Continuous speech recognition has had a rocky history. In the early 1970’s, the United States funded ASR research with a DARPA challenge. The goal was to develop a recognizer for a 1000 word vocabulary. The goal was achieved a few years later by Carnegie-Mellon’s Harpy System. But the future prospects were disappointing and funding dried up. This was the start of the first big AI Winter. Performance improved in the 80’s and 90’s with refinement of probabilistic models. More recently computing power has made larger dimensions in neural network modeling a reality. So what makes speech recognition hard? The first set of problems to solve are related to the audio signal itself, noise for instance. Cars going by, clocks ticking, other people talking, microphones static, our ASR has to know which parts of the audio signal matter and which parts to discard. Variability of pitch. Variability of volume. One speaker sounds different than another even when saying the same word. The pitch and loudness at least in English don’t change the ground truth of which word was spoken. If I say hello, or hello, or hello. It’s all the same word and spelling. We could even think of these differences as another kind of noise that needs to be filtered out. Variability of words speed. Words spoken at different speeds need to be aligned and matched. If I say speech or speech, it’s still the same word with the same number of letters. It’s up to the ASR to align the sequences of sound correctly. Word boundaries. When we speak words run from one to the next without pause. We don’t separate them naturally. Humans understand it because we already know that the word boundaries should be in certain places. This brings us to another class of problems that are language or knowledge related. The fact is, humans perceive speech with more than just their ears. We have domain knowledge of our language that allows us to automatically sort out ambiguities as we hear them. Words that sound the same but have different spellings. Word groups that are reasonable in one context but not in another. Here’s a classic example. When I say recognize speech, it sounds a lot like recognize speech. But you knew what I meant because you know I’m discussing speech recognition. The context matters. An inference like this is going to be tricky for a computer model. Another aspect to consider. Spoken language is different than written language. There are hesitations, repetitions, fragments of sentences, slips of the tongue, a human listener is able to filter this out. Imagine a computer that only knows language from audio books and newspapers read aloud. Such a system may have a hard time decoding unexpected sentence structures. Okay, we’ve identified lots of problems to solve here. Variability of pitch, volume, and speed, ambiguity due to word boundaries, spelling, and context. We’re going to introduce some ways to solve these problems with a number of models and technologies. We’ll start at the beginning with the voice itself.