Let’s take a closer look at the basic VUI pipeline we described earlier.To recap, three general pieces were identified. Voice to text, text input reasoned to text output, and finally, text to speech. It starts with voice to text. This is speech recognition. Speech recognition is historically hard for machines but easy for people and is an important goal of AI. As a person speaks into a microphone, sound vibrations are converted to an audio signal. This signal can be sampled at some rate and those samples converted into vectors of component frequencies. Shown here is a spectrogram, these vectors represent features of sound in a data set, so this step can be thought of as feature extraction. The next step in speech recognition is to decode or recognize the series of vectors as a word or sentence. In order to do that, we need probabilistic models that work well with time series data for the sound patterns. This is the acoustic model. Decoding the vectors with an acoustic model will give us a best guess as to what the words are. This might not be enough though, some sequences of words are much more likely than others. For example, depending on how the phrase “hello world” was said, the acoustic model might not be sure if the words are “hello world” or “how a word” or something else. Now you and I know that it was most likely the first choice, “hello world”. But why do we know? We know because we have a language model in our heads, trained from years of experience and that is something we need to add to our decoder. An accent model may be needed for the same reason. If these models are well trained on lots of representative examples, we have a higher probability of producing the correct text. That’s a lot of models to train. Acoustic, language and accent models are all needed for a robust system and we haven’t even gone through the whole VUI pipeline yet. We’ll learn more about speech recognition models in a later lesson, but here’s a preview. Remember when I said we need a probabilistic model that works well with time series data, think back to two of these you’ve already studied in this course. Earlier, we built Hidden Markov Models (HMMs) to decode a series of gestures. In our deep learning lessons, we used Recurrent Neural Networks (RNNs) to train time series data. Both of these models have been used successfully in speech recognition and we’ll talk more about them when we study speech recognition in detail. Back to the pipeline, once we have our speech in the form of text, it’s time to do the thinking part of our voice application, the reasoning logic. If I ask you, a human, a question like how’s the weather? You may respond in many ways,”i don’t know?” “It’s cold outside”, “the thermometer says 90 degrees, ” et cetera. In order to come up with a the response, you first had to understand what I was asking for and then process the requests and formulate a response. This was easy because, you’re human. It’s hard for a computer to understand what we want and what we mean when we speak. The field of natural language of processing (NLP) is devoted to this quest. To fully implement NLP, large datasets of language must be processed and there are a great deal of challenges to overcome. But let’s look at a smaller problem, like getting just a weather report from VUI device. Let’s imagine an application that has weather information available in response to some text request. Rather than parsing all the words, we could take a shortcut and just map the most probable request phrases for the weather to get weather process. In that case, the application would in fact understand requests most of the time. This won’t work if the request hasn’t been premapped as a possible choice, but it can be quite effective for limited applications and can be improved over time. Once we have a text response, the remaining task in our VUI pipeline is to convert that text to speech. This is the speech synthesis or text to speech (TTS). Here again examples of how words are spoken can be used to train a model, to provide the most probable pronunciation components of spoken words. The complexity of the task can vary greatly when we move from say, a monotonic robotic voice to a rich human sounding voice that includes inflection and warmth. Some of the most realistic sounding machine voices to date have been produced using deep learning techniques.