If HMM’s work why do we need a new model. It comes down to potential. Suppose we have all the data we need and all the processing power we want. How far can an HMM model take us, and how far could some other model take us? According to by Baidu’s Adam Coates in a recent presentation, additional training of a traditional ASR levels off inaccuracy. Meanwhile, Deep Neural Network Solutions are unimpressive with small data sets but they shine as we increase data and model sizes. Here’s the process we’ve looked at so far. We extract features from the audio speech signal with MFCC. Use an HMM acoustic model to convert to sound units, phonemes, or words. Then, it uses statistical language models such as N-grams to straighten out language ambiguities and create the final text sequence. It’s possible to replace the many tune parts with a multiple layer deep neural network. Let’s get a little intuition as to why they can be replaced. In feature extraction, we’ve used models based on human sound production and perception to convert a spectrogram into features. This is similar, intuitively, to the idea of using Convolutional Neural Networks to extract features from image data. Spectrograms are visual representations of speech. So, we ought to be able to let a CNN find the relevant features for speech in the same way. An acoustic model implemented with HMMs, includes transition probabilities to organize time series data. Recurrent Neural Networks can also track time series data through memory, as we’ve seen in earlier lessons on RNNs. The traditional model also uses HMMs to sequence sound units into words. The RNNs produce probability densities over each time slice. So we need another way to solve the sequencing issue. A Connectionist Temporal Classification layer is used to convert the RNN outputs into words. So, we can replace the acoustic portion of the network with a combination of RNN and CTC layers. The end-to-end DNN still makes linguistic errors, especially on words that it hasn’t seen in enough examples. It should be possible for the system to learn language probabilities from audio data but presently there just isn’t enough. The existing technology of N-grams can still be used. Alternately, a Neural Language Model can be trained on massive amounts of available text. Using an NLM layer, the probabilities of spelling and context can be re scored for the system.