10 – N-Grams

The job of the Language Model is to inject the language knowledge into the words to text step in speech recognition, providing another layer of processing between words and text to solve ambiguities in spelling and context. For example, since an Acoustic Model is based on sound, we can’t distinguish the correct spelling for words that sound the same, such as hear and hear. Other sequences may not make sense but could be corrected with a little more information. The words produced by the Acoustic Model are not absolute choices. They can be thought of as a probability distribution over many different words. Each possible sequence can be calculated as the likelihood that the particular word sequence could have been produced by the audio signal. A statistical language model provides a probability distribution over sequences of words. If we have both of these, the Acoustic Model and the Language Model, then the most likely sequence would be a combination over all these possibilities with the greatest likelihood score. If all possibilities in both models were scored, this could be a very large dimension of computations. We can get a good estimate though by only looking at some limited depth of choices. It turns out that in practice, the words we speak at any time are primarily dependent upon only the previous three to four words. N-grams are probabilities of single words, ordered pairs, triples etc.. With N-grams we can approximate the sequence probability with the chain rule. The probability that the first word occurs is multiplied by the probability of the second given the first and so on to get probabilities of a given sequence. We can then score these probabilities along with the probabilities from the Acoustic Model to remove language ambiguities from the sequence options and provide a better estimate of the utterance in text.

%d 블로거가 이것을 좋아합니다: