Phonetics is the study of sound in human speech. Linguistic analysis of language around the world is used to break down human words into their smallest sound segments. In any given language, some number of phonemes define the distinct sounds in that language. In US English, there are generally 39 to 44 phonemes to find. A Grapheme, in contrast, is the smallest distinct unit that can be written in a language. In US English the smallest grapheme set we can define is a set of the 26 letters in the alphabet plus a space. Unfortunately, we can’t simply map phonemes to grapheme or individual letters because some letters map to multiple phonemes sounds and some phonemes map to more than one letter combination. For example, in English the C letter sounds different in cat, chat, and circle. Meanwhile, the phoneme E sound we hear in receive, beat, and beat, is represented by different letter combinations. Here’s a sample of a US English phoneme set called Arpabet. Arpabet was developed in 1971 for speech recognition research, and contains thirty nine phonemes, 15 vowel sounds and 24 consonants, each represented as a one or two letter symbol. Check the reference section for links to the full set. Phonemes are often a useful intermediary between speech and text. If we can successfully produce an acoustic model that decodes a sound signal into phonemes the remaining task would be to map those phonemes to their matching words. This step is called Lexical Decoding, and is based on a lexicon or dictionary of the data set. Why not just use our acoustic model to translate directly into words? Why take the intermediary step? That’s a good question and there are systems that do translate features directly to words. This is a design choice and depends on the dimensionality of the problem. If we want to train a limited vocabulary of words we might just skip the phonemes, but if we have a large vocabulary converting to smaller units first, reduces the number of comparisons that need to be made in the system overall.