What part of the audio signal is really important for recognizing speech? One human creates words and another human hears them. Our speech is constrained by both our voice-making mechanisms and what we can perceive with our ears. Let’s start with the ear and the pitches we can hear. The Mel Scale was developed in 1937 and tells us what pitches human listeners can truly discern. It turns out that some frequencies sound the same to us but we hear differences in lower frequencies more distinctly than in higher frequencies. If we can’t hear a pitch, there is no need to include it in our data, and if our ear can’t distinguish two different frequencies, then they might as well be considered the same for our purposes. For the purposes of feature extraction, we can put the frequencies of the spectrogram into bins that are relevant to our own ears and filter out sound that we can’t hear. This reduces the number of frequencies we’re looking at by quite a bit. That’s not the end of the story though. We also need to separate the elements of sound that are speaker-independent. For this, we focus on the voice-making mechanism we use to create speech. Human voices vary from person to person even though our basic anatomy features are the same. We can think of a human voice production model as a combination of source and filter, where the source is unique to an individual and the filter is the articulation of words that we all use when speaking. Cepstral analysis relies on this model for separating the two. The cepstrum can be extracted from a signal with an algorithm you’ll find in the references. The main thing to remember is that we’re dropping the component of speech unique to individual vocal chords and preserving the shape of the sound made by the vocal tract. Cepstral analysis combined with mel frequency analysis get you 12 or 13 MFCC features related to speech. Delta and Delta-Delta MFCC features can optionally be appended to the feature set. This will double or triple the number of features but has been shown to give better results in ASR. The takeaway for using MFCC feature extraction is that we greatly reduce the dimensionality of our data and at the same time we squeeze noise out of the system. Next, we’ll look at sound from a language perspective, the phonetics of the words we hear.