When we speak we create sinusoidal vibrations in the air. Higher pitches vibrate faster with a higher frequency than lower pitches. These vibrations can be detected by a microphone and transduced from acoustical energy carried in the sound wave, to electrical energy where it is recorded as an audio signal. The audio signal for hello world looks like this. As in any other kind of modeling, we need to get a handle on the features that make up our input. So, what’s going on in this signal? We can see that it seems to be in two blobs and those blobs do correspond to the two words, hello and world. We also see immediately that some of the vibrations in the signal are taller than the others or have a higher amplitude. The amplitude in the audio signal tells us how much acoustical energy is in the sound, how loud it is. If we look closer at a time slice of the signal, we can see it has an irregular wiggle shape to it. Our speech is made up of many frequencies at the same time. The signal we see here is really a sum of all those frequencies stuck together. To properly analyze the signal, we would like to use the component frequencies as features. We can use a fourier transform to break the signal into these components. The FFT algorithm or Fast Fourier Transform, is widely available for this task. We can use this splitting technique to convert the sound to a Spectrogram. In this Spectrogram of the hello world phrase, we see the frequency on the vertical axis plotted against the time, on the horizontal axis. The intensity of shading indicates the amplitude of the signal. To create a Spectrogram first, divide the signal into time frames. Then split each frame signal into frequency components with an FFT. Each time frame is now represented with a vector of amplitudes at each frequency. If we line up the vectors again in their time series order, we can have a visual picture of the sound components, the Spectrogram. The Spectrogram can be lined up with the original audio signal in time. With the Spectrogram we have a complete representation of our sound data. But we still have noise and variability embedded into the data. In addition, there may be more information here than we really need. Next, we’ll look at Feature Extraction techniques to, both, reduce the noise and reduce the dimensionality of our data.