9 – Language Models

So far, we have tools for addressing noise and speech variability through our feature extraction. We have HMM models that can convert those features into phonemes and address the sequencing problems for our full acoustic model. We haven’t yet solved the problems in language ambiguity though. The ASR system can’t tell from the acoustic model … Read more

8 – HMMs in Speech Recognition

You learned the basics of hidden Markov models in an earlier lesson. To recap, HMMs are useful for detecting patterns through time. This is exactly what we are trying to do with an acoustic model. HMMs can solve the challenge, we identified earlier, of time variability. For instance, my earlier example of speech versus speech, … Read more

7 – Acoustic Models and the Trouble with Time

We’ve got our data now. With feature extraction, we’ve addressed noise problems due to environmental factors as well as variability of speakers. Phonetics gives us a representation for sounds and language that we can map to. That mapping, from the sound representation to the phonetic representation, is the task of our acoustic model. We still … Read more

6 – Voice Data Lab Introduction

You’ve learned a lot about speech audio. We’ve introduced signal analysis and feature extraction techniques to create data representations for that speech audio. Now, we need a lot of examples of audio, matched with text, the labels, that we can use to create our dataset. If we have those labeled examples, say a string of … Read more

5 – Phonetics

Phonetics is the study of sound in human speech. Linguistic analysis of language around the world is used to break down human words into their smallest sound segments. In any given language, some number of phonemes define the distinct sounds in that language. In US English, there are generally 39 to 44 phonemes to find. … Read more

4 – Feature Extraction

What part of the audio signal is really important for recognizing speech? One human creates words and another human hears them. Our speech is constrained by both our voice-making mechanisms and what we can perceive with our ears. Let’s start with the ear and the pitches we can hear. The Mel Scale was developed in … Read more

3 – Signal Analysis

When we speak we create sinusoidal vibrations in the air. Higher pitches vibrate faster with a higher frequency than lower pitches. These vibrations can be detected by a microphone and transduced from acoustical energy carried in the sound wave, to electrical energy where it is recorded as an audio signal. The audio signal for hello … Read more

2 – Challenges in ASR

Continuous speech recognition has had a rocky history. In the early 1970’s, the United States funded ASR research with a DARPA challenge. The goal was to develop a recognizer for a 1000 word vocabulary. The goal was achieved a few years later by Carnegie-Mellon’s Harpy System. But the future prospects were disappointing and funding dried … Read more

13 – Outro

Congratulations on completing the speech recognition module. We’ve covered a lot of ground. We started with signal analysis taking apart the sound characteristics of the signal, and extracting only the features we required to decode the sounds and the words. We learned how the features could be mapped to sound representations of phonemes with HMM … Read more

12 – Deep Neural Networks as Speech Models

If HMM’s work why do we need a new model. It comes down to potential. Suppose we have all the data we need and all the processing power we want. How far can an HMM model take us, and how far could some other model take us? According to by Baidu’s Adam Coates in a … Read more

11 – A New Paradigm

The previous lessons identified the problems of speech recognition, and provided a traditional ASR solution using feature extraction HMMs and language models. These systems have gotten better and better since they were introduced in the 1980’s. But is there a better way? As computers become more powerful and data more available, deep neural networks have … Read more

10 – N-Grams

The job of the Language Model is to inject the language knowledge into the words to text step in speech recognition, providing another layer of processing between words and text to solve ambiguities in spelling and context. For example, since an Acoustic Model is based on sound, we can’t distinguish the correct spelling for words … Read more

1 – Introduction to Speech Recognition

Hello again. In the last module, we built a VUI application that used a commercial implementation of speech recognition. Like magic, the words we speak into a voice enabled device, are converted into text. In this module, we take a closer look at how speech recognition really works. Now, when we say speech recognition, we’re … Read more

3 – VUI Applications

VUI applications are becoming more and more common place. There are a few reasons driving this. First of all voice is natural for humans. It’s effortless for us to converse by voice compared to reading and typing. And secondly, it turns out it’s also fast. Speaking into a text transcriber is three times faster than … Read more

2 – VUI Overview

Let’s take a closer look at the basic VUI pipeline we described earlier.To recap, three general pieces were identified. Voice to text, text input reasoned to text output, and finally, text to speech. It starts with voice to text. This is speech recognition. Speech recognition is historically hard for machines but easy for people and … Read more

1 – Welcome

Hello and welcome to voice user interfaces. A VUI is a speech platform that enables humans to communicate with machines by voice. To help you learn how to design voice user interfaces we have Dana Sheahen. Hello. This is a great topic and I’m delighted to have the chance to teach it. VUIs used to … Read more