6 – Voice Data Lab Introduction

You’ve learned a lot about speech audio. We’ve introduced signal analysis and feature extraction techniques to create data representations for that speech audio. Now, we need a lot of examples of audio, matched with text, the labels, that we can use to create our dataset. If we have those labeled examples, say a string of words matched with an audio snippet, we can turn the audio into spectrograms or MFCC representations for training a probabilistic model. Fortunately for us, ASR is a problem that a lot of people have worked on. That means there is labeled audio data available to us and there are lots of tools out there for converting sound into various representations. One popular benchmark datasource for ASR training and testing is the TIMIT Acoustic-Phonetic Corpus. This data was developed specifically for speech research in 1993 and contains 630 speakers voicing 10 phoneme-rich sentences each, sentences like, ‘George seldom watches daytime movies.’ Two popular large vocabulary data sources are the LDC Wall Street Journal Corpus, which contains 73 hours of newspaper reading, and the freely available LibriSpeech Corpus, with 1000 hours of readings from public domain books. Tools for converting these various audio files into spectrograms and other feature sets are available in a number of software libraries. In the following lab, you’ll explore some dataset samples as well as create some audio files and data of your own. You can even take a look at the spectrograms with an open-source visualization tool. Have fun.

%d 블로거가 이것을 좋아합니다: