Week 5

08 Oct 2017

Notes about modern day speech recognition

In the year 2017 there exist a lot of resources for understanding the theoretical and practical applications of speech recognition. Unfortunately, one might initially misinterpreted this as, “learning to do speech recognition is straight-forward”. This is incorrect. I do think building a state of the art speech recognition system is probably a lot more “straight-forward” today than it was twenty or even ten years ago. Still, the components of speech recognition are many and deeply technical.

First, as it is currently known, the problem of speech recognition or automatic speech recognition (ASR) is divided across three major disciplines: machine learning, natural language processing, and signal processing. Each of these disciplines are also intersections of multiple fields of science and engineering. These intersections have produced an array of common applications within human life. Whether it be interacting with automated telephone operators, giving commands to our personal electronic assistants, or transcribing an interview for a blog post, speech recognition is steadily becoming a more standard way to interact with our personal electronic devices. Anyone who has used any of these recent applications of ASR have probably noticed a few flaws. ASR is very advanced, but still nowhere near the fluidity that we perceive some animals to be capable of.

Performing modern day speech recognition

On a high level, most ASR systems are constructed by acquiring training data (child-directed speech in our case) from a particular language, for example English. Then, using this training data we build a statistical model of speech, so that given new acoustic data we can determine the most likely set of– sentences: [//]: # (put math equations here) [//]: # (put picture of speech waveform) This means the accuracy of these ASR systems fundamentally depends on the acoustic data used to train the system and the environment the ASR system will be utilized in. A lot of acoustic data can be acquired by being a member of the organization like the Linguistic Data Consortium (LDC). Though it’s costly, where thousands of dollars is the price tag to have access to such databases. In fact, for large scale, speech training, in addition to having access to acoustic data, one typically needs hundreds of hours of it, a cluster of machines, and a fast internet connection. Building a robust speech recognition system is expensive.

Performing speech recognition with Kaldi

Kaldi was chosen as the speech recognition framework for our project for a number of reasons that ultimately deal with troubleshooting and community support. Still, looking through the various tutorials and notes on Kaldi and about other speech recognition systems has made me realize that speech recognition is something that takes months. Acquiring sufficient data and learning Kaldi could take years really…

Collecting VoxForge data

As mentioned above, having access to large amounts of quality acoustic data is expensive. Luckily resources like VoxForge - free speech corpus and acoustic model repository helps one avoid paying a huge price tag.

After setting the appropriate DATA_ROOT variable within Kaldi I downloaded the VoxForge data to test out with their example scripts:

export DATA_ROOT="/Users/erickoduniyi/voxforge/data"
$ ./getdata.sh

This was about 10GB of acoustic data, so it took about and hour and a half to download everything.

Configuring Kaldi

Most of the speech recognition training will be done by running the various recipes by “hand”. That is, I will be usually copying commands into the respective run.sh scripts. This allows me to configure the variables within the run.sh file which allow me to specify how I want to train (number of parallel processes, number of speakers, language model, etc.). Configuring Kaldi for my training system will become slightly different when I start using computer cluster systems later.

Data selection

Navigating to the local/voxforge_select.sh<\i> script in order to create symbolic links to the matching submission subdirectories in ${DATA_ROOT}. I’m interested in all the English VoxForge speech, so I set the dialect to American/British

dialect="(American)|(British)"

Now that the prerequisites have been satisfied I can move onto acoustic mapping, training and testing with the VoxForge data.

Best,
EO