Week 11

Decoding, decoding, decoding

Just to recap, I collected VoxForge speech data, formatted the data for Kaldi, split the Voxforge into a training and testing test, finished installing the language modeling system, and the dictionary system.

This week I started doing decoding on the VoxForge data. Decoding is essentially the last part of the speech recognition process. After this stage is completed I can get word error rates (WER) for my speech processing experiments. These error rates allow me asses how good my speech recognition system is on recognizing new data. If the WER is low, this indicates subpar performance, which could be improved by adjusting a suite of parameters. However, I just wanted to see what the “base” configurations would get me.

Now that I have all the parts of a speech recognition system and how I want them to be configured I can run the decoding process. I first needed to change this:

export train_cmd="queue.pl --mem 2G"
export decode_cmd="queue.pl --mem 4G"
export mkgraph_cmd="queue.pl --mem 8G"

to this within cmd.sh:

# Setting local system jobs (local CPU - no external clusters)
export train_cmd=run.pl
export decode_cmd=run.pl

I don’t have access to any computing clusters at the moment, but hopefully that will become a reality in the near future…

After my machine was all set up, all I had to do was:

$ ./run.sh

This sets up the whole ASR system where monophone training and decoding, triphone training and decoding are specified. So I actually get a set of different WERs (one for monophone, a couple for triphone):

exp/mono/decode/wer_14
%WER 53.09 [ 2371 / 4466, 27 ins, 1074 del, 1270 sub ]
%SER 90.60 [ 453 / 500 ]
exp/mono/decode/wer_15
%WER 54.05 [ 2414 / 4466, 27 ins, 1131 del, 1256 sub ]
%SER 90.40 [ 452 / 500 ]
exp/tri1/decode/wer_8
%WER 26.35 [ 1177 / 4466, 126 ins, 193 del, 858 sub ]
%SER 70.20 [ 351 / 500 ]
exp/tri1/decode/wer_9
%WER 24.61 [ 1099 / 4466, 99 ins, 216 del, 784 sub ]
%SER 66.40 [ 332 / 500 ]
exp/tri2a/decode/wer_7
%WER 27.43 [ 1225 / 4466, 156 ins, 173 del, 896 sub ]
%SER 72.00 [ 360 / 500 ]
exp/tri2a/decode/wer_8
%WER 25.71 [ 1148 / 4466, 131 ins, 192 del, 825 sub ]
%SER 69.60 [ 348 / 500 ]

Note: One ofthe metrics for accuracy of a speech recognition system can be determined by 1-WER. From the looks of it, the triphone models perform better than the monophone models, which was expected.

Preparing CHILDES data for kaldi

At this point I still have a pretty rough understanding of Kaldi, but I think I have enough experience to try and attempt to build a ASR with the CHILDES data (or at least a subset of it).

SANLAB meeting

This week there was a SANLAB (Speech and Applied Neuroscience Laboratory) meeting with Dr. Brumberg. I’m usually pretty excited to attend these because I think the work the other students are doing is super cool. I mean, I only understand bits and pieces of the conversation between the students and Dr.Brumberg, but somehow it’s all intriguing. Also, sometimes the groups go into these tangents that prompt Dr. Brumberg to bring out some piece of technology. For instance, this week he ended up bringing out a pair of Google Glasses to make a couple of points about virtual/augmented reality, which was cool. Anyways, I think SANLAB or the projects that are carried-out within SANLAB are often realized as Human-Computer-Interaction (HCI) experiments with a clinical focus. Though, I don’t think it was ever possible to separate HCI from SANLAB as HCI is broad and people focused encompassing engineering, computer science, psychology, and neuroscience. I like the intersection of those fields.

Best,
EO