Week 14

Revisiting data clean up

This entire semester we’ve been trying to get an appropriate amount of data to train out automatic speech recognition system and build co-occurence networks. It has been a little difficult because of the massive amount of data available on CHILDES. We still are in the process of calculating the summary statisitcs of the CHILDES and VoxForge corpora. So, now that I am bit more comfortable with the Kaldi framework the next chunk of work is scraping, cleaning, and formatting the speech data correctly. As a start, a few weeks ago I wrote a script to calculate the hours of data within a given corpus using of audio files using MATLABs audio functions.

Here is the portion of the script that calculates the total duration:

if totalNumberOfFiles >= 1
	for k = 1 : totalNumberOfFiles
		% Go through all those files
		thisFolder = allFileInfo(k).folder;
		thisBaseFileName = allFileInfo(k).name;
		fullFileName = fullfile(thisFolder, thisBaseFileName)
		% Get audio file duration
		info = audioinfo(fullFileName);
		audioFileDuration = info.Duration;
		% Calculate total audio file duration
		totalDuration = totalDuration + audioFileDuration;
	end
end

% Output total duration
totalDuration = totalDuration/3600

Group meeting

This week was another group meeting. We presented our respective outlines (presented in powerpoint) to Dr. Brumberg and Shadi. They seemed to think our outlines were good. In regards to the per-word-accuracy (PWA) algorithm, Dr. Brumberg mentioned there were some cases I wasn’t checking for. Particularly, the problem of shifting when you try to calculate the edit distance; we must account for the moments when the reference transcript and the hypothesis are misaligned. For example, if the reference transcript contains the utterance ABC, but the hypothesis utterance was AABC, then the entire string is shifted over by one. This could generate an interesting type of error. I haven’t spent too much time on trying to figure out, but I know this is common problem in string comparison algorithms. Additionally, there are other types of edit distance algorithms, so maybe they could be useful.

We spent some time trying to figure out how to extract timestamps from the transcripts using CHA. We found out that not all transcripts have timestamps, so if we wish to generate the appropriate audio segments, then we can only use transcripts that have the unicode bullets (how timestamps are denoted in the CHA environment). Alternatively, we don’t have to use CHA, as the timestamps are directly given in the form “XXXXX_XXXXX” when viewed as text files. I will spend next week trying to figure out what transcripts have the appropriate timestamps and parent tags (MOT, FAT, PAR, MOM, ect.)

Best,
EO