Week 42 (Extension)

Additional summary statistics

So, before I created a threshold I wanted to go through the resegmented audio files and get the snr values, so I can collect an average snr for each of the sub corpus and ages ranges. This will be one of my metrics for getting rid of the unusable audio files:

for k = 1 : totalNumberOfFiles
  %% Go through each file
  thisFolder = allFileInfo(k).folder;
  thisBaseFileName = allFileInfo(k).name;
  fullFileName = fullfile(thisFolder, thisBaseFileName);

  [input_signal, Fs] = audioread(fullFileName);
  r = mean(abs(hilbert(input_signal)))/(std(input_signal));
  AccumulatedSNR = [AccumulatedSNR; {r}];
  AccumulatedDuration = [AccumulatedDuration, {length(input_siganl)/Fs}];             
end

So, I’ll end up with three list that I can use to calculate the average, median, and standard deviation for snr, samples, and duration respectively. Notice, I can calculate the average number of samples from the average number of duration. Once I have these calculated I can plot them to observe the nature of the data.

VoxForge corpus

Additionally, I wanted to go through the VoxForge transcripts and scrape the mean length utterance (MLU), mean length word (MLW), and the total number of words, and unique words:

for k = 1 : totalNumberOfFiles
		%% Go through each file
        thisFolder = allFileInfo(k).folder;
		thisBaseFileName = allFileInfo(k).name;
		fullFileName = fullfile(thisFolder, thisBaseFileName);

        % Read in the transcript
        % Note: fileText is a .txt file
        fileText = textread(fullFileName,'%s','delimiter','\n');
        % Go through each line
        for i = 1:length(fileText)
            utterance = extractAfter(fileText(i)," ");
            str = strjoin(utterance);
            matches = regexpi(str, '\w+');
            N = numel(matches);
            AccumulatedMLU = [AccumulatedMLU; {N}];
            split_str = split(strtrim(erase(str,[".",","])));
            words = strlength(split_str);
            for j = 1:length(words)
                AccumulatedWords = [AccumulatedWords, {split_str(j)}];
                AccumulatedMLW = [AccumulatedMLW; {words(j)}];
            end         
        end
	end

Though, Rebekah already completed this for me… VoxForge_Stats

Again, the VoxForge corpus is supposed to represent typical adult speech, so the typical length of utterance and and typical length of word and thus we expect them on average to be longer.

Voice activity detection

Still trying to figure out a good way to set a threshold for the audio segments…In any case it would be useful to now start training on VoxForge and testing on some of the CHILDES data.

Best,
EO