Week 18
Cleaning up data
This is the most tedious part of the project. I think that’s why I’ve been avoiding it for so long. It’s nothing but cleaning up the usable data so that it can be properly formatted for the actual ASR construction and analysis.
For my data clean up I used a combination of R and MATLAB (these are the two environments I prefer to do most of my work in). Really, I just used a bunch of my favorite MATLAB and R library/functions. When one set of functions weren’t giving me what I needed I switched over to the other. My mom always says, “Erick, is a very picky person”, maybe she was right?
Since most of the initial data collection and summary statistics were gathered through MATLAB I decided to just aggregate all the usable data in MATLAB first and then pass it to R, so I could do more “fine-grained text analysis” , MATLAB would be good for this too, but R is KING.
The MATLAB code to aggregate the transcripts follows the same design pattern as the two previous scripts:
.
.
.
AccumulatedTranscripts = {};
if totalNumberOfFiles >= 1
for k = 1 : totalNumberOfFiles
% Go through all those files
thisFolder = allFileInfo(k).folder;
thisBaseFileName = allFileInfo(k).name;
fullFileName = fullfile(thisFolder, thisBaseFileName);
% Open and read the transcript
% Note: This converts the transript file to .txt files
fileText = fileread(fullFileName);
% Check if the transcript has the MOT, FAT, or PAR tags
% Note: "" is required
pattern_1 = ["MOT","FAT","PAR"];
% Extract the MOT, FAT, and PAR utterances
if(contains(fileText,pattern_1))
% Extract the lines (columns) of when the parent (MOT, FAT, PAR) is talking
expr_MOT = '[^\n]**MOT[^\n]*';
expr_FAT = '[^\n]**FAT[^\n]*';
expr_PAR = '[^\n]**PAR[^\n]*';
matches = [regexp(fileText,expr_MOT,'match') regexp(fileText,expr_FAT,'match') regexp(fileText,expr_PAR,'match')];
else
% Extract the lines (columns) of when the parent (MOM, DAD, PAR) is talking
expr_MOM = '[^\n]**MOM[^\n]*';
expr_DAD = '[^\n]**DAD[^\n]*';
expr_PAR = '[^\n]**PAR[^\n]*';
matches = [regexp(fileText,expr_MOM,'match') regexp(fileText,expr_DAD,'match') regexp(fileText,expr_PAR,'match')];
end
% Transpose data (i.e 1 by 233 -> 233 by 1)
matches = transpose(matches);
AccumulatedTranscripts = [AccumulatedTranscripts; matches]
end
end
% Output all the transcript lines to a text file.
dlmcell('AccumulatedTranscripts.txt', AccumulatedTranscripts);
Okay, okay, this wasn’t very difficult to do, but MATLAB doesn’t make outputting cell arrays to txt files very easy. Further, if you’re trying to save a non-numeric cell array to a .csv or .xlsx file on macOS, then you’re going to need some “outside help”. Otherwise, fprint(), csvwrite(), xlsxwrite() are going to ask for what seems like a mountain of specificity about your generated data, which I never deliver timely.
For my outside help I used the dlmcell function created by Roland Pfister. This allows me to offload all the confusing, convoluted text processing into one line:
% Output all the transcript lines to a text file.
dlmcell('AccumulatedTranscripts.txt', AccumulatedTranscripts);
I also noticed that within the CHILDES corpus, not all the media files have the .wav extension. Sometimes they’re given as .mp3, .mp4, and .MOV only. So, that’s a little annoying because I believe I can only use .wav files with Kaldi (for getting mfcc values). That’s okay, ffmpeg and built in MATLAB functions make converting the audio from video files to .wav files super straightforward:
% Convert .mp4 files into .wav files
filename = 'af06.mp4';
[input_file, Fs] = audioread(filename);
audiowrite(strrep(filename, '.mp4', '.wav'), input_file, Fs);
We can also visualize the generated audio file in a variety ways within MATLAB. To be clear, I’m not particularly interested in analyzing these signals at the moment, but I think a couple of exercises plotting the audio files would be good:
% Visualize audio
[input_file, Fs] = audioread(strrep(filename, '.mp4', '.wav'));
figure
plot(input_file)
title('Sound Signal');
xlabel('Time (s)');
ylabel('Amplitude');
% Visualize audio
[input_file, Fs] = audioread(strrep(filename, '.mp4', '.wav'));
specgram(input_file(:,1),1024,Fs);
title('Sound Signal');
xlabel('Time (s)');
ylabel('Frequency (kHz)');
And for something a little fancier:
% Visualize audio
[input_file, Fs] = audioread(strrep(filename, '.mp4', '.wav'));
signal = input_file(:,1);
plot(psd(spectrum.periodogram,signal,'Fs',Fs,'NFFT',length(signal)))
Anyways, I let MATLAB finish converting all the .mp4 files to .wav files and began extracting the exploratory information from the aggregated transcripts in R.
I then noticed two things that I was not checking for within the transcripts. First, there are others (i.e ‘PAM’, ‘PAT’) talking to the children during the CHILDES sessions. Before I realized this we were only collecting ‘FAT’, ‘MOT’, but really we want all the utterances that are not from the child (‘CHI’). Secondly, even if some transcripts give timestamps, not all utterances are consistently notated. Which means all the lines without timestamps I can’t use to generate the respective audio file. I suppose this isn’t too troubling, because we’ve already narrowed down our corpus. I do need to change the top script to:
% Get all the non-child utterances and timestamps
expr_PTAGS = '[^\n]**[A-BD-GJ-Z][^\n]+'
expr_TS = '[0-9]+_[0-9]';
% Extract the MOT, FAT, and PAR utterances
matches = regexp(fileText,expr_PTAGS,'match');
% Transpose data (i.e 1 by 233 -> 233 by 1)
matches = transpose(matches);
% Collect transcripts
if(not(isempty(contains(regexp(matches(i), expr_TS, 'match'))))
AccumulatedTranscripts = [AccumulatedTranscripts; matches]
end
For next week I will continue to clean up and format the data.
Best,
EO