Week 2

17 Sep 2017

Collecting data

Our first major task is finding sufficient data for both training and testing the speech recognition system. Our speech recognition system is unlikely to learn without sufficient language data (much like a young child). We suspect that if we train a speech recognition system on baby-talk to recognize actual words, then it should be possible to understand the linguistic and statistical properties of language input.

Dr. Beckage said to look at the CHILDES website for speech data. Dr. Brumberg also noted the LENA foundation will be a primary resources for our project.

CHILDES

The Child Language Data Exchange System (CHILDES) corpora is an extensive collection of language acquisition data. It was created and maintained by Brian MacWhinney and Catherine Snow. The CHILDES database belongs to a larger collection called TalkBank, a publicly available resource for studying “conversational interactions”. This is a good collection of audio files (mp3 & wav) with their associated transcripts (encoded in the CHAT format).

Working with CHILDES

The CHILDES website gives a nice outline on how to interact with the different corpora. The two main options are using their browsable database or downloading the media files and transcripts directly to your computer. The browsable database has a very clean interface and allows one to look through the media files and transcripts quickly online. Their online browser also supports further analysis through the CLAN programming language. For example, once one finds a study of interest it’s possible to run statistics like co-occurence or frequency of words within the transcript.

We will be needing to download a lot of audio and transcript files from the CHILDES collection. It would be very tedious to download all the files one by one, so we’ll use wget, a utility from getting files from the web.

As an example, I wanted to download all the wav files from the Bernstein-Ratner Corpus:

$ wget -c robots=off -r -l inf --no-remove-listing -nH --no-parent -R 'index.html*' -A '*.wav' http://media.talkbank.org/CHILDES/Eng-NA/Braunwald/

The CHILDES website provides additional instructions for scraping large amounts of data from the collection.

As we progress we’ll need to more selective of the corpora we’re utilizing for training and testing our models. Until then, it’s still just interesting to browse.

Best,
EO