Week 10
Building a dictionary
After I got SRILM installed I was ready to build a dictionary. The dictionary is a mapping between phonemes (or some other linguistic unit) to actual words. Within Kaldi, the script used for preparing such a dictionary is local/voxforge_prepare_dict.sh. This script downloads the CMU’s pronunciation dictionary and prepares a list of the words that are found in the train set, but not in cmudict:
$ local/voxforge_prepare_dict.sh
=== Preparing the dictionary ...
--- Striping stress and pronunciation variant markers from cmudict ...
133287 forms processed
123699 baseforms, 9283 variants and 305 duplicates found.
variant distribution:
2 8591
3 527
4 165
--- Searching for OOV words ...
1365 data/local/dict/vocab-oov.txt
12575 data/local/dict/lexicon-iv.txt
utils/make_absolute.sh: line 7:
cd:/Users/erickoduniyi/kaldi-trunk/egs/voxforge/s5/../../../tools/sequitur/lib/python*: No such file or directory
The Sequitur was not found !
The prerequisites! There are a lot of tools packaged with Kaldi (especially within the extra directory), one of which is Sequitur. Sequitur is a data-driven grapheme-to-phoneme converter. I need this before I can proceed with the dictionary building:
$/tools/extras/install_sequitur.sh
File "<string>", line 1
from distutils import sysconfig as s; print s.get_config_vars()['INCLUDEPY']
^
SyntaxError: invalid syntax
This error is an artifact from python 2.7. One must have parenthesis “()” in python 3 when using print. I corrected this by going into install_sequiter.sh script and putting parentheses around the print statement. Actually, after this error I realized that Sequitur is dependent on a lot python functions that are no longer in use…
In file included from sequitur_wrap.cpp:3144:
./EditDistance.cc:41:14: error: use of undeclared identifier 'PyObject_Compare'
return (PyObject_Compare(a, b)) ? 1 : 0;
^
1 error generated.
Multigram.cc:38:33: error: use of undeclared identifier 'PyInt_FromLong'; did you
mean 'PyLong_FromLong'?
PyTuple_SET_ITEM(result, i, PyInt_FromLong(data_[i]));
^~~~~~~~~~~~~~
PyLong_FromLong
/Users/erickoduniyi/anaconda/include/python3.5m/tupleobject.h:62:75: note:
expanded from macro 'PyTuple_SET_ITEM'
#define PyTuple_SET_ITEM(op, i, v) (((PyTupleObject *)(op))->ob_item[i] = v)
^
/Users/erickoduniyi/anaconda/include/python3.5m/longobject.h:18:24: note:
'PyLong_FromLong' declared here
PyAPI_FUNC(PyObject *) PyLong_FromLong(long);
^
Multigram.cc:53:10: error: use of undeclared identifier 'PyInt_Check'
if (!PyInt_Check(item)) {
^
Multigram.cc:57:18: error: use of undeclared identifier 'PyInt_AsLong'; did you
mean 'PyLong_AsLong'?
long value = PyInt_AsLong(item);
^~~~~~~~~~~~
PyLong_AsLong
/Users/erickoduniyi/anaconda/include/python3.5m/longobject.h:23:18: note:
'PyLong_AsLong' declared here
PyAPI_FUNC(long) PyLong_AsLong(PyObject *);
^
3 errors generated.
Again, all of these errors were generated because Sequitur uses python functions that are deprecated in python 3. These errors were corrected by going into EditDistance.cc and Multigram.cc (where the functions are used) and swapping out the old python functions for their updated version. For example, in python 3 PyObject_Compare(a, b) was replaced with PyObject_RichCompare(a, b, Py_EQ). After these errors were corrected Sequitur was installed successfully. Now I can go back to setting up the dictionary:
$ local/voxforge_prepare_dict.sh
=== Preparing the dictionary ...
--- Striping stress and pronunciation variant markers from cmudict ...
133287 forms processed
123699 baseforms, 9283 variants and 305 duplicates found.
variant distribution:
2 8591
3 527
4 165
--- Searching for OOV words ...
1365 data/local/dict/vocab-oov.txt
12575 data/local/dict/lexicon-iv.txt
--- Preparing pronunciations for OOV words ...
File "/Users/erickoduniyi/kaldi-trunk/tools/sequitur-g2p/bin/g2p.py", line 46
fields = line.split()
^
TabError: inconsistent use of tabs and spaces in indentation
--- Prepare phone lists ...
--- Adding SIL to the lexicon ...
*** Dictionary preparation finished!
Group meeting
This week we had our 6th group meeting. During this meeting we showed our concept slides to Dr. Beckage, Dr. Brumberg, and Shadi. I think they thought the slides were fine, we ended up using them as a starting point to discuss new ideas or rather the ideas we can actually try and implement at the moment. At a high level we’re interested in inferring the child’s learning and “cognitive structures” from their parents and then seeing if there are any similarities between age groups. That is, based on a subset of the the different parents language network and phonetic output we want to better understand how language learning happens over time.
Best,
EO