In this directory we will prepare the given data, such that they will become Janus-readable. We will add a few missing words to the dictionary and we will create a Janus-style database for the given task. We will also split the given data into a training and a test set.
Let's start by having a look at what we have. Type
more ../data/dict
and have a look at the given dictionary. You can find such kinds of
dictionaries a various publicly accessible places on the internet. The
one you are currently looking at is an excerpt from the freely available
CMU dictionary that was used by various sites for the (D)ARPA continuous
speech recognition evaluations.
It contains in every line one word (given in all capital letters) followed
by the space-separated phoneme sequence that describes one common pronounciation
of the word. Some words can have multiple valid pronounciations, but to
make things easier we did not include the pronounciation variants in this
tutorial's dictionary. Type the command
cut -f2- -d' ' ../data/dict | tr ' ' '\012' | sort | uniq -c
This will list all the used phonemes together with the frequencies. There are 43 different phones. To make the development process faster and easier to overview, we've decided to use only 16 phonemes for the tutorial.
more ../data/phonesMap
will show you a suggested mapping from the CMU-phones down to the 16 phones that we are going to use. Some of the CMU phones (e.g. diphthongs) will be replaced by two from our smaller 16 phones set.
Transcriptions can come in many different formats and styles. Possibly you've done some transcribing yourself. Do a:
more ../data/transcripts
to look at the given transcriptions. In our case we have one transcription per line in the transcriptions file. The first space-separated item in every line is the "name" of the utterance. You don't have to care about the meaning of these utterance names. They are just the 5 last characters of the corresponding WSJ-namings used by ARPA/NIST/LDC. The rest of the line contains the transcribed words. They are written in capital letters only. Keep in mind, that Janus does read transcriptions and dictionaries (as well as everything else) in a case-sensitive mode, don't ever assume that two words capitalized differently are the same thing, they are not.
Usually, Janus can understand almost any format of raw data, such that there is no need for manipulating the recordings. All we need is a suitable feature description that tells Janus how to interpret the recordings. In our case such a "feature description" could look like this:
$fes readADC ADC $arg(adc) -bm shorten -hm 1024 -offset mean $fes adc2mel MSC ADC 16ms
It has two lines. The first line tells the feature module how to read in the recording-files. It defines a feature named "ADC" which is filled by the "readADC" command using the arguments that follow in the rest of the line. In the second line we spcify how to preprocess the data. In this example the preprocessing command is "adc2mel" which means: compute melscale coefficients (default value is 16 coefficients), each frame covers 16 ms on the time axis. This tutorial does not go into the details of preprocessing, but to find out more about the feature module and what kind of preprocessing is possible, you can look at the available methods of the FeatureSet type using the help function. A good book would be the best place to look for more details about the theory.
Usually you would have a feature description file that contains these two lines. For now we don't have to use a file; we can just type the commands manually into a running Janus.
You can enter the following commands in Janus to see what the preprocessing does:
% FeatureSet fs % fs readADC ADC ../data/recordings/a0101 -bm shorten -hm 1024 -offset mean
% fs adc2mel MSC ADC 16ms % fs show ADC
Then a window will pop up and display the waveform. Use the controls of the feature displaying tool, you can also select the MSC feature there and have a look at the mel spectral coefficients.
To reduce the 43 phonemes set of the CMU dictionary down to the 16 that we want to use, you can use the following lines:
echo cat ../data/dict \| sed `cat ../data/phonesMap | \ awk '{ printf(" -e ,s/ %s / %s %s /g,",$1,$2,$3); }'` | tr , "'" | sh > tmp echo cat tmp \| sed `cat ../data/phonesMap | \ awk '{ printf(" -e ,s/ %s / %s %s /g,",$1,$2,$3); }'` | tr , "'" | sh > mappedDict rm tmp
Of course you are welcome to write your own script (preferrably in Tcl) if you don't like the one above.
Often you will encounter the case that some of the words in your database are not covered by the dictionary. If there are only a few missing words, then you will simply have to add them manually. You'll have to find a pronounciation yourself. You should do this by looking at the pronounciations of similar words that are covered by the dictionary. Sometimes adding the new word means just adding a plural-s at the end, or concatenating two other words, etc.
To find out which words are missing, type the following one-liner:
cut -f2- -d' ' ../data/transcripts | tr ' ' '\012' \ | sort -u | join -v 1 - ../data/dict > missingWords
Now the file "missingWords" contains a list of all words that should be added to the dictionary. The following lines are an example of how the dictionary could be completed:
echo "+GARBIFY_PREV+ GARBAGE" >> mappedDict echo "+LIP_SMACK+ GARBAGE" >> mappedDict echo "+MISC_NOISE+ GARBAGE" >> mappedDict echo "+TONGUE_CLICK+ GARBAGE" >> mappedDict echo "+UNINTELLIGIBLE+ GARBAGE" >> mappedDict echo "+PERCENT B A S E N D" >> mappedDict echo "+POINT B O Y N D" >> mappedDict echo "CHANGED D S E Y N D S D" >> mappedDict echo "COMMISSIONS G A M I S A N S" >> mappedDict echo "FIFTEEN F I F D I N" >> mappedDict echo "HUMANKIND H Y U M A N G A Y N D" >> mappedDict echo "WEEKS U I G S" >> mappedDict echo 'SIL _' >> mappedDict echo '$ _' >> mappedDict echo '( _' >> mappedDict echo ') _' >> mappedDict sort -o mappedDict mappedDict
Janus expects a special format for its dictionary. Fortunately this format does not differ much from what the usually available dictionaries look like. Usually a dictionary looks like:
ABLE EY B AX L ABOUT AX B AW TD ACCEPTANCE AE K S EH PD T AX N S ACCEPT AE K S EH PD TD ACTION AE K SH AX N ...
Where Janus wants:
{ABLE} {EY B AX L} {ABOUT} {AX B AW TD} {ACCEPTANCE} {AE K S EH PD T AX N S} {ACCEPT} {AE K S EH PD TD} {ACTION} {AE K SH AX N} ...
The curly braces are used by Tcl, because the dictionary will be interpreted by Tcl when reading. Obviously you wouldn't need the braces around the words in the above example, but you could imagine to have words that include special characters, that should be packed in braces. However, you are strongly discouraged to use special characters in names of words or phonemes, they are going to cause you trouble pretty sure.
The follwing one-liner does the conversion of the dictionary for our example:
cat mappedDict | sed 's/ $//g' \ | perl -pe 's/([^ ]*) ([^ \n]*)(.*)/{$1} {{$2 WB} $3}/g' \ | perl -pe 's/ ([^ }]+)}\n/ {$1 WB}}\n/g' \ | tr -s ' ' | sed s/GARBAGE/+/g > convertedDict
Remember, it's a one-line-command. We've only split it into four lines for your reading convenience.
This way, we can easily create our Janus-readable dictionary. It's now called "convertedDict" and will be used from now on without further modifications.
Databases are standard objects in Janus. They can be used for anything,
but one of the most common usages is a task database, describing the recordings
of the task, giving all the needed information about every utterance. In
our example, we have a file ../data/transcripts which contains the most
essential information about the task's utterances, namely an utterance
ID and a transcription.
Other tasks can be organized in a different way. So you'll have to figure
out yourself what is the best way to structure your data into a Janus database.
In our example the follwing script can be run in Janus an will create a
Janus-style database:
[DBase db] open db.dat db.idx -mode rwc set fp [open ../data/transcripts r] while { [gets $fp line] != -1 } { set utt [lindex $line 0] db add $utt [list [concat text [lrange $line 1 end]] [list utt $utt]] } db close exit
There are 120 utterances in our data set. Of course, when training and evaluating a real system, you would want to use cross-validation sets for training and an unseen set for testing, but to make things simple, we will split our 120 sentences into just two chunks, a training set of 120, and a test set of 10 sentences. Doesn't add up to 120? Well, something must have gone wrong, but don't tell anybody :-)
You can use
cut -f1 -d' ' ../data/transcripts > trainIDs foreach utt ( 1 3 10 21 31 39 53 64 66 112 ) head -$utt ../data/transcripts | cut -f1 -d' ' | tail -1 >> testIDs end
to create two files, one that contains the utterance IDs of the training set one that contains the utterance IDs of the test set.