Preparing the Data

In this directory we will prepare the given data, such that they will become Janus-readable. We will add a few missing words to the dictionary and we will create a Janus-style database for the given task. We will also split the given data into a training and a test set.


Change to the directory step1. If it doesn't exist create it next to the data directory. Originally this directory is empty. In the rest of this page you will find the documented process of how to create the files that should eventually be there.

Looking at the Dictionary

Let's start by having a look at what we have. Type

   more ../data/dict

and have a look at the given dictionary. You can find such kinds of dictionaries a various publicly accessible places on the internet. The one you are currently looking at is an excerpt from the freely available CMU dictionary that was used by various sites for the (D)ARPA continuous speech recognition evaluations.
It contains in every line one word (given in all capital letters) followed by the space-separated phoneme sequence that describes one common pronounciation of the word. Some words can have multiple valid pronounciations, but to make things easier we did not include the pronounciation variants in this tutorial's dictionary. Type the command

  cut -f2- -d' ' ../data/dict | tr ' ' '\012' | sort | uniq -c

This will list all the used phonemes together with the frequencies. There are 43 different phones. To make the development process faster and easier to overview, we've decided to use only 16 phonemes for the tutorial.

  more ../data/phonesMap

will show you a suggested mapping from the CMU-phones down to the 16 phones that we are going to use. Some of the CMU phones (e.g. diphthongs) will be replaced by two from our smaller 16 phones set.

Looking at the Transcriptions

Transcriptions can come in many different formats and styles. Possibly you've done some transcribing yourself. Do a:

    more ../data/transcripts

to look at the given transcriptions. In our case we have one transcription per line in the transcriptions file. The first space-separated item in every line is the "name" of the utterance. You don't have to care about the meaning of these utterance names. They are just the 5 last characters of the corresponding WSJ-namings used by ARPA/NIST/LDC. The rest of the line contains the transcribed words. They are written in capital letters only. Keep in mind, that Janus does read transcriptions and dictionaries (as well as everything else) in a case-sensitive mode, don't ever assume that two words capitalized differently are the same thing, they are not.

Looking at the Recordings

Usually, Janus can understand almost any format of raw data, such that there is no need for manipulating the recordings. All we need is a suitable feature description that tells Janus how to interpret the recordings. In our case such a "feature description" could look like this:

   $fes readADC   ADC   $arg(adc) -bm shorten -hm 1024 -offset mean
   $fes adc2mel   MSC   ADC     16ms

It has two lines. The first line tells the feature module how to read in the recording-files. It defines a feature named "ADC" which is filled by the "readADC" command using the arguments that follow in the rest of the line. In the second line we spcify how to preprocess the data. In this example the preprocessing command is "adc2mel" which means: compute melscale coefficients (default value is 16 coefficients), each frame covers 16 ms on the time axis. This tutorial does not go into the details of preprocessing, but to find out more about the feature module and what kind of preprocessing is possible, you can look at the available methods of the FeatureSet type using the help function. A good book would be the best place to look for more details about the theory.

Usually you would have a feature description file that contains these two lines. For now we don't have to use a file; we can just type the commands manually into a running Janus.

You can enter the following commands in Janus to see what the preprocessing does:

% FeatureSet fs
% fs readADC ADC ../data/recordings/a0101 -bm shorten -hm 1024 -offset mean
% fs adc2mel MSC ADC 16ms
% fs show ADC

Then a window will pop up and display the waveform. Use the controls of the feature displaying tool, you can also select the MSC feature there and have a look at the mel spectral coefficients.

Mapping the Dictionary Phones

To reduce the 43 phonemes set of the CMU dictionary down to the 16 that we want to use, you can use the following lines:

 echo cat ../data/dict \| sed `cat ../data/phonesMap | \
 awk '{ printf(" -e ,s/ %s / %s %s /g,",$1,$2,$3); }'` | tr , "'" | sh > tmp
 echo cat tmp \| sed `cat ../data/phonesMap | \
 awk '{ printf(" -e ,s/ %s / %s %s /g,",$1,$2,$3); }'` | tr , "'" | sh > mappedDict
 rm tmp

Of course you are welcome to write your own script (preferrably in Tcl) if you don't like the one above.

Adding Missing Words to the Dictionary

Often you will encounter the case that some of the words in your database are not covered by the dictionary. If there are only a few missing words, then you will simply have to add them manually. You'll have to find a pronounciation yourself. You should do this by looking at the pronounciations of similar words that are covered by the dictionary. Sometimes adding the new word means just adding a plural-s at the end, or concatenating two other words, etc.

To find out which words are missing, type the following one-liner:

  cut -f2- -d' ' ../data/transcripts | tr ' ' '\012' \
  | sort -u | join -v 1 - ../data/dict > missingWords

Now the file "missingWords" contains a list of all words that should be added to the dictionary. The following lines are an example of how the dictionary could be completed:

echo "+GARBIFY_PREV+ GARBAGE"           >> mappedDict
echo "+LIP_SMACK+ GARBAGE"              >> mappedDict
echo "+MISC_NOISE+ GARBAGE"             >> mappedDict
echo "+TONGUE_CLICK+ GARBAGE"           >> mappedDict
echo "+UNINTELLIGIBLE+ GARBAGE"         >> mappedDict
echo "+PERCENT B A S E N D"             >> mappedDict
echo "+POINT B O Y N D"                 >> mappedDict
echo "CHANGED D S E Y N D S D"          >> mappedDict
echo "COMMISSIONS G A M I S A N S"      >> mappedDict
echo "FIFTEEN F I F D I N"              >> mappedDict
echo "HUMANKIND H Y U M A N G A Y N D"  >> mappedDict
echo "WEEKS U I G S"                    >> mappedDict
echo 'SIL _'                            >> mappedDict
echo '$ _'                              >> mappedDict
echo '( _'                              >> mappedDict
echo ') _'                              >> mappedDict

sort -o mappedDict mappedDict 

Creating a Janus-Readable Dictionary

Janus expects a special format for its dictionary. Fortunately this format does not differ much from what the usually available dictionaries look like. Usually a dictionary looks like:

        ABLE            EY B AX L
        ABOUT           AX B AW TD
        ACCEPTANCE      AE K S EH PD T AX N S
        ACCEPT          AE K S EH PD TD
        ACTION          AE K SH AX N
        ...

Where Janus wants:

        {ABLE}          {EY B AX L}
        {ABOUT}         {AX B AW TD}
        {ACCEPTANCE}    {AE K S EH PD T AX N S}
        {ACCEPT}        {AE K S EH PD TD}
        {ACTION}        {AE K SH AX N}
        ...

The curly braces are used by Tcl, because the dictionary will be interpreted by Tcl when reading. Obviously you wouldn't need the braces around the words in the above example, but you could imagine to have words that include special characters, that should be packed in braces. However, you are strongly discouraged to use special characters in names of words or phonemes, they are going to cause you trouble pretty sure.

The follwing one-liner does the conversion of the dictionary for our example:

  cat mappedDict | sed 's/ $//g'  \
  | perl -pe 's/([^ ]*) ([^ \n]*)(.*)/{$1} {{$2 WB} $3}/g' \
  | perl -pe 's/ ([^ }]+)}\n/ {$1 WB}}\n/g' \
  | tr -s ' ' | sed s/GARBAGE/+/g > convertedDict

Remember, it's a one-line-command. We've only split it into four lines for your reading convenience.

This way, we can easily create our Janus-readable dictionary. It's now called "convertedDict" and will be used from now on without further modifications.

Creating a Task Database

Databases are standard objects in Janus. They can be used for anything, but one of the most common usages is a task database, describing the recordings of the task, giving all the needed information about every utterance. In our example, we have a file ../data/transcripts which contains the most essential information about the task's utterances, namely an utterance ID and a transcription.
Other tasks can be organized in a different way. So you'll have to figure out yourself what is the best way to structure your data into a Janus database. In our example the follwing script can be run in Janus an will create a Janus-style database:

[DBase db] open db.dat db.idx -mode rwc
set fp [open ../data/transcripts r]
while { [gets $fp line] != -1 } {
  set utt [lindex $line 0]
  db add $utt [list [concat text [lrange $line 1 end]] [list utt $utt]]
}
db close
exit

Defining a Training Set and a Test Set

There are 120 utterances in our data set. Of course, when training and evaluating a real system, you would want to use cross-validation sets for training and an unseen set for testing, but to make things simple, we will split our 120 sentences into just two chunks, a training set of 120, and a test set of 10 sentences. Doesn't add up to 120? Well, something must have gone wrong, but don't tell anybody :-)

You can use

        
      cut -f1 -d' ' ../data/transcripts > trainIDs
      foreach utt ( 1 3 10 21 31 39 53 64 66 112 )
          head -$utt ../data/transcripts | cut -f1 -d' ' | tail -1 >> testIDs
      end

to create two files, one that contains the utterance IDs of the training set one that contains the utterance IDs of the test set.