Preparing the Data

First, we prepare our data sucht that they become "Janus-readable". Finally we create the most basic architecture descriptions.

Change to the directory prepare. If it doesn't exist create it next to the data directory. Originally this directory is empty. In the rest of this page you will find the documented process of how to create the files that should eventually be there.

Reading the Recordings

Usually, Janus can understand almost any format of raw data, such that there is no need for manipulating the recordings. All we need, is a suitable description file that tells Janus how to interprete the recordings. In our case such a "feature description file" could look like this:

   $fes readADC   ADC   $arg(adc) -bm shorten -hm 1024 -offset mean
   $fes adc2mel   MSC   ADC     16ms

It has two lines. The first line tells the feature module how to read in the recording-files. It defines a feature named "ADC" which is filled by the "readADC" command using the arguments that follow in the rest of the line. In the second line we spcify how to preprocess the data. In this example the preprocessing command is "adc2mel" which means: compute melscale coefficients (default value is 16 coefficients), each frame covers 16 ms on the time axis. Since this tutorial is not intended to teach you preprocessing, you should read the documentation of the feature module to find out what kind of preprocessing is possible and read a book for details about the theory.

You can enter the following commands in Janus to see what the preprocessing does:

% [FeatureSet fs] setDesc @featdesc 
% fs eval { { adc ../data/waves/014c0211.wv1 } }
% fs show ADC

Then a window will pop up and display the waveform. Use the controls off the feature displaying tool, you can also select the MSC feature there and have a look at the mel spectral coefficients.

Creating a Janus-Readable Dictionary

Janus expects a special format for its dictionary. Fortunately this format does not differ much from what the usually available dictionaries look like. Usually a dictionary looks like:

	ABLE 		E Y P A L
	ABOUT 		A P A T
	ACCEPTANCE 	E K S E P T A N S
	ACCEPT 		E K S E P T
	ACTION 		E K S A N
	...

Where Janus wants:

	{ABLE} 		{EY B AX L}
	{ABOUT} 	{AX B AW TD}
	{ACCEPTANCE} 	{AE K S EH PD T AX N S}
	{ACCEPT} 	{AE K S EH PD TD}
	{ACTION} 	{AE K SH AX N}
	...

The curly braces are used by Tcl, because the dictionary will be interpreted by Tcl when reading. Obviously you wouldn't need the braces around the words in the above example, but you could imagine to have words that include special characters, that should be packed in braces. However, you are strongly discouraged to use special characters in names of words or phonemes, they are going to cause you trouble pretty sure.

The follwing one-liner does the conversion of the dictionary for our example:

	cat ../data/dict | perl -pe 's/^([^ ]*) (.*)/{$1} {$2}/g' > dict

This way, we can easily create our Janus-readable dictionary.

Creating a Task Database

Databases are standard objects in Janus. They can be used for anything, but one of the most common usages is a task-database, describing the recordings of the task, giving all the needed information about every utterance. In our example, we have a file ../data/transcripts which contains the most essential information about the task's utterances, namely an utterance ID and a transcription. In the WSJ task the utterance ID consists of eight characters SSSTRRUU with the following meanings:

SSS three characters speaker ID

T type of utterance (a=adaptation, c=regular, o=verbalized punctuation)

RR ID of the speakers recording session

UU ID of the utterance within the recording session

Other tasks can be organized in a different way. So you'll have to figure out yourself what is the best way to structure your data into a Janus database. In our example the follwing line will create a script that can produce a Janus database (the line is tuncated for your reading convenience, but should be entered as a whole):

  cat ../data/transcripts | perl -pe 's/(...)(.)(..)(..)\t(.*)/
	      dbase add $1$2$3$4 {{text $5} {s $1} {t $2} {r $3} {u $4}}/g' > dbase.src

The following lines, entered into Janus will create the Janus-readable database files, a file named dbase.dat which contains the data of the database, and a file dbase.idx which contains information for random access to the data.

  % DBase dbase
  % dbase configure -hashSizeX 8
  % dbase open dbase.dat dbase.idx -mode rwc
  % source dbase.src 
  % dbase close

After that you can restart Janus and have a look at the dbase with:

  % DBase dbase
  % dbase open dbase.dat dbase.idx -mode r
  % dbase get 40mc020p

Refer to the documentation of the database module for further details about how to create and use databases.

Defining a Training Set and a Test Set

You know that you should use a cross-validation set for developing a recognizer and report the results on a test set that has never been seen before. But this is not a tutorial on how to do proper research, only on how to use Janus. So we will split our 126 sentences into jsut two junks, a training set of 100, and a test set of 26 sentences. We do:

	cut -f1 ../data/transcripts | head -100 > trainIDs
	cut -f1 ../data/transcripts | tail +101 > testIDs

to create two files, one that contains the utterance IDs of the training set one that contains the utterance IDs of the test set.


SSS	three characters speaker ID
T	type of utterance (a=adaptation, c=regular, o=verbalized punctuation)
RR	ID of the speakers recording session
UU	ID of the utterance within the recording session