Preparing the Database

The term "database" is used in two different contexts that have only little to do with each other, so, to avoid confusion, here are the two usages:

The task database is a collection of files (waveforms, labels, transcriptions, dictionaries, etc.) that contain all that describes a task. This is what can be found in the file "data.tar.gz".

The other meaning of database refers to the Janus object DBase, which can be used to store any kind of string-indexed information. It is generally used to store the utterance descriptions, like speaker, gender, location, transcription etc.

The following is a list of things that you might have to do with your data before you can use them with Janus.

romanizing the data
cleaning the data up
creating a dictionary
completing a dictionary
preparing the recordings
organizing the data

romanizing the data
Some languages use special characters (Chinese, Japanese, Arabic, Klingon, etc.) that are difficult to handle with traditional software. Although there are standards for using and displaying them, these standards are not widely used, and make life a lot more complicated than necessary. Since virtually all software (and even hardware) is made to be used with the Roman alphabet (possibly with minor language-dependent modifications) it is often much easier to translate the language-specific characters into something that uses Roman characters (easily codeable in ASCII), than to try and work with 16-bit characters and non-standard software. For most languages there already does exist some kind of romanization (romaji for Japanese, pinyin for chinese, etc.).
So, before starting to work with Janus, translate all your data, the dictionary and the transcriptions into Roman characters.
cleaning the data up
Often the transcriptions of your recordings are things like newpaper articles. It would be too much to ask from Janus to understand their interpunctuation and the linebreaks and maybe some other contamination like headlines, abbreviations, etc.
So, before really using your data, make sure that you have all the transcriptions in such a format that there is no interpunctuation, such that you have one or more files that contain exactly one line of space-separated words per spoken utterance. Make sure that every word occurs in the dictionary in the exactly same writing (casesensitive).
creating a dictionary
The format of the Janus dictionary is such that every line of the file that does not start with a comment-character can be used as argument to the add method of the Dictionary object class. So every line must contain two Tcl items, first the name of a word, then a list containing optionally tagged phonemes. It is recommended to use braces around the word to allow some special characters to occur in them. Nevertheless, you are discouraged to use special characters wherever it is not necessary. An optionally tagged phoneme is either just the name of a phoneme or it is a list, containing the name of the phone followed by one or more tag names. Tags are used to identify things like word boundaries, syllable boundaries, or stress. The following line could be one from a Janus dictionary:
```
            {HELLO} {{H WB} {E ST SB} L {OW WE}}
     
```
The word is named HELLO, it is built of 4 phonemes, H E L OW, the H is tagged with the tag WB, The E has two tags, namely ST and SB, the L has no tag, and the OW has the tag WE.
It is not necessary to use tags at all. The most common usage of tags is to use just one tag, namely WB to indicate word boundaries for the first and the last phoneme of a word. Since one can think of many different formats for a dictionary to come, there can be a general script that will convert any format into a Janus-usable. Have a look at this topic's doItYourself page for a few examples.
If you don't have a dictionary and you can't find one for your language anywhere (should be unlikely) there's no other way than writing one manually. For some languages the orthography is very much based on the actually spoken phonemes, or it is easily possible to define rules that will produce phoneme sequences from grapheme sequences automatically (e.g. for many slavic languages).
completing a dictionary
Often, your dictionary will not be complete, some words that you would like to use are not mentioned. Then the standard approach is to load the dictionary into a text editor and manually add the missing words. While doing this you will prefer to take pronounciations from existing words that have some parts in common with the new word. To model the word BEAUTIFUL you would have a look at the words BEAUTY and the word HELPFUL, then take the transcription of BEAUTY and append the final three phones of HELPFUL.
Sometimes a word already exists in the dictionary, but it might be pronounced differently than what is found in there. In this case you can add a new pronounciation variant, which is marked by appending a parenthesized integer to the words name. This can look like this:
```
             {TOMATO}     {T OW M EY D OW}
             {TOMATO(2)}  {T OW M AA T OW}
     
```
Please make sure that the variant pronouciation of a word does not come before the baseform, because Janus likes to read the dictionary in one pass, and needs the baseform first before adding a variant.
preparing the recordings
Janus can understand many different formats for speech recordings. But there are still some that Janus does not understand. In these cases you need some program that can convert autofiles (something like audioconvert on SUNs or the SOX package available for various platforms).
The most common format used by janus is simple headerless linear encoding. In this format the sampled values (which could be 8 or 16 bits wide) simply follow each other in the file.
If you have a large database, you might want to save some disk space by compressing the files. One of the best compressing methods is the "shorten" algorithm (developed by Tony Robinson). Janus is able to read files that were compressed with this algorithm (up to version 1). So if you have a shorten-compressor available use it in version-1-mode to compress the audio files. This should reduce the needed disk space by more than 50%.
organizing the data
Unless already defined by someone else, you will have to split all your recordings into at least two, preferrably three or more parts. The largest part will be the "training data", used for computing the recognizers acoustic parameters. A second part will be the "test data", it is mandatory that the test and training data are completely disjoint, you should even take care that the speakers of the training set and the speakers of the test set are disjoint. The test set is also often called "evaluation test set", because it is used to evaluate the performance of the recognizer.
While training a recognizer, you will often have to make decisions based on performance measures, such as whether you want to continue training or stop at a stage where performance seems optimal. Or you'll have to decide which language model parameters perform best. If you make many such decisions based on the performance on the same test set, then your recognizer will do particularly good on this test set, and it might perform significantly worse on a different test set. We call this "tuning" the recognizer to some test set. When reporting the performance of a recognizer, it doesn't make much sense to report the accuracy on a tuned test set. Therefore it is recommended to us a different set for tuning and for evaluation. So, besides the evaluation test set, you should use a "development test set", or sometimes called "crossvalidation set". Then you can tune the recognizer on this set and base all your decisions on the performance on this set, but report official results only on the evaluation test set, which has never been touched during the development of the recognizer.