What Does Janus Need

Remember that this page is just an overview. It does not give any details about what to do or how to do it. It is only there to help you find some orientation and read some frequently used names of various items in Janus.

Mandatory Stuff

To train a speech recognizer, you need some speech recordings, often called "wave-files" or "ADCs" (Analog-Digital-Converted). You also need the information about what was said in these recordings. We call these "transcriptions." Currently Janus is not prepared to be trained without a pronounciation lexicon, so you need that too. We call it "dictionary." Without one of these three things, you can't use Janus to build a recognizer.

Helpful Stuff

Besides the mandatory ADCs, transcriptions, and dictionary, it can be very helpful to have "labels", describing in great detail which part of the spoken utterance was spoken at what time. For recognition, its good to have a "language model" or at least enough example text to build one.

Janus-Readability

Before you can work with Janus you might have to convert some of the files from your database to something that is Janus-readable. This usually means a bit of text processing and working with emacs, awk, sed, or other tools.

ADCs

The ADCs can usually be left unchanged. Only if you have problems with the ADCs memory consumption, you might consider storing them in a compressed format or storing only their preprocessed couterparts. But, if you do store preprocessed recordings, then it could become difficult to experiment with different kinds of preprocessing.

Dictionaries

Dictionaries are available for most common languages (English is covered very well). So if you didn't get a dictionary, it could be possible to get one from somewhere off the internet. The Janus dictionary format is a bit different from what usually the publicly available dictionaries look like. But it should take more than a simple one-liner to make them Janus-readable. You should also consider removing any special characters (espcially those that mean something to Tcl, like spaces, braces, quotes, etc.) to avoild later problems.

Transcriptions

The transcriptions could contain information that is not needed for the recognizer, or that should be ignored. Although it is possible to text-process the transcriptions while training, you would probably prefer to have transcription that are easy to read, by both humans and computers. Life is much easier if e.g. the capitalization of the words in the transcriptions matches the one of the dictionary. Also life is easier if the transcriptions don't contain interpunctuations that are not used by the recognizer. You might also save a lot of time later, if you take the time now to remove (or transform) all characters that are special to Tcl-shells (like quotes, braces, etc.). Spaces should be used to separate words.

Labels

Labels can come in many different formats or then could not come at all. If you don't have an existing recognizer for the language you want to do development on, or if your existing recognizers' phonesets differ too much from the phoneset which you would like to use, then you should try to use the given labels, even if they are not in Janus' binary format. It is possible to read any kind of labels and make Janus train along them. If you don't have labels then you should use an existing recognizer to write some. If even this isn't an option for you, starting with random weights and without labels is your last chance but can take a long time until it really works.

Language Models

A language model describes the statistics of the word sequences in your task's domain. As with any statistics it's always better to have more data to estimate it. So far, Janus can only work with, unigrams, bigrams, and trigrams. If you didn't get a language model at all, you should build one yourself. There are tools available that can compute language models from text data. The text data not only can, but should be much larger than the data that is used for acoustic training. If you did get a language model, you should convert it into NIST format, which is the preferred format of Janus.

Description Files

Once you have all of the needed files, you can start developing a recognizer. Janus will need architecture description files which describe the entire recognizer. If you don't want to play with different types of architectures yourself, you can simply use the default scripts, and Janus will create it's description files itself.

Weights

Besides architecture description files, Janus will produce parameter files. We call them "weight-files", comming from neural-nets' weights. Of course the same architecture can be run with many different weights. The most common weight files are "codebooks" and "distributions", which are used for Gaussian mixtures. As Janus is steadily growing other weight files will appear, too.

Feature Description

One thing that you have to write manually, or, if you are lucky, just select from a predefined library, is a "feature-description". This is a Tcl-script which defines the preprocessing of your recordings. Since there are many different recording formats (8-bit, 16-bit, compression, already somehow preprocessed, little or big endian, etc.), and there are many different kinds of features that a recognizer can be trained on, you'll have to define that at some time. The feature-module is well documented, such that this task shouldn't be too difficult.

Phoneme Subsets

Eventually, you will probably want to use context-dependent acoustic models. There are different ways how to use them. But the preferred way is to do a divisive top-down clustering for building decision trees. The decision trees should use questions about phonemes belonging so some subset of the entire phoneset (e.g. "is phone x a vowel" or "is phone x a nasal", etc.). For English or German, there are many such subset-definitions available. If you don't have any for the languge you are working on, or if your desired phoneset differs very much from the standard phonesets, you should consider defining such subsets yourself.