Training with Labels

When doing Maximum Likelihood (ML) training of HMMs you need some kind of mapping from the feature space vectors of the trained utterance onto the acoustic models to which these vectors correspond. There are two ways of getting such a mapping. One is to use a forced alignment procedure such as the Viterbi algorithm or the forward-backward algorithm. The other is to use labels which define the mapping explicitely. The advantage of a forced-alignment is that you don't need labels. However, the alignment found by Viterbi or forward-backward is miserable if the initial acoustic models are poor. And random initialization, definitely does make poor models. Although it is theoretically possible to start with random parameters, it is much better (faster and more accurate) to start with at least somehow reasonable parameters. These can be obtained either by using labels or by taking them from some other existing recognizer. If you have labels from somewhere, it's usually a good idea to use these labels for computing initial parameters (i.e. codebooks and mixture weighs) from these labels with some algorithm like "k-means" or "neural gas". Once you have a system initialized you can switch to Viterbi or forward-backward training, expecting that the initialized acoustic models are good enough to find reasonable alignments (paths) for your utterances.

If you don't have labels available, there are two ways how to proceed. One is, to start with random parameters, the other is to use an existing recognizer for some other task. Either you just pretend the other task's recognizer is good for your new task and simply continue training as if everything was fine, or you let the other tasks recognizer write labes for your new task and continue as described above. Using an existing recognizer for a new task might cause problems when it comes to phoneme sets, especially when you have to build a recognizer for a new language. Then you'll have to decide yourself or ask a phonetician to map the phonemes from the new language onto phonemes from the old language, such that you can use the existing recognizer's acoustic models for the new task. Generally this means that you will have to build a pronounciation lexicon for your new task which uses only phonemes from the existing recognizer. Once you have a system that performs at least a little bit reasonable on the new task, you can write labels with it and start all over with a completely new recognizer, a completely new architecture, and a completely new set of phonemes and acoustic models.