Training with Labels
When doing Maximum Likelihood (ML) training of HMMs you need some kind of mapping from the feature space vectors of the trained utterance onto the acoustic models to
which these vectors correspond. There are two ways of getting such a mapping. One is to use a forced alignment procedure such as the Viterbi
algorithm or the forward-backward algorithm. The other is to use labels which define the mapping explicitely. The advantage of a forced-alignment
is that you don't need labels. However, the alignment found by Viterbi or forward-backward is miserable if the initial acoustic models are poor.
And random initialization, definitely does make poor models. Although it is theoretically possible to start with random parameters, it is much
better (faster and more accurate) to start with at least somehow reasonable parameters. These can be obtained either by using labels or by taking
them from some other existing recognizer. If you have labels from somewhere, it's usually a good idea to use these labels for computing initial
parameters (i.e. codebooks and mixture weighs) from these labels with some algorithm like "k-means" or "neural gas". Once you have a system
initialized you can switch to Viterbi or forward-backward training, expecting that the initialized acoustic models are good enough to find
reasonable alignments (paths) for your utterances.
If you don't have labels available, there are two ways how to proceed. One is, to start with random parameters, the other is to use an existing
recognizer for some other task. Either you just pretend the other task's recognizer is good for your new task and simply continue training as if
everything was fine, or you let the other tasks recognizer write labes for your new task and continue as described above. Using an existing
recognizer for a new task might cause problems when it comes to phoneme sets, especially when you have to build a recognizer for a new language.
Then you'll have to decide yourself or ask a phonetician to map the phonemes from the new language onto phonemes from the old language, such
that you can use the existing recognizer's acoustic models for the new task. Generally this means that you will have to build a pronounciation
lexicon for your new task which uses only phonemes from the existing recognizer. Once you have a system that performs at least a little bit
reasonable on the new task, you can write labels with it and start all over with a completely new recognizer, a completely new architecture, and
a completely new set of phonemes and acoustic models.