for each utterance 1) load and preprocess the utterances features 2) get an alignment path from somewhere 3) whatever has to be trained, let it accumulate the necessary training information whatever has to be trained, let it update its parameters according to the accumulated dataHere step 2) can be either the running of a Viterbi or a forward-backward alignment, or we can load an aready aligned path from a file, which we call labels-file.
Usually, training along labels is much faster than computing a forced alignment for every utterances.