Besides using weights that have been trained by some other system, the most popular way to initialize codebooks is the k-means algorithm. Whenever we change the feature space to something we haven't had so far we have to find new reference vectors for our codebooks. This is e.g. always the case, when we compute a new LDA matrix. The k-means algorithm needs a great lot of example vectors which it then clusters into fewer vectors, namely the number of vectors we want for our codebooks. Since we can consider both the set of reference vectors of a codebook and the set of example vectors to be a matrix, the k-means operation is a matrix to matrix operation. The source matrices are the sample vectors and the destination matrices are the reference vectors.
The full script which is described in detail below can be found here.
SampleSet sms fs LDA 12 foreach ds [dss:] { sms add $ds ; sms map [dss index $ds] -class $ds }It should remind you very much on the creation of an LDA object. In fact, it is the same thing, only with a different object. Here, too we define classes and define which acoustic unit indices belong to which class.
When the sample set object is created we configure some of its properties:
set fp [open ../step5/ldaCounts] ; makeArray counts [read $fp] ; close $fp foreach class [sms:] { sms:$class configure -maxCount 500 -modulus [expr 1+$counts($class))/500] }First we read the counts file that we've written when doing the LDA. This way we know how many vectors we can expect from every class in the entire database. If we know that we only want 500 example vectors per class for the k-means, then it might be risky to just take the first 500 occurrences. It is better to take examples from all over the database. So why not take every n-th vector that belongs to a class, where n is the occurrence frequency divided by 500. This way we get 500 examples for every class spread from all over the database. To do this we first build an array named counts with the makeArray command that we've already used before. The we configure the maximum number of vectors to be extracted for every class to be 500, and the modulus defining the n from above to be the number of counts for the class divided by 500. The adding of the 1 is there to avoid a modulus of 0 for classes that have less than 500 counts.
Then we run a loop over the entire training data which is very similar to the one we used for computing the LDA matrix:
foreach utt [db] { puts $utt set uttInfo [db get $utt] makeArray arr $uttInfo fs eval $uttInfo hmm make $arr(text) -optWord SIL path bload ../step4/labels/$utt path map hmm -senoneSet sns -stream 0 sms accu path }In fact, the only difference is that this time we accumulate for the sms instead of the lda object. Eventually, when the loop is finished, we do a
sms flushTo write out all the vectors that have not been written to a file yet. Remember that the sample set object is a buffer whose purpose is to extract sample vectors fast. If the buffer is smaller than the maximum number of extracted vectors per class, it will be flushed automatically when it is full. At the end of the loop we must flush the remainder manually.
So we create a new codebook set named cbs2, plus some helper objects for holding vectors and matrices:
CodebookSet cbs2 fs FMatrix smp FVector cntThen we run a loop over all codebooks, for every codebook of our old codebook set we must create one in the new codebok set:
foreach cb [cbs:] { puts $cb cbs2 add $cb LDA [cbs:$cb configure -refN] 12 [cbs:$cb configure -type]In the new codebook set we use the same number of reference vectors (-refN) and the same covariance matrix type (-type), but we use the feature LDA and only 12 coefficients.
Then - still within the loop - we load the extracted sample vectors from their file into the previously created smp matrix. We have to reduce the size of the matrix, because the sample set object did not only save the 12 LDA coefficients but one additional coefficient which contained the path-likelihood of the frame (always 1.0 for Viterbi paths, and the "gamma"-value for forward-backward paths:
smp bload $cb smp resize [smp configure -m] [expr [smp configure -n]-1]Now the smp matrix contains only the 12 LDA coefficients for all the extracted vectors of the $cb codebook. We can now call the k-means algorithm:
cbs2:$cb.mat neuralGas smp -maxIter 5 -tempS 0 -counts cntThe method is called "neuralGas" because k-means is a special case of the neural gas algorithm. With -tempS 0 we are saying that we only want pure k-means.
By giving the option -counts cnt, we will get a vectors cnt which contains 16 coefficients (one for each reference vector), the n-th coefficient contains the number of sample vectors that were clustered into the n-th reference vector of the new codebook. We can use these counts to compute a mixture weight distribution like this:
set sum 0 ; set vec {} foreach x [cnt puts] { set sum [expr $sum + $x] } foreach x [cnt puts] { lappend vec [expr $x/$sum] } foreach ds [dss:] { dss:$ds configure -count $sum -val $vec } }When the loop has finished, we have a new codebook set cbs2, filled with new codebooks, and the same old distribution set filled with mixture weight distributions that correspond to the newly computed codebooks. All that is left to do is to store the new data structures:
cbs2 write codebookSet cbs2 save codebookWeights dss save distribWeights