Computing New Codebooks with k-means

The theory of the k-means algorithm is dicussed in the discussion thread. You will also find an explanation for the sample extraction there. Here, we will only show how this is done in Janus.

Besides using weights that have been trained by some other system, the most popular way to initialize codebooks is the k-means algorithm. Whenever we change the feature space to something we haven't had so far we have to find new reference vectors for our codebooks. This is e.g. always the case, when we compute a new LDA matrix. The k-means algorithm needs a great lot of example vectors which it then clusters into fewer vectors, namely the number of vectors we want for our codebooks. Since we can consider both the set of reference vectors of a codebook and the set of example vectors to be a matrix, the k-means operation is a matrix to matrix operation. The source matrices are the sample vectors and the destination matrices are the reference vectors.

The full script which is described in detail below can be found here.

Extracting Sample Vectors

Before we can run k-means, we have to extract some sample vectors. Thefore we start up Janus as usual, only this time using the featDesc and the ldaMatrix files from ../step5 instead of the old ones from ../step2. After the startup we create a sample set object:

SampleSet sms fs LDA 12
foreach ds [dss:] { sms add $ds ; sms map [dss index $ds] -class $ds }

It should remind you very much on the creation of an LDA object. In fact, it is the same thing, only with a different object. Here, too we define classes and define which acoustic unit indices belong to which class.

When the sample set object is created we configure some of its properties:

set fp [open ../step5/ldaCounts] ; makeArray counts [read $fp] ; close $fp
foreach class [sms:] { 
 sms:$class configure -maxCount 500 -modulus [expr 1+$counts($class))/500]
}

First we read the counts file that we've written when doing the LDA. This way we know how many vectors we can expect from every class in the entire database. If we know that we only want 500 example vectors per class for the k-means, then it might be risky to just take the first 500 occurrences. It is better to take examples from all over the database. So why not take every n-th vector that belongs to a class, where n is the occurrence frequency divided by 500. This way we get 500 examples for every class spread from all over the database. To do this we first build an array named counts with the makeArray command that we've already used before. The we configure the maximum number of vectors to be extracted for every class to be 500, and the modulus defining the n from above to be the number of counts for the class divided by 500. The adding of the 1 is there to avoid a modulus of 0 for classes that have less than 500 counts.

Then we run a loop over the entire training data which is very similar to the one we used for computing the LDA matrix:

foreach utt [db] {
  puts $utt
  set uttInfo [db get $utt]
  makeArray arr $uttInfo
  fs eval $uttInfo
  hmm make $arr(text) -optWord SIL
  path bload ../step4/labels/$utt 
  path map hmm -senoneSet sns -stream 0
  sms accu path
}

In fact, the only difference is that this time we accumulate for the sms instead of the lda object. Eventually, when the loop is finished, we do a

sms flush

To write out all the vectors that have not been written to a file yet. Remember that the sample set object is a buffer whose purpose is to extract sample vectors fast. If the buffer is smaller than the maximum number of extracted vectors per class, it will be flushed automatically when it is full. At the end of the loop we must flush the remainder manually.

Creating New Codebooks

We can't replace the codebooks in cbs by the new vectors we will get from k-means. This is because the codebooks we've used so far were defined on the MSC feature, and the ones we will use from now on will be defined on the LDA feature. Also we would like to reduce the dimensionality of the feature space from 16 (MSC) down to 12 (LDA). This makes the system smaller and faster, and hopefully even more capable for generalization. We can still use the old distributions and replace their values with the new ones that we will get from k-means, because the size of the codebooks will not change, and thus the size of the distributions will also remain the same.

So we create a new codebook set named cbs2, plus some helper objects for holding vectors and matrices:

CodebookSet cbs2 fs
FMatrix smp
FVector cnt

Then we run a loop over all codebooks, for every codebook of our old codebook set we must create one in the new codebok set:

foreach cb [cbs:] { 
  puts $cb
  cbs2 add $cb LDA [cbs:$cb configure -refN] 12 [cbs:$cb configure -type]

In the new codebook set we use the same number of reference vectors (-refN) and the same covariance matrix type (-type), but we use the feature LDA and only 12 coefficients.

Then - still within the loop - we load the extracted sample vectors from their file into the previously created smp matrix. We have to reduce the size of the matrix, because the sample set object did not only save the 12 LDA coefficients but one additional coefficient which contained the path-likelihood of the frame (always 1.0 for Viterbi paths, and the "gamma"-value for forward-backward paths:

  smp bload $cb
  smp resize [smp configure -m] [expr [smp configure -n]-1]

Now the smp matrix contains only the 12 LDA coefficients for all the extracted vectors of the $cb codebook. We can now call the k-means algorithm:

  cbs2:$cb.mat neuralGas smp -maxIter 5 -tempS 0 -counts cnt

The method is called "neuralGas" because k-means is a special case of the neural gas algorithm. With -tempS 0 we are saying that we only want pure k-means.

By giving the option -counts cnt, we will get a vectors cnt which contains 16 coefficients (one for each reference vector), the n-th coefficient contains the number of sample vectors that were clustered into the n-th reference vector of the new codebook. We can use these counts to compute a mixture weight distribution like this:

  set sum 0 ; set vec {}
  foreach x  [cnt puts] { set sum [expr $sum + $x]   }
  foreach x  [cnt puts] { lappend vec [expr $x/$sum] }
  foreach ds [dss:] { dss:$ds configure -count $sum -val $vec }
}

When the loop has finished, we have a new codebook set cbs2, filled with new codebooks, and the same old distribution set filled with mixture weight distributions that correspond to the newly computed codebooks. All that is left to do is to store the new data structures:

cbs2 write codebookSet
cbs2 save  codebookWeights
dss  save  distribWeights

Testing the New Weights

We should take the time and check whether the resulting codebooks are any good. Start up a Janus, using the same kind of startup procedure you've been using for the k-means job, only this time replace the codebook set and the codebook and distribution weights such that they use the newly created and locally saved data structures. If the performance is not better than what you got from the last test, then you shouldn't be too unhappy, we have not yet done any "training".