Doing K-Means for the Second Time

After we have now modified our feature space by defining a new LDA matrix, we have to initialize the context-dependent codebooks anew. The basics of the script are still the same as the ones from the first time we did a k-means initialization. This time the Startup is different, because we have new architecture description files and a new LDA matrix:

[FeatureSet fs] setDesc   @../step5/featDesc
            fs  setAccess @../step2/featAccess

[CodebookSet cbs fs]               read ../step14/codebookSetClustered
[DistribSet  dss cbs]              read ../step14/distribSetClusteredPruned
[PhonesSet ps]                     read ../step14/phonesSet
[Tags tags]                        read ../step2/tags
[Tree dst ps:phones ps tags dss]   read ../step14/distribTreeClusteredPruned

SenoneSet sns [DistribStream str dss dst]

[TmSet tms]                        read ../step2/transitionModels
[TopoSet tps sns tms]              read ../step2/topologies
[Tree tpt ps:phones ps tags tps]   read ../step2/topologyTree
[Dictionary dict ps:phones tags]   read ../step1/convertedDict
[DBase db]                         open ../step1/db.dat ../step1/db.idx -mode r
[FMatrix ldaMatrix]               bload ../step15/ldaMatrix
AModelSet amo tpt ROOT
HMM hmm dict amo
Path path
dst configure -padPhone [ps:phones index pad]

Considering that we can expect very many files that will hold the exracted sample vectors we should this time pack them in an extra directory:

catch { mkdir data }
catch { rm data/* }

The rm command removed the remains of an experiment that might have been run in the same directory earlier. The catch around the commands keeps Tcl from stopping if the data directory already exists of if there are noe data/* files.

The definition of the sample set object is basically the same, only this time we should remember to use the new LDA-model-counts file, and we should use a smaller maximum number of sample vectors because we have many more classes this time. In the main loop we have to change the names of the files from which we will be loading the source matrices for the k-means, and we should check for the existence of these files before trying to load them, because our loop is looping over all codebooks of the codebook set, and not all codebooks are actually being used.