FastMix | ||||||||||||||||||||||||||||||||||||||||||||
Introduction Downloads Options File Formats Auxiliary Utilities References | ||||||||||||||||||||||||||||||||||||||||||||
Introduction | ||||||||||||||||||||||||||||||||||||||||||||
FastMix generates Gaussian mixture models for large datasets using efficient
EM clustering algorithms developed by the
Auton Project. The program
automatically selects the number of Gaussians and the locations of the
Gaussians using the KD-Clust algorithm.
The algorithm uses kd-trees for two purposes: to accerate EM by
caching
sufficient statistics and to find regions
where the model underpredicts the data. The regions of underprediction
are used as candidates for new cluster locations.
FastMix takes as input a file of data points and outputs a set of Gaussian centers and covariances. The program is currently supported by Peter Sand (psand@cs.cmu.edu). | ||||||||||||||||||||||||||||||||||||||||||||
Downloads | ||||||||||||||||||||||||||||||||||||||||||||
fastmix-linux.tar.gz (Linux Version) fastmix-sun4.tar.gz (Sun4 Version) sample1.ds (sample data file) sample2.ds (sample data file) sample.mix (sample Gaussian mixture file) | ||||||||||||||||||||||||||||||||||||||||||||
Options | ||||||||||||||||||||||||||||||||||||||||||||
The standard command line syntax for FastMix includes an input file
and an optional output file:
fastmix input-file output-file If the output file is omitted, the standard output is used. Additional options may be specified on the command line, or in the fastmix.conf file. (Options specified on the command line override values specified in the configuration file.)
The following is a sample FastMix command line: fastmix data.ds mix.cen -t 500 -s aic -d 300 | ||||||||||||||||||||||||||||||||||||||||||||
File Formats | ||||||||||||||||||||||||||||||||||||||||||||
FastMix can load standard text-based data files with fields seperated by
spaces or commas, and records seperated by line breaks. Any lines that
are blank or start with the # character are ignored. The program also
supports fast-loading data files. These fast-loading files (which must
end with a .fds extension) can be generated using the
fastmix-fds utility (see the utilities
section).
FastMix outputs files containing a mixture of gaussians. The file starts with two header lines: gaussians: num_gaussians dimensions: num_dimensions Each line after the header specifies a single gaussian. The values in each line are seperated by spaces and include the gaussian's mean, covariance, and probability (which sums to 1 over the entire mixture of gaussians). Each line contains fields in the following order (d denotes the number of dimensions): gaussian-id probability mean0 ... meand-1 cov0, 0 ... covd-1, 0 ... covd-1, d-1 A FastMix log file contains lines of the following format: secs-since-start score num-gaussians | ||||||||||||||||||||||||||||||||||||||||||||
Auxiliary Utilities | ||||||||||||||||||||||||||||||||||||||||||||
FastMix provides a utility that scores a mixture file using the specified dataset. The utility has the following syntax (valid values for score-type are aic and bic): fastmix-score score-type dataset-file mixture-file FastMix also includes a program that generates a graphical display of a dataset and a mixture file: fastmix-show dataset-file mixture-file Another utility generates a postscript rendering of a dataset and a mixture file: fastmix-ps dataset-file mixture-file ps-file The following program will convert a standard text data file into a fast-loading .fds file (with a .fds extension): fastmix-fds input-file Another utility updates a mixture model by running a sequence of EM steps: fastmix-em dataset-file input-mixture-file output-mixture-file options The following options may be used:
The output mixure model is the highest scoring, not necessarily the last. | ||||||||||||||||||||||||||||||||||||||||||||
References | ||||||||||||||||||||||||||||||||||||||||||||
Andrew W. Moore,
Very Fast EM-based Mixture Model Cluster using Multiresolution
kd-trees,
Advances in Neural Information Processing
Systems 11, (Submitted May 1998, Proceedings published May 1999).
Peter Sand and Andrew W. Moore, Repairing Faulty Mixture Models using Density Estimation, International Conference on Machine Learning, 2001 (ICML2001) |