ngram-class

ngram-class

NAME

ngram-class - induce word classes from N-gram statistics

SYNOPSIS

ngram-class [ -help ] option ...

DESCRIPTION

ngram-class induces word classes from distributional statistics, so as to minimize perplexity of a class-based N-gram model given the provided word N-gram counts. Presently, only bigram statistics are used, i.e., the induced classes are best suited for a class-bigram language model.

The program generates the class N-gram counts and class expansions needed by ngram-count(1) and ngram(1), respectively to train and to apply the class N-gram model.

OPTIONS

Each filename argument can be an ASCII file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicate stdin/stdout.

-help
Print option summary.
-version
Print version information.
-debug level
Set debugging output at level. Level 0 means no debugging. Debugging messages are written to stderr. A useful level to trace the formation of classes is 2.

Input Options

-vocab file
Read a vocabulary from file. Subsequently, out-of-vocabulary words in both counts or text are replaced with the unknown-word token. If this option is not specified all words found are implicitly added to the vocabulary.
-tolower
Map the vocabulary to lowercase.
-counts file
Read N-gram counts from a file. Each line contains an N-gram of words, followed by an integer count, all separated by whitespace. Repeated counts for the same N-gram are added. Counts collected by -text and -counts are additive as well.
Note that the input should contain consistent lower- and higher-order counts (i.e., unigrams and bigrams), as would be generated by ngram-count(1).
-text textfile
Generate N-gram counts from text file. textfile should contain one sentence unit per line. Begin/end sentence tokens are added if not already present. Empty lines are ignored.

Class Merging

-numclasses C
The target number of classes to induce. A zero argument suppresses automatic class merging altogether (e.g., for use with -interact).
-full
Perform full greedy merging over all classes starting with one class per word. This is the O(V^3) algorithm described in Brown et al. (1992).
-incremental
Perform incremental greedy merging, starting with one class each for the C most frequent words, and then adding one word at a time. This is the O(V*C^2) algorithm described in Brown et al. (1992); it is the default.
-interact
Enter a primitive interactive interface when done with automatic class induction, allowing manual specification of additional merging steps.
-noclass-vocab file
Read a list of vocabulary items from file that are to be excluded from classes. These words or tags do no undergo class merging, but their N-gram counts still affect the optimization of model perplexity.
The default is to exclude the sentence begin/end tags (<s> and </s>) from class merging; this can be suppressed by specifying -noclass-vocab /dev/null.

Output Options

-class-counts file
Write class N-gram counts to file when done. The format is the same as for word N-gram counts, and can be read by ngram-count(1) to estimate a class-N-gram model.
-classes file
Write class definitions (member words and their probabilities) to file when done. The output format is the same as required by the -classes option of ngram(1).
-save S
Save the class counts and/or class definitions every S iterations during induction. The filenames are obtained from the -class-counts and -classes options, respectively, by appending the iteration number. This is convenient for producing sets of classes at different granularities during the same run. S=0 (the default) suppresses the saving actions.
-save-maxclasses K
Modifies the action of -save so as to only start saving once the number of classes reaches K. (The iteration numbers embedded in filenames will start at 0 from that point.)

SEE ALSO

ngram-count(1), ngram(1).
P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer, ``Class-Based n-gram Models of Natural Language,'' Computational Linguistics 18(4), 467-479, 1992.

BUGS

Classes are optimized only for bigram models at present.

AUTHOR

Andreas Stolcke <stolcke@speech.sri.com>.
Copyright 1999-2007 SRI International