ngram-class

NAME

ngram-class - induce word classes from N-gram statistics

SYNOPSIS

ngram-class [ -help ] option ...

DESCRIPTION

ngram-class induces word classes from distributional statistics, so as to minimize perplexity of a class-based N-gram model given the provided word N-gram counts. Presently, only bigram statistics are used, i.e., the induced classes are best suited for a class-bigram language model.

The program generates the class N-gram counts and class expansions needed by ngram-count(1) and ngram(1), respectively to train and to apply the class N-gram model.

OPTIONS

Each filename argument can be an ASCII file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicate stdin/stdout.

-help: Print option summary.
-version: Print version information.
-debug level: Set debugging output at level. Level 0 means no debugging. Debugging messages are written to stderr. A useful level to trace the formation of classes is 2.

Input Options

-vocab file: Read a vocabulary from file. Subsequently, out-of-vocabulary words in both counts or text are replaced with the unknown-word token. If this option is not specified all words found are implicitly added to the vocabulary.
-tolower: Map the vocabulary to lowercase.
-counts file: Read N-gram counts from a file. Each line contains an N-gram of words, followed by an integer count, all separated by whitespace. Repeated counts for the same N-gram are added. Counts collected by -text and -counts are additive as well.
Note that the input should contain consistent lower- and higher-order counts (i.e., unigrams and bigrams), as would be generated by ngram-count(1).
-text textfile: Generate N-gram counts from text file. textfile should contain one sentence unit per line. Begin/end sentence tokens are added if not already present. Empty lines are ignored.

Class Merging

-numclasses C: The target number of classes to induce. A zero argument suppresses automatic class merging altogether (e.g., for use with -interact).
-full: Perform full greedy merging over all classes starting with one class per word. This is the O(V^3) algorithm described in Brown et al. (1992).
-incremental: Perform incremental greedy merging, starting with one class each for the C most frequent words, and then adding one word at a time. This is the O(V*C^2) algorithm described in Brown et al. (1992); it is the default.
-interact: Enter a primitive interactive interface when done with automatic class induction, allowing manual specification of additional merging steps.
-noclass-vocab file: Read a list of vocabulary items from file that are to be excluded from classes. These words or tags do no undergo class merging, but their N-gram counts still affect the optimization of model perplexity.
The default is to exclude the sentence begin/end tags (<s> and </s>) from class merging; this can be suppressed by specifying -noclass-vocab /dev/null.

Output Options

-class-counts file: Write class N-gram counts to file when done. The format is the same as for word N-gram counts, and can be read by ngram-count(1) to estimate a class-N-gram model.
-classes file: Write class definitions (member words and their probabilities) to file when done. The output format is the same as required by the -classes option of ngram(1).
-save S: Save the class counts and/or class definitions every S iterations during induction. The filenames are obtained from the -class-counts and -classes options, respectively, by appending the iteration number. This is convenient for producing sets of classes at different granularities during the same run. S=0 (the default) suppresses the saving actions.
-save-maxclasses K: Modifies the action of -save so as to only start saving once the number of classes reaches K. (The iteration numbers embedded in filenames will start at 0 from that point.)

BUGS

Classes are optimized only for bigram models at present.