ngram-class
ngram-class
NAME
ngram-class - induce word classes from N-gram statistics
SYNOPSIS
ngram-class [ -help ] option ...
DESCRIPTION
ngram-class
induces word classes from distributional statistics,
so as to minimize perplexity of a class-based N-gram model
given the provided word N-gram counts.
Presently, only bigram statistics are used, i.e., the induced classes
are best suited for a class-bigram language model.
The program generates the class N-gram counts and class expansions
needed by
ngram-count(1)
and
ngram(1),
respectively to train and to apply the class N-gram model.
OPTIONS
Each filename argument can be an ASCII file, or a
compressed file (name ending in .Z or .gz), or ``-'' to indicate
stdin/stdout.
- -help
-
Print option summary.
- -version
-
Print version information.
- -debug level
-
Set debugging output at
level.
Level 0 means no debugging.
Debugging messages are written to stderr.
A useful level to trace the formation of classes is 2.
Input Options
- -vocab file
-
Read a vocabulary from file.
Subsequently, out-of-vocabulary words in both counts or text are
replaced with the unknown-word token.
If this option is not specified all words found are implicitly added
to the vocabulary.
- -tolower
-
Map the vocabulary to lowercase.
- -counts file
-
Read N-gram counts from a file.
Each line contains an N-gram of
words, followed by an integer count, all separated by whitespace.
Repeated counts for the same N-gram are added.
Counts collected by
-text
and
-counts
are additive as well.
Note that the input should contain consistent lower- and higher-order
counts (i.e., unigrams and bigrams), as would be generated by
ngram-count(1).
- -text textfile
-
Generate N-gram counts from text file.
textfile
should contain one sentence unit per line.
Begin/end sentence tokens are added if not already present.
Empty lines are ignored.
Class Merging
- -numclasses C
-
The target number of classes to induce.
A zero argument suppresses automatic class merging altogether
(e.g., for use with
-interact).
- -full
-
Perform full greedy merging over all classes starting with one class per
word.
This is the O(V^3) algorithm described in Brown et al. (1992).
- -incremental
-
Perform incremental greedy merging, starting with
one class each for the
C
most frequent words, and then adding one word at a time.
This is the O(V*C^2) algorithm described in Brown et al. (1992);
it is the default.
- -interact
-
Enter a primitive interactive interface when done with automatic class
induction, allowing manual specification of additional merging steps.
- -noclass-vocab file
-
Read a list of vocabulary items from
file
that are to be excluded from classes.
These words or tags do no undergo class merging, but their
N-gram counts still affect the optimization of model perplexity.
The default is to exclude the sentence begin/end tags (<s> and </s>)
from class merging; this can be suppressed by specifying
-noclass-vocab /dev/null.
Output Options
- -class-counts file
-
Write class N-gram counts to
file
when done.
The format is the same as for word N-gram counts, and can be
read by
ngram-count(1)
to estimate a class-N-gram model.
- -classes file
-
Write class definitions (member words and their probabilities) to
file
when done.
The output format is the same as required by the
-classes
option of
ngram(1).
- -save S
-
Save the class counts and/or class definitions every
S
iterations during induction.
The filenames are obtained from the
-class-counts
and
-classes
options, respectively, by appending the iteration number.
This is convenient for producing sets of classes at different granularities
during the same run.
S=0
(the default) suppresses the saving actions.
- -save-maxclasses K
-
Modifies the action of
-save
so as to only start saving once the number of classes reaches
K.
(The iteration numbers embedded in filenames will start at 0 from that point.)
SEE ALSO
ngram-count(1), ngram(1).
P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer,
``Class-Based n-gram Models of Natural Language,''
Computational Linguistics 18(4), 467-479, 1992.
BUGS
Classes are optimized only for bigram models at present.
AUTHOR
Andreas Stolcke <stolcke@speech.sri.com>.
Copyright 1999-2007 SRI International