multi-ngram
multi-ngram
NAME
multi-ngram - build multiword N-gram models
SYNOPSIS
multi-ngram [ -help ] option ...
DESCRIPTION
multi-ngram
builds N-gram language models that contain multiwords, i.e., compound words
that are a concatenation of words from some prior given model.
It will optionally generate multiword N-grams and insert them into
an existing, reference N-gram model, so as to cover multiwords occuring
in a specified vocabulary.
It will then assign probabilities to the multiword N-grams so that word
strings containing multiwords have the same probabilities as the strings
of component words in the reference model.
Note that the inverse operation (expanding a multiword N-gram to contain
only regular words) is subsumed by the
ngram -expand-classes
function.
OPTIONS
Each filename argument can be an ASCII file, or a
compressed file (name ending in .Z or .gz), or ``-'' to indicate
stdin/stdout.
- -help
-
Print option summary.
- -version
-
Print version information.
- -order n
-
Set the maximal N-gram order to be used from the reference model.
NOTE: The order of the model is not set automatically when a model
file is read, so the same file can be used at various orders.
To use models of order higher than 3 it is always necessary to specify this
option.
- -multi-order n
-
The maximal N-gram order in the multiword-based model.
- -debug level
-
Set the debugging output level (0 means no debugging output).
- -vocab file
-
Words to be added to the model.
In particular, this should include all the multiwords to be added.
- -multi-char C
-
Character used to delimit component words in multiwords
(an underscore character by default).
- -lm file
-
Reference N-gram model.
- -multi-lm file
-
Model containing multiwords; the N-grams in this model will be assigned
new probabilities based on the reference model.
If this option is
not
given then the multiword model will be generated by adding multiword
N-grams to the reference model.
- -prune-unseen-ngrams
-
This option prevents the insertion of multiword N-grams whose component
N-grams are not contained in the reference model.
For example, for a multiword bigram "a_b c_d" to be inserted, a trigram
reference model must contain the trigrams "a b c" and "b c d".
If the reference model were a bigram LM, it would have to contain
"a b", "b c", and "c d".
This option is important to control the size of the multiword LM for
large vocabularies.
- -write-lm file
-
Output location of the generated multiword model.
SEE ALSO
ngram(1), ngram-format(5).
BUGS
This program is a hack for cases were the original training data is
not available and a multiword model has to be generated from an existing
model.
The resulting model is no longer properly normalized, since the
same word string can potentially be represented with or without multiwords.
The generation of multiword N-grams uses a heuristic algorithm that
works well for bigrams and trigrams, but is not exhaustive.
AUTHOR
Andreas Stolcke <stolcke@speech.sri.com>.
Copyright 2000-2004 SRI International