multi-ngram

NAME

multi-ngram - build multiword N-gram models

SYNOPSIS

multi-ngram [ -help ] option ...

DESCRIPTION

multi-ngram builds N-gram language models that contain multiwords, i.e., compound words that are a concatenation of words from some prior given model. It will optionally generate multiword N-grams and insert them into an existing, reference N-gram model, so as to cover multiwords occuring in a specified vocabulary. It will then assign probabilities to the multiword N-grams so that word strings containing multiwords have the same probabilities as the strings of component words in the reference model.

Note that the inverse operation (expanding a multiword N-gram to contain only regular words) is subsumed by the ngram -expand-classes function.

OPTIONS

Each filename argument can be an ASCII file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicate stdin/stdout.

-help: Print option summary.
-version: Print version information.
-order n: Set the maximal N-gram order to be used from the reference model. NOTE: The order of the model is not set automatically when a model file is read, so the same file can be used at various orders. To use models of order higher than 3 it is always necessary to specify this option.
-multi-order n: The maximal N-gram order in the multiword-based model.
-debug level: Set the debugging output level (0 means no debugging output).
-vocab file: Words to be added to the model. In particular, this should include all the multiwords to be added.
-multi-char C: Character used to delimit component words in multiwords (an underscore character by default).
-lm file: Reference N-gram model.
-multi-lm file: Model containing multiwords; the N-grams in this model will be assigned new probabilities based on the reference model. If this option is not given then the multiword model will be generated by adding multiword N-grams to the reference model.
-prune-unseen-ngrams: This option prevents the insertion of multiword N-grams whose component N-grams are not contained in the reference model. For example, for a multiword bigram "a_b c_d" to be inserted, a trigram reference model must contain the trigrams "a b c" and "b c d". If the reference model were a bigram LM, it would have to contain "a b", "b c", and "c d". This option is important to control the size of the multiword LM for large vocabularies.
-write-lm file: Output location of the generated multiword model.

BUGS

This program is a hack for cases were the original training data is not available and a multiword model has to be generated from an existing model.
The resulting model is no longer properly normalized, since the same word string can potentially be represented with or without multiwords.
The generation of multiword N-grams uses a heuristic algorithm that works well for bigrams and trigrams, but is not exhaustive.