Both, BINDUMP and normal files can be read from files compressed with gnuzip. The ending .gz is found automatically, don't add it to the filename.
<s> is the marker for the beginning of an utterance.
</s> is the marker for the
end of an utterance.
The value in the first column ist the
log monogram, bigram and trigram probability. The value in the
last column is the backoff value, that is used in case the corresponding
bigram (in case of the monogram section) or trigram (in case of the
bigram section) is unseen.
A file always has at least one unigram and one bigram.
Trigrams can be missing. If there are no trigrams, there should not
be backoff-values for bigrams either.
logP(w1,w2,w3) = logP(w1,w2,w3) if trigram w1,w2,w3 in list = backoff(w1,w2) + logP(w2,w3) if bigram w2,w3 in list = backoff(w2) + logP(w3) else
This is a silly little nist file. Put in here when and where the file was built, and if possible on which database using which program. \data\ ngram 1=4 ngram 2=4 ngram 3=3 \1-grams: -99.0 <s> 0.0 -00.0 </s> -99.0 -3.5 ja -1.5 -3.5 nein -1.5 \2-grams: -00.0 <s> </s> -0.5 -99.0 </s> <s> -0.6 -4.2 <s> ja -0.5 -6.7 ja </s> -0.5 \3-grams: -3.2 ja ja </s> -2.3 <s> ja ja -3.3 nein ja ja \end\