Main Page   Compound List   File List   Compound Members   File Members  

compute_unigram.c File Reference

Calculate the probabilities for each 1-gram. More...

#include <math.h>
#include <stdio.h>
#include "ngram.h"
#include "idngram2lm.h"
#include "pc_libs/pc_general.h"

Go to the source code of this file.

Functions

void compute_unigram (ng_t *ng, int verbosity)


Detailed Description

Calculate the probabilities for each 1-gram.

From comments to v1:

A) open_vocab == 0:

Closed vocab model. P(UNK)=0. UNK is not part of the vocab. The discount mass is divided equally among all zerotons. If this results in P(zeroton)>Q*P(singleton) for some appropriate fraction Q, P(zeroton) is reduced to Q*P(singleton) and the entire array is then renormalized.

B) open vocab == 1:

Open vocab model, where the vocab was chosen without knowing the partition of the data into training and testing, so unicount[UNK]/N is probably a reasonable estimate of P(UNK). So: Treat UNK as any other word. As in (A), discount mass is divided among all zerotons. If some is left over (b/c/o the constraint P(zeroton)<=Q*P(singleton)), renormalizing everything to absorb it.

C) open vocab == 2:

Open vocab model, where the vocab was defined to include all the training data, hence unicount[UNK]=0. So: The discount mass is split: one part (1-OOV_fraction) is divided among the zerotons, as above. The other part, plus any leftover from the first part, is put into P(UNK).

note: UNK is hardwired to id=0, here and elsewhere.

Definition in file compute_unigram.c.


Function Documentation

void compute_unigram ng_t   ng,
int    verbosity
 

Definition at line 67 of file compute_unigram.c.

References ng_t::abs_disc_const, ABSOLUTE, CLOSED_VOCAB, ng_t::context_cue, ng_t::count, ng_t::count4, ng_t::count_table, ng_t::disc_range, ng_t::discounting_method, ng_t::first_id, ng_t::four_byte_counts, fprintf(), ng_t::freq_of_freq, GOOD_TURING, ng_t::gt_disc_ratio, i, ng_t::lin_disc_ratio, LINEAR, ng_t::n_unigrams, ng_t::no_of_ccs, ng_t::oov_fraction, OPEN_VOCAB_2, pc_message(), quit(), return_count(), ng_t::uni_log_probs, ng_t::uni_probs, verbosity, ng_t::vocab_size, ng_t::vocab_type, WITTEN_BELL, and ng_t::zeroton_fraction.

Referenced by main().


Generated on Tue Dec 21 13:54:46 2004 by doxygen1.2.18