CMU 11-731(MT&sSeq2Seq) n-gram model

Language Models 1: n-gram Language Models

  • Target Sentence: $$ P(E) $$

Practical use:

  • Assess naturalness
  • Generate text

Word-by-word Computation of Probabilities

Transition from:

$$ P(E) = P(|E| = T, e{^1_T}) $$

To:

$$ P(E) = \prod_{t=1}^{T+1} P(e{_T}| e{^{t-1}_1}) $$

where $$ e{_{T+1}} = \langle /S \rangle $$

Count-based n-gram Language Models

$$ P{_{ML}}(e{_T}| e{^{t-1}_1}) = \frac{count\,of\,prefix (e_1^t) }{count\,of\,prefix (e_1^{t-1})} $$

  • problem: new data (not in training corpus) will make the whole probability 0
  • solution(2 steps):

    • window -> n-gram models (problem: What if 2-word string not in corpus ?)
    • smoothing(e.g interpolation)
  • More smoothing technique:

    • Context-dependent smoothing coecients
    • Back-off
    • Modified distributions
    • Modified Kneser-Ney smoothing

Evaluation of Language Models

  • data

    • Training data
    • Dev data
    • Test data
  • Measures

    • likelihood -> log likelihood
      • reason: numerical precision problems & mathematically convenient(derivative)
    • likelihood / number of words
    • perplexity - “how confused is the model about its decision?” - Graham

Handling Unknown Words

  • problems: some of the words in test data not in training data
  • solution:
    • Assume closed vocabulary
    • Interpolate with an unknown words distribution
    • Add an “unk” word

Further reading(done / not yet)

Share