CMU 11-731(MT&sSeq2Seq) n-gram model

11731, Machine Translation, course notes

reference:
- Language Models 1: n-gram Language Models

Language Models 1: n-gram Language Models

Target Sentence: $$ P(E) $$

Practical use:

Assess naturalness
Generate text

Word-by-word Computation of Probabilities

Transition from:

$$ P(E) = P(|E| = T, e{^1_T}) $$

To:

$$ P(E) = \prod_{t=1}^{T+1} P(e{_T}| e{^{t-1}_1}) $$

where $$ e{_{T+1}} = \langle /S \rangle $$

Count-based n-gram Language Models

$$ P{_{ML}}(e{_T}| e{^{t-1}_1}) = \frac{count\,of\,prefix (e_1^t) }{count\,of\,prefix (e_1^{t-1})} $$

problem: new data (not in training corpus) will make the whole probability 0
solution(2 steps):
- window -> n-gram models (problem: What if 2-word string not in corpus ?)
- smoothing(e.g interpolation)
More smoothing technique:
- Context-dependent smoothing coecients
- Back-off
- Modified distributions
- Modified Kneser-Ney smoothing

Evaluation of Language Models

data
- Training data
- Dev data
- Test data
Measures
- likelihood -> log likelihood
  - reason: numerical precision problems & mathematically convenient(derivative)
- likelihood / number of words
- perplexity - “how confused is the model about its decision?” - Graham

Handling Unknown Words

problems: some of the words in test data not in training data
solution:
- Assume closed vocabulary
- Interpolate with an unknown words distribution
- Add an “unk” word

Further reading(done / not yet)

Large-scale language modeling
- Effcient data structures
  - KenLM: Faster and Smaller Language Model Queries
  - Faster and Smaller N-Gram Language Models
- Distributed parameter servers
  - Large Language Models in Machine Translation
- Lossy compression algorithms
  - Randomized Language Models via Perfect Hash Functions
To be updated ….