- reference:
Language Models 1: n-gram Language Models
- Target Sentence: $$ P(E) $$
Practical use:
- Assess naturalness
- Generate text
Word-by-word Computation of Probabilities
Transition from:
$$ P(E) = P(|E| = T, e{^1_T}) $$
To:
$$ P(E) = \prod_{t=1}^{T+1} P(e{_T}| e{^{t-1}_1}) $$
where $$ e{_{T+1}} = \langle /S \rangle $$
Count-based n-gram Language Models
$$ P{_{ML}}(e{_T}| e{^{t-1}_1}) = \frac{count\,of\,prefix (e_1^t) }{count\,of\,prefix (e_1^{t-1})} $$
- problem: new data (not in training corpus) will make the whole probability 0
solution(2 steps):
- window -> n-gram models (problem: What if 2-word string not in corpus ?)
- smoothing(e.g interpolation)
More smoothing technique:
- Context-dependent smoothing coecients
- Back-off
- Modified distributions
- Modified Kneser-Ney smoothing
Evaluation of Language Models
data
- Training data
- Dev data
- Test data
Measures
- likelihood -> log likelihood
- reason: numerical precision problems & mathematically convenient(derivative)
- likelihood / number of words
- perplexity - “how confused is the model about its decision?” - Graham
- likelihood -> log likelihood
Handling Unknown Words
- problems: some of the words in test data not in training data
- solution:
- Assume closed vocabulary
- Interpolate with an unknown words distribution
- Add an “unk” word
Further reading(done / not yet)
Large-scale language modeling
- Effcient data structures
- Distributed parameter servers
- Lossy compression algorithms
To be updated ….