- reference
Manual Evaluation
- Adequacy
- Fluency
- Rank-based Evaluation
Automatic Evaluation and BLEU
BLEU Score
- n-gram Precision
- Brevity Penalty
Notes on BLEU and Automatic Evaluation
- Comparability
- Corpus Level vs. Sentence Level
- use smoothed BLEU or BLEU+1
Significance and Stability Concerns
Randomness in Data Selection
bootstrap resampling (random subsets)
If one system is consistently better on all subsets, then we can assume that the changes are stable- how much variance –> standard bootstrap
- how certain is better –> paired bootstrap
- Randomness in Training
Other Automatic Evaluation Metrics
- METEOR
- Translation Edit Rate (TER)
- Focused measures
Meta-evaluation
- Pearson’s correlation
- Spearman’s rank correlation