CMU 11-731(MT&Seq2Seq) Evaluating MT Systems

Manual Evaluation

  • Adequacy
  • Fluency
  • Rank-based Evaluation

Automatic Evaluation and BLEU

  • BLEU Score

    • n-gram Precision
    • Brevity Penalty
  • Notes on BLEU and Automatic Evaluation

    • Comparability
    • Corpus Level vs. Sentence Level
      • use smoothed BLEU or BLEU+1

Significance and Stability Concerns

  • Randomness in Data Selection

    • bootstrap resampling (random subsets)
      If one system is consistently better on all subsets, then we can assume that the changes are stable

      • how much variance –> standard bootstrap
      • how certain is better –> paired bootstrap
  • Randomness in Training

Other Automatic Evaluation Metrics

  • METEOR
  • Translation Edit Rate (TER)
  • Focused measures

Meta-evaluation

  • Pearson’s correlation
  • Spearman’s rank correlation
Share