Code for Statistical Significance Testing for MT Evaluation Metrics
IMPORTANT NOTE: A newer version of this code, compatible with mteval-v13a,
is available here.
The version below may exhibit errors due to a bug in mteval-v11b (that has been corrected as of mteval-v13a). Thanks to Alex Fraser for pointing this out.
This page contains scripts to perform paired bootstrap resampling (Koehn, 2004) for
three commonly-used MT automatic evaluation metrics: BLEU (Papineni et al., 2002),
NIST (Doddington, 2002), and METEOR (Banerjee and Lavie, 2005).
All of these metrics compute document-level scores based on small amounts of information from each segment, so
we can rapidly do any number of paired-sample comparisons once we have the segment-level information.
We therefore do this process in two steps: (1) compute segment-level statistics for
hypothesis and reference sgml files for the two systems we wish to compare, outputting to two files,
and (2) read in all segment-level scores for these two systems and perform
bootstrap resampling to test for significance.
We make minimal changes to the original evaluation metric scripts for step (1)
to print out segment-level
information. Then we perform step (2) with a separate script which performs the test
by sampling segments from the documents with replacement and computing document-level
scores on the set of samples. Small sections of the original scripts are used
to compute these document-level scores. This sampling proceeds for a user-specified
number of samples and the winning system is recorded on each iteration and for each metric.
Given a p-value, significance is then tested using the fraction of samples for which system 1
performed better than system 2, for each metric.
mteval-v11b-sig.pl -- modified version of NIST BLEU v11b evaluation script that can print segment-level statistics
meteor-sig.pl -- modified version of METEOR v0.6 evaluation script to do the same
paired_bootstrap_resampling_nistbleu.pl -- performs paired bootstrap resampling for the NIST and BLEU metrics given two files containing segment-level statistics computed using the modified scripts above. Number of samples and p-value for the test are command-line parameters.
paired_bootstrap_resampling_meteor.pl -- does the same for METEOR (requires an additional parameter which indicates the target language)
compute_nistbleu_from_stats.pl -- computes NIST and BLEU scores from a file containing segment-level statistics. Not used by the paired_bootstrap_resampling_nistbleu.pl script above, but useful as a check to ensure that the statistics were output correctly by the modified NIST BLEU script.
compute_meteor_from_stats.pl -- does the same for METEOR
For questions, bug reports, etc. please contact Kevin Gimpel (kgimpel at cs.cmu.edu).
Below are links to the original scripts for computing these metrics.
Banerjee, S. and A. Lavie, METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at ACL 2005.
Doddington, G. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. HLT 2002.
Koehn, P. Statistical Significance Tests for Machine Translation Evaluation. EMNLP 2004.
Papineni, K., S. Roukos, T. Ward, and W. Zhu. Bleu: a Method for Automatic Evaluation of Machine Translation. ACL 2002.