ARK Machine Translation Research

This page contains resources for machine translation research developed by members of Noah's ARK in the Language Technologies Institute at Carnegie Mellon University.

Code for Statistical Significance Testing for MT Evaluation Metrics

IMPORTANT NOTE: A newer version of this code, compatible with mteval-v13a, is available here. The version below may exhibit errors due to a bug in mteval-v11b (that has been corrected as of mteval-v13a). Thanks to Alex Fraser for pointing this out.

This page contains scripts to perform paired bootstrap resampling (Koehn, 2004) for three commonly-used MT automatic evaluation metrics: BLEU (Papineni et al., 2002), NIST (Doddington, 2002), and METEOR (Banerjee and Lavie, 2005). All of these metrics compute document-level scores based on small amounts of information from each segment, so we can rapidly do any number of paired-sample comparisons once we have the segment-level information. We therefore do this process in two steps: (1) compute segment-level statistics for hypothesis and reference sgml files for the two systems we wish to compare, outputting to two files, and (2) read in all segment-level scores for these two systems and perform bootstrap resampling to test for significance.

We make minimal changes to the original evaluation metric scripts for step (1) to print out segment-level information. Then we perform step (2) with a separate script which performs the test by sampling segments from the documents with replacement and computing document-level scores on the set of samples. Small sections of the original scripts are used to compute these document-level scores. This sampling proceeds for a user-specified number of samples and the winning system is recorded on each iteration and for each metric. Given a p-value, significance is then tested using the fraction of samples for which system 1 performed better than system 2, for each metric.

All scripts are available in the following tar.gz file:
paired_bootstrap.tar.gz

The files contained are listed below:

mteval-v11b-sig.pl -- modified version of NIST BLEU v11b evaluation script that can print segment-level statistics
meteor-sig.pl -- modified version of METEOR v0.6 evaluation script to do the same
paired_bootstrap_resampling_nistbleu.pl -- performs paired bootstrap resampling for the NIST and BLEU metrics given two files containing segment-level statistics computed using the modified scripts above. Number of samples and p-value for the test are command-line parameters.
paired_bootstrap_resampling_meteor.pl -- does the same for METEOR (requires an additional parameter which indicates the target language)
compute_nistbleu_from_stats.pl -- computes NIST and BLEU scores from a file containing segment-level statistics. Not used by the paired_bootstrap_resampling_nistbleu.pl script above, but useful as a check to ensure that the statistics were output correctly by the modified NIST BLEU script.
compute_meteor_from_stats.pl -- does the same for METEOR

For questions, bug reports, etc. please contact Kevin Gimpel (kgimpel at cs.cmu.edu).

Below are links to the original scripts for computing these metrics.

NIST/BLEU: www.nist.gov/speech/tests/mt/2008/scoring.html
METEOR: www.cs.cmu.edu/~alavie/METEOR/

References:

Banerjee, S. and A. Lavie, METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at ACL 2005.
Doddington, G. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. HLT 2002.
Koehn, P. Statistical Significance Tests for Machine Translation Evaluation. EMNLP 2004.
Papineni, K., S. Roukos, T. Ward, and W. Zhu. Bleu: a Method for Automatic Evaluation of Machine Translation. ACL 2002.