Meteor consists of two major components: a flexible monolingual word aligner and a scorer. For machine translation evaluation, hypothesis sentences are aligned to reference sentences. Alignments are then scored to produce sentence and corpus level scores. Score and alignment information can also be used to visualize word alignments and score distributions using Meteor X-ray. For detailed information on Meteor word alignment and scoring, see Denkowski and Lavie, 2011. This paper also details the flexible matching support that allows Meteor to align words and phrases with differing surface forms.
This release includes the following software:
Meteor is released under the GNU Lesser General Public License (LGPL) and includes some files subject to the (compatible) WordNet license. See the included COPYING files for details.
Language support is divided into two groups. Fully supported languages include flexible word and phrase matching (at least one type of match other than exact) and language-specific parameters tuned to maximize correlation between Meteor scores and human judgments of translation quality. Partially supported languages include flexible word matching and use language-independent parameters chosen to generalize well across known languages.
Fully supported languages:
Language | Exact Match | Stem Match | Synonym Match | Paraphrase Match | Tuned Parameters |
English | Yes | Yes | Yes | Yes | Yes |
Arabic | Yes | No | No | Yes | Yes |
Czech | Yes | No | No | Yes | Yes |
French | Yes | Yes | No | Yes | Yes |
German | Yes | Yes | No | Yes | Yes |
Spanish | Yes | Yes | No | Yes | Yes |
Partially supported languages:
Language | Exact Match | Stem Match | Synonym Match | Paraphrase Match | Tuned Parameters |
Danish | Yes | Yes | No | No | LI |
Dutch | Yes | Yes | No | No | LI |
Finnish | Yes | Yes | No | No | LI |
Hungarian | Yes | Yes | No | No | LI |
Italian | Yes | Yes | No | No | LI |
Norwegian | Yes | Yes | No | No | LI |
Portuguese | Yes | Yes | No | No | LI |
Romanian | Yes | Yes | No | No | LI |
Russian | Yes | Yes | No | No | LI |
Swedish | Yes | Yes | No | No | LI |
Turkish | Yes | Yes | No | No | LI |
Paraphrase capability can also be added to unsupported languages. If your MT system has a bilingual phrase table, you can use Parex to build paraphrases tables and use them with Meteor. For example, if you want to evaluate a system that translates into Danish and build a paraphrase table named paraphrase-da.gz, you can use:
java -Xmx2G -jar meteor-*.jar test reference -l da \ -a paraphrase-da.gz -m 'exact stem paraphrase' -w '1.0 0.5 0.5'This tells Meteor to use the paraphrase table (-a paraphrase-da.gz) add the paraphrase module (-m 'exact stem paraphrase') and add a weight for paraphrases (-w '1.0 0.5 0.5').
Other Languages:
Meteor is capable of scoring UTF-8 encoded data for any language. Specifying language "other" will automatically select exact matches only for alignment and language-independent scoring parameters. Remember to pre-segment, tokenize, and lowercase text as needed.
java -Xmx2G -jar meteor-*.jar test reference -l other
To call Meteor, run the following:
java -Xmx2G -jar meteor-*.jarRunning Meteor with no arguments prints the following help message:
Meteor version 1.4 Usage: java -Xmx2G -jar meteor-*.jar <test> <reference> [options] Options: -l language Fully supported: en cz de es fr ar Supported with language-independent parameters: da fi hu it nl no pt ro ru se tr -t task One of: rank util adq hter li tune util implies -ch -p 'alpha beta gamma delta' Custom parameters (overrides default) -m 'module1 module2 ...' Specify modules (overrides default) Any of: exact stem synonym paraphrase -w 'weight1 weight2 ...' Specify module weights (overrides default) -r refCount Number of references (plaintext only) -x beamSize (default 40) -s wordListDirectory (if not default for language) -d synonymDirectory (if not default for language) -a paraphraseFile (if not default for language) -f filePrefix Prefix for output files (default 'meteor') -q Quiet: Segment scores to stderr, final to stdout, no additional output (plaintext only) -ch Character-based precision and recall -norm Tokenize / normalize punctuation and lowercase (Recommended unless scoring raw output with pretokenized references) -lower Lowercase only (not required if -norm specified) -noPunct Do not consider punctuation when scoring (Not recommended unless special case) -sgml Input is in SGML format -mira Input is in MIRA format (Use '-' for test and reference files) -vOut Output verbose scores (P / R / frag / score) -ssOut Output sufficient statistics instead of scores -writeAlignments Output alignments annotated with Meteor scores (written to <prefix>-align.out) Sample options for plaintext: -l <lang> -norm Sample options for SGML: -l <lang> -norm -sgml Sample options for raw output / pretokenized references: -l <lang> -lower See README file for additional informationThe simplest way to run Meteor is as follows:
java -Xmx2G -jar meteor-*.jar test reference -l en -normThis tells Meteor to score the file "test" against "reference", where test and reference are UTF-8 encoded files that contain one sentence per line. The "-l en" option tells Meteor to use settings for English. The -norm flag tells Meteor to apply language-specific text normalization before scoring. These are the ideal settings for which language-specific parameters are tuned.
Important note: If you are scoring text in a partially supported language, do not use the -norm flag, as Meteor has no normalization rules for these languages. Instead, use your own tools for segmenting, tokenizing, and lowercasing (if desired) the test and reference text prior to scoring. Meteor will warn if the -norm flag is used with unsupported languages. For example, to score Danish text, pre-tokenize the files and run:
java -Xmx2G -jar meteor-*.jar test.da.tok reference.da.tok -l da
To score the example files included with Meteor, use the following:
java -Xmx2G -jar meteor-*.jar example/xray/system1.hyp example/xray/reference -l en -normYou should see the following output:
Meteor version: 1.4 Eval ID: meteor-1.4-wo-en-norm-0.85_0.2_0.6_0.75-ex_st_sy_pa-1.0_0.6_0.8_0.6 Language: English Format: plaintext Task: Ranking Modules: exact stem synonym paraphrase Weights: 1.0 0.6 0.8 0.6 Parameters: 0.85 0.2 0.6 0.75 Segment 1 score: 0.447752250844953 Segment 2 score: 0.4284116369815996 Segment 3 score: 0.2772888474043816 Segment 4 score: 0.39587671218995263 Segment 5 score: 0.34983532103052495 . . . Segment 2485 score: 0.29553941444479426 Segment 2486 score: 0.27829272093582047 Segment 2487 score: 0.2825995999223381 Segment 2488 score: 0.32037812996981163 Segment 2489 score: 0.33120147321343485 System level statistics: Test Matches Reference Matches Stage Content Function Total Content Function Total 1 16268 20842 37110 16268 20842 37110 2 485 26 511 489 22 511 3 820 119 939 845 94 939 4 3813 3162 6975 3954 2717 6671 Total 21386 24149 45535 21556 23675 45231 Test words: 61600 Reference words: 62469 Chunks: 20118 Precision: 0.6767347074578696 Recall: 0.6500539115850005 f1: 0.663126043401952 fMean: 0.6539211143997783 Fragmentation penalty: 0.5099053526424513 Final score: 0.3204832379614146The output contains the following in order:
For the majority of scoring scenarios, only the -l and -norm options should be used. For more advanced usage, the full list of options follows.
SCORE ||| reference 1 words ||| reference n words ||| hypothesis wordsScores hypothesis against one or more references and returns line of sufficient statistics.
EVAL ||| statsCalculates final scores using output of SCORE lines. Meteor exits on end-of-file.
tstLen refLen stage1tstTotalMatches stage1refTotalMatches stage1tstWeightedMatches stage1refWeightedMatches s2tTM s2rTM s2tWM s2rWM s3tTM s3rTM s3tWM s3rWM s4tTM s4rTM s4tWM s4rWM chunks lenCost
Title precision recall fragmentation score sentence1 sentence2 Line2Start:Length Line1Start:Length Module Score ...
Meteor includes a monolingual word aligner that can be run independently of the scorer. To run the aligner, use:
java -Xmx2G -cp meteor-*.jar MatcherRunning the aligner with no arguments shows the help message:
Meteor Aligner version 1.4 Usage: java -Xmx2G -cp meteor-*.jar Matcher <test> <reference> [options] Options: -l language One of: en da de es fi fr hu it nl no pt ro ru se tr -m 'module1 module2 ...' Specify modules (overrides default) One of: exact stem synonym paraphrase -t type Alignment type (coverage vs accuracy) One of: maxcov maxacc -x beamSize Keep speed reasonable -d synonymDirectory (if not default) -a paraphraseFile (if not default) See README file for examplesMost options are the same as in the Meteor scorer. The additional option is -t, which specifies whether alignments should maximize coverage (comparable to recall) or accuracy (comparable to precision).
Sentences are read from test and reference files, one per line, and alignments are written to stdout using the Meteor format:
Alignment <line N> sentence1 sentence2 Line2Start:Length Line1Start:Length Module Score ...
Important note: the Meteor Aligner does not apply any normalization to input text. Text should be segmented, tokenized, and lowercased as desired prior to Meteor alignment.
Meteor includes a standalone word stemmer for supported languages. To run the stemmer, use:
java -cp meteor-*.jar StemmerRunning the stemmer with no arguments shows the help message:
Snowball stem some text in a supported language Languages: en da de es fi fr hu it nl no pt ro ru se tr Usage: Stemmer lang < in > outThe stemmer reads lines from stdin and writes to stdout. Each word in the input is stemmed using the Snowball stemmer for the specified language.
Important note: the Meteor Stemmer does not apply any normalization to input text. Text should be segmented, tokenized, and lowercased as desired prior to Meteor alignment.
The simplest way to integrate Meteor with your software involves using the -stdio option:
java -Xmx2G -jar meteor-*.jar - - -l en -norm -stdioThis tells Meteor to use the English settings, normalize text, and use stdin/stdout. You can then write lines of the following form to Meteor's stdin:
SCORE ||| reference 1 words ||| reference n words ||| hypothesis wordsThis scores a hypothesis against one or more references and returns a line of sufficient statistics.
EVAL ||| statsThis reads a line of sufficient statistics and produces a final score. Meteor exits on end-of-file.
Languages such as C++, Python, and Perl can open an external process and communicate with its stdin and stdout. For more information, see the documentation for process control for your language.
If your software is written in Java, you can use the Meteor API directly:
import edu.cmu.meteor.scorer.MeteorConfiguration; import edu.cmu.meteor.scorer.MeteorScorer; import edu.cmu.meteor.util.Constants; MeteorConfiguration config = new MeteorConfiguration(); config.setLanguage("en"); config.setNormalization(Constants.NORMALIZE_KEEP_PUNCT); MeteorScorer scorer = new MeteorScorer(config); double score = scorer.getMeteorStats("test string", "reference string").score;Remember to add meteor-*.jar to your classpath. See the source files for MeteorConfiguration and MeteorScorer for additional information.
X-ray visualizes alignments and scores of one or more MT systems against a set of reference translations. When scoring translation hypotheses with Meteor, use the -writeAlignments option to produce alignment files annotated with Meteor statistics. X-Ray uses these files to produce graphical representations of alignment matrices and score distributions via XeTeX and Gnuplot. Final output is in PDF form with intermediate LaTeX and plot files preserved for easy inclusion in reports and presentations.
Requirements:
sudo apt-get install python texlive-full gnuplot unifont
Setup:
If XeTeX and Gnuplot are installed somewhere other than /usr/bin, edit xray/Generation.py to include the correct locations:
xelatex_cmd = '/usr/bin/xelatex' gnuplot_cmd = '/usr/bin/gnuplot'
Usage:
Run X-ray with the following:
python xray/xray.pyRunning X-Ray with no arguments shows the help message:
MX: X-Ray your translation output Usage: xray.py [options] <align.out> [align.out2 ...] Options: -h, --help show this help message and exit -c, --compare compare alignments of two result sets (only first 2 input files used) -n, --no-align do not visualize alignments -x MAX, --max=MAX max alignments to sample (default use all) -p PRE, --prefix=PRE prefix for output files (default mx) -l LBL, --label=LBL optional system label list, comma separated: label1,label2,... -u, --unifont use unifont (use for non-western languages)
Example usage: score and visualize the hypotheses from system1 and system2 in the example/xray directory.
Score system1 with Meteor using the following options:
java -Xmx2G -jar meteor-*.jar example/xray/system1.hyp example/xray/reference \ -norm -writeAlignments -f system1-norm: tokenize and normalize before scoring
Visualize alignments and scores of system1 with Meteor X-Ray:
python xray/xray.py -p system1 system1-align.out-p system1: prefix output files with 'system1'
Files produced:
Score system2 with Meteor:
java -Xmx2G -jar meteor-*.jar example/xray/system2.hyp example/xray/reference \ -norm -writeAlignments -f system2Compare performances of system1 and system2:
python xray/xray.py -c -p compare system1-align.out system2-align.out-c: compare two Meteor outputs
Files produced:
Additional systems:
To compare any number of systems, score each with Meteor (as above) and pass the align.out files to X-Ray. Without the -c flag, X-Ray will generate individual alignment matrices for each system and a single score PDF with score distributions for all systems. This is useful for comparing many configurations of the same system.
Meteor parameters can be optimized to maximize agreement with human judgments of translation quality. The most frequently used evaluation task is ranking, where metrics should replicate human preferences between multiple translation hypotheses. Training Meteor to ranking data requires the following:
segment-id lang-pair1 system1 lang-pair2 system2
To prepare for training, convert the input files to SGML format where needed. (Plaintext would be fine in most cases, but datasets distributed as SGML aren't required to have segments in consistent order, which can create problems for older data.) The following example uses the included data in example/train:
mkdir my-train-dir python scripts/sgmlize.py t < example/train/fr-en.sys1 > my-train-dir/fr-en.sys1.sgm python scripts/sgmlize.py t < example/train/fr-en.sys2 > my-train-dir/fr-en.sys2.sgm python scripts/sgmlize.py r < example/train/fr-en.ref > my-train-dir/fr-en.ref.sgm cp example/train/fr-en.rank my-train-dir
Since parallel training requires loading several copies of Meteor into memory, filter the paraphrase table to minimize memory usage:
java -cp meteor-*.jar FilterParaphrase data/paraphrase-en.gz filtered.gz \ example/train/fr-en.ref
Meteor trains parameters by calculating sufficient statistics for all hypotheses and running an exhaustive grid search over rescorings. Every point explored is written out. To run the Trainer directly on one cpu:
java -cp meteor-*.jar Trainer rank my-train-dir -a filtered.gz > train.out
To find the best training point, sort the output:
sort -gr train.out > train.out.sort
The point with the highest correlation is the first line of the sorted file.
Running on multiple cpus greatly improves the speed of Meteor training. To run the grid search in parallel, use the meteor_shower script:
python scripts/meteor_shower.py meteor-*.jar en 4 rank my-train-dir work-dir 8 -a `pwd`/filtered.gz
This will keep 8 trainers running in parallel. Make sure to specify an absolute path for the paraphrase file. The results will be written to work-dir, along with a script for sorting the results.
Meteor now supports language-specific evaluation for any target language for which there is enough data to build a standard phrase-based machine translation system. To build language-specific resources (paraphrase table and function word list), run the new_language.py script with your parallel data and a Moses-format phrase table:
python scripts/new_language.py out-dir corpus.f corpus.e phrase-table.gz [target-corpus.e]Paraphrases will be extracted matching the target corpus (this can be a collection of relevant dev sets). If no target corpus is provided, the first 10,000 lines of the English corpus will be used (in practice this works adequately). Meteor can then be run with these files:
java -Xmx2G -jar meteor-*.jar test reference -new out-dir/meteor-filesData should be pre-tokenized. Meteor will lowercase all data for evaluation (-new implies -lower). A universal parameter set will be used. These parameters are tuned on over 100,000 binary ranking judgments across 8 language directions and encode the following general properties:
Authors of previous Meteor versions: