Minimum Error Rate Training Tools
If you are looking to use MER with the STTK system, please read this, otherwise continue reading this page. The scripts below implement Minimum Error Rate training for SMT as described in Och(20003) (see my bib files for the references) and Venugopal,Vogel (2005). It include several improvements to the basic training method including randomized initial conditions and permuted model order (in order to deal with the greedy nature of the algorithm) and dynamic parameter range expansion or restriction (to increase their potential relative impact, or in order to limit the use of certain models). The script is to be used with the BLEU metric. Simple modifications can be made for other metrics, please feel free to contact me if you need help. The documentation below explains the function and usage of this script.
What is Minimum Error Rate Training ?
Most modern SMT systems use some variation on the exponential log linear model (which contains the noisy channel model as a special case) to combine diverse knowledge source tto score translation output. The translation with the highest probability under this model (or lowest -log probabilty) is chosen assuming the argmax decision rule. See Venugopal, Zollman,Waibel (2005) and Kumar,Bryne (2003) for discussion on alternative decision rules and training methods. Chosing the parameters of this exponential model (or scaling factors for each knowledge source in the noisy channel model) significantly affects translation quality, since they are used to drive the search space (through the space of possible target language translations). We need a method that explores the parameter space of scaling factors and picks value that maximize translation quality according to an automatic evaluation metric. Venugopal,Vogel (2005) shows the difficulty of using MMI to learn these parameters. Minimum Error Rate training as introduced in Och (20003) using the n-best list as an approximation to the translation search space, and considers rescoring the translations with different choices of scaling parameters. This explicitly generates an error surface whose minumum can be inspected, and the corresponding parameter choices reported. Och (2003) shows that the space of parameter configurations can be limited to those that actually cause the error surface to change, making this search strategy quite feasible when used in a greedy search through the parameter space (optimizing one parameter at a time).
Requirements for Minimum Error Rate Training
- Argmax decision rule (alternative decision rules cannot take advantage of the shortcut described above)
- N-Best list represents sufficient diversity in translations (and by consequence, individual model probabilities)
- Automatic evaluation metric that scores translations. Ideally this metric can be partially computed (like computing n-gram statistics for BLEU)
- A way to merge N-Best list across succesive iterations of the training process
- A development set of source sentences for which references are available
Data for Minimum Error Rate Training
The optimization script takes an n-best list (in the form of 2 files) which has been augmented with data used for automatic evaluation, and a file that specifies restrictions to the search process. The first n-best list feature file contains a list of model costs, along with intermediate data (n-gram match counts and the length of the nearest reference) that represents each candidate translation.
There is no separation between the n-best lists of each source sentence, (this data is specified in the second file). The script provided is tailored for n-gram matching evaluation metrics, like BLEU or NIST. Each line in the n-best list feature begins with an arbitrary number of model scores (whose number has to match the number of models represented in the restriction file), and is followed by n-gram matching statistics for that line. The n-gram statistics are arranged as follows
1-gram-matched 1-grams-suggested 2-g.... corresponding-metric-length
If these statistics are accumulated over a large corpus, you can calculated the BLEU score for this corpus, see Papeneni (2002) for details. The last field, corresponding-metric-length is different in the IBM and NIST flavors of the BLEU metric. IBM BLEU uses the length of the closest reference translation (assuming mulitple exist), while NIST BLEU uses the length of the shortest reference. These scores are calculated outside this script since it requires string matching/manipulation, a costly process which is not well suited to Matlab.
The second n-best list file just indicated the number of candidate translations present in the feature file for each sentences. Its format is
sentenceNumber #ofCandidateTranslations
We need to generate 1 more file before running the optimization. This file restricts the scaling factors for each model to stay within a specified range. The format for this file is shown below. It is usually useful to restrict one of the parameters to 1.0, and let the other parameters optimize around this fixed value. Initially you can set the ExpansionMargin (see below) to allow the system to select its own ranges.
minValueModel1 minValueModel2 ... minValueModelN
maxValueModel1 maxValueModel2 ... maxValueModelN
Here are sample lines from the 2 files that represent the n-best list augmented for IBM BLEU scoring
feats file
-43.9954 85.8221 34 46 67.7138 76.2273 27 46 12 45 5 44 3 43 41
-43.9954 85.8221 45.25 46 67.7138 76.2273 27 46 12 45 5 44 3 43 41
cands file
0 1000
1 1000
restriction file
-3 0 -3 -6 0 0
3 3 3 3 3 3
Running the script
The following variables need to be set before running the script in Matlab.
- feats_opt = feature file for n-best list
- cands_opt = candidate file for n-best list
- init_opt = parameter restriction file for optimization
matlab -nodisplay -r
feat_opt=\'feats.opt\',cands_opt=\'cands.opt\',init_opt=\'init.opt\',
optimizeV2IBMBLEU
When running the script...
As the script on an example feature file, cands file, restriction file you should will see a lot of output. This data is generated for word-phrase translation for 100 sentences. It is meant to serve as a very simple example.
- Initial Score: 0.168 (this is the score of the n-best as ordered in the file, taking the first translation for each candidate
- Parameter settings (9 in this case) and an error (1.0 - BLEU score) this is the error that is gradually being reduced optimizing over each dimension
- Start Values after which you will see values for each parameter
- Final Parameters (9 parameter values) and one (1.0-BLEU score) error that is the minimum value
- Final Score: 0.183 (final bleu score after optimization)
- Each line during the training will look like this...
0.000000 2.695273 0.722074 2.674726 .... 0.831022 - You will see the last error value decreasing during each cycle through the parameters. Once the error stops changing, the process is repeated for each permutation of parameter orders, in case some parameters are directly correlated.
- After each full cycle of permutations, new start values are initialized. Parameters in the script determine how many times this happens
- After running the script, use the values generated by this process to re-initialize your tranlation (n-best list generation process), and generate a new n-best list. You should then MERGE each successive n-best list. Merging is described in the Venugopal, Vogel,2005
User parameters in the script
There are a few parameters in the script that allow you to change the way the optimization runs. Their default settings usually work well, but you might want to experiment with them to speed up the process if your parameters are generalizing well. Here they are in no particular order.
- NumRandomTests : currently set to 5. Number of times new random seeds are used in the optimization process
- ConvergedLimit : currently set to 3. Number of times the error must stay the same to claim convergence has occurred (within a random seed run)
- IterationLimit : currently set to 20. Max number of times random permutations are considered for a single random seed
- ExpansionMargin : currently set ot 0.0 If the optimized parameter value gets within the ExpansionMargin of its right bound, the right bound will be increased by ExpansionFactor*( maxValue - minValue), effectively increasing the search space.