SAMT System Documentation
Table of contents
1 - Overview
2 - Installation
3 - Example walk-through
4 - Commands we used for our IWSLT-07 submission
1 - Overview
The SAMT system consists of three parts:
- Extraction of statistical translation rules from a training
corpus; either plain hierarchical rules a la Chiang (2005) or
syntax-augmented rules a la Zollmann&Venugopal (2006).
- CKY+ style chart-parser employing the statistical translation rules to translate test sentences
- A minimum-error-rate (similar to Och 2003) optimization tool (integrated into the chart
parser or as a standalone tool) to tune the parameters of the underlying log-linear model on a
held-out development corpus
2 - Installation
Components of SAMT system
The decoding and MER components of the SAMT system are built in C++
and link into shared libraries for BerkeleyDB. The SRI LM toolkit
is used in the decoder, but due to various compiler compatibility
issues, we have simply included the relevant code from the SRI
toolkit into our source distribution. Note: the code from SRI, dates
back a few version in the SRI history; this artifact will
eventually be addressed.
For phrase extraction we provide instructions to create a
generalized, annotated rule table, which is stored as a Berkeley DB
file. While
you can use any underlying phrase extraction toolkit to generate the
pure lexical phrases, we provide instructions to directly interface
with Phillip Koehn toolkits. Our rule creation process is written
primarily in Perl, so you will need to extend your Perl implementation
to handle Berkeley DB.
In summary, we will be using or updating the following tools.
- Moses: used to perform (non-hierarchical) phrase extraction
- Charniak parser: generates target langugage parse trees
- Berkeley DB: used to store hierarchical and syntax structured grammars on disk
- Perl : Used in our scripts that extract hierarchical and syntax
structured models
Perl
You need Perl >= 5.8.7. Make sure the executable is in your $PATH.
Often, you will not be able to add modules to the perl distro on your
computing environment. If this is the case, here is how to install it
locally.
- wget http://downloads.activestate.com/ActivePerl/Linux/5.8/ActivePerl-5.8.8.820-x86_64-linux-glibc-2.3.3-gcc-274679.tar.gz
- sh install.sh
- set PERL_DIR = /usr2/ashishv/external-tools/ActivePerl-5.8
- setenv $PERL_DIR/bin:${PATH}
- rehash
You will also need to the following modules into your perl distribution:
- Set::IntRange (agree to the dependencies Bit::Vector,Carp::Clan)
- Log::Log4perl
Install modules into the Perl found in your path like this:
cpan
install Set::IntRange
You also need the perl module Tree-R-0.05 by Ari Jolma, which is
included with this distribution.
To install this module type the following from the SBMT directory:
cd Tree-R-0.05
perl Makefile.PL
make
make test
make install
Berkeley DB
You need a recent version of Berkeley DB, freely available at www.sleepycat.com
(to avoid network filesystem problems, use at least 4.4.20!) and the
perl module BerkeleyDB, which is included in the Berkeley DB
distribution but is not being installed by default.
This is how to install Berkeley DB locally (i.e., without root rights), assuming you are in the Berkeley DB's main directory:
cd build_unix
../dist/configure --prefix=whereyouwantittobe/programs/BerkeleyDB-4.4.20.NC --enable-cxx --disable-shared
make
make install
Note: Setting the --enable-cxx flag will generate the db_cxx.h header files and the corresponding c++ libraries, you MUSTdo this to generate shared libraries that the decoder can use.
Then for the Perl module BerkeleyDB, change into perl/BerkeleyDB and
modify config.in to match the place of your Berkeley DB, e.g.:
INCLUDE = /nfs/islpc7_1/Andreas/programs/BerkeleyDB-4.4.20.NC/include
LIB = /nfs/islpc7_1/Andreas/programs/BerkeleyDB-4.4.20.NC/lib
Then type:
perl Makefile.PL
make
make test
make install
GIZA++ and Moses training scripts
Unless you have your own system to extract phrase pairs from the
training corpus, you will need GIZA++ (freely available at www.fjoch.com/GIZA++.html) for
the word alignment and Moses phrase extraction scripts (version from April 2007 or later) - freely available at
www.statmt.org.
Compile GIZA++ (with
-DBINARY_SEARCH_FOR_TTABLE
) and mkcls following the instructions in the GIZA++
distribution.
Syntactically Structured Translation Rules
If you want to make use of the syntactic capabilities of our system,
you need to create syntactic Penn treebank style parse trees of the
target-language-side of the training corpus. Our training
scripts are assuming the use of Eugene Charniak's parser version
05Aug16 (freely avaiable at ftp://ftp.cs.brown.edu/pub/nlparser)
with a slight modification to the file PARSE/parseIt.C that makes sure
that the parsing output lines don't get out-of-synch with the English
training text lines in case of a failed parse. We have included the
modified verion of parseIt.C in directory ./charniakparserchanges. Copy
that file into your ./parser05Aug16/PARSE directory, change into that
directory, and type
make parseIt
to recompile the parser for your system.
The modification ensures that also for failed parses an ouput line is
produced. In case you don't want to use the provided version of
Charniak's parser, this is the DIFF of the modified file parseIt.C:
82c82,86
< if(len > params.maxSentLen) continue;
---
> //az (next if modified)
> if(len > params.maxSentLen) {
> cout << "_fail (maxSentLen exceeded)" << endl << endl;
> continue;
> }
Compiling the chart-parsing decoder
The decoder uses the g++ and GNU automake/autoconf technology to
automate the build process. Here is a known working configuration
of GNU tools used to build the system. Other configs might work... but
keeping track of these revisions is harder than doing research.
This setup is test on 64-bit machines running Fedora Core 5
- g++-4.1.0
- automake 1.7.9
- autoconf 2.13
- libtoolize 1.5
Several people have reported difficulty getting the system working via this
automake and autoconf system. As an alternative, we also provide a script
that can generate a makefile for you. The script is called
generatemakefile.pl and is found in the dist directory.
- cp -r ./dist ./myoptions Where myoptions will correspond to compile flags that you will pick, ie Opti vs Debug
- cd myoptions; perl generatemakefile.pl --bdb /pathtoyourbdb/ >
Makefile
- make
Here is the structure of the SAMT.tar.gz that you downloaded
- src : source code for CKY+ decoder
- dist: files that make autoconf/automake tick
- doc: contains a Dokygen file to generate documentation
- scripts: scripts to generate a rule table
- examples: a sample environment to test your installation
Follow these steps (from the dir above src) to build a binary version of
the CKY+ decoder called FastTranslateChart via the automake and autoconf
mechanism.
- cp -r ./dist ./myoptions Where myoptions will correspond to compile flags that you will pick, ie Opti vs Debug
- cd myoptions; autoreconf -i Will try to learn something about your system setup
- myoptions> ./configure --prefix=CXXFLAGS="-O2 -DNDEBUG
-DHAVE_CXX_STDHEADERS -DINSTANTIATE_TEMPLATES -I${BDB_INCLUDE}"
LDFLAGS="-L${BDB_LIB} -ldb_cxx"
To produce optimized code use -O2 NOT -O3, we have not had great luck getting consistent code from O3
- myoptions> ln -s ../src src
- make
You will probably have the most difficulty with the ./configure ...
line, since linking to BerkeleyDB libs can be a struggle. Again, ensure
that you have run "make install" in the BerkeleyDB installation, and
that you can find the libdb_cxx.a library. Sometime you might have to
just add the BerkeleyDB lib path to your LD_LIBRARY_PATH to get it to
locate the libraries successfully.
Path issues
Add the ./scripts directory to your $PATH.
3 - Example Walk-through
We now explain training, tuning, and employing of the translation
system at the example of the Europarl Spanish-English corpus from the
ACL 2007 MT Workshop, which is freely available at http://www.statmt.org/wmt07/.
We have put the first 10,000 lines from that corpus into the
SAMT/examples/europarl directory. Try to walk through the following steps, starting in this directory.
Preprocessing
To make the data compatible with the Charniak parser, "..." and ".."
have to be removed from the English side and empty lines have to be modified to contain at least one symbol.
To avoid confusion with meta
symbols in our system, in both source and target data, "#" must be
removed and no "@" is allowed to stand at the beginning of a word.
To do all this, simply go into your training data directory, and type
sbmtpreprocess.sh *
This will preprocess all files in that directory and keep the original files under the same names plus extension ".bak".
Also, you might want to remove lines with too many words from the corpus to avoid prohibitively long training times for these
sentences. You can use the following Moses script for this:
$MOSES_SCRIPTS_ROOT/training/clean-corpus-n.perl europarl-small es en europarl-tiny 1 40
Preprocessing your development set references
Later we will tune the SAMT model's feature weights on a development corpus towards a given metric (e.g. IBM-BLEU score).
In order for that tuning process to be as close as possible to the real test scenario, translations and references
should be normalized in the same way as during actual testing. Note that this is not a symmetrical process: Normalization
of the reference is a given process defined by the evaluation campaign, whereas translation output normalization can contain
additional steps such as capitalizing the first word and removing double-punctuation marks. Therefore, the normalizing script (can any executable file) called during devset tuning only normalizes the translations, not the respective references. You need
to normalize the references manually. For the standard mt-eval normalization, run (script is in the SAMT/script directory and therefore in your path):
mteval-preprocess-references.pl < dev2006-small.en > dev2006-small.en.mtevalpp
and in order to be reported correct scores for your test set:
mteval-preprocess-references.pl test2006-small.en.mtevalpp
Word-aligning the training corpus and extracting the phrases
This call will run GIZA++ do compute IBM4 word alignments for the training sentence pairs, and will also join the Spanish-English
and English-Spanish alignments using the grow-diag-final-and method, as well as compute the
word-to-word lexical relative frequencies (source- and target-conditioned):
$MOSES_SCRIPTS_ROOT/training/train-factored-phrase-model.perl -scripts-root-dir $MOSES_SCRIPTS_ROOT -root-dir . -corpus europarl-tiny -f es -e en -alignment grow-diag-final-and -first-step 1 -last-step 4 >& train-alignment.log &
Afterwards, run this line to extract phrase pair spans with maximum source length 8 and save them in the file `extraction-log':
$MOSES_SCRIPTS_ROOT/training/phrase-extract/extract model/aligned.0.en model/aligned.0.es model/aligned.grow-diag-final-and extract 8 --OnlyOutputSpanInfo > extraction-log
Note: Sometimes GIZA++ fails on sentences (e.g. when they are too long). This shouldn't happen in your case because you removed
sentences that are too long. To be sure, you can double-check that the file `model/aligned.0.en' is identical to the file
`europarl-tiny.en'. If the former file is different, then it will be a subset of your original corpus. In that case, you have
to parse based on `model/aligned.0.en' to obtain matching parses. Usually this mismatch shouldn't occur and you can thus
parallelize the parsing step.
Training the Language Model
Run
cat europarl-tiny.en | $SRILM/bin/i686-m64_c/ngram-count -text - -order 3 -kndiscount -interpolate -lm europarl-tiny.en.srilm
where $SRILM is your SRI language model directory.
Parsing the target-side training sentences (only needed when creating syntax-augmented rules)
Assuming that you installed the Charniak parser in directory /nfs/islpc7_1/Andreas/programs/parser05Aug16, and that your English training sentences are in file europarl-tiny.en, type:
(cat europarl-tiny.en | replace
'^(.*)\n' '<s> \1 <\/s>\n' |
/nfs/islpc7_1/Andreas/programs/parser05Aug16/PARSE/parseIt -T20 -l400
-N1 /nfs/islpc7_1/Andreas/programs/parser05Aug16/DATA/EN/ | replace
'^\n' '' >targetparsetrees) >& parselog &
This should take around 10 minutes. The script replace is in SAMT/scripts (and thus in your $PATH) and replaces ARG1 by ARG2.
You can trade off parsing run-time against parsing accuracy by
modifying the parameter "-T" (the smaller, the faster; -T210 is default
speed, -T50 is supposed to lose one percent of accuracy).
As a sanity test, use `wc' to check that the resulting parse tree file has the same n.o. lines as your input file.
Rule extraction
You are now ready to run `extractrules.pl' (takes around 1 hour):
(extractrules.pl --PhrasePairFeedFile extraction-log --TargetParseTreeFile targetparsetrees | gzip > extractrules.out.gz) >& log-extractrules &
The perl script extracts SAMT rules, sentence by sentence, and writes them to standard output. If you don't specify
the --TargetParseTreeFile parameter, non-syntactic hierarchical rules will be extracted. You can also pipe
the extraction log into STDIN by specifying `--PhrasePairFeedFile -'.
The following parameters (case insensitive) restrict the kind of non-lexical rules (i.e. rules containing at least one nonterminal) allowed.
Note that lexical rules (i.e., phrase pairs) are not being restricted by these parameters.
- --MAXSOURCEWORDCOUNT max n.o. source words in created rules
- --MAXTARGETWORDCOUNT same for target
- --MAXSOURCESYMBOLCOUNT max n.o. source symbols (source words + substitutions) in created rules
- --MAXTARGETSYMBOLCOUNT same for target
- --MAXSUBSTITUTIONCOUNT max n.o. nonterminal substitution sites allowed in created rules
Note also that --MAXSOURCESYMBOLCOUNT and --MAXSUBSTITUTIONCOUNT are
the two parameters that can significantly influence execution speed, as
they allow for omission of recursive subroutine calls.
On-the-fly test set filtering
To speed up the extraction process and save disk space, you can have
extractrules.pl filter the rules on-the-fly for your development and
test set. Assuming that you concatenated your development and your test
set together into the file test2000andrealtest2000.fr, and that you only have phrases in your phrase table of length up to 12, you can run the filtered rule extraction as follows:
(extractrules.pl
--PhrasePairFeedFile extraction-log --TargetParseTreeFile targetparsetrees
-r test2000andrealtest2000.fr --MaxSourceLength 12
| gzip >
extractedrules-test2000andrealtest2000filtered.gz ) >&
log-extractrules &
When using -r, you always have to specify a maximum source phrase
length via --MaxSourceLength because we're hashing in all n-grams
of length up to --MaxSourceLength from the dev/test set.
Note that on-the-fly filtering distorts the target-conditioned relative
frequency feature calculation.
Rule extraction in several pieces
Depending on the size of your training corpus and the rule restriction
parameters, the rule extraction process can take quite long. It is,
however, easily parallelizable since extractrules.pl works on a per-sentence
basis. After the word alignment step, chunk the files
`model/aligned.0.en', `model/aligned.0.es' and `model/aligned.grow-diag-final-and'
into pieces and run the phrase span extraction and the parsing and the subsequent
rule extraction separately for each of these pieces.
After all pieces are processed, simply concatenate the respective extracted rules output
from the individual extractrules.pl calls.
Other options
Check out the top of the file SAMT/scripts/extractrules.pl for a detailed explanation of all the command line parameters.
Rule merging
The rules have been extracted individually for each sentence. Therefore, identical rules now
have to be merged. This is done by the C++ program MergeRules (should take about 5 minutes):
zcat extractrules.out.gz | sortsafe.sh -T /tmp | MergeRules 0 0 8 8 0 | gzip > mergedrules.gz
Here /tmp is the temp directory to be used for the unix sort (sortsafe.sh from the script directory makes sure
that no locale is turned on for sorting). By setting the first two parameters for
MergeRules to non-zero values, you can restrict MergeRules to only output rules that occurred a specified
minimum amount of times (first parameter: minimum occurrence frequency for lexical rules, second parameter:
minimum occurrence frequency for nonlexical rules). Doing this is very useful if you run out of memory (or time)
during rule filtering (the next step) despite of (or because you cannot do) test set restriction.
You shouldn't need to change the other parameters
for MergeRules.
Rule filtering
Once your rules are extracted, use filterrules.pl to compute additional
features such as lexical weights for the rules, add the glue rules, possibly merge the set of
nonterminals into clusters, filter the rules for a test set (if not
done on-the-fly during rule extraction), and convert them into a
Berkeley DB B-tree that can be used by our chart-parsing decoder.
You can filter your merged rule file as follows
(again, parameters are case insensitive): [this will take 30 minutes and 3.4 GB of RAM, use test set restriction (see below)
if you don't have a 4GB machine]
(zcat mergedrules.gz | filterrules.pl
--cachesize 4000 --PhrasalFeatureCount 0
--LexicalWeightFile model/lex.0-0.n2f --LexicalWeightFileReversed model/lex.0-0.f2n
--BeamFactorLexicalRules 0.05 --BeamFactorNonLexicalRules 0.05
--MinOccurrenceCountLexicalrules 0
--MinOccurrenceCountNonlexicalrules 0
-o rules.db
) >
& log-filterrules &
This creates a rule database in the BerkeleyDB file specified by parameter -o with
23 features (check the top of SAMT/scripts/filterrules.pl for the
description of the individual features). If you do not specify parameter -o then the rules will be output in text form to STDOUT.
Some important parameters (for more, check the beginning of the file filterrules.pl):
- --MinOccurrenceCountLexicalRules C
Removes a LEXICAL rule source/target/type if its occurrence count is less than C and there is another rule source/target2/type2 with higher occurrence count.
- --MinOccurrenceCountNonlexicalRules C
Removes a NONLEXICAL rule source/target/type if its occurrence count is less than C.
- --MinLexSourceCondRFlexicalrules p
Causes filterrules.pl
to drop all lexical rules (i.e., rules without abstractions) whose relative frequency amongst rules with
the same source side is less than p.
-
--MinLexSourceCondRFnonlexicalrules p
Does the same for nonlexical
rules, where source side here comprises of words and nonterminal symbols,
i.e., the rules "@PN va / @1 goes" and "@NN va / @1 goes" do not
compete because their source sides are considered
different.
-
- --BeamFactorLexicalRules p
Only output those of the lexical rules whose frequency is at least p * 'frequency of highest-frequency rule with same source side'
- --BeamFactorNonLexicalRules p
Same for non-lexical rules
- --MaxAbstractionCount N
Maximum n.o. substitution site pairs in a rule is N
- --AllowAbstractRules
Allow rules without any words in them, such as "@VP @NP / @1 @2 / @S". By default, these are not allowed because decoding will become very slow.
- --noAllowAdjacentNonterminals
Don't allow adjacent substitution sites. Employed e.g. by Chiang (2005).
As with extractrules.pl, you can restrict the rules to the ones matching a test set using the parameter "-r testset.fr". YOU SHOULD DEFINITELY DO THIS (preferably already when running extractrules.pl) if you don't need a production system translating spontaneous sentences because this will save you lots of time and memory in the filtering process and also slightly speed up the translation process.
NOTE: If you're tuning to a development set (see below) and later test on a 'real' test set, don't filter rules for the development and test set individually, but
rather run one rule filtering on the concatenated devset+testset,
because the size of the rule database has a (slight) effect on count-based features and their respective optimal weights.
You can also have filterrules.pl compute the lexical phrasal cost
features based on a source- and/or a target-conditioned word-to-word
translation probability file. If you extracted your phrases with
Moses, such files were created in the ./model directory
(lex.0-0.n2f and lex.0-0.f2n). You can even specify multiple such lexica
(e.g., alignment-frequency based one and IBM1 probability based).
Each file passed with option --LexicalWeightFile has to consist of
entries of probability = P(frenchword|englishword) of the form:
frenchword englishword probability
Each file passed with option --LexicalWeightFileReversed has to consist
of entries of probability = P(englishword|frenchword) of the form:
englishword frenchword probability
The resulting lexical features for the rules will be appended to the
end of the rules' feature vectors, first the ones corresponding to the
--LexicalWeightFile files specified, then the ones corresponding to the
--LexicalWeightFileReversed files specified.
To speed up the MER training (feature weight tuning) process (see next
section), you might want to restrict the set of features to the ones
that are relevant for you. You can do this with the parameter
--RestrictToFeatures=v where v is a binary vector (elements separated
by '_') indicating which features to keep. For example,
--RestrictToFeatures=1_1_1_0_0_0_0_1_1_1_1_0_0_0_0_0_0_0_0_0_0_0_0
would project the rule features down to features 1 to 3 and 8 to 11.
As with the rule extraction, there are many other parameters that can be
specified for rule filtering. Check the top of
SAMT/scripts/filterrules.pl for a detailed description.
By the way: You can convert (a) text file(s) into a rule database using rules2db.pl and convert a rule database into text format using db2rules.pl. Doing the latter and piping the resulting output to filterrules.pl thus enables you to re-filter an already filtered rule database.
Parameter tuning on the development set
Once you have your rule database, you can translate, but first you
should tune the feature weights on a held-out development set. Run (took around 2.5 hours on our Intel(R) Xeon(TM) CPU 3.60GHz):
FastTranslateChart \
--NumReferences 1 --SentenceList dev2006-small.es --ReferenceList dev2006-small.en.mtevalpp --LMFiles europarl-tiny.en.srilm \
--RuleDB rules.db \
--NormalizingScript $SAMT/scripts/punctuation-postprocess-eval4.pl \
--ScoringMetric IBMBLEU --RemoveUnt 0 --MEROptimize 1 --IterationLimit 20 --Opti_NumModify 5 --Opti_Epsilon 0.0001 \
--NBest 1000 --ExtractUnique 1 \
--PruningMap 0-200-5-@_S-400-5 --ComboPruningFuzzCostDifference 4 --MaxCombinationCount 8 --MaxRuleAppCountDifference 5 \
--HistoryLength 2 --SRIHistoryLength 2 --RecomputeLMCostsDuringNBest 0 --RescoreLM 1 \
--FeatureWeightsParsing 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
--Verbosity 0 --DisplayNBestData 2 \
> & log-dev &
You can specify the option -f to have parameters read in from a GNU
parameter file (one line per parameter, no leading '--'; use leading
'#' for comments). The parameters passed on the command line override
the parameters from the
param file.
Provide the reference translations via ReferenceList. If there are
multiple references (e.g., 16), the reference file consists of e.g. 16
lines for the first test sentence, followed by 16 lines for the second
test sentence, and so forth. The number of references per test
sentence, if more than 1, has to be specified via --NumReferences.
Supply the scoring metric with the --ScoringMetric parameter. There are
currently three supported metric scores: IBMBLEU, NISTBLEU and NIST.
Note that the example case above, we are running MER training, treating the
200-sentence dev2006-small data as development data. In practice, this is a bit too small a set to estimate
parameters from, especially if there is only one reference per sentence.
When using one reference per sentence, your development set should contain about 1000 sentences.
You can speed up the optimization process by making it more greedy. For
that, supply --Opti_TwoStage=0 to the translation call. Also you have
the option to remove untranslated words (by default these are kept in
the translation output and thus count for the scoring metrics). Set
"--RemoveUnt 1" to remove untranslated words.
More details about the parameters used can be found by typing "FastTranslateChart" without arguments, or
looking into the FastTranslateChart.cc file.
System output
While FastTranslateChart is running, use the command
grep Sent log-dev
to see how the translation is progressing and how long it takes per sentence. Note that for the first sentences,
FastTranslateChart will be slower than average because the rule database is not in the cache yet.
Look through the log-dev file, output of particular importance is...
- TrainTranslationScore: The metric score after each iteration (will converge to 26.41% in our example)
- OracleTranslationScore: The best possible metric score after
each iteration, that would have been achieved when choosing for
each sentence the best translation from the n-best list (will converge to 30.07%)
- FinalScore: Score after current iteration's MER optimization step (on the combined n-best list of all iterations so far). Should converge to 26.46%
- TotalNewTranslations: Number of new unique translations generated after each MER iteration. Should become zero (this is the stopping criterion,
but the parameter --IterationLimit can be used to specify a hard limit).
- Got Params: Parameters found after MER optimization
- grep -A2 'unique tr' | grep 'm_totalCost' Prints the top translation output for each sentence with parse and synchronous span information
Parallelizing translation
Especially for the parameter tuning, which usually requires about 10 iterations over the development set in order to get to
reasonable parameters, it is benefitial to farm the translations out to a computer cluster. We do this using condor. If you are able
to use a cluster, look into the script 'runbees.sh'. This script starts up several FastTranslateChart 'bees', which each look for
the next untranslated sentence, grab a lock for it, and process it. After all translations are complete, the bees report back as 'done' and die, and the runbees script calls 'MER', a stand-alone that should also be in your 'dist' directory after compiling, which only does the MER optimization step. Then runbees starts up the bees for the next round of translation, based on MER's returned optimal feature weights. In the section describing our IWSLT'07 training and testing commands you can find an example call of runbees.
Finally: using or testing the system
After MER training, the final output should be some line like
Final feature weights for parsing:
1.08194_0.553082_0.357918_-1.43708_0.02299_0.0244001_0_-0.240129_0.0399453_0.504576_0.0971147_-0.737959_0.470949_-0.24972_1.20947e-07_-1.3741_-0.0336305_1.70817_0.0125906_0_-0.108608_0.0742462_-0.155831_0
These are the features
learned by MER. For real translation calls of FastTranslateChart, supply these with the --FeatureWeightsParsing parameter.
NB: In test mode, you don't need to generate NBest lists, so you can
set --NBest=1; remember to also unset MER, i.e.
--MEROptimize=0
4 - Commands we used for our IWSLT-07 submission
For our official IWSLT-2007 Chinese-English submission, which ranked third in the evaluation, we used the following commands to train and tune the system:
cd /SMT/Projects/IWSLT-Eval/IWSLT-2007/Chinese-to-English
cp ./train/IWSLT07_CE_training_E.txt ./train/IWSLT07_CE_training_C.txt /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/
cat ./dev/TXT/IWSLT07_CE_devset1_CSTAR03_C.txt | replace '^.*\\' '' | replace '(.*\n)' '$1$1$1$1$1$1$1$1$1$1$1$1$1$1$1$1' > /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/devset1_C.txt
cat ./dev/TXT/IWSLT07_CE_devset2_IWSLT04_C.txt | replace '^.*\\' '' | replace '(.*\n)' '$1$1$1$1$1$1$1$1$1$1$1$1$1$1$1$1' > /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/devset2_C.txt
cat ./dev/TXT/IWSLT07_CE_devset3_IWSLT05_C.txt | replace '^.*\\' '' | replace '(.*\n)' '$1$1$1$1$1$1$1$1$1$1$1$1$1$1$1$1' > /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/devset3_C.txt
cat ./dev/TXT/IWSLT07_CE_devset5_IWSLT06_C.txt | replace '^.*\\' '' | replace '(.*\n)' '$1$1$1$1$1$1$1' > /nfs/islpc7_0/iwlst07/data/devset5_C.txt
cat ./dev/TXT/IWSLT07_CE_devset1_CSTAR03_E.mref.txt | replace '^.*\\' '' > /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/devset1_E.txt
cat ./dev/TXT/IWSLT07_CE_devset2_IWSLT04_E.mref.txt | replace '^.*\\' '' > /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/devset2_E.txt
cat ./dev/TXT/IWSLT07_CE_devset3_IWSLT05_E.mref.txt | replace '^.*\\' '' > /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/devset3_E.txt
cat ./dev/TXT/IWSLT07_CE_devset5_IWSLT06_E.mref.txt | replace '^.*\\' '' > /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/devset5_E.txt
cat ./dev/TXT/IWSLT07_CE_devset4_IWSLT06_E.mref.txt | replace '^.*\\' '' > /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/devset.en
cat ./dev/TXT/IWSLT07_CE_devset4_IWSLT06_C.txt | replace '^.*\\' '' > /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/devset.utf8
pushd /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/
cat *E.txt | replace '^.*\\' '' > train.en
cat *C.txt | replace '^.*\\' '' > train.utf8
set TGT = `expandfilename train.en`
# NOTE: the program eng_tokenizer.pl is third party and not available with our distribution, but the
# preprocessing script tokenizer.perl from the MOSES distribution should do pretty much the same
cat $TGT | eng_tokenizer.pl | sbmtpreprocess.pl > $TGT.sbmtpp.tokenized
punctuation-preprocess.pl --TrueCasingData train.en.sbmtpp.tokenized < train.en.sbmtpp.tokenized > train.en.sbmtpp.tokenized.punctpp
sbmtpreprocess.pl < train.utf8 > train.utf8.sbmtpp
sbmtpreprocess.sh devset.utf8 realtest.utf8
cat devset.en | mteval-preprocess-references.pl > devset.en.iwsltpp
# Now comes the training
cd ..
mkdir train2
cd train2
ln -s ../data/train.en.sbmtpp.tokenized.punctpp train.en
ln -s ../data/train.utf8.sbmtpp train.utf8
set MOSES_SCRIPTS_ROOT=/nfs/islpc7_0/moses/bin/moses-scripts/scripts-20070418-2137
$MOSES_SCRIPTS_ROOT/training/clean-corpus-n.perl train utf8 en train1 1 40
nohup $MOSES_SCRIPTS_ROOT/training/train-factored-phrase-model.perl -scripts-root-dir $MOSES_SCRIPTS_ROOT -root-dir . -corpus train1 -f utf8 -e en -alignment grow-diag-final-and -first-step 1 -last-step 4 >& train-alignment.log &
nohup $MOSES_SCRIPTS_ROOT/training/phrase-extract/extract model/aligned.0.en model/aligned.0.utf8 model/aligned.grow-diag-final-and extract 10 --OnlyOutputSpanInfo > extraction-log
condorsubmit.sh 3000 parse-charniak 'parse-charniak.sh < train1.en > train1.en.parsed'
(extractrules.pl --MaxSourceWordCount=10 --MaxTargetWordCount=15 --MaxSourceSymbolCount=10 --MaxTargetSymbolCount=15 --MaxSubstitutionCount=2 --PhrasePairFeedFile ../train1/extraction-log --TargetParseTreeFile train1.en.parsed --MaxSourceLength 10 -r devset-realtest.utf8 | gzip > extractrules.out.gz ) >& log-extr &
setenv PATH $andi3/SBMT-0705/dist64b:$PATH
condorsubmit.sh 3000 mergerules 'zcat extractrules.out.gz | sortsafe.sh -T /tmp | MergeRules 0 0 8 8 0 | gzip > mergedrules.gz'
condorsubmit.sh 3000 filterrules '(zcat mergedrules.gz | filterrules.pl --NullRules --UseNULL --UseRefinedIBM1ProbEstimate --noAllowAbstractRules --PhrasalFeatureCount=0 --MeanTargetSourceRatio=1.14 --cachesize 4000 --PhrasalFeatureCount 0 --LexicalWeightFile model/lex.0-0.n2f --LexicalWeightFileReversed model/lex.0-0.f2n --BeamFactorLexicalRules 0.05 --BeamFactorNonLexicalRules 0.05 -o rules-withnull-onlinefiltered.db --MinOccurrenceCountLexicalrules=0 --MinOccurrenceCountNonlexicalrules=0 ) > & log-filterrules'
cat train1.en | $andi3/programs64/srilm/bin/i686-m64_c/ngram-count -memuse -text - -order 5 -gt2min 0 -gt3min 0 -gt4min 2 -gt5min 2 -kndiscount -interpolate -lm train1.en.5gram.srilm
# PARAMETER TUNING
pushd /data/rack0temp14/SMT/iwslt07ce
setenv CONDOR_SMALL "condorsubmit.sh 3000"
setenv CONDOR_LARGE "condorsubmit.sh 3000"
setenv CONDOR_MERGE "condorsubmit.sh 3000"
setenv PATH $andi3/SBMT-0705/dist64b:$PATH
set SRC = /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/devset.utf8
set REF = /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/devset.en.iwsltpp
set NREF = 7
set RULEDB = /data/rack0temp14/SMT/iwslt07ce/rules-withnull-onlinefiltered.db
set LM = /nfs/islpc7_2/ashishv/iwslt/iwlst07/train1/train1.en.5gram.srilm
set SCALING = 0.365239_1.44995_-0.30802_-0.185773_-0.00197066_0.00647094_-0.038661_0.00752506_0_0.091088_-0.0187804_-0.275446_0.104277_0_-0.00810137_-0.0928521_0.0364509_0.124571_0.0571254_0.487417_0.0308388_0.107574_0.00515493_0.00470195
( nohup runbees.pl --params "--RuleDB ${RULEDB} --SentenceList ${SRC} --ReferenceList ${REF} --NumReferences ${NREF} --NormalizingScript $andi3/SBMT/scripts/punctuation-postprocess-eval4.pl --LMFiles ${LM} --RecomputeLMCostsDuringNBest 1 --RescoreLM 0 --SRIHistoryLength 4 --HistoryLength 2 --MEROptimize 1 --IterationLimit 20 --PruningMap 0-100-7-@_S-200-7 --ComboPruningBeamSize 10000 --ComboPruningFuzzCostDifference 5 --MaxCostDifferencePerCell inf --MaxCombinationCount 10 --NBest 2000 --HypsPerNode 1000 --RemoveUnt 0 --MaxRuleAppCountDifference 5 --RA_Mult 0 --RA_UpdateHyp 1 --RA_LowestCount 0 --NBestSubSpans 0 --Verbosity 0 --TagSetList @_dummy --ScoringMetric IBMBLEU --DisplayNBestData 2 --CompareTargetWords 1 --NBestStyle 2 --ExtractUnique 1 --PruningMapUseBoundary 1 --Opti_NumModify 10" --merparams "--Opti_NumModify 10 --ScoringMetric IBMBLEU --Opti_Epsilon 0.0001" --numbees 35 --expdir /data/rack0temp14/SMT/iwslt07ce/expibm8-mcc10-withnull-onlinefiltered --scaling ${SCALING} --iterlimit 20 --iterstart 0 ) > & log-exp8withnull &
# gets to 30.48 InitialScore and 30.90 FinalScore and 30.16 TrainTrans afte 6 iteratiosn
# gets to 30.8224 InitialScore AND FinalScore and 30.4 TrainTransScore after 8 iterations
# REAL TEST
setenv CONDOR_SMALL "condorsubmit.sh 3000"
setenv CONDOR_LARGE "condorsubmit.sh 3000"
setenv CONDOR_MERGE "condorsubmit.sh 3000"
setenv PATH $andi3/SBMT-0705/dist64b:$PATH
set SRC = /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/realtest.utf8
set REF = /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/realtest.utf8
set NREF = 1
set LM = /nfs/islpc7_2/ashishv/iwslt/iwlst07/train1/train1.en.5gram.srilm
set SCALING = 0.0209398_0.0233861_2.1456e-05_-0.0100184_0.00216776_-0.000383287_0_0.00178281_0_0.00731606_-0.00307071_1.65476_0_0.00149819_0.0125884_0.00380715_0.0018003_0.0107588_0_-0.036329_0.00201955_0.00719299_0_0.000755101
set RULEDB = /data/rack0temp14/SMT/iwslt07ce/rules-withnull-onlinefiltered.db
( nohup runbees.pl --params "--RuleDB ${RULEDB} --SentenceList ${SRC} --ReferenceList ${REF} --NumReferences ${NREF} --NormalizingScript $andi3/SBMT/scripts/punctuation-postprocess-eval4.pl --LMFiles ${LM} --RecomputeLMCostsDuringNBest 1 --RescoreLM 0 --SRIHistoryLength 4 --HistoryLength 2 --MEROptimize 1 --IterationLimit 20 --PruningMap 0-100-7-@_S-200-7 --ComboPruningBeamSize 10000 --ComboPruningFuzzCostDifference 5 --MaxCostDifferencePerCell inf --MaxCombinationCount 10 --NBest 2000 --HypsPerNode 1000 --RemoveUnt 0 --MaxRuleAppCountDifference 5 --RA_Mult 0 --RA_UpdateHyp 1 --RA_LowestCount 0 --NBestSubSpans 0 --Verbosity 0 --TagSetList @_dummy --ScoringMetric IBMBLEU --DisplayNBestData 1 --CompareTargetWords 1 --NBestStyle 2 --ExtractUnique 1 --PruningMapUseBoundary 1 --Opti_NumModify 10" --merparams "--Opti_NumModify 10 --ScoringMetric IBMBLEU --Opti_Epsilon 0.0001" --numbees 20 --expdir /data/rack0temp14/SMT/iwslt07ce/realtestexpibm8-mcc10-withnull-onlinefiltered --scaling ${SCALING} --iterlimit 1 --iterstart 0 ) > & log-realtestexpibm8withnull &
nbestsent-to-1best.sh realtestexpibm8/iterXXX/ > realtestexpibm8/translations.txt
perl ~joy/Eval/EvaluateXlat.pl -b IWSLT_JE07 samt translations.txt
38.08 BLEU (LP .917)