SAMT System Documentation

1 - Overview
2 - Installation
3 - Example walk-through
4 - Commands we used for our IWSLT-07 submission

1 - Overview

The SAMT system consists of three parts:

Extraction of statistical translation rules from a training corpus; either plain hierarchical rules a la Chiang (2005) or syntax-augmented rules a la Zollmann&Venugopal (2006).
CKY+ style chart-parser employing the statistical translation rules to translate test sentences
A minimum-error-rate (similar to Och 2003) optimization tool (integrated into the chart parser or as a standalone tool) to tune the parameters of the underlying log-linear model on a held-out development corpus

2 - Installation

Components of SAMT system

The decoding and MER components of the SAMT system are built in C++ and link into shared libraries for BerkeleyDB. The SRI LM toolkit is used in the decoder, but due to various compiler compatibility issues, we have simply included the relevant code from the SRI toolkit into our source distribution. Note: the code from SRI, dates back a few version in the SRI history; this artifact will eventually be addressed.

For phrase extraction we provide instructions to create a generalized, annotated rule table, which is stored as a Berkeley DB file. While you can use any underlying phrase extraction toolkit to generate the pure lexical phrases, we provide instructions to directly interface with Phillip Koehn toolkits. Our rule creation process is written primarily in Perl, so you will need to extend your Perl implementation to handle Berkeley DB.

In summary, we will be using or updating the following tools.

Moses: used to perform (non-hierarchical) phrase extraction
Charniak parser: generates target langugage parse trees
Berkeley DB: used to store hierarchical and syntax structured grammars on disk
Perl : Used in our scripts that extract hierarchical and syntax structured models

Perl

You need Perl >= 5.8.7. Make sure the executable is in your $PATH. Often, you will not be able to add modules to the perl distro on your computing environment. If this is the case, here is how to install it locally.

wget http://downloads.activestate.com/ActivePerl/Linux/5.8/ActivePerl-5.8.8.820-x86_64-linux-glibc-2.3.3-gcc-274679.tar.gz
sh install.sh
set PERL_DIR = /usr2/ashishv/external-tools/ActivePerl-5.8
setenv $PERL_DIR/bin:${PATH}
rehash

You will also need to the following modules into your perl distribution:

Set::IntRange (agree to the dependencies Bit::Vector,Carp::Clan)
Log::Log4perl

Install modules into the Perl found in your path like this:    cpan
   install Set::IntRange
You also need the perl module Tree-R-0.05 by Ari Jolma, which is included with this distribution.
To install this module type the following from the SBMT directory:
   cd Tree-R-0.05
   perl Makefile.PL
   make
   make test
   make install

Berkeley DB

You need a recent version of Berkeley DB, freely available at www.sleepycat.com (to avoid network filesystem problems, use at least 4.4.20!) and the perl module BerkeleyDB, which is included in the Berkeley DB distribution but is not being installed by default.
This is how to install Berkeley DB locally (i.e., without root rights), assuming you are in the Berkeley DB's main directory:

cd build_unix
../dist/configure --prefix=whereyouwantittobe/programs/BerkeleyDB-4.4.20.NC --enable-cxx --disable-shared
make
make install

Note: Setting the --enable-cxx flag will generate the db_cxx.h header files and the corresponding c++ libraries, you MUSTdo this to generate shared libraries that the decoder can use.

Then for the Perl module BerkeleyDB, change into perl/BerkeleyDB and modify config.in to match the place of your Berkeley DB, e.g.:
INCLUDE = /nfs/islpc7_1/Andreas/programs/BerkeleyDB-4.4.20.NC/include
LIB     = /nfs/islpc7_1/Andreas/programs/BerkeleyDB-4.4.20.NC/lib

Then type:
             perl Makefile.PL
             make
             make test
           make install

GIZA++ and Moses training scripts

Unless you have your own system to extract phrase pairs from the training corpus, you will need GIZA++ (freely available at www.fjoch.com/GIZA++.html) for the word alignment and Moses phrase extraction scripts (version from April 2007 or later) - freely available at www.statmt.org. Compile GIZA++ (with -DBINARY_SEARCH_FOR_TTABLE) and mkcls following the instructions in the GIZA++ distribution.

Syntactically Structured Translation Rules

If you want to make use of the syntactic capabilities of our system, you need to create syntactic Penn treebank style parse trees of the target-language-side of the training corpus. Our training scripts are assuming the use of Eugene Charniak's parser version 05Aug16 (freely avaiable at ftp://ftp.cs.brown.edu/pub/nlparser) with a slight modification to the file PARSE/parseIt.C that makes sure that the parsing output lines don't get out-of-synch with the English training text lines in case of a failed parse. We have included the modified verion of parseIt.C in directory ./charniakparserchanges. Copy that file into your ./parser05Aug16/PARSE directory, change into that directory, and type

make parseIt

to recompile the parser for your system.

The modification ensures that also for failed parses an ouput line is produced. In case you don't want to use the provided version of Charniak's parser, this is the DIFF of the modified file parseIt.C:

82c82,86
<        if(len > params.maxSentLen) continue;
---
>        //az (next if modified)
>        if(len > params.maxSentLen) {
>        cout << "_fail (maxSentLen exceeded)" << endl << endl;
>        continue;
>        }

Compiling the chart-parsing decoder

The decoder uses the g++ and GNU automake/autoconf technology to automate the build process. Here is a known working configuration of GNU tools used to build the system. Other configs might work... but keeping track of these revisions is harder than doing research. This setup is test on 64-bit machines running Fedora Core 5

g++-4.1.0
automake 1.7.9
autoconf 2.13
libtoolize 1.5

Several people have reported difficulty getting the system working via this automake and autoconf system. As an alternative, we also provide a script that can generate a makefile for you. The script is called generatemakefile.pl and is found in the dist directory.

cp -r ./dist ./myoptions Where myoptions will correspond to compile flags that you will pick, ie Opti vs Debug
cd myoptions; perl generatemakefile.pl --bdb /pathtoyourbdb/ > Makefile
make

Here is the structure of the SAMT.tar.gz that you downloaded

src : source code for CKY+ decoder
dist: files that make autoconf/automake tick
doc: contains a Dokygen file to generate documentation
scripts: scripts to generate a rule table
examples: a sample environment to test your installation

Follow these steps (from the dir above src) to build a binary version of the CKY+ decoder called FastTranslateChart via the automake and autoconf mechanism.

cp -r ./dist ./myoptions Where myoptions will correspond to compile flags that you will pick, ie Opti vs Debug
cd myoptions; autoreconf -i Will try to learn something about your system setup
myoptions> ./configure --prefix=CXXFLAGS="-O2 -DNDEBUG -DHAVE_CXX_STDHEADERS -DINSTANTIATE_TEMPLATES -I${BDB_INCLUDE}" LDFLAGS="-L${BDB_LIB} -ldb_cxx" To produce optimized code use -O2 NOT -O3, we have not had great luck getting consistent code from O3
myoptions> ln -s ../src src
make

You will probably have the most difficulty with the ./configure ... line, since linking to BerkeleyDB libs can be a struggle. Again, ensure that you have run "make install" in the BerkeleyDB installation, and that you can find the libdb_cxx.a library. Sometime you might have to just add the BerkeleyDB lib path to your LD_LIBRARY_PATH to get it to locate the libraries successfully.

Path issues

Add the ./scripts directory to your $PATH.

3 - Example Walk-through

We now explain training, tuning, and employing of the translation system at the example of the Europarl Spanish-English corpus from the ACL 2007 MT Workshop, which is freely available at http://www.statmt.org/wmt07/.

We have put the first 10,000 lines from that corpus into the SAMT/examples/europarl directory. Try to walk through the following steps, starting in this directory.

Preprocessing

To make the data compatible with the Charniak parser, "..." and ".." have to be removed from the English side and empty lines have to be modified to contain at least one symbol. To avoid confusion with meta symbols in our system, in both source and target data, "#" must be removed and no "@" is allowed to stand at the beginning of a word.

To do all this, simply go into your training data directory, and type

sbmtpreprocess.sh *

This will preprocess all files in that directory and keep the original files under the same names plus extension ".bak".

Also, you might want to remove lines with too many words from the corpus to avoid prohibitively long training times for these sentences. You can use the following Moses script for this:

$MOSES_SCRIPTS_ROOT/training/clean-corpus-n.perl europarl-small es en europarl-tiny 1 40

Preprocessing your development set references

Later we will tune the SAMT model's feature weights on a development corpus towards a given metric (e.g. IBM-BLEU score). In order for that tuning process to be as close as possible to the real test scenario, translations and references should be normalized in the same way as during actual testing. Note that this is not a symmetrical process: Normalization of the reference is a given process defined by the evaluation campaign, whereas translation output normalization can contain additional steps such as capitalizing the first word and removing double-punctuation marks. Therefore, the normalizing script (can any executable file) called during devset tuning only normalizes the translations, not the respective references. You need to normalize the references manually. For the standard mt-eval normalization, run (script is in the SAMT/script directory and therefore in your path):

mteval-preprocess-references.pl < dev2006-small.en > dev2006-small.en.mtevalpp

and in order to be reported correct scores for your test set:

mteval-preprocess-references.pl test2006-small.en.mtevalpp

Word-aligning the training corpus and extracting the phrases

This call will run GIZA++ do compute IBM4 word alignments for the training sentence pairs, and will also join the Spanish-English and English-Spanish alignments using the grow-diag-final-and method, as well as compute the word-to-word lexical relative frequencies (source- and target-conditioned):

$MOSES_SCRIPTS_ROOT/training/train-factored-phrase-model.perl -scripts-root-dir $MOSES_SCRIPTS_ROOT -root-dir . -corpus europarl-tiny -f es -e en -alignment grow-diag-final-and -first-step 1 -last-step 4 >& train-alignment.log &

Afterwards, run this line to extract phrase pair spans with maximum source length 8 and save them in the file `extraction-log': $MOSES_SCRIPTS_ROOT/training/phrase-extract/extract model/aligned.0.en model/aligned.0.es model/aligned.grow-diag-final-and extract 8 --OnlyOutputSpanInfo > extraction-log

Note: Sometimes GIZA++ fails on sentences (e.g. when they are too long). This shouldn't happen in your case because you removed sentences that are too long. To be sure, you can double-check that the file `model/aligned.0.en' is identical to the file `europarl-tiny.en'. If the former file is different, then it will be a subset of your original corpus. In that case, you have to parse based on `model/aligned.0.en' to obtain matching parses. Usually this mismatch shouldn't occur and you can thus parallelize the parsing step.

Training the Language Model

Run
cat europarl-tiny.en | $SRILM/bin/i686-m64_c/ngram-count -text - -order 3 -kndiscount -interpolate -lm europarl-tiny.en.srilm
where $SRILM is your SRI language model directory.

Parsing the target-side training sentences (only needed when creating syntax-augmented rules)

Assuming that you installed the Charniak parser in directory /nfs/islpc7_1/Andreas/programs/parser05Aug16, and that your English training sentences are in file europarl-tiny.en, type:

(cat europarl-tiny.en | replace '^(.*)\n' '<s> \1 <\/s>\n' | /nfs/islpc7_1/Andreas/programs/parser05Aug16/PARSE/parseIt -T20 -l400 -N1 /nfs/islpc7_1/Andreas/programs/parser05Aug16/DATA/EN/ | replace '^\n' '' >targetparsetrees) >& parselog &

This should take around 10 minutes. The script replace is in SAMT/scripts (and thus in your $PATH) and replaces ARG1 by ARG2.

You can trade off parsing run-time against parsing accuracy by modifying the parameter "-T" (the smaller, the faster; -T210 is default speed, -T50 is supposed to lose one percent of accuracy).
As a sanity test, use `wc' to check that the resulting parse tree file has the same n.o. lines as your input file.

Rule extraction

You are now ready to run `extractrules.pl' (takes around 1 hour):

(extractrules.pl --PhrasePairFeedFile extraction-log --TargetParseTreeFile targetparsetrees | gzip > extractrules.out.gz) >& log-extractrules &

The perl script extracts SAMT rules, sentence by sentence, and writes them to standard output. If you don't specify the --TargetParseTreeFile parameter, non-syntactic hierarchical rules will be extracted. You can also pipe the extraction log into STDIN by specifying `--PhrasePairFeedFile -'.
The following parameters (case insensitive) restrict the kind of non-lexical rules (i.e. rules containing at least one nonterminal) allowed.
Note that lexical rules (i.e., phrase pairs) are not being restricted by these parameters.

--MAXSOURCEWORDCOUNT max n.o. source words in created rules
--MAXTARGETWORDCOUNT same for target
--MAXSOURCESYMBOLCOUNT max n.o. source symbols (source words + substitutions) in created rules
--MAXTARGETSYMBOLCOUNT same for target
--MAXSUBSTITUTIONCOUNT max n.o. nonterminal substitution sites allowed in created rules

Note also that --MAXSOURCESYMBOLCOUNT and --MAXSUBSTITUTIONCOUNT are the two parameters that can significantly influence execution speed, as they allow for omission of recursive subroutine calls.

On-the-fly test set filtering

To speed up the extraction process and save disk space, you can have extractrules.pl filter the rules on-the-fly for your development and test set. Assuming that you concatenated your development and your test set together into the file test2000andrealtest2000.fr, and that you only have phrases in your phrase table of length up to 12, you can run the filtered rule extraction as follows:

(extractrules.pl --PhrasePairFeedFile extraction-log --TargetParseTreeFile targetparsetrees -r test2000andrealtest2000.fr --MaxSourceLength 12 | gzip > extractedrules-test2000andrealtest2000filtered.gz ) >& log-extractrules &

When using -r, you always have to specify a maximum source phrase length via --MaxSourceLength because we're hashing in all n-grams of length up to --MaxSourceLength from the dev/test set.

Note that on-the-fly filtering distorts the target-conditioned relative frequency feature calculation.

Rule extraction in several pieces

Depending on the size of your training corpus and the rule restriction parameters, the rule extraction process can take quite long. It is, however, easily parallelizable since extractrules.pl works on a per-sentence basis. After the word alignment step, chunk the files `model/aligned.0.en', `model/aligned.0.es' and `model/aligned.grow-diag-final-and' into pieces and run the phrase span extraction and the parsing and the subsequent rule extraction separately for each of these pieces.

After all pieces are processed, simply concatenate the respective extracted rules output from the individual extractrules.pl calls.

Other options

Check out the top of the file SAMT/scripts/extractrules.pl for a detailed explanation of all the command line parameters.

Rule merging

The rules have been extracted individually for each sentence. Therefore, identical rules now have to be merged. This is done by the C++ program MergeRules (should take about 5 minutes):

zcat extractrules.out.gz | sortsafe.sh -T /tmp | MergeRules 0 0 8 8 0 | gzip > mergedrules.gz

Here /tmp is the temp directory to be used for the unix sort (sortsafe.sh from the script directory makes sure that no locale is turned on for sorting). By setting the first two parameters for MergeRules to non-zero values, you can restrict MergeRules to only output rules that occurred a specified minimum amount of times (first parameter: minimum occurrence frequency for lexical rules, second parameter: minimum occurrence frequency for nonlexical rules). Doing this is very useful if you run out of memory (or time) during rule filtering (the next step) despite of (or because you cannot do) test set restriction.

You shouldn't need to change the other parameters for MergeRules.

Rule filtering

Once your rules are extracted, use filterrules.pl to compute additional features such as lexical weights for the rules, add the glue rules, possibly merge the set of nonterminals into clusters, filter the rules for a test set (if not done on-the-fly during rule extraction), and convert them into a Berkeley DB B-tree that can be used by our chart-parsing decoder.

You can filter your merged rule file as follows (again, parameters are case insensitive): [this will take 30 minutes and 3.4 GB of RAM, use test set restriction (see below) if you don't have a 4GB machine]

(zcat mergedrules.gz | filterrules.pl --cachesize 4000 --PhrasalFeatureCount 0 --LexicalWeightFile model/lex.0-0.n2f --LexicalWeightFileReversed model/lex.0-0.f2n --BeamFactorLexicalRules 0.05 --BeamFactorNonLexicalRules 0.05 --MinOccurrenceCountLexicalrules 0 --MinOccurrenceCountNonlexicalrules 0 -o rules.db ) > & log-filterrules &

This creates a rule database in the BerkeleyDB file specified by parameter -o with 23 features (check the top of SAMT/scripts/filterrules.pl for the description of the individual features). If you do not specify parameter -o then the rules will be output in text form to STDOUT.

Some important parameters (for more, check the beginning of the file filterrules.pl):

--MinOccurrenceCountLexicalRules C
Removes a LEXICAL rule source/target/type if its occurrence count is less than C and there is another rule source/target2/type2 with higher occurrence count.
--MinOccurrenceCountNonlexicalRules C
Removes a NONLEXICAL rule source/target/type if its occurrence count is less than C.
--MinLexSourceCondRFlexicalrules p
Causes filterrules.pl to drop all lexical rules (i.e., rules without abstractions) whose relative frequency amongst rules with the same source side is less than p.
--MinLexSourceCondRFnonlexicalrules p
Does the same for nonlexical rules, where source side here comprises of words and nonterminal symbols, i.e., the rules "@PN va / @1 goes" and "@NN va / @1 goes" do not compete because their source sides are considered different.
--BeamFactorLexicalRules p
Only output those of the lexical rules whose frequency is at least p * 'frequency of highest-frequency rule with same source side'
--BeamFactorNonLexicalRules p
Same for non-lexical rules
--MaxAbstractionCount N
Maximum n.o. substitution site pairs in a rule is N
--AllowAbstractRules
Allow rules without any words in them, such as "@VP @NP / @1 @2 / @S". By default, these are not allowed because decoding will become very slow.
--noAllowAdjacentNonterminals
Don't allow adjacent substitution sites. Employed e.g. by Chiang (2005).

As with extractrules.pl, you can restrict the rules to the ones matching a test set using the parameter "-r testset.fr". YOU SHOULD DEFINITELY DO THIS (preferably already when running extractrules.pl) if you don't need a production system translating spontaneous sentences because this will save you lots of time and memory in the filtering process and also slightly speed up the translation process. NOTE: If you're tuning to a development set (see below) and later test on a 'real' test set, don't filter rules for the development and test set individually, but rather run one rule filtering on the concatenated devset+testset, because the size of the rule database has a (slight) effect on count-based features and their respective optimal weights.

You can also have filterrules.pl compute the lexical phrasal cost features based on a source- and/or a target-conditioned word-to-word translation probability file. If you extracted your phrases with Moses, such files were created in the ./model directory (lex.0-0.n2f and lex.0-0.f2n). You can even specify multiple such lexica (e.g., alignment-frequency based one and IBM1 probability based). Each file passed with option --LexicalWeightFile has to consist of entries of probability = P(frenchword|englishword) of the form:

frenchword englishword probability

Each file passed with option --LexicalWeightFileReversed has to consist of entries of probability = P(englishword|frenchword) of the form:

englishword frenchword probability

The resulting lexical features for the rules will be appended to the end of the rules' feature vectors, first the ones corresponding to the --LexicalWeightFile files specified, then the ones corresponding to the --LexicalWeightFileReversed files specified.
To speed up the MER training (feature weight tuning) process (see next section), you might want to restrict the set of features to the ones that are relevant for you. You can do this with the parameter --RestrictToFeatures=v where v is a binary vector (elements separated by '_') indicating which features to keep. For example,

--RestrictToFeatures=1_1_1_0_0_0_0_1_1_1_1_0_0_0_0_0_0_0_0_0_0_0_0

would project the rule features down to features 1 to 3 and 8 to 11.

As with the rule extraction, there are many other parameters that can be specified for rule filtering. Check the top of SAMT/scripts/filterrules.pl for a detailed description.

By the way: You can convert (a) text file(s) into a rule database using rules2db.pl and convert a rule database into text format using db2rules.pl. Doing the latter and piping the resulting output to filterrules.pl thus enables you to re-filter an already filtered rule database.

Parameter tuning on the development set

Once you have your rule database, you can translate, but first you should tune the feature weights on a held-out development set. Run (took around 2.5 hours on our Intel(R) Xeon(TM) CPU 3.60GHz):

FastTranslateChart \
--NumReferences 1 --SentenceList dev2006-small.es --ReferenceList dev2006-small.en.mtevalpp --LMFiles europarl-tiny.en.srilm \
--RuleDB rules.db \
--NormalizingScript $SAMT/scripts/punctuation-postprocess-eval4.pl \
--ScoringMetric IBMBLEU --RemoveUnt 0 --MEROptimize 1 --IterationLimit 20 --Opti_NumModify 5 --Opti_Epsilon 0.0001 \
--NBest 1000 --ExtractUnique 1  \
--PruningMap 0-200-5-@_S-400-5 --ComboPruningFuzzCostDifference 4 --MaxCombinationCount 8 --MaxRuleAppCountDifference 5 \
 --HistoryLength 2 --SRIHistoryLength 2 --RecomputeLMCostsDuringNBest 0 --RescoreLM 1 \
--FeatureWeightsParsing 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
--Verbosity 0 --DisplayNBestData 2 \
> & log-dev &

You can specify the option -f to have parameters read in from a GNU parameter file (one line per parameter, no leading '--'; use leading '#' for comments). The parameters passed on the command line override the parameters from the param file.

Provide the reference translations via ReferenceList. If there are multiple references (e.g., 16), the reference file consists of e.g. 16 lines for the first test sentence, followed by 16 lines for the second test sentence, and so forth. The number of references per test sentence, if more than 1, has to be specified via --NumReferences.

Supply the scoring metric with the --ScoringMetric parameter. There are currently three supported metric scores: IBMBLEU, NISTBLEU and NIST.

Note that the example case above, we are running MER training, treating the 200-sentence dev2006-small data as development data. In practice, this is a bit too small a set to estimate parameters from, especially if there is only one reference per sentence. When using one reference per sentence, your development set should contain about 1000 sentences.

You can speed up the optimization process by making it more greedy. For that, supply --Opti_TwoStage=0 to the translation call. Also you have the option to remove untranslated words (by default these are kept in the translation output and thus count for the scoring metrics). Set "--RemoveUnt 1" to remove untranslated words.

More details about the parameters used can be found by typing "FastTranslateChart" without arguments, or looking into the FastTranslateChart.cc file.

System output

While FastTranslateChart is running, use the command

grep Sent log-dev

to see how the translation is progressing and how long it takes per sentence. Note that for the first sentences, FastTranslateChart will be slower than average because the rule database is not in the cache yet.

Look through the log-dev file, output of particular importance is...

TrainTranslationScore: The metric score after each iteration (will converge to 26.41% in our example)
OracleTranslationScore: The best possible metric score after each iteration, that would have been achieved when choosing for each sentence the best translation from the n-best list (will converge to 30.07%)
FinalScore: Score after current iteration's MER optimization step (on the combined n-best list of all iterations so far). Should converge to 26.46%
TotalNewTranslations: Number of new unique translations generated after each MER iteration. Should become zero (this is the stopping criterion, but the parameter --IterationLimit can be used to specify a hard limit).
Got Params: Parameters found after MER optimization
grep -A2 'unique tr' | grep 'm_totalCost' Prints the top translation output for each sentence with parse and synchronous span information

Parallelizing translation

Especially for the parameter tuning, which usually requires about 10 iterations over the development set in order to get to reasonable parameters, it is benefitial to farm the translations out to a computer cluster. We do this using condor. If you are able to use a cluster, look into the script 'runbees.sh'. This script starts up several FastTranslateChart 'bees', which each look for the next untranslated sentence, grab a lock for it, and process it. After all translations are complete, the bees report back as 'done' and die, and the runbees script calls 'MER', a stand-alone that should also be in your 'dist' directory after compiling, which only does the MER optimization step. Then runbees starts up the bees for the next round of translation, based on MER's returned optimal feature weights. In the section describing our IWSLT'07 training and testing commands you can find an example call of runbees.

Finally: using or testing the system

After MER training, the final output should be some line like

Final feature weights for parsing: 1.08194_0.553082_0.357918_-1.43708_0.02299_0.0244001_0_-0.240129_0.0399453_0.504576_0.0971147_-0.737959_0.470949_-0.24972_1.20947e-07_-1.3741_-0.0336305_1.70817_0.0125906_0_-0.108608_0.0742462_-0.155831_0

These are the features learned by MER. For real translation calls of FastTranslateChart, supply these with the --FeatureWeightsParsing parameter.

NB: In test mode, you don't need to generate NBest lists, so you can set --NBest=1; remember to also unset MER, i.e. --MEROptimize=0

4 - Commands we used for our IWSLT-07 submission

For our official IWSLT-2007 Chinese-English submission, which ranked third in the evaluation, we used the following commands to train and tune the system:

cd /SMT/Projects/IWSLT-Eval/IWSLT-2007/Chinese-to-English

cp ./train/IWSLT07_CE_training_E.txt ./train/IWSLT07_CE_training_C.txt /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/
cat ./dev/TXT/IWSLT07_CE_devset1_CSTAR03_C.txt | replace '^.*\\' '' | replace '(.*\n)' '$1$1$1$1$1$1$1$1$1$1$1$1$1$1$1$1' > /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/devset1_C.txt
cat ./dev/TXT/IWSLT07_CE_devset2_IWSLT04_C.txt | replace '^.*\\' '' | replace '(.*\n)' '$1$1$1$1$1$1$1$1$1$1$1$1$1$1$1$1' > /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/devset2_C.txt
cat ./dev/TXT/IWSLT07_CE_devset3_IWSLT05_C.txt | replace '^.*\\' '' | replace '(.*\n)' '$1$1$1$1$1$1$1$1$1$1$1$1$1$1$1$1' > /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/devset3_C.txt
cat ./dev/TXT/IWSLT07_CE_devset5_IWSLT06_C.txt | replace '^.*\\' '' | replace '(.*\n)' '$1$1$1$1$1$1$1' > /nfs/islpc7_0/iwlst07/data/devset5_C.txt
cat ./dev/TXT/IWSLT07_CE_devset1_CSTAR03_E.mref.txt | replace '^.*\\' '' > /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/devset1_E.txt
cat ./dev/TXT/IWSLT07_CE_devset2_IWSLT04_E.mref.txt | replace '^.*\\' '' > /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/devset2_E.txt
cat ./dev/TXT/IWSLT07_CE_devset3_IWSLT05_E.mref.txt | replace '^.*\\' '' > /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/devset3_E.txt
cat ./dev/TXT/IWSLT07_CE_devset5_IWSLT06_E.mref.txt | replace '^.*\\' '' > /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/devset5_E.txt

cat ./dev/TXT/IWSLT07_CE_devset4_IWSLT06_E.mref.txt  | replace '^.*\\' '' > /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/devset.en
cat ./dev/TXT/IWSLT07_CE_devset4_IWSLT06_C.txt | replace '^.*\\' '' > /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/devset.utf8

pushd /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/
cat *E.txt | replace '^.*\\' '' > train.en
cat *C.txt | replace '^.*\\' '' > train.utf8

set TGT = `expandfilename train.en`

# NOTE: the program eng_tokenizer.pl is third party and not available with our distribution, but the
# preprocessing script tokenizer.perl from the MOSES distribution should do pretty much the same
cat $TGT | eng_tokenizer.pl | sbmtpreprocess.pl > $TGT.sbmtpp.tokenized

punctuation-preprocess.pl --TrueCasingData train.en.sbmtpp.tokenized  < train.en.sbmtpp.tokenized  > train.en.sbmtpp.tokenized.punctpp

sbmtpreprocess.pl < train.utf8 > train.utf8.sbmtpp
sbmtpreprocess.sh devset.utf8 realtest.utf8

cat devset.en | mteval-preprocess-references.pl > devset.en.iwsltpp


# Now comes the training
cd ..
mkdir train2
cd train2

  ln -s ../data/train.en.sbmtpp.tokenized.punctpp train.en
  ln -s ../data/train.utf8.sbmtpp train.utf8

set MOSES_SCRIPTS_ROOT=/nfs/islpc7_0/moses/bin/moses-scripts/scripts-20070418-2137
$MOSES_SCRIPTS_ROOT/training/clean-corpus-n.perl train utf8 en train1 1 40
nohup $MOSES_SCRIPTS_ROOT/training/train-factored-phrase-model.perl -scripts-root-dir $MOSES_SCRIPTS_ROOT -root-dir . -corpus train1 -f utf8 -e en -alignment grow-diag-final-and -first-step 1 -last-step 4 >& train-alignment.log &

nohup $MOSES_SCRIPTS_ROOT/training/phrase-extract/extract model/aligned.0.en model/aligned.0.utf8 model/aligned.grow-diag-final-and extract 10 --OnlyOutputSpanInfo > extraction-log

condorsubmit.sh 3000 parse-charniak 'parse-charniak.sh < train1.en > train1.en.parsed'

(extractrules.pl --MaxSourceWordCount=10 --MaxTargetWordCount=15  --MaxSourceSymbolCount=10 --MaxTargetSymbolCount=15 --MaxSubstitutionCount=2 --PhrasePairFeedFile ../train1/extraction-log --TargetParseTreeFile train1.en.parsed --MaxSourceLength 10 -r devset-realtest.utf8 | gzip > extractrules.out.gz ) >& log-extr &

setenv PATH $andi3/SBMT-0705/dist64b:$PATH
condorsubmit.sh 3000 mergerules 'zcat extractrules.out.gz | sortsafe.sh -T /tmp | MergeRules 0 0 8 8 0 | gzip > mergedrules.gz'

condorsubmit.sh 3000 filterrules '(zcat mergedrules.gz | filterrules.pl --NullRules --UseNULL --UseRefinedIBM1ProbEstimate --noAllowAbstractRules --PhrasalFeatureCount=0 --MeanTargetSourceRatio=1.14 --cachesize 4000 --PhrasalFeatureCount 0 --LexicalWeightFile model/lex.0-0.n2f --LexicalWeightFileReversed model/lex.0-0.f2n --BeamFactorLexicalRules 0.05 --BeamFactorNonLexicalRules 0.05 -o rules-withnull-onlinefiltered.db --MinOccurrenceCountLexicalrules=0 --MinOccurrenceCountNonlexicalrules=0 ) > & log-filterrules'

cat train1.en | $andi3/programs64/srilm/bin/i686-m64_c/ngram-count -memuse -text - -order 5 -gt2min 0 -gt3min 0 -gt4min 2 -gt5min 2 -kndiscount -interpolate -lm train1.en.5gram.srilm


# PARAMETER TUNING

pushd /data/rack0temp14/SMT/iwslt07ce

setenv CONDOR_SMALL "condorsubmit.sh 3000"
setenv CONDOR_LARGE "condorsubmit.sh 3000"
setenv CONDOR_MERGE "condorsubmit.sh 3000"
setenv PATH $andi3/SBMT-0705/dist64b:$PATH
set SRC = /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/devset.utf8
set REF = /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/devset.en.iwsltpp
set NREF = 7
set RULEDB = /data/rack0temp14/SMT/iwslt07ce/rules-withnull-onlinefiltered.db
set LM = /nfs/islpc7_2/ashishv/iwslt/iwlst07/train1/train1.en.5gram.srilm

set SCALING = 0.365239_1.44995_-0.30802_-0.185773_-0.00197066_0.00647094_-0.038661_0.00752506_0_0.091088_-0.0187804_-0.275446_0.104277_0_-0.00810137_-0.0928521_0.0364509_0.124571_0.0571254_0.487417_0.0308388_0.107574_0.00515493_0.00470195

( nohup runbees.pl --params "--RuleDB ${RULEDB} --SentenceList ${SRC} --ReferenceList ${REF} --NumReferences ${NREF} --NormalizingScript $andi3/SBMT/scripts/punctuation-postprocess-eval4.pl --LMFiles ${LM} --RecomputeLMCostsDuringNBest 1 --RescoreLM 0 --SRIHistoryLength 4 --HistoryLength 2 --MEROptimize 1 --IterationLimit 20 --PruningMap 0-100-7-@_S-200-7 --ComboPruningBeamSize 10000 --ComboPruningFuzzCostDifference 5 --MaxCostDifferencePerCell inf --MaxCombinationCount 10 --NBest 2000 --HypsPerNode 1000 --RemoveUnt 0 --MaxRuleAppCountDifference 5 --RA_Mult 0 --RA_UpdateHyp 1 --RA_LowestCount 0 --NBestSubSpans 0 --Verbosity 0 --TagSetList @_dummy --ScoringMetric IBMBLEU --DisplayNBestData 2 --CompareTargetWords 1 --NBestStyle 2 --ExtractUnique 1 --PruningMapUseBoundary 1 --Opti_NumModify 10" --merparams "--Opti_NumModify 10 --ScoringMetric IBMBLEU --Opti_Epsilon 0.0001" --numbees 35 --expdir /data/rack0temp14/SMT/iwslt07ce/expibm8-mcc10-withnull-onlinefiltered --scaling ${SCALING} --iterlimit 20 --iterstart 0 ) > & log-exp8withnull &
# gets to 30.48 InitialScore and 30.90 FinalScore and 30.16 TrainTrans afte 6 iteratiosn
# gets to 30.8224 InitialScore AND FinalScore and 30.4 TrainTransScore after 8 iterations

# REAL TEST
setenv CONDOR_SMALL "condorsubmit.sh 3000"
setenv CONDOR_LARGE "condorsubmit.sh 3000"
setenv CONDOR_MERGE "condorsubmit.sh 3000"
setenv PATH $andi3/SBMT-0705/dist64b:$PATH
set SRC = /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/realtest.utf8
set REF = /nfs/islpc7_2/ashishv/iwslt/iwlst07/data/realtest.utf8
set NREF = 1
set LM = /nfs/islpc7_2/ashishv/iwslt/iwlst07/train1/train1.en.5gram.srilm

set SCALING = 0.0209398_0.0233861_2.1456e-05_-0.0100184_0.00216776_-0.000383287_0_0.00178281_0_0.00731606_-0.00307071_1.65476_0_0.00149819_0.0125884_0.00380715_0.0018003_0.0107588_0_-0.036329_0.00201955_0.00719299_0_0.000755101
set RULEDB = /data/rack0temp14/SMT/iwslt07ce/rules-withnull-onlinefiltered.db
( nohup runbees.pl --params "--RuleDB ${RULEDB} --SentenceList ${SRC} --ReferenceList ${REF} --NumReferences ${NREF} --NormalizingScript $andi3/SBMT/scripts/punctuation-postprocess-eval4.pl --LMFiles ${LM} --RecomputeLMCostsDuringNBest 1 --RescoreLM 0 --SRIHistoryLength 4 --HistoryLength 2 --MEROptimize 1 --IterationLimit 20 --PruningMap 0-100-7-@_S-200-7 --ComboPruningBeamSize 10000 --ComboPruningFuzzCostDifference 5 --MaxCostDifferencePerCell inf --MaxCombinationCount 10 --NBest 2000 --HypsPerNode 1000 --RemoveUnt 0 --MaxRuleAppCountDifference 5 --RA_Mult 0 --RA_UpdateHyp 1 --RA_LowestCount 0 --NBestSubSpans 0 --Verbosity 0 --TagSetList @_dummy --ScoringMetric IBMBLEU --DisplayNBestData 1 --CompareTargetWords 1 --NBestStyle 2 --ExtractUnique 1 --PruningMapUseBoundary 1 --Opti_NumModify 10" --merparams "--Opti_NumModify 10 --ScoringMetric IBMBLEU --Opti_Epsilon 0.0001" --numbees 20 --expdir /data/rack0temp14/SMT/iwslt07ce/realtestexpibm8-mcc10-withnull-onlinefiltered --scaling ${SCALING} --iterlimit 1 --iterstart 0 ) > & log-realtestexpibm8withnull &

nbestsent-to-1best.sh realtestexpibm8/iterXXX/ > realtestexpibm8/translations.txt
perl ~joy/Eval/EvaluateXlat.pl -b IWSLT_JE07 samt translations.txt
38.08 BLEU (LP .917)