Syntax Augmented Machine
Translation via Chart Parsing
Latest version: HERE
This is a Hadoop-based MapReduce-parallelized version (see our IWSLT'08 paper). Check file setup-commands.txt for installation instruction. The readme file still refers to the old non-Hadoop SAMT version.
Instead, read the "Grammar based statistical MT on Hadoop" paper below for usage instructions.
References:
Our open-source SAMT system consists of three parts:
- Extraction of statistical
translation rules from a training corpus; either plain
hierarchical
rules a la Chiang (2005) or syntax-augmented rules a la
Zollmann&Venugopal (2006).
- CKY+ (Chappelier and Rajman, 1998) style chart-parser employing the statistical
translation rules to translate test sentences
- Fast C++ code - translates the 2000 (realtest) sentences of
the Europarl French-English data in approx. 40 min, i.e., 46 sentences
per minute, achieving state-of-the-art scores
- Implements CKY+ for internal binarization during parsing
- Can efficiently handle thousands of non-terminal categories
- Performs LM intersection with the grammar at run-time, or
optionally uses future cost estimates for LM cost, producing
state-of-the-art scores
- A
minimum-error-rate optimization and scoring tool (integrated into the
chart parser)
to tune the parameters of the underlying log-linear model on a held-out
development corpus
The system is available open-source under the GNU General Public
License. Click here to
download it.
(Library LGPL version [needed if used for commercial purposes, no support provided]: here.)
Documentation for the SAMT is available by consulting the
following sources.
- Readme.html
documentation Detailed instructions on installation of the
system and running through an quick-start example.
- Detailed technical overview at the top of
FastTranslateChart.cc, complements the published work
- Doxygen comments on classes/functions + detailed notes in
code
- The samt-technical mailing list (see below), for all the
points we forgot to explain fully
We will regularly updating the SAMT system. We have created the
following Google groups to manage announcements, and
host technical discussions regarding the system.
- samt-announce
to receive information on major updates.
- samt-technical
to participate in technical discussion regarding the SAMT system. Get
your compiling / running / theory questions answered here.
Of course, you also can email us directly: {zollmann or ashishv} (at) cs.cmu.edu
InterACT homepage
Andreas's homepage
Ashish's homepage