ARK Machine Translation Research
|
|
This page is the home for machine translation research conducted by members of Noah's ARK in the Language Technologies
Institute at Carnegie Mellon University.
Machine translation is an active research area that offers great
promise to revolutionize the way we communicate. Our goal is to make
machine translation systems better and faster, and also to develop
techniques that can be useful for other areas of natural language
processing and machine learning.
Members and Collaborators
- Waleed Ammar, Ph.D. student, LTI
- Victor Chahuneau, Masters student, LTI
- Chris Dyer, Assistant Professor, LTI
- Kevin Gimpel, Assistant Professor, TTI-Chicago (formerly CMU LTI)
- Noah A. Smith, Associate Professor, LTI and MLD
Software
Rampion
Rampion (Gimpel and Smith, 2012) is an algorithm for training statistical machine translation models based
on minimizing structured ramp loss. The code provided here can be used with the Moses decoder or any other decoder that supports the same formats for configuration files and k-best lists. There is also an implementation of Rampion in cdec.
Version 0.2 also includes implementations of PRO and risk minimization as well as several additional forms of ramp loss from Gimpel (2012). Also included are the improved sentence-level BLEU approximations from Nakov et al. (2012), which are recommended for single-reference training.
Version 0.1 released 6/6/2012: rampion-v0.1.tar.gz
Version 0.2 released 12/28/2012: rampion-v0.2.tar.gz
cdec
cdec (Dyer et al., 2010) is a flexible and
efficient software framework for machine translation and other structured prediction tasks. It was used for
our German-English submission to the WMT11 shared task (Dyer et al., 2011b), for recent work on feature-rich
modeling for unsupervised word alignment (Dyer et al., 2011a), as wel as for transliteration (Ammar et al., 2012). It implements training and decoding algorithms
for several commonly-used models in machine translation.
Inference for Monolingual and Bilingual Gappy Pattern Models
Below is a link to code that implements the models described by Gimpel and Smith (2011a).
These models can discover gappy patterns in either monolingual or bilingual (word-aligned) text.
Sample data files and execution scripts are provided.
Version 0.1 released 7/20/2011: gaplm-v0.1.tar.gz
|
sample patterns
Code for Statistical Significance Testing for MT Evaluation Metrics
The links below contain software to perform paired bootstrap resampling (Koehn, 2004) for the
BLEU metric (Papineni et al., 2002) as computed using the mteval-v13a script provided
by NIST (http://www.itl.nist.gov/iad/mig/tests/mt/2009/).
The code is available in the following tar.gz file: paired_bootstrap_v13a.tar.gz
A previous release for use with mteval-v11b is archived here.
You also may be interested in code by Jon Clark for
performing bootstrap resampling and approximate randomization with BLEU, METEOR, and TER.
References
- W. Ammar, C. Dyer, N. A. Smith (2012). Transliteration by Sequence Labeling with Lattice Encodings and Reranking. Named Entities Workshop at ACL 2012.
- C. Dyer, J. H. Clark, A. Lavie, and N. A. Smith (2011a). Unsupervised Word Alignment with Arbitrary Features. ACL 2011.
- C. Dyer, K. Gimpel, J. H. Clark, and N. A. Smith (2011b). The CMU-ARK German-English Translation System. WMT 2011.
- C. Dyer, A. Lopez, J. Ganitkevitch, J. Weese, F. Ture, P. Blunsom, H. Setiawan, V. Eidelman, and P. Resnik (2010). cdec: A Decoder, Alignment, and Learning Framework for Finite-State and Context-Free Translation Models. ACL 2010.
- K. Gimpel (2012). Discriminative Feature-Rich Modeling for Syntax-Based Machine Translation. Ph.D. Thesis, Carnegie Mellon University.
- K. Gimpel and N. A. Smith (2009). Feature-Rich Translation by Quasi-Synchronous Lattice Parsing. EMNLP 2009.
- K. Gimpel and N. A. Smith (2011a). Generative Models of Monolingual and Bilingual Gappy Patterns. WMT 2011.
- K. Gimpel and N. A. Smith (2011b). Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. EMNLP 2011.
- K. Gimpel and N. A. Smith (2012). Structured Ramp Loss Minimization for Machine Translation. NAACL 2012.
- P. Koehn (2004). Statistical Significance Tests for Machine Translation Evaluation. EMNLP 2004.
- P. Nakov, F. Guzmán, and S. Vogel (2012). Optimizing for Sentence-Level BLEU+1 Yields Short Translations. COLING 2012.
- K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002). BLEU: a Method for Automatic Evaluation of Machine Translation. ACL 2002.
Acknowledgments
This research has been supported in part by the National Science Foundation
under grants IIS-0844507 and IIS-1054319,
the U. S. Army Research Laboratory and the U. S. Army Research Office under contract/grant number W911NF-10-1-0533, grants from Google,
and by Sandia National Laboratories (fellowship to K. Gimpel).