From: Teruko Mitamura <teruko+@cs.cmu.edu>
To: sfung@cs.cmu.edu, bhan@andrew.cmu.edu, khannan@cs.cmu.edu, 
    sjayaram@andrew.cmu.edu, frank+@cs.cmu.edu, ssaleem@cs.cmu.edu
Subject: MT lab class tomorrow at 2pm in 1305 NSH
Cc: ref@cs.cmu.edu, alavie@cs.cmu.edu, ralf@cs.cmu.edu, lsl@cs.cmu.edu, 
    ehn@cs.cmu.edu, stephan.vogel@cs.cmu.edu
Date: Thu, 20 Jan 2005 18:21:27 -0500
Message-ID: <4985.1106263287@kyoto.lti.cs.cmu.edu>
Sender: Teruko_Mitamura@kyoto.lti.cs.cmu.edu


Hi All,

   Here are some descriptions of possible MT lab topics we are
going to discuss tomorrow.  We have 5 faculty presenting the
possible topics.  Each faculty will have about 10 min., followed
by Q&A.  See you at 2pm in 1305 NSH.  

--Teruko

---------------------------------------------------------------------
MT Lab: possible topics
----------------------------------------------------------------------
Advisor: Robert Frederking

"Comparison of ROVER and MEMT"

	The basic idea is that ROVER (combining speech recognizer
outputs) is claimed to usually give "good" results.  But the similar
idea in MT, Multi-Engine MT, seems to be very difficult to get really
"good" results from, in the sense of really significant improvements,
or improvements anything like what humans can get from combining a set
of MT outputs.  This experiment would involve these steps:
	-- verify that a set of SR outputs produced good results
		when ROVERed, 
	-- run one or more MEMT algorithms over the same outputs, 
	-- verify that the MEMT outputs show similar improvement to
		the ROVER outputs on the SR data,
	-- compare the SR data with typical MT data, and try to
		determine what characteristics of the MT data make the
		problem more difficult.

The main approach to the last step would be to produce synthetic data
that is "in-between" the SR and MT data on various dimensions, and see
what kinds of MEMT quality one can get on the synthetic data.  There
are a number of hypotheses about why MT data might be harder; this
would empirically demonstrate the contribution of the different
dimensions.
----------------------------------------------------------------------
Advisor: Teruko Mitamura

"Building a Chinese to English MT system with KANTOO"

Develop lexicon, grammar, and mapping rules for Chinese analysis
and English generation for a small sample corpus.  This project
can be a team project, if more than one student is interested.

KANT Home page: http://www.lti.cs.cmu.edu/Research/Kant/

----------------------------------------------------------------------
Advisor: Alon Lavie  

(1) The utility of Mutual Information for Assessing MT quality
(2) Advanced experimentation on MT Evaluation Metrics
(3) Rapid MT Prototyping using the AVENUE Transfer Framework

----------------------------------------------------------------------
Advisor: Ralf Brown 

Clustering for Generalization in Example-Based Machine Translation

Our EBMT engine has the ability to generalize its training texts
through the use of word equivalence classes.  There is existing code
to find such classes automatically via clustering, but it has proven
not to improve translation quality consistently because there is too
much noise in the clusters.  A newly-published clustering algorithm
built on top of k-means clustering promises to dramatically reduce
that noise.

The task in this lab project will be to implement the new clustering
algorithm (code for k-means is already in place) and then train an
EBMT system using the original text only, the equivalence classes
generated by the exising clustering program, and the equivalence
classes generated by the newly-implemented algorithm, to compare the
effectiveness of the clustering for machine translation.

-----------------------------------------------------------------------
Advisor: Stephan Vogel

Topic 1:  A fertility model for data-driven MT

The IBM3 word alignment model introduces the so-called
fertility model to capture the observation that a word in
one language is often aligned to several words in the
other language.  Our current SMT system does not
use this kind of information in an explicit way (it is
only implicit in the phrase to phrase alignment).  The
fertility information could be used to improve the phrase
alignment.  A second idea is to use the word fertility
information to construct a sentence length model.

Tasks:
   - implement (in C++) the fertility model based on the
     phrase alignment model used for the SMT system
   - train fertility model with the GIZA toolkit
   - compare the quality of the two methods
   - construct a sentence length model based on the
     word-fertilities
   - evaluate the quality of this sentence length model
   - use this sentence length model in the SMT decoder
     and study its effect on translation quality


Topic 2: Optimizing a MT system using Evolutionary Strategy

Each translation system has a number of parameters which
need to be tuned to achieve optimal performance for a given
language pair, data situation, evaluation criterion.  With the
number of parameters increasing, as the MT systems become
more sophisticated manual tuning becomes less and less
possible.  In the last two years different optimization approaches
(maximum entropy, minimum error training, simplex algorithm, etc)
have been applied to do this optimization in an automatic and
principled way.  This is usually done be recalculating the scores
for an n-best list, thereby selecting a new first-best.  The idea
is to use some Evolutionary Strategy (ES) (or Genetic Algorithm)
to do this optimization.

Tasks:
- design an ES, which can optimize a variable number of
   of parameters, where each parameters is characterized by
   meaningful boundaries and minimal step size.
- implement this (in C++) an test it on some standard problems
- use it to optimize an n-best list
- full optimization of the SMT systems for different evaluation
   criteria for one or two translation tasks