home | research | publications | courses | bio | links

Research interests:

My Research Blog

Dissertation Research | Avenue Project | Quechua | Mapudungun | Publications | Past research in Speech
Loading Clusty Cloud ...


Dissertation Research

Ph.D. Thesis: Automatic Improvement of Machine Translation Systems

Interactive and Automatic Refinement of Translation Rules for a Transfer-based MT systems
OR
Can the Internet help improve MT?

Achieving high translation quality remains the biggest challenge in Machine Translation (MT) systems. To address this challenge researchers have explored a variety of methods to include user feedback in the MT loop. However, most MT systems have failed to incorporate post-editing efforts beyond the addition of corrected translations to the parallel training data for Statistical and Example-Based system or to a translation memory database. My research centers on developing a largely automated approach that uses online post-editing feedback from non-experts to refine translation rules. Precise error correction information that is relevant to the system allows the Automatic Rule Refiner to trace the errors back to incorrect lexical and grammar rules responsible for the errors and to propose concrete fixes to such rules. Since this approach attacks the problem at its core, it generalizes beyond the input sentences corrected by bilingual speakers, and allows for correct translation of unseen data. The reaching power of the Internet further enhances the relevance of this work. We envision modifying the product of my research to be an online game with a purpose. This game will allow bilingual speakers to correct MT input, get rewards for making good corrections, and compare their scores and speed to those of other players. For the MT community, this game will provide a free and easy way to get feedback for MT system improvement.

Short paper describing my dissertation research in more detail: Can the Internet help improve Machine Translation". Doctoral Consortium at HLT-NAACL, June 4, 2006, New York, USA.   pdf

Other papers can be found here.

I proposed in November 2004. If you would like me to send you the Proposal document or presentation slides, send me email (aria AT cs.cmu.edu).

AVENUE project

Since May 2002, I have been working on the AVENUE project (CMU blackboard for the AVENUE project), namely on developing Automatic Machine Translation Systems for resource-poor languages.

I am currently in charge of developing the resources for Quechua. During the summer of 2005, I spent three months in Cusco building the resources and infrastructure to implement a Quechua-Spanish MT prototype system, as part of the V-Unit (Vision Unit). The V-Unit is part of the TechBridgeWorld Initiative at CMU. Currently the Quechua-Spanish MT prototype system has 25 translation rules and 683 lexical entries (40 manually and 643 semi-automatically created).

From May 2002 until January 2005, I was in charge of Mapudungun. In April and November of 2002, I travelled to Temuco, Chile, and worked with the local team at the Instituto de Estudios Indígenas (Universidad de la Frontera) to develop resources and NLP tools for Mapudung-Spanish (transcribed spoken corpus, dictionary, morphological analyzer). I also provided technical assistance setting up and using software developed at CMU.

For my thesis, I am working on automatically refining translation rules, by using minimal corrections from non-expert bilingual users. More


Quechua

Quechua or runasimi, which means language of the people, is the indigenous language of a large portion of the South American highlands, and there are about 10 million speakers today. However, we know of no electronic resources in Quechua, let alone any information and communication technologies in Quechua.

The term Quechua covers a variety of distinct languages and dialects. The Ethnologue Data Base showes 46 dialects of Quechuan, 32 spoken in Peru. Quechua is also spoken in Bolivia, Ecuador, South of Colombia and North of Argentina. The most important dialect is that spoken in Cuzco, the seat of the former Inca Empire. Quechua spread by means of conquests realized before and during that empire. It displaced several earlier languages, only to find itself increasingly displaced today by Spanish. In spite of this intense competition, Quechua in its various forms remains a vital language in Peru and elsewhere.

A piece of good news for us, computational linguists, is that the endless battle to decide which one of the two competing orthographies should be the official one, the pentavocal and the trivocal, has finally ended in favor of the pentavocalic orthographic system, which has a closest correspondence with the Quechuan letter-to-sound rules.

In 2005 Spring Semester, I audited Quechua II at the University of Pittsburgh, taught by Salome Gutierrez. And during my time in Cusco (June-August 2005), I studied both the Quechua language and culture at Centro Bartolome de las Casas, where I enjoyed daily classes taught by native speaker and educator Gina Maldonado.




Mapudungun

Mapudungun is an American Indigenous language spoken in Chile and Argentina by about half million Mapuche people.

Our Chilean local team is located in the Instituto de Estudios Indígenas, Universidad de la Frontera, in Temuco, Chile.

  • The first AVENUE Mapudungun-Spanish Machine Translation system is now available online.


  • For some information on the Mapuche people and their language, Mapudungun, you can visit these links:
  • About their history
  • About their language
  • To see their flags




  • PhD research related publications

    Some recent papers and presentations

  • "A Walk on the Other Side: Adding Statistical Components to a Transfer-Based Translation System" with Stephan Vogel. To appear in Syntax and Structure in Statistical Translation (SSST) Workshop at HLT-NAACL, 26 April 2007, Rochester, New York, USA.
  •   pdf   

  • "The Inner Works of an Automatic Rule Refiner for Machine Translation" with William Ridmann. METIS-II Workshop, January 11, 2007, Leuven, Belgium.
  •   pdf   

  • "Automating Post-Editing to Improve MT Systems" with Jaime Carbonell. Automated Post-Editing Workshop at AMTA, August 12, 2006, Boston, USA.
  •   pdf   

  • "Giving the Power to Bilingual Speakers". Position Paper for the Automated Post-Editing Workshop at AMTA, August 12, 2006, Boston, USA.
  •   pdf   

  • "Can the Internet help improve Machine Translation". Doctoral Consortium at HLT-NAACL, June 4, 2006, New York, USA.
  •   pdf    [slides]    [poster]

  • "A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation" with Jaime Carbonell and Alon Lavie. EAMT 10th Annual Conference 30-31 May 2005, Budapest, Hungary.
  •   pdf

  • "Building Machine translation systems for indigenous languages" with Roberto Aranovich and Lori Levin. Second Conference on the Indigenous Languages of Latin America (CILLA II), 27-29 October 2005, Texas, USA.
  •   pdf

  • "Error Analysis of Two Types of Grammar for the Purpose of Automatic Rule Refinement" with Katharina Probst and Jaime Carbonell. forthcoming at AMTA, 2004.
  •   postscript    pdf

  • "The Translation Correction Tool: English-Spanish user studies" with Jaime Carbonell. LREC, 2004. Lisbon, Portugal.
  •   postscript    pdf

  • Lavie, A., S. Vogel, L. Levin, E. Peterson, K. Probst, A. Font Llitjos, R. Reynolds, J. Carbonell, and R. Cohen, "Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario". ACM Transactions on Asian Language Information Processing (TALIP), to appear in 2(2), June 2003.
  •   postscript    pdf

  • Two HCI project proposals presented December 11, 2002     powerpoint presentation

  • My 2 cents to the panel "From Bits to Bots: Women Everywhere, Leading the Way": AVENUE, Automatic Machine Translation for low-density languages. Grace Hopper Celebration, 2002 Vancouver, Canada. With: Lenore Blum; Anastassia Ailamaki, Manuela Veloso, Sonya Allin and M. Bernardine Dias.   powerpoint presentation


  • Here is a brief overview of the Catalan language (catalŕ) for MT, which I presented February 17, 2003, during one of our surprise language exercise meetings: powerpoint
    [To my grandfather, Joan Llitjós Armengou, a generous and positive man who worked all his life to give us a better life.]
    And here is a very nice and thorough description by the Gran Enciclopedia Catalana (GREC) Word



  • Past research: Speech

    CMU Speech group (only from CMU)

    A couple of years ago, I used to working on:

    In September 2001, I defended my Masters Thesis on:
    Improving Pronunciation Accuracy of Proper Names with Language Origin Classes
    postcript   pdf    slides  
    For less detail versions, you can take a look at the Eurospeech '01 paper or at some of the related talks I've given:
           - Eurospeech presentation (September 7 2001)   powerpoint (~15 minutes)
           - presentation in Catalan (July 18, 2001)   powerpoint (~20 minutes)
           - talk at the Sphinx lunch (June 21, 2001)    postcript    powerpoint (~45 minutes)
          
    Please, take a minute to participate in the evaluation of our pronunciation models by going to the
                       PRONUNCIATION OF PROPER NAMES SITE

    In the past, I have worked on:

    • VXML Dialog Systems (BusLine)
    • Writing Spanish generation grammars for multilingual Machine Translation systems (JANUS and NLPWIN)
    • Writing a Catalan Constraint Grammar
    • Writing LFG-based Catalan analysis and generation grammars in LEKTA


    Dissertation Research | Avenue Project | Quechua | Mapudungun | Publications | Past research in Speech

    home | research | publications | courses | bio | links

    Ariadna Font Llitjós
    Last modified: Mon May 13 21:16:17 EDT 2002