Pronominal Anaphora Translation into Spanish

Next: Pronominal Anaphora Translation into Up: Evaluation of the Generation Previous: Evaluation of the Generation

Pronominal Anaphora Translation into Spanish

In this experiment, the translation of English, third-person, personal pronouns into Spanish was evaluated.

We tested the method on the portions of the SemCor and MTI corpora used previously in the process of anaphora resolution. The training corpus was used for improving the number and gender rules. The remaining fragments of the corpora were reserved for test data.

We needed to know the semantic category (person, animal, or object) and the grammatical gender (masculine or feminine) of the pronoun's antecedent in order to apply the number and gender rules. In the SemCor corpus, the WordNet sense was used to identify the antecedent's semantic category. In the MTI corpus, due to the lack of semantic information, a set of heuristics was used to determine the antecedent's semantic category.

With regard to information about the antecedent's gender, an English-Spanish electronic dictionary was used since the POS tag does not usually provide gender and number information. The dictionary was incorporated into the system as a database. For each English word, the dictionary provides a translation into Spanish, and the word's gender and number in Spanish.

The number and gender rules were applied using this morphological and semantic information. We conducted a blind test over the entire test corpus, and the obtained results appear in Table 10.

Table 10: Translation of pronominal anaphora into Spanish, evaluation phase

	Subject	Compl	Correct	Total	P(%)
SEMCOR	197	47	229	254	90.2
MTI	239	231	353	470	75.1
TOTAL	436	288	582	724	80.4

The evaluation of this task was automatically carried out after the anaphoric annotation of each pronoun. This annotation included information about the antecedent and the translation into the target language of the anaphor. To do so, the human annotators translated the anaphors according to the criteria established by the morphological rules. For example, the pronoun it with subject function was translated into the Spanish pronoun él if its antecedent was of the animal type and masculine; on the other hand, if its antecedent was of the object type and masculine, it was translated into the Spanish pronoun éste; and so on. In the Spanish-English translation, the pronoun él with subject function was translated into the English pronoun he if its antecedent was a person type and masculine; on the other hand, if its antecedent was an object/animal type and masculine/feminine, it was translated into the English pronoun it; and so on²⁰.

Table 10 shows the anaphoric pronouns of each corpus classified by grammatical function: subject and complement (direct or indirect object). The last three columns represent the number of pronouns successfully solved, the total number of solved pronouns, and the obtained precision, respectively. For instance, the SemCor corpus contains 197 pronouns with subject function and 47 complement pronouns. The precision obtained in this corpus was of 90.2% (229 out of 254).

Discussion. In the translation of English personal pronouns in the third person into Spanish, an overall precision of 80.4% (582 out of 724) was obtained. Specifically, 90.2% P and 75.1% P were obtained in the SemCor and MTI corpora, respectively.

From these results, we have extracted the following conclusions:

In the SemCor corpus, all of the instances of the English pronouns he, she, him, and her were correctly translated into Spanish. There are two reasons for this:
- The semantic roles of these pronouns were correctly identified in all of the cases.
- These pronouns contain the necessary grammatical information (gender and number) that allows the correct translation into Spanish, independent of the antecedent proposed as a solution by the AGIR system.
The errors in the translation of the pronouns it, they, and them were originated by the following different causes:
- There were mistakes in the anaphora-resolution stage, that is, the antecedent proposed by the system was not the correct one (44.4% of the global mistakes). This caused an incorrect translation into Spanish mainly due to the fact that the proposed antecedent and the correct one had different grammatical genders.
- There were mistakes in the identification of the semantic role of the pronouns that caused the application of an incorrect morphological rule (44.4%). These mistakes mainly originated in an incorrect process of clause splitting.
- There were mistakes originated by the English-Spanish electronic dictionary (11.2%). Two circumstances could occur: (a) the word did not appear in the dictionary; and (b) the word's gender in the dictionary was different from the real word's gender, since the word had different meanings.
In the MTI corpus, nearly all the pronouns were it, they, and them (96.2% of the total pronouns). The errors in the translation of these pronouns originated in the same causes as those in the SemCor corpus, although the percentages were different:
- There were mistakes in the anaphora-resolution stage (22.9% of the mistakes).
- There were mistakes in the identification of the pronouns' semantic role (62.9%).
- There were mistakes that originated in the English-Spanish dictionary (14.2%). In this corpus, there were a large number of technical words that did not appear in the electronic dictionary.
After analyzing the results, we observed that the precision of the SemCor corpus was approximately 15% higher than that obtained by the MTI corpus. The lower percentage obtained by the MTI corpus were the result of the corpus itself (most of the pronouns in this corpus are it, they, and them), and of the lack of semantic information.

In order to measure the efficiency of our proposal, we compared our system with one of the most representative MT systems of the moment: Systran. Systran was designed and built more than thirty years ago, and it is being continually modified in order to improve its translation quality. Moreover, it is easily accessible to Internet users through the service of MT on the web--BABELFISH²¹--which provides free translations between different languages. With regard to the problem of pronominal anaphora resolution and translation, Systran is one of the best MT systems studied (see Section 2) because, like our own system, it treats the problems of intersentential pronominal anaphora and Spanish zero pronouns on unrestricted texts after carrying out a partial parsing of the source text. As was mentioned in Section 2, a free trial of the commercial product SYSTRANLinks²² was used to translate between the English and Spanish languages the evaluation corpora. The results appear in Table 11.

Table 11: Translation of pronominal anaphora (complement pronouns only) into Spanish, SYSTRANLinks and AGIR

	SYSTRANLinks	AGIR
SEMCOR	75.4	82.5
MTI	58.1	69.3

The evaluation of the SYSTRANLinks output was carried out by a human translator by hand. Pronouns judged as acceptable by the translator were considered correctly translated; otherwise, they were considered incorrectly translated.

Table 11 only shows the evaluation of English complement pronoun translation into Spanish because Systran did not translate all the subject pronouns into Spanish. By analyzing the Systran outputs of both corpora, we extracted the following conclusions:

All the instances of the English pronouns he and she (always with subject function) were correctly translated into their Spanish equivalents él and ella.
All the instances of the English pronouns it and they with subject function were omitted in Spanish--zero pronouns. These pronouns were not resolved in English, and subsequently were not translated into Spanish.

On the other hand, in our AGIR system, we have evaluated the correct application of the morphological rule to translate all source pronouns into target pronouns. A subsequent task must decide if the pronoun in the target language (a) must be generated as our system proposes, (b) must be substituted by another kind of pronoun (e.g., a possessive pronoun), or (c) must be eliminated (i.e., Spanish zero pronouns). Therefore, we have only taken into account the complement pronoun translation in order to make a fair comparison between the two systems.

As shown in Table 11, the precision obtained using AGIR is approximately 7-11% higher (depending on the corpus) than the one obtained using Systran. The errors in Systran originated in mistakes in the anaphora-resolution stage that caused incorrect translations, since the proposed antecedents and the correct ones have different grammatical gender. These errors can occur in intrasentential anaphors (as presented in Section 2) or in intersentential anaphors, as in the following example extracted from the corpora:

(E) [This information] is only valid for Linux on the Intel platform. Much of it should be applicable to Linux on other processor architectures, but I have no first hand experience or information.
(S) Esta información es solamente válida para Linux en la plataforma de Intel. Mucho de él debe ser aplicable a Linux en otras configuraciones del procesador, pero no tengo ninguna experiencia o información de primera mano.

This example shows an incorrect English-Spanish translation of the pronoun it done by Systran. In this case, the antecedent (this information, feminine) is in the previous sentence to the anaphor. It is incorrectly solved, and then it is incorrectly translated (the pronoun él--masculine--instead of the pronoun ésta--feminine).

Next: Pronominal Anaphora Translation into Up: Evaluation of the Generation Previous: Evaluation of the Generation

Jesus Peral 2002-12-13