In this experiment, the translation of English, third-person,
personal pronouns into Spanish was evaluated.
We tested the method on the portions of the SemCor and MTI
corpora used previously in the process of anaphora resolution.
The training corpus was used for improving the number and gender
rules. The remaining fragments of the corpora were reserved for
test data.
We needed to know the semantic category (person,
animal, or object) and the grammatical gender
(masculine or feminine) of the pronoun's antecedent
in order to apply the number and gender rules. In the SemCor corpus,
the WordNet sense was used to identify the antecedent's
semantic category. In the MTI corpus, due to the lack of semantic
information, a set of heuristics was used to determine the
antecedent's semantic category.
With regard to information about the antecedent's gender, an
English-Spanish electronic dictionary was used since the POS
tag does not usually provide gender and number information. The
dictionary was incorporated into the system as a database.
For each English word, the dictionary provides a translation
into Spanish, and the word's gender and number in Spanish.
The number and gender rules were applied using this
morphological and semantic information. We conducted a blind
test over the entire test corpus, and the obtained results appear
in Table 10.
Table 10:
Translation of
pronominal anaphora into Spanish, evaluation phase
Subject
Compl
Correct
Total
P(%)
SEMCOR
197
47
229
254
90.2
MTI
239
231
353
470
75.1
TOTAL
436
288
582
724
80.4
The evaluation of this task was automatically carried out after
the anaphoric annotation of each pronoun. This annotation
included information about the antecedent and the translation
into the target language of the anaphor. To do so, the human
annotators translated the anaphors according to the criteria
established by the morphological rules. For example, the pronoun
it with subject function was translated into the
Spanish pronoun él if its antecedent was of the animal type
and masculine; on the other hand, if its antecedent was of the
object type and masculine, it was translated into
the Spanish pronoun éste; and so on. In the Spanish-English
translation, the pronoun él with subject function was
translated into the English pronoun he if its antecedent
was a person type and masculine; on the other hand, if
its antecedent was an object/animal type and
masculine/feminine, it was translated into the English
pronoun it; and so on20.
Table 10 shows the anaphoric
pronouns of each corpus classified by grammatical function:
subject and complement (direct or indirect object).
The last three columns represent the number of pronouns
successfully solved, the total number of solved pronouns, and the
obtained precision, respectively. For instance, the SemCor
corpus contains 197 pronouns with subject function and 47
complement pronouns. The precision obtained in this
corpus was of 90.2% (229 out of 254).
Discussion. In the translation of English personal
pronouns in the third person into Spanish, an overall
precision of 80.4% (582 out of 724) was obtained.
Specifically, 90.2% P and 75.1% P were
obtained in the SemCor and MTI corpora, respectively.
From these results, we have extracted the following conclusions:
In the SemCor corpus, all of the instances of the English pronouns
he, she, him, and her were
correctly translated into Spanish. There are two
reasons for this:
The semantic roles of these pronouns were correctly
identified in all of the cases.
These pronouns contain the necessary grammatical information
(gender and number) that allows the correct translation into
Spanish, independent of the antecedent proposed as a solution by
the AGIR system.
The errors in the translation of the pronouns it,
they, and them were originated by
the following different causes:
There were mistakes in the anaphora-resolution stage, that is, the
antecedent proposed by the system was not the correct one (44.4%
of the global mistakes). This caused an incorrect translation into
Spanish mainly due to the fact that the proposed antecedent and
the correct one had different grammatical genders.
There were mistakes in the identification of the semantic role of the
pronouns that caused the application of an incorrect morphological
rule (44.4%). These mistakes mainly originated in an
incorrect process of clause splitting.
There were mistakes originated by the English-Spanish electronic dictionary
(11.2%). Two circumstances could occur: (a) the word did not
appear in the dictionary; and (b) the word's gender in the
dictionary was different from the real word's gender, since the word
had different meanings.
In the MTI corpus, nearly all the pronouns were
it, they, and them (96.2% of the
total pronouns). The errors in the translation of these pronouns
originated in the same causes as those in the SemCor corpus, although the
percentages were different:
There were mistakes in the anaphora-resolution stage (22.9% of the
mistakes).
There were mistakes in the identification of the pronouns' semantic role
(62.9%).
There were mistakes that originated in the English-Spanish dictionary
(14.2%). In this corpus, there were a large number of technical words that
did not appear in the electronic dictionary.
After analyzing the results, we observed that the precision
of the SemCor corpus was approximately 15% higher than that
obtained by the MTI corpus. The lower percentage obtained by the MTI corpus were
the result of the corpus itself (most of the pronouns
in this corpus are it, they, and
them), and of the lack of semantic information.
In order to measure the efficiency of our proposal, we compared
our system with one of the most representative MT systems of the
moment: Systran. Systran was designed and built more than thirty
years ago, and it is being continually modified in order to
improve its translation quality. Moreover, it is easily accessible
to Internet users through the service of MT on the
web--BABELFISH21--which provides free translations between different
languages. With regard to the problem of pronominal anaphora
resolution and translation, Systran is one of the best MT systems
studied (see Section 2) because, like our own system, it treats
the problems of intersentential pronominal anaphora and Spanish
zero pronouns on unrestricted texts after carrying out a partial
parsing of the source text. As was mentioned in Section 2, a free
trial of the commercial product SYSTRANLinks22 was used to translate between the English and
Spanish languages the evaluation corpora. The results appear in
Table 11.
Table 11:
Translation of
pronominal anaphora (complement pronouns only) into Spanish, SYSTRANLinks and AGIR
SYSTRANLinks
AGIR
SEMCOR
75.4
82.5
MTI
58.1
69.3
The evaluation of the SYSTRANLinks output was carried out by a
human translator by hand. Pronouns judged as acceptable by the
translator were considered correctly translated; otherwise, they
were considered incorrectly translated.
Table 11 only shows the
evaluation of English complement pronoun translation into Spanish
because Systran did not translate all the subject
pronouns into Spanish. By analyzing the Systran outputs of both
corpora, we extracted the following conclusions:
All the instances of the English pronouns he and
she (always with subject function) were correctly
translated into their Spanish equivalents él and ella.
All the instances of the English pronouns
it and they with subject function were omitted in
Spanish--zero pronouns. These pronouns were not resolved in
English, and subsequently were not translated into Spanish.
On the other hand, in our AGIR system, we have evaluated the
correct application of the morphological rule to translate all
source pronouns into target pronouns. A subsequent task must
decide if the pronoun in the target language (a) must be
generated as our system proposes, (b) must be substituted by
another kind of pronoun (e.g., a possessive pronoun), or (c) must
be eliminated (i.e., Spanish zero pronouns). Therefore, we have
only taken into account the complement pronoun translation in
order to make a fair comparison between the two systems.
As shown in Table 11, the
precision obtained using AGIR is approximately 7-11% higher
(depending on the corpus) than the one obtained using Systran. The
errors in Systran originated in mistakes in the
anaphora-resolution stage that caused incorrect translations,
since the proposed antecedents and the correct ones have different
grammatical gender. These errors can occur in intrasentential
anaphors (as presented in Section 2) or in intersentential
anaphors, as in the following example extracted from the corpora:
(E) [This information] is only valid for Linux
on the Intel platform. Much of it should be applicable
to Linux on other processor architectures, but I have no first hand experience or information.
(S) Esta información es solamente válida
para Linux en la plataforma de Intel. Mucho de él debe ser
aplicable a Linux en otras configuraciones del procesador, pero
no tengo ninguna experiencia o información de primera mano.
This example shows an incorrect English-Spanish translation of
the pronoun it done by Systran. In this case, the
antecedent (this information, feminine) is in the previous
sentence to the anaphor. It is incorrectly solved, and then it is
incorrectly translated (the pronoun él--masculine--instead
of the pronoun ésta--feminine).
Next:Pronominal Anaphora Translation into Up:Evaluation of the Generation Previous:Evaluation of the Generation
Jesus Peral
2002-12-13