The algorithm for anaphora resolution in English is based on
the one developed for Spanish, and it has been conveniently adapted for
English. The main difference between the two algorithms consists
in a different order of the preferences obtained after the
training phase. After this phase, we extracted the following
conclusions:
Spanish has more morphological information than
English. As a consequence, morphological constraints in Spanish
discard more candidates than constraints in English.
Spanish is a nearly free-order language, in which the
different constituents of a sentence (subject, object, etc.) can
appear almost at any position. For this reason, the preference of
syntactic parallelism has a more important role in the
anaphora-resolution method in English than in Spanish.
Spanish sentences are usually longer than English ones. This
fact implies more candidates for Spanish anaphors than for
English ones.
After the training phase, the algorithm was evaluated over the
test corpus. In the evaluation phase, two experiments were
carried out. In the first experiment, only lexical, morphological, and
syntactic information was used. The obtained results with
the SemCor and MTI corpora appear in Table
7.
Table 7:
Anaphora
resolution in English, evaluation phase: experiment 1
He
She
It
They
Him
Her
Them
Corr
Total
P(%)
SEMCOR
116
10
38
50
34
0
6
175
254
68.9
MTI
1
0
347
56
0
0
66
361
470
76.8
The table shows the number of pronouns (classified by type) for
the different corpora. The last three columns represent the number
of correctly solved pronouns, the total number of pronouns, and the
obtained precision, respectively. For instance, in the MTI
corpus a precision of 76.8% was obtained.
Discussion. In pronominal anaphora resolution in English,
the following results were obtained in the first experiment:
SemCor corpus, P = 68.9%, R = 66%; MTI corpus, P = 76.8%,
R = 72.9%.
From these results, we have extracted the following conclusions:
The types of pronouns vary considerably according to the
corpus. In the SemCor corpus, 15% of the pronouns are occurrences
of the it pronoun, whereas in the MTI corpus this percentage is
73.8%. This fact is explained by the kind and domain of each
corpus. The SemCor is a corpus with a narrative style which
contains a lot of person entities14 that are
referred to in the text with the use of personal pronouns (he,
she, and they). On the other hand, the MTI corpus is a
collection of technical manuals that contains almost no
person entities. Rather, most references are to object
entities, using it pronouns.
In the SemCor corpus, errors originated from different
causes:
The lack of semantic information caused 57% of the global
mistakes. There were seventeen mistakes in the resolution of it
pronouns, in which the system proposed a person entity as
solutions for these pronouns. On the other hand, twenty-eight occurrences of
the pronouns he, she, him, and her were
incorrectly solved due to the system proposing an object or
animal entity as the solution.
There were exceptions in the applications of preferences (38%), mainly
due to the existence of a large number of candidates compatible with the
anaphor15.
There were mistakes in the POS tagging (5%).
In the MTI corpus, errors were mainly produced in the resolution of
it pronouns (73.4% of the global mistakes). The it
pronoun lacks gender information (it is valid for masculine and
feminine) and subsequently there are a lot of candidates per
anaphor16. This fact originates errors in the application
of preferences. The remaining errors are originated by the lack of
semantic information.
After analyzing the results, it was observed that the precision
of the SemCor corpus was approximately 8% lower than that for the
MTI corpus. The errors in the SemCor corpus mainly originated
with the lack of semantic information. Therefore, in order to
improve the obtained results, a second experiment was carried out with the
addition of semantic information.
The modifications to the second experiment were the following:
Two new semantic constraints--presented in [Saiz-Noeda et al., 2000]--were
added to the morphological and syntactic constraints:
The pronouns he, she, him, and her
must have as the antecedents person entities.
The pronoun it must have as its antecedent a non-person
entity.
To apply these new constraints, the twenty-five top concepts of
WordNet (the concepts at the top level in the ontology) were
grouped into three categories: person, animal, and
object. Subsequently, WordNet was consulted with the head
of each candidate, and thus the semantic category of the
antecedent was obtained.
This experiment was exclusively carried out with the SemCor corpus because
it is the only one in which content words are
annotated with their WordNet sense.
Table
8 shows the number of pronouns (classified by type) for
the different corpora after these changes were incorporated.
Table 8:
Anaphora
resolution in English, evaluation phase: experiment 2
He
She
It
They
Him
Her
Them
Corr
Total
P(%)
SEMCOR
116
10
38
50
34
0
6
220
254
86.6
MTI
1
0
347
56
0
0
66
361
470
76.8
As shown in Table 8, the
addition of the two simple semantic constraints resulted in considerable
improvement in the obtained precision (approximately 18%) for the SemCor
corpus. We concluded that the use of semantic information
(such as new constraints and preferences) in the process of anaphora
resolution will improve the results obtained.
Finally, Table 9 compares anaphora resolution using AGIR with the
other approaches
previously presented17. It is important to emphasize the high percentages
obtained using our system and Hobbs's method in the SemCor corpus; both
systems incorporate semantic information18 into their methods
using semantic constraints (selectional restrictions), whereas
none of the other authors incorporate semantics in their approaches.
Table 9:
Anaphora
resolution in English, comparison of AGIR with other approaches