... pronouns1
This kind of pronouns will be presented in detail in Section 4.1.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... Systran2
A free trial of the commercial product SYSTRANLinks (copyright 2002 by SYSTRAN S.A.) has been used to translate between the English and Spanish languages all the corpora used in the evaluation of our approach. (URL = http://w4.systranlinks.com/config, visited on 06/22/2002).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... correctly3
In this paper, we have used the symbols (S) and (E) to represent Spanish and English texts, respectively. The symbol ``Ø'' indicates the presence of the omitted pronoun. In the examples, the pronoun and the antecedent have an index; co-indexing indicates co-reference between them.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... descriptions4
One-anaphora has the following structure in English: a determiner and the pronoun one with some premodifiers or postmodifiers (the red one; the one with the blue bow). This kind of anaphors in Spanish consists of noun phrases in which the noun has been omitted (el rojo; el que tiene el lazo azul). In definite descriptions, anaphors are formed by definite noun phrases that refer to objects that are usually uniquely determined in the context.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... information5
The SS stores the following information for each constituent: constituent name (NP, PP, etc.), semantic and morphological information, discourse marker (identifier of the entity or discourse object), and the SS of its subconstituents.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... stage6
In the evaluation of our approach, we have only used an English corpus (SemCor) where all content words are annotated with their WordNet sense; this sense has been used to identify the semantic category of the word. The remaining corpora do not have information about the senses of the content words; therefore, a set of heuristics has been used to identify their semantic categories. Currently, a WSD module [Montoyo & Palomar, 2000] is being developed in our Research Group, which will be incorporated into our system in the future.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... LEXESP7
The LEXESP corpus belongs to the project of the same name, carried out by the Psychology Department of the University of Oviedo and developed by the Computational Linguistics Group of the University of Barcelona, with the collaboration of the Language Processing Group of the Catalonia University of Technology, Spain.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... insignificant8
In order to compare our system with other systems, in Section 6.2 we evaluate pronoun translation (including zero pronouns) between Spanish and English using the commercial product SYSTRANLinks and the Spanish LEXESP corpus. The evaluation highlights the deficiencies of zero-pronoun detection, resolution, and translation (out of 559 anaphoric, third-person, zero pronouns in the LEXESP corpus, only 266 were correctly translated into English--a precision of only 47.6%).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... information9
It is important to mention here that semantic information was not available for the Spanish corpora.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... SUPAR10
A detailed study of these implementations in SUPAR is presented in Palomar et al. [Palomar, M., et al., 2001].
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... Hobbs11
Hobbs's baseline is frequently used to compare most of the work accomplished on anaphora resolution. Hobbs's algorithm does not work as well as ours because it carries out a full parsing of the text. Furthermore, the manner in which the syntactic tree is explored using Hobbs's algorithm is not the best one for Spanish, since it is nearly a free-word-order language.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... resolution12
As previously mentioned, only anaphoric, third-person, personal pronouns will be resolved in order to translate them into the target language.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... MTI13
This corpus was provided by the Computational Linguistics Research Group of the School of Humanities, Languages and Social Studies, University of Wolverhampton, England. The corpus is anaphorically annotated indicating the anaphors and their correct antecedents.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... entities14
If we use a basic ontology based on semantic features, at the top level, entities could be classified into three main categories: person, animal, and object.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... anaphor15
The sentences of the SemCor corpus are very long (with an average of 24.3 words per sentence). This fact implies a large number of candidates per anaphor (an average of 15.2) after applying constraints.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... anaphor16
The sentences of the MTI corpus are not very long (with an average of 15.5 words per sentence). However, the candidates per anaphor, after applying constraints, are high (an average of 13.6).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... pre\-sen\-ted17
As mentioned earlier, all the results presented here were automatically obtained after the anaphoric annotation of each pronoun. After the tagging and the partial parsing of the source text, pronominal anaphora were resolved and translated into the target language. None of the intermediate outputs needed to be adjusted manually in order to be processed subsequently.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... information18
Hobbs proposed the use of semantic information using selectional restrictions as a straightforward extension of his method in order to improve the obtained results in anaphora resolution.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... texts19
In order to detect pleonastic it pronouns in AGIR, a set of rules, based on pattern recognition, that allow for the identification of this type of pronoun is constructed. These rules were based on the work of [Lappin & Leass, 1994,Paice & Husk, 1987,Denber, 1998], which dealt with this problem in a similar way. We have used the information provided by the POS tagger in order to improve the detection of the different patterns. We have evaluated the method using journalistic texts for a portion of the Federal Register corpus that contains a set of 313 documents (156,831 words). In the detection of pleonastic it pronouns a 88.7% P (568 out of 640) was obtained. Finally, it is very important to point out the high percentage of it pronouns in the test corpus that are pleonastic (32.9%). This fact demonstrates the importance of the correct detection of this kind of pronoun in any MT system.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... on20
In the automatic evaluation, a pronoun was considered as correctly translated when the pronoun proposed by the system was the same as that proposed by the human annotator. With this criterion, we evaluated the correct application of the corresponding morphological rule.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... web--BABELFISH21
URL = http://www.babelfish.altavista.com (visited on 03/11/2002).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... SYSTRANLinks22
URL = http://w4.systranlinks.com/config (visited on 06/22/2002).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.