Evaluation

Next: Evaluation of Zero-Pronoun Detection Up: Elliptical Zero-Subject Constructions (Zero Previous: Elliptical Zero-Subject Constructions (Zero

Evaluation

To evaluate this task, two experiments were performed: an evaluation of zero-pronoun detection and an evaluation of zero-pronoun resolution. In both experiments the method was tested on two kinds of corpora. In the first instance, we used a portion of the LEXESP⁷ corpus that contains a set of thirty-one documents (38,999 words) from different genres and written by different authors. The LEXESP corpus contains texts of different styles and on different topics (newspaper articles about politics, sports, etc.; narratives about specific topics; novel fragments; etc.). In the second instance, the method was tested on a fragment of the Spanish version of Blue Book (BB) corpus (15,571 words), a technical manual that contains the handbook of the International Telecommunications Union (CCITT) published in English, French, and Spanish. Both corpora are automatically tagged by different taggers.

We randomly selected a subset of the LEXESP corpus (three documents --6,457 words) and a fragment of the Blue Book corpus (4,723 words) as training corpora. The remaining fragments of the corpora were reserved for test data.

It is important to emphasize that all the tasks presented in this paper were automatically evaluated after the annotation of each pronoun (including zero pronouns). To do so, each anaphoric, third-person, personal pronoun was annotated with the information about its antecedent and its translation into the target language. Furthermore, co-reference chains were identified. The annotation phase was accomplished in the following manner: (1) two annotators (native speakers) were selected for each language, (2) an agreement was reached between the annotators with regard to the annotation scheme, (3) each annotator annotated the corpora, and (4) a reliability test [Carletta, J., et al., 1997] was done on the annotation in order to guarantee the results. The reliability test used the kappa statistic that measures agreement between the annotations of two annotators in making judgments about categories. In this way, the annotation is considered a classification task consisting of defining an adequate solution among the candidate list. According to Carletta et al. [Carletta, J., et al., 1997], a k measurement such as

allows us to draw encouraging conclusions, and a measurement

means there is total reliability between the results of the two annotators. In our tests, we obtained a kappa measurement of 0.83. Therefore, we consider the annotation obtained for the evaluation to be totally reliable.

Next: Evaluation of Zero-Pronoun Detection Up: Elliptical Zero-Subject Constructions (Zero Previous: Elliptical Zero-Subject Constructions (Zero

Jesus Peral 2002-12-13