Coreference Resolution

Andy Schlaikjer (hazen+)
Read the Web, Spring 2006
School of Computer Science
Carnegie Mellon University

This is my project page for the coreference resolution work I'm doing for the 10709 Read the Web course. Feel free to look around, although if you're not already a member of the course, this will likely make very little sense :)

Overview

I'm concentrating on producing a "within-document" coreference resolution module for the larger RTW system, capable of bootstrapping resolution models using data extracted from materials provided by the Automatic Content Extraction (ACE) series of evaluations, along with various layers of annotation produced by other text processing tools.

Data produced by this module will be somewhat low-level; Annotations sitting directly on top of documents stored in an Annotations Database (ADB) instance will encode referential phrases and their antecedent phrases. No attempt will be made to link these phrases to higher level entities in some knowledge base or concepts in some ontology.

Data Types

Most of the data I'll be using will be stored as "Tag" objects (or annotations) inside an ADB instance. Each Tag object has a "type", as well as "source" which help identify what it is and where it came from. Tags may also have an arbitrary number of "attributes" (key-value pairs, kind of like XML elements). In order to better understand the kinds of data I'm using, I'll outline all the ADB tag types, their sources, and possible attributes here.

Please note this list is subject to change, especially if you have some useful input on how your project might be doing things differently! Hopefully this list will grow to accommodate other project groups' data types.

Tag Type Source Description

SENTENCE MXTERMINATOR A sentence tag which bounds some span of text which represents a sentence.

POS_* STANFORD A part-of-speech (POS) tag. There are many types of POS tags, such as NN, NNS, VB, VBD, etc. Please refer to the Penn Treebank tagset for details.

SYN_* STANFORD A syntactic constituent. There are many types of constituent tags, but each may have a "parent" tag which represents some higher level sentential constituent. Currently, the tagset includes NP, VP, PP, ADJP, ADVP, S, SBAR, SBARQ, SQ, WHADVP, WHNP, WHPP. Please refer to the Penn Treebank constituent label set for details.

SEM_* ASSERT, KANTOO A semantic argument or "target" action. Following the Propbank tagset (details in Propbank distribution README), the types of primary semantic tags are ARG0, ARG1, ARG2, ARG3, ARG4, and TARGET. However, there are additional "modifier" argument tags (ARGM-*) which are subcategorized by "function": EXT, DIR, LOC, TMP, REC, PRD, NEG, MOD, ADV, MNR, CAU, PNC, DIS.

ENAMEX_* IDENTIFINDER An entity name expression (or proper name), e.g. a person name, location, or organization. This does not include anaphoric mentions of named entities such as pronominal phrases such as "he". The types supported are PERSON, ORGANIZATION, LOCATION.

TIMEX_* IDENTIFINDER A time expression. Types supported are DATE, TIME.

NUMEX_* IDENTIFINDER A numerical expression. Types supported are MONEY, PERCENT.

ANTECEDENT COREFEREE An antecedent phrase. This tag is a marker which signals that some other REFERENT tag links back to this ANTECEDENT tag.

REFERENT COREFEREE A referential phrase. This tag will also have an attribute which links it back to the ANTECEDENT to which the text refers.

Reading List

This list has been extracted from the project proposal document.

Sally Goldman and Yan Zhou. Enhancing supervised learning with unlabeled data. In Proceedings of the 17th International Conference on Machine Learning (ICML), pages 327--334, 2000.
Sanda Harabagiu, Razvan Bunescu, and Steven Maiorano. Text and knowledge mining for coreference resolution. In Proceedings of the 2nd Meeting of the North American Chapter of the Association of Computational Linguistics (NAACL2001), pages 55--62, 2001.
Ryu Iida, Kentaro Inui, and Yuji Matsumoto. The issue of combining anaphoricity determination and antecedent identification in anaphora resolution. In IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLPKE), pages 244--249, 2005.
Ryu Iida, Kentaro Inui, Hiroya Takamura, and Yuji Matsumoto. Incorporating contextual cues in trainable models for coreference resolution. In Proceedings of the EACL Workshop on the Computational Treatment of Anaphora, pages 23--30, 2003.
Katja Markert, Natalia Modjeska, and Malvina Nissim. Using the Web for nominal anaphora resolution. In Proceedings of the EACL Workshop on the Computational Treatment of Anaphora, 2003.
Andrew McCallum and BenWellner. Toward conditional models of identity uncertainty with application to proper noun coreference. In IJCAI Workshop on Information Integration on the Web, 2003.
Christoph Mueller, Stefan Rapp, and Michael Strube. Applying cotraining to reference resolution. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), July 2002.
Vincent Ng. Learning noun phrase anaphoricity to improve coreference resolution: Issues in representa- tion and optimization. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL04), pages 152--159, 2004.
Vincent Ng. Machine learning for coreference resolution: From local classification to global ranking. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), 2005.
Vincent Ng and Claire Cardie. Bootstrapping coreference classifiers with multiple machine learning algorithms. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP03), 2003.
Vincent Ng and Claire Cardie. Weakly supervised natural language learning without redundant views. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLTNAACL), 2003.
WeeMeng Soon, Hwee Tou Ng, and Daniel Chung Yong Lim. A machine learning approach to coreference resolution of noun phrases. Computational Linguistics, 27(4):521--544, 2001.
Mark Steedman, Miles Osborne, Anoop Sarkar, Stephen Clark, Rebecca Hwa, Julia Hockenmaier, Paul Ruhlen, Steven Baker, and Jeremiah Crim. Bootstrapping statistical parsers from small datasets. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2003.

This page last modified on 2006-07-11

Tag Type	Source	Description
`SENTENCE`	`MXTERMINATOR`	A sentence tag which bounds some span of text which represents a sentence.
`POS_*`	`STANFORD`	A part-of-speech (POS) tag. There are many types of POS tags, such as `NN`, `NNS`, `VB`, `VBD`, etc. Please refer to the Penn Treebank tagset for details.
`SYN_*`	`STANFORD`	A syntactic constituent. There are many types of constituent tags, but each may have a "parent" tag which represents some higher level sentential constituent. Currently, the tagset includes `NP`, `VP`, `PP`, `ADJP`, `ADVP`, `S`, `SBAR`, `SBARQ`, `SQ`, `WHADVP`, `WHNP`, `WHPP`. Please refer to the Penn Treebank constituent label set for details.
`SEM_*`	`ASSERT`, `KANTOO`	A semantic argument or "target" action. Following the Propbank tagset (details in Propbank distribution README), the types of primary semantic tags are `ARG0`, `ARG1`, `ARG2`, `ARG3`, `ARG4`, and `TARGET`. However, there are additional "modifier" argument tags (`ARGM-*`) which are subcategorized by "function": `EXT`, `DIR`, `LOC`, `TMP`, `REC`, `PRD`, `NEG`, `MOD`, `ADV`, `MNR`, `CAU`, `PNC`, `DIS`.
`ENAMEX_*`	`IDENTIFINDER`	An entity name expression (or proper name), e.g. a person name, location, or organization. This does not include anaphoric mentions of named entities such as pronominal phrases such as "he". The types supported are `PERSON`, `ORGANIZATION`, `LOCATION`.
`TIMEX_*`	`IDENTIFINDER`	A time expression. Types supported are `DATE`, `TIME`.
`NUMEX_*`	`IDENTIFINDER`	A numerical expression. Types supported are `MONEY`, `PERCENT`.
`ANTECEDENT`	`COREFEREE`	An antecedent phrase. This tag is a marker which signals that some other `REFERENT` tag links back to this `ANTECEDENT` tag.
`REFERENT`	`COREFEREE`	A referential phrase. This tag will also have an attribute which links it back to the `ANTECEDENT` to which the text refers.