Coreference Resolution
Andy Schlaikjer (hazen+)
Read the Web, Spring 2006
School of Computer Science
Carnegie Mellon University
This is my project page for the coreference resolution work I'm doing for the
10709 Read the Web course. Feel free to look around, although if you're not
already a member of the course, this will likely make very little sense :)
Overview
I'm concentrating on producing a "within-document" coreference resolution module
for the larger RTW system, capable of bootstrapping resolution models using data
extracted from materials provided by the Automatic Content Extraction (ACE)
series of evaluations, along with various layers of annotation produced by other
text processing tools.
Data produced by this module will be somewhat low-level; Annotations sitting
directly on top of documents stored in an Annotations Database (ADB) instance
will encode referential phrases and their antecedent phrases. No attempt will be
made to link these phrases to higher level entities in some knowledge base or
concepts in some ontology.
Data Types
Most of the data I'll be using will be stored as "Tag" objects (or annotations)
inside an ADB instance. Each Tag object has a "type", as well as "source" which
help identify what it is and where it came from. Tags may also have an arbitrary
number of "attributes" (key-value pairs, kind of like XML elements). In order to
better understand the kinds of data I'm using, I'll outline all the ADB tag
types, their sources, and possible attributes here.
Please note this list is subject to change, especially if you have some useful
input on how your project might be doing things differently! Hopefully this list
will grow to accommodate other project groups' data types.
Tag Type |
Source |
Description |
SENTENCE |
MXTERMINATOR |
A sentence tag which bounds some span of text which represents a sentence.
|
POS_* |
STANFORD |
A part-of-speech (POS) tag. There are many types of POS tags, such as
NN , NNS , VB , VBD ,
etc. Please refer to the Penn Treebank
tagset for details.
|
SYN_* |
STANFORD |
A syntactic constituent. There are many types of constituent tags, but each may
have a "parent" tag which represents some higher level sentential
constituent. Currently, the tagset includes NP , VP ,
PP , ADJP , ADVP , S ,
SBAR , SBARQ , SQ , WHADVP ,
WHNP , WHPP . Please refer to the Penn Treebank
constituent label set for details.
|
SEM_* |
ASSERT , KANTOO |
A semantic argument or "target" action. Following the Propbank tagset (details
in Propbank distribution README), the types of primary semantic tags are
ARG0 , ARG1 , ARG2 , ARG3 ,
ARG4 , and TARGET . However, there are additional
"modifier" argument tags (ARGM-* ) which are subcategorized by
"function": EXT , DIR , LOC ,
TMP , REC , PRD , NEG ,
MOD , ADV , MNR , CAU ,
PNC , DIS .
|
ENAMEX_* |
IDENTIFINDER |
An entity name expression (or proper name), e.g. a person name, location, or
organization. This does not include anaphoric mentions of named entities such as
pronominal phrases such as "he". The types supported are PERSON ,
ORGANIZATION , LOCATION .
|
TIMEX_* |
IDENTIFINDER |
A time expression. Types supported are DATE , TIME .
|
NUMEX_* |
IDENTIFINDER |
A numerical expression. Types supported are MONEY , PERCENT .
|
ANTECEDENT |
COREFEREE |
An antecedent phrase. This tag is a marker which signals that some other
REFERENT tag links back to this ANTECEDENT tag.
|
REFERENT |
COREFEREE |
A referential phrase. This tag will also have an attribute which links it back
to the ANTECEDENT to which the text refers.
|
Reading List
This list has been extracted from the project proposal document.
- Sally Goldman and Yan Zhou. Enhancing supervised learning with unlabeled data. In Proceedings of
the 17th International Conference on Machine Learning (ICML), pages 327--334, 2000.
- Sanda Harabagiu, Razvan Bunescu, and Steven Maiorano. Text and knowledge mining for coreference
resolution. In Proceedings of the 2nd Meeting of the North American Chapter of the Association of
Computational Linguistics (NAACL2001), pages 55--62, 2001.
- Ryu Iida, Kentaro Inui, and Yuji Matsumoto. The issue of combining anaphoricity determination and
antecedent identification in anaphora resolution. In IEEE International Conference on Natural Language
Processing and Knowledge Engineering (IEEE NLPKE), pages 244--249, 2005.
- Ryu Iida, Kentaro Inui, Hiroya Takamura, and Yuji Matsumoto. Incorporating contextual cues in
trainable models for coreference resolution. In Proceedings of the EACL Workshop on the Computational
Treatment of Anaphora, pages 23--30, 2003.
- Katja Markert, Natalia Modjeska, and Malvina Nissim. Using the Web for nominal anaphora resolution.
In Proceedings of the EACL Workshop on the Computational Treatment of Anaphora, 2003.
- Andrew McCallum and BenWellner. Toward conditional models of identity uncertainty with application
to proper noun coreference. In IJCAI Workshop on Information Integration on the Web, 2003.
- Christoph Mueller, Stefan Rapp, and Michael Strube. Applying cotraining to reference resolution. In
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), July
2002.
- Vincent Ng. Learning noun phrase anaphoricity to improve coreference resolution: Issues in representa-
tion and optimization. In Proceedings of the 42nd Annual Meeting of the Association for Computational
Linguistics (ACL04), pages 152--159, 2004.
- Vincent Ng. Machine learning for coreference resolution: From local classification to global ranking. In
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), 2005.
- Vincent Ng and Claire Cardie. Bootstrapping coreference classifiers with multiple machine learning
algorithms. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing
(EMNLP03), 2003.
- Vincent Ng and Claire Cardie. Weakly supervised natural language learning without redundant views.
In Proceedings of the Human Language Technology Conference of the North American Chapter of the
Association for Computational Linguistics (HLTNAACL), 2003.
- WeeMeng Soon, Hwee Tou Ng, and Daniel Chung Yong Lim. A machine learning approach to coreference
resolution of noun phrases. Computational Linguistics, 27(4):521--544, 2001.
- Mark Steedman, Miles Osborne, Anoop Sarkar, Stephen Clark, Rebecca Hwa, Julia Hockenmaier, Paul
Ruhlen, Steven Baker, and Jeremiah Crim. Bootstrapping statistical parsers from small datasets.
In Proceedings of the 11th Conference of the European Chapter of the Association for Computational
Linguistics (EACL), 2003.
This page last modified on 2006-07-11