Coreference Resolution

Andy Schlaikjer (hazen+)
Read the Web, Spring 2006
School of Computer Science
Carnegie Mellon University

This is my project page for the coreference resolution work I'm doing for the 10709 Read the Web course. Feel free to look around, although if you're not already a member of the course, this will likely make very little sense :)

Overview

I'm concentrating on producing a "within-document" coreference resolution module for the larger RTW system, capable of bootstrapping resolution models using data extracted from materials provided by the Automatic Content Extraction (ACE) series of evaluations, along with various layers of annotation produced by other text processing tools.

Data produced by this module will be somewhat low-level; Annotations sitting directly on top of documents stored in an Annotations Database (ADB) instance will encode referential phrases and their antecedent phrases. No attempt will be made to link these phrases to higher level entities in some knowledge base or concepts in some ontology.

Data Types

Most of the data I'll be using will be stored as "Tag" objects (or annotations) inside an ADB instance. Each Tag object has a "type", as well as "source" which help identify what it is and where it came from. Tags may also have an arbitrary number of "attributes" (key-value pairs, kind of like XML elements). In order to better understand the kinds of data I'm using, I'll outline all the ADB tag types, their sources, and possible attributes here.

Please note this list is subject to change, especially if you have some useful input on how your project might be doing things differently! Hopefully this list will grow to accommodate other project groups' data types.

Tag Type Source Description
SENTENCE MXTERMINATOR A sentence tag which bounds some span of text which represents a sentence.
POS_* STANFORD A part-of-speech (POS) tag. There are many types of POS tags, such as NN, NNS, VB, VBD, etc. Please refer to the Penn Treebank tagset for details.
SYN_* STANFORD A syntactic constituent. There are many types of constituent tags, but each may have a "parent" tag which represents some higher level sentential constituent. Currently, the tagset includes NP, VP, PP, ADJP, ADVP, S, SBAR, SBARQ, SQ, WHADVP, WHNP, WHPP. Please refer to the Penn Treebank constituent label set for details.
SEM_* ASSERT, KANTOO A semantic argument or "target" action. Following the Propbank tagset (details in Propbank distribution README), the types of primary semantic tags are ARG0, ARG1, ARG2, ARG3, ARG4, and TARGET. However, there are additional "modifier" argument tags (ARGM-*) which are subcategorized by "function": EXT, DIR, LOC, TMP, REC, PRD, NEG, MOD, ADV, MNR, CAU, PNC, DIS.
ENAMEX_* IDENTIFINDER An entity name expression (or proper name), e.g. a person name, location, or organization. This does not include anaphoric mentions of named entities such as pronominal phrases such as "he". The types supported are PERSON, ORGANIZATION, LOCATION.
TIMEX_* IDENTIFINDER A time expression. Types supported are DATE, TIME.
NUMEX_* IDENTIFINDER A numerical expression. Types supported are MONEY, PERCENT.
ANTECEDENT COREFEREE An antecedent phrase. This tag is a marker which signals that some other REFERENT tag links back to this ANTECEDENT tag.
REFERENT COREFEREE A referential phrase. This tag will also have an attribute which links it back to the ANTECEDENT to which the text refers.

Reading List

This list has been extracted from the project proposal document.

This page last modified on 2006-07-11