Abstract
Entities involved in typical information extraction tasks often
consist of multiple words or tokens. Most state of the art NER
systems do sequential labeling where each token in the input words is
assigned a label. We argue that NER tasks should classify {\em
segments} of multiple adjacent words instead of single words. We
formalize this as a semi-Markov process which relaxes the usual Markov
assumptions of word-based labeling tasks. This formalism allows the
direct use of useful entity-level as against word-level features, and
provides a more natural formulation of the NER problem than sequential
word classification. In particular, this allows a natural way of
incorporating noisy external dictionaries of multi-word entities
through high-performance string similarity measures from the record
linkage literature. I will present how Conditional Random Fields
(CRFs), a popular and high-performance IE model is extended to perform
such semi-markov sequential labeling. Experiments in multiple domains
show that the new model can substantially improve extraction
performance, relative to previously published methods for using
external dictionaries in NER.
(This is joint work with William Cohen)
|
Pradeep Ravikumar Last modified: Thu Apr 29 18:23:24 EDT 2004