Text Segmentation Using Exponential Models
Abstract
This paper introduces a new statistical approach to
automatically partitioning text into coherent segments. Our proposed
model enlists both long-range and short-range language models to help it
sniff out likely sites of topic changes in text. To aid its search, the
model consults a set of simple lexical hints it has learned to associate
with the presence of boundaries through inspection of a large corpus of
annotated data. We also propose a new probabilistically motivated error
metric for use by the natural language processing community, intended to
supersede precision and recall for appraising segmentation algorithms.
Qualitative assessment of our algorithm as well as evaluation using this
new metric demonstrate the effectiveness of our approach in two very
different domains, Wall Street Journal articles and broadcast
news transcripts.