Thesis Research - Abstract
In this era, where electronic text information is exponentially
growing and where time is a critical resource, it has become virtually
impossible for any user to browse or read large numbers of individual
documents. It is therefore important to explore methods of allowing
users to locate and browse information quickly within collections of
documents. Automatic text summarization of multiple documents
fulfills such information seeking goals by providing a method for the
user to quickly view highlights and/or relevant portions of document
collections. As of yet, there has been little work with
multi-document summarization, although single document summarization
has been a subject of focus in the last few years.
Multi-document summarization differs from single in that the issues of
compression, speed, redundancy and passage selection are critical in
the formation of useful summaries. If multi-document summarization is
to be useful across subject areas and languages, it must be relatively
independent of natural language understanding. A statistical approach
allows for both rapid passage selection and speed. The maximal
marginal relevance (MMR) metric is used to provide ``relevant''
novelty in passage selection, i.e., selecting passages that meet the
criteria of relevance to a query, while reducing redundancy and
maximizing diversity among the individual passages.
The approach builds on previous work in single-document summarization
by using additional, available information about the document set as a
whole, the relationships between the documents, as well as properties
of individual documents. The underlying framework is modular, thus
allowing easy parameterization to take into account different document
genres or corpora characteristics, user requirements, as well as
linguistic properties of languages that can enhance summarization
results.
The principal question being addressed is "Can multi-document
summarization effectively indicate the textual content of document
collections and assist users to rapidly find their desired
information?" I will explore this question by evaluating the system
in the domains of newswire articles, web pages, and time permitting,
computer science technical reports.
Committee:
Jaime Carbonell (Chair)
Jamie Callan
Vibhu Mittal
Jan Pedersen