CMU - IR Discussion Series

Wednesday, March 2, 2005 - 4:30, WeH 4623
Title: What is the matter? Explorations in text categorization
Speaker: Lillian Lee

Abstract:

While document classification — the grouping together of texts that have similar content — is presumably simpler than explicitly determining document meaning, it is still plenty hard. In this talk, we will investigate representations and classification techniques for improving existing algorithms.

We first present a projection-based alternative to Latent Semantic Indexing (LSI) for representing document content. Our new algorithm, Iterative Residual Rescaling, is motivated by an analysis employing matrix perturbation theory to reveal a precise relationship between LSI's performance and the uniformity of the document set's underlying topic distribution. The algorithm generalizes and empirically outperforms LSI in representing inter-document topic-based similarity measurements.

We then turn to a different problem, that of classifying texts not by topic but by sentiment: for example, one might want to determine whether a movie review is "thumbs up" or "thumbs down". Sentiment analysis has empirically been shown to be resistant to traditional text-categorization approaches, and in general involves more subtlety than one might at first imagine. We demonstrate that techniques based on finding minimum cuts in graphs to model inter-item relationships yield state-of-the-art results even when no explicit linguistic information is used.

Joint work with Rie Kubota Ando and with Bo Pang.