Abstract:
While document classification — the grouping together of texts that have similar content — is presumably simpler than explicitly determining document meaning, it is still plenty hard. In this talk, we will investigate representations and classification techniques for improving existing algorithms.
We first present a projection-based alternative to Latent Semantic Indexing (LSI) for representing document content. Our new algorithm, Iterative Residual Rescaling, is motivated by an analysis employing matrix perturbation theory to reveal a precise relationship between LSI's performance and the uniformity of the document set's underlying topic distribution. The algorithm generalizes and empirically outperforms LSI in representing inter-document topic-based similarity measurements.
We then turn to a different problem, that of classifying texts not by topic but by sentiment: for example, one might want to determine whether a movie review is "thumbs up" or "thumbs down". Sentiment analysis has empirically been shown to be resistant to traditional text-categorization approaches, and in general involves more subtlety than one might at first imagine. We demonstrate that techniques based on finding minimum cuts in graphs to model inter-item relationships yield state-of-the-art results even when no explicit linguistic information is used.
Joint work with Rie Kubota Ando and with Bo Pang.