Dimension Reduction Techniques for Qualitative and Count Data

Thomas Hofmann
Brown University

Many applications of machine learning methods in domains such as information retrieval, natural language processing, molecular biology, neuroscience, and economics have to be able to deal with various sorts of discrete data that is typically of very high dimensionality.

One standard approach to deal with high dimensional data is to perform a dimension reduction and map the data to some lower dimensional representation. Reducing the data dimensionality is often a valuable analysis by itself, but it might also serve as a pre-processing step to improve or accelerate subsequent stages such as classification or regression. Two closely related methods that are often used in this context and that can be found in virtually every textbook on unsupervised learning are principal component analysis (PCA) and factor analysis. However, PCA relies on a least--squares approximation principle and factor analysis is based on assumptions about the normality of random variables. In contrast, methods for discrete data such as qualitative or count data (as well as for continuos but non--Gaussian data) have been widely ignored.

Various techniques have been proposed in the statistical literature over the last 60 years: canonical analysis, correspondence analysis, association analysis, and latent class analysis being the most important ones. The history of these methods up to these days has been a story of oblivion and re-discovery. It is thus important to find a systematic framework that allows to understand the relationships between these methods, both, conceptionally and in terms of their computational aspects. We provide such as unifying view by clarifying the geometrical foundations of dimension reduction methods for qualitative data. We also focus on the question in which way today's machine learning problems differ from traditional statistical problems and what consequences this has for the applicability of dimension reduction techniques. Experimental results from information retrieval, collaborative filtering and linguistics are used to stress this point.

Back to Abstracts Last modified: Fri Nov 19 22:41:42 EST 1999