Abstract
Most statistical approaches to modeling text implicitly assume that
informative words are rare. This assumption is usually appropriate for
topical retrieval and classification tasks; however, in non- topical
classification and soft-clustering problems where classes and latent
variables relate to sentiment or author, informative words can be
frequent. In this paper we present a comprehensive set of statistical
learning tools which treat words with higher frequencies of occurrence
in a sensible manner. We introduce probabilistic models of contagion
for classification and soft-clustering based on the Poisson and
Negative-Binomial distributions, which share with the Multinomial the
desirable properties of simplicity and analytic tractability. We then
introduce the Delta-Square statistic to select features and avoid
over-fitting.
As an example, we demonstrate the Dirichlet-Poisson model for
classification and soft-clustering. On a technical level, this model
leverages: (a) the "reference length" parameter, in order to implicitly
normalize word-counts in a probabilistic fashion, and ultimately
correct parameter estimates for the different word-length of documents,
and (b) the "sum/ratio" parameterization, in order to promote the
tractability of variational inference, the interpretability of
parameters and priors, and geometrical intuitions.
This is joint work with William Cohen and Stephen Fienberg.
|
Pradeep Ravikumar Last modified: Sat Nov 5 09:08:53 EST 2005