Error-correcting output coding for text classification
This paper applies error-correcting output coding (ECOC) to the task of
document categorization. ECOC, of recent vintage in the AI literature, is a
method for decomposing a multiway classification problem into many binary
classification tasks, and then combining the results of the subtasks into a
hypothesized solution to the original problem. There has been much recent
interest in the machine learning community about algorithms which integrate
"advice" from many subordinate predictors into a single classifier, and
error-correcting output coding is one such technique. We provide experimental
results on several real-world datasets, extracted from the Internet, which
demonstrate that ECOC can offer significant improvements in accuracy over
conventional classification algorithms.
- Citation:
- A. Berger (1999). Error-correcting output coding for text classification.
IJCAI'99: Workshop on machine learning for information filtering.
Stockholm, Sweden.
- Online documents:
- Workshop paper
[pdf]
 
 
Statistical models for text segmentation
This paper introduces a new statistical approach to
automatically partitioning text into coherent segments. The approach is based
on a technique that incrementally builds an exponential model by identifying
features correlated with the presence of boundaries in labeled training text.
The models use two classes of features: topicality features that use
adaptive language models in a novel way to detect broad changes of topic, and
cue-word features that detect occurrences of specific words, which may
be domain-specific, that tend to be used near segment boundaries. Assessment
of our approach on quantitative and qualitative grounds demonstrates its
effectiveness in two very different domains, Wall Street Journal news
articles and television broadcast news story transcripts. Quantitative results
on these domains are presented using a new probabilistically motivated error
metric, which combines precision and recall in a natural and flexible way.
This metric enables a quantitative assessment of the relative contributions of
the different feature types, as well as a comparison with decision trees and
previously proposed text segmentation algorithms.
- Citation:
- D. Beeferman, A. Berger, and J. Lafferty (1999). Statistical models for text
segmentation. Machine Learning, (34):177-210. Special Issue on Natural
Language Learning (C. Cardie and R. Mooney, eds).
- Online documents:
- journal article
[pdf]
- Conference paper (shorter; presented at EMNLP '97)
[ps]
 
 
A maximum entropy approach to natural language processing
The concept of maximum entropy can be traced back along multiple
threads to Biblical times. Only recently, however, have computers become
powerful enough to permit the widescale application of this concept to real
world problems in statistical estimation and pattern recognition. In this
paper we describe a method for statistical modeling based on maximum entropy.
We present a maximum-likelihood approach for automatically constructing maximum
entropy models and describe how to implement this approach efficiently, using
as examples several problems in natural language processing.
- Citation:
- A. Berger, S. Della Pietra and V. Della Pietra (1996). A maximum entropy
approach to natural language processing. Computational Linguistics 22-1. Also available as IBM Research Report RC-19694
- Online documents:
- journal article
[ps]
- IBM Tech Report
[ps]
 
The Candide system for machine translation
We present an overview of Candide, a system for automatic
translation of French text to English text. Candide uses
methods of information theory and statistics to develop a
probability model of the translation process. This model,
which is made to accord as closely as possible with a large
body of French and English sentence pairs, is then used to
generate English translations of previously unseen French
sentences. This paper provides a tutorial in these methods,
discussions of the training and operation of the system, and
a summary of test results.
- Citation:
- A. Berger, P. Brown, S. Della Pietra, V. Della Pietra, J. Gillett,
J. Lafferty, H. Printz, L. Ures (1994). The Candide system for machine
translation. ARPA Workshop on Speech and Natural Language. Morgan
Kaufman Publishers, 157-163.
- Online documents:
- Workshop paper
[ps]
 
Just in Time Language Modelling
Traditional approaches to language modelling have relied on a fixed
corpus of text to inform the parameters of a probability distribution
over word sequences. Increasing the corpus size often leads to
better-performing language models, but no matter how large, the corpus
is a static entity, unable to reflect information about events which
postdate it. In these pages we introduce an online paradigm which
interleaves the estimation and application of a language model. We
present a Bayesian approach to online language modelling, in which the
marginal probabilities of a static trigram model are dynamically
updated to match the topic being dictated to the system. We also
describe the architecture of a prototype we have implemented which
uses the World Wide Web (WWW) as a source of information, and provide
results from some initial proof of concept experiments.
- Citation:
- A. Berger, R. Miller (1998). Just in Time Language Modelling.
IEEE Conference on Acoustic, Speech and Signal Processing. Seattle, WA.
- Online documents:
- Conference paper
[pdf]
- Slides of ICASSP'98 talk
[ps]
 
Probabilistic decoding of low-density Cayley codes
We report on some investigations into the behavior of a class of
low-density codes constructed using algebraic techniques. Recent work
shows expansion to be an essential property of the graphs underlying the
low-density parity-check codes first introduced by Gallager. In
addition, it has recently been shown that certain spectral techniques
similar to those based on Fourier analysis for classical cyclic codes
can be applied to codes constructed from Cayley graphs. This motivates
us to compare the behavior of algebraically constructed expanders and
randomly generated bipartite graphs using a probabilistic decoding
algorithm. Preliminary results indicate that the performance of the
explicit, algebraic expanders is comparable to that of random graphs in
the case where each variable is associated with only two parity checks,
while such codes are inferior to randomly generated codes with three or
more constraints for each variable.
- Citation:
- A. Berger and J. Lafferty (1997).
Proceedings of the
Canadian Workshop on Information Theory. Toronto, Canada.
- Online documents:
- Workshop paper
[pdf]
A Model of Lexical Attraction and Repulsion
This paper introduces new techniques based on exponential
families for modeling the correlations between words in
text and speech.
The motivation for this work is to build improved
statistical language models by treating a static trigram model as a
default distribution, and adding sufficient statistics, or "features,"
to a family of conditional exponential distributions in order to model
the nonstationary characteristics of language. We focus on features
based on pairs of mutually informative words which allow the trigram
model to adapt to recent context. While previous work assumed the
effects of these word pairs to be constant over a window of several
hundred words, we show that their influence is nonstationary on a much
smaller time scale. In particular, empirical samples drawn from both
written text and conversational speech reveal that the "attraction"
between words decays exponentially, while stylistic and syntactic
constraints create a "lexical exclusion" effect that discourages close
co-occurrence. We show that these characteristics are well described by
mixture models based on two-stage exponential distributions. These
models are a common tool in queueing theory, but they have not
previously found use in speech and language processing. We show how the
EM algorithm can be used to estimate the parameters of these models,
which can then be incorporated as penalizing features in the posterior
distribution for predicting the next word. Experimental results
illustrate the benefit these techniques yield when incorporated into
a long-range language model.
- Citation:
- D. Beeferman, A. Berger, J. Lafferty (1997).
ACL-EACL'97 Joint Conference.
Madrid, Spain.
- Online documents:
- Conference paper
[ps]
 
Language translation apparatus and method using context-based translation
models
An apparatus for translating a series of source words in a first
language to a series of target words in a second language. For an input series
of source words, at least two target hypotheses, each including a series of
target words, are generated. Each target word has a context comprising at least
one other word in the target hypothesis. For each target hypothesis, a
language model match score including an estimate of the probability of
occurrence of the series of words in the target hypothesis. At least one
alignment connecting each source word with at least one target word in the
target hypothesis is identified. For each source word and each target
hypothesis, a word match score including an estimate of the conditional
probability of occurrence of the source word, given the target word in the
target hypothesis which is connected to the source word and given the context
in the target hypothesis of the target word which is connected to the source
word. For each target hypothesis, a translation match score including a
combination of the word match scores for the target hypothesis and the source
words in the input series of source words. A target hypothesis match score
including a combination of the language model match score for the target
hypothesis and the translation match score for the target hypothesis. The
target hypothesis having the best target hypothesis match score is output.
- Citation:
- A. Berger, P. Brown, S. Della Pietra, V. Della Pietra, A. Kehler,
R. Mercer (1996). Language translation apparatus and method using context-based
translation models. U.S. patent #5,510,981.
 
Recognition performance of a large-scale dependency-grammar language model
In this paper we report on a large-scale investigation of dependecy grammar
language models. Our work includes several significant departures from earlier
studies, notably a much larger training corpus, improved model structure,
different feature types, new feature selection methods, and more coherent
training and test data. We report results of using this model, in a rescoring
paradigm, upon the word error rate (WER) of the IBM speech recognition system.
- Citation:
- A. Berger and H. Printz (1998).
Int'l Conference on Spoken Language Processing (ICSLP'98),
Sydney, Australia.
- Online documents:
- Conference paper
[ps]
 
A Comparison of Criteria for Maximum Entropy/Minimum Divergence
Feature Selection
In this paper we study the gain, a naturally-arising statistic from
the theory of MEMD modeling, as a figure of merit for selecting features for an
MEMD language model. We compare the gain with two popular
alternatives---empirical activation and mutual information---and argue that the
gain is the preferred statistic, on the grounds that it directly measures a
feature's contribution to improving upon the base model.
- Citation:
- A. Berger, H. Printz (1998). Third Conference on Empirical
Methods in Natural Language Processing, 96-107. Granada, Spain.
- Online documents:
- Conference paper
[compressed ps]
 
Cyberpunc: A lightweight punctuation annotation system for speech
This paper describes a lightweight method for the automatic insertion
of intra-sentence punctuation into text. Despite the intuition that
pauses in an acoustic stream are a positive indicator for some types
of punctuation, this work will demonstrate the feasibility of a system
which relies solely on lexical information. Besides its potential role
in a speech recognition system, such a system could serve equally well
in non-speech applications such as automatic grammar correction in a
word processor and parsing of spoken text. After describing the design
of a punctuation-restoration system, which relies on a trigram
language model and a straightforward application of the Viterbi
algorithm, we summarize results, both quantitative and subjective, of
the performance and behavior of a prototype system.
- Citation:
- D. Beeferman, A. Berger, J. Lafferty (1998).
IEEE Conference on Acoustic, Speech and Signal Processing. Seattle, WA.
- Online documents:
- Conference paper
[ps]
 
Experiments in spoken document retrieval at CMU
We describe our submission to the TREC-7 Spoken Document Retrieval (SDR) track
and the speech recognition and information retrieval engines. We present SDR
evaluation results and a brief analysis. A few developments are also decribed in
greater detail including:
- A new, probabilistic retrieval engine based on language models,
- A new, TFIDF-based weighting function that incorporates word error probability
- The use of a simple confidence estimate for word probability based on speech recognition lattices.
Although improvements over a development test set were promising, the new
techniques failed to yield significant gains in the evaluation test set.
- Citation:
- M. Siegler, A. Berger, M. Witbrock, A. Hauptman (1998).
Proceedings of TREC-7, The Seventh Text Retrieval Conference.
- Online documents:
- Conference paper [pdf]