Language and Statistics

Outline for 11-761: Language and Statistics

Foundations [1 week]

Basic Tools from Probability and Statistics: Laws of probability, Bayes' theorem, Maxium likelihood, Estimators (variance, bias, consistency, efficiency).
Basic Concepts from Information Theory: Properties of entropy, Kullback-Leibler divergence, mutual information, data processing inequality, compression and coding, arithmetic coding, intuitive interpretation .

The Noisy Channel Model [1 week]

The source-channel model. Applications: speech, translation, spelling correction, OCR, speech processing with side information, other problems in language processing.

Language Modeling and N-grams [1.5 weeks]

perplexity and alternative measures, data sparseness, conditional modeling, history partitioning, N-grams. Word frequencies, Zipf's law, type-token curves, vocabulary and n-gram growth, the zero frequency problem, smoothing, discounting, the Good-Turing estimate. The backoff model. A Dirichlet language model. Ngram data structures, the CMU-Cambridge toolkit .

The EM Algorithm [1 week]

The basic algorithm and example applications. The mathematics underlying the algorithm.

Finite state models [2 weeks]

Markov chains, hidden Markov models and the forward-backward algorithm, the Cave-Neuwirth analysis of English, deleted interpolation, class-based n-grams, tagging.

Clustering and Decision Trees [1.5 weeks]

Clustering: hierarchical clustering, mutual information techniques, word compounds.
Decision Trees: The CART technique, applications.

Stochastic Grammars [1 week]

The inside-outside algorithm, context-sensitive models, automatic grammar induction, link grammars.

Maximum Entropy [1.5 weeks]

Exponential models, triggers, feature induction and iterative scaling, priors and distance models

Language Model Adaptation [1/2 week]

caches, smoothing, Bayesian methods.

Special Topics [2 weeks]

statistical machine translation, text segmentation , tokenization and text conditioning, Search algorithms.

Back to Language and Statistics homepage.