11-661/11-761 Language and Statistics

11-661/761 Language and Statistics

Fall 2018

Internet search, speech recognition, machine translation, question answering, information retrieval, biological sequence analysis -- are all at the forefront of this century’s information revolution. In addition to their use of machine learning, these technologies rely heavily on classic statistical estimation techniques. Yet most CS and engineering undergraduate programs do not prepare students in this area beyond an introductory probability & statistics course. This course is designed to address this gap.

The goal of "Language and Statistics" is to ground the data-driven techniques used in language technologies in sound statistical methodology. We start by formulating various language technology problems in both an information theoretic framework (the source-channel paradigm) and a Bayesian framework (the Bayes classifier). We then discuss the statistical properties of words, sentences, documents and whole languages, and the various computational formalisms used to represent language. These discussions naturally lead to specific concepts in statistical estimation.

Topics include: Zipf's distribution and type-token curves; point estimators, Maximum Likelihood estimation, bias and variance, sparseness, smoothing and clustering; interpolation, shrinkage, and backoff; entropy, cross entropy and mutual information; decision tree models applied to language; latent variable models and the EM algorithm; hidden Markov models; exponential models and the maximum entropy principle; semantic modeling and dimensionality reduction; probabilistic context-free grammars and syntactic language models.

“Language and Statistics” is designed for LTI and other SCS graduate students, but others are also welcome. CS undergraduate upperclassmen who have taken it have done well, although they found it challenging.

The Master-level version of this course (11-661) does not require the course project.

Instructor: Bhiksha Raj

bhiksha@cs.cmu.edu

TAs:

Pramati Kalwad (pkalwad@andrew.cmu.edu)
Ashwin Naresh Kumar (anareshk@andrew.cmu.edu)
Zhilin Yang (zhiliny@andrew.cmu.edu)

Lecture: Monday and Wednesday, 12:00 noon - 1.20pm

Location: Gates-Hillman Complex GHC 4307

Recitation: TBD

Office hours:

Bhiksha: TBD
Ashwin: Tuesday, 3pm-4pm GHC 5417
Pramati: Wednesday, 2pm-3pm GHC 5417
Zhilin: Thursday, 11am-12pm GHC 5505

Prerequisites

Strong quantitative aptitude. Familiarity and comfort with basic undergraduate-level probability. Some programming skill.

Units

This course is worth 12 units.

Course Work

	Maximum
Exam	1 exam, total contribution to grade 35%
Assignments	7 assignments, total contribution to grade 35%
Final Project	1 project, total contribution to grade 30% (only for those taking 11-761; otherwise, renormalize)

Top “Endorsed Answer” Answerers on Piazza may get bonus points.
Class participation can also improve your grade.

Late Policy

Extensions: If you have an unavoidable conflict that prevents you from completing an assignment on time (such as travel to a conference or a medical emergency), please send an email to the TA in charge of that assignment, as soon as you become aware of the problem, briefly stating the circumstances and how much more time you need. The TA-in-charge is authorized to grant an extension as long as you requested it promptly. Do not send extension requests to the Instructor, nor to the other TAs.

Assignments turned late without prior approval will incur a 25% penalty per 24-hour period or any part thereof.

Books

The course will not follow a specific book, but will draw from a number of sources. We list relevant books at the end of this page. We will also put up links to relevant reading material for each class. Students are expected to familiarize themselves with the material before the class. The readings will sometimes be arcane and difficult to understand; if so, do not worry, we will present simpler explanations in class.

Discussion board: Piazza

We will use Piazza for discussions. Here is the link. Please sign up.

Academic Integrity

You are expected to comply with the University Policy on Academic Integrity and Plagiarism.

You are allowed to talk with / work with other students on homework assignments
You can share ideas but not code, you should submit your own code

Your course instructor reserves the right to determine an appropriate penalty based on the violation of academic dishonesty that occurs. Violations of the university policy can result in severe penalties including failing this course and possible expulsion from Carnegie Mellon University. If you have any questions about this policy and any work you are doing in the course, please feel free to contact your instructor for help.

Schedule of Classes (This schedule is only tentative and subject to change)

Lectures	Dates	Topics	Required Reading (due BEFORE class)	Lecture notes/Slides	Additional readings, if any	Assignments
1,2	August 29, September 5	Course Overview Statistical Language Modeling Computational Linguistics Statistical Decision Making Source-Channel Paradigm Heterogeneity of language Word types and tokens, type token curves Zipf's law, Mandelbrot's distribution	[MS] 1.1 - 1.3, 2.1	Intro, SLM+CL Heterogeneity, Zipf's Law	Historical overview by Lafferty [BCW] ch. 4, Zipf's Law	HW 1 out on 09/08
3,4,5	September 10, September 12, September 17	Probability Review Parameter Estimation Desiderata for Estimators	[MS] 1.4			HW1 due on 09/14
6,7	September 19, September 24, September 26	Unigrams: Statistical Estimation Maximum Likelihood Estimates	[MS] 6.2.1	Maximum Likelihood Estimation	[mD] ch. 6, esp. 6.5	HW2 out on 09/23. Latex source available here
	October 1	Class Cancelled
8	October 3	Sparseness Smoothing	[MS] 6.2.2	Slides included in the Maximum Likelihood Estimation deck	Good-Turing	HW2 due on 09/30
9	Oct 5 @ NSH 3002 (Make up class)	Probability Simplex N-grams: Linear Interpolation Pseudocounts MAP Conjugate Priors	[MS] 6.1, 6.3 [sK]	Slides included in the Maximum Likelihood Estimation deck	Chen & Goodman 98(pp. 1-21)	HW 3 out on 10/04
10,11	Oct 8, Oct 10	MAP cont'd Expectation Maximization (EM) I EM II (formal) Gaussian Mixture Models (GMMs)	[MS] 14.2 Notes by Guy Lebanon: Derivation of EM for Gaussian mixture EM derivation shortcut for exponential family	Slides included in the Maximum Likelihood Estimation deck	More advanced EM notes by John Lafferty	HW 3 due on 10/13 HW 4 out on 10/16. LaTeX source available here
12,13	Oct 15, Oct 17	Whole Sentence Models Ngrams: Shrinkage and EM for Shrinkage Backoff Information Theory I	Convexity, Maximum Likelihood, and All That	Ngram Models
14,15	Oct 22, Oct 24	Information Theory II Perplexity, Feature Selection		Perplexity	Elements of Information Theory Visual Information Theory IT notes: Entropy of English	HW 4 due on 10/24
16,17	Oct 29, Oct 31	Clustering Trees			[MS] 16.1 [BBDM]
18,19	Nov 5, Nov 7	Hidden Markov Models - I and II Exam Review	[MS] ch. 9		Larry Rabiner's classic HMM tutorial	HW 5 out on 11/05
20	Nov 12	Hidden Markov Models - III	[MS] ch. 9		Larry Rabiner's classic HMM tutorial	HW 5 due on 11/11 HW 6 out on 11/12
	Nov 15 - Nov 17	Take Home Exam
21,22,23	Nov 14, Nov 19, Nov 26	Exponential Models - I and II Maximum Entropy Models	Adam Berger's online tutorial		[MS] 16.2 Noah Smith's tutorial [BDD] [rR] [rR slides] using MCMC with language	HW 6 due on 11/22 HW 7 out on 11/23
24	November 28	Semantic Models Probabilistic Context Free Grammars (PCFG) The Inside-Outside Algorithm	Notes on Probabilistic Context Free Grammars		[MS] 11.1-11.4
25,26	Dec 3, Dec 5	Overflow Topics: Latent Semantic Analysis Dimensionality Reduction cf. Word2vec Syntactic Language Models Presentations - Mandatory Class Attendance	Bellegarda 99 Indexing by latent semantic analysis Jelinek and Chelba 99		Bellegarda 00 Yan Liu's Slides Hoffmann 99 Gildea and Hofmann 99 Raux and Singh 04 Chelba Slides 98	HW 7 due on 12/07

Abbreviations (in order of appearance):

Abbreviation	Book
[MS]	Manning and Schutze, Foundations of Statistical Natural Language Processing.
[BCW]	Bell, Cleary and Witten, Text Compression.
[mD]	Morris DeGroot, Probability and Statistics, 2nd edition.
[sK]	Slava M. Katz, "Estimation of probabilities from sparse data for the language model component of a speech recognizer", IEEE Transactions on Acoustic, Speech and Signal Processing, vol. 35, no. 3, pp. 400-401, 1987.
[BBDM]	L. Bahl, P. Brown, P. de Souza and R. Mercer, "A tree-based statistical language model for natural language speech recognition", IEEE Transactions on Acoustic, Speech and Signal Processing, vol. 37, no. 7, pp. 1001-1008, 1989.
[rR]	Roni Rosenfeld, "A maximum entropy approach to adaptive statistical language modeling", Speech and Language, vol. 10, pp. 187-228, 1996.