15-859(B) Machine Learning Theory, Spring 2006

15-859(B) MACHINE LEARNING THEORY

Instructor: Avrim Blum

Time: MW 1:30-2:50

Place: Wean 5409

Course description: This course will focus on theoretical aspects of machine learning. We will examine questions such as: What kinds of guarantees can one prove about learning algorithms? What are good algorithms for achieving certain types of goals? Can we devise models that are both amenable to mathematical analysis and make sense empirically? What can we say about the inherent ease or difficulty of learning problems? Addressing these questions will require pulling in notions and ideas from statistics, complexity theory, information theory, cryptography, game theory, and empirical machine learning research.

Text: An Introduction to Computational Learning Theory by Michael Kearns and Umesh Vazirani, plus papers and notes for topics not in the book.

Office hours: Just stop by my office or make an appointment.

The final exam is Tuesday 5/16 1:00-4:00pm in Wean 6423 (not 4623)

Handouts

General course information
Handout on tail inequalities
Homework 1 [ps,pdf]. Solutions [ps,pdf]
Homework 2 [ps,pdf]. Solutions [ps,pdf]
Homework 3 [ps,pdf]. Solutions [ps,pdf]
Homework 4 [ps,pdf]. Solutions [ps,pdf]
Homework 4.5 [ps,pdf].
Homework 5 [ps,pdf]. Solutions [ps,pdf]
Homework 6 [ps,pdf]. Solutions [ps,pdf]

Lecture Notes & tentative plan

01/18: Introduction, basic definitions, the consistency model.
01/23: The Mistake-bound model.
01/25: The Winnow algorithm + applications.
01/30: The Perceptron Algorithm, Margins, and Kernel functions.
02/01: Combining expert advice, Weighted-Majority, Regret-bounds and connections to game theory.
02/06: Kalai-Vempala algorithm.
02/08: PAC model I: basic results and Occam's razor.
02/13: PAC model II: Chernoff/Hoeffding bounds, MB->PAC [pdf]. Handout on tail inequalities
02/15: VC-dimension I: Proof of Sauer's lemma. slides and notes.
02/20: [no class today]
02/22: VC-dimension II.
02/27: Boosting I: weak vs strong learning, basic issues.
03/01: Boosting II: Adaboost.
03/06: Cryptographic hardness results.
03/08: Maxent and maximum-likelihood exponential models. Connection to winnow.
03/20: Learning from noisy data. The Statistical Query model.
03/22: Characterizing SQ learnability with Fourier analysis.
03/27: Finish SQ hardness. Using Fourier for learning.
03/29: Membership queries I: basic algorithms, KM algorithm.
04/03: Membership queries II: Fourier spectrum of DTs & DNFs, Bshouty's alg.
04/05: Membership queries III: Angluin's algorithm for learning DFAs. Additional notes.
04/10: Learning finite-state environments without a reset.
04/12: MDPs and reinforcement learning.
04/17: MDPs and reinforcement learning II.
04/19: Semi-supervised learning, active learning.
04/24: Boosting and margins
04/26: Kernels and similarity functions
05/01: Project presentations
05/03: Project presentations

Additional Readings & More Information

Very new results:

Feldman, Optimal Hardness Results for Maximizing Agreements with Monomials (solves open problem #1 from 1st day of class).

Presentations on the Web:

PASCAL video lectures.

Books (general):

N. Cristianini and J. Shawe-Taylor, Kernel Methods for Pattern Analysis, 2004.
N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines (and other kernel-based learning methods), 2000.
M. Anthony and P. Bartlett. Learning in Neural Networks : Theoretical Foundations. Cambridge University Press, 1999.
V. Vapnik. Statistical Learning Theory. Wiley, 1998.
L. Devroye, L. Györfi, G. Lugosi, A Probabilistic Theory of Pattern Recognition, Springer, New York, 1996.

Online Learning:

Nick Littlestone, Learning Quickly when Irrelevant Attributes Abound: A New Linear-threshold Algorithm. Machine Learning 2:285--318, 1987. (The version pointed to here is the tech report UCSC-CRL-87-28.) This is the paper that first defined the Mistake-bound model, and also introduced the Winnow algorithm. A great paper.
Littlestone and Warmuth, The Weighted Majority Algorithm. Information and Computation 108(2):212-261, 1994. (The version pointed to here is the tech report UCSC-CRL-91-28.) Introduces the weighted majority algorithm, along with a number of variants. Also a great paper.
Nicolò Cesa-Bianchi, Yoav Freund, David Haussler, David Helmbold, Robert Schapire, and Manfred Warmuth, How to use expert advice, Journal of the ACM, 44(3):427-485, May 1997. Yoav Freund and Robert Schapire, Adaptive game playing using multiplicative weights, Games and Economic Behavior, 29:79-103, 1999. Continuing on with line of research in the [LW] paper, these give tighter analyses of multiplicative-weighting expert algorithms and give a game-theoretic perspective, as well as address a number of other issues.
Adam Kalai and Santosh Vempala, Efficient algorithms for the online decision problem, COLT '03. Martin Zinkevich, Online convex programming and generalized infinitesimal gradient ascent, ICML '03. These papers give efficient algorithms for a broad class of settings that one can view as having exponentially many "experts", but which are represented in an implicit compact way.
Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, Robert Schapire: The Nonstochastic Multiarmed Bandit Problem, SIAM J. Comput. 32(1): 48-77 (2002). Brendan McMahan and Avrim Blum: Online Geometric Optimization in the Bandit Setting Against an Adaptive Adversary, COLT '04. Abie Flaxman, Adam Tauman Kalai, and Brendan McMahan: Online Convex Optimization in the Bandit Setting: Gradient Descent Without a Gradient, SODA '2005. These papers extend above results to the bandit setting, in which only the loss or gain of the action actually played can be observed at each time step.
Avrim Blum, On-Line Algorithms in Machine Learning (a survey). From "Online Algorithms: the state of the art", Fiat and Woeginger eds., LNCS #1442, 1998.

PAC model:

David Haussler Chapter on PAC learning model, and decision-theoretic generalizations, with applications to neural nets. From Mathematical Perspectives on Neural Networks, Lawrence Erlbaum Associates, 1995, containing reprinted material from "Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications", Information and Computation, Vol. 100, September, 1992, pp. 78-150. This is a really nice survey of the PAC model and various sample-complexity results.
David Williamson, John Shawe-Taylor, Bernhard Schölkopf, Alex Smola Sample Based Generalization Bounds. Gives tighter generalization bounds where instead of using "the maximum number of ways of labeling a set of 2m points" you can use "the number of ways of labeling your actual sample".
My FOCS'03 tutorial on Machine Learning Theory

Boosting:

The original paper: Robert E. Schapire, The strength of weak learnability. Machine Learning, 5(2):197-227, 1990.
The Adaboost paper: Yoav Freund and Rob Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer and System Sciences, 55(1):119-139, 1997.
An overview by Rob Schapire: The boosting approach to machine learning: An overview. In MSRI Workshop on Nonlinear Estimation and Classification, 2002.

Fourier analysis, weak learning, SQ learning:

Avrim Blum, Merrick Furst, Jeffrey Jackson, Michael Kearns, Yishay Mansour, and Steven Rudich, Weakly Learning DNF and Characterizing Statistical Query Learning Using Fourier Analysis., STOC '94 pp. 253--262.
Y. Mansour. Learning Boolean Functions via the Fourier Transform. Survey article in ``Theoretical Advances in Neural Computation and Learning", 391--424 (1994).
A. Blum, C. Burch, and J. Langford, On Learning Monotone Boolean Functions. Proceedings of the 39th Annual Symposium on Foundations of Computer Science (FOCS '98).

Web sites on maxent: