15-859: Algorithms for Big Data, Fall 2017
- Instructor: David Woodruff
- Lecture time: Thursdays, 15:00-17:30 (15:00-16:10, break till 16:20, 16:20-17:30), GHC
4303.
- TA: Dhivya Eswaran
- David's office hours: Mondays, 13:00-14:00 in GHC 7217 or by appointment.
- Dhivya's recitation: Fridays, 10:30-11:20 in Baker Hall 235A
- Dhivya's office hours: Wednesdays, 10:00-11:00 in GHC 6008 or by appointment.
Grading
Grading is based on problem sets, scribing a lecture, and a presentation/project. There will be no exams.
General information about the breakdown for the grading is available
here:
grading.pdf
Latex
Homework solutions, scribe notes, and final projects must be typeset in LaTeX. If you are not familiar with LaTeX, see this introduction.
A template for your scribe notes is here:
template.tex
Lectures
- Lecture 1 slides (Least squares regression, subspace embeddings, net arguments, matrix Chernoff)
Scribe notes 1
Scribe notes 2
- Lecture 2 slides (Subsampled Randomized Hadamard Transform, Approximate Matrix Product CountSketch)
Scribe notes 3
Scribe notes 4
- Lecture 3 slides (CountSketch, Affine Embeddings, Low Rank Approximation)
Scribe notes 5
Scribe notes 6
- Lecture 4 slides (Sketching-based Preconditioning, Leverage Scores, Distributed Low Rank Approximation, Coresets)
Scribe notes 7
Scribe notes 8
- Lecture 5 slides (Coresets, Distributed Low Rank Approximation)
Scribe notes 9
Scribe notes 10
- Lecture 6 slides (L1 Regression, Cauchy Random Variables, L1 Subspace Embeddings)
Scribe notes 11
Scribe notes 12
- Lecture 7 slides (Fast L1 Regression, Data Stream Model, Estimating Norms in a Stream)
Scribe notes 13
Scribe notes 14
- Lecture 8 slides (Estimating p-Norms, Heavy Hitters, and Non-zero items in a Stream)
Scribe notes 15
Scribe notes 16
- Lecture 9 slides (Information Theory, Distances Between Distributions, Indexing, Lower Bounds for Norms)
Scribe notes 17
Scribe notes 18
- Lecture 10 slides (Information Complexity and Norm Lower Bounds in a Stream, Non-negative Matrix Factorization, Weighted Low Rank Factorization)
Scribe notes 19 and 20
- Lecture 11 slides (Weighted Low Rank Approximation, Gaussian Width, M-Estimators, Compressed Sensing)
Scribe notes 21
Scribe notes 22
Recitations
- Recitation 1 slides (review of linear algebra and probability)
- Many of the recitations ended up not using slides. The TA will post material used in recitations to piazza.
Problem sets
Course Description
With the growing number of massive datasets in applications such as machine learning and numerical linear algebra, classical algorithms for processing such datasets are often no longer feasible. In this course we will cover algorithmic techniques, models, and lower bounds for handling such data. A common theme is the use of randomized methods, such as sketching and sampling, to provide dimensionality reduction. In the context of optimization problems, this leads to faster algorithms, and we will see examples of this in the form of least squares regression and low rank approximation of matrices and tensors, as well as robust variants of these problems. In the context of distributed algorithms, dimensionality reduction leads to communication-efficient protocols, while in the context of data stream algorithms, it leads to memory-efficient algorithms. We will study some of the above problems in such models, such as low rank approximation, but also consider a variety of classical streaming problems such as counting distinct elements, finding frequent items, and estimating norms. Finally we will study lower bound methods in these models showing that many of the algorithms we covered are optimal or near-optimal. Such methods are often based on communication complexity and information-theoretic arguments.
References
One recommended reference book is the lecturer's monograph
Sketching as a Tool for Numerical Linear Algebra.
Slide notes for about half the course are available here:
Videos from a previous course I taught on the linear algebra, l1, and weighted slides are available here:
videos . Note that mine start on 27-02-2017.
The rest of the material will be based on
the lecturer's experience and related research papers or surveys.
Materials from the following related courses might be useful in
various parts of the course:
Intended audience: The course is indended for both graduate students and advanced undegraduate students with mathematical maturity and comfort with algorithms, discrete probability, and linear algebra. No other prerequisites are required.
Maintained by David Woodruff