Welcome to the programming component of this assignment!
This assignment includes an autograder for you to grade your answers on your machine. This can be run with the command:
python3.6 autograder.py
The code for this assignment consists of several Python files, some of which you will need to read and understand in order to complete the assignment, and some of which you can ignore. You can download and unzip all the code, data, and supporting files from hw4_programming.zip.
kmeans.py |
Your code to implement K-means clustering algorithm. |
gmm.py |
Your code to implement clustering with Gaussian mixture models. |
util.py |
Convenience methods to generate various plots that will be needed in this assignment. |
test_cases/Q*/*.py |
These are the unit tests that the autograder runs. Ideally, you would be writing these unit tests yourself, but we are saving you a bit of time and allowing the autograder to check these things. You should definitely be looking at these to see what is and is not being tested. These test cases also generate plots related to the task; the plots are saved in a directory named figures . The autograder on Gradescope may run a different version of these unit tests. |
autograder.py |
Autograder infrastructure code. |
Files to Edit and Submit: You will fill in portions of kmeans.py
and gmm.py
during the assignment. You should submit this file containing your code and comments to the Programming component on Gradescope. Please do not change the other files in this distribution or submit any of our original files other than these files. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder.
Report: Many of the sections in this programming assignment will contain questions that are not autograded. You will place the requested results in the appropriate locations within the PDF of the Written component of this assignment.
Evaluation: Your assignment will be assessed based on your code, the output of the autograder, and the required contents of in the Written component.
Academic Dishonesty: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. If you do, we will pursue the strongest consequences available to us.
Getting Help: You are not alone! If you find yourself stuck on something, contact the course staff for help. Office hours, recitation, and Piazza are there for your support; please use them. If you can't make our office hours, let us know and we will schedule more. We want these assignments to be rewarding and instructional, not frustrating and demoralizing. But, we don't know when or how to help unless you ask.
In kmeans.py
, implement the kmeans
function to perform K-means clustering algorithm on the MNIST dataset. See function docstring for details.
As you'll see in the code, we initialize the cluster centers to the first K points in the dataset. While this isn't an ideal initialization, it helps take remove any randomness in the autograder.
You may run the following command to run some unit tests on your Q1 implementation:
python3.6 autograder.py -q Q1
The autograder will also render your 784 dimensional K-means centers as images. It will save these plots as as png files in a new directory named figures
. You are required to include these some of these figures as part of the written component of this assignment. See the written component for details about which images to include.
In this question, we'll be testing our GMM skills on an MNIST dataset with just the zero and one digits. Of course, since this in an unsupervised clustering task, we don't have any of the class labels for this dataset. To make our probabilistic clusting easier to visualize, we'll preprocess the MNIST dataset with PCA to compress it down to a 2-dimensional feature space.
In gmm.py
, implement the pca
function to use PCA to project data onto the first K principal components. See function docstring for details.
You may use either numpy.linalg.eig
or numpy.linalg.svd
to compute your eigenvectors. We recommend numpy.linalg.svd
as it returns sorted arrays. You may NOT use high-level PCA solvers, for instance sklearn.decomposition.PCA
.
NOTE: Eigenvectors are not unique. They can be scaled and still be correct eigenvectors. To make the autograder happy, you must pass your eigenvector matrix through gmm.consistent_scale_eigenvectors
before using them to transform your data.
You may run the following command to run some unit tests on your Q2 PCA implementation:
python3.6 autograder.py -t test_cases/Q2/test1.py
python3.6 autograder.py -t test_cases/Q2/test2.py
We've already given you the high level structure for the EM algorithm for Gaussian mixture models in gmm.learn_gmm()
, which alternates between: E) updating the conditional class probabilities given the data and the parameters and M) updating the parameters given the data and the conditional class probabilities. While you shouldn't edit the core content of this learn_gmm()
function, you may want to uncomment some of the printing and plotting statements to help you better understand what is happening at each iteration.
We've also provided a function to compute PDF of a multivariate Gaussian, gmm.gaussian_pdf()
, and a function to compute the log likelihood of the GMM, gmm.log_likelihood()
, that we are trying to maximize:
\[ \ell(\boldsymbol{\theta}; \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \cdots, \boldsymbol{x}^{(N)}) = \sum_{i=1}^N \log p\left(\boldsymbol{x}^{(i)} \mid \boldsymbol{\theta}\right) = \sum_{i=1}^N \log \sum_{k=1}^K \pi_k \mathcal{N}\left(\boldsymbol{x}^{(i)}; \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k\right) \]
where \(\pi_k = p\left(z_k^{(i)} = 1\right), \forall i \).
Implement the update_Z()
function in gmm.py
. See the function docstring for details.
For a fixed set of Gaussian mixture model parameters, \(\boldsymbol{\theta}^{(t)}\), update the probability that each point, \(\boldsymbol{x}^{(i)}\), belongs to cluster \(k\):
\[p\left(z_k^{(i)} = 1 \mid \boldsymbol{x}^{(i)}, \boldsymbol{\theta}^{(t)}\right) \leftarrow \frac{\pi_k^{(t)} \mathcal{N}\left(\boldsymbol{x}^{(i)}; \boldsymbol{\mu}_k^{(t)}, \boldsymbol{\Sigma}_k^{(t)}\right)}{\sum_{j=1}^K \pi_j^{(t)} \mathcal{N}\left(\boldsymbol{x}^{(i)}; \boldsymbol{\mu}_j^{(t)}, \boldsymbol{\Sigma}_j^{(t)}\right)}, \forall i, k\]
Implement the update_parameters()
function in gmm.py
. See the function docstring for details.
\[ \pi_k^{(t+1)} \leftarrow \frac{\sum_{i=1}^N P\left(z_k^{(i)} = 1 \mid \boldsymbol{x}^{(i)}, \boldsymbol{\theta}^{(t)}\right) }{N}, \forall k \]
\[ \boldsymbol{\mu}_k^{(t+1)} \leftarrow \frac{\sum_{i=1}^N P\left(z_k^{(i)} = 1 \mid \boldsymbol{x}^{(i)}, \boldsymbol{\theta}^{(t)}\right) \boldsymbol{x}^{(i)} }{\sum_{i=1}^N P\left(z_k^{(i)} = 1 \mid \boldsymbol{x}^{(i)}, \boldsymbol{\theta}^{(t)}\right)}, \forall k \]
\[ \boldsymbol{\Sigma}_k^{(t+1)} \leftarrow \frac{\sum_{i=1}^N P\left(z_k^{(i)} = 1 \mid \boldsymbol{x}^{(i)}, \boldsymbol{\theta}^{(t)}\right) \left(\boldsymbol{x}^{(i)} - \boldsymbol{\mu}_k^{(t+1)}\right) \left(\boldsymbol{x}^{(i)} - \boldsymbol{\mu}_k^{(t+1)}\right)^T }{\sum_{i=1}^N P\left(z_k^{(i)} = 1 \mid \boldsymbol{x}^{(i)}, \boldsymbol{\theta}^{(t)}\right)}, \forall k \]
You may run the following command to run some unit tests on your Q2 implementation:
python3.6 autograder.py -q Q2
The autograder will also save various plots of PCA and GMM results as png files in a new directory named figures
. You are required to include these some of these figures as part of the written component of this assignment. See the written component for details about which plots to include.
Complete all questions as specified in the above instructions. Then upload kmeans.py
and gmm.py
to Gradescope. Your submission should finish running within 20 minutes, after which it will time out on Gradescope.
Don't forget to include any request results in the PDF of the Written component, which is to be submitted on Gradescope as well.
You may submit to Gradescope as many times as you like. You may also run the autograder on your own machine to speed up the development process. Just note that the autograder on Gradescope will be slightly different than the local autograder. The autograder can be invoked on your own machine using the command:
python3.6 autograder.py
Note that running the autograder locally will not register your grades with us. Remember to submit your code when you want to register your grades for this assignment.
The autograder on Gradescope might take a while but don't worry: so long as you submit before the deadline, it's not late.