$ \def\v{\mathbf{v}} \def\w{\mathbf{w}} \def\x{\mathbf{x}} \def\D{\mathbf{D}} \def\V{\mathbf{V}} \def\S{\mathbf{S}} \def\F{\mathcal F} \def\bold#1{\bf #1} $

MLSP Fall 2016: Homework 2
Sparse Representations

Problem 1: Sparse recovery

We have previously noted in class that given a dictionary $D$ and a data $X$ that can be composed as a sparse combination of dictionary entries, i.e. we can write $X = D Y$, where $|Y|_0 \leq k$ for some small value of $k$, then $Y$ can be recovered from $X$ and $D$ using sparse recovery techniques. Specifically, we noted that this can be estimtated through the following minimization: \[ \hat{Y} = \arg\min_Y \|X - DY\|_2^2 + \lambda |Y|_1 \] We will now see how this simple principle can be used to subsample data and still recover it. This principle is known as “compressive sensing”.

We will use an example to explain the concept. The example will also be your homework problem.

A little story: Cassini-Huygens

The Cassini spacecraft was launched on 15 Oct 1997 and entered into orbit around Saturn on 1 July 2004. Ever since it has been orbiting the ringed planet, taking thousands of pictures of Saturn, its rings, and its moons, and beaming them back to Earth. Here are some of the gorgeous pictures sent by Cassini to earth. Cassini is powered by about 33 kg of Plutonium-238, which will continue to provide power to the spacecraft until the end of its mission.

On Christmas day 2004 Cassini launched the Huygens probe towards Saturn's moon, Titan. Huygens landed on Titan on 14 Jan 2005. Unlike Cassini, Huygens was powered by a battery. By design, the module had no more than three hours of battery life, most of which was planned to be used during the descent onto Titan. Engineers expected to get at most only 30 minutes of data from the surface, much of which was in the form of pictures of Titan's surface and topography. Here are the incredible pictures sent by Huygens.

Our “story” ends with this note: Huygens could only send 350 pictures before its batteries ran out of power. (An additional 350 pictures were lost because of a software bug due to which the channel they were sent on was ignored).

Compressive sensing to the rescue

One of the main consumptions of on-board battery is for the transmission of the pictures. Each image must be adequately “wrapped” with error-correcting code and transmitted to the recepient (Huygens transmitted to both Cassini and the Earth). The amount of energy consumed to transmit a picture directly relates to the number of pixels captured -- more pixels require more energy both to capture and to transmit.

Let us now consider an alternate scheme.

A note on image sparsity in the wavelet domain

It is well known that when considered in the wavelet transform domain, images are sparse. A wavelet transform can be viewed as a matrix operation (just as a discrete Fourier transform can). Let $I$ be any image (expressed as a vector). The wavelet transform $\F$ of the image can be computed as \[ \F = WI \] where $W$ is the transform matrix. (In reality, the transform is viewed as a filterbank, but we will assume the simpler model above for our discussion).

The image itself can be recovered from the transform as \[ I = W_{inv} \F \] where $W_{inv}$ is the inverse transform matrix.

For images, $\F$ is generally sparse, i.e. most of the components of $\F$ are zero or close to zero. We will use this to our benefit.

Capturing one-pixel masked integral images

Instead of simply snapping a picture directly, consider a scheme where Huygens would instead apply a random mask to its camera and computes an integral instead. So, for instance, instead of capturing the entire image, Huygens applies a random mask to get the following picture

Saturn masked

and computes a single integral value \[ P_1 = \sum_{i,j} m_{i,j} I_{i,j} \] where $m_{i,j}$ is the value of the mask at pixel $(i,j)$, and $I_{i,j}$ is the actual image value. Representing the mask as the vector $\mathbf{m}_1$, where the subscript $1$ is to indicate that this is the first such mask applied, Huygens' first measurement would be \[ P_1 = \mathbf{m}_1^\top I \]

Simiarly, applying a series of other such masks, e.g.

Saturn masked

Huygens can now get a series of measurements $P_1,~P_2,~P_3,\cdots,P_K$ of (scalar) measurements, where $P_i$ has been obtained using mask $\mathbf{m}_i$, and $K$ is the total number of measurements obtained in this manner. Representing all of the masks as a single matrix $\mathbf{M} = [\mathbf{m}_1~\mathbf{m}_2~\mathbf{m}_3\cdots\mathbf{m}_K]^\top$, and the measurements $P_i$ collectively as a vector $\mathbf{P} = [P_1~P_2~P_3\cdots P_K]^\top$, we can write \[ \mathbf{P} = \mathbf{M}I \]

Note that $K$ may be far fewer than the number of pixels in the image. Huygens can now simply transmit $P_1 \cdots P_K$, and save on power. If $K \lt N$, where $N$ is the number of pixels in the image, Huygens could use the saved energy to take more pictures and transmit them. Since we are sending far fewer numbers than we would if we were to actually send the entire image, we will call these compressed measurements.

Recovering the full image from the compressed measurements

But how then can the $N$-pixel pictures themselves be recovered from this sequence of $K$ compressed measurements? For this we exploit the sparsity of images in the wavelet domain.

Recall from our earlier discussion that we can represent $I = W_{inv} \F$, where $\F$ is sparse. Hence we can write \[ \mathbf{P} = \mathbf{M}W_{inv}\F \]

Let $\mathbf{R} = \mathbf{M}W_{inv}$. Remember that the random masks applied to the camera are known, i.e. $\mathbf{M}$ is known. The inverse wavelet transform matrix $W_{inv}$ of course known. Hence $\mathbf{R}$ is known. We can therefore write \[ \mathbf{P} = \mathbf{R}\F \] In other words, we are expressing the compressed measurements $\mathbf{P}$ as a sparse combination of the columns in a dictionary matrix $\mathbf{R}$.

To recover the image $I$ from the $K \times 1$ compressed measurement vector $\mathbf{P}$, it is sufficient to recover the $N \times 1$ sparse vector $\F$, since $I = W_{inv}\F$. We can do so using the sparse recovery technique we discussed in the context of dictionary-based representations, which we recalled above: \[ \hat{\F} = \arg\min_{\F} \|\mathbf{P} - \mathbf{R}\F\|_2^2 + \lambda |\F|_1 \]

The complete image can now be recovered as $I = W_{inv}\hat{\F}$. This is the basic concept behind compressive sensing.

We have managed to come up with a solution that enables Huygens to send far fewer numbers (only $K$) than the size of the image itself ($N$ pixels) and still recover the image. Huygens can now use the conserved battery to take more pictures. We have thus rescued NASA's space exploration program!!

Having achieved these lofty goals, let us now return to a more mundane issue: homework problems.

Homework Problem Ia

In this problem we give you compressed measurements from an image taken by Cassini (not Huygens). We also give you the “measurement matrix” $\mathbf{R}$. You must recover the full image. (Note: we are referring to $\mathbf{R}$ in the above equations as the measurement matrix; more conventionally $\mathbf{M}$ would be called the measurement matrix).

Here is the original image. This is only for reference, to compute error; we will not use it in recovery computations. It is a $128 \times 128$ image with a total of 16384 pixels. Note that the sparse wavelet transform $\F$ of the image will also have 16384 components, although most of them will be very small or 0.

This file contains three sets of compressed measurements “P1639.mat”, “P2048.mat” and “P4096.mat”, and their corresponding measurement matrices “R1639.mat”, “R2048.mat”, and “R4096.mat”. In each case the numbers represent the number of measurements. Thus “P1639.mat” contains a $1639 \times 1$ vector, representing 1639 one-pixel measurements, and “R1639.mat” is a $1639 \times 16384$ matrix that gives the linear transform from the 16384-component sparse transform $\F$ to the measurement vector in “P1639”. In the directory you will also find files name “SXXXX” (where XXXX is 1639, 2048 or 4096). This file is required by matlab for image reconstruction, and is not part of the actual recovery algorithm.

You are required to use the sparse recovery procedure to recover $\F$ for each of these three measurements, and to reconstruct the image.

For each of the three measurements, solve \[ \hat{\F} = \arg\min_{\F} \|\mathbf{P} - \mathbf{R}\F\|_2^2 + \lambda |\F|_1 \] We recommend setting $\lambda = 0.001\max(abs(\mathbf{R}^\top \mathbf{P}))$, although you may also try other values. You may use any solver for the above problem. As a default, we include a matlab implementation of a least-squares $\ell_1$ solver from Stephen Boyd. Note however that this is not the best algorithm; there are other better algorithms for $\ell_1$ minimization. The Stanford sparselab page has several nice solvers under the “downloads” link. You may try these; some of them should give you better results than l1_ls. Submissions that report lower error (as explained below) will obtain better scores for this problem, so it is in your benefit to explore. You will be required to submit your code.
For each of the three measurements, reconstruct the original image using the recovered $\F$. You can do so using matlab through the following commands:
```
load SXXXX.mat;
S = SXXXX;
Irec = waverec2(reshape(F,128,128),S,'db1');
```
Compute the error $E = \sum_{i,j} (I(i,j) - I_{rec}(i,j))^2$, where $I(i,j)$ and $I_{rec}(i,j)$ are the $(i,j)$th pixel value in the original and recovered image respectively. Report this error.

Homework Problem Ib

In part Ia we solved the problem of sparse recovery as one of $\ell_1$ minization. We do so since it is well known that minimizing the $\ell_1$ norm also often minimizes the $\ell_0$ norm of the solution; however our actual objective in sparse recovery is to find a solution that has minimal $\ell_0$ norm, i.e. the smallest number of non-zero elements! Often, the optimal $\ell_1$ minimized solution will differ significantly (and can be significantly worse) than the optimal $\ell_0$ miniized solution.

In this part of the problem we will solve the alternate optimization problem: \[ \hat{\F} = \arg\min_{\F} \|\mathbf{P} - \mathbf{R}\F\|_2^2~~~~{\rm such~that}~~\|\F\|_0 \leq K \] where $K$ is the level of sparsity desired in $\F$. This is clearly an $\ell_0$ constrained optimization problem. $\|\F\|_0 \leq K$ specifies a feasible set, which is the set of all vectors that have no more than $K$ non-zero elements (note that this is a non-convex set). Recall from our lecture on optimization that we can use the method of projected gradients for this problem.

The gradient of $ \|\mathbf{P} - \mathbf{R}\F\|_2^2$ w.r.t $\F$ is given by \[ \nabla_{\F} \|\mathbf{P} - \mathbf{R}\F\|_2^2 = -\mathbf{R}^\top (\mathbf{P} - \mathbf{R}\F) \]

Representing the $n^{\rm th}$ iterate in an interative estimate of $\F$ by $\F^n$, the projected gradient descent algorithm to estimate $\F$ would be given by \[ \F_{unconstrained}^{n+1} = \F^n + \eta \mathbf{R}^\top (\mathbf{P} - \mathbf{R}\F^n) \\ \F^{n+1} = P_K(\F_{unconstrained}^{n+1}) \] where $\eta$ is a step size. The operator $P_K$ in the second step of the algorithm simply sets the $N-K$ smallest elements of $\F_{unconstrained}^{n+1}$ to zero, forcing it to be a $K-sparse$ vector.

The second step in the above iteration is the “projection” step of the projected gradient algorithm. It provably projects $\F_{unconstrained}^{n+1}$ onto the feasible set for $\F$, namely the set of all $K$-sparse vectors.

The above iteration is guaranteed to converge under specific conditions on $\mathbf{R}$, $K$ and $\eta$. This is captured nicely in the following algorithm known as “Iterative Hard Thresholding” (IHT), proposed by Tom Blumensath.

Iterative Hard Thresholding

As a first step, the IHT algorithm assumes that the individual columns of the measurement matrix $\mathbf{R}$ are vectors of length $\leq 1.0$. So, as a first step, confirm that this is true for the provided $\mathbf{R}$.

The IHT algorithm then performs iterations of the following computation:

Choose a maximum sparsity level $K$ for your solution.
Initialize $\F_{unconstrained}^{0} = 0$. (Alternately, you could set it to any random value).
Iterate: \[ \F^{n+1} = P_K(\F^n + \mathbf{R}^\top (\mathbf{P} - \mathbf{R}\F^n)) \] until the error $\|\mathbf{P} - \mathbf{R}\F\|_2^2$ converges, or the maximum number of iterations is achieved.
The final output is $\hat{\F} = \F_{conv}$, where $\F_{conv}$ is the final estimate derived from the previous step.

The actual homework problem is the following

Implement the IHT algorithm.
Find the estimate $\hat{\F}$ for each of the three measurements in problem Ia. Use $K = max(\frac{M}{4},\frac{N}{10})$, where $M$ is the number of measurements, and $N$ is the size of the image. Note that these are far from optimal settings. Set the maximum number of iterations to 200.
For each of the three measurements, reconstruct the original image using the recovered $\F$, as you did in problem Ia. Compute the error $E = \sum_{i,j} (I(i,j) - I_{rec}(i,j))^2$, where $I(i,j)$ and $I_{rec}(i,j)$ are the $(i,j)$th pixel value in the original and recovered image respectively. Report this error.
You should find that the IHT solution is considerably worse than the $\ell_1$ minimization result. Can you hazard any intuition as to why this is so?

Problem 2: Linear Discriminant Analysis

In this problem we will explore the use of Linear Discriminant Analysis for the problem of language identification.

The goal is to build a language recognition system based on “I-vector” representation of speech signals. I-vectors are a factor-analysis based mechanism for representing audio recordings as fixed-length vectors. They are particularly useful for categorizing sounds, specifically speech, but also other data, by their categorical content (e.g. gender, language, speaker ID).

In the file Data_problem_problem2.tar you will find three directories: “Train”, “Dev” and “Eval”. These directories correspond respectively to the training data that will be us to train your classifier, development data to tune your classifier and evaluation data to evaluate the different models. In each directory you will find 24 files that correspond to 24 different language classes. Each line of these 24 files corresponds to a single I-vector of dimension 600 and represents a single speech recording. The goal of this problem is to build two types of classifiers, one based on Linear Discriminant Analysis (LDA) and another based on sparse dictionary representation.

Problem 2A: Linear Discriminant Analysis

Implement LDA (linear discriminant analysis) and cosine scoring for the I-vectors in the data set. The classification algorithm based on LDA and cosine scoring is described as follows:

Normalize the length of each I-vector as: \[ \w = \frac{\x}{\|\x\|} \] We will work only with the normalized I-vectors.
Training LDA:
- Compute the global mean of all the (normalized) vectors: \[ \bar{\w} = \frac{1}{n} \sum_{i=1}^n \w_i \] where $n$ represents the total number of vectors across all languages (classes).
- For each language $l$ compute the class mean of all the I-vectors in the class: \[ \w_l = \frac{1}{n_l}\sum_{\w : \w \in l} \w \] Here $n_l$ is the number of vectors in language $l$, and $\{\w : \w \in l\}$ represents the set of all vectors from language $l$.
- Compute a between class covariance matrix: \[ \S_b = \sum_{l=1}^L n_l (\w_l - \bar{\w})(\w_l - \bar{\w})^\top \] where $L$ is the total number of languages (classes).
- Compute a within class covariance matrix: \[ \S_w = \sum_{l=1}^L\sum_{\w : \w \in l} (\w - \w_l)(\w - \w_l)^\top \] Here, the outer sum is over languages, and the inner sum is over all vectors from that language.

The goal of LDA is to estimate an LDA matrix $\V$ that solves the following generalized Eigenvalue problem: \[ \S_b\v = \lambda \S_w \v \] You can solve the above generalized Eigenvalue problem using any Eigen value solver. In matlab you can use the command

[V,D] = eigs(Sb, Sw, L-1)

If you have $L$ different classes, LDA will provide $L-1$ non-zero Eigenvalues.

The $\V$ matrix we want to compute is obtained from the $L-1$ Eigenvectors corresponding to the $L-1$ non-zero Eigenvalues: this will give us a $L-1 \times D$ matrix $\V$ (where $D$ is the dimensionality of the data, in our case the I-vectors). Note that the columns of $\V$ are orthogonal to one another.

Training the classifier

Project the (normalized) I-vectors onto the columns of $\V$ and the renomalize the length: \[ \hat{\w} = \frac{\V^\top \w}{\|\V^\top \w\|} \]
For each class $l$ compute the mean and its normalized length. \[ \mathbf{m}_l^{raw} = \frac{1}{n_l} \sum_{\hat{\w} \in l} \hat{\w} \\ \mathbf{m}_l = \frac{\mathbf{m}_l^{raw}}{\|\mathbf{m}_l^{raw}\|} \] $\mathbf{m}_l$ is the “model” for language $l$.

Testing the classifier

Project the test I-vectors onto the columns of $\V$ and the renomalize the length: \[ \hat{\w}_{test} = \frac{\V^\top \w_{test}}{\|\V^\top \w_{test}\|} \]
Classify the test vector as belonging to the language (class) whose mean is closest to it in angle, i.e. assign the test vector to the language whose model mean has the greatest inner product with it: \[ \hat{l}(\hat{\w}) = \arg\max_l \left(\mathbf{m}_l^\top\hat{\w}_{test}\right) \]

Acutal Homework Problem

Train an LDA projection matrix $\V$. You will have to submit $\V$.
Train models for all the classes.
Classify each of the test recordings in both the “dev” and “eval” directories. Return the classification output.
Report dev and eval accuracies as follows: \[ Acc_{dev} = \frac{\rm number~of~correctly~classified~vectors~in~dev~directory}{\rm total~number~of~vectors~in~dev~directory} \] \[ Acc_{eval} = \frac{\rm number~of~correctly~classified~vectors~in~eval~directory}{\rm total~number~of~vectors~in~eval~directory} \]

Problem 2B: Classification through Overcomplete Dictionaries

In this problem we will use sparse overcomplete dictionary based representations to perform classification.

Load all the (unnormalized) training vectors for each langauge $l$ into a dictionary $\D_l$. Thus, if there are $n_l$ training recordings for $l$, $\D_l$ will have $n_l$ columns.
Concatenate all the dictionaries into one large super dictionary: \[ \D = [\D_1,~\D_2,\cdots,\D_L] \]
For each test example $\x_{test}$, solve the $\ell_1$ regularized sparse optimization problem: \[ \hat{S} = \arg\min \|\x_{test} -\D S\|^2_2 + \lambda \|S\|_1 \] You may use any of the solvers from sparselab or any other solver you may choose for this. (We recommend that you try a few to find out which ones give you the best classification).
We can separate $S$ into the contributions of each language: \[ S = [S_1^\top, S_2^\top, \cdots, S_L^\top] \] where $S_i$ is the vector of weights corresponding to the columns from $\D_i$ in $\D$.
For each langauge $l$ compute the language specific reconstruction error: \[ E_l = \|\x - D_l S_l\|^2 \]
Classify the recording as: \[ \hat{l}(\x) = \arg\max_l E_l \]

Acutal Homework Problem

Classify each of the test recordings in both the “dev” and “eval” directories using the above procedure. Return the classification output.
Report dev and eval accuracies as given earlier.

Problem 2C: Classification through Overcomplete Dictionaries on LDA vectors

Repeat problem 2B, but now using the normalized LDA-projected vectors $\hat{\w}$ computed in problem 2A.

Compare the performance of 2C with 2B, and comment on the difference.

Submission Details

The homework is due at 11:59pm on November 9,2016.

Please use the HW2 submission template for your submission.

Solutions should be emailed to Chiyu, Anurag and Bhiksha.

MLSP Fall 2016: Homework 2 Sparse Representations