\( \def\xx{\mathbf x} \def\xX{\mathbf X} \def\yy{\mathbf y} \def\bold#1{\bf #1} \)
Let's warm up with a simple problem.
[This] is a recording of "Polyushka Polye", played on the harmonica. It has been downloaded from YouTubewith permission from the artist
Here are a set of notes from a harmonica. You are required to transcribe the music. For transcription you must determine how each of the notes is played to compose the music.
You can use the matlab instructions given here to convert each note into a spectral vector, and the entire music to a spectrogram matrix.
Simple projection of music magnitude spectrograms (which are non-negative) onto a set of notes will result in negative weights for some notes. To explain, let $\mathbf{M}$ be the (magnitude) spectrogram of the music. It is a matrix of size $D \times T$, where $D$ is the size of the Fourier transform and $T$ is the number of spectral vectors in the signal. Let $\mathbf{N}$ be a matrix of notes. Each column of $\mathbf{N}$ is the magnitude spectral vector for one note. $\mathbf{N}$ has size $D \times K$, where $K$ is the number of notes.
Conventional projection of $\mathbf{M}$ onto the notes $\mathbf{N}$ computes the approximation \[ \widehat{\mathbf{M}} = \mathbf{N} \mathbf{W} \]
such that $||\mathbf{M} - \widehat{\mathbf{M}}||_F^2 = \sum_{i,j} (M_{i,j} - \widehat{M}_{i,j})^2$ is minimized. Here $||\mathbf{M} - \widehat{\mathbf{M}}||_F$ is known as the Frobenius norm of $\mathbf{M} - \widehat{\mathbf{M}}$. $M_{i,j}$ is the $(i,j)^{\rm th}$ entry of $\mathbf{M}$ and $\widehat{M}_{i,j}$ is similarly the $(i,j)^{\rm th}$ entry of $\widehat{\mathbf{M}}$. Please note the definition of the Frobenius norm; we will use it later.
$\widehat{\mathbf{M}}$ is the projection of $\mathbf{M}$ onto $\mathbf{N}$. $\mathbf{W}$, of course, is given by $\mathbf{W} = pinv(\mathbf{N}) \mathbf{M}$. $\mathbf{W}$ can be viewed as the transcription of $\mathbf{M}$ in terms of the notes in $\mathbf{N}$. So, the $j^{\rm th}$ column of $\mathbf{M}$, which we represent as $M_j$ and is the spectrum in the $j^{\rm th}$ frame of the music, is approximated by the notes in $\mathbf{N}$ as \[ M_j = \sum_i N_i W_{i,j} \]
where $N_i$, the $i^{\rm th}$ column of $\mathbf{N}$ and represents the $i^{\rm th}$ note and $W_{i,j}$ is the weight assigned to the $i^{\rm th}$ note in composing the $j^{\rm th}$ frame of the music.
Whe problem is that in this computation, we will frequently find $W_{i,j}$ values to be negative. In other words, this model requires you to subtract some notes — since $W_{i,j} N_i$ will have negative entries if $W_{i,j}$ is negative, this is equivalent to subtracting note the weighted note $|W_{i,j}|N_i$ in the $j^{\rm th}$ frame. Clearly, this is an unreasonable operation intuitively; when we actually play music, we never unplay a note (which is what playing a negative note would be).
Also, $\widehat{\mathbf{M}}$ may have negative entries. In other words, our projection of $\mathbf{M}$ onto the notes in $\mathbf{N}$ can result in negative spectral magnitudes in some frequencies at certain times. Again, this is meaningless physically -- spectral magnitudes cannot, by definition, be negative.
In this homework problem we will try to fix this anomaly.
We will do this by computing the approximation $\widehat{\mathbf{M}} = \mathbf{N} \mathbf{W}$ with the constraint that the entries of $\mathbf{W}$ must always be greater than or equal to $0$, i.e. they must be non-negative. To do so we will use a simple gradient descent algorithm which minimizes the error $||\mathbf{M} - \mathbf{N}\mathbf{W}||_F^2$ subject to the constraint that all entries in $\mathbf{W}$ are non-negative.
We define the following error function: \[ E = \frac{1}{DT}||\mathbf{M} - \mathbf{N}\mathbf{W}||_F^2. \] where $D$ is the number of dimensions (rows) in $\mathbf{M}$, and $T$ is the number of vectors (frames) in $\mathbf{M}$.
Derive the formula for $\frac{dE}{d\mathbf{W}}$.
We define the following gradient descent rule to estimate $\mathbf{W}$. It is an iterative estimate. Let $\mathbf{W}^0$ be the initial estimate of $\mathbf{W}$ and $\mathbf{W}^n$ the estimate after $n$ iterations.
We use the following projected gradient update rule \[ \widehat{\mathbf{W}}^{n+1} = \mathbf{W}^n - \eta \frac{dE}{d\mathbf{W}}|_{\mathbf{W}^n} \\ \mathbf{W}^{n+1} = \max(\widehat{\mathbf{W}}^{n+1},0) \]
where $\frac{dE}{d\mathbf{W}}|_{\mathbf{W}^n}$ is the derivative of $E$ with respect to $\mathbf{W}$ computed at $\mathbf{W} = \mathbf{W}^n$, and $\max(\widehat{\mathbf{W}}^{n+1},0)$ is a component-wise flooring operation that sets all negative entries in $\widehat{\mathbf{W}}^{n+1}$ to 0.
In effect, our feasible set for values of $\mathbf{W}$ are $\mathbf{W} \succcurlyeq 0$, where the symbol $\succcurlyeq$ indicates that every element of $\mathbf{W}$ must be greater than or equal to 0. The algorithm performs a conventional gradient descent update, and projects any solutions that fall outside the feasible set back onto the feasible set, through the max operation.
Implement the above algorithm. Initialize $\mathbf{W}$ to a matrix of all $1$s. Run the algorithm for $\eta$ values $(0.0001, 0.001, 0.01, 0.1)$. Run 250 iterations in each case. Plot $E$ as a function of iteration number $n$. Return this plot and the final matrix $\mathbf{W}$. Also show a plot of best error $E$ as a function of $\eta$.
For the best $\eta$ (which resulted in the lowest error) recreate the music using this transcription as $\widehat{\mathbf{M}} = \mathbf{N} {\mathbf{W}}$. Resynthesize the music from $\widehat{\mathbf M}$. What does it sound like? You may return the resynthesized music to impress us (although we won't score you on it).
Detailed instuctions on how to submit the results are given here.
Solutions must be emailed to both TAs, and cc-ed to Bhiksha. The message must have the subject line "MLSP assignment 1". Remember to include your generated results. Don't delete them.
Solutions are due before Oct 4th, 2016 (i.e. by 23:59:59 on Oct 3rd).