[This] is a recording of "Polyushka Polye", played on the harmonica. It has been downloaded with permission from the artist
Below are a set of notes from a harmonica. You are required to transcribe the music. For transcription you must determine the note or set of notes being played at each instant.
Note | Wav file |
---|---|
E | e.wav |
F | f.wav |
G | g.wav |
A | a.wav |
B | b.wav |
C | c.wav |
D | d.wav |
E2 | e2.wav |
F2 | f2.wav |
G2 | g2.wav |
A2 | a2.wav |
To transcribe the music you must obtain a matrix of 1s and 0s, where rows correspond to notes and columns correspond to time. A note-time $(n,t)$ entry in the matrix will be 1 if the correspnding note $n$ is present at time $t$.
To do so, you must determine the weights with which the notes must be combined to obtain the closest approximation to the music at that time. The matrix must be assigned an entry of 1 for each of the notes which have a weight greater than 0.
Download the following matlab files: [stft.m]
You can read a wav file into matlab as follows:
[s,fs] = wavread('filename');
s = resample(s,16000,fs);
The recordings of the notes can be computed to a spectrum as follows:
spectrum = mean(abs(stft(s',2048,256,0,hann(2048))),2);
'spectrum' will be a $1025 \times 1$
The recordings of the complete music can be read just as you read the notes. To convert it to a spectrogram, do the following:
sft = stft(s',2048,256,0,hann(2048));
sphase = sft./(abs(sft)+eps);
smag = abs(sft);
The 'eps' used here is a very tiny Matlab value to prevent sphase from getting 'NAN' at some locations. The 'smag' will be a $1025\times K$ matrix where $K$ is the number of spectral vectors in the matrix. We will also need 'sphase' to reconstruct the signal later.
Compute the spectrum for each of the notes. Compute the spectrogram matrix 'smag' for the music signal. This matrix is composed of $K$ spectral vectors. Each vector represents 16 milliseconds of the signal.
The output should be in the form of a matrix:
1 1 0 0 0 0 0 1 ...
0 0 0 1 1 0 1 1 ...
0 1 1 1 0 1 1 1 ...
...........................
Each row of the matrix represents one note. Hence there will be as many rows as you have notes in table 1.
Each column represents one of the columns in the spectrogram for the music. So if there are $K$ vectors in the spectrogram, there will be $K$ vectors in your output.
Each entry will denote if a note was found in that vector or not. For instance, if matrix entry $(4,25) = 0$, then the fourth note (d) was not found in the 25th spectral vector of the signal.
This is not for credit. However you may find it instructive in understanding the results from part 3 of this homework.
You can use the notes and the transcription matrix thus obtained to synthesize audio. Note that matrix multiplying the notes and the transcription will simply give you the magnitude spectrum. In order to create meaningful audio, you will need to use the phases as well. Once you have the phases included, you can use the stft to synthesize a signal from the matrix. Submit the synthesized audio along with the matrix.
Deliverables, Formats and Bonus Question
You should strictly follow the instructions and the given Matlab template to submit your code and results. The organization of submission files as well the structure of code are NOT meant to be changed. Use the given variable names and function arguments rather than changing them. Don't move the data files to other places. When storage of results is required, store your results of each problem in the 'results' folder in each problem's folder. Do name your root folder as well as report with your Andrew ID.
In particular, here are what we expected for Problem 1:
1. Fill the function ''load_data.m" with your code to load the notes and the music. Output their (mean) spectrum magnitudes and phases as instructed in the code.
2. Fill the function "transcribe_music.m" with your code to compute the following 4 matrices: 1. transcribe matrix; 2. thresholded transcribe matrix; 3. projection matrix; 4. reconstructed music spectrum magnitude with thresholded transcribe matrix. You may play with different threshold (T) values, but when submitting your code and results, use 0.1 as the threshold.
3. (Optional) Fill the function "synthesize_music.m" with your code to return a synthesized music. Although it's optional, we do strongly recommend you finish it, as it is very simple and only takes two lines of matlab code. Once you finish it, you will get a better idea how stft and basis projection works. If you have implemented your code correctly, you should generate a synthesized music that sounds reasonable.
4. Fill "Run_Problem1.m" to run it as the 'main' function calling other functions. You should: 1. Output your thresholded transcribed matrix as 'transMatT.mat' and save it to the 'results' folder in 'Problem1'. 2. (Optional) Output your synthesized music as 'polyushka_syn.wav' also to this folder.
5. (Bonus) Prove the following question in your report: Suppose $\mathbf{W}$ is the $1025\times 11$ note spectrum magnitude matrix, $\mathbf{x}_k$ is the $k$th column of the nonthresholded transcribed matrix and $\mathbf{m}_k$ is the music spectrum magnitude at time $k$, then your implementation of computing $\mathbf{x}_k$ in Question 2 is a minimizer of $\|\mathbf{W}\mathbf{x}_k - \mathbf{m}_k\|^2$. Then consider the projection matrix learned in the class, what's the connection? Now do you understand why the projection matrix you learned is formed in this way?
You do not need to paste your code into the report. But make sure you include all codes and the data files. Your submitted code should be runable with one click.
Let's warm up with a simple problem.
A rotation in 3-D space is characterized by two angles. We will characterize them as a rotation along the $X-Y$ plane, and a rotation along the $Y-Z$ plane. Derive the equations that transform a vector $[x, y, z]^\top$ to a new vector $[x', y', z']^\top$ by rotating it counterclockwize by angle $\theta$ along the $X-Y$ plane and by an angle $\delta$ along the $Y-Z$ plane. Represent this as a matrix transformation of the column vector $[x, y, z]^\top$ to the column vector $[x', y', z']^\top$. The matrix that transforms the former into the latter is a rotation matrix.
A $4\times 1$ vector $v$ of length (l-2 norm) 4 is transformed by $A$ as $u = Av$. What is the longest length (l-2 norm) that $u$ can be? What is the shortest length of $u$?
1. You must prove the three questions algebraically in your report and show detailed steps.
2. It is optional to include simulation codes in your submitted files. However you must present your simulation results in your report.
We warm up for problem 3 with basic matrix calculus.
We begin by stating a few rules we will use. Everywhere the notation $x_i$ refers to the $i^{\rm th}$ component of a vector ${\mathbf x}$ and $x_{i,j}$ refers to the $(i,j)^{\rm th}$ component of a matrix ${\mathbf X}$.
Note the order of the indices in the following
The transpose of any $N_1 \times N_2$ matrix is an $N_2 \times N_1$ matrix. We will represent the transposition operation by the superscript $\top$. Let ${\mathbf Y} = {\mathbf X}^\top$. Then $y_{i,j} = x_{j,i}$ (i.e. the $(i,j)$-th component of ${\mathbf Y}$ is the $(j,i)$-th component of ${\mathbf X}$).
For the purposes of the computation here, we will expand the notion of transposes to tensors. Let ${\mathbf Y}$ be any $N_1 \times N_2 \times \cdots \times N_K$ tensor. By our definition ${\mathbf X} = {\mathbf Y}^\top$ is an $N_K \times N_{K-1} \times \cdots \times N_1$ tensor, whose components are given by $x_{i_1,i_2,\cdots,i_K} = y_{i_K,i_{K-1},\cdots,i_i}$.
Using the above definitions, we can also write the following chain rule.
Because of the zero-valued off diagonal elements, tensor transposition has no effect on the tensor. Multiplying this tensor with any $M\times 1$ vector ${\mathbf t}$ has the following geometric interpretation. The figure below illustrates the operations.
The depth dimension of the tensor, which is its trailing dimension, multiplies the vector. The highlighted points and the arrows indicate corresponding elements that are multiplied together. The products are summed.Clearly, since only the diagonal elements have non-zero values, only one of the component-wise products has a non-zero value, and the sum takes this value.
This has the effect that for any $M \times 1$ vector ${\mathbf t}$, the matrix ${\mathbf S} = {\mathbf V}^\top{\mathbf t}$ is an $M \times N$ matrix whose components (according to rule 9) are given by \[ s_{k,j} = \sum_{i} v_{i,j,k}t_i = t_ky_j \] The indices can be verified by referring to the illustration above.
Simple projection of music magnitude spectrograms (which are non-negative) onto a set of notes will result in negative weights for some notes. To explain, let $\mathbf{M}$ be the (magnitude) spectrogram of the music. It is a matrix of size $D \times T$, where $D$ is the size of the Fourier transform and $T$ is the number of spectral vectors in the signal. Let $\mathbf{N}$ be a matrix of notes. Each column of $\mathbf{N}$ is the magnitude spectral vector for one note. $\mathbf{N}$ has size $D \times K$, where $K$ is the number of notes.
Conventional projection of $\mathbf{M}$ onto the notes $\mathbf{N}$ computes the approximation \[ \widehat{\mathbf{M}} = \mathbf{N} \mathbf{T} \]
such that $||\mathbf{M} - \widehat{\mathbf{M}}||_F^2 = \sum_{i,j} (M_{i,j} - \widehat{M}_{i,j})^2$ is minimized. Here $||\mathbf{M} - \widehat{\mathbf{M}}||_F$ is known as the Frobenius norm of $\mathbf{M} - \widehat{\mathbf{M}}$. $M_{i,j}$ is the $(i,j)^{\rm th}$ entry of $\mathbf{M}$ and $\widehat{M}_{i,j}$ is similarly the $(i,j)^{\rm th}$ entry of $\widehat{\mathbf{M}}$. Please note the definition of the Frobenius norm; we will use it later.
$\widehat{\mathbf{M}}$ is the projection of $\mathbf{M}$ onto $\mathbf{N}$. $\mathbf{T}$, of course, is given by $\mathbf{T} = pinv(\mathbf{N}) \mathbf{M}$. $\mathbf{T}$ can be viewed as the transcription of $\mathbf{M}$ in terms of the notes in $\mathbf{N}$. So, the $j^{\rm th}$ column of $\mathbf{M}$, which we represent as $M_j$ and is the spectrum in the $j^{\rm th}$ frame of the music, is approximated by the notes in $\mathbf{N}$ as \[ M_j = \sum_i N_i T_{i,j} \]
where $N_i$, the $i^{\rm th}$ column of $\mathbf{N}$ and represents the $i^{\rm th}$ note and $T_{i,j}$ is the weight assigned to the $i^{\rm th}$ note in composing the $j^{\rm th}$ frame of the music.
The problem is that in this computation, we will frequently find $T_{i,j}$ values to be negative. In other words, this model requires you to subtract some notes — since $T_{i,j} N_i$ will have negative entries if $T_{i,j}$ is negative, this is equivalent to subtracting note the weighted note $|T_{i,j}|N_i$ in the $j^{\rm th}$ frame. Clearly, this is an unreasonable operation intuitively — when we actually play music, we never unplay a note (which is what playing a negative note would be).
Also, $\widehat{\mathbf{M}}$ may have negative entries. In other words, our projection of $\mathbf{M}$ onto the notes in $\mathbf{N}$ can result in negative spectral magnitudes in some frequencies at certain times. Again, this is meaningless physically -- spectral magnitudes cannot, by definition, be negative.
In this homework problem we will try to fix this anomaly.
We will do this by computing the approximation $\widehat{\mathbf{M}} = \mathbf{N} \mathbf{T}$ with the constraint that the entries of $\mathbf{T}$ must always be greater than or equal to $0$, i.e. they must be non-negative. To do so we will use a simple gradient descent algorithm which minimizes the error $||\mathbf{M} - \mathbf{N}\mathbf{T}||_F^2$ subject to the constraint that all entries in $\mathbf{T}$ are non-negative.
We define the following error function: \[ E = ||\mathbf{M} - \mathbf{N}\mathbf{T}||_F^2. \]
Derive the formula for $\frac{dE}{d\mathbf{T}}$.
We define the following gradient descent rule to estimate $\mathbf{T}$. It is an iterative estimate. Let $\mathbf{T}^0$ be the initial estimate of $\mathbf{T}$ and $\mathbf{T}^n$ the estimate after $n$ iterations.
We use the following update rule \[ \widehat{\mathbf{T}}^{n+1} = \mathbf{T}^n - \eta \frac{dE}{d\mathbf{T}}|_{\mathbf{T}^n} \\ \mathbf{T}^{n+1} = \max(\widehat{\mathbf{T}}^{n+1},0) \]
where $\frac{dE}{d\mathbf{T}}|_{\mathbf{T}^n}$ is the derivative of $E$ with respect to $\mathbf{T}$ computed at $\mathbf{T} = \mathbf{T}^n$, and $\max(\widehat{\mathbf{T}}^{n+1},0)$ is a component-wise flooring operation that sets all negative entries in $\widehat{\mathbf{T}}^{n+1}$ to 0.
3.b.II.i. Implement the above algorithm. Initialize $\mathbf{T}$ to a matrix of all $1$s. Set $\eta$ to 0.01. Run 100 iterations. Plot $E$ as a function of iteration number $n$. Return this plot and the final matrix $\mathbf{T}$. Revise the value of $\eta$ if $E$ did not converge. Also add the $E$ plot using your revised $\eta$ value in the report.
3.b.II.ii. No points for this one, but try this. Transcribe the music: set all values of $\mathbf{T}$ above the threshold 0.1 to 1.0 and the rest to 0, and return the resulting binary transcription. Let this matrix be $\mathbf{T}_b$ (the subscript $b$ representing binary). Now recreate the music using this transcription as $\mathbf{M}_b = \mathbf{N} {\mathbf{T}}_b$. You can regenerate audio from ${\mathbf M}_b$. What does it sound like? You may return the resynthesized music to impress us (although we won't score you on it).
The assignment is due at the beginning of class on Oct 6th, 2015. The assignment is worth 15 points. Each day of delay thereafter will automatically deduct 3 points from your score. Everyone should strictly follow the instructions and templates, or you may lose considerable amount of hw scores.
Solutions must be emailed to both TAs, and cc-ed to Bhiksha. The message must have the subject line "MLSP assignment 1". Remember to include your generated results. Don't delete them.
The template for assignment 1 is available here: hw1_template.zip