11-755 MLSP Homework 1

11-755 MLSP Homework: Linear Algebra Refresher

Problem

Below are links to pieces of music and recordings of several notes. You are required to "transcribe" the music.

For transcription, you will have to determine the note or set of notes being played at each time.

Blowin' in the wind

This tgz file contains a recording of a harmonica piece rendering the song "blowin' in the wind". Also given are a collection of notes, and an example musical scale. Transcribe both the musical scale and the main song in terms of the notes.

Polyushka Polye

This is a recording of "Polyshka Polye", played on the harmonica. It has been downloaded from youtube (with permission from the artist).

Below are a set of notes from a harmonica.

Note Wav File

E e.wav

F f.wav

G g.wav

A a.wav

B b.wav

C c.wav

D d.wav

E2 e2.wav

F2 f2.wav

G2 g2.wav

A2 a2.wav

Download the following matlab files: stft.m

Matlab instructions

You can read a wav file into matlab as follows:
[s,fs] = wavread('filename'); s = resample(s,16000,fs);

The recordings of the notes can be computed to a spectrum as follows:
spectrum = mean(abs(stft(s',2048,256,0,hann(2048))),2);
“spectrum” will be a 1025 x 1 vector.

The recordings of the complete music can be read just as you read the notes. To convert it to a spectrogram do the following:
sft = stft(s',2048,256,0,hann(2048)); sphase = sft./abs(sft); smag = abs(sft);
“smag” will be a 1025 x K matrix (K is the number of spectral vectors in the matrix. We will also need “sphase” to reconstruct the signal later.

Additional Info

Compute the spectrum for each of the notes. Compute the spectrogram matrix “smag” for the music signal. This matrix is composed of K spectral vectors. Each vector represents 16 milliseconds of the signal.

You may find, projections, pseudo inverses, and dot products useful. If you know of any other techniques, you can use those too. Tricks like thresholding (setting all values of some variable that fall below a threshold to 0) might also help.

The output should be of the form of a matrix :

1 1 0 0 0 0 0 1 . . .

0 0 0 1 1 0 1 1 . . .

0 1 1 1 0 1 1 1 . . .

. . . . . . . . . . .

Each row of the matrix represents one note. Hence there will be as many rows as you have notes in table 1.

Each column represents one of the columns in the spectrogram for the music. So if there are K vectors in the spectrogram, there will be K vectors in your output.

Each entry will denote if a note was found in that vector or not. For instance, if matrix entry (4,25) = 0, then the fourth note (d) was not found in the 25th spectral vector of the signal.

Synthesizing Audio

You can use the notes and the transcription matrix thus obtained to synthesize audio. Note that matrix multiplying the notes and the transcription will simply give you the magnitude spectrum. In order to create meaningful audio, you will need to use the phases as well. Once you have the phases included, you can use the stft to synthesize a signal from the matrix. Submit the synthesized audio along with the matrix.

Linear Algebra

Lets warm up with a simple problem:

Rotation Matrices

A rotation in 3-D space is characterized by two angles. We will characterize them as a rotation along the X-Y plane, and a rotation along the Y-Z plane. Derive the equations that transform a vector (x, y, z) to a new vector (x', y', z') by rotating it counterclockwize by angle θ along the X-Y plane and by an angle δ along the Y-Z plane. Represent this as a matrix transformation of the column vector (x, y, z)' to the column vector (x', y', z')'. The matrix that transforms the former into the latter is a rotation matrix.

Projecting Instrument Notes

For this problem you will transform the harmonica notes of problem 1 to piano notes, by a matrix transform. The piano notes can be downloaded from here. Note that, in this case, you don't know which piano notes correspond to which notes from the harmonica. There are 3 parts to this problem:

Find the piano note corresponding to each note from the harmonica. The dot product is your friend.
Find a transformation that converts the harmonica notes to piano notes. To do so, you must list the spectra for all hamonica notes as a matrix H. List the correspnding piano notes as a matrix P. There must be a one-to-one correspondence between the notes represented by the columns of H and those represented by the columns of P. The and provide the resulting rotation matrix in your solution. The desired transformation is a matrix M such that MH ≈ P. Provide the matrix M.
Synthesize the music piece from Problem 1, using both the actual piano notes and those obtained by transforming the harmonica notes. Submit both synthesized recordings.

Exercising Audio Characterization

In the previous problems, the harmonica (or piano) notes that produce music when combined in some manner can be thought of as basis vectors for the music. That is to say, a linear combination of those vectors produces the music. Later in the class, we will look at methods that try to learn such bases automatically for different types of audio, such as a person's speech, music from a certain instrument, background noise etc.

One of the problems with learning this representation automatically is that the audio used to learn the bases for a certain kind of sound must contain that sound only and no other. If it contains other sound, or even an absence of sound, then the model that is learnt will capture something other than the intended source. It is easy to see that gathering real data for a task like this will result in non-optimal data sets.

Consider the case where the system is attempting to learn a set of basis vectors for a particular speaker. Listen to the data here for an example of data for the speaker collected in a natural environment. You will notice that the speaker is not speaking in the initial and end portions of the audio file. Also, there is background noise from an idling automobile throughout the video. Even if the recording was controlled in order to ensure that the speaker was speaking for the duration of the audio, the background noise would still appear. However, at the very least, we would like to extract the segments of audio where the speaker is present. One simple heuristic that is used for this kind of task is called thresholding.

Thresholding

In this problem, you will implement thresholding and look at the characteristics of the audio.

Read in the signal, and accept a numeric input parameter, cutoff (which is the threshold cutoff).
Compute the spectrogram for the signal
- Use an analysis window of 60ms, with a hop size of 15ms, to convert converting signals to its spectrographic representation.
- Given the window size, hop size and the frequency of the audio signal, you can now compute which signal elements would be a part of each spectral frame.
Compute the average power in db of each spectral frame. Suppose the variable e contains the elements in the frame in the signal domain. Then the average power in db is given by p = 10log10(mean(Σ_i e_i²))
Plot the average power for each frame.
Create a histogram for the average power and using 50 bins (for Matlab's hist command). You should see 2 peaks at different points in the histogram (i.e. the frequency of occurrence increases form an initial point to a peak, then drops for a while, before increasing again and reaching another peak before starting to drop once again).
Return a matrix containing only those frames whose average power is above the cutoff.

Interpreting thresholding

Attach your plots of the average power and the histogram to your solutions.
Based on what you saw in the plot of the power, can you suggest a way for extracting only those frames that contain the speaker?
Why are there 2 peaks in the histogram for the average power?
. This recording is a segment of audio, which has a short region of “no speech” at the beginning and the end. The leader and trailer do not have much noise. This other recording is similar, except that the leading and training segments have loud noise in the background. Run your code on these signals and plot the average power and the histogram, as earlier.

Comment briefly on your observations. What differences do you see between the 2 sets of plots, why do you think there is/should be a difference, how do these differences change with the level of background noise, etc.

Due date

The assignment is due in 3 weeks (September 27th). The assignment is worth 15 points. Each day of delay thereafter will automatically deduct 0.5 points from your score.

Solutions may be emailed to me, Sourish or Sohail. The message must have the subject line "MLSP assignment 1". It should include a 1 page report of what you did (can be longer), and the resulting matrix, as well as the synthesized audoi.

Note	Wav File
E	e.wav
F	f.wav
G	g.wav
A	a.wav
B	b.wav
C	c.wav
D	d.wav
E2	e2.wav
F2	f2.wav
G2	g2.wav
A2	a2.wav