15-826 HW3 Spring 2007

CARNEGIE MELLON UNIVERSITY

15-826 - Multimedia databases and data mining

Spring 2007

Homework 3 - Due: Tuesday April 24, 1:30pm, in class.

Important:

Due date: Tuesday April 24, 1:30pm, hard copy in class and soft copy e-mailed to the TA in a single e-mail.
For all questions please BOTH:
- e-mail the TA a soft-copy of all code, output (including images of plots) and auxiliary scripts tar'ed up in a single file. Please name this file <your_andrew_id>_hw3.tar
- hand in a hard-copy of all code, output and auxiliary scripts
Please turn in a typed report - handwritten material may not be graded, at the grader's discretion.
All homeworks including this one are to be done INDIVIDUALLY
For all plots, in addition to a hardcopy printout, please submit an electronic version in postscript, pdf, gif, jpeg, or png format.
Wherever numerical answers are asked for please provide up to 3 significant digits of accuracy.
Expected effort for this homework:
- Q1: 1 hour
- Q2: 1 hour
- Q3: 1 hour
- Q4: 2 hours
- Q5: 1 hour
- Q6: 2 hours
- Q7: 4 hours

Q1: 1-D DFT [10 pts]

Apply DFT to following datasets.

For each dataset hand in:

[1 pt] The time-plot (value vs time) of the original dataset,
[5 pt] The amplitude-frequency ("amplitude spectrum") plot for DFT.
[2 pt] What can you say given these plots, e.g. how many strong frequencies are there?
[1 pt] Give the sum of squares of the time values.
[1 pt] Give the sum of squares of the amplitudes.

Q2: 1-D DWT [10 pts]

Apply DWT using Haar wavelets (code available here) and DFT to this simulated heart beat.

What to hand in:

[1 pt] The time-plot of the original dataset,
[1 pt] The amplitude-frequency ("amplitude spectrum") plot for DFT.
1. [1 pt] What can you say given these plots?
[2 pt] Plot the scalogram for DWT as a set of histograms (1 per scale). Do the full level decomposition of the signals (~~you might consider using xwpl~~ (link seems to be broken), so feel free to use anything you wish, like matlab's wavelet toolbox, or see BBoard for alternatives). The histograms should look like the examples starting on slide #100 in in these slides, that is:
```
  for (scale in 0 to 9){
  	give the 2^scale coefficients as a histogram over time
  }
  also give the smooth coefficient
  
```

[1 pt] What can you say given these plots? Any dominant frequencies? Any spikes?

[1 pt] The time plot of the reconstructed version using only the top 5 DFT coefficients
Specifically, pick the 5 best complex numbers of the DFT, that can help you reconstruct the original signal. For the upcoming comparison with DWT, we shall use the concept of 'degrees of freedom' DOF, counted as follows:
- every complex number X_f contributes 2 towards the degrees of freedom
- except for X₀ and X₅₁₂, which by theorem have zero imaginary part
- Also, when you choose X_f , you can (and should) use X_N-f, for free.
For the top 5 DFT coefficients, the degrees of freedom DOF would be approximately DOF=10 (or a bit less).

[1 pt] Give the root mean squared error (RMSE) between this reconstructed version and the original, where RMSE is defined as the square root of the average squared differences between the true values, y, and the reconstructed values, y^:

[1 pt] Give the time plot of the reconstructed version using only the top DOF DWT coefficients, where DOF is the degrees of freedom for DFT, defined above. (Notice that for DWT all coefficients are real, and thus each coefficient corresponds to 1 degree of freedom).

[1 pt] The root mean squared error (RMSE) between this reconstructed version and the original

Q3: 2D DFT and DWT [10 pts]

Download this 2D black and white image of 16*16 pixels, each with intensity 0 or 1. The origin of the image (x=1, y=1) is in the top left corner (like a matrix). The format for the data is "x y value" where value is the black-white intensity (0 = black, 1 = white).

Hand in the following:

[2 pt] The 2-d image of the dataset,
[2 pt] Draw the 2-d grey-scale image of the reconstructed version using only the top 3 DFT coefficients as defined in Q2 above. The degrees of freedom (DOF) should be about 6.

[2 pt] What is the root mean squared error (RMSE) between this reconstructed version and the original, where RMSE is defined as before, but this time averaging over all pixels (ie, all values of i and j):

[2 pt] Draw the reconstructed version using only the top DOF DWT coefficients, as defined in Q2 above. The idea, again, is to make it a fair comparison between DFT and DWT. (Use Haar wavelets: DWT each row, then each column).

[2 pt] What is the root mean squared error (RMSE) between this reconstructed version and the original?

Q4: Iterated Function Systems [15 pts]

Dragon Curve

Write code to generate and plot the following curve for any order n:

The picture below (figure 1) shows four iterations of the dragon curve, from order 0 on the left, to order 3 on the right. The white lines in each image are the current state of the curve. At each iteration, each line segment is split into two line segments by forming a 45-45-90 degree triangle (right angles shown as yellow boxes) with the original line segment (in red) as the hypotenuse. The orientation of this new triangle oscillates, so in the order-2 curve, the lower triangle is formed to the right of the red line, while the upper triangle is formed to the left of the red line.

Order 0 Order 1 Order 2 Order 3

(Figure 1)

Please submit:
1. [3 pts] The plot of the curve at order 6 (your code should generate a set of 2-d line segments, which you can plot using your favorite utility, eg, gnuplot, xgraph, etc)
2. [2 pts] The fractal dimension of the curve in the limit (as n goes to infinity)
3. [2 pts] Your code
X-shape

Get the IFS code here. Now use it to plot the following X-shape. The picture below (figure 2) shows one iteration of the X-shape. The square is subdivided into nine sub-squares, with the four corners and center square filled, and the rest left empty. For the next iteration, each filled sub-square is further subdivided and filled according to the same pattern.

(Figure 2)

Please submit:
1. [2 pts] The plot of 1,000 points that you generated using Barnsley's IFS algorithm (in the ifs code)
2. [2 pts] The fractal dimension of the x-shape
3. [2 pts] The correlation integral (distance vs # pairs) of your 1,000 points (you can use the code here)
4. [2 pts] Your code (ie, the parts of the IFS code that you modified. Eg, input.parms)

Q5: SVD [10 pts]

Consider the following Document by Term matrix: DocTerm

The file is in csv (comma-separated-values) format with windows end of line termination. In this file each row denotes a document, and the columns denote terms. An entry (i,j) denotes the number of occurrences of the j^th term in the i^th document. Based on their occurrence in the documents, the terms can be clustered as belonging to a particular Topic. Run SVD on the dataset to determine the number of topics present. You can use any standard software (e.g., Matlab).

Hand in:

[3 pts] The SVD components (the U, lambda, V matrices, specifying which is which).
[1 pts] Your estimate of the number of topics for the DocTerm dataset.
[6 pts] For every topic, give a sorted list of documents (specified by row-id) that participate in this topic, along with their weight in the topic. For example:
```
Topic A:
--------
doc 1, weight .9
doc 2, weight .2
doc 3, weight .8
Topic B:
------
doc 1, weight .1
doc 2, weight .8
doc 3, weight .2
```

Q6: 3-D SVD [15 pts]

Recall the multivariate Guassian distribution in N dimensions:

where sigma is an NxN covariance matrix, and mu is a length-N mean vector.

This dataset contains 10,000 points constructed in the following manner:

Points are sampled from a three dimensional (N=3) spherical (Sigma is now the 3x3 identity matrix) Gaussian centered at the origin (mu is now the vector: <0, 0, 0>).
1. The first coordinate of the sampled point is multiplied by a scalar a
2. The second coordinate of the sampled point is multiplied by a scalar b
3. The third coordinate of the sampled point is multiplied by a scalar c
These warped points are all finally rotated and translated by some unknown affine transformation (which is the same for all points).

[15 pts] Given this data, recover a, b and c.

HINT: Inspect the 2-d scatterplots.

Q7: Regression and Recursive Least Squares [30 pts]

Implement the method of Recursive Least Squares (RLS), with forgetting, from the paper by [Yi et al, ICDE 2000] (pdf, ps.gz), or from [Chen & Roussopoulos, SIGMOD 94] (pdf, ps.gz).

Language: You may choose among C/C++, perl, python, Java, R, Matlab. However, your algorithm must be incremental, processing one row of data at a time.
Input format: n+1 numbers per line, blank separated; the last column is the dependent variable
```
<x1> <x2> ... <y>
```

Output format

₁

this

Note:

processing one line at a time

Apply the recursive least squares (RLS) algorithm to this simulated, fictitious dataset. The data contains 1024 rows of three blank-separated values. Each row represents the weekly measurement of the same patient over 20 years. The format is:

<calories consumed> <hours of exercise performed> <cholesterol>

The first two variables are the independent variables, x1 and x2, while the third variable (cholesterol) is the dependent variable, y. We believe the data is generated by some linear function, whose parameters w1 and w2 we are trying to estimate:

y = w1*x1 + w2*x2 + noise

At some point in time, the patient began taking a medicine to lower her cholesterol. We would like to identify a) when the patient began taking the medicine and b) what effect (if any) this medicine had on the relationship between the patient's calories, exercise and cholesterol. We also want to investigate how much of a difference running ordinary least squares (OLS, read more here or see property C1 in slide #17 in these slides, which is equivalent to equation 4 in this paper) and RLS makes on the results. At every week we estimate w1(t) and w2(t) that best fit the data up to and including week t.

What to hand in:

Plot both the time-series of how w1(t) and w2(t) change over time, and the 2-d scatter plot of <w1(t), w2(t)> over time t, as estimated by:

[4 pts] RLS with no forgetting (ie, lambda = 1)
[4 pts] RLS with the forgetting factor lambda = 0.95
[4 pts] OLS with no forgetting (ie, lambda = 1)
[4 pts] OLS with the forgetting factor lambda = 0.95

[2 pts] Based on these plots, when (which week) did the patient begin taking her medication?
[2 pts] Give the running times (wall clock) of each of the four experiments, and the specifications of the machine you ran it on (eg, win xp, 256 mb ram, intel dual core 3 ghz)
[10 pts] Submit your code for both RLS and OLS

Last modified: April 19, 2007