CARNEGIE MELLON UNIVERSITY
15-826 - Multimedia databases and data mining
Spring 2007
Homework 1 - Due: Tuesday Feb. 6, 1:30pm, in class.
Important:
- Due date: Tuesday Feb. 6, 1:30pm, hard copy in class and soft
copy e-mailed to the TA
in a single e-mail.
- This homework is MANDATORY, meaning you must get at least
a passing grade: 50 points or higher.
- For all questions please BOTH:
- e-mail the TA
a soft-copy of all code, output and auxilary scripts tar'ed up in a
single file
- hand in a hard-copy of all code, output and auxilary scripts
- Please turn in a typed report - handwritten material may not be graded,
at the grader's discretion.
- All homeworks including this are to be done INDIVIDUALLY
- Expected effort for this homework:
- Q1: 1-2 hours
- Q2: 1-2 hours: .5 h to design the algorithm and .5-1.5 hours
for coding and debugging.
- Q3: 10-15 hours: 1-2h to compile and run the code library,
4-6h for coding and debugging each problem, and 1h for
generating results
Q1: SQL [30 pts]
Retrieve the following tables and load them in a DBMS of your
choice: MS-Access is recommended, but any other is acceptable:
- MySQL
- postgres
- You can use this script to load the
data into the server. Use the command '\i load.sql' at the psql promt.
These tables (in comma-separated ASCII format, derived from Sean Lahman's Baseball
Database) describe historical data about American baseball teams
and players:
- PLAYERS (playerID, first_name,
last_name, rightHanded): Describes each player's first and last name,
and whether they bat right or left handed
- TEAMS (yearID, teamID, teamName,
stadium): Describes which team played in which stadium during which
year.
- SALARIES (year, teamID, playerID,
salary): Describes which player played for which team during which year
for how much money.
Given this data, please answer the following queries (feel free to
use views):
- [10 pts] Find the single highest salary ever paid, and
report the teamID, year and amount. In case of tie, report all tied
entries.
- [10 pts] Report the first and last name of all
left-handed players who played for PIT in 1985.
- [10 pts] List the names of all stadiums the New York
Yankees have ever played in.
For each query both print out and e-mail:
- the resulting table [half the points]
- the corresponding SQL statement [half the points]
Caution: The ASCII data files have ms-windows/dos end-of-line
termination. For unix/linux use "dos2unix" to convert them.
Q2: Z-order [10 pts]
Write the code to compute the z-value of a 2-d point,
as well as the inverse. You may use C/C++, perl, java or python. If
you'd like to use a different language, please ask the TA first.
-
[2.5 pts] zorder should return the z-value of the
given (x,y) point. The command-line syntax should be:
zorder -n <order-of-curve> <xvalue>
<yvalue>
Thus:
zorder -n 2 0 0 # should return '0'
zorder -n 3 0 1 # should return '1'
- Note that these examples determine the orientation of the
Z-curve.
-
[2.5 pts] izorder should give the inverse. The
command-line syntax should be:
izorder -n <order-of-curve> <zvalue>
Thus:
izorder -n 5 0 # should return the
'x'
and 'y' values, ie, 0 0
izorder -n 2 15 # should return '3 3'
-
[2.5 pts] Give the results of your programs on this input file.
Make sure you echo the input, so that it is clear which answer
refers
to which question
- [2.5 pts] Using your izorder, plot a z-curve
(perhaps using gnuplot) of order 7 (128x128
grid) and hand in the plot.
Hand in your source code
on hard copy along with e-mailing it in.
NOTES:
-
It is fine to copy and/or modify code from the web, or from existing
papers
- but you have to make sure that your programs follow all the above
specifications.
Q3: R trees [60 pts]
You are required to extend the capabilities of an R-tree package,
and implement two different algorithms for closest pair
queries. You may use C/C++, perl, java or python. If you'd like to use
a different language, please ask the TA first.
- Code: You may use any R-tree code from the internet you like, but
if you do so, please let the TA know. We recommend this R-Tree package. It is in C; 'gunzip; tar
xvf' and do 'make demo'. This creates the bin/DRmain
program and runs it on some small datasets. It has been tested on unix
platform in the andrew machines along with cygwin on windows xp.
- Do 'bin/DRmain' and insert some points of your own,
to become familiar with the package.
- the program expects rectangles - treat points as degenerate
rectangles (x_low = x_high etc).
- in order to keep the distances between points as ints,
by default the program will calculate the squared euclidean
distance. This should not affect the results of your nearest neighbor
calculations. The program likewise assigns some meaning to negative
distances which you can also ignore.
- in case of unexplained errors, do 'make spotless'
, to delete all libraries etc.
- Also note the program deliberately maintains state
(in the file default.idx) across invocations. This means once
you insert a point, you do not need to insert it again. To reset your
tree, you can just delete this file.
- Implement : Currently the R-tree package supports 's'
for range (s)earch, 'i' for (i)nsertion, 'knn' for
(K)-nearest neighbor search, 'print' for (p)rinting the tree,
etc. You have to implement
- [30 pts] 'n' for '(n)aive closest pair': your
program should print the node ids of the two closest points, along with
the (positive) euclidean distance between them.
- Hint: You may use the package's knn function and
issue n (the number of points) knn queries
- [30 pts] 'v' for 'recursi(v)e closest pair':
your program should again print the node ids of the two closest points,
along with the (positive) euclidean distance between them. This time
you will need to test on a much larger data set, so the naive approach
will not work.
- Input Data : You can test your program with a small script, a large
script, and add more data (points
specified as nodeid, x and y coordinate values). Please take into
account that the file has ms-windows/dos end-of-line termination.
- Turn in :
- Hard copy: a printout of your source code (you are welcome to
give only the parts that you have added/modified).
- Email the whole tar-ball to aarnold+15826 at cs. Please make
sure that the program correctly compiles and runs the experiments and
generates all output with a single make command. If you have
used an R-tree package other than the one provided, please highlight
your alterations to the source. For example, by prefacing the modified
sections with /** TA: MODIFIED BELOW **/
- Please note:
- Your program should be able to handle points in 2-d, 3-d and
higher
- Your program should handle ties and duplicate points
intelligently and gracefully.
- Test the code thoroughly, even the package we provided, or
one you may have downloaded from the web.
Last updated by Christos Faloutsos, Jan. 24, 2007.