Carnegie Mellon University
15-826 Multimedia Databases and Data Mining
Fall 2012 - C. Faloutsos
List of suggested projects
The projects are grouped according to their general theme. We also
list the data and software available, to leverage your effort. More
links and resources may be added in the future. Reminders:
- URL for this very page (internal to CMU - please treat it
'confidentially'):
www.cs.cmu.edu/~christos/courses/826.F12/CMU-ONLY/projlist.html
- The default project is the 'UCR insect classification contest' - strongly recommended for the majority of the students.
- You may propose projects outside this list, as long
as they have to do with mining and indexing large datasets.
In that case, contact the instructor as early as possible.
- A [P] in the project title signify that this project is
related to the phd dissertation of the contact person.
- Please form groups of 3-4
people.
- Please check the 'blackboard' system, where we
will create one thread for each of the projects below. Please
indicate your interest, by posting in the appropriate thread(s), so
that you can find partners.
SUGGESTED TOPICS
People who take the class for their master's degree, are strongly
recommended to choose one of the two default projects, with the first
one being the most recommended. They are both well defined, with a lot of implementation, and rather predictable outcomes.
The rest of the projects are more open-ended, and they are more suitable for people who want to do research in data mining.
1. DEFAULT PROJECTS - recommended for people in M.Sc. programs.
1.1 Default project #1: UCR insect dataset
- Problem: See the project web site. The default is to do classification, but you are welcome to do visualization, feature extraction, clustering, etc
- Data: from the project web site. 500 sound clips, each with a class label.
- Introductory papers: Start from the award winning paper of [Rakthanmanon+, KDD'12]; check the sigmod'07 tutorial by Keogh; or the sigmod'04 tutorial by the instructor
- Comments: very well defined project - extremely suitable for the majority of people in the class.
- Contact person(s): instructor
1.2 Default project #2: Graph mining using RDBMS
- Problem: Do we need an
additional language, to do graph manipulations? Show that SQL is
enough, to answer all the questions we want. Given a graph of (source,
destination) pairs on a disk, write the SQL queries to answer numerous
questions of interest, like 'which are the most important nodes', 'find
the radius of each node' etc.
- Data: Any graph dataset - the emphasis on this project is implementation.
- Introductory papers: The PEGASUS paper with GIM-V; the follow-up paper of GBASE; the rest of the papers on the pegasus project web site
- Comments: May lead to a publication. Degree distribution and pagerank have been implemented; radius and eigenvalues are still missing.
- Contact person: instructor.
2. GRAPH MINING
2.1 Anomaly detection and attribution
- Problem: How can we
automatically find anomalies (e.g. spikes) in datasets? And more
importantly, how can we do attribution? For instance, if there are too
many nodes of degree 256, can we say something more about them? That
is, if there is a
spike in the count-vs-degree plot (assume a power-law-like
distribution), what can we say about the nodes that are causing the
spike in the plot? Do they belong to some specific structure, e.g. star
or chain? This project aims at helping make sense of big graphs: we
want to find automatically the properties that make some nodes in the
graph anomalous, instead of just reporting that there is 'some type' of
anomaly.
- Data: ``Stack overflow'' - The data is described briefly in the paper: OPAvion: Mining and
Visualization in Large Graphs. Leman Akoglu, Duen Horng Chau, U
Kang, Danai Koutra, and Christos Faloutsos. SIGMOD'12,
Arizona, USA, May 2012. Any other graph dataset would be suitable, too.
- Introductory material:
For automatically detecting spikes in power-law graphs,
start from the 'median filtering' method (check wikipedia to get
familiar with the denoising algorithm).
- Comments: There are
several anomaly detection algorithms in the literature. Here we focus
on spotting some types of anomalies in plots, and mainly on explaining
them. There is no work in doing anomaly attribution automatically.
- Contact Person: Danai Koutra; instructor.
2.2 Belief propagation in large graphs
- Problem: Belief
Propagation is a powerful algorithm that has been used successfully in
numerous fields, such as computer vision, fraud detection, malware
detection, ldpc codes. In this project, we will focus on a fast
approximation of belief propagation (as presented in the first paper
given below) which currently handles only two different classes (e.g.,
guilty/non-guilty people). The goal of the project is to extend the
algorithm to multiple classes and derive the more general matrix
multiplication equation (instead of the initial iterative equations of
the method), so that the belief propagation approximation is more
widely applicable (e.g., in the paper-citation graph we have 4 classes
(areas of research): AI, DB, IR, DM).
- Data: DBLP network (annotated, with 4 different classes)
- Introductory material:
- Unifying Guilt-by-Association Approaches:
Theorems and Fast Algorithms. Danai Koutra, Tai-You Ke, U Kang,
Duen Horng (Polo) Chau, Hsing-Kuo Kenneth Pao, and Christos
Faloutsos. ECML PKDD, Athens, Greece, Sep. 2011
- Understanding belief propagation and its generalizations. J.
Yedidia, W. Freeman, and Y. Weiss. Exploring articial intelligence in
the new millennium, 8:236{239, 2003.
- Polonium:
Tera-scale graph mining and inference for malware detection. D. Chau,
C. Nachenberg, J. Wilhelm, A. Wright, and C. Faloutsos. Polonium:
Tera-scale graph mining and inference for malware detection. SDM, 2011.
- Comments: There are
multiple implementations of BP in the literature. Here we start from
the formulas given in the second paper, and following similar (but
trickier) analysis to the first paper, we will try to derive a
generalized matrix formula for BP, which handles more than 2 classes.
- Contact Person: Danai Koutra; instructor.
2.3 Parallel graph mining using hadoop
- Problem: Given a large
graph with billions of edges and tens of billions of nodes, and
several share-nothing machines, parallelize the typical graph
mining algorithms, to be as fast as you can. Our 'pegasus' system already
computes the in- and out-degree distributions, the diameter of the
graph, the first several eigenvalues, and runs on top of hadoop.
'hadoop' allows relatively
easy parallel execution,
implementing the map-reduce
system of Google [Dean + Ghemawat, OSDI'04]. 'Hadoop' is
open source; we have a small cluster where we can give you an
account, or make some other arrangement.
- The first step is to do timing of several possible
architectures: with, or without a relational DBMS; with, or without
replication of the data; using the PIG system;
using 'hbase'
- Also, what is the best way to store the data (e.g., as
<from,to> pairs in a flat file; as an adjacency list, hashed
on the 'from' node-id, or as something else.)
- Data: We shall start
with synthetic data, using an existing generator [Leskovec+,
PAKDD'05]. Then, DBLP, IMDB etc. We could also get data on real
CMU IP traffic (will need NDA). Finally, we also have a
who-talks-to-whom social network with 270 million nodes and 8
billion edges (60Gb of data)
- Introductory paper(s):
The generator above; the Gamma database machine papers [Dewitt+,
IEEE TKDE'90]; papers on hash-joins [Kitsuregawa+,
vldb'90] the RMAT paper [Chakrabarti+
SIAM-DM'04], the connection sub-graph paper [Faloutsos+,
KDD'04]. If you plan to use 'hadoop', get the map-reduce paper
[Dean + Ghemawat, OSDI'04] and the documentation about the
add-ons to hadoop, PIG and hbase.
- Comments: Very high
practical interest, with hard problems from both the algorithmic as
well as the system side. There is a lot of room, even for 4 or
more people.
- Contact person:
instructor.
2.4 Non-negative matrix factorization with hadoop and SGD
- Problem: Non-negative
Matrix Factorization, and Matrix Factorizations in general, have proved
useful for many data mining tasks such as matrix completion, concept
discovery, and latent semantic indexing. How much can present state of
the art algorithms scale? Is there a best choice among the existing
algorithms (e.g. "Multiplicative Updates" or (Stochastic) Gradient
Descent) in terms of parallelizability and, ultimately, scalability? In
this project, your task will be to 1) investigate existing algorithms
with respect to their scalability potential, 2) implement your
choice in MapReduce/Hadoop and 3) experiment with one or more real
world datasets (and possibly a synthetic one) in order to a) describe
the findings of the algorithm, and b) demonstrate that your
implementation scales.
- Data: IMDB dataset, DBLP dataset, come up with a way to generate synthetic data
- Introductory papers:
- DD Lee and HS Seung. Learning the parts of objects by non-negative matrix factorization, Nature 1999
- DD Lee and HS Seung. Algorithms for Non-negative Matrix Factorization, Advances in Neural Information Processing Systems, 2001
- Rainer Gemulla, Peter J Haas, Erik Nijkamp, and Yannis Sismanis. Large Scale Matrix Factorization with Distributed Stochastic Gradient Descent, ACM KDD 2011
- Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, Yi-Min
Wang. Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic
Data Analysis on MapReduce, ACM WWW 2010
- Scott Deerwester, Susan T. Dumais, George W. Furnas,
Thomas K. Landauer, Richard Harshman. Indexing by Latent Semantic
Analysis, Journal of the American Society for Information Science,
1990
- Comments: The paper of Gemulla+ is especially interesting.
- Contact Person: Vagelis Papalexakis, instructor.
3. SPATIO/TEMPORAL AND STREAM MINING
3.1 Guess the next flu spike: Co-evolving time series mining
- Problem: Given time series of patients (blood pressure
over time, etc), and class labels ('healthy', 'unhealthy') extract
features and do classification. Or, given a set of sequences of,
say, BGP updates, find correlations and anomalies (BGP = Border
Gateway Protocol, in computer networks). In yet-another scenario,
consider monitoring a data-center (like the Self-* system or the
Data Center
Observatory , both at CMU/PDL. Another application is
monitoring environmental data, to spot, say, global warming,
deforestation, etc - see the web page of Prof. Vipin
Kumar
- Data
- Very interesting dataset: from the tycho project - epidemiology time series, with # of infected people per unit time per US city per disease. Other data include
- From the physionet.org collection
- Introductory paper(s) For spikes in epidemiology data, check the 'spikeM' model [kdd'12]. For BGP, check [Prakash+,
KDD'09] (or here, for a
more detailed version. For data center monitoring, check the
SPIRIT
project,
and the corresponding publication
OSR06.
Also the lag-correlation paper [Sakurai+
SIGMOD'05], and the DynaMMo method (Kalman filters for
missing values [ Li+ KDD'09
]).
- Comments Start with Fourier and wavelets, for features.
For the 'tycho' data, try the 'spikeM' method. Check the 'DynaMMo' and 'PLiF' methods. For the physionet data, one
challenge is how to handle the several, wrong recordings (eg.,
blood pressure ~ 0). Depending on the composition of the team, the
project could focus on any of the above settings (environment only;
datacenter only; etc).
- Contact person: instructor.
DATASETS
Unless explicitly mentioned, the datasets are either
'public' or 'owned' by the instructor; for the rest, we need to
discuss about 'Non-disclosure agreements' (NDAs).
Time sequences
- Time series
repository at UCR.
- KURSK
dataset of multipe time sequences: time series from
seismological sensors by the explosion site of the 'Kursk'
submarine.
- Track traffic data, from our Civil Engineering
Department. Number of trucks, weight etc per day per highway-lane.
Find patterns, outliers; do data cleansing.
- River-level / hydrology data: multiple,
correlated time series. Do data cleansing; find correlations
between these series. Excellent project for people that like
canoeing!
- Sunspots:
number of sunspots per unit time. Some data are here.
Sunspots seem to have an 11-year periodicity, with high
spikes.
- Time sequences from the
Sante-Fe Institute forecasting competition (financial data,
laser-beam oscillation data, patients' apnea data etc)
- Disk access
traces, from HP Labs (we have local copies at CMU).
For each disk access, we have the timestamp, the block-id, and the
type ('read'/'write'). Here is a
snippet of the data, aggregated per 30'.
- Network traffic data from datapository.net at CMU
- Motion-capture data from CMU mocap.cmu.edu
Spatial data
- Astrophysics data - thousands of galaxies, with
coordinates, red-shift, spectra, photographs.
Small snippet of the data. More data are in the 'skyserver' web
site, where you can ask SQL queries and
get data in html or csv format
- Synthetic astrophysics
data: 1K of (x,y,z, weight) tuples, from Prof. Rupert Croft
(CMU). The full dataset is 200Mb compressed - contact
instructor.
- Road segments: several datasets with line segments
(roads of U.S. counties, Montgomery MD, Long Beach CA, x-y
coordinates of stars in the sky from NASA, etc).
Snippet of data (roads from California, from TIGER).
Graph data
Miscellaneous:
SOFTWARE
Notes for the software: Before you modify any code,
please contact the instructor - ideally, we would like to use these
packages as black boxes.
- Readily available:
- ACCESS METHODS
-
DR-tree : R-tree code; searches for range and nearest-neighbor
queries. In C.
-
kd-tree code
- OMNI
trees - a faster version of metric trees.
- B-tree code, for text (should be easily changed to handle
numbers, too). In C.
- SVD AND TENSORS:
- FRACTALS
- GRAPHS
- the PEGASUS
package for graph mining on hadoop.
- the
NetMine network topology analysis package
- GMine:
interactive graph visualization package and graph manipulation
library (by Junio (Jose Fernandez Rodrigues Junior) and Jure
Leskovec)
- the '
crossAssociation' package for graph partitioning.
- Outside CMU:
- GiST package from
Hellerstein at UC Berkeley: A general spatial access method, which
is easy to customize. It is already customized to yield
R-trees.
- hadoop, PIG and hbase
- pajek,
jung, graphviz, guess, cytoscape , for (small) graph
visualization
-
METIS,
for graph partitioning
BIBLIOGRAPHICAL RESOURCES:
Last modified Sept. 17, 2012, by Christos Faloutsos.