project list

Carnegie Mellon University
15-826 Multimedia Databases and Data Mining
Fall 2012 - C. Faloutsos

List of suggested projects

The projects are grouped according to their general theme. We also list the data and software available, to leverage your effort. More links and resources may be added in the future. Reminders:

URL for this very page (internal to CMU - please treat it 'confidentially'): www.cs.cmu.edu/~christos/courses/826.F12/CMU-ONLY/projlist.html
The default project is the 'UCR insect classification contest' - strongly recommended for the majority of the students.
You may propose projects outside this list, as long as they have to do with mining and indexing large datasets. In that case, contact the instructor as early as possible.
A [P] in the project title signify that this project is related to the phd dissertation of the contact person.
Please form groups of 3-4 people.
Please check the 'blackboard' system, where we will create one thread for each of the projects below. Please indicate your interest, by posting in the appropriate thread(s), so that you can find partners.

1. DEFAULT PROJECTS - recommended for people in M.Sc. programs.

1.1 Default project #1: UCR insect dataset

Problem: See the project web site. The default is to do classification, but you are welcome to do visualization, feature extraction, clustering, etc
Data: from the project web site. 500 sound clips, each with a class label.
Introductory papers: Start from the award winning paper of [Rakthanmanon+, KDD'12]; check the sigmod'07 tutorial by Keogh; or the sigmod'04 tutorial by the instructor
Comments: very well defined project - extremely suitable for the majority of people in the class.
Contact person(s): instructor

1.2 Default project #2: Graph mining using RDBMS

Problem: Do we need an additional language, to do graph manipulations? Show that SQL is enough, to answer all the questions we want. Given a graph of (source, destination) pairs on a disk, write the SQL queries to answer numerous questions of interest, like 'which are the most important nodes', 'find the radius of each node' etc.
Data: Any graph dataset - the emphasis on this project is implementation.
Introductory papers: The PEGASUS paper with GIM-V; the follow-up paper of GBASE; the rest of the papers on the pegasus project web site
Comments: May lead to a publication. Degree distribution and pagerank have been implemented; radius and eigenvalues are still missing.
Contact person: instructor.

2. GRAPH MINING

2.1 Anomaly detection and attribution

Problem: How can we automatically find anomalies (e.g. spikes) in datasets? And more importantly, how can we do attribution? For instance, if there are too many nodes of degree 256, can we say something more about them? That is, if there is a spike in the count-vs-degree plot (assume a power-law-like distribution), what can we say about the nodes that are causing the spike in the plot? Do they belong to some specific structure, e.g. star or chain? This project aims at helping make sense of big graphs: we want to find automatically the properties that make some nodes in the graph anomalous, instead of just reporting that there is 'some type' of anomaly.
Data: ``Stack overflow'' - The data is described briefly in the paper: OPAvion: Mining and Visualization in Large Graphs. Leman Akoglu, Duen Horng Chau, U Kang, Danai Koutra, and Christos Faloutsos. SIGMOD'12, Arizona, USA, May 2012. Any other graph dataset would be suitable, too.
Introductory material: For automatically detecting spikes in power-law graphs, start from the 'median filtering' method (check wikipedia to get familiar with the denoising algorithm).
Comments: There are several anomaly detection algorithms in the literature. Here we focus on spotting some types of anomalies in plots, and mainly on explaining them. There is no work in doing anomaly attribution automatically.
Contact Person: Danai Koutra; instructor.

2.2 Belief propagation in large graphs

Problem: Belief Propagation is a powerful algorithm that has been used successfully in numerous fields, such as computer vision, fraud detection, malware detection, ldpc codes. In this project, we will focus on a fast approximation of belief propagation (as presented in the first paper given below) which currently handles only two different classes (e.g., guilty/non-guilty people). The goal of the project is to extend the algorithm to multiple classes and derive the more general matrix multiplication equation (instead of the initial iterative equations of the method), so that the belief propagation approximation is more widely applicable (e.g., in the paper-citation graph we have 4 classes (areas of research): AI, DB, IR, DM).
Data: DBLP network (annotated, with 4 different classes)
Introductory material:

Unifying Guilt-by-Association Approaches: Theorems and Fast Algorithms. Danai Koutra, Tai-You Ke, U Kang, Duen Horng (Polo) Chau, Hsing-Kuo Kenneth Pao, and Christos Faloutsos. ECML PKDD, Athens, Greece, Sep. 2011
Understanding belief propagation and its generalizations. J. Yedidia, W. Freeman, and Y. Weiss. Exploring articial intelligence in the new millennium, 8:236{239, 2003.
Polonium: Tera-scale graph mining and inference for malware detection. D. Chau, C. Nachenberg, J. Wilhelm, A. Wright, and C. Faloutsos. Polonium: Tera-scale graph mining and inference for malware detection. SDM, 2011.

Comments: There are multiple implementations of BP in the literature. Here we start from the formulas given in the second paper, and following similar (but trickier) analysis to the first paper, we will try to derive a generalized matrix formula for BP, which handles more than 2 classes.
Contact Person: Danai Koutra; instructor.

2.3 Parallel graph mining using hadoop

Problem: Given a large graph with billions of edges and tens of billions of nodes, and several share-nothing machines, parallelize the typical graph mining algorithms, to be as fast as you can. Our 'pegasus' system already computes the in- and out-degree distributions, the diameter of the graph, the first several eigenvalues, and runs on top of hadoop. 'hadoop' allows relatively easy parallel execution, implementing the map-reduce system of Google [Dean + Ghemawat, OSDI'04]. 'Hadoop' is open source; we have a small cluster where we can give you an account, or make some other arrangement.
- The first step is to do timing of several possible architectures: with, or without a relational DBMS; with, or without replication of the data; using the PIG system; using 'hbase'
- Also, what is the best way to store the data (e.g., as <from,to> pairs in a flat file; as an adjacency list, hashed on the 'from' node-id, or as something else.)
Data: We shall start with synthetic data, using an existing generator [Leskovec+, PAKDD'05]. Then, DBLP, IMDB etc. We could also get data on real CMU IP traffic (will need NDA). Finally, we also have a who-talks-to-whom social network with 270 million nodes and 8 billion edges (60Gb of data)
Introductory paper(s): The generator above; the Gamma database machine papers [Dewitt+, IEEE TKDE'90]; papers on hash-joins [Kitsuregawa+, vldb'90] the RMAT paper [Chakrabarti+ SIAM-DM'04], the connection sub-graph paper [Faloutsos+, KDD'04]. If you plan to use 'hadoop', get the map-reduce paper [Dean + Ghemawat, OSDI'04] and the documentation about the add-ons to hadoop, PIG and hbase.
Comments: Very high practical interest, with hard problems from both the algorithmic as well as the system side. There is a lot of room, even for 4 or more people.
Contact person: instructor.

2.4 Non-negative matrix factorization with hadoop and SGD

Problem: Non-negative Matrix Factorization, and Matrix Factorizations in general, have proved useful for many data mining tasks such as matrix completion, concept discovery, and latent semantic indexing. How much can present state of the art algorithms scale? Is there a best choice among the existing algorithms (e.g. "Multiplicative Updates" or (Stochastic) Gradient Descent) in terms of parallelizability and, ultimately, scalability? In this project, your task will be to 1) investigate existing algorithms with respect to their scalability potential, 2) implement your choice in MapReduce/Hadoop and 3) experiment with one or more real world datasets (and possibly a synthetic one) in order to a) describe the findings of the algorithm, and b) demonstrate that your implementation scales.
Data: IMDB dataset, DBLP dataset, come up with a way to generate synthetic data
Introductory papers:

DD Lee and HS Seung. Learning the parts of objects by non-negative matrix factorization, Nature 1999
DD Lee and HS Seung. Algorithms for Non-negative Matrix Factorization, Advances in Neural Information Processing Systems, 2001
Rainer Gemulla, Peter J Haas, Erik Nijkamp, and Yannis Sismanis. Large Scale Matrix Factorization with Distributed Stochastic Gradient Descent, ACM KDD 2011
Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, Yi-Min Wang. Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce, ACM WWW 2010
Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman. Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, 1990

Comments: The paper of Gemulla+ is especially interesting.
Contact Person: Vagelis Papalexakis, instructor.

3. SPATIO/TEMPORAL AND STREAM MINING

3.1 Guess the next flu spike: Co-evolving time series mining

Problem: Given time series of patients (blood pressure over time, etc), and class labels ('healthy', 'unhealthy') extract features and do classification. Or, given a set of sequences of, say, BGP updates, find correlations and anomalies (BGP = Border Gateway Protocol, in computer networks). In yet-another scenario, consider monitoring a data-center (like the Self-* system or the Data Center Observatory , both at CMU/PDL. Another application is monitoring environmental data, to spot, say, global warming, deforestation, etc - see the web page of Prof. Vipin Kumar
Data

Very interesting dataset: from the tycho project - epidemiology time series, with # of infected people per unit time per US city per disease. Other data include

From the physionet.org collection

for BGP, check the Datapository project.
Check here for nvironmental data

Introductory paper(s) For spikes in epidemiology data, check the 'spikeM' model [kdd'12]. For BGP, check [Prakash+, KDD'09] (or here, for a more detailed version. For data center monitoring, check the SPIRIT project, and the corresponding publication OSR06. Also the lag-correlation paper [Sakurai+ SIGMOD'05], and the DynaMMo method (Kalman filters for missing values [ Li+ KDD'09 ]).
Comments Start with Fourier and wavelets, for features. For the 'tycho' data, try the 'spikeM' method. Check the 'DynaMMo' and 'PLiF' methods. For the physionet data, one challenge is how to handle the several, wrong recordings (eg., blood pressure ~ 0). Depending on the composition of the team, the project could focus on any of the above settings (environment only; datacenter only; etc).
Contact person: instructor.

DATASETS

Unless explicitly mentioned, the datasets are either 'public' or 'owned' by the instructor; for the rest, we need to discuss about 'Non-disclosure agreements' (NDAs).

Time sequences

Time series repository at UCR.
KURSK dataset of multipe time sequences: time series from seismological sensors by the explosion site of the 'Kursk' submarine.
Track traffic data, from our Civil Engineering Department. Number of trucks, weight etc per day per highway-lane. Find patterns, outliers; do data cleansing.
River-level / hydrology data: multiple, correlated time series. Do data cleansing; find correlations between these series. Excellent project for people that like canoeing!
Sunspots: number of sunspots per unit time. Some data are here. Sunspots seem to have an 11-year periodicity, with high spikes.
Time sequences from the Sante-Fe Institute forecasting competition (financial data, laser-beam oscillation data, patients' apnea data etc)
Disk access traces, from HP Labs (we have local copies at CMU). For each disk access, we have the timestamp, the block-id, and the type ('read'/'write'). Here is a snippet of the data, aggregated per 30'.
Network traffic data from datapository.net at CMU
Motion-capture data from CMU mocap.cmu.edu

Spatial data

Astrophysics data - thousands of galaxies, with coordinates, red-shift, spectra, photographs. Small snippet of the data. More data are in the 'skyserver' web site, where you can ask SQL queries and get data in html or csv format
Synthetic astrophysics data: 1K of (x,y,z, weight) tuples, from Prof. Rupert Croft (CMU). The full dataset is 200Mb compressed - contact instructor.
Road segments: several datasets with line segments (roads of U.S. counties, Montgomery MD, Long Beach CA, x-y coordinates of stars in the sky from NASA, etc). Snippet of data (roads from California, from TIGER).

Graph data

YahooWeb crawl (120Gb, 1B nodes, 6B edges). Needs mild NDA
Web-log and click-stream data (NDA: needed).
call-graphs Snapshots of anonymized (and anonymous) who-calls-whom graphs (NDA)
Enron email dataset (400 MB compressed)
Large collection of networks, from Stanford
Movie-actor data from imdb.com (we have a cleaned-up snapshot of it)
DBLP author-paper-conference data from the DBLP site of Mike Ley (records in XML, and their DTD). For 'ego-surfing', try this java app or the java applet at U. Alberta.
Graph datasets at U.Mass (Amherst), by Prof. Dave Jensen.
More graph datasets from Mark Newman (U. Michigan) - including popular test-beds like the Zachary's karate club social network etc.
patent information, from googlebooks (mirroring the U.S. Patent Office). Contact instructor for a who-cites-whom file.

Miscellaneous:

Several collections of training data from the UC-Irvine repository (check the larger ones) and from KDD-nuggets for machine learning algorithms.
Demographic data from the U.S. Bureau of Census

SOFTWARE

Notes for the software: Before you modify any code, please contact the instructor - ideally, we would like to use these packages as black boxes.

Readily available:
- ACCESS METHODS
  - DR-tree : R-tree code; searches for range and nearest-neighbor queries. In C.
  - kd-tree code
  - OMNI trees - a faster version of metric trees.
  - B-tree code, for text (should be easily changed to handle numbers, too). In C.
- SVD AND TENSORS:
  - Code for SVD in `mathematica'.
  - Code for SPIRIT (incremental SVD on streams)
  - Tensor toolkit from Tamara Kolda
- FRACTALS
  - Code for computing the fractal dimension (simplified version in Perl; more elaborate, in Perl and C, by Leejay Wu)
  - Barnsley's algorithm for Iterated Function Systems in `C'.
- GRAPHS
  - the PEGASUS package for graph mining on hadoop.
  - the NetMine network topology analysis package
  - GMine: interactive graph visualization package and graph manipulation library (by Junio (Jose Fernandez Rodrigues Junior) and Jure Leskovec)
  - the ' crossAssociation' package for graph partitioning.
Outside CMU:
- GiST package from Hellerstein at UC Berkeley: A general spatial access method, which is easy to customize. It is already customized to yield R-trees.
- hadoop, PIG and hbase
- pajek, jung, graphviz, guess, cytoscape , for (small) graph visualization
- METIS, for graph partitioning

BIBLIOGRAPHICAL RESOURCES:

Last modified Sept. 17, 2012, by Christos Faloutsos.

Carnegie Mellon University 15-826 Multimedia Databases and Data Mining Fall 2012 - C. Faloutsos