project list

Carnegie Mellon University
15-826 Multimedia Databases and Data Mining
Fall 2013 - C. Faloutsos

List of suggested projects

The projects are grouped according to their general theme. We also list the data and software available, to leverage your effort. More links and resources may be added in the future. Reminders:

URL for this very page (internal to CMU - please treat it 'confidentially'): www.cs.cmu.edu/~christos/courses/826.F13/CMU-ONLY/projlist.html
The default projects are strongly recommended for the majority of the students.
Please form groups of 2 people.
Please check the 'blackboard' system, where we will create one thread for each of the projects below. Please indicate your interest, by posting in the appropriate thread(s), so that you can find partners.

1. DEFAULT PROJECTS - for people in M.Sc. programs.

1.1 Default project #1: UCR insect dataset

Given a large collection of labeled insect sound-clips, design a good distance function, to distinguish between malaria-carrying mosquitos, versus other insects. See the full description of the Insect Mining project here, in pdf.

1.2 Default project #2: Graph mining using RDBMS

Given about 100 real graphs, do we see common trends? do they all have small diameter ('six degrees')? if not, which ones deviate? and why? Answer all these questions, using traditional SQL, which, as it turns out, is powerful enough to answer a long list of graph-mining queries (with query optimization coming for free!) Implement pageRank, diameter, connected components, etc, in SQL, and apply your code to a long list of graph datasets, to spot general patterns, and deviations. See the full description of the Graph Mining project here, in pdf.

2. OPEN-ENDED PROJECTS - GRAPH MINING

2.1 Spam Detection for Review Data

Problem: Review data provides valuable information about products and services. Review data is ubiquities on websites as Amazon, Yelp or Tripadvisor, and is being frequently used by customers to choose among competing products or services. Since reviews highly affect the buying behaviour of customers, spammers try to mislead the users by writing fake reviews. The goal of this project is to develop methods to detect users showing spamming behaviour. We want to start with a feature based detection of spammers: What are the characteristics of a spammer? Which features can be used to discriminate between spammers and non-spammers? Are these features useful for all users or only for a subset of users? Based on this feature representation, automatic methods to classify/rank the users regarding their spamming behaviour should be developed exploiting, e.g., the principles of subspace clustering/co-clustering or low rank matrix factorization.
Data: The participants can test their methods on multiple review datasets such as Amazon (6M reviews) and Yelp (300K reviews).
Introductory material:
- Paper on review spam: Arjun Mukherjee, Abhinav Kumar, Bing Liu, Junhui Wang, Meichun Hsu, Malu Castellanos, and Riddhiman Ghosh. Spotting Opinion Spammers using Behavioral Footprints. SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2013), August 11-14 2013 in Chicago, USA.
- Overview of subspace clustering techniques: Hans-Peter Kriegel, Peer Kroeger, Arthur Zimek: Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. TKDD 3(1) (2009)
Contact Person: Dr. Stephan Guennemann.

2.2. Is modern spam detection research actually working? (Bipartite core detection)

Motivation: Many researchers in data mining focus on detecting spam by finding groups of users acting together to spam some item online. For example, we have found on eBay users working together to boost each other's reputability, accounts on Facebook all liking similar Pages to boost their creditability, and accounts on Twitter all following certain accounts to boost their appearance of being famous. A quick search on eBay will find examples of this. As a result data mining methods look for certain graph patterns, large, dense bipartite cores in particular, to detect such behavior. Unfortunately, some honest, good users can create these graph patterns inadvertantly.
Problem: What are the sizes and densities of naturally occuring bipartite cores in different data sets? Knowing the different distributions of bipartite cores would be interesting both for community detection research, understanding user behavior in different contexts (buying products vs. following on Twitter), and quanitifying the robustness of state of the art spam detection methdos in the real world. If there are many large, naturally occuring groups of users acting together, then much of the academic research on spam detection would have to be rethought.
Data: Amazon (6M reviews), Yelp (300k reviews), possibly Twitter graph, and any other data sets you could scrape.
Introductory Material:
- CopyCatch: Stopping Group Attacks by Spotting Lockstep Behavior in Social Networks. Alex Beutel, Wanhong Xu, Venkatesan Guruswami, Christopher Palow, Christos Faloutsos. Proceedings of the 22nd International Conference on World Wide Web (WWW), 2013.
- MAFIA: A Maximal Frequent Itemset Algorithm. Doug Burdick, Manuel Calimlim, Jason Flannick, Johannes Gehrke, and Tomi Yiu.
- Itemset mining in noisy contexts: a hybrid approach. Karima Mouhoubi, Lucas Letocart, Celine Rouveirol.
- Flexible Fault Tolerant Subspace Clustering for Data with Missing Values. Stephan Gunnermann, Emmanuel Muller, Sebastian Raubach, Thomas Seidl.
Contact: Alex Beutel (TA for class)

2.3. Adversarial Spam Injection

Problem: Many researchers in data mining focus on detecting spam in data sets from the internet, particulary focusing on unusual graph structures left by spammers. However, researchers often focus on the strengths of their algorithms rather than their vulnerability to smart attackers. How well can you get around state of the art machine learning and data mining methods to detect spam? How much spam can you add to a dataset without being caught?
Data: Try injecting spam into Amazon (6M reviews), Yelp (300k reviews), possibly Twitter graph, and any other data sets you could scrape.
Introductory Material:
- CopyCatch: Stopping Group Attacks by Spotting Lockstep Behavior in Social Networks. Alex Beutel, Wanhong Xu, Venkatesan Guruswami, Christopher Palow, Christos Faloutsos. Proceedings of the 22nd International Conference on World Wide Web (WWW), 2013.
- EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs. B. Aditya Prakash, Ashwin Sridharan, Mukund Seshadri, Sridhar Machiraju, Christos Faloutsos.
- NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks. Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. International Conference on World Wide Web (WWW) 2007. May 8-12, 2007. Banff, Alberta, Canada.
Comment: Use algorithms from the above methods and compare how robust they are to different spamming techniques in different data sets. (Depending on the scope, I may be able to provide source code for some of the above methods.)
Contact: Alex Beutel (TA for class)

2.4. Outliers: Scalable low-rank plus sparse matrix decompositions using hadoop

Problem: Decomposing a large data matrix into the superposition of a low-rank component plus a sparse component has found widespread applicability in diverse data mining tasks. Some exciting examples include foreground/background separation from video surveillance streams, unveiling Internet traffic anomalies, singing voice separation from music accompaniment, matrix completion with outliers, and latent semantic indexing, to name a few. Check also this website for more applications. In this context, the goal of the present project is to investigate and empirically demonstrate how these decomposition models scale to modern massive detests. Current algorithms seeking the desired parallelizability either involve incremental stochastic gradient descent iterations, or distributed proximal algorithms such the alternating-direction method of multipliers. The main task in this project will be to implement your favorite algorithm in MapReduce/Hadoop (we have a small cluster for which we can give you access), and try it out on a few of the (synthetic and real world) datasests listed next.
Data: For starters, synthetic data is always useful to test the algorithms for correctness, and perform controlled scalability experiments. Some useful real datasets include video streams to perform background modeling and foreground extraction, Internet traffic data for anomaly identification (more data here), the MIR-1K dataset for singing voice separation. We also have a who-talks-to-whom social network dataset (will need NDA) involving 270 million nodes and 8 billion edges, where one would like to identify structure (low rank) and (sparse) outliers.
Introductory papers: Basic papers on low-rank plus sparse matrix decompositions are e.g., Candes et al'11 and Mateos-Giannakis'12; the tutorial/short course by Yi Ma could also be useful. Non-performance optimized distributed algorithms for matrix decomposition were developed in this paper.
Comments: No need to spend time on the theoretical and performance aspects studied in the aforementioned introductory papers, stay focused on the algorithms and applications. Sequential (online) algorithms have also been proposed for scalability and real-time mining of streaming data; see e.g. this paper and this website; while tensor (multi-way array generalization of matrices) extensions are also worth pursuing and implementing. We can of course discuss these if you are interested.
Contact Persons: Dr. Gonzalo Mateos; instructor.

3. OPEN ENDED PROJECTS - STREAM MINING

3.1 Change Detection for Product Ratings

Problem: Many websites such as Amazon, Tripadvisor, or Yelp allow the users to rate the quality of products or services. These ratings are rather static but change over time. Given such a time-series of product ratings, how can we detect and explain a change of the product's evaluation? In this project, we want to analyse the correlation between the rating of a product and it's corresponding review text. Does a significant change in the product's ratings also induce a change of the discussed topics? In this project, we start by using existing change detection methods to detect points of inflection and text analysis measures as tf-idf (term frequency-inverse document frequency) to study the potential correlations. As a further step, we envision the development of extended techniques based on topic mining and methods integrating the change detection step into the text analysis process.
Data: Data from the Amazon (6M reviews) and Yelp (300K reviews) website can be made available.
Introductory material:
- Li-Chen Cheng, Zhi-Han Ke, Bang-Min Shiue: Detecting changes of opinion from customer reviews. FSKD 2011:1798-1802
- Ludmila I. Kuncheva: Change Detection in Streaming Multivariate Data Using Likelihood Detectors. IEEE Trans. Knowl. Data Eng. (TKDE) 25(5):1175-1180 (2013)
Contact Persons: Nikou Guennemann, Dr. Stephan Guennemann.

3.2 Guess the next flu spike: Co-evolving time series mining

Problem: Given time series of patients (blood pressure over time, etc), and class labels ('healthy', 'unhealthy') extract features and do classification. Or, given a set of sequences of, say, BGP updates, find correlations and anomalies (BGP = Border Gateway Protocol, in computer networks). In yet-another scenario, consider monitoring a data-center (like the Self-* system or the Data Center Observatory , both at CMU/PDL. Another application is monitoring environmental data, to spot, say, global warming, deforestation, etc - see the web page of Prof. Vipin Kumar
Data

Very interesting dataset: from the tycho project - epidemiology time series, with # of infected people per unit time per US city per disease. Other data include

From the physionet.org collection

for BGP, check the Datapository project.
Check here for nvironmental data

Introductory paper(s) For spikes in epidemiology data, check the 'spikeM' model [kdd'12]. For BGP, check [Prakash+, KDD'09] (or here, for a more detailed version. For data center monitoring, check the SPIRIT project, and the corresponding publication OSR06. Also the lag-correlation paper [Sakurai+ SIGMOD'05], and the DynaMMo method (Kalman filters for missing values [ Li+ KDD'09 ]).
Comments Start with Fourier and wavelets, for features. For the 'tycho' data, try the 'spikeM' method. Check the 'DynaMMo' and 'PLiF' methods. For the physionet data, one challenge is how to handle the several, wrong recordings (eg., blood pressure ~ 0). Depending on the composition of the team, the project could focus on any of the above settings (environment only; datacenter only; etc).
Contact person: instructor.

4. OPEN ENDED PROJECTS - TENSORS

4.1 Tensors on hadoop - 'sparse-3'

Problem: Tensor decompositions are increasingly popular in data mining applications. Applying them on web scale, however, is still a challenging problem; several approaches attempt to tackle this scalability issue [1,2]. A recent line of work [2] uses biased sampling in order to create multiple tensor sketches, operates on sketch space and merges the final results. In this project you will investigate such methods and implement [2] (or a hybrid) on a distributed storage environment such as Hadoop/MapReduce. The main idea is what we call `sparse-3' decomposition: (a) starting from a sparse tensor, (b) we want to derive a sparse decomposition, and (c) have sparse intermediate results. We hope that a careful such implementation will have tremendous speed-ups over the traditional methods.
Evaluation criteria: For [2], there already exists a Java/Matlab implementation. The first step to assess whether your implementation is correct is to verify that results you obtain are comparable to the original implementation (note: They don't have to be identical).
Datasets: NELL dataset, Phonecalls (need NDA)
Introductory Material:
- [1] U Kang, Evangelos Papalexakis, Abhay Harpale, Christos Faloutsos Gigatensor: scaling tensor analysis up by 100 times-algorithms and discoveries
- [2] Evangelos E Papalexakis, Christos Faloutsos, Nicholas D Sidiropoulos ParCube: Sparse Parallelizable Tensor Decompositions
Contact Persons: Vagelis Papalexakis (TA in the class); Instructor.

4.2. Tensor decomposition using RDBMS

Problem: Let's take the 2nd default project a step further. Can SQL be used to manipulate temporal evolving graphs? We are particularly interested in applying SQL to the tensor decomposition problem: given a 3-way tensor (for instance, indicating if person i contacted person j on day k) we want to find heavy blocks in the tensor. Using the previous example, we are looking for a set of people that called a set of other people on a set of days (the output would be a set of these 3 vectors). There are many algorithms that can be applied to solve this problem, but can any of them be implemented in SQL (and thus be easily parallelizable)?
Data: Any temporal graph will do, we have phone networks, computer communications network and email network data available.
Introductory material:
- The Pegasus paper with GIM-V is a good starting point to understand how common matrix operations can be applied in SQL.
- Navasca's presentation is a simple introduction to CP decomposition and the ALS method.
- Tamara Kolda and Brett Bader's survey is a more detailed alternative to understand all the notation and the most common algorithms.
Comments: This project combines a fair amount of implementation and mathematical problems and can definitely lead to a publication.
Contact Persons: Miguel Araujo; Vagelis Papalexakis (TA for class); instructor.

5. OPEN ENDED PROJECT - BIO-INFORMATICS

Problem: The goal of this project is to classify individuals into disease/healthy status based on gene expression profiles. Specifically, in training data set, each individual is represented by a vector of real numbers (gene expression) with a label (sick or healthy). Given test data set, you will predict whether an individual is healthy or sick. Default is to build a classifier, but you can also do feature extraction, clustering or visualization.
Data: Alzheimer's disease data set by the Harvard Brain Tissue Research Center and Merck Research Laboratories
Introductory material:
1. iPcc: a novel feature extraction method for accurate disease class discovery and prediction. Xianwen Ren,Yong Wang, Xiang-Sun Zhang, and Qi Jin. Nucleic Acids Research, 2013
2. Gene expression profiling predicts clinical outcome of breast cancer: Laura J. van 't Veer etal. Nature 2002
3. Boosting for tumor classification with gene expression data. Marcel Dettling and Peter Buhlmann. Bioinformatics 2002
Comments: For this project, feature selection is very useful for classification because gene expression profile data is in very high dimension (>20000) and a small number of genes might be truly associated with diseases. That is, identifying candidate genes useful for classification as a preprocessing step would be a good idea.
Contact Person: Seunghak Lee (TA in the class).

DATASETS

Unless explicitly mentioned, the datasets are either 'public' or 'owned' by the instructor; for the rest, we need to discuss about 'Non-disclosure agreements' (NDAs).

Time sequences

Time series repository at UCR.
KURSK dataset of multipe time sequences: time series from seismological sensors by the explosion site of the 'Kursk' submarine.
Track traffic data, from our Civil Engineering Department. Number of trucks, weight etc per day per highway-lane. Find patterns, outliers; do data cleansing.
River-level / hydrology data: multiple, correlated time series. Do data cleansing; find correlations between these series. Excellent project for people that like canoeing!
Sunspots: number of sunspots per unit time. Some data are here. Sunspots seem to have an 11-year periodicity, with high spikes.
Time sequences from the Sante-Fe Institute forecasting competition (financial data, laser-beam oscillation data, patients' apnea data etc)
Disk access traces, from HP Labs (we have local copies at CMU). For each disk access, we have the timestamp, the block-id, and the type ('read'/'write'). Here is a snippet of the data, aggregated per 30'.
Network traffic data from datapository.net at CMU
Motion-capture data from CMU mocap.cmu.edu

Spatial data

Astrophysics data - thousands of galaxies, with coordinates, red-shift, spectra, photographs. Small snippet of the data. More data are in the 'skyserver' web site, where you can ask SQL queries and get data in html or csv format
Synthetic astrophysics data: 1K of (x,y,z, weight) tuples, from Prof. Rupert Croft (CMU). The full dataset is 200Mb compressed - contact instructor.
Road segments: several datasets with line segments (roads of U.S. counties, Montgomery MD, Long Beach CA, x-y coordinates of stars in the sky from NASA, etc). Snippet of data (roads from California, from TIGER).

Graph data - need NDA

YahooWeb crawl (120Gb, 1B nodes, 6B edges). Needs mild NDA
Web-log and click-stream data (NDA: needed).
call-graphs Snapshots of anonymized (and anonymous) who-calls-whom graphs (NDA)

Graph Data - public

Enron email dataset (400 MB compressed)
Large collection of networks, from Stanford
Movie-actor data from imdb.com (we have a cleaned-up snapshot of it)
DBLP author-paper-conference data from the DBLP site of Mike Ley (records in XML, and their DTD). For 'ego-surfing', try this java app or the java applet at U. Alberta.
Graph datasets at U.Mass (Amherst), by Prof. Dave Jensen.
More graph datasets from Mark Newman (U. Michigan) - including popular test-beds like the Zachary's karate club social network etc.
patent information, from googlebooks (mirroring the U.S. Patent Office). Contact instructor for a who-cites-whom file.

Miscellaneous:

Several collections of training data from the UC-Irvine repository (check the larger ones) and from KDD-nuggets for machine learning algorithms.
Demographic data from the U.S. Bureau of Census

SOFTWARE

Notes for the software: Before you modify any code, please contact the instructor - ideally, we would like to use these packages as black boxes.

Readily available:
- ACCESS METHODS
  - DR-tree : R-tree code; searches for range and nearest-neighbor queries. In C.
  - kd-tree code
  - OMNI trees - a faster version of metric trees.
  - B-tree code, for text (should be easily changed to handle numbers, too). In C.
- SVD AND TENSORS:
  - Code for SVD in `mathematica'.
  - Code for SPIRIT (incremental SVD on streams)
  - Tensor toolkit from Tamara Kolda
- FRACTALS
  - Code for computing the fractal dimension (simplified version in Perl; more elaborate, in Perl and C, by Leejay Wu)
  - Barnsley's algorithm for Iterated Function Systems in `C'.
- GRAPHS
  - the PEGASUS package for graph mining on hadoop.
  - the NetMine network topology analysis package
  - GMine: interactive graph visualization package and graph manipulation library (by Junio (Jose Fernandez Rodrigues Junior) and Jure Leskovec)
  - the ' crossAssociation' package for graph partitioning.
Outside CMU:
- GiST package from Hellerstein at UC Berkeley: A general spatial access method, which is easy to customize. It is already customized to yield R-trees.
- hadoop, PIG and hbase
- pajek, jung, graphviz, guess, cytoscape , for (small) graph visualization
- METIS, for graph partitioning

BIBLIOGRAPHICAL RESOURCES:

Last modified Sept. 16, 2013, by Christos Faloutsos.

Carnegie Mellon University 15-826 Multimedia Databases and Data Mining Fall 2013 - C. Faloutsos