Carnegie Mellon University
15-826 Multimedia Databases and Data Mining
Fall 2013 - C. Faloutsos
List of suggested projects
The projects are grouped according to their general theme. We also
list the data and software available, to leverage your effort. More
links and resources may be added in the future. Reminders:
- URL for this very page (internal to CMU - please treat it
'confidentially'):
www.cs.cmu.edu/~christos/courses/826.F13/CMU-ONLY/projlist.html
- The default projects are strongly recommended for the majority of the students.
- Please form groups of 2
people.
- Please check the 'blackboard' system, where we
will create one thread for each of the projects below. Please
indicate your interest, by posting in the appropriate thread(s), so
that you can find partners.
SUGGESTED TOPICS
People who take the class for their master's degree, are strongly
recommended to choose one of the two default projects, with the first
one being the most recommended. They are both well defined, with a lot of implementation, and rather predictable outcomes.
The rest of the projects are more open-ended, and they are more suitable for people who want to do research in data mining.
1. DEFAULT PROJECTS - for people in M.Sc. programs.
1.1 Default project #1: UCR insect dataset
Given a large collection of labeled insect sound-clips,
design a good distance function, to distinguish between
malaria-carrying mosquitos, versus other insects.
See the full description of the Insect Mining project here, in pdf.
1.2 Default project #2: Graph mining using RDBMS
Given about 100 real graphs, do we see common trends?
do they all have small diameter ('six degrees')?
if not, which ones deviate? and why?
Answer all these questions, using
traditional SQL, which, as it turns out,
is powerful enough
to answer a long list of graph-mining queries
(with query optimization coming for free!)
Implement pageRank, diameter, connected components, etc,
in SQL, and apply your code to a long list of graph
datasets, to spot general patterns, and deviations.
See the full description of the Graph Mining project here, in pdf.
2. OPEN-ENDED PROJECTS - GRAPH MINING
2.1 Spam Detection for Review Data
- Problem: Review
data provides valuable information about products and services. Review
data is ubiquities on websites as Amazon, Yelp or Tripadvisor, and is
being frequently used by customers to
choose among competing products or services. Since reviews highly
affect the buying behaviour of customers, spammers try to mislead the
users by writing fake reviews. The goal of this project is to develop
methods to detect users showing spamming behaviour. We want to start
with a feature based detection of spammers: What are the
characteristics of a spammer? Which features can be used to
discriminate between spammers and non-spammers? Are these features
useful for all users or only for a subset of users? Based on this
feature representation, automatic methods to classify/rank the users
regarding their spamming behaviour should be developed exploiting,
e.g., the principles of subspace clustering/co-clustering or low rank
matrix factorization.
- Data: The participants can test their methods on multiple review datasets such as Amazon (6M reviews) and Yelp (300K reviews).
- Introductory material:
- Paper on review spam: Arjun Mukherjee, Abhinav Kumar, Bing
Liu, Junhui Wang, Meichun Hsu, Malu Castellanos, and Riddhiman Ghosh.
Spotting Opinion Spammers using Behavioral Footprints. SIGKDD
International Conference on Knowledge Discovery and Data Mining
(KDD-2013), August 11-14 2013 in Chicago, USA.
- Overview of subspace clustering techniques: Hans-Peter Kriegel,
Peer Kroeger, Arthur Zimek: Clustering high-dimensional data: A survey
on subspace clustering, pattern-based clustering, and correlation
clustering. TKDD 3(1) (2009)
- Contact Person: Dr. Stephan Guennemann.
2.2. Is modern spam detection research actually working? (Bipartite core detection)
- Motivation: Many researchers in data mining focus on detecting spam
by finding groups of users acting together to spam some item online. For
example, we have found on eBay users working together to boost each other's
reputability, accounts on Facebook all liking similar Pages to boost their
creditability, and accounts on Twitter all following certain accounts to
boost their appearance of being famous. A quick
search on eBay will find examples
of
this. As a result data mining methods look for certain graph
patterns, large, dense bipartite cores in particular, to detect such
behavior. Unfortunately, some honest, good users can create these graph
patterns inadvertantly.
- Problem: What are the
sizes and densities of naturally occuring bipartite cores in different data
sets? Knowing the different distributions of bipartite cores would be
interesting both for community detection research, understanding user
behavior in different contexts (buying products vs. following on Twitter),
and quanitifying the robustness of state of the art spam detection methdos
in the real world. If there are many large, naturally occuring groups of
users acting together, then much of the academic research on spam detection
would have to be rethought.
- Data: Amazon (6M reviews), Yelp (300k reviews), possibly Twitter
graph, and any other data sets you could scrape.
- Introductory Material:
- CopyCatch: Stopping Group Attacks by Spotting Lockstep Behavior in Social Networks.
Alex Beutel, Wanhong Xu, Venkatesan Guruswami, Christopher Palow,
Christos Faloutsos. Proceedings of the 22nd International Conference on
World Wide Web (WWW), 2013.
- MAFIA: A Maximal Frequent Itemset Algorithm. Doug Burdick, Manuel Calimlim, Jason Flannick, Johannes Gehrke, and Tomi Yiu.
- Itemset mining in noisy contexts: a hybrid approach. Karima Mouhoubi, Lucas Letocart, Celine Rouveirol.
- Flexible
Fault Tolerant Subspace Clustering for Data with Missing Values.
Stephan Gunnermann, Emmanuel Muller, Sebastian Raubach, Thomas Seidl.
- Contact: Alex Beutel
(TA for class)
2.3. Adversarial Spam Injection
- Problem: Many researchers in data mining focus on detecting spam
in data sets from the internet, particulary focusing on unusual graph
structures left by spammers. However, researchers often focus on the
strengths of their algorithms rather than their vulnerability to smart
attackers. How well can you get around state of the art machine learning
and data mining methods to detect spam? How much spam can you add to a
dataset without being caught?
- Data: Try injecting spam into Amazon (6M reviews), Yelp (300k
reviews), possibly Twitter graph, and any other data sets you could
scrape.
- Introductory Material:
- CopyCatch: Stopping Group Attacks by Spotting Lockstep Behavior in Social Networks.
Alex Beutel, Wanhong Xu, Venkatesan Guruswami, Christopher Palow,
Christos Faloutsos. Proceedings of the 22nd International Conference on
World Wide Web (WWW), 2013.
- EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs. B. Aditya Prakash, Ashwin Sridharan, Mukund Seshadri, Sridhar Machiraju, Christos Faloutsos.
- NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks.
Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos
Faloutsos. International Conference on World Wide Web (WWW) 2007. May
8-12, 2007. Banff, Alberta, Canada.
- Comment: Use algorithms from the above methods and compare how
robust they are to different spamming techniques in different data sets.
(Depending on the scope, I may be able to provide source code for some of
the above methods.)
- Contact: Alex Beutel
(TA for class)
2.4. Outliers: Scalable low-rank plus sparse matrix decompositions using hadoop
- Problem: Decomposing a large data matrix into the
superposition of a low-rank component plus a sparse component has found
widespread applicability in diverse data mining tasks. Some exciting examples include
foreground/background separation from video surveillance streams,
unveiling
Internet traffic anomalies, singing
voice separation from music accompaniment, matrix completion with outliers,
and latent semantic indexing, to name a few. Check also
this website
for more applications. In this context, the goal of the present project is to investigate
and empirically demonstrate how these decomposition models scale to modern massive detests. Current
algorithms seeking the desired parallelizability either involve
incremental stochastic gradient descent
iterations,
or distributed proximal algorithms
such the alternating-direction method of multipliers.
The main task in this project will be to implement your favorite algorithm in MapReduce/Hadoop (we have a small
cluster for which we can give you access), and try it out on a few of the (synthetic and real world) datasests listed next.
- Data: For starters, synthetic data is always useful to test the algorithms
for correctness, and perform controlled scalability experiments. Some useful real datasets include
video streams
to perform background modeling and
foreground extraction, Internet traffic
data for anomaly identification (more data here), the
MIR-1K dataset
for singing voice separation. We also have a who-talks-to-whom social network dataset
(will need NDA) involving 270 million nodes and 8 billion edges, where one would like to identify structure (low rank)
and (sparse) outliers.
- Introductory papers: Basic papers on low-rank plus sparse matrix decompositions are e.g.,
Candes et al'11
and Mateos-Giannakis'12;
the tutorial/short course
by Yi Ma could also be useful.
Non-performance optimized distributed algorithms for matrix decomposition were developed in this
paper.
- Comments: No need to spend time on the
theoretical and performance aspects studied in the aforementioned introductory papers,
stay focused on the algorithms and applications. Sequential (online) algorithms have also been proposed
for scalability and real-time mining of streaming data; see e.g. this
paper
and this website;
while tensor (multi-way array generalization of matrices) extensions are also worth pursuing and implementing.
We can of course discuss these if you are interested.
- Contact Persons: Dr. Gonzalo Mateos; instructor.
3. OPEN ENDED PROJECTS - STREAM MINING
3.1 Change Detection for Product Ratings
- Problem: Many websites such as Amazon, Tripadvisor, or Yelp
allow the users to rate the quality of products or services. These
ratings are rather static but change over time. Given such a
time-series of product ratings, how can we detect and explain a change
of the product's evaluation? In this project, we want to analyse the
correlation between the rating of a product and it's corresponding
review text.
Does a significant change in the product's ratings also induce a change
of the discussed topics?
In this project, we start by using existing change detection methods to
detect points of inflection and text analysis measures as tf-idf (term
frequency-inverse document frequency) to study the potential
correlations. As a further step, we envision the development of
extended techniques based on topic mining and methods integrating the
change detection step into the text analysis process.
- Data: Data from the Amazon (6M reviews) and Yelp (300K reviews) website can be made available.
- Introductory material:
-
Li-Chen Cheng, Zhi-Han Ke, Bang-Min Shiue: Detecting changes of opinion from customer reviews. FSKD 2011:1798-1802
- Ludmila I. Kuncheva: Change Detection in Streaming Multivariate
Data Using Likelihood Detectors. IEEE Trans. Knowl. Data Eng. (TKDE)
25(5):1175-1180 (2013)
- Contact Persons: Nikou Guennemann, Dr. Stephan Guennemann.
3.2 Guess the next flu spike: Co-evolving time series mining
- Problem: Given time series of patients (blood pressure
over time, etc), and class labels ('healthy', 'unhealthy') extract
features and do classification. Or, given a set of sequences of,
say, BGP updates, find correlations and anomalies (BGP = Border
Gateway Protocol, in computer networks). In yet-another scenario,
consider monitoring a data-center (like the Self-* system or the
Data Center
Observatory , both at CMU/PDL. Another application is
monitoring environmental data, to spot, say, global warming,
deforestation, etc - see the web page of Prof. Vipin
Kumar
- Data
- Very interesting dataset: from the tycho project - epidemiology time series, with # of infected people per unit time per US city per disease. Other data include
- From the physionet.org collection
- Introductory paper(s) For spikes in epidemiology data, check the 'spikeM' model [kdd'12]. For BGP, check [Prakash+,
KDD'09] (or here, for a
more detailed version. For data center monitoring, check the
SPIRIT
project,
and the corresponding publication
OSR06.
Also the lag-correlation paper [Sakurai+
SIGMOD'05], and the DynaMMo method (Kalman filters for
missing values [ Li+ KDD'09
]).
- Comments Start with Fourier and wavelets, for features.
For the 'tycho' data, try the 'spikeM' method. Check the 'DynaMMo' and 'PLiF' methods. For the physionet data, one
challenge is how to handle the several, wrong recordings (eg.,
blood pressure ~ 0). Depending on the composition of the team, the
project could focus on any of the above settings (environment only;
datacenter only; etc).
- Contact person: instructor.
4. OPEN ENDED PROJECTS - TENSORS
4.1 Tensors on hadoop - 'sparse-3'
-
Problem:
Tensor decompositions are increasingly popular in data mining
applications. Applying them on web scale, however, is still a
challenging problem; several approaches attempt to tackle this
scalability issue [1,2]. A recent line of work [2] uses biased sampling
in order to create multiple tensor sketches, operates on sketch space
and merges the final results. In this project you will investigate such
methods and implement [2] (or a hybrid) on a distributed storage
environment such as Hadoop/MapReduce. The main idea is what we call
`sparse-3' decomposition:
(a) starting from a sparse tensor, (b) we want to derive a sparse
decomposition,
and (c) have sparse intermediate results.
We hope that a careful such implementation will have tremendous
speed-ups over the traditional methods.
-
Evaluation criteria: For [2], there already exists a Java/Matlab
implementation. The first step to assess whether your implementation is
correct is to verify that results you obtain are comparable to the
original implementation (note: They don't have to be identical).
-
Datasets: NELL dataset, Phonecalls (need NDA)
-
Introductory Material:
-
Contact Persons:
Vagelis Papalexakis (TA in the class); Instructor.
4.2. Tensor decomposition using RDBMS
- Problem: Let's take
the 2nd default project a step further. Can SQL be used to manipulate
temporal evolving graphs? We are particularly interested in applying
SQL to the tensor decomposition problem: given a 3-way tensor (for
instance, indicating if person i contacted person j on day k)
we want to find heavy blocks in the tensor. Using the previous example,
we are looking for a set of people that called a set of other people on
a set of days (the output would be a set of these 3 vectors). There are
many algorithms that can be applied to solve this problem, but can any
of them be implemented in SQL (and thus be easily parallelizable)?
- Data: Any temporal graph will do, we have phone networks, computer communications network and email network data available.
- Introductory material:
-
The Pegasus paper with GIM-V is a good starting point to understand how common matrix operations can be applied in SQL.
-
Navasca's presentation is a simple introduction to CP decomposition and the ALS method.
-
Tamara Kolda and Brett Bader's survey is a more detailed alternative to understand all the notation and the most common algorithms.
- Comments: This project combines a fair amount of implementation and mathematical problems and can definitely lead to a publication.
- Contact Persons:
Miguel Araujo;
Vagelis Papalexakis (TA for class); instructor.
5. OPEN ENDED PROJECT - BIO-INFORMATICS
-
Problem: The goal of this project is to classify individuals into
disease/healthy status based on gene expression profiles.
Specifically, in training data set, each individual is represented by
a vector of real numbers (gene expression) with a label (sick or
healthy). Given test data set, you will predict whether an individual
is healthy or sick. Default is to build a classifier, but you can also
do feature extraction, clustering or visualization.
-
Data: Alzheimer's disease data set by the Harvard Brain Tissue Research
Center and Merck Research Laboratories
-
Introductory material:
-
iPcc: a novel feature extraction method for accurate disease class
discovery and prediction. Xianwen Ren,Yong Wang, Xiang-Sun Zhang, and
Qi Jin. Nucleic Acids Research, 2013
-
Gene expression profiling predicts clinical outcome of breast cancer:
Laura J. van 't Veer etal. Nature 2002
-
Boosting for tumor classification with gene expression data. Marcel
Dettling and Peter Buhlmann. Bioinformatics 2002
-
Comments: For this project, feature selection is very useful for
classification because gene expression profile data is in very high
dimension (>20000) and a small number of genes might be truly
associated with diseases. That is, identifying candidate genes useful
for classification as a preprocessing step would be a good idea.
-
Contact Person:
Seunghak Lee (TA in the class).
DATASETS
Unless explicitly mentioned, the datasets are either
'public' or 'owned' by the instructor; for the rest, we need to
discuss about 'Non-disclosure agreements' (NDAs).
Time sequences
- Time series
repository at UCR.
- KURSK
dataset of multipe time sequences: time series from
seismological sensors by the explosion site of the 'Kursk'
submarine.
- Track traffic data, from our Civil Engineering
Department. Number of trucks, weight etc per day per highway-lane.
Find patterns, outliers; do data cleansing.
- River-level / hydrology data: multiple,
correlated time series. Do data cleansing; find correlations
between these series. Excellent project for people that like
canoeing!
- Sunspots:
number of sunspots per unit time. Some data are here.
Sunspots seem to have an 11-year periodicity, with high
spikes.
- Time sequences from the
Sante-Fe Institute forecasting competition (financial data,
laser-beam oscillation data, patients' apnea data etc)
- Disk access
traces, from HP Labs (we have local copies at CMU).
For each disk access, we have the timestamp, the block-id, and the
type ('read'/'write'). Here is a
snippet of the data, aggregated per 30'.
- Network traffic data from datapository.net at CMU
- Motion-capture data from CMU mocap.cmu.edu
Spatial data
- Astrophysics data - thousands of galaxies, with
coordinates, red-shift, spectra, photographs.
Small snippet of the data. More data are in the 'skyserver' web
site, where you can ask SQL queries and
get data in html or csv format
- Synthetic astrophysics
data: 1K of (x,y,z, weight) tuples, from Prof. Rupert Croft
(CMU). The full dataset is 200Mb compressed - contact
instructor.
- Road segments: several datasets with line segments
(roads of U.S. counties, Montgomery MD, Long Beach CA, x-y
coordinates of stars in the sky from NASA, etc).
Snippet of data (roads from California, from TIGER).
Graph data - need NDA
- YahooWeb crawl (120Gb, 1B nodes, 6B edges). Needs mild
NDA
- Web-log and click-stream data (NDA: needed).
- call-graphs Snapshots of anonymized (and anonymous)
who-calls-whom graphs (NDA)
Graph Data - public
Miscellaneous:
SOFTWARE
Notes for the software: Before you modify any code,
please contact the instructor - ideally, we would like to use these
packages as black boxes.
- Readily available:
- ACCESS METHODS
-
DR-tree : R-tree code; searches for range and nearest-neighbor
queries. In C.
-
kd-tree code
- OMNI
trees - a faster version of metric trees.
- B-tree code, for text (should be easily changed to handle
numbers, too). In C.
- SVD AND TENSORS:
- FRACTALS
- GRAPHS
- the PEGASUS
package for graph mining on hadoop.
- the
NetMine network topology analysis package
- GMine:
interactive graph visualization package and graph manipulation
library (by Junio (Jose Fernandez Rodrigues Junior) and Jure
Leskovec)
- the '
crossAssociation' package for graph partitioning.
- Outside CMU:
- GiST package from
Hellerstein at UC Berkeley: A general spatial access method, which
is easy to customize. It is already customized to yield
R-trees.
- hadoop, PIG and hbase
- pajek,
jung, graphviz, guess, cytoscape , for (small) graph
visualization
-
METIS,
for graph partitioning
BIBLIOGRAPHICAL RESOURCES:
Last modified Sept. 16, 2013, by Christos Faloutsos.