Carnegie Mellon University
15-826 Multimedia Databases and Data Mining
Fall 2014 - C. Faloutsos
List of suggested non-default projects
The projects are grouped according to their general theme. We also
list the data and software available, to leverage your effort. More
links and resources may be added in the future. Reminders:
- URL for this very page (internal to CMU - please treat it
'confidentially'):
www.cs.cmu.edu/~christos/courses/826.F14/CMU-ONLY/projlist.html
- The default project is strongly recommended for the M.Sc. students.
- Please form groups of 2
people.
- Please check the 'blackboard' system, where we
will create one thread for each of the projects below. Please
indicate your interest, by posting in the appropriate thread(s), so
that you can find partners.
SUGGESTED TOPICS
As mentioned earlier, people who take the class for their
master's degree, are expected to choose the default project. It is well
defined, with a lot of implementation, and rather predictable outcomes. See full description of the Graph Mining project here, in pdf.
Here we give a list of suggested, non-default projects. They are more
open-ended, and they are more suitable for people who want to do
research in data mining.
1. GRAPH / TENSOR MINING
1.1. Adversarial Spam Injection
- Problem: Many researchers in data mining focus on detecting spam
in data sets from the internet, particulary focusing on unusual graph
structures left by spammers. However, researchers often focus on the
strengths of their algorithms rather than their vulnerability to smart
attackers. How well can you get around state of the art machine learning
and data mining methods to detect spam? How much spam can you add to a
dataset without being caught?
- Data: Try injecting spam into Amazon (6M reviews), Yelp (300k
reviews), possibly Twitter graph, and any other data sets you could
scrape.
- Introductory Material:
- CopyCatch: Stopping Group Attacks by Spotting Lockstep Behavior in Social Networks.
Alex Beutel, Wanhong Xu, Venkatesan Guruswami, Christopher Palow,
Christos Faloutsos. Proceedings of the 22nd International Conference on
World Wide Web (WWW), 2013.
- EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs. B. Aditya Prakash, Ashwin Sridharan, Mukund Seshadri, Sridhar Machiraju, Christos Faloutsos.
- NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks.
Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos
Faloutsos. International Conference on World Wide Web (WWW) 2007. May
8-12, 2007. Banff, Alberta, Canada.
- Comment: Use algorithms from the above methods and compare how
robust they are to different spamming techniques in different data sets.
(Depending on the scope, I may be able to provide source code for some of
the above methods.)
- Contact: Mr. Neil Shah (TA for class), Mr. Alex Beutel.
1.2. Spam Detection for Review Data
- Problem: Review
data provides valuable information about products and services. Review
data is ubiquities on websites as Amazon, Yelp or Tripadvisor, and is
being frequently used by customers to
choose among competing products or services. Since reviews highly
affect the buying behaviour of customers, spammers try to mislead the
users by writing fake reviews. The goal of this project is to develop
methods to detect users showing spamming behaviour. We want to start
with a feature based detection of spammers: What are the
characteristics of a spammer? Which features can be used to
discriminate between spammers and non-spammers? Are these features
useful for all users or only for a subset of users? Based on this
feature representation, automatic methods to classify/rank the users
regarding their spamming behaviour should be developed exploiting,
e.g., the principles of subspace clustering/co-clustering or low rank
matrix factorization.
- Data: The participants can test their methods on multiple review datasets such as Amazon (6M reviews) and Yelp (300K reviews).
- Introductory material:
- Paper on review spam: Arjun Mukherjee, Abhinav Kumar, Bing
Liu, Junhui Wang, Meichun Hsu, Malu Castellanos, and Riddhiman Ghosh.
Spotting Opinion Spammers using Behavioral Footprints. SIGKDD
International Conference on Knowledge Discovery and Data Mining
(KDD-2013), August 11-14 2013 in Chicago, USA.
- Overview of subspace clustering techniques: Hans-Peter Kriegel,
Peer Kroeger, Arthur Zimek: Clustering high-dimensional data: A survey
on subspace clustering, pattern-based clustering, and correlation
clustering. TKDD 3(1) (2009)
- Contact Person: Dr. Stephan Guennemann; instructor; Mr. Shi HU , Mr. Alex Beutel.
1.3. Random walk on tensors
- Problem: Consider a (huge) set of facts as
subject-verb-object triplets - like 'Obama' 'is-president-of' 'USA';
What is the most interesting verb and subject you can offer, for
object='Pittsburgh'? The hardest part of the problem is to give a
measure of 'goodness' of a fact. We would also like a fast algorithm to
compute (sub-quadratic on the count of triplets)
- Data:
- NELL data (=Never Ending Language Learner) - about 100M triplets, 20M objects and subjects, and 10M verbs
- Introductory paper(s) : First two papers are in the reading list:
- Comments : very high risk - and very high pay-off.
- Contact person: Mr. Jay Yoon LEE (TA for the class), Mr. Vagelis Papalexakis
1.4. Tensors on hadoop - 'sparse-3'
-
Problem:
Tensor decompositions are increasingly popular in data mining
applications. Applying them on web scale, however, is still a
challenging problem; several approaches attempt to tackle this
scalability issue [1,2]. A recent line of work [2] uses biased sampling
in order to create multiple tensor sketches, operates on sketch space
and merges the final results. In this project you will investigate such
methods and implement [2] (or a hybrid) on a distributed storage
environment such as Hadoop/MapReduce. The main idea is what we call
`sparse-3' decomposition:
(a) starting from a sparse tensor, (b) we want to derive a sparse
decomposition,
and (c) have sparse intermediate results.
We hope that a careful such implementation will have tremendous
speed-ups over the traditional methods.
-
Evaluation criteria: For [2], there already exists a Java/Matlab
implementation. The first step to assess whether your implementation is
correct is to verify that results you obtain are comparable to the
original implementation (note: They don't have to be identical).
-
Datasets: NELL dataset, Phonecalls (need NDA)
-
Introductory Material:
-
Contact Persons: Mr.
Vagelis Papalexakis; Instructor.
1.5. Tensor decomposition using RDBMS
- Problem: Let's take
the 2nd default project a step further. Can SQL be used to manipulate
temporal evolving graphs? We are particularly interested in applying
SQL to the tensor decomposition problem: given a 3-way tensor (for
instance, indicating if person i contacted person j on day k)
we want to find heavy blocks in the tensor. Using the previous example,
we are looking for a set of people that called a set of other people on
a set of days (the output would be a set of these 3 vectors). There are
many algorithms that can be applied to solve this problem, but can any
of them be implemented in SQL (and thus be easily parallelizable)?
- Data: Any temporal graph will do, we have phone networks, computer communications network and email network data available.
- Introductory material:
-
The Pegasus paper with GIM-V is a good starting point to understand how common matrix operations can be applied in SQL.
-
Navasca's presentation is a simple introduction to CP decomposition and the ALS method.
-
Tamara Kolda and Brett Bader's survey is a more detailed alternative to understand all the notation and the most common algorithms.
- Comments: This project combines a fair amount of implementation and mathematical problems and can definitely lead to a publication.
- Contact Persons:
Mr. Vagelis Papalexakis; instructor.
2. MODELING
2.1 Skewed, 2-d distributions - ``Almond-G'' and extensions
- Problem: Can you guess the
number of followers, if I tell you the count of tweets, for a twitter
user? In general, there seems to be a super-linear relationship (like
'double the followers, triple the tweets'). The goal is to derive a
process like CRP (='chinese restaurant process', related to
'rich-get-richer', related to the 'Yule-Simon' process). CRP
creates a 1-dimensional distribution (# people in each table), while we
want a 2-d distribution (#-followers and #-tweets, for each person).
Once you create such a mechanism, try to (a) fit it in real data (b)
spot outliers (c) do clustering.
- Data: from Danai (the contact person), and her paper, below
- Introductory material: papers and lecture foils:
- Patterns amongst Competing Task Frequencies: Super-Linearities, and the Almond-DG model
Danai Koutra, Vasileios Koutras, B. Aditya Prakash, Christos Faloutsos.
PAKDD 2013, Gold Coast, Queensland, Australia, April 2013. pdf and foils
- Comments: high-risk / high-payoff. A good such model will have high impact.
- Contact Persons: Ms. Danai Koutra; instructor.
2.2 'Brain in a box'
- Problem:
Can you design a neural network,
to mimick the level of energy
activities of a real brain, when it is performing some tasks?
Start with a survey on ``recurrent neural networks'',
and the pointers on ``system identification'' in the introductory
paper below. Design a GUI, so that we can add/delete/modify neurons,
and see the reactions of the resulting ``brain''.
- Data: From Vagelis.
- Introductory Material: the paper below, and its citations
- Evangelos E. Papalexakis, Alona Fyshe, Nicholas Sidiropoulos,Partha Pratim Talukdar, Tom Mitchell,Christos Faloutsos, Good-Enough Brain Model: Challenges, Algorithms and Discoveries in Multi-Subject Experiments, ACM SIGKDD 2014, New York City, USA
- Comments:
Hard problem in general,
but the GUI should be do-able within a semester.
- Contact Persons: Mr. Vagelis Papalexakis; instructor.
DATASETS
Unless explicitly mentioned, the datasets are either
'public' or 'owned' by the instructor; for the rest, we need to
discuss about 'Non-disclosure agreements' (NDAs).
Time sequences
- Time series
repository at UCR.
- KURSK
dataset of multipe time sequences: time series from
seismological sensors by the explosion site of the 'Kursk'
submarine.
- Track traffic data, from our Civil Engineering
Department. Number of trucks, weight etc per day per highway-lane.
Find patterns, outliers; do data cleansing.
- River-level / hydrology data: multiple,
correlated time series. Do data cleansing; find correlations
between these series. Excellent project for people that like
canoeing!
- Sunspots:
number of sunspots per unit time. Some data are here.
Sunspots seem to have an 11-year periodicity, with high
spikes.
- Time sequences from the
Sante-Fe Institute forecasting competition (financial data,
laser-beam oscillation data, patients' apnea data etc)
- Disk access
traces, from HP Labs (we have local copies at CMU).
For each disk access, we have the timestamp, the block-id, and the
type ('read'/'write'). Here is a
snippet of the data, aggregated per 30'.
- Network traffic data from datapository.net at CMU
- Motion-capture data from CMU mocap.cmu.edu
Spatial data
- Astrophysics data - thousands of galaxies, with
coordinates, red-shift, spectra, photographs.
Small snippet of the data. More data are in the 'skyserver' web
site, where you can ask SQL queries and
get data in html or csv format
- Synthetic astrophysics
data: 1K of (x,y,z, weight) tuples, from Prof. Rupert Croft
(CMU). The full dataset is 200Mb compressed - contact
instructor.
- Road segments: several datasets with line segments
(roads of U.S. counties, Montgomery MD, Long Beach CA, x-y
coordinates of stars in the sky from NASA, etc).
Snippet of data (roads from California, from TIGER).
Graph data - need NDA
- YahooWeb crawl (120Gb, 1B nodes, 6B edges). Needs mild
NDA
- Web-log and click-stream data (NDA: needed).
- call-graphs Snapshots of anonymized (and anonymous)
who-calls-whom graphs (NDA)
Graph Data - public
Miscellaneous:
SOFTWARE
Notes for the software: Before you modify any code,
please contact the instructor - ideally, we would like to use these
packages as black boxes.
- Readily available:
- ACCESS METHODS
-
DR-tree : R-tree code; searches for range and nearest-neighbor
queries. In C.
-
kd-tree code
- OMNI
trees - a faster version of metric trees.
- B-tree code, for text (should be easily changed to handle
numbers, too). In C.
- SVD AND TENSORS:
- FRACTALS
- GRAPHS
- the PEGASUS
package for graph mining on hadoop.
- the
NetMine network topology analysis package
- GMine:
interactive graph visualization package and graph manipulation
library (by Junio (Jose Fernandez Rodrigues Junior) and Jure
Leskovec)
- the '
crossAssociation' package for graph partitioning.
- Outside CMU:
- GiST package from
Hellerstein at UC Berkeley: A general spatial access method, which
is easy to customize. It is already customized to yield
R-trees.
- hadoop, PIG and hbase
- pajek,
jung, graphviz, guess, cytoscape , for (small) graph
visualization
-
METIS,
for graph partitioning
BIBLIOGRAPHICAL RESOURCES:
Last modified Sept. 15, 2014, by Christos Faloutsos.