Carnegie Mellon University
15-826 Multimedia Databases and Data Mining
Spring 2007 - C. Faloutsos
List of suggested projects
The projects are grouped according to their general theme. We also list
the data and software available, to leverage your effort. More links
and
resources may be added in the future. Reminders:
- URL for this very page (internal to CMU - please treat it
'confidentially'): www.cs.cmu.edu/~christos/courses/826.S07/CMU-ONLY/projlist.html
- Feel free to propose projects outside this
list, as long
as they have to do with mining and indexing large
datasets.
- Asterisks [*] in the project title signify active, ongoing
work on this project, with several more collaborators. But feel free to
consider non-asterisked projects, too, if they are related to your
interests or your dissertation.
SUGGESTED TOPICS
1. SPATIO/TEMPORAL AND STREAM MINING
1.1. [*] Disk access traffic patterns, and the
Self-*
project
- Problem:
Given traces
from real workstations
(tuples of the form <disk-id, track-id, R/W-flag,
timestamp>),
find
patterns; do predictions; use them to design better buffering and
prefetching
algorithms. Try 'blind signal separation'/ICA, to distinguish between
'reads'
and 'writes', or between interactive and database accesses. A related
problem is to forecast the response time for a given disk request,
given a training set with their response times. The main problem is to
extract good features. In fact, this is a small part of the the Self-*
project, which also has numerous co-evolving time sequences
from a prototype data center with multiple 'intelligent' storage units:
cpu utilizations, network traffic measurements, room temperature sensor
measurements, humidity measurements, etc. The goal is to find patterns,
correlations, lag-correlations, anomalies, to help the data center self-organize,
self-detect upcoming (or existing) failures and attacks, to
self-optimize its performance.
- Data:
See the 'Disk Access Traces'
below. Also, the
web site of the Self-*
project, with a lot of measurement data, that we already
have.
- Introductory
paper(s): the
'PQRS' model [Wang
et
al, PEVA 2001]; see also the use of CART [Wang
et al SIGMETRICS 2004] and the follow-up work [Mesnier+,
'05]. Check the live Intemon
system, and the corresponding publications (VLDB06,
OSR06)
on Jimeng's page under 'InteMon'. Also the lag-correlation paper [Sakurai+ SIGMOD'05].
- Comments:
The general
case is hard, and in fact, is the topic
of dissertations (Mengzhi Wang,
Mike Mesnier). However,
there are a lot of initial ideas that you could try
within a semester, and a
lot
of industrial interest in the topic.
- Contact
person: Jimeng
Sun
1.2. [*] Similarity search in motion-capture sequences
- Problem:
Given a
collection of motion-capture sequences, find clips that are similar to
a desirable query clip. The first step is to design a good distance
function. Motion capture sequences record 40-80 numbers about 60 times
per second. The numbers are the angles of the joints of the person that
is videotaped.
- Data: Free
on the web: mocap.cs.cmu.edu
from our CMU graphics group.
- Introductory
paper(s):
a paper on mo-cap segmentation [Barbic
et al, GI'04]; see also [Keogh+
kdd04] and [Keogh+,
VLDB04]
- Comments:
Part of a large joint effort between db and graphics at CMU. Start with
the
Euclidean distance, then with the time warping distance, and
proceed to more complicated ones. Lei Li (below) is
considering Kalman-filter based ones. We also need to implement fast
indexing methods, like OMNI-trees (contact instructor for code).
- Contact person: Lei Li
2. GRAPHS - LARGE GRAPH MINING
2.1 Handling Large Graphs
- Problem:
The goal
is to build and test various strategies of handling large disk resident
graphs. Given a large graph (web graph or a large
who-knows-whom social network) which does not fit into main memory, we
want to best organize the data on disk to support at least
two
operations: sequential scan over all nodes, and random node access.
One straigtforward solution is to store the graph in a Relational
Database,
another simple solution is to have a flat file. One would build and
test
the performance of various strategies of handling large graphs.
- Data:
We have very large real graphs: a who-talks-to-whom social network with
270 million nodes and 8 billion edges (70Gb of data). We also have
smaller but still large (few gigabytes) networks.
- Introductory
paper(s): [Chiang
et al, 1995], [Wang
et al, KDD'04], with many more.
- Comments:
Very high practical interest; rather well defined and do-able within a
semester.
- Contact
person: Jure
Leskovec
2.2. Parallel graph mining
- Problem:
As in the previous problem, but for even larger graphs: Given a large
graph with billions of edges and tens of billions of nodes, and several
share-nothing machines, parallelize the typical graph mining
algorithms, to be as fast as you can. We want to compute the
in-
and out-degree distributions, the diameter of the graph, the first
several eigenvalues, the 'network value' of each node, the 'clustering
coefficient', the node- and edge-betweeness. The first step is to do
timing of several possible architectures: with, or without a relational
DBMS; with, or without replication of the data. Also, what is the best
way to store the data (e.g., as <from,to> pairs in a flat
file; as an
adjacency list, hashed on the 'from' node-id, or as something else.)
- Data:
We shall start
with synthetic data, using an existing generator [Leskovec+,
PAKDD'05]. Then, DBLP, IMDB etc. We could also get data from Prof. Dave Andersen at CMU, on real CMU IP traffic (will need NDA)
- Introductory
paper(s):
The generator above; the Gamma database machine papers [Dewitt+,
IEEE TKDE'90]; papers on hash-joins [Kitsuregawa+,
vldb'90] the RMAT paper [Chakrabarti+
SIAM-DM'04], the connection sub-graph paper [Faloutsos+,
KDD'04]
- Comments:
Very high practical interest, with hard problems from both the
algorithmic as well as the
system side. There is a lot of room, even for 4 or more people.
2.3. Finding frequent sub-graphs
- Problem:
The goal
of the project is to find building blocks of graphs. Namely, given a
graph we want to find frequently occuring sub-graphs. For example, in a
collection of chemical molecules (drug design), we want to find which
subgraph occurs too often (e.g. a benzolium hexagon). The task includes
enumerating all possible subgraphs, and counting them. One of the
challenges is to develop a fast heuristic for sub-graph isomorphism.
Another research direction is to extend for near-matches; a third to
focus on un-labeled graphs, where the heuristics for graph isomorphism
become slower.
- Data:
social networks, biological networks.
- Introductory
paper(s): [Yan
et al, ICDM 2002] and related work (Xifeng Yan, Jiawei Han,
George Karypis)
- Comments:
High practical interest (drug design, social networks, computer
networks).. The challenge is to handle large, unlabeled graphs.
- Contact
person: Jure
Leskovec
2.4. Fast implementations of RWR (for gCap)
- Problem:
In the 'gCap'
paper (see below), we need to compute the steady-state probability of
each node in a random walk with restarts, when the restarting node is
unknown. Thus, naively, we need n2
steady-state probabilities.
How can we do better than that?
- Data:
DBLP, Corel image data, and more.
- Introductory paper(s): Probably the
fastest algorithm so far is by Hanghang Tong, in ICDM'06.
The
goal is to make it even faster, and/or to implement it in C/C++, so
that it can run on huge graphs (Gb-size). Related papers:
gCap [Pan
et al, KDD'04]; topic sensitive PageRank [Haveliwala
WWW'02] ; fast algorithms for topic sensitive PageRank [Hawelivala
+ '03]; graph partitioning [Sun+,
'05].
- Comments:
Mainly, implementation - but it has room for
innovation. Closely related to the dissertation of Hanghang Tong, who
will help along.
- Contact
person: Hanghang
Tong.
2.5. Large Graph Visualization
- Problem:
Given a huge graph (say, millions of nodes), help visualize it. Start
with the ICML03 paper below; then, try to extend it for huge
graphs: try some partitioning/grouping method, and/or some fish-eye
ideas.
- Data: epinions.com; DBLP
citation information, etc
- Introductory paper(s): Check the recent
paper from our group and USP colleagues on GMine.
Also [Takeshi
Yamada, Kazumi
Saito,
Naonori Ueda: Cross-Entropy Directed Embedding
of
Network Data. ICML'03]. Also, the visualization
tools
from CAIDA, and specifically 'walrus'.
- Comments: Open-ended
problem, but very useful - could lead to
publication. The first
step is to implement the method by [Yamada et al].
- Contact
person: Jure
Leskovec
2.6. [*] Relational databases as graphs, and 'fuzzy queries'.
- Problem:
Relatively
recent work
argues that Relational databases can be
treated as a graph, and queries do not need to use SQL at all. More
specifically, consider the DBLP graph, and queries of the form 'who is
the most central person with respect to "recovery"', or 'what is the
most influential paper by person X'. Colleagues at Yahoo would like to
find who is the person that asks the best questions; who is the one
that gives the best answers, etc. Other collaborators want to answer
fuzzy queries, like 'find authors who are related to KDD
conference, and to the University of XYZ' (that means, they
have a lot of papers in KDD, and a lot of colleagues in XYZ
university).
- Data:
DBLP database; maybe data from Yahoo (but will be under NDA, at best).
- Introductory
paper(s):
the BANKS system
[Bhalotia
et
al, ICDE'02]; see the work
at UCSD
[Balmin
et al, VLDB'04] and also [Geerts
et al,
VLDB'04]. Also see the 'electricity' based algorithms of [Palmer+
PAKDD'03]
- Comments:
It will be
very interesting to compare the above
approaches, and maybe to merge them, to create a hybrid with even
better performance. It will need user studies - we could get 'ground
truth' from UCSD. It will probably require a lot of implementation.
However, it is of high research and industrial interest.
- Contact
person: Hanghang
Tong.
2.7. [*] E-bay fraud detection
- Problem: Given historical user data (who bought what, from
whom, and feedbacks), find fraud. Apparently, the most important type
of fraud is 'account hijacking': I steal your id somehow, and I
exploit your good reputation, and sell non-existing products to
unsuspecting victims. The goal of the project is to spot anomalies in a
user's selling/bying behavior, which may signify account
hijacking.
- Data: We have a crawl of 60,000 users and ~1M transactions, at CMU.
- Introductory papers: See the paper by Polo Chau [Chau+ PKDD06] - a follow-up, unpublished version will be available from the instructor.
- Comments: very practical - may make headlines in popular press!
- Contact person: Duen Horng ('Polo') Chau
3. GRAPHS - INFLUENCE PROPAGATION, GENERATORS,
MODELS
3.1. [*] Propagation of Influence/Information in Networks and
weblogs ('blogs')
- Problem:
We want
to find patters of propagation of information (or viruses, influence,
etc.) in a network. We can start by limiting ouselves to trees. For
example, in a web-log influence tree, what is the most
typical form of
influence: a 'star' topology? a 'string' topology? something
in-between? How to generate such realistic patterns, from first
principles? ALSO, we want to model the temporal aspects: how often do
bloggers post messages? are the posts uniformly
distributed over time? (probably not, probably bursty). How
can we spot abnormal/surprising patterns?
- Data:
social networks, citation networks, weblog influence data.
- Introductory
paper(s): [Leskovec
et al, PAKDD 2006] and (internal to CMU) [Leskovec+
SDM07 - full version], [Leskovec+07
submitted]
- Contact
persons: Mary McGlohon,
Jure
Leskovec
3.2. Virus propagation - SIS model
- Problem: Consider an SIS model
('Susceptible-Infected-Susceptible', like people getting sick with the
flu, and recovering, and getting re-infected and so on).
Given a graph and the virus properties, determine whether the system
reaches a steady state (as opposed to oscillating, or as opposed to
chaotic behavior). If it does reach a steady state, estimate
the
number of infected nodes.
- Data: not needed, since it is mainly a
theoretical
project.
- Introductory paper(s): the virus
propagation paper by [Wang
et al, SRDS' 2003]
- Comments: There are some initial ideas
on how to
approximate the problem (contact instructor). Should be
enough for 1 person, and could lead to a publication.
3.3. Virus propagation - SIR model
- Problem:
The goal is to
estimate the expected population of Susceptible, Infected and
Recovered, as a function of time, when the infections happen along the
edges of a graph. Past work usually assumes a complete clique topology
(everybody can infect everybody else).
- Data:
We have AIDS
infection numbers, of two different countries in Africa, with different
social structures.
- Introductory
paper(s): [Zou,
Gong, Towsley, CCS'02],
with many more (say, from Pastor-Satorras and Vespignani; Tanya
Vassilevska-Kostova).
- Comments:
Mainly a
theoretical project, which could involve large scale simulation. The
conjecture is that the initial growth and the
final decay depend on the first eigenvalue of the adjacency matrix.
3.4. Finding models for Time Evolving Graphs
- Problem:The
goal is to
find a model for the creation and evolution of
networks. Given a time evolving network we want to fit a
network
evolution model, and specifically the one based on 'Kronecker graphs'
(see below), or some other model that you may design. The model itself
and its parameters would give us
further insight into the network evolution.
- Data:
Large time evolving social and citation networks (scientific
publications, email graph, recommendation network, etc.).
- Introductory
paper(s): Kronecker graphs paper [Leskovec
et al, PKDD 2005]; the thesis of
Deepay Chakrabarti, and recent graph topology surveys
[Chakrabarti+ 06, ACM Comp. Surveys].
- Comments:
Hard problem. Could lead to a publication.
- Contact
person: Jure
Leskovec
3.5. Generation of Realistic Labeled Graphs
- Problem:
The goal
of the project is to extend current graph generator models to also
handle labeled graphs. The intuition is that nodes in densely linked
parts of the graph are more likely to have same label. So
besides generating
a realistic topology we also want to assign labels to nodes in a
realistic and principled way. A related, but completely different
problem is to consider edge labels, too. For
example, people/nodes in a social network could have financial links,
friendship links, parent-child links; the question is whether one type
of link increases or decreases the chances for another type of link.
The initial idea is to treat the edge labels as one more dimension, and
represent such a graph with a 'tensor'. Then, you could try tensor
analysis tools (PARAFAC,
multilinear
decompositions).
- Data:
labeled social networks for evaluation.
- Introductory
paper(s): [Chakrabarti+
SIAM-DM'04], [Leskovec
et al, PKDD 2005], and the tensor decomposition paper [Sun+,
KDD06]
- Comments:
Mainly theoretical (tensors etc). A lot of recent research interest on
the topic.
- Contact
persons: Jure
Leskovec, Jimeng
Sun
4. MULTIMEDIA - BIOLOGICAL IMAGES
4.1. Indexing and Clustering for bio images
- Problem:
Given a
collection drosophila embryos (2-d), find a good
similarity
function, summarize
them, and report patterns and outliers.
- Data:
drosophila embryo
images
- Introductory
paper(s): Paper by [Pan+
KDD06] on the topic; papers on PCA (say, from the textbook),
on ICA [Pan+,
ICDM'05].
- Comments:
In collaboration with Prof. Eric Xing at CMU.
4.2. Distance function for 3-d protein images
- Problem:
Given a
collection protein
localization images (3-d), find a good similarity function, summarize
them, and report patterns and outliers.
- Data:
3-d protein
images, from Prof. Bob
Murphy.
- Introductory
paper(s):
See the Murphy
lab site, for
papers and datasets, and specifically [Huang+'04]
and [Chen+,
'04]
- Comments:
The challenge
is to find a good distance function, even in the case that we have no
labels.
5. MISCELLANEOUS
5.1. Fraud detection in on-line auctions - hijacked accounts
- Problem:
The goal of this project is to detect hijacked accounts for online auctions.
It is quite common these days to hear news that perpetrators hijack
online auction users' accounts,
often through fake emails pretending to be sent from the official auction
websites (such as eBay).
The hijackers do this to steal whatever good reputation
that those users have already establish and use that
as a ``proof of trustworthiness''
for the next fraudulent sales that they are going to carry out,
which often involve non-delivery of auction items.
This will explore methods to detect such hijacking,
specifically through anomaly detection,
detecting the changes in user account behaviors
(such as sales frequency, time periods of the day, etc).
- Data:
crawled eBay data (crawler will be available),
information of various confirmed hijacked accounts,
as ground truth, will need to be gathered.
- Comments:
This is an important real problem to solve, but an open-ended one. Could lead to publication.
- Contact
person: Polo Chau
5.2. Auction fraud - detecting networks of 1-cent auctions
- Problem:
The goal of this project is to discover the (fraudulent) relationship
between groups of people who create ``fake'' reputation.
In online auctions, such as eBay,
there is the phenomenon that some ``feedback groups''
try to artificially boost each others' apparent reputation
(or trustworthiness) by buying or selling very cheap items,
usually 1-cent items.
From reports of real auction users and even some new articles,
it seems that a lot of those 1-cent groups
are actually nicely managed and probably even have automated systems
that back the item posting and transaction processes.
This project will try and find out what kind
of networks exist among those groups,
who are involved,
and also whatever patterns of activities
that might exist in their activities,
(and any other interesting issues.)
- Data:
crawled eBay data (crawler will be available)
- Comments:
This is an important real problem to solve, but an open-ended one.
Could lead to publication.
- Contact
person: Polo Chau
DATASETS
Unless explicitly mentioned, the datasets are either
'public'
or 'owned' by the instructor; for the rest, we need to discuss about
'Non-disclosure
agreements' (NDAs).
Time sequences
- Time
series
repository at UCR.
- KURSK
dataset of multipe time sequences: time series from
seismological
sensors by the explosion site of the 'Kursk' submarine.
- Track traffic data, from our Civil
Engineering Department.
Number
of trucks, weight etc per day per highway-lane. Find patterns,
outliers;
do data cleansing.
- River-level / hydrology data: multiple,
correlated time series. Do data cleansing; find correlations
between
these series. Excellent project for people that like canoeing!
- Sunspots: number
of
sunspots
per unit time. Some data are here.
Sunspots seem to have an 11-year periodicity, with high spikes.
- Time sequences from the Sante-Fe
Institute forecasting competition (financial data, laser-beam
oscillation
data, patients' apnea data etc)
- Disk
access
traces, from HP Labs (we have local
copies at CMU). For
each disk access, we have the timestamp, the
block-id, and the type ('read'/'write'). Here is a snippet
of the data, aggregated per 30'.
Spatial data
- Astrophysics data - thousands of
galaxies, with
coordinates, red-shift,
spectra, photographs. Small
snippet of the data. More data are in the 'skyserver' web
site,
where
you can ask
SQL
queries
and get data in html or csv format
- Road segments: several datasets with
line segments (roads
of U.S.
counties, Montgomery MD, Long Beach CA, x-y coordinates of stars in the
sky from NASA, etc). Snippet
of data (roads from California, from TIGER).
Images/video
- Biological
data: images of proteins, with ~50 attributes each.
- 'Owner': Prof. Bob Murphy.
- Video/image/sound data, from Informedia.
2Tb of video,
segmented; 1M images with features; 10^4 faces. Extract features;
design good similarity functions; do the named-entity analysis.
Graph-like data
- Web-log and click-stream data (NDA:
needed).
- Visit patterns for a large web site: for
300 pages, and
thousands
of users, we record how many times a user visited a specific site. Find
patterns, clusters, fractal dimensions, regularities in the SVD etc.
- Enron
email dataset
(400 MB compressed)
- Movie-actor data from imdb.com
(we have a cleaned-up snapshot of it)
- DBLP author-paper-conference data from the DBLP site of
Mike Ley (records
in XML, and their
DTD). For 'ego-surfing', try this java app
or the java
applet at U.
Alberta.
Miscellaneous:
SOFTWARE
Notes for the software: Before you modify
any code, please
contact
the instructor - ideally, we would like to use these packages as black
boxes.
- Readily available:
- DR-tree
: R-tree code; searches for range and nearest-neighbor queries. In C.
- kd-tree
code
- OMNI trees - a faster version of metric trees.
- B-tree code, for text (should be easily changed to
handle
numbers,
too).
In C.
- Code for SVD in
`mathematica'.
- Code for computing the fractal dimension in Perl
and C, by Leejay Wu
- Barnsley's algorithm
for Iterated Function Systems in `C'.
- the NetMine
network topology analysis package
- GMine: interactive graph visualization package and
graph
manipulation library (by Jure and Junio)
- the 'crossAssociation'
package for graph partitioning.
- Outside CMU:
- GiST
package from
Hellerstein
at UC Berkeley: A general spatial access method, which is easy to
customize.
It is already customized to yield R-trees.
BIBLIOGRAPHICAL RESOURCES:
Last modified Feb. 6, 2007, by Christos Faloutsos.