Carnegie Mellon University
15-826 Multimedia Databases and Data Mining
Spring 2008 - C. Faloutsos
List of suggested projects
The projects are grouped according to their general theme. We also
list the data and software available, to leverage your effort. More
links and resources may be added in the future. Reminders:
- URL for this very page (internal to CMU - please treat it
'confidentially'):
www.cs.cmu.edu/~christos/courses/826.S08/CMU-ONLY/projlist.html
- Feel free to propose projects outside this list, as long
as they have to do with mining and indexing large
datasets.
- An asterisk [*] in the project title signify that this
project is related to the phd dissertation of the contact person.
Two asterisks [**] means that this is a group project, with
several potential collaborators. But feel free to consider
non-asterisked projects, too, if they are related to your interests
or your dissertation.
SUGGESTED TOPICS
0. HADOOP AND PARALLELISM
The projects below are mainly designed for a traditional,
single-machine architecture. However, 'hadoop' allows relatively
easy parallel execution, implementing the map-reduce
system of Google [Dean + Ghemawat, OSDI'04]. 'Hadoop' is
open source; we can provide some lab machines for you to install
'hadoop', or we can give you access to a 50-node hadoop cluster at
INTEL-Pittsburgh, and maybe access to the 'M45' of Yahoo (1000
machines, 4 cores each, 1Tb total RAM and over 3Pb storage - see
the press release at Yahoo, CMU,
Scientific American, etc). You are welcome to try any of these
projects below, on a hadoop cluster.
1. SPATIO/TEMPORAL AND STREAM MINING
1.1. [*] Disk access traffic patterns, and the Self-*
project
- Problem: Given traces
from real workstations (tuples of the form <disk-id, track-id,
R/W-flag, timestamp>), find patterns; do predictions; use them
to design better buffering and prefetching algorithms. Try 'blind
signal separation'/ICA, to distinguish between 'reads' and
'writes', or between interactive and database accesses. A related
problem is to forecast the response time for a given disk request,
given a training set with their response times. The main problem is
to extract good features. In fact, this is a small part of the the
Self-* project,
which also has numerous co-evolving time sequences from a prototype
data center with multiple 'intelligent' storage units: cpu
utilizations, network traffic measurements, room temperature sensor
measurements, humidity measurements, etc. The goal is to find
patterns, correlations, lag-correlations, anomalies, to help the
data center self-organize, self-detect upcoming (or existing)
failures and attacks, to self-optimize its performance.
- Data: See the
'Disk Access Traces' below.
Also, the web site of the Self-* project, with a lot
of measurement data, that we already have. We also have traces from
an MS SQL Server.
- Introductory
paper(s): the 'PQRS' model [Wang et
al, PEVA 2001]; see also the use of CART [
Wang et al SIGMETRICS 2004] and the follow-up work [Mesnier+,
'05]. Check the SPIRIT
project, or the live Intemon
system, and the corresponding publications (VLDB06,
OSR06) on
Jimeng's page under
'InteMon'. Also the lag-correlation paper [
Sakurai+ SIGMOD'05].
- Comments: The general
case is hard, and in fact, is the topic of dissertations
(Dr. Mengzhi
Wang, Dr. Mike
Mesnier, Dr. Jimeng
Sun). However, there are a lot of initial ideas that you could
try within a semester, and a
lot of industrial interest in the topic.
- Contact
person: Lei Li
2. GRAPHS - LARGE GRAPH MINING
2.1. Large/parallel graph mining, possibly using 'hadoop'
- Problem: Given a large
graph with billions of edges and tens of billions of nodes, and
several share-nothing machines, parallelize the typical graph
mining algorithms, to be as fast as you can. We want to
compute the in- and out-degree distributions, the diameter of the
graph, the first several eigenvalues, the 'network value' of each
node, the 'clustering coefficient', the node- and edge-betweeness.
The first step is to do timing of several possible architectures:
with, or without a relational DBMS; with, or without replication of
the data. Also, what is the best way to store the data (e.g., as
<from,to> pairs in a flat file; as an adjacency list, hashed
on the 'from' node-id, or as something else.)
- Data: We shall start
with synthetic data, using an existing generator [Leskovec+,
PAKDD'05]. Then, DBLP, IMDB etc. We could also get data on real
CMU IP traffic (will need NDA). Finally, we also have a
who-talks-to-whom social network with 270 million nodes and 8
billion edges (60Gb of data)
- Introductory paper(s):
The generator above; the Gamma database machine papers [Dewitt+,
IEEE TKDE'90]; papers on hash-joins [Kitsuregawa+,
vldb'90] the RMAT paper [Chakrabarti+
SIAM-DM'04], the connection sub-graph paper [
Faloutsos+, KDD'04]. If you plan to use 'hadoop', get the map-reduce paper
[Dean + Ghemawat, OSDI'04].
- Comments: Very high
practical interest, with hard problems from both the algorithmic as
well as the system side. There is a lot of room, even for 4 or
more people.
- Contact person:
Jure Leskovec
2.2. [**] Large Graph Visualization
- Problem: Given a huge graph (say, millions of nodes),
help visualize it. Start with the ICML03 paper below; then, try to
extend it for huge graphs: try some partitioning/grouping method,
and/or some fish-eye ideas.
- Data: epinions.com; DBLP citation information,
etc
- Introductory paper(s): Check the recent paper from our
group and USP colleagues on GMine. Also [Takeshi
Yamada, Kazumi Saito, Naonori Ueda:
Cross-Entropy Directed Embedding of Network Data.
ICML'03]. Also, the visualization tools
from CAIDA, and specifically 'walrus'.
Check the graphdrawing
organization and the corresponding sequence of conferences
on Graph Drawing (GD) from there. Our goal here is different,
though: we want to large graphs, that don't fit in memory,
nor on the screen, nor in the human mind, unless we summarize them
somehow.
- Comments: Open-ended problem, but very useful
- could lead to publication. The first step is to implement
the method by [Yamada et al].
- Contact person:
Polo Chau (also,
Jure Leskovec and
Mary McGlohon)
2.3. [*] Best -Effort Pattern Match
- Problem: Given a large attributed graph, how can we find
subgraphs that match a user pattern query. For example,
“find a CEO who has strong interactions with a Manager, a
Lawyer, and an Accountant, or another structure as close to that as
possible”. Similarly, a ‘loop’ query could
help us spot, eg., a money laundering ring.
- Data: DBLP database; IMDB datasets.
- Introductory paper(s): the G-Ray, [Tong et al
KDD 2007], the BANKS system
[
Bhalotia et al, ICDE'02].
- Comments: There are several possible extensions of
G-Ray: labelled edges, weighted edges, time-stamps. Might lead to
publications.
- Contact person: Hanghang
Tong.
2.4. [*] Relational databases as graphs, 'fuzzy queries' and
Center-Piece Subgraphs (CePS)
- Problem: Relatively
recent work argues that Relational databases can be treated as a
graph, and queries do not need to use SQL at all. More
specifically, consider the DBLP graph, and queries of the form 'who
is the most central person with respect to "recovery"', or 'what is
the most influential paper by person X'. Colleagues at Yahoo would
like to find who is the person that asks the best questions; who is
the one that gives the best answers, etc. Other collaborators want
to answer fuzzy queries, like 'find authors who are related to
KDD conference, and to the University of XYZ' (that means, they
have a lot of papers in KDD, and a lot of colleagues in the XYZ
university). Similarly, given Q nodes in a social network
(say, authorship network), how can we find the node/author that is
the centerpiece, and has direct or indirect connections to all, or
most of them? For example, this node could be the common advisor,
or someone who started the research area that the Q nodes belong
to.
- Data: DBLP database;
maybe data from Yahoo (but will be under NDA, at best).
- Introductory paper(s):
start from the CePS paper [Tong et al KDD
2006]; also see the BANKS system [
Bhalotia et al, ICDE'02]; see the work at UCSD
[Balmin
et al, VLDB'04] and also [Geerts et al,
VLDB'04]. Also see the 'electricity' based algorithms of
[
Palmer+ PAKDD'03]
- Comments: There
are several possible extensions of CePS, like negation ('find
people related to X and Y, but not to Z'). Might lead to
publications. You may also choose to do an implementation-mainly
project, where you could set up a web site with the CePS algorithm
above, so that people can discover their scientific environment,
their connections to others etc. See the off-line browser, the
visualization tool at DBLP, to get some initial ideas.
- Contact person: Hanghang Tong.
2.5. [*] Cross-Association/Co-clustering
- Problem: Given a contingency table (i.e. large sparse
bipartite graph w/ positive edge weights), how can we
simultaneously group the row/column into some clustering? How do we
do that in a parameter-free way? Recently, researchers try to solve
this problem from information theoretic point of view, by exploring
the duality between data mining and compression. One challenge is
how to generalize it to heterogeneous settings, for example, when
we have 3 types of nodes (in DBLP: authors, papers, keywords), and
we want to find groups and clusters.
- Data: DBLP database; the ENRON dataset.
- Introductory paper(s): the co-clustering paper,
[Dhillon et al KDD 2003], the cross-association
paper
[Chakrabarti et al KDD 2004], the GraphScope paper
[Sun et al
KDD 2007], the paper by Bo Long [Long
et al ICML 2006].
- Comments: There are several possible extensions of the
current methods. Might lead to publications. There is also some
practical issues, for example, efficient implementation of the
current methods, in a conventional machine, or on a 'hadoop'
cluster.
- Contact person: Hanghang
Tong.
3. GRAPHS - INFLUENCE PROPAGATION, GENERATORS,
MODELS
3.1. [*] Propagation of Influence/Information in Networks and
weblogs ('blogs')
- Problem: We want to
find patters of propagation of information (or viruses, influence,
etc.) in a network. We can start by limiting ouselves to trees. For
example, in a web-log influence tree, what is the most typical
form of influence: a 'star' topology? a 'string' topology?
something in-between? How to generate such realistic patterns, from
first principles? Also, we want to model the temporal aspects: how
often do bloggers post messages? are the posts uniformly
distributed over time? (probably not, probably bursty). How can
we spot abnormal/surprising patterns?
- Data: social networks,
citation networks, weblog influence data.
- Introductory paper(s):
[McGlohon+07],
[Leskovec
et al, PAKDD 2006] and (internal to CMU) [
Leskovec+ SDM07 - full version],
- Contact persons:
Mary
McGlohon (also, Jure
Leskovec)
3.2. Generation of Realistic Labeled Graphs
- Problem: The goal of
the project is to extend current graph generator models to also
handle labeled graphs. The intuition is that nodes in densely
linked parts of the graph are more likely to have same
label. So besides generating a realistic topology we also want
to assign labels to nodes in a realistic and principled way. A
related, but completely different problem is to consider edge
labels, too. For example, people/nodes in a social network could
have financial links, friendship links, parent-child links; the
question is whether one type of link increases or decreases the
chances for another type of link. The initial idea is to treat the
edge labels as one more dimension, and represent such a graph with
a 'tensor'. Then, you could try tensor analysis tools (PARAFAC, multilinear
decompositions).
- Data: labeled social
networks for evaluation.
- Introductory paper(s):
[Chakrabarti+
SIAM-DM'04], [Leskovec
et al, PKDD 2005], and the tensor decomposition paper [Sun+,
KDD06]
- Comments: Mainly
theoretical (tensors etc). A lot of recent research interest on the
topic.
- Contact persons:
Hanghang Tong, Jure Leskovec
4. MULTIMEDIA - BIOLOGICAL IMAGES
4.1. [**] Feature Extraction for analyzing Drosophila Embryo
Images
- Problem: Given a set
of annotated 2D images of Drosophila embryos (352*160 gray scale),
how can we extract good numerical features that capture the
characteristics of each image? Is there a proper distance function
that combine local features and global features to determine the
"closeness" between two images? Good feature extraction algorithms
would help to improve the performance of a number of mining tasks
such as automatic captioning and multi-modal querying
- Data: Check the
BDGP site
here ; we also have preprocessed data available in which low
quality images were removed and fly embryos were already scaled and
aligned.
- Introductory paper(s):
FEMine (Our previous work published in KDD'06, details on data
preprocessing; baseline for feature extraction algorithm design);
Zhou and Peng, 2007 , and Peng
et al, 2007 (Recent work on automatic fly embryo image
analysis); also
Tomancak et al, 2002 (Papers by the BDGP group; Good references
if you want to know more about the dataset)
- Comments: The dataset
contains more than 10k images, you may start from a smaller subset,
and write your own algorithm to do further data cleaning.
- Contact Persons:
Fan Guo and
Lei Li
4.2. Fast implementations of RWR (for gCap)
- Problem: In the 'gCap'
paper (see below), and in the Drosophila Embryo project above, we
need to compute the steady-state probability of each node in a
random walk with restarts, when the restarting node is unknown.
Thus, naively, we need n2 steady-state
probabilities. How can we do better than that?
- Data: DBLP, Corel
image data, and more.
- Introductory paper(s): Probably the fastest algorithm so
far is by Hanghang Tong, in ICDM'06.
The goal is to make it even faster, and/or to implement it in
C/C++, or 'hadoop', so that it can run on huge graphs
(Gb-size). Related papers: gCap [Pan
et al, KDD'04]; topic sensitive PageRank [
Haveliwala WWW'02] ; fast algorithms for topic sensitive
PageRank [Hawelivala
+ '03]; graph partitioning [Sun+,
'05].
- Comments: Mainly,
implementation - but it has room for innovation. Closely related to
the dissertation of Hanghang Tong, who will help along. Also, it
could be used immediately by the Drosophila Embryo project
above.
- Contact person:
Hanghang Tong.
5. MISCELLANEOUS
5.1. Fraud detection in on-line auctions - hijacked
accounts
- Problem: The goal of
this project is to detect hijacked accounts for online auctions. It
is quite common these days to hear news that perpetrators hijack
online auction users' accounts, often through fake emails
pretending to be sent from the official auction websites (such as
eBay). The hijackers do this to steal whatever good reputation that
those users have already establish and use that as a ``proof of
trustworthiness'' for the next fraudulent sales that they are going
to carry out, which often involve non-delivery of auction items.
This will explore methods to detect such hijacking, specifically
through anomaly detection, detecting the changes in user account
behaviors (such as sales frequency, time periods of the day,
etc).
- Data: crawled eBay
data (crawler will be available), information of various confirmed
hijacked accounts, as ground truth, will need to be gathered.
- Comments: This is an
important real problem to solve, but an open-ended one. Could lead
to publication.
- Contact person:
Polo Chau
5.2. Auction fraud - detecting networks of 1-cent auctions
- Problem: The goal of
this project is to discover the (fraudulent) relationship between
groups of people who create ``fake'' reputation. In online
auctions, such as eBay, there is the phenomenon that some
``feedback groups'' try to artificially boost each others' apparent
reputation (or trustworthiness) by buying or selling very cheap
items, usually 1-cent items. From reports of real auction users and
even some new articles, it seems that a lot of those 1-cent groups
are actually nicely managed and probably even have automated
systems that back the item posting and transaction processes. This
project will try and find out what kind of networks exist among
those groups, who are involved, and also whatever patterns of
activities that might exist in their activities, (and any other
interesting issues.)
- Data: crawled eBay
data (crawler will be available)
- Comments: This is an
important real problem to solve, but an open-ended one. Could lead
to publication.
- Contact person:
Polo Chau
DATASETS
Unless explicitly mentioned, the datasets are either
'public' or 'owned' by the instructor; for the rest, we need to
discuss about 'Non-disclosure agreements' (NDAs).
Time sequences
- Time series
repository at UCR.
- KURSK
dataset of multipe time sequences: time series from
seismological sensors by the explosion site of the 'Kursk'
submarine.
- Track traffic data, from our Civil Engineering
Department. Number of trucks, weight etc per day per highway-lane.
Find patterns, outliers; do data cleansing.
- River-level / hydrology data: multiple,
correlated time series. Do data cleansing; find correlations
between these series. Excellent project for people that like
canoeing!
- Sunspots: number
of sunspots per unit time. Some data are here.
Sunspots seem to have an 11-year periodicity, with high
spikes.
- Time sequences from the
Sante-Fe Institute forecasting competition (financial data,
laser-beam oscillation data, patients' apnea data etc)
- Disk access
traces, from HP Labs (we have local copies at CMU).
For each disk access, we have the timestamp, the block-id, and the
type ('read'/'write'). Here is a
snippet of the data, aggregated per 30'.
Spatial data
- Astrophysics data - thousands of galaxies, with
coordinates, red-shift, spectra, photographs.
Small snippet of the data. More data are in the 'skyserver' web
site, where you can ask SQL queries and
get data in html or csv format
- Road segments: several datasets with line segments
(roads of U.S. counties, Montgomery MD, Long Beach CA, x-y
coordinates of stars in the sky from NASA, etc).
Snippet of data (roads from California, from TIGER).
Images/video
-
Biological data: images of proteins, with ~50 attributes
each.
- 'Owner': Prof. Bob Murphy.
- Video/image/sound data, from Informedia. 2Tb of video,
segmented; 1M images with features; 10^4 faces. Extract features;
design good similarity functions; do the named-entity
analysis.
Graph-like data
- Web-log and click-stream data (NDA: needed).
- Snapshots of 2 anonymized (and anonymous) social networks
(NDA)
- Visit patterns for a large web site: for 300 pages, and
thousands of users, we record how many times a user visited a
specific site. Find patterns, clusters, fractal dimensions,
regularities in the SVD etc.
- Netflix competition
dataset (users, movies and ratings) - needs easy registration - we
have a cleaned-up version available, locally.
- Enron email
dataset (400 MB compressed)
- Movie-actor data from imdb.com (we have a cleaned-up snapshot
of it)
- DBLP author-paper-conference data from the DBLP site of Mike
Ley (records in XML,
and their
DTD). For 'ego-surfing', try this java app or the
java applet
at U. Alberta.
- Graph
datasets at U.Mass (Amherst), by Prof. Dave Jensen.
Miscellaneous:
SOFTWARE
Notes for the software: Before you modify any code,
please contact the instructor - ideally, we would like to use these
packages as black boxes.
- Readily available:
- ACCESS METHODS
-
-
DR-tree : R-tree code; searches for range and nearest-neighbor
queries. In C.
-
kd-tree code
- OMNI
trees - a faster version of metric trees.
- B-tree code, for text (should be easily changed to handle
numbers, too). In C.
- SVD AND TENSORS:
-
- Code for SVD in
`mathematica'.
- Code for
SPIRIT (incremental SVD on streams)
- FRACTALS
-
- GRAPHS
-
- the
NetMine network topology analysis package
- GMine:
interactive graph visualization package and graph manipulation
library (by Junio (Jose Fernandez Rodrigues Junior) and Jure
Leskovec)
- the '
crossAssociation' package for graph partitioning.
- Outside CMU:
- GiST package from
Hellerstein at UC Berkeley: A general spatial access method, which
is easy to customize. It is already customized to yield
R-trees.
BIBLIOGRAPHICAL RESOURCES:
Last modified Jan. 20, 2008, by Christos Faloutsos.