project list

Carnegie Mellon University
15-826 Multimedia Databases and Data Mining
Spring 2007 - C. Faloutsos

List of suggested projects

The projects are grouped according to their general theme. We also list the data and software available, to leverage your effort. More links and resources may be added in the future. Reminders:

URL for this very page (internal to CMU - please treat it 'confidentially'): www.cs.cmu.edu/~christos/courses/826.S07/CMU-ONLY/projlist.html
Feel free to propose projects outside this list, as long as they have to do with mining and indexing large datasets.
Asterisks [*] in the project title signify active, ongoing work on this project, with several more collaborators. But feel free to consider non-asterisked projects, too, if they are related to your interests or your dissertation.

3. GRAPHS - INFLUENCE PROPAGATION, GENERATORS, MODELS

3.1. [*] Propagation of Influence/Information in Networks and weblogs ('blogs')

Problem: We want to find patters of propagation of information (or viruses, influence, etc.) in a network. We can start by limiting ouselves to trees. For example, in a web-log influence tree, what is the most typical form of influence: a 'star' topology? a 'string' topology? something in-between? How to generate such realistic patterns, from first principles? ALSO, we want to model the temporal aspects: how often do bloggers post messages? are the posts uniformly distributed over time? (probably not, probably bursty). How can we spot abnormal/surprising patterns?
Data: social networks, citation networks, weblog influence data.
Introductory paper(s): [Leskovec et al, PAKDD 2006] and (internal to CMU) [Leskovec+ SDM07 - full version], [Leskovec+07 submitted]
Contact persons: Mary McGlohon, Jure Leskovec

3.2. Virus propagation - SIS model

Problem: Consider an SIS model ('Susceptible-Infected-Susceptible', like people getting sick with the flu, and recovering, and getting re-infected and so on). Given a graph and the virus properties, determine whether the system reaches a steady state (as opposed to oscillating, or as opposed to chaotic behavior). If it does reach a steady state, estimate the number of infected nodes.
Data: not needed, since it is mainly a theoretical project.
Introductory paper(s): the virus propagation paper by [Wang et al, SRDS' 2003]
Comments: There are some initial ideas on how to approximate the problem (contact instructor). Should be enough for 1 person, and could lead to a publication.

3.3. Virus propagation - SIR model

Problem: The goal is to estimate the expected population of Susceptible, Infected and Recovered, as a function of time, when the infections happen along the edges of a graph. Past work usually assumes a complete clique topology (everybody can infect everybody else).
Data: We have AIDS infection numbers, of two different countries in Africa, with different social structures.
Introductory paper(s): [Zou, Gong, Towsley, CCS'02], with many more (say, from Pastor-Satorras and Vespignani; Tanya Vassilevska-Kostova).
Comments: Mainly a theoretical project, which could involve large scale simulation. The conjecture is that the initial growth and the final decay depend on the first eigenvalue of the adjacency matrix.

3.4. Finding models for Time Evolving Graphs

Problem:The goal is to find a model for the creation and evolution of networks. Given a time evolving network we want to fit a network evolution model, and specifically the one based on 'Kronecker graphs' (see below), or some other model that you may design. The model itself and its parameters would give us further insight into the network evolution.
Data: Large time evolving social and citation networks (scientific publications, email graph, recommendation network, etc.).
Introductory paper(s): Kronecker graphs paper [Leskovec et al, PKDD 2005]; the thesis of Deepay Chakrabarti, and recent graph topology surveys [Chakrabarti+ 06, ACM Comp. Surveys].
Comments: Hard problem. Could lead to a publication.
Contact person: Jure Leskovec

3.5. Generation of Realistic Labeled Graphs

Problem: The goal of the project is to extend current graph generator models to also handle labeled graphs. The intuition is that nodes in densely linked parts of the graph are more likely to have same label. So besides generating a realistic topology we also want to assign labels to nodes in a realistic and principled way. A related, but completely different problem is to consider edge labels, too. For example, people/nodes in a social network could have financial links, friendship links, parent-child links; the question is whether one type of link increases or decreases the chances for another type of link. The initial idea is to treat the edge labels as one more dimension, and represent such a graph with a 'tensor'. Then, you could try tensor analysis tools (PARAFAC, multilinear decompositions).
Data: labeled social networks for evaluation.
Introductory paper(s): [Chakrabarti+ SIAM-DM'04], [Leskovec et al, PKDD 2005], and the tensor decomposition paper [Sun+, KDD06]
Comments: Mainly theoretical (tensors etc). A lot of recent research interest on the topic.
Contact persons: Jure Leskovec, Jimeng Sun

4. MULTIMEDIA - BIOLOGICAL IMAGES

4.1. Indexing and Clustering for bio images

Problem: Given a collection drosophila embryos (2-d), find a good similarity function, summarize them, and report patterns and outliers.
Data: drosophila embryo images
Introductory paper(s): Paper by [Pan+ KDD06] on the topic; papers on PCA (say, from the textbook), on ICA [Pan+, ICDM'05].
Comments: In collaboration with Prof. Eric Xing at CMU.

4.2. Distance function for 3-d protein images

Problem: Given a collection protein localization images (3-d), find a good similarity function, summarize them, and report patterns and outliers.
Data: 3-d protein images, from Prof. Bob Murphy.
Introductory paper(s): See the Murphy lab site, for papers and datasets, and specifically [Huang+'04] and [Chen+, '04]
Comments: The challenge is to find a good distance function, even in the case that we have no labels.

5. MISCELLANEOUS

5.1. Fraud detection in on-line auctions - hijacked accounts

Problem: The goal of this project is to detect hijacked accounts for online auctions. It is quite common these days to hear news that perpetrators hijack online auction users' accounts, often through fake emails pretending to be sent from the official auction websites (such as eBay). The hijackers do this to steal whatever good reputation that those users have already establish and use that as a ``proof of trustworthiness'' for the next fraudulent sales that they are going to carry out, which often involve non-delivery of auction items. This will explore methods to detect such hijacking, specifically through anomaly detection, detecting the changes in user account behaviors (such as sales frequency, time periods of the day, etc).
Data: crawled eBay data (crawler will be available), information of various confirmed hijacked accounts, as ground truth, will need to be gathered.
Comments: This is an important real problem to solve, but an open-ended one. Could lead to publication.
Contact person: Polo Chau

5.2. Auction fraud - detecting networks of 1-cent auctions

Problem: The goal of this project is to discover the (fraudulent) relationship between groups of people who create ``fake'' reputation. In online auctions, such as eBay, there is the phenomenon that some ``feedback groups'' try to artificially boost each others' apparent reputation (or trustworthiness) by buying or selling very cheap items, usually 1-cent items. From reports of real auction users and even some new articles, it seems that a lot of those 1-cent groups are actually nicely managed and probably even have automated systems that back the item posting and transaction processes. This project will try and find out what kind of networks exist among those groups, who are involved, and also whatever patterns of activities that might exist in their activities, (and any other interesting issues.)
Data: crawled eBay data (crawler will be available)
Comments: This is an important real problem to solve, but an open-ended one. Could lead to publication.
Contact person: Polo Chau

DATASETS

Unless explicitly mentioned, the datasets are either 'public' or 'owned' by the instructor; for the rest, we need to discuss about 'Non-disclosure agreements' (NDAs).

Time sequences

Time series repository at UCR.
KURSK dataset of multipe time sequences: time series from seismological sensors by the explosion site of the 'Kursk' submarine.
Track traffic data, from our Civil Engineering Department. Number of trucks, weight etc per day per highway-lane. Find patterns, outliers; do data cleansing.
River-level / hydrology data: multiple, correlated time series. Do data cleansing; find correlations between these series. Excellent project for people that like canoeing!
Sunspots: number of sunspots per unit time. Some data are here. Sunspots seem to have an 11-year periodicity, with high spikes.
Time sequences from the Sante-Fe Institute forecasting competition (financial data, laser-beam oscillation data, patients' apnea data etc)
Disk access traces, from HP Labs (we have local copies at CMU). For each disk access, we have the timestamp, the block-id, and the type ('read'/'write'). Here is a snippet of the data, aggregated per 30'.

Spatial data

Astrophysics data - thousands of galaxies, with coordinates, red-shift, spectra, photographs. Small snippet of the data. More data are in the 'skyserver' web site, where you can ask SQL queries and get data in html or csv format
Road segments: several datasets with line segments (roads of U.S. counties, Montgomery MD, Long Beach CA, x-y coordinates of stars in the sky from NASA, etc). Snippet of data (roads from California, from TIGER).

Images/video

Biological data: images of proteins, with ~50 attributes each.
- 'Owner': Prof. Bob Murphy.
Video/image/sound data, from Informedia. 2Tb of video, segmented; 1M images with features; 10^4 faces. Extract features; design good similarity functions; do the named-entity analysis.

Graph-like data

Web-log and click-stream data (NDA: needed).
Visit patterns for a large web site: for 300 pages, and thousands of users, we record how many times a user visited a specific site. Find patterns, clusters, fractal dimensions, regularities in the SVD etc.
Enron email dataset (400 MB compressed)
Movie-actor data from imdb.com (we have a cleaned-up snapshot of it)
DBLP author-paper-conference data from the DBLP site of Mike Ley (records in XML, and their DTD). For 'ego-surfing', try this java app or the java applet at U. Alberta.

Miscellaneous:

Several collections of training data from the UC-Irvine repository and from KDD-nuggets for machine learning algorithms.
Demographic data from the U.S. Bureau of Census

SOFTWARE

Notes for the software: Before you modify any code, please contact the instructor - ideally, we would like to use these packages as black boxes.

Readily available:
- DR-tree : R-tree code; searches for range and nearest-neighbor queries. In C.
- kd-tree code
- OMNI trees - a faster version of metric trees.
- B-tree code, for text (should be easily changed to handle numbers, too). In C.
- Code for SVD in `mathematica'.
- Code for computing the fractal dimension in Perl and C, by Leejay Wu
- Barnsley's algorithm for Iterated Function Systems in `C'.
- the NetMine network topology analysis package
- GMine: interactive graph visualization package and graph manipulation library (by Jure and Junio)
- the 'crossAssociation' package for graph partitioning.
Outside CMU:
- GiST package from Hellerstein at UC Berkeley: A general spatial access method, which is easy to customize. It is already customized to yield R-trees.

BIBLIOGRAPHICAL RESOURCES:

Last modified Feb. 6, 2007, by Christos Faloutsos.

Carnegie Mellon University
15-826 Multimedia Databases and Data Mining
Spring 2007 - C. Faloutsos

List of suggested projects

SUGGESTED TOPICS

1. SPATIO/TEMPORAL AND STREAM MINING

**1.1. [] Disk access traffic patterns, and the Self- project**

1.2. [*] Similarity search in motion-capture sequences

2. GRAPHS - LARGE GRAPH MINING

2.1 Handling Large Graphs

2.2. Parallel graph mining

2.3. Finding frequent sub-graphs

2.4. Fast implementations of RWR (for gCap)

2.5. Large Graph Visualization

2.6. [*] Relational databases as graphs, and 'fuzzy queries'.

2.7. [*] E-bay fraud detection

3. GRAPHS - INFLUENCE PROPAGATION, GENERATORS, MODELS

3.1. [*] Propagation of Influence/Information in Networks and weblogs ('blogs')

3.2. Virus propagation - SIS model

3.3. Virus propagation - SIR model

3.4. Finding models for Time Evolving Graphs

3.5. Generation of Realistic Labeled Graphs

4. MULTIMEDIA - BIOLOGICAL IMAGES

4.1. Indexing and Clustering for bio images

4.2. Distance function for 3-d protein images

5. MISCELLANEOUS

5.1. Fraud detection in on-line auctions - hijacked accounts

5.2. Auction fraud - detecting networks of 1-cent auctions

DATASETS

Time sequences

Spatial data

Images/video

Graph-like data

Miscellaneous:

SOFTWARE

BIBLIOGRAPHICAL RESOURCES:

Carnegie Mellon University 15-826 Multimedia Databases and Data Mining Spring 2007 - C. Faloutsos

List of suggested projects

SUGGESTED TOPICS

1. SPATIO/TEMPORAL AND STREAM MINING

1.1. [*] Disk access traffic patterns, and the Self-* project

1.2. [*] Similarity search in motion-capture sequences

2. GRAPHS - LARGE GRAPH MINING

2.1 Handling Large Graphs

2.2. Parallel graph mining

2.3. Finding frequent sub-graphs

2.4. Fast implementations of RWR (for gCap)

2.5. Large Graph Visualization

2.6. [*] Relational databases as graphs, and 'fuzzy queries'.

2.7. [*] E-bay fraud detection

3. GRAPHS - INFLUENCE PROPAGATION, GENERATORS, MODELS

3.1. [*] Propagation of Influence/Information in Networks and weblogs ('blogs')

3.2. Virus propagation - SIS model

3.3. Virus propagation - SIR model

3.4. Finding models for Time Evolving Graphs

3.5. Generation of Realistic Labeled Graphs

4. MULTIMEDIA - BIOLOGICAL IMAGES

4.1. Indexing and Clustering for bio images

4.2. Distance function for 3-d protein images

5. MISCELLANEOUS

5.1. Fraud detection in on-line auctions - hijacked accounts

5.2. Auction fraud - detecting networks of 1-cent auctions

DATASETS

Time sequences

Spatial data

Images/video

Graph-like data

Miscellaneous:

SOFTWARE

BIBLIOGRAPHICAL RESOURCES:

Carnegie Mellon University
15-826 Multimedia Databases and Data Mining
Spring 2007 - C. Faloutsos

**1.1. [] Disk access traffic patterns, and the Self- project**