project list

Carnegie Mellon University
15-826 Multimedia Databases and Data Mining
Spring 2008 - C. Faloutsos

List of suggested projects

The projects are grouped according to their general theme. We also list the data and software available, to leverage your effort. More links and resources may be added in the future. Reminders:

URL for this very page (internal to CMU - please treat it 'confidentially'): www.cs.cmu.edu/~christos/courses/826.S08/CMU-ONLY/projlist.html
Feel free to propose projects outside this list, as long as they have to do with mining and indexing large datasets.
An asterisk [*] in the project title signify that this project is related to the phd dissertation of the contact person. Two asterisks [**] means that this is a group project, with several potential collaborators. But feel free to consider non-asterisked projects, too, if they are related to your interests or your dissertation.

SUGGESTED TOPICS

0. HADOOP AND PARALLELISM

The projects below are mainly designed for a traditional, single-machine architecture. However, 'hadoop' allows relatively easy parallel execution, implementing the map-reduce system of Google [Dean + Ghemawat, OSDI'04]. 'Hadoop' is open source; we can provide some lab machines for you to install 'hadoop', or we can give you access to a 50-node hadoop cluster at INTEL-Pittsburgh, and maybe access to the 'M45' of Yahoo (1000 machines, 4 cores each, 1Tb total RAM and over 3Pb storage - see the press release at Yahoo, CMU, Scientific American, etc). You are welcome to try any of these projects below, on a hadoop cluster.

1. SPATIO/TEMPORAL AND STREAM MINING

**1.1. [] Disk access traffic patterns, and the Self- project**

Problem: Given traces from real workstations (tuples of the form <disk-id, track-id, R/W-flag, timestamp>), find patterns; do predictions; use them to design better buffering and prefetching algorithms. Try 'blind signal separation'/ICA, to distinguish between 'reads' and 'writes', or between interactive and database accesses. A related problem is to forecast the response time for a given disk request, given a training set with their response times. The main problem is to extract good features. In fact, this is a small part of the the Self-* project, which also has numerous co-evolving time sequences from a prototype data center with multiple 'intelligent' storage units: cpu utilizations, network traffic measurements, room temperature sensor measurements, humidity measurements, etc. The goal is to find patterns, correlations, lag-correlations, anomalies, to help the data center self-organize, self-detect upcoming (or existing) failures and attacks, to self-optimize its performance.
Data: See the 'Disk Access Traces' below. Also, the web site of the Self-* project, with a lot of measurement data, that we already have. We also have traces from an MS SQL Server.
Introductory paper(s): the 'PQRS' model [Wang et al, PEVA 2001]; see also the use of CART [ Wang et al SIGMETRICS 2004] and the follow-up work [Mesnier+, '05]. Check the SPIRIT project, or the live Intemon system, and the corresponding publications (VLDB06, OSR06) on Jimeng's page under 'InteMon'. Also the lag-correlation paper [ Sakurai+ SIGMOD'05].
Comments: The general case is hard, and in fact, is the topic of dissertations (Dr. Mengzhi Wang, Dr. Mike Mesnier, Dr. Jimeng Sun). However, there are a lot of initial ideas that you could try within a semester, and a lot of industrial interest in the topic.
Contact person: Lei Li

2. GRAPHS - LARGE GRAPH MINING

2.1. Large/parallel graph mining, possibly using 'hadoop'

Problem: Given a large graph with billions of edges and tens of billions of nodes, and several share-nothing machines, parallelize the typical graph mining algorithms, to be as fast as you can. We want to compute the in- and out-degree distributions, the diameter of the graph, the first several eigenvalues, the 'network value' of each node, the 'clustering coefficient', the node- and edge-betweeness. The first step is to do timing of several possible architectures: with, or without a relational DBMS; with, or without replication of the data. Also, what is the best way to store the data (e.g., as <from,to> pairs in a flat file; as an adjacency list, hashed on the 'from' node-id, or as something else.)
Data: We shall start with synthetic data, using an existing generator [Leskovec+, PAKDD'05]. Then, DBLP, IMDB etc. We could also get data on real CMU IP traffic (will need NDA). Finally, we also have a who-talks-to-whom social network with 270 million nodes and 8 billion edges (60Gb of data)
Introductory paper(s): The generator above; the Gamma database machine papers [Dewitt+, IEEE TKDE'90]; papers on hash-joins [Kitsuregawa+, vldb'90] the RMAT paper [Chakrabarti+ SIAM-DM'04], the connection sub-graph paper [ Faloutsos+, KDD'04]. If you plan to use 'hadoop', get the map-reduce paper [Dean + Ghemawat, OSDI'04].
Comments: Very high practical interest, with hard problems from both the algorithmic as well as the system side. There is a lot of room, even for 4 or more people.
Contact person: Jure Leskovec

2.2. [**] Large Graph Visualization

Problem: Given a huge graph (say, millions of nodes), help visualize it. Start with the ICML03 paper below; then, try to extend it for huge graphs: try some partitioning/grouping method, and/or some fish-eye ideas.
Data: epinions.com; DBLP citation information, etc
Introductory paper(s): Check the recent paper from our group and USP colleagues on GMine. Also [Takeshi Yamada, Kazumi Saito, Naonori Ueda: Cross-Entropy Directed Embedding of Network Data. ICML'03]. Also, the visualization tools from CAIDA, and specifically 'walrus'. Check the graphdrawing organization and the corresponding sequence of conferences on Graph Drawing (GD) from there. Our goal here is different, though: we want to large graphs, that don't fit in memory, nor on the screen, nor in the human mind, unless we summarize them somehow.
Comments: Open-ended problem, but very useful - could lead to publication. The first step is to implement the method by [Yamada et al].
Contact person: Polo Chau (also, Jure Leskovec and Mary McGlohon)

2.3. [*] Best -Effort Pattern Match

Problem: Given a large attributed graph, how can we find subgraphs that match a user pattern query. For example, “find a CEO who has strong interactions with a Manager, a Lawyer, and an Accountant, or another structure as close to that as possible”. Similarly, a ‘loop’ query could help us spot, eg., a money laundering ring.
Data: DBLP database; IMDB datasets.
Introductory paper(s): the G-Ray, [Tong et al KDD 2007], the BANKS system [ Bhalotia et al, ICDE'02].
Comments: There are several possible extensions of G-Ray: labelled edges, weighted edges, time-stamps. Might lead to publications.
Contact person: Hanghang Tong.

2.4. [*] Relational databases as graphs, 'fuzzy queries' and Center-Piece Subgraphs (CePS)

Problem: Relatively recent work argues that Relational databases can be treated as a graph, and queries do not need to use SQL at all. More specifically, consider the DBLP graph, and queries of the form 'who is the most central person with respect to "recovery"', or 'what is the most influential paper by person X'. Colleagues at Yahoo would like to find who is the person that asks the best questions; who is the one that gives the best answers, etc. Other collaborators want to answer fuzzy queries, like 'find authors who are related to KDD conference, and to the University of XYZ' (that means, they have a lot of papers in KDD, and a lot of colleagues in the XYZ university). Similarly, given Q nodes in a social network (say, authorship network), how can we find the node/author that is the centerpiece, and has direct or indirect connections to all, or most of them? For example, this node could be the common advisor, or someone who started the research area that the Q nodes belong to.
Data: DBLP database; maybe data from Yahoo (but will be under NDA, at best).
Introductory paper(s): start from the CePS paper [Tong et al KDD 2006]; also see the BANKS system [ Bhalotia et al, ICDE'02]; see the work at UCSD [Balmin et al, VLDB'04] and also [Geerts et al, VLDB'04]. Also see the 'electricity' based algorithms of [ Palmer+ PAKDD'03]
Comments: There are several possible extensions of CePS, like negation ('find people related to X and Y, but not to Z'). Might lead to publications. You may also choose to do an implementation-mainly project, where you could set up a web site with the CePS algorithm above, so that people can discover their scientific environment, their connections to others etc. See the off-line browser, the visualization tool at DBLP, to get some initial ideas.
Contact person: Hanghang Tong.

2.5. [*] Cross-Association/Co-clustering

Problem: Given a contingency table (i.e. large sparse bipartite graph w/ positive edge weights), how can we simultaneously group the row/column into some clustering? How do we do that in a parameter-free way? Recently, researchers try to solve this problem from information theoretic point of view, by exploring the duality between data mining and compression. One challenge is how to generalize it to heterogeneous settings, for example, when we have 3 types of nodes (in DBLP: authors, papers, keywords), and we want to find groups and clusters.
Data: DBLP database; the ENRON dataset.
Introductory paper(s): the co-clustering paper, [Dhillon et al KDD 2003], the cross-association paper [Chakrabarti et al KDD 2004], the GraphScope paper [Sun et al KDD 2007], the paper by Bo Long [Long et al ICML 2006].
Comments: There are several possible extensions of the current methods. Might lead to publications. There is also some practical issues, for example, efficient implementation of the current methods, in a conventional machine, or on a 'hadoop' cluster.
Contact person: Hanghang Tong.

3. GRAPHS - INFLUENCE PROPAGATION, GENERATORS, MODELS

3.1. [*] Propagation of Influence/Information in Networks and weblogs ('blogs')

Problem: We want to find patters of propagation of information (or viruses, influence, etc.) in a network. We can start by limiting ouselves to trees. For example, in a web-log influence tree, what is the most typical form of influence: a 'star' topology? a 'string' topology? something in-between? How to generate such realistic patterns, from first principles? Also, we want to model the temporal aspects: how often do bloggers post messages? are the posts uniformly distributed over time? (probably not, probably bursty). How can we spot abnormal/surprising patterns?
Data: social networks, citation networks, weblog influence data.
Introductory paper(s): [McGlohon+07], [Leskovec et al, PAKDD 2006] and (internal to CMU) [ Leskovec+ SDM07 - full version],
Contact persons: Mary McGlohon (also, Jure Leskovec)

3.2. Generation of Realistic Labeled Graphs

Problem: The goal of the project is to extend current graph generator models to also handle labeled graphs. The intuition is that nodes in densely linked parts of the graph are more likely to have same label. So besides generating a realistic topology we also want to assign labels to nodes in a realistic and principled way. A related, but completely different problem is to consider edge labels, too. For example, people/nodes in a social network could have financial links, friendship links, parent-child links; the question is whether one type of link increases or decreases the chances for another type of link. The initial idea is to treat the edge labels as one more dimension, and represent such a graph with a 'tensor'. Then, you could try tensor analysis tools (PARAFAC, multilinear decompositions).
Data: labeled social networks for evaluation.
Introductory paper(s): [Chakrabarti+ SIAM-DM'04], [Leskovec et al, PKDD 2005], and the tensor decomposition paper [Sun+, KDD06]
Comments: Mainly theoretical (tensors etc). A lot of recent research interest on the topic.
Contact persons: Hanghang Tong, Jure Leskovec

4. MULTIMEDIA - BIOLOGICAL IMAGES

4.1. [**] Feature Extraction for analyzing Drosophila Embryo Images

Problem: Given a set of annotated 2D images of Drosophila embryos (352*160 gray scale), how can we extract good numerical features that capture the characteristics of each image? Is there a proper distance function that combine local features and global features to determine the "closeness" between two images? Good feature extraction algorithms would help to improve the performance of a number of mining tasks such as automatic captioning and multi-modal querying
Data: Check the BDGP site here ; we also have preprocessed data available in which low quality images were removed and fly embryos were already scaled and aligned.
Introductory paper(s): FEMine (Our previous work published in KDD'06, details on data preprocessing; baseline for feature extraction algorithm design); Zhou and Peng, 2007 , and Peng et al, 2007 (Recent work on automatic fly embryo image analysis); also Tomancak et al, 2002 (Papers by the BDGP group; Good references if you want to know more about the dataset)
Comments: The dataset contains more than 10k images, you may start from a smaller subset, and write your own algorithm to do further data cleaning.
Contact Persons: Fan Guo and Lei Li

4.2. Fast implementations of RWR (for gCap)

Problem: In the 'gCap' paper (see below), and in the Drosophila Embryo project above, we need to compute the steady-state probability of each node in a random walk with restarts, when the restarting node is unknown. Thus, naively, we need n² steady-state probabilities. How can we do better than that?
Data: DBLP, Corel image data, and more.
Introductory paper(s): Probably the fastest algorithm so far is by Hanghang Tong, in ICDM'06. The goal is to make it even faster, and/or to implement it in C/C++, or 'hadoop', so that it can run on huge graphs (Gb-size). Related papers: gCap [Pan et al, KDD'04]; topic sensitive PageRank [ Haveliwala WWW'02] ; fast algorithms for topic sensitive PageRank [Hawelivala + '03]; graph partitioning [Sun+, '05].
Comments: Mainly, implementation - but it has room for innovation. Closely related to the dissertation of Hanghang Tong, who will help along. Also, it could be used immediately by the Drosophila Embryo project above.
Contact person: Hanghang Tong.

5. MISCELLANEOUS

5.1. Fraud detection in on-line auctions - hijacked accounts

Problem: The goal of this project is to detect hijacked accounts for online auctions. It is quite common these days to hear news that perpetrators hijack online auction users' accounts, often through fake emails pretending to be sent from the official auction websites (such as eBay). The hijackers do this to steal whatever good reputation that those users have already establish and use that as a ``proof of trustworthiness'' for the next fraudulent sales that they are going to carry out, which often involve non-delivery of auction items. This will explore methods to detect such hijacking, specifically through anomaly detection, detecting the changes in user account behaviors (such as sales frequency, time periods of the day, etc).
Data: crawled eBay data (crawler will be available), information of various confirmed hijacked accounts, as ground truth, will need to be gathered.
Comments: This is an important real problem to solve, but an open-ended one. Could lead to publication.
Contact person: Polo Chau

5.2. Auction fraud - detecting networks of 1-cent auctions

Problem: The goal of this project is to discover the (fraudulent) relationship between groups of people who create ``fake'' reputation. In online auctions, such as eBay, there is the phenomenon that some ``feedback groups'' try to artificially boost each others' apparent reputation (or trustworthiness) by buying or selling very cheap items, usually 1-cent items. From reports of real auction users and even some new articles, it seems that a lot of those 1-cent groups are actually nicely managed and probably even have automated systems that back the item posting and transaction processes. This project will try and find out what kind of networks exist among those groups, who are involved, and also whatever patterns of activities that might exist in their activities, (and any other interesting issues.)
Data: crawled eBay data (crawler will be available)
Comments: This is an important real problem to solve, but an open-ended one. Could lead to publication.
Contact person: Polo Chau

DATASETS

Unless explicitly mentioned, the datasets are either 'public' or 'owned' by the instructor; for the rest, we need to discuss about 'Non-disclosure agreements' (NDAs).

Time sequences

Time series repository at UCR.
KURSK dataset of multipe time sequences: time series from seismological sensors by the explosion site of the 'Kursk' submarine.
Track traffic data, from our Civil Engineering Department. Number of trucks, weight etc per day per highway-lane. Find patterns, outliers; do data cleansing.
River-level / hydrology data: multiple, correlated time series. Do data cleansing; find correlations between these series. Excellent project for people that like canoeing!
Sunspots: number of sunspots per unit time. Some data are here. Sunspots seem to have an 11-year periodicity, with high spikes.
Time sequences from the Sante-Fe Institute forecasting competition (financial data, laser-beam oscillation data, patients' apnea data etc)
Disk access traces, from HP Labs (we have local copies at CMU). For each disk access, we have the timestamp, the block-id, and the type ('read'/'write'). Here is a snippet of the data, aggregated per 30'.

Spatial data

Astrophysics data - thousands of galaxies, with coordinates, red-shift, spectra, photographs. Small snippet of the data. More data are in the 'skyserver' web site, where you can ask SQL queries and get data in html or csv format
Road segments: several datasets with line segments (roads of U.S. counties, Montgomery MD, Long Beach CA, x-y coordinates of stars in the sky from NASA, etc). Snippet of data (roads from California, from TIGER).

Images/video

Biological data: images of proteins, with ~50 attributes each.
- 'Owner': Prof. Bob Murphy.
Video/image/sound data, from Informedia. 2Tb of video, segmented; 1M images with features; 10^4 faces. Extract features; design good similarity functions; do the named-entity analysis.

Graph-like data

Web-log and click-stream data (NDA: needed).
Snapshots of 2 anonymized (and anonymous) social networks (NDA)
Visit patterns for a large web site: for 300 pages, and thousands of users, we record how many times a user visited a specific site. Find patterns, clusters, fractal dimensions, regularities in the SVD etc.
Netflix competition dataset (users, movies and ratings) - needs easy registration - we have a cleaned-up version available, locally.
Enron email dataset (400 MB compressed)
Movie-actor data from imdb.com (we have a cleaned-up snapshot of it)
DBLP author-paper-conference data from the DBLP site of Mike Ley (records in XML, and their DTD). For 'ego-surfing', try this java app or the java applet at U. Alberta.
Graph datasets at U.Mass (Amherst), by Prof. Dave Jensen.

Miscellaneous:

Several collections of training data from the UC-Irvine repository (check the larger ones) and from KDD-nuggets for machine learning algorithms.
Demographic data from the U.S. Bureau of Census

SOFTWARE

Notes for the software: Before you modify any code, please contact the instructor - ideally, we would like to use these packages as black boxes.

Readily available:
- ACCESS METHODS
- - DR-tree : R-tree code; searches for range and nearest-neighbor queries. In C.
  - kd-tree code
  - OMNI trees - a faster version of metric trees.
  - B-tree code, for text (should be easily changed to handle numbers, too). In C.
- SVD AND TENSORS:
- - Code for SVD in `mathematica'.
  - Code for SPIRIT (incremental SVD on streams)
- FRACTALS
- - Code for computing the fractal dimension in Perl and C, by Leejay Wu
  - Barnsley's algorithm for Iterated Function Systems in `C'.
- GRAPHS
- - the NetMine network topology analysis package
  - GMine: interactive graph visualization package and graph manipulation library (by Junio (Jose Fernandez Rodrigues Junior) and Jure Leskovec)
  - the ' crossAssociation' package for graph partitioning.
Outside CMU:
- GiST package from Hellerstein at UC Berkeley: A general spatial access method, which is easy to customize. It is already customized to yield R-trees.

BIBLIOGRAPHICAL RESOURCES:

Last modified Jan. 20, 2008, by Christos Faloutsos.

Carnegie Mellon University 15-826 Multimedia Databases and Data Mining Spring 2008 - C. Faloutsos