Large graph mining: patterns, tools and case studies

Tutorial proposal for ICDE 2009, Shanghai, China

Christos Faloutsos and Hanghang Tong,
Carnegie Mellon University

Abstract

How do graphs look like? How do they evolve over time? How can we find patterns, anomalies and regularities in them? How to find influential nodes in the network? We will present both theoretical results and algorithms as well as case studies on several real applications. Our emphasis is on the intuition behind each method, and on guidelines for the practitioner.

The tutorial has the following parts: (a) Statistical properties and models and graph generators of static and evolving networks. (b) Tools for the analysis of static and dynamic graphs, like the Singular Value Decomposition, tensor decomposition for community detection, HITS/PageRank etc. (c) Proximity measurements on graphs, the main ideas to quantify the closeness of two nodes of the graph, fast algorithms to compute the proximity scores, applications of proximity, like CenterPiece subgraphs, pattern match, trend analysis etc. (d) Case studies of how a virus or information or influence spreads through the network, how to find influential bloggers or nodes to target for viral marketing, how to find fraudsters on eBay, how to find communities on graphs.

Keywords: Graph mining, linear algebra, SVD, tensors, pageRank

Foils | Part I | Part II | Part III | Part IV |

Outline - Description of topics

The goal of the tutorial is to cover the most powerful tools for the analysis of large, real graphs. The tutorial starts old and new patterns that most real graphs obey (small diameter, power laws etc). It continues with powerful, traditional tools from linear algebra (singular value decomposition SVD, eigenvalue analysis); it shows that they form the basis for the extremely successful PageRank and HITS algorithms; and it concludes with more advanced tools, namely, sparse low rank approximations ('CUR' and derivatives).
The next part focuses on proximity of two nodes on a graph, and how to assess it. We describe several measures (electric current, maximum flow, escape probability), we compare them and we focus on the most successful ones and on fast algorithms to compute them.
The tutorial concludes with several case studies: influence propagation, fraud detection on e-bay, a survey of algorithms for community detection and graph partitioning, and a description of the map/reduce method for the analysis of Tera- and Peta-byte scale graphs.

The proposed format is 3 hours.

· Part I: Patterns [0.5h - Faloutsos]

patterns on static graphs
patterns on time-evolving graphs
patterns on weighted graphs

Part II: Power Tools [1h - Faloutsos]

SVD/PCA
HITS, PageRank
example-based decomposition: CUR, CMD and Colibri
tensors - concepts and applications

Part III: Proximity [1h - Tong]

main ideas, properties & generalizations (random walks, escape probabilities, etc)
fast algorithms
applications: (center-piece subgraphs, pattern match etc)

Part IV: Case Studies [0.5 hour - Faloutsos]

virus/influence propagation
blog analysis
eBay fraud detection
community detection and tracking (co-clustering, cross-associations, graphScope etc)
tensor for web mining: TOPHITS
scalability: map/reduce and graph mining

Conclusions

Target Audience

The target audience is data management, data mining and machine learning researchers and professionals who work on static or time-evolving graphs and want to know about tools and models when dealing with large network datasets.
Prerequisites: Computer science background (B.Sc. or equivalent); familiarity with undergraduate linear algebra (eigenvectors). The tutorial will focus on intuition and examples, carefully introducing only the minimal necessary mathematical tools, and always focusing on practical applications.

About the instructors

Christos Faloutsos is a Professor at Carnegie Mellon University. He has received the Presidential Young Investigator Award by the National Science Foundation (1989), the Research
Contributions Award in ICDM 2006, twelve ``best paper'' awards, and several teaching awards. He has served as a member of the executive committee of SIGKDD; he has published over 160 refereed articles, 11 book chapters and one monograph. He holds five patents and he has
given over 20 tutorials and 10 invited distinguished lectures. His research interests include data mining for streams and graphs, fractals, database performance, and indexing for multimedia and
bio-informatics data. (Full CV at www.cs.cmu.edu/~christos/webvitae.pdf )

Hanghang Tong is a senior Ph.D. student in the Machine Learning Department at Carnegie Mellon University. He has received best paper awards from SIAM-DM 2008 and ICDM 2006, and he has 25 refereed publications. He holds an M.S. degree and a B.S. degree from Tsinghua University, P.R. China. His research interests include data mining for multimedia and for graphs. (Full CV at www.cs.cmu.edu/~htong/pdf/cv_Tong.pdf )