Data Mining
the Internet: What we know,
what we don't and how we can learn more
CONTENT
What do we know about the Internet?
How can we learn more about the Internet?
In this tutorial, we addresses these two questions: what and how. First,
we present the state of the art of WHAT we know about modeling and simulating
the Internet. Second, we present cutting edge techniques of HOW to
further our understanding of the network. The motivation is that
despite the significant research efforts, we know very little about the
Internet. Furthermore, most network researchers are unaware of the wealth
of analysis tools from the areas of data mining and statistics. Data analysis
based on averages, standard deviation and Poisson processes has exhausted
its capabilities. We present two scenarios that describe the two
main thursts of this tutorial.
-
Scenario one (WHAT): You want to simulate your new protocol. What topology
should you use? What is the distribution of sources and destinations? What
is the traffic intensity of each connection? What kind of background traffic
should we use?
-
Scenario two (HOW): You just obtained large measured data of round trip
delays among several node pairs over a few hours. How can you characterize
it? How do you compare the delays between different end-points? How do
you cluster "similar" round-trip behavior? How can you identify abnormal
behavior such as a Distributed Denial of Service Attack (DDoS)?
In a nutshell, the main goal of this tutorial is to present what we know
about modeling the Internet, and how we can learn more. The tutorial intends
to bridge the gap between network researchers and datamining research.
FOILS
partB,
in PDF
INTENDED AUDIENCE
This tutorial is targeted for
network researchers who want to
-
conduct realistic simulations
-
analyze real data by identifying patterns and abnormal behavior
-
get quickly up-to-date with the latest data mining tools
PREREQUISITES
None. The tutorial is self contained so that it can be accesible
to industry people and graduate students, while at the same time will contain
useful material for seasoned network researchers.
BENEFITS TO PARTICIPANTS
The participants will learn:
-
what is the state of the art network modeling
(topology and power-law networks, traffic
and self-similarity and long-range dependency, end-to-end behavior)
-
how to conduct realistic simulations
-
how to analyze real data using advanced datamining tools,
like (a) Singular Value Decomposition (SVD),
(b) Autoregressive and ARIMA forecasting,
(c) fractals, multifractals, and self-similarity.
But the tutorial mainly emphasizes
the intuition behind them, while giving enough mathematical details
and citations to enable further, deeper study.
INSTRUCTORS' BIOGRAPHICAL NOTES
The instructors have been in collaboration for 4 years, with multiple joint
papers.
MICHALIS FALOUTSOS received the B.Sc. degree in Electrical Engineering
(1993) from the National Technical University of Athens, Greece and the
M.Sc. and Ph.D. degrees in Computer Science from the University of Toronto,
Canada (1999). He is currently an assistant professor at the University
of California Riverside. He has received the CAREER award from NSF (2000),
and two major DARPA grants. He has co-authored with Christos and
Petros Faloutsos the highly-cited paper "On Powerlaws of the Internet Topology"
(SIGCOMM'99), which renewed the interest of the community in modeling the
Internet topology. His interests include Internet measurements, multicast
protocols, real-time communications, and wireless networks.
CHRISTOS FALOUTSOS received the B.Sc. degree in Electrical Engineering
(1981) from the National Technical University of Athens, Greece and the
M.Sc. and Ph.D. degrees in Computer Science from the University of Toronto,
Canada. He is currently a professor at Carnegie Mellon University. He has
received the Presidential Young Investigator Award by the National Science
Foundation (1989), multiple ``best paper'' awards (SIGMOD 94, VLDB 97, KDD01
(runner-up), Performance Evaluation'02),
and four teaching awards. He has published over 100 refereed
articles, one monograph, and holds four patents. His research interests
include data mining, network analysis, indexing in relational and multimedia
databases.