[Overview
| Technique | People | Publication |
Download | Demo]
Data stream is an important model for many different applications: network traffic analysis, sensor network monitoring, moving object tracking, financial data analysis. The main challenge is that 1) speed challenge: the data are coming into the system for processing in real-time (e.g. stock index quotes, sensor measurements, network traffic); 2) space challenge: the data are usually unbounded which posts another challenge on how to efficiently store and process the data. A lot of applications on data streams are monitoring applications, where huge volume of real-time data streaming into the system need to be monitored for trend analysis and anomaly detection. It is crucial to detect patterns and correlations that may exist in co-evolving data streams. Streams often are inherently correlated (e.g., temperatures in the same building, host traffic in the same network, prices in the same market, etc.) and it is possible to reduce hundreds of numerical streams into just a handful of hidden variables that compactly describe the key trends and dramatically reduce the complexity of further data processing.
Figure on left illustrates the idea: the top panel plots the original data streams (potentially a large number of them); the middle panel shows the hidden variables computed from the original streams; the bottom panel plots the reconstruction of original stream based on the hidden variables. The idea is that it is hard to monitor a large number of data streams (in the top panel), but it is probably OK to look at the hidden variables (in the middle panel). Furthermore, how good the hidden variables capture the original data streams is reflected in the reconstruction. The closer the original and reconstruction are, the better the hidden variables capture the streams. This is a very important observation since it gives a hint on 1) how to determine the number of hidden variables and 2) how to detect abnormal behavior.
The ultimate goal is to have online algorithms that does the following for a large number of streams:
automatically detect the patterns
FUNDING ACKNOWLEDGEMENTS:
This material is based upon work supported by the National Science Foundation
under Grants No. IIS-0209107, IIS-0205224, INT-0318547, SENSOR-0329549,
IIS-0326322, This work is also supported in part by the Pennsylvania
Infrastructure Technology Alliance (PITA),a partnership of Carnegie Mellon,
Lehigh University and the Commonwealth of Pennsylvania's Department of Community
and Economic Development (DCED). Additional funding was provided by donations
from Intel, NTT and Hewlett-Packard. Any opinions, findings, and conclusions or
recommendations expressed in this material are those of the author(s) and do not
necessarily reflect the views of the National Science Foundation, or other
funding parties.
Given n numerical data streams, all of whose values we observe at each time tick t, SPIRIT can incrementally n-dimensional correlations and hidden variables, which summarize the key trends in the entire stream collection. Related publication [2,3]
Related publication [1]
|
Chlorine data: generated using EPANET (dataset courtesy of Prof. Jeanne M. VanBriesen)
Sensor motes: sensor measurements
collecting from wireless sensors (dataset courtesy of Prof.
Carlos Guestrin (CMU) -if you
use it, please also reference the paper:
Amol Deshpande, Carlos Guestrin, Samuel Madden, Joseph M. Hellerstein, Wei
Hong: Model-Driven Data Acquisition in Sensor Networks. VLDB 2004: 588-599)
Software: it is a standard-alone matlab source code [download]
Simple Web Demo (it requires Sun JVM installed):
IntMon: intelligent monitoring system for large clusters [currently under construction, please come back later]