SPIRIT: Streaming Pattern Discovery

Overview

Data stream is an important model for many different applications: network traffic analysis, sensor network monitoring, moving object tracking, financial data analysis. The main challenge is that 1) speed challenge: the data are coming into the system for processing in real-time (e.g. stock index quotes, sensor measurements, network traffic); 2) space challenge: the data are usually unbounded which posts another challenge on how to efficiently store and process the data. A lot of applications on data streams are monitoring applications, where huge volume of real-time data streaming into the system need to be monitored for trend analysis and anomaly detection. It is crucial to detect patterns and correlations that may exist in co-evolving data streams. Streams often are inherently correlated (e.g., temperatures in the same building, host traffic in the same network, prices in the same market, etc.) and it is possible to reduce hundreds of numerical streams into just a handful of hidden variables that compactly describe the key trends and dramatically reduce the complexity of further data processing.

Figure on left illustrates the idea: the top panel plots the original data streams (potentially a large number of them); the middle panel shows the hidden variables computed from the original streams; the bottom panel plots the reconstruction of original stream based on the hidden variables. The idea is that it is hard to monitor a large number of data streams (in the top panel), but it is probably OK to look at the hidden variables (in the middle panel). Furthermore, how good the hidden variables capture the original data streams is reflected in the reconstruction. The closer the original and reconstruction are, the better the hidden variables capture the streams. This is a very important observation since it gives a hint on 1) how to determine the number of hidden variables and 2) how to detect abnormal behavior.

The ultimate goal is to have online algorithms that does the following for a large number of streams:

automatically detect the patterns
adapt to the changes in real-time
no magic parameter for users to set
work in distributed environments

FUNDING ACKNOWLEDGEMENTS:
This material is based upon work supported by the National Science Foundation under Grants No. IIS-0209107, IIS-0205224, INT-0318547, SENSOR-0329549, IIS-0326322, This work is also supported in part by the Pennsylvania
Infrastructure Technology Alliance (PITA),a partnership of Carnegie Mellon, Lehigh University and the Commonwealth of Pennsylvania's Department of Community and Economic Development (DCED). Additional funding was provided by donations from Intel, NTT and Hewlett-Packard. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, or other funding parties.

Technique

Centralized SPIRIT

Given n numerical data streams, all of whose values we observe at each time tick t, SPIRIT can incrementally n-dimensional correlations and hidden variables, which summarize the key trends in the entire stream collection. Related publication [2,3]

Distributed SPIRIT

Related publication [1]

People

CMU:
- Christos Faloutsos - professor from CSD
- Jimeng Sun - PhD student from CSD
- Evan Hoke - undergrad student from CSD
IBM Waston:
- Spiros Papadimitriou - researcher (CMU alumni)

Publication

Sun, J., Papadimitriou, S., Faloutsos, C. Distributed Pattern Discovery in Multiple Streams, To appear in the Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Singapore, Apr 9-12, 2006 [short version/full version][bib]
Papadimitriou, S., Sun, J., Faloutsos, C. Streaming Pattern Discovery in Multiple Time-Series, Proceedings of the Very Large Data Bases Conference (VLDB), Trondheim, Norway, 2005[pdf][bib]
- Abstract: In this paper, we introduce SPIRIT (Streaming Pattern dIscoveRy in multIple Timeseries). Given n numerical data streams, all of whose values we observe at each time tick t, SPIRIT can incrementally nd correlations and hidden variables, which summarise the key trends in the entire stream collection. It can do this quickly, with no buffering of stream values and without comparing pairs of streams. Moreover, it is any-time, single pass, and it dynamically detects changes. The discovered trends can also be used to immediately spot potential anomalies, to do efficient forecasting and, more generally, to dramatically simplify further data processing. Our experimental evaluation and case studies show that SPIRIT can incrementally capture correlations and discover trends, efficiently and effectively.
Sun, J., Papadimitriou, S., Faloutsos, C. Online latent variable detection in sensor networks, To appear in Proceedings of 21th IEEE International Conference on Data Engineering (ICDE), 2005 (demo).[pdf][bib]
- Abstract: Sensor networks attract increasing interest, for a broad range of applications. Given a sensor network, one key issue becomes how to utilize it efficiently and effectively. In particular, how can we detect the underlying correlations (latent variables) among many co-evolving sensor measurements? Can we do it incrementally? We present a system that can (1) collect the measurements from the real wireless sensors; (2) process them in real-time; and (3) determine the correlations (latent variables) among the sensor streams on the fly.
Evan Hoke, Jimeng Sun, John D. Strunk, Gregory R. Ganger, Christos Faloutsos. InteMon: Continuous Mining of Sensor Data in Large-scale Self-* Infrastructures. ACM SIGOPS Operating Systems Review, 40(3):38-44. ACM Press, July 2006 [pdf][bib]
- Abstract: Modern data centers have a large number of components that must be monitored, including servers, switches/routers, and environmental control systems. This paper describes InteMon, a prototype monitoring and mining system for data centers. It uses the SNMP protocol to monitor a new data center at Carnegie Mellon. It stores the monitoring data in a MySQL database, allowing visualization of the time-series data using a JSP web-based frontend interface for system administrators. What sets InteMon apart from other cluster monitoring systems is its ability to automatically analyze correlations in the monitoring data in real time and alert administrators of potential anomalies. It uses efficient, state of the art stream mining methods to report broken correlations among input streams. It also uses these methods to intelligently compress historical data and avoid the need for administrators to configure threshold-based monitoring bands.
Evan Hoke, Jimeng Sun, Christos Faloutsos. InteMon: Intelligent System Monitoring on Large Clusters, Proceedings of VLDB, 2006 (demo).

Download

Datasets (please gently acknowledge us in your resulting work)
- Chlorine data: generated using EPANET (dataset courtesy of Prof. Jeanne M. VanBriesen)
- Sensor motes: sensor measurements collecting from wireless sensors (dataset courtesy of Prof. Carlos Guestrin (CMU) -if you use it, please also reference the paper:
  Amol Deshpande, Carlos Guestrin, Samuel Madden, Joseph M. Hellerstein, Wei Hong: Model-Driven Data Acquisition in Sensor Networks. VLDB 2004: 588-599)
Software: it is a standard-alone matlab source code [download]

Demo

Simple Web Demo (it requires Sun JVM installed):

IntMon: intelligent monitoring system for large clusters [currently under construction, please come back later]