Syllabus
DESCRIPTION
The course covers advanced algorithms for learning, analysis, data
management and visualization of large datasets. Topics include
indexing for text and DNA databases, searching medical and
multimedia databases by content, fundamental signal processing
methods, compression, fractals in databases, data mining, privacy
and security issues, rule discovery, data visualization, graph
mining, stream mining.
TOPICS TO BE COVERED
- Database topics:
- Traditional databases: Advanced hashing and multi-key
access methods, for main-memory and for disk-based data.
- Text databases: indexing text and DNA strings,
clustering, information filtering, LSI (singular value
decomposition).
- Multimedia databases: Searching by content in
signals: Time sequences, photographs and medical
images, video clips, feature extraction, continuous
media storage and delivery.
- Tools:
- Fundamental signal processing methods: Discrete Fourier
Transform, wavelets, JPEG and MPEG compression.
- Singular Value Decomposition: revisited
- Fractals in databases: Self-similarity/non-uniformity of
real datasets, fractal dimensions, selectivity using
fractals and multifractals, fractal image compression,
self-similarity in web-traffic patterns.
- Data Mining:
- Graph mining: ``Laws'' in large graphs (power laws; 'small
world' phenomena); graph generators; social networks.
- Sensor and time series mining: linear and non-linear
forecasting
- Review of Statistical methods,
- Review of AI-methods,
- Database methods - Massive datasets: Association rules;
Frequent sets; Single-pass learning
algorithms; Information compression and
reconstruction; Sampling; Condensed data representations;
Datacubes; Cube-trees; Function finding.
- Security and Privacy Protection.
- Visualization of large data sets
- More tools: approximate counting algorithms; Independent
Component Analysis.
- OVERVIEW OF RECENT TOPICS: trust and influence propagation;
Future directions.
PREREQUISITES: Introductory database course 15-415/615
or 15-445/645 (familiarity with B-trees and Hashing), or permission
of the instructor.
UNIVERSITY UNITS: 12
CORE UNITS: 1
TEXT
Copies of instructor's transparencies and notes, as well as copies
of selected articles will be made available. The required texts are:
Recommended, but not required texts:
- William H. Press, Saul A. Teukolsky, William T. Vetterling and
Brian P. Flannery, Numerical
Recipes
in C, Cambridge University Press, 1992, 2nd Edition.
- Raghu Ramakrishnan, Johannes Gehrke, "Database Management
Systems," McGraw-Hill 2002 (3rd ed).
- Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques, 3rd
edition,
2011
METHOD OF EVALUATION
The course involves
- A midterm (20%)
- Homeworks (10%) (hw1: 1%, hw2,3,4: 3% each)
- A Project (40%)
- A Final exam (30%)
Clarifications:
- A detailed handout about the project will be distributed at
the beginning of the course, along with a list of suggested
projects. The goal of the project is to give the participants
the opportunity to tackle a large, interesting problem, which
may lead to a publication, and/or a large-size software system.
Last updated: Sept. 10, 2024,
by Christos Faloutsos