Syllabus
DESCRIPTION
The course covers advanced algorithms for learning, analysis, data
management and visualization of large datasets. Topics include
indexing for text and DNA databases, searching medical and
multimedia databases by content, fundamental signal processing
methods, compression, fractals in databases, data mining, privacy
and security issues, rule discovery, data visualization, graph
mining, stream mining.
TOPICS TO BE COVERED
- Database topics:
- Traditional databases: Advanced hashing and multi-key
access
methods, for main-memory and for disk-based data.
- Text databases: indexing text and DNA strings,
clustering, information filtering, LSI (singular value
decomposition).
- Multimedia databases: Searching by content in
signals:
Time sequences, photographs and medical images, video
clips,
feature extraction, continuous media storage and
delivery.
- Tools:
- Fundamental signal processing methods: Discrete Fourier
Transform, wavelets, JPEG and MPEG compression.
- Singular Value Decomposition: revisited
- Fractals in databases: Self-similarity/non-uniformity of
real
datasets, fractal dimensions, selectivity using fractals and
multifractals, fractal image compression, self-similarity in
web-traffic patterns.
- Data Mining:
- Graph mining: ``Laws'' in large graphs (power laws; 'small
world' phenomena); graph generators; social networks.
- Sensor and time series mining: linear and non-linear
forecasting
- Review of Statistical methods,
- Review of AI-methods,
- Database methods - Massive datasets: Association rules;
Frequent sets; Single-pass learning
algorithms;
Information compression and reconstruction; Sampling;
Condensed
data representations; Datacubes; Cube-trees; Function
finding.
- Security and Privacy Protection.
- Visualization of large data sets
- More tools: approximate counting algorithms; Independent
Component Analysis.
- OVERVIEW OF RECENT TOPICS: trust and influence propagation;
Future directions.
PREREQUISITES: Introductory database course 15-415/615
or 15-445/645 (familiarity with B-trees and Hashing), or permission
of the instructor.
UNIVERSITY UNITS: 12
CORE UNITS: 1
TEXT
Copies of instructor's transparencies and notes, as well as copies
of selected articles will be made available. The required texts
are:
Recommended, but not required texts:
- William H. Press, Saul A. Teukolsky, William T. Vetterling and
Brian P. Flannery, Numerical
Recipes
in C, Cambridge University Press, 1992, 2nd
Edition.
- Raghu Ramakrishnan, Johannes Gehrke, "Database Management
Systems," McGraw-Hill 2002 (3rd ed).
- Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques, 3rd
edition,
2011
METHOD OF EVALUATION
The course involves
- A midterm (20%)
- Homeworks (10%) (hw1: 1%, hw2,3,4: 3% each)
- A Project (40%)
- A Final exam (30%)
Clarifications:
- Projects will be carried out in teams of 2. A detailed handout
about the project will be distributed at the beginning of the
course, along with a list of suggested projects. The goal of the
project is to give the participants the opportunity to tackle a
large, interesting problem, which may lead to a publication,
and/or
a large-size software system.
Last updated: Sept. 2, 2019, by Christos
Faloutsos