CS core unit / LTI core
Welcome to the IR course for CMU graduate-level students!
Sit-in: You need the approval from the instructor. The current
classroom is overly crowded due to a large number non-registered
participants, leaving no enough space for the registered students and
the approved sit-in members. While we are still working hard with the
registrar for the possibility of classroom swap, at present we must restrict
the participation only to the registered list and the approved
sit-in list (check the participation
approval, and please inform the instructor if you are registered
but not included in the list). Further approvals are possible if we
find a larger classroom, or if space is available from withdraws.
Approvals are made based on the following priorities, ordered from
higher to lower:
Course Description: The graduate IR core course focuses on
fundamental techniques for information retrieval, and new research
challenges in the field. The fundamental techniques include: document
and query representation, retrieval models using vector spaces and
probabilistic ranking, term weighting schemes based on corpus
statistics, indexing and search techniques, evaluation measures and
methodology. The new challenges include statistical methods and
machining learning techniques applied to text retrieval,
categorization, clustering, summarization, information extraction and
discovery, cross-language and multi-media information retrieval.
Prerequisites :
Grading : Grades will be based primarily (60%) on a multi-step
project. The other components consist of a midterm (20%) and homework
assignments (20%). No final exam.
Workload : Moderate. 12 hours/week estimate.
Homework Assignments (anticipated) : 2 problem sets, 1
hands-on programming task (homework) The infomation about
the various IR software (for your homework) is now available on the
web.
Projects: Projects are intended to give each student hands-on
experience with state-of-the-art methods on challenging but tractable
problems in Information Retrieval. Students will choose from a set of
projects designed by the instructor
(projects designed). Students will also have the option of
designing their own projects, subject to instructor approval. More
than one student may work on a particular topic, but all students will
work independently, each choosing an approach that is different from
the approaches used by the other students working on the same
pro, pleaseblem. A library of software and formatted data collections will be
available on AFS to lighten the programming load. The software
library will include several search engines and statistical
classifiers.
General Information
Teaching Assistant
Please email to the instructor if you are willing to be included in
the waiting list.
Additionally, students are encouraged (but not required) to have taken
AI or Machine Learning courses.
Syllabus (anticipated)
Day Important Events Subject Lecturer
Introduction
lec 1. 1/13
Course outline: materials, importance of text retrieval, technical challenges
Yiming
lec 2. 1/15
Text Representation: tokenalization (word stemming and phrasing),
term weighting, document indexing (inverted files), query processing
David Evans
lec 3. 1/20
Bench-mark systems, evaluation and scaling issues (TREC and its tracks)
David Evans
Text Retrieval
lec 4. 1/22
HW1
Vector Space Models (VSM, RF, PRF)
Yiming
lec 5. 1/27
Generalized Vector Space Models (GVSM)
Yiming
lec 6. 1/29
HW1 due, HW2 out
Latent Semantic Indexing (LSI)
Yiming
lec 7. 2/3
Translingual retrieval methods (GVSM, LSI, PRF, EBMT)
Yiming
lec 8. 2/5
HW2 due, HW3 out
Multivariate Regression (LLSF) and kNN
Yiming
Text Categorization
lec 9. 2/10
Project info out
Nearest Neighbor classification (kNN) and LLSF (and LSP)
Yiming
lec 10. 2/12
HW3 due
k-D tree for nearest neighbor classification
Andrew Moore
lec 11. 2/17
Projects assigned
Evaluation issues
Yiming
lec 12. 2/19
On-line learning (Sleeping Experts)
Avrim Blum
lec 13. 2/24
Evaluation issues (Rocchio, Naive Bayes)
Yiming
lec 14. 2/26
Scaling issues and problem decomposition
Yiming
Navigation, Summarization and Discovery
lec 15. 3/3
Automated Summarization
Jaime Carbonell
lec 16. 3/5
Text clustering and topic detection
Yiming
lec 17. 3/10
Machine learning for information extraction from text
Dayne Freitag
lec 18. 3/12
Extracting knowledge from the World Wide Web
Tom Mitchell
lec 19. 3/17
Projects preliminary results
Review & Q/A
Klaus/Yiming
3/19
Midterm Exam
No class-sit-in exam; papers review by 3/20 instead
Yiming
3/24
Spring Break!!!
3/26
Spring Break!!!
lec 20. 3/31
Topic tracking
Yiming
lec 21. 4/2
Learning from unlabeled documents; hierarchical text categorization
Andrew McCallum
Multimedia IR
lec 22. 4/7
Spoken document retrieval
Alex Hauptmann
lec 23. 4/9
Topic-coherent segmentation/clustering
John Lafferty
lec 24. 4/14
Visual content-based image retrieval
Yihong Gong
lec 25. 4/16
Indexing methods, large scale
Chris Faloutsos
lec 26. 4/21
Research challenges in digital libraries
Howard Wactlar
Projects
4/23
Projects due
Project presentation
Yiming
4/28
Project presentation
Yiming
4/30
Project presentation
Yiming
Yiming Yang (
yiming@cs.cmu.edu)