11741/15866, IR course

Original version by Yiming

15-886/11-741: Information Retrieval

CS core unit / LTI core

Welcome to the IR course for CMU graduate-level students!

General Information

Time: Tuesdays and Thursday 1:30-2:50pm
Location: Cyert Hall, Blue Conference Room
Instructor: Yiming Yang (yiming@cs.cmu.edu, Cyert Hall 260, office hours by appointment).
Materials: There will be no official textbooks. Course notes and papers will be distributed instead. Many of the handouts are/will-be available online; if non-online handouts are needed, hard copies will be made available outside of Yiming Yang's office (Cyert Hall 260) with announcement. A reading list is specified, including some (online papers ) and a package of hard copies which are prepared for registered students. The package can be obtained via Jen Potter at the cost of $20 to cover the copy expanse. For further reference, there are a few textbooks recommended:
- van Rijsbergen (1979) "Information Retrieval" available on-line;
- Salton, G. (1989), "Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer", Addison-Wesley, Reading, Pennsylvania (out of print, unfortunately);
- Frankes, W.B. and Baeza-Yates, R.B. (1992) "Information Retrieval: Data Structure & Algorithms", Prentice Hall, Englewood Cliffs, New Jersey (currently not available);
- Sparck Jones, K. and Willett, P. Ed. (1997) "Readings in Information Retrieval", Margan Kaufmann, San Francisco, California;
- Kowalski, G. (1997) "Information Retrieval Systems: Theory and Implementation", Kluwer Academic Publishers, Boston/Dordrecht/London;
- Yu, C. and Meng, W. "Principles of Database Query Processing for Advanced Applications", Margan Kaufmann, San Francisco, California.
Teaching Assistant
- Klaus Zechner (email - Klaus.Zechner@cs.cmu.edu)
- Office: SC 224
- Office hours: Tuesday 3-4pm; Wednesday 4-5pm; exceptional cases by appointment
Sit-in: You need the approval from the instructor. The current classroom is overly crowded due to a large number non-registered participants, leaving no enough space for the registered students and the approved sit-in members. While we are still working hard with the registrar for the possibility of classroom swap, at present we must restrict the participation only to the registered list and the approved sit-in list (check the participation approval, and please inform the instructor if you are registered but not included in the list). Further approvals are possible if we find a larger classroom, or if space is available from withdraws. Approvals are made based on the following priorities, ordered from higher to lower:
- registered students;
- CMU faculty members;
- CMU students fully registered but not in this class;
- LTI employees;
- order in the waiting list (first-in-first-assigned).
Please email to the instructor if you are willing to be included in the waiting list.
Course Description: The graduate IR core course focuses on fundamental techniques for information retrieval, and new research challenges in the field. The fundamental techniques include: document and query representation, retrieval models using vector spaces and probabilistic ranking, term weighting schemes based on corpus statistics, indexing and search techniques, evaluation measures and methodology. The new challenges include statistical methods and machining learning techniques applied to text retrieval, categorization, clustering, summarization, information extraction and discovery, cross-language and multi-media information retrieval.
Prerequisites :
- Programming and data-structures at the level of 15-212 or higher.
- Algorithms comparable to the undergraduate CS algorithms course (15-451) or higher.
- Basic linear algebra (21-241 or 21-341) and basic statistics (36-202) or higher.
Additionally, students are encouraged (but not required) to have taken AI or Machine Learning courses.
Grading : Grades will be based primarily (60%) on a multi-step project. The other components consist of a midterm (20%) and homework assignments (20%). No final exam.
Workload : Moderate. 12 hours/week estimate.
Homework Assignments (anticipated) : 2 problem sets, 1 hands-on programming task (homework) The infomation about the various IR software (for your homework) is now available on the web.
Projects: Projects are intended to give each student hands-on experience with state-of-the-art methods on challenging but tractable problems in Information Retrieval. Students will choose from a set of projects designed by the instructor (projects designed). Students will also have the option of designing their own projects, subject to instructor approval. More than one student may work on a particular topic, but all students will work independently, each choosing an approach that is different from the approaches used by the other students working on the same pro, pleaseblem. A library of software and formatted data collections will be available on AFS to lighten the programming load. The software library will include several search engines and statistical classifiers.

Syllabus (anticipated)

Day Important Events Subject Lecturer

Introduction

lec 1. 1/13 Course outline: materials, importance of text retrieval, technical challenges Yiming

lec 2. 1/15 Text Representation: tokenalization (word stemming and phrasing), term weighting, document indexing (inverted files), query processing David Evans

lec 3. 1/20 Bench-mark systems, evaluation and scaling issues (TREC and its tracks) David Evans

Text Retrieval

lec 4. 1/22 HW1 Vector Space Models (VSM, RF, PRF) Yiming

lec 5. 1/27 Generalized Vector Space Models (GVSM) Yiming

lec 6. 1/29 HW1 due, HW2 out Latent Semantic Indexing (LSI) Yiming

lec 7. 2/3 Translingual retrieval methods (GVSM, LSI, PRF, EBMT) Yiming

lec 8. 2/5 HW2 due, HW3 out Multivariate Regression (LLSF) and kNN Yiming
Text Categorization

lec 9. 2/10 Project info out Nearest Neighbor classification (kNN) and LLSF (and LSP) Yiming

lec 10. 2/12 HW3 due k-D tree for nearest neighbor classification Andrew Moore

lec 11. 2/17 Projects assigned Evaluation issues Yiming

lec 12. 2/19 On-line learning (Sleeping Experts) Avrim Blum

lec 13. 2/24 Evaluation issues (Rocchio, Naive Bayes) Yiming

lec 14. 2/26 Scaling issues and problem decomposition Yiming

Navigation, Summarization and Discovery

lec 15. 3/3 Automated Summarization Jaime Carbonell

lec 16. 3/5 Text clustering and topic detection Yiming

lec 17. 3/10 Machine learning for information extraction from text Dayne Freitag

lec 18. 3/12 Extracting knowledge from the World Wide Web Tom Mitchell

lec 19. 3/17 Projects preliminary results Review & Q/A Klaus/Yiming

3/19 Midterm Exam No class-sit-in exam; papers review by 3/20 instead Yiming

3/24 Spring Break!!!

3/26 Spring Break!!!

lec 20. 3/31 Topic tracking Yiming

lec 21. 4/2 Learning from unlabeled documents; hierarchical text categorization Andrew McCallum

Multimedia IR

lec 22. 4/7 Spoken document retrieval Alex Hauptmann

lec 23. 4/9 Topic-coherent segmentation/clustering John Lafferty

lec 24. 4/14 Visual content-based image retrieval Yihong Gong

lec 25. 4/16 Indexing methods, large scale Chris Faloutsos

lec 26. 4/21 Research challenges in digital libraries Howard Wactlar

Projects

4/23 Projects due Project presentation Yiming

4/28 Project presentation Yiming

4/30 Project presentation Yiming

Syllabus (anticipated)
Day	Important Events	Subject	Lecturer
Introduction
lec 1. 1/13		Course outline: materials, importance of text retrieval, technical challenges	Yiming
lec 2. 1/15		Text Representation: tokenalization (word stemming and phrasing), term weighting, document indexing (inverted files), query processing	David Evans
lec 3. 1/20		Bench-mark systems, evaluation and scaling issues (TREC and its tracks)	David Evans
Text Retrieval
lec 4. 1/22	HW1	Vector Space Models (VSM, RF, PRF)	Yiming
lec 5. 1/27		Generalized Vector Space Models (GVSM)	Yiming
lec 6. 1/29	HW1 due, HW2 out	Latent Semantic Indexing (LSI)	Yiming
lec 7. 2/3		Translingual retrieval methods (GVSM, LSI, PRF, EBMT)	Yiming
lec 8. 2/5	HW2 due, HW3 out	Multivariate Regression (LLSF) and kNN	Yiming
Text Categorization
lec 9. 2/10	Project info out	Nearest Neighbor classification (kNN) and LLSF (and LSP)	Yiming
lec 10. 2/12	HW3 due	k-D tree for nearest neighbor classification	Andrew Moore
lec 11. 2/17	Projects assigned	Evaluation issues	Yiming
lec 12. 2/19		On-line learning (Sleeping Experts)	Avrim Blum
lec 13. 2/24		Evaluation issues (Rocchio, Naive Bayes)	Yiming
lec 14. 2/26		Scaling issues and problem decomposition	Yiming
Navigation, Summarization and Discovery
lec 15. 3/3		Automated Summarization	Jaime Carbonell
lec 16. 3/5		Text clustering and topic detection	Yiming
lec 17. 3/10		Machine learning for information extraction from text	Dayne Freitag
lec 18. 3/12		Extracting knowledge from the World Wide Web	Tom Mitchell
lec 19. 3/17	Projects preliminary results	Review & Q/A	Klaus/Yiming
3/19	Midterm Exam	No class-sit-in exam; papers review by 3/20 instead	Yiming
3/24	Spring Break!!!
3/26	Spring Break!!!
lec 20. 3/31		Topic tracking	Yiming
lec 21. 4/2		Learning from unlabeled documents; hierarchical text categorization	Andrew McCallum
Multimedia IR
lec 22. 4/7		Spoken document retrieval	Alex Hauptmann
lec 23. 4/9		Topic-coherent segmentation/clustering	John Lafferty
lec 24. 4/14		Visual content-based image retrieval	Yihong Gong
lec 25. 4/16		Indexing methods, large scale	Chris Faloutsos
lec 26. 4/21		Research challenges in digital libraries	Howard Wactlar
Projects
4/23	Projects due	Project presentation	Yiming
4/28		Project presentation	Yiming
4/30		Project presentation	Yiming

Yiming Yang ( yiming@cs.cmu.edu)

Original version by Yiming

General Information

Teaching Assistant

Syllabus (anticipated)