Original version by Yiming

15-886/11-741: Information Retrieval

CS core unit / LTI core

Welcome to the IR course for CMU graduate-level students!

General Information

  • Time: Tuesdays and Thursday 1:30-2:50pm
  • Location: Cyert Hall, Blue Conference Room
  • Instructor: Yiming Yang (yiming@cs.cmu.edu, Cyert Hall 260, office hours by appointment).
  • Materials: There will be no official textbooks. Course notes and papers will be distributed instead. Many of the handouts are/will-be available online; if non-online handouts are needed, hard copies will be made available outside of Yiming Yang's office (Cyert Hall 260) with announcement. A reading list is specified, including some (online papers ) and a package of hard copies which are prepared for registered students. The package can be obtained via Jen Potter at the cost of $20 to cover the copy expanse. For further reference, there are a few textbooks recommended:
    • van Rijsbergen (1979) "Information Retrieval" available on-line;
    • Salton, G. (1989), "Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer", Addison-Wesley, Reading, Pennsylvania (out of print, unfortunately);
    • Frankes, W.B. and Baeza-Yates, R.B. (1992) "Information Retrieval: Data Structure & Algorithms", Prentice Hall, Englewood Cliffs, New Jersey (currently not available);
    • Sparck Jones, K. and Willett, P. Ed. (1997) "Readings in Information Retrieval", Margan Kaufmann, San Francisco, California;
    • Kowalski, G. (1997) "Information Retrieval Systems: Theory and Implementation", Kluwer Academic Publishers, Boston/Dordrecht/London;
    • Yu, C. and Meng, W. "Principles of Database Query Processing for Advanced Applications", Margan Kaufmann, San Francisco, California.

    Teaching Assistant

    Sit-in: You need the approval from the instructor. The current classroom is overly crowded due to a large number non-registered participants, leaving no enough space for the registered students and the approved sit-in members. While we are still working hard with the registrar for the possibility of classroom swap, at present we must restrict the participation only to the registered list and the approved sit-in list (check the participation approval, and please inform the instructor if you are registered but not included in the list). Further approvals are possible if we find a larger classroom, or if space is available from withdraws. Approvals are made based on the following priorities, ordered from higher to lower:

    • registered students;
    • CMU faculty members;
    • CMU students fully registered but not in this class;
    • LTI employees;
    • order in the waiting list (first-in-first-assigned).
    Please email to the instructor if you are willing to be included in the waiting list.

    Course Description: The graduate IR core course focuses on fundamental techniques for information retrieval, and new research challenges in the field. The fundamental techniques include: document and query representation, retrieval models using vector spaces and probabilistic ranking, term weighting schemes based on corpus statistics, indexing and search techniques, evaluation measures and methodology. The new challenges include statistical methods and machining learning techniques applied to text retrieval, categorization, clustering, summarization, information extraction and discovery, cross-language and multi-media information retrieval.

    Prerequisites :

    • Programming and data-structures at the level of 15-212 or higher.
    • Algorithms comparable to the undergraduate CS algorithms course (15-451) or higher.
    • Basic linear algebra (21-241 or 21-341) and basic statistics (36-202) or higher.
    Additionally, students are encouraged (but not required) to have taken AI or Machine Learning courses.

    Grading : Grades will be based primarily (60%) on a multi-step project. The other components consist of a midterm (20%) and homework assignments (20%). No final exam.

    Workload : Moderate. 12 hours/week estimate.

    Homework Assignments (anticipated) : 2 problem sets, 1 hands-on programming task (homework) The infomation about the various IR software (for your homework) is now available on the web.

    Projects: Projects are intended to give each student hands-on experience with state-of-the-art methods on challenging but tractable problems in Information Retrieval. Students will choose from a set of projects designed by the instructor (projects designed). Students will also have the option of designing their own projects, subject to instructor approval. More than one student may work on a particular topic, but all students will work independently, each choosing an approach that is different from the approaches used by the other students working on the same pro, pleaseblem. A library of software and formatted data collections will be available on AFS to lighten the programming load. The software library will include several search engines and statistical classifiers.

Syllabus (anticipated)

Day Important Events Subject Lecturer
Introduction
lec 1. 1/13 Course outline: materials, importance of text retrieval, technical challenges Yiming
lec 2. 1/15 Text Representation: tokenalization (word stemming and phrasing), term weighting, document indexing (inverted files), query processing David Evans
lec 3. 1/20 Bench-mark systems, evaluation and scaling issues (TREC and its tracks) David Evans
Text Retrieval
lec 4. 1/22 HW1 Vector Space Models (VSM, RF, PRF) Yiming
lec 5. 1/27 Generalized Vector Space Models (GVSM) Yiming
lec 6. 1/29 HW1 due, HW2 out Latent Semantic Indexing (LSI) Yiming
lec 7. 2/3 Translingual retrieval methods (GVSM, LSI, PRF, EBMT) Yiming
lec 8. 2/5 HW2 due, HW3 out Multivariate Regression (LLSF) and kNN Yiming
Text Categorization
lec 9. 2/10 Project info out Nearest Neighbor classification (kNN) and LLSF (and LSP) Yiming
lec 10. 2/12 HW3 due k-D tree for nearest neighbor classification Andrew Moore
lec 11. 2/17 Projects assigned Evaluation issues Yiming
lec 12. 2/19 On-line learning (Sleeping Experts) Avrim Blum
lec 13. 2/24 Evaluation issues (Rocchio, Naive Bayes) Yiming
lec 14. 2/26 Scaling issues and problem decomposition Yiming
Navigation, Summarization and Discovery
lec 15. 3/3 Automated Summarization Jaime Carbonell
lec 16. 3/5 Text clustering and topic detection Yiming
lec 17. 3/10 Machine learning for information extraction from text Dayne Freitag
lec 18. 3/12 Extracting knowledge from the World Wide Web Tom Mitchell
lec 19. 3/17 Projects preliminary results Review & Q/A Klaus/Yiming
3/19 Midterm Exam No class-sit-in exam; papers review by 3/20 instead Yiming
3/24 Spring Break!!!
3/26 Spring Break!!!
lec 20. 3/31 Topic tracking Yiming
lec 21. 4/2 Learning from unlabeled documents; hierarchical text categorization Andrew McCallum
Multimedia IR
lec 22. 4/7 Spoken document retrieval Alex Hauptmann
lec 23. 4/9 Topic-coherent segmentation/clustering John Lafferty
lec 24. 4/14 Visual content-based image retrieval Yihong Gong
lec 25. 4/16 Indexing methods, large scale Chris Faloutsos
lec 26. 4/21 Research challenges in digital libraries Howard Wactlar
Projects
4/23 Projects due Project presentation Yiming
4/28 Project presentation Yiming
4/30 Project presentation Yiming


Yiming Yang ( yiming@cs.cmu.edu)