Advanced Statistical Language Processing: Reading the Web (10-709)

Tom Mitchell
Machine Learning Department
School of Computer Science, Carnegie Mellon University

Fall 2009

Instructor: Tom Mitchell, GHC 8211, x8-2611

Course administrative assistant: Sharon Cavlovich, GHC 8215, x8-5196

Class lectures: Thursdays 3:00pm-4:50pm, Gates-Hillman Center, 4211

This is an advanced, research-oriented course on statistical natural language processing. Students and the instructor will work together to understand and extend state-of-the-art machine learning algorithms for information extraction, named entity extraction, co-reference resolution, and related natural language processing tasks. The course will involve two primary activities: reading and discussing current research papers in this area, and developing a novel approach to continuous learning for natural language processing. More specifically, as a class we will work together to build components of a computer system that runs for many days on a large computer cluster that contains 200 million web pages, to perform two tasks: (1) extracting factual content from unstructured and semi-structured web pages, and (2) continuously learning to improve its competence at information extraction. We will begin the course with a running prototype system, as described at http://rtw.ml.cmu.edu/readtheweb.html. During the course, students will help extend and populate this system with additional statistical learning methods that enable it to extract additional kinds of information from the web, and to continuously learn to improve its capabilities.

Class Wiki:

Wiki for student projects, discussions of research issues, etc.

Homeworks:

HW1: show your creativity using our NP-context data. Due in class Sept 17.
HW2: propose and justify your course project. Due in class Oct 1.
HW3: read and write about active learning. Due in class Oct 15
HW4: discuss project of other students

Data and Software:

data sets and software to access the knowledge base of extracted beliefs are available here.

Class slides:

Sept 10. The research goal, current system, project ideas
Sept 24. Semi-supervised learning I
Oct 8. Semi-supervised learning II
Oct 15. Active learning (Burr Settles)
Oct 22. Multi-task active learning (Yi Zhang)
Oct 29.

Exploiting unlabeled data and lexical/ontological structure for frame-semantic parsing (Nathan Schneider)

Learning morphological patterns for NER (Reza Zadeh)

Nov 5.

Semantic feature vectors and modeling the impact of adjectives on them (Jayant Krishnamurthy)
Rich knowledge for coreference resolution (Brendan O'Conner)

Nov 12. Ontology extrension (Mohamed Thahir)