Tom Mitchell
Machine Learning Department
School of Computer Science, Carnegie Mellon University
Instructor: Tom Mitchell, GHC 8211, x8-2611
Course administrative assistant: Sharon Cavlovich, GHC 8215, x8-5196
Class
lectures: Thursdays
3:00pm-4:50pm, Gates-Hillman Center, 4211
This is an advanced, research-oriented course on statistical natural
language processing. Students and the instructor
will work
together to understand and extend state-of-the-art machine learning
algorithms for information extraction, named entity extraction,
co-reference resolution, and related natural language processing
tasks. The course will involve two primary activities:
reading
and discussing current research papers in this area, and developing a
novel approach to continuous learning for natural language
processing. More specifically, as a class we will
work
together to build components of a computer system that runs for many
days on a large computer cluster that contains 200 million web pages,
to perform two tasks: (1) extracting factual content from unstructured
and semi-structured web pages, and (2) continuously learning to improve
its competence at information extraction. We will
begin the
course with a running prototype system, as described at
http://rtw.ml.cmu.edu/readtheweb.html. During the course,
students will help extend and populate this system with additional
statistical learning methods that enable it to extract additional kinds
of information from the web, and to continuously learn to improve its
capabilities.
Class Wiki:
Homeworks:
Data and Software: