This is an advanced, research-oriented course on statistical natural
language processing. Students and the instructors
will work together to understand, implement, and extend
state-of-the-art machine learning algorithms for information
extraction, named entity extraction, co-reference resolution, and
related
natural language processing tasks. The
course will involve two primary activities: reading and discussing
current research papers in this area, and developing a novel approach
to continuous learning for natural language
processing. More specifically, as a class we will
work together toward designing and building a computer system that runs 24 hours/day, 7
days/week, performing two tasks: (1) extracting factual content from
unstructured and semi-structured web pages, and (2) continuously
learning to improve its competence at information
extraction.
Resources: a list of candidate resources (eg., search software, Wordnet) which may be useful in our system.
Task List:
To
succeed as a group we'll have to accomplish a number of important tasks
beyond individually learning the material and developing our individual
component of the
system. These tasks range from serving as a class consultant
to
help others use
the Minorthird system, to helping manage the course
website. Everybody is expected to sign up for
something. Please sign
up now.
Round 1 Projects
: look here for advice on round 1 project proposals due January 26
Tentative
course schedule:
During most class meetings we will spend part of the class studying one
or more approaches to semi-supervised learning, and part of the class
on design and
design reviews of the ReadTheWeb system we're building.
The following is a partial outline of topics/assignments/handouts for
upcoming class sessions. To be updated as we go.
Assignment
out: (1) read the above
two papers. (2) Form teams of 2-4 students. Each team will
give a
5 min presentation on Jan 26, describing their proposed learning task
and semi-supervised approach.
April 6. We reviewed and discussed the draft
APIs posted on Kiva, and reviewed the overall timeline for finishing
the course. A summary of action items was captured in a set of PowerPoint slides. Each team will tabulate the steps in their module's operation following the example provided by Andy Schlaikjer, and document those steps in more detail (see slides for more).
Here is some of the research we may cover:
-Large scale web information extraction [Etzioni, et al. 05]
-Bootstrap learning from the web [Brin, 1999]
-Cotraining for web classification [Blum&Mitchell 98]
-Bootstrapping for natural language learning [Eisner&Karakos,
05]
-Semi-supervised learning for named entity extraction
[Collins&Singer 99; Jones 05]
-Automatic learning of hypernyms [Ng, 05]
-Extracting information about people and publications from the web [McCallum, 05]
-Wrapper induction for extraction from structured web pages [Muslea et
al., 01; Mohapatra et al. 04]
-Learning to disambiguate word senses [Yarowsky 96]
-Discovering new word senses [Pantel&Lin 02]
-Synonym and ontology discovery [Lin et al., 03]
-Relation extraction [Yangarber et al. 00]
-Statistical parsing [Collins, et al. 05]
-Graphical models for information extraction [Rosario, 05]
-Latent Dirichlet Allocation [Blei, 02]
more
coming soon...
This file is located at
/afs/cs/project/theo-21/www/index.html.
It was created using NVU, freely available at http://www.nvu.com/
Tom Mitchell, January 20, 2006.