Read The Web

course 10-709, Spring 2006

(alternative title: Advanced Statistical Language Processing)

School of Computer Science
Carnegie Mellon University

This is an advanced, research-oriented course on statistical natural language processing. Students and the instructors will work together to understand, implement, and extend state-of-the-art machine learning algorithms for information extraction, named entity extraction, co-reference resolution, and related natural language processing tasks. The course will involve two primary activities: reading and discussing current research papers in this area, and developing a novel approach to continuous learning for natural language processing. More specifically, as a class we will work together toward designing and building a computer system that runs 24 hours/day, 7 days/week, performing two tasks: (1) extracting factual content from unstructured and semi-structured web pages, and (2) continuously learning to improve its competence at information extraction.

Instructors: Tom Mitchell, William Cohen, Scott Fahlman, Eric Nyberg
Course secretary: Sharon Cavlovich, sharon.cavlovich@cmu.edu , Wean Hall 5315, x8-5196
Class meetings: Thursdays, 3:00-5:00pm, Wean 5409
Course mailing list: 10709-students@cs.cmu.edu
Office hours: Tom Mitchell by appointment with sharon.cavlovich@cmu.edu

KIVA Discussion Site : we're using this site for discussions about specific projects, meeting scheduling, etc.

Round 2 Project plans :

Software: for collecting web pages, training text classifiers, storing facts in a knowledge base, and more...

Data sets: including 10^5 web pages collected from CS and Biology departments

Reading list: on bootstrap learning for natural language processing

Project Reading lists: links to reading lists for individual project groups

Resources: a list of candidate resources (eg., search software, Wordnet) which may be useful in our system.

Task List: To succeed as a group we'll have to accomplish a number of important tasks beyond individually learning the material and developing our individual component of the system. These tasks range from serving as a class consultant to help others use the Minorthird system, to helping manage the course website. Everybody is expected to sign up for something. Please sign up now.

Round 1 Projects : look here for advice on round 1 project proposals due January 26

Tentative course schedule:
During most class meetings we will spend part of the class studying one or more approaches to semi-supervised learning, and part of the class on design and design reviews of the ReadTheWeb system we're building.

The following is a partial outline of topics/assignments/handouts for upcoming class sessions. To be updated as we go.

Jan 19. Introduction, Co-Training, and Bootstrap learning from the web (lTom's cotraining slides, Eric's data model slides).

Reading: Blum & Mitchell, 1998. Combining labeled and unlabeled data with Co-Training. COLT 1998.
Reading: Brin, 1998. Extracting patterns and relations from the World Wide Web, EDBT'98.
Assignment out: (1) read the above two papers. (2) Form teams of 2-4 students. Each team will give a 5 min presentation on Jan 26, describing their proposed learning task and semi-supervised approach.
Here is some guidance on choosing projects.

Jan 26.

Student round 1 proposals and discussion. (5 min proposal, 5 min discussion).
Minorthird slides (William Cohen). Examples: SecondStringSample.java and MinorthirdLearner.java
Assignment out: write a one-paragraph review for each of the other five proposals. Post them to the corresponding KIVA group by tuesday, Jan 31.

Feb 2.

Slides with more details about the ADB and UIMA (Eric Nyberg)
Slides about Scone (Scott Fahlman)
Assignment out: read paper "Unsupervised Named-Entity Extraction from the Web: An Experimental Study," Etzioni et al., AI Journal, 2005. Come prepared to discuss this paper in detail with the author in class on Feb 9.

Feb 9. Guest lecture: Oren Etzioni, Univ of Washington
Feb 16. Final round 1 project presentations
Feb 23.

Formalisms for using redundancy to learn from unlabeled (and labeled) data

URNs model: Downey, Etzioni, & Soderland, 2005. A Probabilistic Model of Redundancy in Information Extraction, IJCAI 2005.
Cotraining: Blum & Mitchell, 1998. Combining labeled and unlabeled data with Co-Training. COLT 1998.

Discussion of second generation design.

March 2. Initial proposals for round 2 projects.

Proposal topics discussed:

General purpose bootstrap learner for relation extractions
Active learning: which questions to ask trainer (limited budget)
Co-reference resolution
Mine extracted knowledge base for regularities and feed back to bootstrap learning
Semantic role labeling, especiation nominalizaiton
Extract very general relation: Co-occursMostFrequentlyWith(<entity_type1, entity_type2>)

Homework out: produce written proposals and post to Kiva by WEDNESDAY March 8.

March 9. Create scenario to illustrate target system performance to extract info from Bio dept web site plus online publications

Scenario created in class
Blackboard photos: ontology , target extracted info 1, target extracted info 2

March 16 - CMU Spring Break

March 23. Guest lecture: Andrew McCallum, Univ. of Massachusetts, Amherst

reading: Charles Sutton and Andrew McCallum, 2005. Composition of Conditional Random Fields for Transfer Learning, in Proceedings of Human Language Technologies / Emprical Methods in Natural Language Processing (HLT/EMNLP) 2005 .

March 30. Schedule for projects and integrated RTW system.

April 6. We reviewed and discussed the draft APIs posted on Kiva, and reviewed the overall timeline for finishing the course. A summary of action items was captured in a set of PowerPoint slides. Each team will tabulate the steps in their module's operation following the example provided by Andy Schlaikjer, and document those steps in more detail (see slides for more).

April 13.

William's lecture on learning to extract information using HTML structure
Eric's slides summarizing wednesday's meeting on integrated system design

April 20
April 27
May 4

Here is some of the research we may cover:
-Large scale web information extraction [Etzioni, et al. 05]
-Bootstrap learning from the web [Brin, 1999]
-Cotraining for web classification [Blum&Mitchell 98]
-Bootstrapping for natural language learning [Eisner&Karakos, 05]
-Semi-supervised learning for named entity extraction [Collins&Singer 99; Jones 05]
-Automatic learning of hypernyms [Ng, 05]
-Extracting information about people and publications from the web [McCallum, 05]
-Wrapper induction for extraction from structured web pages [Muslea et al., 01; Mohapatra et al. 04]
-Learning to disambiguate word senses [Yarowsky 96]
-Discovering new word senses [Pantel&Lin 02]
-Synonym and ontology discovery [Lin et al., 03]
-Relation extraction [Yangarber et al. 00]
-Statistical parsing [Collins, et al. 05]
-Graphical models for information extraction [Rosario, 05]
-Latent Dirichlet Allocation [Blei, 02]

more coming soon...

This file is located at /afs/cs/project/theo-21/www/index.html.
It was created using NVU, freely available at http://www.nvu.com/
Tom Mitchell, January 20, 2006.