Information Extraction and Integration: 10-707

Learning to Turn Words into Data:
Machine Learning Approaches to Information Extraction and Information Integration

Notice: this page is for the course as taught in spring 2004. An updated syllabus for 2007 is available.

Instructor and Venue

Instructor: William Cohen, CALD
When/where: MW 10:30-11:50, Wean Hall 4601
Course Number: 10-707, cross-listed in LTI as 11-748

Announcements: On Wed April 28, all project groups will give a 10-minute presentation of their work.

Supplemental material: Instructions for installing minorthird; Vitor's also maintaining an FAQ for minorthird.

Description

Information extraction is finding names of entities in unstructured or partially structured text, and determining the relationships that hold between these entities. Information integration is reasoning with data taken from multiple sources. Together these techniques let one automatically perform the tremendously challenging task of deriving structured information from text, and relating it to previously-known facts.

The course will discuss many of the sub-problems involved in information extraction and integration, and the techniques required to solve them. We will consider the problems of text segmentation, relational learning, classification of text segments, finding and clustering of similar records, and reasoning with objects whose identity is uncertain. We will survey a variety of learning techniques that have been used on these problems, including rule-learning, boosting, semi-supervised learning, finite-state sequential classification methods (such as conditional Markov models and conditional random fields), character-based edit distances and adaptive generative models for modifying them, and other topics as time allows.

Readings will be based on research papers. Grades will be based on class participation, paper presentations, and a project.

More specifically, students will be expected to:

Prepare short summaries of the papers being discussed.
Present one or more "optional" papers from the syllabus (or some other mutually agreeable paper) to the class.
Do a course project in a group of 2-3 people. Typical course projects might be: systematically comparing two or more existing extraction of integration methods on an existing dataset; exploring a new extraction or integration application, by collecting a dataset and evaluating an existing method; or rigorous formal analysis of a course-related topic. The end result of the project will be a written report, with format and length appropriate for a conference publication.

Prerequisites: a machine learning course (e.g., 15-781 or 15-681) or consent of the instructor.

Syllabus

Overview/Survey

Lecture: (Jan 12) Overview of IE. Some longer overview slides are available on my web page, from researcher tutorials given at NIPS-2002 and KDD-2003. (Jan 14) Overviews of some of my own older work on information integration, and also some more of my recent work comparing different string distance metrics. Don't miss the example of TFIDF matching that didn't fit in my old PDF presentation.

Readings:

Adaptive information extraction: Core technologies for information agents, Kushmerick and Thomas.
Machine Learning for Sequential Data: A Review, Tom Dietterich.
Adaptive Name-Matching in Information Integration, Bilenko et al.
Comparative Experiments on Learning Information Extractors for Proteins and their Interactions, Bunescu et al.

Information Extraction as Classifying Text Segments

Lecture: (Jan 19) A discussion of key points from Jansche and Abney, and Cohen et al. (Jan 21) Overviews of the Califf and Mooney paper and the Cohen et al paper. Pradeep will also present a summary of the Collins and Singer paper.

Readings:

Information Extraction from Voicemail Transcripts, Janche and Abney. (For background, here's the Huang et al 2001 paper they compare to.)
Understanding Captions in Biomedical Publications, Cohen et al..
Bottom-Up Relational Learning of Pattern Matching Rules for Information Extraction, Califf and Mooney.
A Flexible Learning System for Wrapping Tables and Lists in HTML Documents, Cohen et al.
(Optional) Ranking Algorithms for Named-Entity Extraction: Boosting and the Voted Perceptron., Collins.
(Optional) Relational Learning via Propositional Algorithms: An Information Extraction Case Study, Roth and Yih.
(Optional) Unsupervised Models for Named Entity Classification, Collins and Singer.
(Optional)Information Extraction from HTML: Application of a General Machine Learning Approach, Freitag.

IE as Boundary Detection

Lecture: (Jan 27) A discussion of Kushmeric's AIJ 2000 journal paper and Kushmeric and Freitag's BWI paper.; and I'll try again to get to a a presentation of the Cohen et al wrapper-learning paper.

Readings:

Wrapper induction: Efficiency and expressiveness, Kushmerick
Boosted wrapper induction, Freitag and Kushmerick

IE as Sequential Token Classification: HMMs

Lecture: (2/4) A guest lecture by Sunita Sarawagi, focusing on the Borkar et al paper. (Notice that I've added this to the readings for this week, and made the Leek paper "optional".)

Readings: