William Cohen, CALD
When/where: MW 10:30-11:50, Wean Hall 4601
Course
Number: 10-707, cross-listed in LTI as 11-748
Announcements: On Wed April 28, all project
groups will give a 10-minute presentation of their work.
Supplemental material: Instructions for installing minorthird;
Vitor's also maintaining an FAQ
for minorthird.
Description
Information extraction is finding names of entities in unstructured or
partially structured text, and determining the relationships that hold
between these entities. Information integration is reasoning with
data taken from multiple sources. Together these techniques let one
automatically perform the tremendously challenging task of deriving
structured information from text, and relating it to previously-known
facts.
The course will discuss many of the sub-problems involved in
information extraction and integration, and the techniques required to
solve them. We will consider the problems of text segmentation,
relational learning, classification of text segments, finding and
clustering of similar records, and reasoning with objects whose
identity is uncertain. We will survey a variety of learning
techniques that have been used on these problems, including
rule-learning, boosting, semi-supervised learning, finite-state
sequential classification methods (such as conditional Markov models
and conditional random fields), character-based edit distances and
adaptive generative models for modifying them, and other topics as
time allows.
Readings will be based on research papers. Grades will be based on
class participation, paper presentations, and a project.
More specifically, students will be expected to:
- Prepare short summaries of the papers being discussed.
- Present one or more "optional" papers from the syllabus
(or some other mutually agreeable paper) to the class.
- Do a course project in a group of 2-3 people. Typical
course projects might be: systematically comparing two or more
existing extraction of integration methods on an existing
dataset; exploring a new extraction or integration
application, by collecting a dataset and evaluating an
existing method; or rigorous formal analysis of a
course-related topic. The end result of the project will be a
written report, with format and length appropriate for a
conference publication.
Prerequisites: a machine learning course (e.g., 15-781 or 15-681) or
consent of the instructor.
Syllabus
Overview/Survey
Lecture: (Jan 12) Overview of IE.
Some longer overview slides are available on my web page, from
researcher tutorials given at NIPS-2002
and KDD-2003.
(Jan 14) Overviews of some of my own older
work on information integration, and also some more of my recent work comparing
different string distance metrics. Don't miss the example of TFIDF matching that didn't
fit in my old PDF presentation.
Readings:
Information Extraction as Classifying Text Segments
Lecture: (Jan 19) A discussion of
key points from Jansche and Abney, and Cohen et
al. (Jan 21) Overviews of the Califf
and Mooney paper and the Cohen et
al paper. Pradeep will also present a summary of the
Collins and Singer paper.
Readings:
IE as Boundary Detection
Lecture: (Jan 27) A discussion of
Kushmeric's AIJ 2000 journal paper and Kushmeric and Freitag's
BWI paper.; and I'll try again to get to a a presentation of the Cohen et al
wrapper-learning paper.
Readings:
IE as Sequential Token Classification: HMMs
Lecture: (2/4) A guest lecture by Sunita Sarawagi,
focusing on the Borkar et al paper. (Notice that I've
added this to the readings for this week, and made the Leek
paper "optional".)
Readings:
IE as Sequential Token Classification: Other Directed Graphical Models
Lecture: (2/9) Comments on the Ratnaparkhi
paper and the Frietag et al paper; presentation from Tal
Blum.
Readings:
Lecture: (2/11) Comments on
Borthwick et al paper and the Mikheev et al
papers; presentation from Bing Zhao.
Readings:
IE as Sequential Token Classification: "Undirected" Graphical Models
Lecture: (2/18) Comments on
Lafferty et al paper and the Sha and Pereira
paper; presentation from Luca.
Lecture: (2/23) Comments on
Klein and Manning et al paper and the Toutanova
paper
Readings:
IE as Sequential Token Classification: Margin-based Methods
Lecture: (2/25) finishing up Klein and Manning
et al and some background on max-margin learning
Lecture: (3/1) Guest lecture from Russ Greiner (U Alberta) on
Web-IC.
Lecture: (3/3) Comments on
the Collins paper and Altun et al paper.
Readings:
Information Integration: Distance Metrics for Text
Lecture: (3/15)
An overview of edit-distance computations and comments the
Monge-Elkan paper
Lecture: (3/17)
More on edit-distance computations; TFIDF distances for
data integration
Lecture: (3/22) Review of
various distance metrics and comparative experiments with
different metrics
Readings:
IE with Large Dictionaries
Lecture: (3/25) Guest lecture from Carlos Guestrin on
Max Margin Markov networks.
Lecture: (3/29) Review of
previous remarks, and comments on Krauthammer et al
paper.
Lecture: (3/31) Comments on
Bunescu et al and Cohen and Sarawagi papers
Lecture: (4/5) Additional comments
on Cohen and Sarawagi paper
Readings:
Information Integration: Learning Distance Metrics
Lecture: (4/7 and 4/12) Learning
Edit Distances with Pair HMMs
Readings:
- Learning
String Edit Distance, Ristad and Yianilos
- Adaptive
Duplicate Detection Using Learnable String Similarity
Measures, Bilenko and Mooney
- Learning
to Match and Cluster Large High-Dimensional Data Sets For
Data Integration, Cohen and Richman
- (Optional) Felligi and Sunter, "A theory for record
linkage", Journal of the American Statistical Society,
64:1183--1210, 1969. Available on-line as part of the
background
material provided with the Record
Linkage Techniques - 1985 Workshop.
- (Optional) Data
Cleaning Methods, Winkler (invited talk - slides
are also available.
- (Optional)Unsupervised
learning of name structure from coreference data, Charniak.
- (Optional)Preparation
of name and address data for record linkage using hidden
Markov models, Churches et al.
Information Integration: Reasoning with Uncertain Objects and/or Extracting Facts
Readings:
Last modified: Wed Oct 26 12:51:13 Eastern Daylight Time 2011