Spring 2006
In class on Jan 26 student teams will present 5 minute proposals
describing
some NLP learning task for which they will implement a self-supervised
learning algorithm.
Project purpose: The goal
of these projects is for us to explore a variety of NLP learning tasks
where redundancy
can be used to support self-supervised, "bootstrap"
learning. Therefore, it is best if teams explore different
approaches, taking advantage of different types of redundancy.
Below are some possible learning tasks to consider. PLEASE NOTE
THIS IS ONLY TO SPUR YOUR OWN THINKING - FEEL FREE TO SUGGEST A
DIFFERENT PROJECT THAT YOU FIND INTERESTING. Be creative!
Some possible tasks:
- Named entity recognition:
learn to recognize which strings represent an entity of type "person"
or "university" or "department" or "publication" or "event" or
.... There are several kinds of redundancy available here to
support semi-supervised learning. For example, given the sentence
"Satya graduated from CMU with a CS degree." we might infer
that CMU is a university in several redundant ways, such as (a) CMU is
on a
list of universities we have previously learned, or (b) CMU is a
potential abbreviation
for "Carnegie Mellon University" which is on the list of known
universities, or (c)
"graduated from X" suggests X is a university, or (d) there is a
website of the form "www.XXX.edu" where XXX=CMU, or (e) there is an
HTML list somewhere on the web which lists several known universities,
and CMU is also an item in this same list. (i.e., infer this from
a
rule about HTML lists such as "if item X is on the same list as item Y,
and Y is of type T, then believe X is also of type T." Of
course none of these methods is perfect, so some kind of probabilistic
or confidence score will need to be associated with each inference.
- Relation extraction:
learn to pick out relations such as
Attends(<person>,<university>), or
CollaboratesWith(<person1><person2>). On the web
there may often be multiple sources of evidence. For example, to
infer
that Attends("Mary Smith","CMU") we might find evidence such as (a) a
home
page associated with Mary Smith exists on the CMU website, or (b) the
CMU LTI website contains a list of students, including Mary's name, or
(c) a sentence on the web matches the pattern "<person> joined
the PhD program at <university> in " and has "Mary
Smith"=<person>, or (d) a CMU CS technical report citation lists
Mary Smith as the author. One interesting aspect of
extracting such biographical information is that many home pages use
words like "I" instead of "Mary Smith" (ie., most people do not speak
about themselves in the third person).
- Resolving pronoun references:
learn to determine the correct referent (person) for words such as "I,
my, she,
he, it, her,...". Pronoun resolution can be difficult in
the general case, but it is probably easier on home pages. For
example, Satya's home page
(http://www.cs.cmu.edu/afs/cs/user/satya/Web/home.html) contains
sentences involving first-person words like "I" and "my" and
"I'm". Within a web page it is reasonable to infer these
three words refer to the owner of the web page, so perhaps we should
seed this learner with a rule stating "If the token " I " occurs on the
homepage of person P, then assume this token represents person
P." Furthermore, Satya's page
points to a number of other pages that contain his full name, and other
synonyms for his name. One of them is his "bio" page, which is
mostly written in the third person and which redundantly states some of
the same assertions found on his homepages. Many of these first
person
assertions (e.g., "I'm a CMU faculty member.") represent beliefs that
can also be inferred from
other places on the web.
- Web page classification:
learn to classify university web pages into categories such as the home
page affiliated with a 'person', 'department', 'project', 'course',
'publication', 'hobby', etc. This will be important for our
larger effort, because many entity extraction or relation extraction
problems become much easier if we know the type of web page the sentence
or sentences occur in (e.g., the token " I " on a person's homepage, as
opposed to a class web page). In class we discussed two redundant
sources for web page classification, but there are many more. We
could classify http://www.cs.cmu.edu/~wcohen/ as a person's home page
based on (a) the bag of words found on the page, (b) the text on
hyperlinks pointing into the page, (c) the text on hyperlinks pointing
OUT of the page and the types of pages it points to, (d) the fact that
the HTML headings on the page
include the heading "Biography", (e) the page contains numerous
sentences
written in the first person, (f) this page is pointed to by the CALD
Faculty list webpage, where it is in a long list of personal home
pages, (g) this page begins with the heading "William W. Cohen" and
"William" is on a list of common first names, (h) ....
What type of information should
you try to extract?
For this project, you can do most anything -- make this
choice based on what you think will offer plenty of redundancy to
support self-supervised learning. However, if possible try to
focus on extracting beliefs about
universities, their people, departments, activities, publications,
conferences, research, etc. After these first projects we
are likely to focus on CS and Biology
departments (the former because it's fun and we understand them, the
latter
because we might be able to take advantage of other ongoing research at
CMU, ontologies such as Mesh,
and other resources such as Medline publications).
What should be in your proposal
presentation?
Describe very specifically the type of knowledge you want your system
to learn and the types of beliefs to be learned (e.g.,, We intend to
learn extraction rules of the following two types:
- text extraction rules such as: "If string
='<event> will be held on <date>', Then
believe 'TimeOf(<event>,<date>)"
- table extraction rules such as: "If table has
column X = event name, and column Y = location, Then for each row
extract LocOf(<event>,<date>)
to extract the class of beliefs
- TimeOf(<event>,<time>), and
- DateOf(<event>,<date>), and
- LocationOf(<event>, <location>)
List the various redundant sources of information/inference
that you intend to rely on. (e.g., We will use
- extraction from tables based on typing the columns
- extraction from multiple text phrases found on multiple web pages
- active search of the web for seminars which often have
HTML-formated home pages
Try to prepare by 'hand simulating' your algorithm on some simple
examples and try to summarize your experience (perhaps you can show a
simple sequence of web pages visited, and rules/beliefs created using
your hand simulation).
Presentation logistics:
Send tom.mitchell@cs.cmu.edu your presentation (powerpoint, pdf, or ps)
by noon on thursday Jan 26.
He will assemble all presentations on his laptop in advance to save
time. Appoint a single speaker for your group, and have her/him
practice the presentation in front of the group. Be sure you can
present your ideas in 5 minutes - avoid spending a lot of time on
motivation - get directly to the specifics. Keep in mind the
point of your presentation is to stimulate the class into providing
their own ideas to help your team.

This page is located in the file
/afs/cs/project/theo-21/www/project1ideas.html.
It is writable by any member of the course.
It was created using NVU, freely available at http://www.nvu.com/
Tom Mitchell, January 21, 2006.