Read The Web - Team Projects

Spring 2006

In class on Jan 26 student teams will present 5 minute proposals describing some NLP learning task for which they will implement a self-supervised learning algorithm.

Project purpose: The goal of these projects is for us to explore a variety of NLP learning tasks where redundancy can be used to support self-supervised, "bootstrap" learning. Therefore, it is best if teams explore different approaches, taking advantage of different types of redundancy.

Below are some possible learning tasks to consider. PLEASE NOTE THIS IS ONLY TO SPUR YOUR OWN THINKING - FEEL FREE TO SUGGEST A DIFFERENT PROJECT THAT YOU FIND INTERESTING. Be creative!

Some possible tasks:

Named entity recognition: learn to recognize which strings represent an entity of type "person" or "university" or "department" or "publication" or "event" or .... There are several kinds of redundancy available here to support semi-supervised learning. For example, given the sentence "Satya graduated from CMU with a CS degree." we might infer that CMU is a university in several redundant ways, such as (a) CMU is on a list of universities we have previously learned, or (b) CMU is a potential abbreviation for "Carnegie Mellon University" which is on the list of known universities, or (c) "graduated from X" suggests X is a university, or (d) there is a website of the form "www.XXX.edu" where XXX=CMU, or (e) there is an HTML list somewhere on the web which lists several known universities, and CMU is also an item in this same list. (i.e., infer this from a rule about HTML lists such as "if item X is on the same list as item Y, and Y is of type T, then believe X is also of type T." Of course none of these methods is perfect, so some kind of probabilistic or confidence score will need to be associated with each inference.

Relation extraction: learn to pick out relations such as Attends(<person>,<university>), or CollaboratesWith(<person1><person2>). On the web there may often be multiple sources of evidence. For example, to infer that Attends("Mary Smith","CMU") we might find evidence such as (a) a home page associated with Mary Smith exists on the CMU website, or (b) the CMU LTI website contains a list of students, including Mary's name, or (c) a sentence on the web matches the pattern "<person> joined the PhD program at <university> in " and has "Mary Smith"=<person>, or (d) a CMU CS technical report citation lists Mary Smith as the author. One interesting aspect of extracting such biographical information is that many home pages use words like "I" instead of "Mary Smith" (ie., most people do not speak about themselves in the third person).

Resolving pronoun references: learn to determine the correct referent (person) for words such as "I, my, she, he, it, her,...". Pronoun resolution can be difficult in the general case, but it is probably easier on home pages. For example, Satya's home page (http://www.cs.cmu.edu/afs/cs/user/satya/Web/home.html) contains sentences involving first-person words like "I" and "my" and "I'm". Within a web page it is reasonable to infer these three words refer to the owner of the web page, so perhaps we should seed this learner with a rule stating "If the token " I " occurs on the homepage of person P, then assume this token represents person P." Furthermore, Satya's page points to a number of other pages that contain his full name, and other synonyms for his name. One of them is his "bio" page, which is mostly written in the third person and which redundantly states some of the same assertions found on his homepages. Many of these first person assertions (e.g., "I'm a CMU faculty member.") represent beliefs that can also be inferred from other places on the web.

Web page classification: learn to classify university web pages into categories such as the home page affiliated with a 'person', 'department', 'project', 'course', 'publication', 'hobby', etc. This will be important for our larger effort, because many entity extraction or relation extraction problems become much easier if we know the type of web page the sentence or sentences occur in (e.g., the token " I " on a person's homepage, as opposed to a class web page). In class we discussed two redundant sources for web page classification, but there are many more. We could classify http://www.cs.cmu.edu/~wcohen/ as a person's home page based on (a) the bag of words found on the page, (b) the text on hyperlinks pointing into the page, (c) the text on hyperlinks pointing OUT of the page and the types of pages it points to, (d) the fact that the HTML headings on the page include the heading "Biography", (e) the page contains numerous sentences written in the first person, (f) this page is pointed to by the CALD Faculty list webpage, where it is in a long list of personal home pages, (g) this page begins with the heading "William W. Cohen" and "William" is on a list of common first names, (h) ....

What type of information should you try to extract?

For this project, you can do most anything -- make this choice based on what you think will offer plenty of redundancy to support self-supervised learning. However, if possible try to focus on extracting beliefs about universities, their people, departments, activities, publications, conferences, research, etc. After these first projects we are likely to focus on CS and Biology departments (the former because it's fun and we understand them, the latter because we might be able to take advantage of other ongoing research at CMU, ontologies such as Mesh, and other resources such as Medline publications).

What should be in your proposal presentation?

Describe very specifically the type of knowledge you want your system to learn and the types of beliefs to be learned (e.g.,, We intend to learn extraction rules of the following two types:

text extraction rules such as: "If string ='<event> will be held on <date>', Then believe 'TimeOf(<event>,<date>)"
table extraction rules such as: "If table has column X = event name, and column Y = location, Then for each row extract LocOf(<event>,<date>)

to extract the class of beliefs

TimeOf(<event>,<time>), and
DateOf(<event>,<date>), and
LocationOf(<event>, <location>)

List the various redundant sources of information/inference that you intend to rely on. (e.g., We will use

extraction from tables based on typing the columns
extraction from multiple text phrases found on multiple web pages
active search of the web for seminars which often have HTML-formated home pages

Try to prepare by 'hand simulating' your algorithm on some simple examples and try to summarize your experience (perhaps you can show a simple sequence of web pages visited, and rules/beliefs created using your hand simulation).

Presentation logistics: Send tom.mitchell@cs.cmu.edu your presentation (powerpoint, pdf, or ps) by noon on thursday Jan 26. He will assemble all presentations on his laptop in advance to save time. Appoint a single speaker for your group, and have her/him practice the presentation in front of the group. Be sure you can present your ideas in 5 minutes - avoid spending a lot of time on motivation - get directly to the specifics. Keep in mind the point of your presentation is to stimulate the class into providing their own ideas to help your team.

This page is located in the file /afs/cs/project/theo-21/www/project1ideas.html.
It is writable by any member of the course.
It was created using NVU, freely available at http://www.nvu.com/
Tom Mitchell, January 21, 2006.