Models for Natural Language Learning using Unlabeled Data



Here is a lightly-annotated bibliography of  papers on learning  from labeled and unlabeled data.  It focuses on methods especially relevant to bootstrap learning for natural language analysis, and on theoretical models for how and when we should expect unlabeled data to be helpful.

Please edit this file ( /afs/cs/project/theo-21/www/semisupervised.html) to add more citations.

Natural language bootstrap learning:

Yarowsky wrote an early paper describing how to learn to disambiguate word senses.  It makes the assumption that each occurance of a word (e.g., "bank") in a document has the same meaning (e.g., river bank or financial bank).  Abney's paper is a more recent formal analysis of why Yarowsky's algorithm works.
Cotraining uses unlabeled data together with labeled data to learn f(x)=y when x can be expressed as a pair of features x=<x1,x2> such that both x1 and x2 are individually sufficient to predict y.  This has been used to train web page classifiers, named entity recognizers, image classifiers, and more.  The idea was introduced in 1998 by Blum & Mitchell and has been applied and extended in several directions.
Another line of bootstrapping algorithms was started by Sergei Brin's paper on using web search as a subroutine for bootstrap learning using the web as a training corpus.
Etzioni's group has pushed very large scale extraction from the web, based on bootstrap learning of named entity extractors and relation extractors

Theoretical models for bootstrap learning:

The above papers contain a number of theoretiical models, especially the Blum & Mitchell 1998 paper, and the Abney paper. Following are more recent theoretical models for how and when unlabeled data can improve learning.

These papers provide PAC-style bounds on co-training and related learning settings that go beyond those provided in the original co-training paper.
This paper extends the co-training theory model to capture the iterative expansion of the domain of the learned function
This paper considers multiple function approximators instead of multiple views on the data, leading to a Boosting-style approach.

Whereas the above papers focus on PAC bounds, the following paper has a very different focus.  It presents a statistical model for estimating accuracy for bootstrap learning of named entity and relation extractors, under the assumption that correct entities and relations will be repeatedly extracted from a large corpus, and that correct extractions will be repeatedly more frequently than incorrect extractions.  This is used by Etzioni's system described above.