NIPS 2001 by Drew Bagnell (CMU RI), David Blei (Berkeley CS), Andrew McCallum (WhizBang Labs)
Presented by Drew Bagnell
Abstract
A fascinating aspect of classification and extraction from the
Web is the richness of the information in its formatting, layout,
directory structures, and linkage information. Within a particular
web site these structural regularities are often powerfully indicative
features for
common classification and extraction tasks. For example, if one
wished to extract the titles of all the books on Amazon.com, one can
rely on the fact that the book title
appears in the same location on the book's home page and in the same
font. The
difficulty with using this information is that each web site has its
own, different structural regularities, and thus one cannot
successfully apply models tuned for one site to extraction from
another. In response, many researchers have built tools to facilitate
hand-tuning of site-specific extractors.
Most statistical models assume that the modeled data are independently, identically distributed. However, as in the above example, it is often the case that certain proper subsets of the data share identifiable regularities that do not occur throughout its entirety. Other examples of subsets that may have local regularity include patients from a particular hospital, voice sounds from a particular speaker, or vibration data from a particular airplane. In other words, certain patterns may exhibit degrees of scope. Our thesis is that leveraging these local regularities can significantly improve the performance of a learner because local features are often both simple and highly indicative. The central difficulty is that in practical problems our trained algorithm will be applied to novel locales not encountered in our labeled dataset. The only knowledge directly applicable to the new data is from the traditionally-used, global regularities, {\it i.e.} those features that are independently, identically distributed across locales. We will discuss a generative probabilistic model for modeling features with varying scope and appropriate algorithms to leverage those features. Finally, we will demonstrate the effectiveness of the approach by showing dramatically improved performance on an information extraction problem. |
Charles Rosenberg Last modified: Tue Mar 12 18:00:47 EST 2002