the following brief opinion piece appears in AI Magazine, Fall 2005,
AAAI Press, Menlo Park, CA.
Reading the Web: A
Breakthrough Goal for AI
Tom
M. Mitchell
Carnegie Mellon University
Fall, 2005
I believe AI has an opportunity to achieve a true breakthrough over the
coming decade by at last solving the problem of reading natural
language text to extract its factual content. In fact, I
hereby offer to bet anyone a lobster dinner that by 2015 we will have a
computer program capable of automatically reading at least 80% of the
factual content across the entire English-speaking web, and placing
those facts in a structured knowledge base. The significance
of this AI achievement would be tremendous: it would immediately
increase by many orders of magnitude the volume, breadth, and depth of
ground facts and general knowledge accessible to knowledge based AI
programs. In essence, computers would be harvesting in
structured form the huge volume of knowledge that millions of humans
are entering daily on the web in the form of unstructured text.
Why do I believe this breakthrough will occur in the coming
decade? Because of the fortunate confluence of three
trends. First, there has been substantial progress over the
past several years in natural language processing for automatically
extracting named entities (e.g., person names, locations, dates,
products, ...) and facts relating these entities (e.g.,
WorksFor(Bill, Microsoft)). Much of this progress has come
from new natural language processing approaches, many based on machine
learning algorithms, and progress here shows no sign of
slowing. Second, there has been substantial progress in
machine learning over the past decade, most significantly on "bootstrap
learning" algorithms that learn from a small volume of labeled data,
and huge volumes of unlabeled data, so long as there is a certain kind
of redundancy in the facts expressed in this data. To
illustrate, when I type the query "birthday of Elvis Presley" into
Google, it returns 437,000 hits. Scanning them, I estimate
the web has tens of thousands of redundant expressions of the fact that
"Elvis Presley was born on January 8, 1935." Importantly, the
different statements of this fact are expressed in widely varying
syntactic forms (e.g., "Presley, born January 8, 1935, was the son of
..."). Bootstrap learning algorithms can take great advantage
of this kind of redundant, unlabelled data, to learn both the facts and
how to extract facts from different linguistic forms. The
third important trend is that the data needed for learning to read
factual statements is finally available: for the first time in history
every computer has access to a virtually limitless and growing text
corpus (e.g., the web), and this corpus happens to contain just the
kind of factual redundancies needed by the bootstrap learning
approaches mentioned above. These three trends: progress in
natural language analysis, progress in machine learning, availability
of a sufficiently rich text corpus with tremendous redundancy, together
make this the right time for AI researchers to go back to one of the
key problems of AI - natural language understanding - and solve it (at
least for the factual content of language).
Some have mentioned to me that this is a large goal. I agree,
and propose we approach it by forming a shared web repository where
facts that are extracted from the web by different researcher's efforts
are accumulated and made accessible to all. This
open-source shared repository should also accumulate and share learned
rules that extract content from different linguistic
forms. Working as a research community in this
fashion seems the best way to achieve this ambitious goal.
And I'd hate to have to leave the field to open a lobster fishery.