the following brief opinion piece appears in AI Magazine, Fall 2005, AAAI Press, Menlo Park, CA.





Reading the Web: A Breakthrough Goal for AI

Tom M. Mitchell
Carnegie Mellon University
Fall, 2005


I believe AI has an opportunity to achieve a true breakthrough over the coming decade by at last solving the problem of reading natural language text to extract its factual content.  In fact, I hereby offer to bet anyone a lobster dinner that by 2015 we will have a computer program capable of automatically reading at least 80% of the factual content across the entire English-speaking web, and placing those facts in a structured knowledge base.  The significance of this AI achievement would be tremendous: it would immediately increase by many orders of magnitude the volume, breadth, and depth of ground facts and general knowledge accessible to knowledge based AI programs.  In essence, computers would be harvesting in structured form the huge volume of knowledge that millions of humans are entering daily on the web in the form of unstructured text.

Why do I believe this breakthrough will occur in the coming decade?  Because of the fortunate confluence of three trends.  First, there has been substantial progress over the past several years in natural language processing for automatically extracting named entities (e.g., person names, locations, dates, products, ...)  and facts relating these entities (e.g., WorksFor(Bill, Microsoft)).  Much of this progress has come from new natural language processing approaches, many based on machine learning algorithms, and progress here shows no sign of slowing.  Second, there has been substantial progress in machine learning over the past decade, most significantly on "bootstrap learning" algorithms that learn from a small volume of labeled data, and huge volumes of unlabeled data, so long as there is a certain kind of redundancy in the facts expressed in this data.  To illustrate, when I type the query "birthday of Elvis Presley" into Google, it returns 437,000 hits.  Scanning them, I estimate the web has tens of thousands of redundant expressions of the fact that "Elvis Presley was born on January 8, 1935."  Importantly, the different statements of this fact are expressed in widely varying syntactic forms (e.g., "Presley, born January 8, 1935, was the son of ...").  Bootstrap learning algorithms can take great advantage of this kind of redundant, unlabelled data, to learn both the facts and how to extract facts from different linguistic forms.  The third important trend is that the data needed for learning to read factual statements is finally available: for the first time in history every computer has access to a virtually limitless and growing text corpus (e.g., the web), and this corpus happens to contain just the kind of factual redundancies needed by the bootstrap learning approaches mentioned above.  These three trends: progress in natural language analysis, progress in machine learning, availability of a sufficiently rich text corpus with tremendous redundancy, together make this the right time for AI researchers to go back to one of the key problems of AI - natural language understanding - and solve it (at least for the factual content of language).

Some have mentioned to me that this is a large goal.  I agree, and propose we approach it by forming a shared web repository where facts that are extracted from the web by different researcher's efforts are accumulated and made accessible to all.   This open-source shared repository should also accumulate and share learned rules that extract content from different linguistic forms.   Working as a research community in this fashion seems the best way to achieve this ambitious goal.  And I'd hate to have to leave the field to open a lobster fishery.