CMU World Wide Knowledge Base (Web->KB) project

Goal:
To develop a probabilistic, symbolic knowledge base that mirrors the content of the world wide web. If successful, this will make text information on the web available in computer-understandable form, enabling much more sophisticated information retrieval and problem solving.

Approach:

We are developing a system that can be trained to extract symbolic knowledge from hypertext, using a variety of machine learning methods.

Datasets:

The first experiments consisted in extracting knowledge about computer science departments. We have assembled two data sets for this task:

A data set consisting of classified Web pages.
A relational data set describing both pages and hyperlinks.
A subset of the 4 Universities dataset containing web pages and hyperlink data. Used for the Co-training experiments in COLT 98 by Blum & Mitchell.

Other Datasets used by the WebKB Group

Related research on machine learning and text:

See the other research on text learning by our research group.

Publications:

Overview of the Project:

Learning to Extract Symbolic Knowledge from the World Wide Web.
M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam and S. Slattery.
Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98).
(8 pages)
Learning to Construct Knowledge Bases from the World Wide Web.
M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam and S. Slattery.
To appear in Artificial Intelligence .
(57 pages)
Data Mining on Symbolic Knowledge Extracted from the Web
Rayid Ghani, Rosie Jones, Dunja Mladenic, Kamal Nigam and Sean Slattery.
In KDD-2000 Workshop on Text Mining. 2000

Overview of Cora, a related project:

Building Domain-Specific Search Engines with Machine Learning Techniques.
Andrew McCallum, Kamal Nigam, Jason Rennie and Kristie Seymore.
AAAI-99 Spring Symposium on Intelligent Agents in Cyberspace. A related paper will also appear in IJCAI'99.
First paper from the related BioKB Project:
Constructing Biological Knowledge Bases by Extracting Information from Text Sources.
Mark Craven and Johan Kumlien.
Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology (ISMB-99).
Text and Hypertext Classification:
Using Error-Correcting Codes for Text Classification.
Rayid Ghani.
To appear in the Proceedings of the 17th International Conference on Machine Learning (ICML 2000)
Analyzing the Effectiveness and Applicability of Co-training.
Kamal Nigam and Rayid Ghani.
To appear in Ninth International Conference on Information and Knowledge Management (CIKM-2000). 2000.
Text Classification from Labeled and Unlabeled Documents using EM.
Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell.
To appear in the Machine Learning Journal. 2000. Draft.
Using Maximum Entropy for Text Classification.
Kamal Nigam, John Lafferty and Andrew McCallum.
IJCAI-99 Workshop on Machine Learning for Information Filtering. 1999.
Text Classification by Bootstrapping with Keywords, EM and Shrinkage .
Andrew McCallum and Kamal Nigam.
ACL '99 Workshop for Unsupervised Learning in Natural Language Processing. 1999.
Improving Text Classification by Shrinkage in a Hierarchy of Classes.
Andrew McCallum, Ronald Rosenfeld, Tom Mitchell and Andrew Ng.
Proceedings of the 15th International Conference on Machine Learning (ICML-98).
Learning to Classify Text from Labeled and Unlabeled Documents.
Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell.
Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98).
Employing EM and Pool-Based Active Learning for Text Classification.
Andrew McCallum and Kamal Nigam.
Proceedings of the 15th International Conference on Machine Learning (ICML-98).
Combining Labeled and Unlabeled Data with Co-Training .
Avrim Blum and Tom Mitchell.
Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT-98).
The Role of Unlabeled Data in Supervised Learning,
T. Mitchell,
Proceedings of the Sixth International Colloquium on Cognitive Science, San Sebastian, Spain, 1999 (invited paper).
A Case Study in Using Linguistic Phrases for Text Categorization on the WWW.
J. Fürnkranz, T. Mitchell and E. Riloff.
Working Notes of the 1998 AAAI/ICML Workshop on Learning for Text Categorization.
A Comparison of Event Models for Naive Bayes Text Classification.
Andrew McCallum and Kamal Nigam.
Working Notes of the 1998 AAAI/ICML Workshop on Learning for Text Categorization.
Relational Learning for Hypertext Domains:
Unsupervised Structural Inference for Web Page Classification.
S. Slattery.
Accepted submission to the International Conference on Machine Learning, 2000.
Combining Statistical and Relational Methods for Learning in Hypertext Domains.
S. Slattery and M. Craven.
Proceedings of the 8th International Conference on Inductive Logic Programming (ILP-98).
First-Order Learning for Web Mining.
M. Craven, S. Slattery and K. Nigam.
Proceedings of the 10th European Conference on Machine Learning (ECML-98 ).

Automatic Corpus Construction from the Web

Learning a Monolingual Language Model from a Multilingual Text Database.
Rayid Ghani & Rosie Jones.
Proceedings of the Ninth International Conference on Information and Knowledge Management (CIKM 2000).
Automatically Building a Corpus for a Minority Language From the Web.
Rosie Jones & Rayid Ghani.
In Proceedings of the Student Workshop at the 38th Annual Meeting of the Association for Computational Linguistics (ACL-2000).

Spidering:

Using Reinforcement Learning to Spider the Web Efficiently.
Jason Rennie and Andrew McCallum.
Proceedings of the 16th International Conference on Machine Learning (ICML-99).
Information Extraction:
Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping
Ellen Riloff and Rosie Jones.
Proceedings of the 16th National Conference on Artificial Intelligence (AAAI-99).
Bootstrapping for Text Learning Tasks.
Rosie Jones, Andrew McCallum, Kamal Nigam and Ellen Riloff.
IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications. 1999.
Information Extraction with HMMs and Shrinkage
Dayne Frietag and Andrew McCallum.
Draft accepted to AAAI'99 Workshop on Machine Learning for Information Extraction.
Learning Hidden Markov Model Structure for Information Extraction
Kristie Seymore, Andrew McCallum, Roni Rosenfeld.
Draft accepted AAAI'99 Workshop on Machine Learning for Information Extraction.
Multistrategy Learning for Information Extraction.
D. Freitag.
Proceedings of the 15th International Conference on Machine Learning (ICML-98).
Toward General-Purpose Learning for Information Extraction.
D. Freitag.
Proceedings of the Seventeenth International Conference on Computational Linguistics (COLING-ACL-98).
Information Extraction From HTML: Application of a General Learning Approach.
D. Freitag.
Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98).
Using Grammatical Inference to Improve Precision in Information Extraction.
D. Freitag.
Working Notes of the ICML-97 Workshop on Automata Induction, Grammatical Inference, and Language Acquisition.

Student projects and unpublished reports:
Using HTML Formatting to Aid in Natural Language Processing on the World Wide Web.
D. DiPasquo.
Senior Honors Thesis, School of Computer Science, CMU, May 1998.
Classification of World Wide Web Documents.
C.Y. Quek.
Senior Honors Thesis, School of Computer Science, CMU, May 1997.

Researchers:

Rayid Ghani
Rosie Jones
Andrew McCallum
Tom Mitchell (Faculty Principal Investigator)
Dunja Mladenic
Kamal Nigam
Sean Slattery

Project Alumni:

Amine Bensaid
Mark Craven (now at University of Wisconsin)
Dan Dipasquo (now at ISI)
Dayne Freitag
Johannes Fürnkranz
Paul Hsiung
Hooman Katirau
Johan Kumlien
Carsten Lanquillon
Choon Yang Quek
Jason Rennie (now at MIT)
Nathan Willett

Internal project page visible to project members only.

theo-11-last update: Jan 2001 by Rayid Ghani

this web page is stored at /afs/cs.cmu.edu/project/theo-11/www/wwkb/index.html

CMU World Wide Knowledge Base (Web->KB) project

Goal:

Approach:

Datasets:

Other Datasets used by the WebKB Group

Related research on machine learning and text:

Publications:

Overview of the Project:

Overview of Cora, a related project:

First paper from the related BioKB Project:

Text and Hypertext Classification:

Relational Learning for Hypertext Domains:

Automatic Corpus Construction from the Web

Spidering:

Information Extraction:

Student projects and unpublished reports:

Researchers:

Project Alumni: