CMU World Wide Knowledge Base (Web->KB) project
Goal:
To develop a probabilistic, symbolic
knowledge base that mirrors the content of the world wide web. If successful,
this will make text information on the web available in
computer-understandable form, enabling much more sophisticated information
retrieval and problem solving.
Approach:
We are developing a system that can be trained to extract symbolic
knowledge from hypertext, using a variety of machine learning methods.
Datasets:
The first experiments consisted in extracting knowledge about computer
science departments. We have assembled two data sets for this task:
Other Datasets used by the WebKB Group
Related research on machine learning and text:
See the other research on text learning by our research group.
Publications:
Overview of the Project:
-
Learning to Extract Symbolic Knowledge from the World Wide Web.
M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam and S. Slattery.
Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98).
(8 pages)
-
Learning to Construct Knowledge Bases from the World Wide Web.
M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam and S. Slattery.
To appear in Artificial Intelligence .
(57 pages)
- Data Mining
on Symbolic Knowledge Extracted from the Web
Rayid Ghani, Rosie Jones, Dunja Mladenic, Kamal Nigam and Sean Slattery.
In KDD-2000 Workshop
on Text Mining. 2000
Overview of Cora,
a related project:
- Building Domain-Specific
Search Engines with Machine Learning Techniques.
Andrew McCallum,
Kamal Nigam, Jason Rennie and Kristie Seymore.
AAAI-99
Spring Symposium on Intelligent Agents in Cyberspace. A related paper will also appear in IJCAI'99.
First paper from the related BioKB Project:
-
Constructing Biological Knowledge Bases by Extracting
Information from Text Sources.
Mark Craven and Johan Kumlien.
Proceedings of the 7th International Conference on Intelligent
Systems for Molecular Biology (ISMB-99).
Text and Hypertext Classification:
-
Using Error-Correcting Codes for Text Classification.
Rayid
Ghani.
To appear in the Proceedings of the 17th International Conference on
Machine Learning (ICML 2000)
- Analyzing the Effectiveness and Applicability of Co-training.
Kamal Nigam and Rayid Ghani.
To appear in Ninth International Conference on Information and Knowledge
Management (CIKM-2000).
2000.
-
Text Classification from Labeled and Unlabeled Documents using
EM.
Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom
Mitchell.
To appear in the Machine Learning Journal. 2000.
Draft.
- Using Maximum
Entropy for Text Classification.
Kamal Nigam, John Lafferty and Andrew McCallum.
IJCAI-99 Workshop on Machine Learning for Information Filtering. 1999.
-
Text Classification by Bootstrapping with Keywords, EM and Shrinkage
.
Andrew McCallum and Kamal Nigam.
ACL '99 Workshop for Unsupervised Learning in Natural Language Processing. 1999.
- Improving Text Classification by Shrinkage in a Hierarchy of Classes.
Andrew McCallum, Ronald Rosenfeld, Tom Mitchell and Andrew Ng.
Proceedings of the 15th International Conference on Machine Learning (ICML-98).
- Learning to Classify Text from
Labeled and Unlabeled Documents.
Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell.
Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98).
- Employing EM and Pool-Based
Active Learning for Text Classification.
Andrew McCallum and Kamal Nigam.
Proceedings of the 15th International Conference on Machine Learning (ICML-98).
-
Combining Labeled and Unlabeled Data with Co-Training .
Avrim Blum and Tom Mitchell.
Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT-98).
-
The Role of Unlabeled Data in Supervised Learning,
T. Mitchell,
Proceedings of the Sixth International Colloquium on
Cognitive Science, San Sebastian, Spain, 1999 (invited paper).
- A Case Study in Using Linguistic Phrases for Text Categorization on the WWW.
J. Fürnkranz, T. Mitchell and E. Riloff.
Working Notes of the 1998 AAAI/ICML Workshop on Learning for Text Categorization.
- A
Comparison of Event Models for Naive Bayes Text Classification.
Andrew McCallum and Kamal Nigam.
Working Notes of the 1998 AAAI/ICML Workshop on Learning for Text Categorization.
Relational Learning for Hypertext Domains:
- Unsupervised Structural Inference for Web Page Classification.
S. Slattery.
Accepted submission to the International Conference on Machine Learning, 2000.
-
Combining Statistical and Relational Methods for Learning in Hypertext Domains.
S. Slattery and M. Craven.
Proceedings of the 8th International Conference on Inductive Logic Programming (ILP-98).
-
First-Order Learning for Web Mining.
M. Craven, S. Slattery and K. Nigam.
Proceedings of the 10th European Conference on Machine Learning (ECML-98 ).
Automatic Corpus Construction from the Web
Spidering:
- Using
Reinforcement Learning to Spider the Web Efficiently.
Jason
Rennie and Andrew McCallum.
Proceedings of the 16th International
Conference on Machine Learning (ICML-99).
Information Extraction:
- Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping
Ellen Riloff and Rosie Jones.
Proceedings of the 16th National Conference on Artificial Intelligence (AAAI-99).
- Bootstrapping for Text Learning Tasks.
Rosie Jones, Andrew McCallum, Kamal Nigam and Ellen Riloff.
IJCAI-99 Workshop on Text Mining: Foundations, Techniques and
Applications. 1999.
- Information
Extraction with HMMs and Shrinkage
Dayne Frietag and Andrew
McCallum.
Draft accepted to AAAI'99 Workshop on Machine Learning for
Information Extraction.
- Learning Hidden Markov
Model Structure for Information Extraction
Kristie Seymore,
Andrew McCallum, Roni Rosenfeld.
Draft accepted AAAI'99 Workshop on
Machine Learning for Information Extraction.
-
Multistrategy Learning for Information Extraction.
D. Freitag.
Proceedings of the 15th International Conference on Machine Learning (ICML-98).
-
Toward General-Purpose Learning for Information Extraction.
D. Freitag.
Proceedings of the Seventeenth International Conference
on Computational Linguistics (COLING-ACL-98).
-
Information Extraction From HTML: Application of a General Learning Approach.
D. Freitag.
Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98).
-
Using Grammatical Inference to Improve Precision in Information Extraction.
D. Freitag.
Working Notes of the ICML-97 Workshop on Automata Induction, Grammatical Inference,
and Language Acquisition.
Student projects and unpublished reports:
-
Using HTML Formatting to Aid in Natural Language Processing on the World Wide Web.
D. DiPasquo.
Senior Honors Thesis, School of Computer Science, CMU, May 1998.
-
Classification of World Wide Web Documents.
C.Y. Quek.
Senior Honors Thesis, School of Computer Science, CMU, May 1997.
Researchers:
Project Alumni:
Internal project page visible to project members only.
theo-11-last update: Jan 2001 by Rayid Ghani
this web page is stored at /afs/cs.cmu.edu/project/theo-11/www/wwkb/index.html