CMU World Wide Knowledge Base project
Choon's Thesis
Abstract
The rich variety of knowledge available on the World Wide Web makes it an attractive target to datamine. A first step to this datamining operation is to be able to classify web pages according to some predetermined ontology. Current machine learning techniques for text classification deal primarily with flat text documents, and do not take advantage of the richer structure offered by the World Wide Web, such as hyperlinks, titles and paragraph headings. This paper investigates several methods designed specifically to classify web pages, compares their relative merits, and shows that using structural information produces classifiers with different performance characteristics.
Paper
Postscript File
Microsoft Word Format
Diagrams
Graph 1
Graph 2
Graph 3
last update: Jan 1997