CMU World Wide Knowledge Base project

Choon's Thesis

Abstract

The rich variety of knowledge available on the World Wide Web makes it an attractive target to datamine. A first step to this datamining operation is to be able to classify web pages according to some predetermined ontology. Current machine learning techniques for text classification deal primarily with flat text documents, and do not take advantage of the richer structure offered by the World Wide Web, such as hyperlinks, titles and paragraph headings. This paper investigates several methods designed specifically to classify web pages, compares their relative merits, and shows that using structural information produces classifiers with different performance characteristics.

Paper

Postscript File

Microsoft Word Format

Diagrams

Graph 1

Graph 2

Graph 3

last update: Jan 1997