Hypertext Categorization using Hyperlink Patterns and Meta Data
Joint work with Sean Slattery and Yiming Yang.
Abstract
Hypertext poses new text classification research challenges as hyperlinks, content of linked documents, and meta data about related web sites all provide richer sources of information for hypertext classification that are not available in traditional text classification. We investigate the use of such information for representing web sites, and the effectiveness of different classifiers (Naive Bayes, Nearest Neighbor, and Foil) in exploiting those representations. We find that using words in web pages alone often yields suboptimal performance of classifiers, compared to exploiting additional sources of information beyond document content. On the other hand, we also observe that linked pages can be more harmful than helpful when the linked neighborhoods are highly "noisy" and that links have to be used in a careful manner. More importantly, our investigation suggests that meta data which is often available, or can be acquired using information extraction techniques, can be extremely useful for improving classification accuracy. Finally, the relative performance of the different classifiers being tested gives us insights into the strengths and limitations of our algorithms for hypertext classification.