IR term project of Tze Sing Eugene Ng

Basic Information

Contents


Abstract

Text classification is the problem of mapping documents in free text to the most appropriate class among a set of classes expressed in a controlled vocabulary. To handle large corpora, feature selection schemes such as Information Gain and Document Frequency have been used to greatly reduce the complexity of classification. Unfortunately, as the number of target classes increases, scalability becomes a significant problem with traditional text classification methods. One possible solution is to organize classes in a hierarchical fashion and perform text classification hierarchically. In this project, we explore hierarchical classification methods by using k Nearest Neighbors (kNN) as the classification scheme and Distributional Word Clustering (DWC) for feature selection. We choose this combination because kNN is known to be one of the best text classification scheme and DWC is a feature selection method that is interesting and yet not very well understood. We show that DWC is very computationally expensive and in general does not perform better than Information Gain as a feature selection method. In addition, hierarchical text classification using kNN classifiers can result in degraded performance compared to the non-hierarchical counterpart, despite possible reduction in complexity.

Documents and Timelines

Click here for the Project Proposal in Postscript

Click here for the Project Preliminary in Postscript

Click here for my paper reviews in Postscript

Click here for my final report in Postscript

Task

to be done by

status

Fully understand mechanisms in question by reading relevant papers 3/4 Closed
Familiarize with the available tools (kNN, X^2, etc.) 3/6 Closed
Familiarize with the data collections, construct hierarchies 3/8 Closed
Extra cate90.traina nd cate90.test subsets from the aptemod collection 3/10 Closed
Implement distributional word clustering algorithm 3/16 Closed
Complete DWC optimation 3/17 Closed
Perform additional experiments with DWC, ensure correctness 3/19 Closed
Perform experiments with flat kNN using DWC, find optimal strategy 3/23 Closed
Perform experiment with hierarchical kNN using DWC 3/25 Closed
Prototype system implemented for debugging 3/12 Closed
Completion of entire system and basic evaluation 4/3 Closed
Propose, if possible, improvement to the basic scheme 4/8 Closed
Implement the proposed mechanisms 4/12 Insuff

Time

Evaluate the proposed mechanisms 4/17 Insuff

Time

Delivery of presentation and project report 4/23 Closed

System Description

Distributional Word Clustering (DWC) is an interesting feature selection method. Intuitively, DWC measures similarity between words in a document collection by measuring the similarity in the probability distribution over the class variable induced by these words. Words that have similar probability distribution over the category variable are clustered together as a single event.The goal of the DWC is, given the V vocabulary words of the corpus, condense the V words into M clusters.




The idea of hierarchical text classification is to organize classes in a hierarchy, and focus on classifying documents to the immediate groups of subsets of classes rather than classifying against all the leaf classes at once. Doing so may allow us to select very few features that are sufficient in classifying to the immediate subsets. This part of the system takes advantage of the hierarchy in the Reuter-21578 ApteMod collection and the corpus subsets generated by Photina Jang. The hierarchy has two levels. At level 1, there are 8 subsets: Commodity, Corporate, Currency, Economy, Energy, Interest, Money-Fx, and Ship, and level 2 (leaf level) contains the document classes. To make the basic kNN classifier work with a hierarchy, we need to perform several steps. These steps are independent of the feature selection mechanism used:

Experiments & Results

Some example clusters generated by the DWC algorithm.

Clustering time of DWC.

Flat kNN classification performance.

Hierarchical kNN classification performance.


last update: Mar 3rd, 1998