IR term project of Yue Pan

Basic Information

Contents


Abstract

The challenge is to apply a high-accuracy classification method (kNN, DTree or LLSF) to very large category space. The idea is to use an divide-and-conquer strategy to make a large problem more tractable, and hopefully without significant loss and any loss of classification accuracy.

Proposal and Timelines

 
Task to be done by status
Preparing Reuter corpus March 5 done
Writing IG computing module March 12 done
Using Sampling module / Report March 17 ?
Using KNN & Flat Categorization March 24 done
Wrting Hierarchical Classifier  March 31 done
Testing Hierarchical Categorization April 7 done
Improving Hierarcial Categorization April 14 n.a
Result Comparing and Analysis April 21 n.a
Project Presentation April 28 n.a

System Description

Experiments

System Hierarchy:
                                                                                                Layer 1

                                                                                            Root (100 %)
      +----------------------------------------------------------------------------+
       |                                       |                          |                            |                        |                               |                    |                   |
CORPORATE   COMMODITY     ECONOMY        ENERGY      MONEY-FX    INTEREST    SHIP         CURRENCY
  (62.42 %)                (13.46 %)             (8.03 %)             (5.98 %)          (5.33 %)              (2.41 %)     (2.33 %)      (0.04 %)
 
 
CORPORATE COMMODITY ECONOMY ENERGY MONEY-FX INTEREST SHIP CURRENCY
#doc 4511 1241 867 463 156
#cat 2 54 15 9 1 1 1 7
Entropy -0.65 -4.62 -2.21 -1.08 -0.88
In-Set 97.56 % 78.20 % 84.62 % 83.17 % 76.19 %

Experiment Parameter:

Token: top 2000 word according to Information Gain
Stemming: no
Stop Word Elimination: no
Knn: Top 30
Simularity Measure: ltc ltc

Results

Demo

 


last update: Apr 13, 1998