IR term project of Yue Pan

Basic Information

Project Title: Hierarchical Text Categorization
Name: Yue Pan (email: ypan@cs.cmu.edu)
Presentation Date: Tue Apr 28th, 1998
Demo Date:

Abstract
Proposal and Timelines
System Description
Experiments
Results
Demo

Abstract

The challenge is to apply a high-accuracy classification method (kNN, DTree or LLSF) to very large category space. The idea is to use an divide-and-conquer strategy to make a large problem more tractable, and hopefully without significant loss and any loss of classification accuracy.

Proposal and Timelines

Project Proposal

Paper Review

Task	to be done by	status
Preparing Reuter corpus	March 5	done
Writing IG computing module	March 12	done
Using Sampling module / Report	March 17	?
Using KNN & Flat Categorization	March 24	done
Wrting Hierarchical Classifier	March 31	done
Testing Hierarchical Categorization	April 7	done
Improving Hierarcial Categorization	April 14	n.a
Result Comparing and Analysis	April 21	n.a
Project Presentation	April 28	n.a

System Description

Experiments

System Hierarchy:
                                                                                                Layer 1
                                                                                            Root (100 %)
      +----------------------------------------------------------------------------+
       |                                       |                          |                            |                        |                               |                    |                   |
CORPORATE   COMMODITY     ECONOMY        ENERGY      MONEY-FX    INTEREST    SHIP         CURRENCY
(62.42 %)                (13.46 %)             (8.03 %)             (5.98 %)          (5.33 %)              (2.41 %)     (2.33 %)      (0.04 %)

CORPORATE COMMODITY ECONOMY ENERGY MONEY-FX INTEREST SHIP CURRENCY

#doc 4511 1241 867 463 156

#cat 2 54 15 9 1 1 1 7

Entropy -0.65 -4.62 -2.21 -1.08 -0.88

In-Set 97.56 % 78.20 % 84.62 % 83.17 % 76.19 %

Experiment Parameter:
Token: top 2000 word according to Information Gain
Stemming: no
Stop Word Elimination: no
Knn: Top 30
Simularity Measure: ltc ltc

Results

IG result file is now ready
Program Description

Demo

last update: Apr 13, 1998

	CORPORATE	COMMODITY	ECONOMY	ENERGY	MONEY-FX	INTEREST	SHIP	CURRENCY
#doc	4511	1241	867	463				156
#cat	2	54	15	9	1	1	1	7
Entropy	-0.65	-4.62	-2.21	-1.08				-0.88
In-Set	97.56 %	78.20 %	84.62 %	83.17 %				76.19 %