IR term project of Photina Jang

Basic Information

Contents


Abstract

Make use of feature selection methods (IG, DF, Chi^2) and the kNN classifier for text categorization of large corpora. See if a hierarchical structure of categories applied to the large corpus leads to higher text categorization performance than using only a flat structure. Induction of new sets of unlabeled categories from the given hierarchy that are either subsets or combinations of part of the given categories that can improve the categorization performance.

Proposal

proposal.ps

Timelines

Task

to be done by

status

Baseline: apply kNN to flat structure to obtain baseline
March 6, 1998
done
Subsetting: assign documents of training and test corpora separately to the given 2-level category hierarchy
March 13, 1998
done
Feature Selection: apply IG, DF, CHI^2 feature selection metrics (provided by Yue Pan) to each subcategories of the hierarchy
March 20, 1998
done
Data Analysis: observe the data distribution, behavior of features of each category and subcategories
April 3, 1998
done
Implementation: program a text categorization system using such feature selection methods
April 17, 1998
done
Testing: on Reuters-ApteMod and Reuters-21578
April 20, 1998
done

Flat vs Hierarchical Structure:

System Description

Experiments

Preliminary Results

Results

Demo


last update: April 29, 1998