Make use of feature selection methods (IG, DF, Chi^2) and the kNN classifier for text categorization of large corpora. See if a hierarchical structure of categories applied to the large corpus leads to higher text categorization performance than using only a flat structure. Induction of new sets of unlabeled categories from the given hierarchy that are either subsets or combinations of part of the given categories that can improve the categorization performance.
Task |
to be done by |
status |
Baseline: apply kNN to flat structure to obtain baseline | ||
Subsetting: assign documents of training and test corpora separately to the given 2-level category hierarchy | ||
Feature Selection: apply IG, DF, CHI^2 feature selection metrics (provided by Yue Pan) to each subcategories of the hierarchy | ||
Data Analysis: observe the data distribution, behavior of features of each category and subcategories | ||
Implementation: program a text categorization system using such feature selection methods | ||
Testing: on Reuters-ApteMod and Reuters-21578 |
Flat vs Hierarchical Structure: