IR term project of Photina Jang

Basic Information

Project Title: Text Categorization: a Hierarchical approach (project#5)
Name: Photina J. Jang (email: photina@cs.cmu.edu)
Presentation Date: Tue April 28, 1998
Demo Date: ... April , 1998

Abstract
Proposal and Timelines
System Description
Experiments
Results
Demo

Abstract

Make use of feature selection methods (IG, DF, Chi^2) and the kNN classifier for text categorization of large corpora. See if a hierarchical structure of categories applied to the large corpus leads to higher text categorization performance than using only a flat structure. Induction of new sets of unlabeled categories from the given hierarchy that are either subsets or combinations of part of the given categories that can improve the categorization performance.

Proposal

proposal.ps

Timelines

Task	to be done by	status
Baseline: apply kNN to flat structure to obtain baseline	March 6, 1998	done
Subsetting: assign documents of training and test corpora separately to the given 2-level category hierarchy	March 13, 1998	done
Feature Selection: apply IG, DF, CHI^2 feature selection metrics (provided by Yue Pan) to each subcategories of the hierarchy	March 20, 1998	done
Data Analysis: observe the data distribution, behavior of features of each category and subcategories	April 3, 1998	done
Implementation: program a text categorization system using such feature selection methods	April 17, 1998	done
Testing: on Reuters-ApteMod and Reuters-21578	April 20, 1998	done

Flat vs Hierarchical Structure:

IR term project of Photina Jang

Basic Information

Contents

Abstract

Proposal

Timelines

System Description

Experiments

Results

Demo