Kavita Thomas' IR Term Project

Basic Information

Project Title: Text categorization: multiple approaches
Name: Kavita Thomas; (email: kavita@cs.cmu.edu)

Presentation Date: Tuesday 28 April 1998
Demo Date: <...>

Contents

Abstract

Proposal and Timelines

System Description

Experiments

Results

Demo

Abstract

This project will empirically explore the task of text categorisation with decision tree (DTree), naive Bayes (Rainbow), Support Vector Machines (SVM) and kNN classifiers over a large corpus (Reuters 21578) with multiple, often overlapping categories. Specifically, parameter optimisation on each classifier will be performed, which will lead to optimal classification results for each classifier and enable an optimal performance comparison for each classifier across the board. Different feature selection schemes on the corpus will then be explored. Results will be compared with and without labelled data, and with and without separate training and testing examples for each category. Importance of correlation in human-assigned category labels between a label and its appearance in a given document will be examined by comparison of performance (using optimal parameters and feature selection schemes) across two disparate corpora-Reuters 21578 and Ohsumed (which has less direct correspondence between words in the document and its category label).

Here's a link to my <Midterm>

Proposal and Timelines

<Proposal>
Revised Progress Chart:

Task

to be done by

status

Personal Project Web-Page. Do all relevant reading (references). Tu 3 Mar done

Initial evaluation of web-based classification schemes and data-set; i.e., how much pre-processing will I need to do before I can run the data through. F 6 Mar done 15 Mar

<Preliminary Results> Determine what experiments need to be performed and how to make sure they are valid tests. Tu 17 Mar done

Compare results with literature, draw conclusions, and design further tests. Write this up in Experiments section. Tu 24 Mar done 25 Mar

Complete parameter optimisation on all classifiers on both data sets.Determine best feature selection method for each classifier on both Reuters21578 and Ohsumed data sets. Write up results and prepare tables/graphs. Su 29 Mar done on DTree--rest to be completed by Mon 20 Apr latest

Compare both data sets with optimal parameters and feature selectors on classifiers with and without labelled data. Write up results and prepare tables/graphs. M 6 April to be done ASAP!!

Demo decision (hard date) Tu 14 Apr done--poster presentation

Perform any last-minute experiments and write up results. M 20 Apr

Final Project Report (hand-in) Th 23 Apr

Final Presentation Tu 28 Apr

System Description

...forthcoming...

<Experiments>

<Results>

Demo

...coming by 14 April...

last update: 13 April 1998

Task	to be done by	status
Personal Project Web-Page. Do all relevant reading (references).	Tu 3 Mar	done
Initial evaluation of web-based classification schemes and data-set; i.e., how much pre-processing will I need to do before I can run the data through.	F 6 Mar	done 15 Mar
<Preliminary Results> Determine what experiments need to be performed and how to make sure they are valid tests.	Tu 17 Mar	done
Compare results with literature, draw conclusions, and design further tests. Write this up in Experiments section.	Tu 24 Mar	done 25 Mar
Complete parameter optimisation on all classifiers on both data sets.Determine best feature selection method for each classifier on both Reuters21578 and Ohsumed data sets. Write up results and prepare tables/graphs.	Su 29 Mar	done on DTree--rest to be completed by Mon 20 Apr latest
Compare both data sets with optimal parameters and feature selectors on classifiers with and without labelled data. Write up results and prepare tables/graphs.	M 6 April	to be done ASAP!!
Demo decision (hard date)	Tu 14 Apr	done--poster presentation
Perform any last-minute experiments and write up results.	M 20 Apr
Final Project Report (hand-in)	Th 23 Apr
Final Presentation	Tu 28 Apr