Kavita Thomas' IR Term Project

Basic Information

Contents


Abstract

This project will empirically explore the task of text categorisation with decision tree (DTree), naive Bayes (Rainbow), Support Vector Machines (SVM) and kNN classifiers over a large corpus (Reuters 21578) with multiple, often overlapping categories. Specifically, parameter optimisation on each classifier will be performed, which will lead to optimal classification results for each classifier and enable an optimal performance comparison for each classifier across the board. Different feature selection schemes on the corpus will then be explored. Results will be compared with and without labelled data, and with and without separate training and testing examples for each category. Importance of correlation in human-assigned category labels between a label and its appearance in a given document will be examined by comparison of performance (using optimal parameters and feature selection schemes) across two disparate corpora-Reuters 21578 and Ohsumed (which has less direct correspondence between words in the document and its category label).

Here's a link to my <Midterm>

Proposal and Timelines

<Proposal>

Revised Progress Chart:

Task

to be done by

status

Personal Project Web-Page. Do all relevant reading (references).
Tu 3 Mar
done
Initial evaluation of web-based classification schemes and data-set; i.e., how much pre-processing will I need to do before I can run the data through.
F 6 Mar
done 15 Mar
<Preliminary Results> Determine what experiments need to be performed and how to make sure they are valid tests.
Tu 17 Mar
done
Compare results with literature, draw conclusions, and design further tests. Write this up in Experiments section.
Tu 24 Mar
done 25 Mar
Complete parameter optimisation on all classifiers on both data sets.Determine best feature selection method for each classifier on both Reuters21578 and Ohsumed data sets. Write up results and prepare tables/graphs.
Su 29 Mar
done on DTree--rest to be completed by Mon 20 Apr latest
Compare both data sets with optimal parameters and feature selectors on classifiers with and without labelled data. Write up results and prepare tables/graphs.
M 6 April
to be done ASAP!!
Demo decision (hard date)
Tu 14 Apr
done--poster presentation
Perform any last-minute experiments and write up results.
M 20 Apr
Final Project Report (hand-in)
Th 23 Apr
Final Presentation
Tu 28 Apr

System Description

...forthcoming...

<Experiments>

<Results>

Demo

...coming by 14 April...


last update: 13 April 1998