This project will empirically explore the task of text categorisation with decision tree (DTree), naive Bayes (Rainbow), Support Vector Machines (SVM) and kNN classifiers over a large corpus (Reuters 21578) with multiple, often overlapping categories. Specifically, parameter optimisation on each classifier will be performed, which will lead to optimal classification results for each classifier and enable an optimal performance comparison for each classifier across the board. Different feature selection schemes on the corpus will then be explored. Results will be compared with and without labelled data, and with and without separate training and testing examples for each category. Importance of correlation in human-assigned category labels between a label and its appearance in a given document will be examined by comparison of performance (using optimal parameters and feature selection schemes) across two disparate corpora-Reuters 21578 and Ohsumed (which has less direct correspondence between words in the document and its category label).
Here's a link to my <Midterm>
Proposal and Timelines
<Proposal>
Revised Progress Chart:
Task |
to be done by |
status |
done | ||
done 15 Mar | ||
done | ||
done 25 Mar | ||
done on DTree--rest to be completed by Mon 20 Apr latest | ||
to be done ASAP!! | ||
done--poster presentation | ||
...forthcoming...
...coming by 14 April...