IR term project of Cenk Gazen

Basic Information

Contents


Abstract

The topic-tracking task can be described as follows: The system is given a sequence of news stories. A target topic is defined by a few training stories on that topic. The task of the system is to decide whether an incoming story is on the target topic or not. A key parameter for the system is the number of stories used to define a target topic and is named Nt. My main goal is to investigate if the performance of the system can be improved by using some sort of unsupervised learning mechanism.

Proposal and Timelines

My proposal and progress report for this project is available in postscript format.

Task

By

Status

Get classifiers running on corpus Mar 10 Completed.
Develop unsupervised learning system Mar 31 Experiments with preliminary system based on DTree started.
  Apr 14 The framework for running classifiers in 'unsupervised' mode has been developed. Currently running fresh experiments with all the classifiers.
Run experiments Apr 14  

System Description

I will develop a system that will take Nt training examples, use the kNN approach (or something similar with a threshold on 'similarity') to find similar stories (say k many), and increase the size of the training set from Nt to Nt+k. My hypothesis is that this system will perform better than the system using only Nt training examples, but most probably not as good as the system using Nt+k human-classified examples. The system is similar in concept to the pseudo-relevance feedback mechanism in MLIR.

As a side goal, I would like to compare other text classification methods such as DTree, kNN, NaiveBayes, and Support Vector Machine empirically.

Experiments

The general setup for the experiments will be to feed each system with a varying number of training examples on one topic and observe its performance. Detailed information on the experiments and evaluation is available at the Linguistic Data Consortium's webpage.

In principle, it should be possible to use the pseudo-relevance approach with all of the classifiers I mentioned above. So, I will test each system twice: once with and once without unsupervised-learning. Besides the usual evaluation measures of recall and precision, I am also planning to investigate DET curves in comparing the results.

Framework for using unlabeled data

To be able to compare different classifiers accurately, I developed a framework that ensures all the systems are tested using the same corpus splits and all the outputs are evaluated in exactly the same way. A high level description of how the system works is as follows:

To use different classifiers, only the sub-systems that are shown in italic type need to be modified.

Results

Detailed DTree results

Detailed kNN results

Detailed Rainbow (Bayes) results, unsupervised learning results

Detailed SVM results

Summary of results

Demo


last update: 4/14/1998