IR term project of Cenk Gazen

Basic Information

Project Title: Topic Tracking - Multiple Approaches & Unsupervised Learning
Name: Cenk Gazen (email: bcg+@cs.cmu.edu)
Presentation Date: Thu Apr 23rd
Demo Date:

Abstract
Proposal and Timelines
Literature Survey (postscript)
System Description
Experiments
Results
Demo

Abstract

The topic-tracking task can be described as follows: The system is given a sequence of news stories. A target topic is defined by a few training stories on that topic. The task of the system is to decide whether an incoming story is on the target topic or not. A key parameter for the system is the number of stories used to define a target topic and is named N_t. My main goal is to investigate if the performance of the system can be improved by using some sort of unsupervised learning mechanism.

Proposal and Timelines

My proposal and progress report for this project is available in postscript format.

Task	By	Status
Get classifiers running on corpus	Mar 10	Completed.
Develop unsupervised learning system	Mar 31	Experiments with preliminary system based on DTree started.
	Apr 14	The framework for running classifiers in 'unsupervised' mode has been developed. Currently running fresh experiments with all the classifiers.
Run experiments	Apr 14

System Description

I will develop a system that will take N_t training examples, use the kNN approach (or something similar with a threshold on 'similarity') to find similar stories (say k many), and increase the size of the training set from N_t to N_t+k. My hypothesis is that this system will perform better than the system using only N_t training examples, but most probably not as good as the system using N_t+k human-classified examples. The system is similar in concept to the pseudo-relevance feedback mechanism in MLIR.

As a side goal, I would like to compare other text classification methods such as DTree, kNN, NaiveBayes, and Support Vector Machine empirically.

Experiments

The general setup for the experiments will be to feed each system with a varying number of training examples on one topic and observe its performance. Detailed information on the experiments and evaluation is available at the Linguistic Data Consortium's webpage.

In principle, it should be possible to use the pseudo-relevance approach with all of the classifiers I mentioned above. So, I will test each system twice: once with and once without unsupervised-learning. Besides the usual evaluation measures of recall and precision, I am also planning to investigate DET curves in comparing the results.

Framework for using unlabeled data

To be able to compare different classifiers accurately, I developed a framework that ensures all the systems are tested using the same corpus splits and all the outputs are evaluated in exactly the same way. A high level description of how the system works is as follows:

Split the corpus for a given class (=topic) and Nt.
Loop as many times as desired (typically 2)
- Convert the corpus to the specific format for a given classifier.
- Run the classifier for the given class and Nt.
- Convert the classifier specific results into a standard format.
- Move the new positive examples from the test set into the training set.

To use different classifiers, only the sub-systems that are shown in italic type need to be modified.

Results

Detailed DTree results

Detailed kNN results

Detailed Rainbow (Bayes) results, unsupervised learning results

Detailed SVM results

Summary of results

Demo

last update: 4/14/1998