proj home | ir core

IR term project of Hua Yu

Basic Information

Contents


Abstract

- Discover structures in a collection of text
- Automatically determine # clusters

Proposal and Timelines

proposal

... to keep track of progress...:

Task
to be done by
status
Run GAC on SWB collection to see what's the difference of using different number of target clusters
Done
Cluster in 1d using BIC criteria
Fri Mar.6
Done
Cluster in 2d using EM algorithm
Done
Cluster in 2d using annealed EM
Done
Studying Loglikelihood ratio behavior with varying number of Gaussian Mixtures
Apr.7
Done
Experiment with TDT corpus to see how to set the right number of clusters
Apr.19
TBD

System Description

For the final text clustering system, I'll use preprocessed document vector representation as is used in GAC. Upon this I'll show how clusters are obtained.

Experiments

refer to time-table

Results

Some nice pictures of Mixture Gaussian modelling of 2d data is available by contacting me.

Demo

... plans about what to demo...


last update: Apr.14, 1998