proj home | ir core

IR term project of Hua Yu

Basic Information

Project Title: Text Clustering
Name: Hua Yu (email: hyu@cs)
Presentation Date: Thu Apr 23rd
Demo Date: TBA

Abstract
Proposal and Timelines
System Description
Experiments
Results
Demo

Abstract

- Discover structures in a collection of text
- Automatically determine # clusters

Proposal and Timelines

proposal

... to keep track of progress...:

Task	to be done by	status

Run GAC on SWB collection to see what's the difference of using different number of target clusters		Done
Cluster in 1d using BIC criteria	Fri Mar.6	Done
Cluster in 2d using EM algorithm		Done
Cluster in 2d using annealed EM		Done
Studying Loglikelihood ratio behavior with varying number of Gaussian Mixtures	Apr.7	Done
Experiment with TDT corpus to see how to set the right number of clusters	Apr.19	TBD

System Description

For the final text clustering system, I'll use preprocessed document vector representation as is used in GAC. Upon this I'll show how clusters are obtained.

Experiments

refer to time-table

Results

1d segmentation using BIC criteria is pretty good
EM clustering of 2d data: pretty good if known the number of clusters beforehand
EM clustering of 2d data: how to set the number of clusters. It's more complicated because of the issue of sampling noise, overfitting, EM convergence rate. But it's doable.
Run GAC over SWB collection with different number of target clusters: validated the intuition that to get a meaningful clustering, we should tell GAC the right number of clusters

Some nice pictures of Mixture Gaussian modelling of 2d data is available by contacting me.

Demo

... plans about what to demo...

last update: Apr.14, 1998