proj home |
ir core
IR term project of Hua Yu
Basic Information
- Project Title: Text Clustering
- Name: Hua Yu (email: hyu@cs)
- Presentation Date: Thu Apr 23rd
- Demo Date: TBA
Contents
Abstract
- Discover structures in a collection of text
- Automatically determine # clusters
Proposal and Timelines
proposal
... to keep track of progress...:
Task
|
to be done by
|
status
|
|
|
|
Run GAC on SWB collection to see what's the difference of using different number of target clusters |
|
Done |
Cluster in 1d using BIC criteria |
Fri Mar.6 |
Done |
Cluster in 2d using EM algorithm |
|
Done |
Cluster in 2d using annealed EM |
|
Done |
Studying Loglikelihood ratio behavior with varying number of Gaussian Mixtures |
Apr.7 |
Done |
Experiment with TDT corpus to see how to set the right number of clusters |
Apr.19 |
TBD |
System Description
For the final text clustering system, I'll use preprocessed document
vector representation as is used in GAC. Upon this I'll show how
clusters are obtained.
Experiments
refer to time-table
Results
- 1d segmentation using BIC criteria is pretty good
- EM clustering of 2d data: pretty good if known the number of clusters
beforehand
- EM clustering of 2d data: how to set the number of clusters. It's more
complicated because of the issue of sampling noise, overfitting, EM
convergence rate. But it's doable.
- Run GAC over SWB collection with different number of target clusters:
validated the intuition that to get a meaningful clustering, we should tell
GAC the right number of clusters
Some nice pictures of Mixture Gaussian modelling of 2d data is
available by contacting me.
Demo
... plans about what to demo...
last update: Apr.14, 1998