March 2002

3.20.02:
Still trying to figure out what's going on with the pruning stuff. Looked more closely at the queries. Could it be that trec6 topics are full of rare terms? That isn't really the case. The main differences are that trec6 title queries are on average 1 word shorter than trec123 topics. And we have 100 queries for trec123, only 50 for trec6. It means if we do badly or one or two queries, it will average in less.. that could make a difference. For trec6, there were 2 queries that were entirely not found after the pruning. If I remove them from the results file and run the rhat again, the results are almost identical. So maybe that means those queries aren't really what's resulting in a performance hit. So what's causing it? Should maybe try full queries and larger individual dbs for trec6 datasets. Easy enough for the bysource data, how to combine kmeans data? Wish there were more relevant judgments for more topics for trec6. Should I try trec4 dataset? Maybe it will be too many things to look at. I also got from Paul some stats relating to heap's law and zipf's for trec6. Haven't really looked at them yet. Also working on the Lemur 1.1 release, which will be in just over a week.

3.10.02:
The trec6 kmeans does not yield desired results as well as the trec123 dataset. (about 3% performance hit) Thought maybe trec6 by source would do better. It is also worse than trec123 but not as bad as kmeans. Why? Tried to look at stats of dbs. Each Kmeans db varies a lot in size. Trec123 is almost all equal. However, bysource is also fairly equal. trec123 has fewer percentage of terms as unique, which means it should do better or worse?? Another difference is that individual trec123 dbs are just bigger, almost double, according to average total word count. Could be that when there are fewer words, they're all important. If so, should taking top x terms should do better than x% of terms? but x % does better for both trec6 sets. for trec123, they are the same. another thing, kmeans baseline is a lot higher than the other two. Does that make a difference?

Ran some more stuff using modified cw counts for the CORI eval, using counts taken after the pruning was done. This made almost no difference. Kind of makes sense since that is used as a ratio. But it was worth a try. What about term counts for the coll index? Currently, I still use original counts. This affects.. what.. besides average document length? Ave doc length value should definitely not change.

<<>>