![]() ![]() ![]() ![]() ![]() |
March 20023.20.02:Still trying to figure out what's going on with the pruning stuff. Looked more closely at the queries. Could it be that trec6 topics are full of rare terms? That isn't really the case. The main differences are that trec6 title queries are on average 1 word shorter than trec123 topics. And we have 100 queries for trec123, only 50 for trec6. It means if we do badly or one or two queries, it will average in less.. that could make a difference. For trec6, there were 2 queries that were entirely not found after the pruning. If I remove them from the results file and run the rhat again, the results are almost identical. So maybe that means those queries aren't really what's resulting in a performance hit. So what's causing it? Should maybe try full queries and larger individual dbs for trec6 datasets. Easy enough for the bysource data, how to combine kmeans data? Wish there were more relevant judgments for more topics for trec6. Should I try trec4 dataset? Maybe it will be too many things to look at. I also got from Paul some stats relating to heap's law and zipf's for trec6. Haven't really looked at them yet. Also working on the Lemur 1.1 release, which will be in just over a week.
3.10.02: Ran some more stuff using modified cw counts for the CORI eval, using counts taken after the pruning
was done. This made almost no difference. Kind of makes sense since that is used as a ratio. But it was
worth a try. What about term counts for the coll index? Currently, I still use original counts. This
affects.. what.. besides average document length? Ave doc length value should definitely not change.
|