![]() ![]() ![]() ![]() ![]() |
January 20021.31.02:So I did build the collection selection index without position information and it is only 1/10th of the size of the other one. I ran the retrieval experiment on it just to make sure I could get the same results. Sure enough, it must be accurate because the results from the 2 are identical. However, in the process of all of this I did discover a bug in my pruning code. When I thought I was pruning each list and dropping entries at the document level, I was still dropping whole inverted lists if it did not have at least one docoument(db) with the required termcount (above threshold). That explains why I did not get better results from my second set of experiments. The results would be similar if the dataset was such that the 100 dbs have similar word distributions. I guess that must be the case. Since I was rewriting whole lists out, I was not writing out the newly created pruned lists. Now that I've changed it to write out the pruned lists, I see that the pruned lists are not being created properly. I'm getting very bad results from the pruned index. (as if all termcounts were 0). So I need to look into that.
1.30.02: In the meantime, I've be offering a lot of support to lemur users. One is a person in the group having
a lot of trouble getting the web data built. Another is someone trying to use an old version of VC++. (which
just won't work.) I think Cheng gave him some hacks. And finally, I've built all the libs and apps on win2000,
zipped it up and put it on our download page. We've had to add some notes on our download page to clarify
which versions of windows and vc++ people should use.
The SIGIR deadline has passed so obviously I didn't submit anything for that. Currently we're thinking
of submitting to the VLDB conference. If not, there's still CIKM after that. Hopefully I'll have something
in time for VLDB, but Jamie thinks it's 50/50. I still need to measure the query time to see if the smaller
collection index improves retrieval time. I also want to tweak those counts.
1.24.02: Also, Jamie told me to just focus on precision at 3,5, and 10 databases searched. If that's the
case then our performance difference becomes even smaller. I was thinking it might be the case that
we'd only care about less than 10 dbs searched, but I picked 50 before because that was the point with
the largest difference in performance. I also found out that Jamie was actually hoping to get better
performance, not just same. I think this might happen if I tweak the counts file.
Also happening in the last couple days is the AFRL people coming to get the newest YFitler system.
I made some minor changes to the GUI code. They were all fairly simple. I'm also talking to the
purchasing people about getting a new server for the group. sent them an email yesterday.
1.21.02: At first I was still quite discouraged given the results because it did not perform any better than
the old method. For the same small noticeable performance hit (about .02 hit in ave. prec for 50 dbs
searched), the db is only 10% smaller. The 2 sets of experiments have roughly the same performance level
for the same reduction in db size.
Currently, the thresholds are based on getting the top N terms in
each database regardless of database size. I was thinking of changing it to take the top N% of each database.
I don't have a feeling that this will greatly improve it though. A secondary thought is possibly tweaking the
counts. Currently when I run the cori eval, I give it term counts and collection counts based on the original
database. I thought that this is accurate since the actual data has not changed. But perhaps we could get
better performance if we tweak the counts to represent the pruned database.
I reread the paper and it turns out that we were measuring the amount of pruning differently than they were.
They were measuring it by the number of entries that get thrown out, completely ignoring the actual db size on disk.
So our 10% reduction is actually, by their definition, either 95% or 85% pruned depending on whether "entries"
should mean inverted lists, or each entry in an inverted list. The difference in index size does not change
significantly for us probably because we keep position information. (Keeping entries with higher df, means
keeping the entries with the longest list of positions.) So while nothing has changed, all of a sudden the
results look much better. I'm pretty sure Jamie will not be happy about this though. I'm pretty sure he would
want to show that we can save disk space.
1.18.02: Ran into a big problem today. I ran the Paul's newer cori evaluation stuff on the baseline collsel db
and got terrible results. I ran the same eval on the older collsel db I had lying around from the old
experiments and got equally bad results. This led me to believe that the fault lay in the eval method.
I tried to find the old eval method I used so I could test it on the old and new dbs, but I could not find it
and could not recreate it either. The retrieval code has changed significantly since the last time I ran these
experiments. This is very discouraging for me because without results to compare, there is way to evaluate
the pruned dbs. I was hoping to get first round results today, which would not have been a problem if eval
was working properly. I've sent a message off to Paul to see if I've done something stupid that he can
easily spot. One strange thing is that the tids for the 2 collsel dbs do not match. Though it should. They
were built using the same data and stopword and acronyms list. (The stemming might be different, but stemming
wouldn't account for the differences I see). They were not built using the same indexer application though.
I'm not sure it should be something I'm worried about. For the pruning stuff, as long as I compare performances
of the pruned indexes to what it was pruned from, the original db matters less (as long as the results are
acceptable.)
In toolkit news, I did download the toolkit yesterday, built everything on win2000. I built the sample
data using PushIndexer and ran simple tfidf experiment. The results look good.
1.15.02:
1.9.02: In John's release email he just said that it works on windows and not specifically NT. So people have been
trying or wanting to use it with XP or 2000 or 98. That brings up some interesting questions. XP and 2000 are
NT based, so it shouldn't be a problem. People also wanted a binary release, which is pretty reasonable. I have
to run it on 2000, see if it works then zip it all up to release. Jamie has been talking about possibly putting
a GUI on it for people who just want to tinker with it. We won't get to that until Feb.
1.7.02: I figured out why the wordindex was not being built in release mode. It was because the call to create it
was enclosed in an assert statement, and assert statements are ignored in release mode. While John uses a lot
of assert statements, the others were all error checking. We decided that we should write our own version of
"assert" which just prints out an error message and exits. Now everything worked in both debug and release modes.
After basicindex worked, we made sure all the applications would work. It turns out that many of them still
had header problems, including .h and not using "common_headers". We made those fixes. We also tried to
get rid of more of the warnings, like removing pragma statements. Finally, I remade some of the windows
.mak files and we were ready for release.
1.4.02: We were still seeing problems in that the indexes that were being created would think that it had reached
EOF after a few records even though the file itself appeared to be of the correct size. We rewrote the code
to bypass the compression but found that the error still occurred. In the end, we finally discovered that
the problem was due to the files not being explicitly written and opened using binary mode. This fix solved
that problem.
Meanwhile, John wanted to replace BitArray class, since it was old and used some values specific to 32 bit
platforms. Even though it works for all our machines I guess he wanted a more elegant solution. He changed
to a BitVector class, but this caused the code to crash when running in Debug mode. There was an error accessing
NORMAL_BLOCK when the BitVector was being de-allocated. I did some research and apparently this is some kind of
bug with the VC debug version of free. This error surfaces when using derived classes (BitVector is derived from
STL vector class). I tried various ways to go around it, but with no success. This bug does not appear in
Release mode. However, the wordindex does not get built in Release mode. I tried it with the old BitArray
version and the same problem occurs in Release mode.
|