January 2002

1.31.02:
So I did build the collection selection index without position information and it is only 1/10th of the size of the other one. I ran the retrieval experiment on it just to make sure I could get the same results. Sure enough, it must be accurate because the results from the 2 are identical. However, in the process of all of this I did discover a bug in my pruning code. When I thought I was pruning each list and dropping entries at the document level, I was still dropping whole inverted lists if it did not have at least one docoument(db) with the required termcount (above threshold). That explains why I did not get better results from my second set of experiments. The results would be similar if the dataset was such that the 100 dbs have similar word distributions. I guess that must be the case. Since I was rewriting whole lists out, I was not writing out the newly created pruned lists. Now that I've changed it to write out the pruned lists, I see that the pruned lists are not being created properly. I'm getting very bad results from the pruned index. (as if all termcounts were 0). So I need to look into that.

1.30.02:
Made a lot of progress today. Making changes to the code in the last couple of days to build an index without position information. It's less of a hack now then it was before. I've even modified PushIndexer so you can just enter a parameter as to whether you want positions. I've made it as simple as possible so if you don't want position info, it doesn't build a docterm index either. Maybe that's not the best thing, but so far I've only ever used the inverted list for anything. That could probably be added later. In fact, I could probably write a PushIndex that only does docterm index, then write a TextHandler for it so you would just chain it in if you wanted one. Currently I wrote a new PushIndex, and changed InvFPTextHandler to instantiate the correct PushIndex based on the parameter. The PushIndex abstract API has finally come in handy. So far I've run a small test on one (50megs) file. The index is half the size. I'm building the collection index, but it's not finished yet. Hopefully it won't crash.

In the meantime, I've be offering a lot of support to lemur users. One is a person in the group having a lot of trouble getting the web data built. Another is someone trying to use an old version of VC++. (which just won't work.) I think Cheng gave him some hacks. And finally, I've built all the libs and apps on win2000, zipped it up and put it on our download page. We've had to add some notes on our download page to clarify which versions of windows and vc++ people should use.

The SIGIR deadline has passed so obviously I didn't submit anything for that. Currently we're thinking of submitting to the VLDB conference. If not, there's still CIKM after that. Hopefully I'll have something in time for VLDB, but Jamie thinks it's 50/50. I still need to measure the query time to see if the smaller collection index improves retrieval time. I also want to tweak those counts.

1.24.02:
So for the past few days I've been trying to hack the code to leave out position information without much success. It keeps crashing. At this point, I think I'll have to start over. I have a feeling that this code will become permanent so I shouldn't hack it anyway. It turns out that the selection db shouldn't have position information anyway. Paul was padding it with fake numbers just because of the InvFPPushIndex he was using already has position info.

Also, Jamie told me to just focus on precision at 3,5, and 10 databases searched. If that's the case then our performance difference becomes even smaller. I was thinking it might be the case that we'd only care about less than 10 dbs searched, but I picked 50 before because that was the point with the largest difference in performance. I also found out that Jamie was actually hoping to get better performance, not just same. I think this might happen if I tweak the counts file.

Also happening in the last couple days is the AFRL people coming to get the newest YFitler system. I made some minor changes to the GUI code. They were all fairly simple. I'm also talking to the purchasing people about getting a new server for the group. sent them an email yesterday.

1.21.02:
Well, I got Paul to take a look for me and it turns out that actually there was a bug with the old eval program not the new one. In any event it didn't matter much because after you run through the perl script, you get quite similar rhat scores. I had a dumb thing in my perl file pointing to the wrong judgment file.

At first I was still quite discouraged given the results because it did not perform any better than the old method. For the same small noticeable performance hit (about .02 hit in ave. prec for 50 dbs searched), the db is only 10% smaller. The 2 sets of experiments have roughly the same performance level for the same reduction in db size.

Currently, the thresholds are based on getting the top N terms in each database regardless of database size. I was thinking of changing it to take the top N% of each database. I don't have a feeling that this will greatly improve it though. A secondary thought is possibly tweaking the counts. Currently when I run the cori eval, I give it term counts and collection counts based on the original database. I thought that this is accurate since the actual data has not changed. But perhaps we could get better performance if we tweak the counts to represent the pruned database.

I reread the paper and it turns out that we were measuring the amount of pruning differently than they were. They were measuring it by the number of entries that get thrown out, completely ignoring the actual db size on disk. So our 10% reduction is actually, by their definition, either 95% or 85% pruned depending on whether "entries" should mean inverted lists, or each entry in an inverted list. The difference in index size does not change significantly for us probably because we keep position information. (Keeping entries with higher df, means keeping the entries with the longest list of positions.) So while nothing has changed, all of a sudden the results look much better. I'm pretty sure Jamie will not be happy about this though. I'm pretty sure he would want to show that we can save disk space.

1.18.02:
Despite many bugs and problems, I did manage to build all 100 dbs and the collection selection db for it over the last 2 days. Everything appeared to be right. I also modified my code to create thresholds for each of the 100 dbs (rather than using one threshold for the ctf of the collection selection db). This code works as far as I can tell.

Ran into a big problem today. I ran the Paul's newer cori evaluation stuff on the baseline collsel db and got terrible results. I ran the same eval on the older collsel db I had lying around from the old experiments and got equally bad results. This led me to believe that the fault lay in the eval method. I tried to find the old eval method I used so I could test it on the old and new dbs, but I could not find it and could not recreate it either. The retrieval code has changed significantly since the last time I ran these experiments. This is very discouraging for me because without results to compare, there is way to evaluate the pruned dbs. I was hoping to get first round results today, which would not have been a problem if eval was working properly. I've sent a message off to Paul to see if I've done something stupid that he can easily spot. One strange thing is that the tids for the 2 collsel dbs do not match. Though it should. They were built using the same data and stopword and acronyms list. (The stemming might be different, but stemming wouldn't account for the differences I see). They were not built using the same indexer application though. I'm not sure it should be something I'm worried about. For the pruning stuff, as long as I compare performances of the pruned indexes to what it was pruned from, the original db matters less (as long as the results are acceptable.)

In toolkit news, I did download the toolkit yesterday, built everything on win2000. I built the sample data using PushIndexer and ran simple tfidf experiment. The results look good.

1.15.02:
Fixed all the Filter GUI stuff today. Just some small changes. Need to give the updated version to them next week. Also talked to Paul and got his latest stuff. Need to run it and start the pruning experiments (the right way) with it. Also been supporting the new guy on using the lemur toolkit. Lots of small stuff.

1.9.02:
So we released the toolkit yesterday and already got a bug report. The utility makefile fails when someone tried to build it on windows. The problem was an ambiguous error in String.cpp. This same makefile works on my machine of course. I tried it on another machine with the same OS and version of VC++ but it did not work. So at first I thought it was in some configuration setting within the VC++ project. But I could not find any diferences. Finally, with some help from Martin, we discovered that the problem occurs when using VC++ with no service packs installed. In any event, there was a problem with the code that when fixed can be compiled using VC with not service packs too. The code was overriding a method in the string library and created an object in that method of the child object instead of the parent string class. So it basically made the method recursive instead of calling the parent method. The fix just needs to make that object a string instead of String. (or cast it when making the method call).

In John's release email he just said that it works on windows and not specifically NT. So people have been trying or wanting to use it with XP or 2000 or 98. That brings up some interesting questions. XP and 2000 are NT based, so it shouldn't be a problem. People also wanted a binary release, which is pretty reasonable. I have to run it on 2000, see if it works then zip it all up to release. Jamie has been talking about possibly putting a GUI on it for people who just want to tinker with it. We won't get to that until Feb.

1.7.02:
Still trying to make basicindex work on windows. John tried to bypass the BitVector completely by having the compression code directly use a vector of ints or bools. There was still a problem during deallocation of the vector. However, this was a different error than we had seen previously with free. Cheng and I finally figured out that John was constructing the vector with a size and then trying to set values into it using array-like syntax. Neither of us had ever tried to use vector in that way before and it seemed that it didn't work. We did not construct it with a size, allowing the vector to dynamically manage itself. And we changed the code to use the stl "push_back" api instead of direct assignments into the vector using []. This worked. I tried to use the resize method after the contruction (because we already know the size) to see if it would be more efficient. This caused it to fail. I'm not really sure what resize and the size constructor are for. Well, since we already knew the size, I thought maybe we should use an array. There was one part of the code though that needed it to have dynamic size so that change would not work.

I figured out why the wordindex was not being built in release mode. It was because the call to create it was enclosed in an assert statement, and assert statements are ignored in release mode. While John uses a lot of assert statements, the others were all error checking. We decided that we should write our own version of "assert" which just prints out an error message and exits. Now everything worked in both debug and release modes.

After basicindex worked, we made sure all the applications would work. It turns out that many of them still had header problems, including .h and not using "common_headers". We made those fixes. We also tried to get rid of more of the warnings, like removing pragma statements. Finally, I remade some of the windows .mak files and we were ready for release.

1.4.02:
Been working relentlessly with John and Cheng on getting the BasicIndexer to work on Windows. Most of the bugs have been worked out. There were some problems with the code using /bin/mv and Cheng finally made the changes using rename. However, there was still a problem with it. I remembered that rename does not work if you try to rename to a target file that already exists (this is well documented in VC help). So we added remove statements before each rename statement.

We were still seeing problems in that the indexes that were being created would think that it had reached EOF after a few records even though the file itself appeared to be of the correct size. We rewrote the code to bypass the compression but found that the error still occurred. In the end, we finally discovered that the problem was due to the files not being explicitly written and opened using binary mode. This fix solved that problem.

Meanwhile, John wanted to replace BitArray class, since it was old and used some values specific to 32 bit platforms. Even though it works for all our machines I guess he wanted a more elegant solution. He changed to a BitVector class, but this caused the code to crash when running in Debug mode. There was an error accessing NORMAL_BLOCK when the BitVector was being de-allocated. I did some research and apparently this is some kind of bug with the VC debug version of free. This error surfaces when using derived classes (BitVector is derived from STL vector class). I tried various ways to go around it, but with no success. This bug does not appear in Release mode. However, the wordindex does not get built in Release mode. I tried it with the old BitArray version and the same problem occurs in Release mode.

<<>>