tnt homepage - work log

February 2003

2.21.03:
Well, I really dropped the ball on this journal, but I'm going to try and start again. I underestimated how useful it is in terms of keeping my thoughts clear and for later lookup. I just used it now to look up stuff I didn't for windows debugging over a year ago. Very helpful.

Recently I've been working solely on the toolkit and various administrative type things. There's been a decent amount of traffic on the Lemur forum so I try and look at that, although I'm not capable of answering all the questions. I've completely stopped working on the pruning stuff, which is really too bad. I reread the stuff I'd managed to write down about the pruning experiments. It really is interesting work. I might try and pick that up again, but it would require me to rebuild all the databases, which is what deterred me in the second round of experiments.

My most recent accomplishment is writing another index which uses btrees to store the term/doc dictionaries and index lookup tables. I really didn't want to write my own btree package so I'm using one from libgist that was developed at Berkeley. It actually has more functionality than we really need. I went through and tried to grab only the bare minimum we would need to get btrees to work. Usually I find digging though other people's code to be really painful, but this was not too bad. The btrees load quickly so the index now can load in less than a second. However, retrieval is slower than if everything is pre-loaded. I guess that's obvious, but I don't really understand fully how this btree version will help with getting a server or interactive version of Lemur since retrieval is slow. Actually, as far as the btree goes, finding a term is fairly fast, so I don't really know why retrieval is so much slower and how to speed it up. As far as getting stuff off disk, both indexes do it the same way. Though slow loading, the quickest retrieval is by using hash_map (instead of map) with the old inverted index. I found out that using hash_map is 75% faster. Windows is finally going to support hash_map, but CMU can't get their act together to get us a licensing agreement to use the latest VC++. Gcc already has hash_map support included in regular distribution. Another big difference is that these btree files take up more space than the ASCII files I was using before. They are about 60% larger. Not only that, for each dictionary, I have to keep another map around so I could have 1 each for getting id to str, then str to id.

Meanwhile we've also been discussing how to come up with a naming standard for all the indexes we have, as our library is building more and more. Jamie thinks it's very important not to make the toolkit too complicated. On one hand, we should make everything as simple as possible, on the other, the toolkit is not a simple program so people shouldn't expect it to be an off the shelf system. The original idea of the toolkit (I thought) was to provide a general framework for indexing and retrieval using different indexes and different retrieval algorithms. I think we've definitely supplied that, along with additional parsing capabilities and a bunch of other things we never said we'd do.

My next big task (expected to be a small pain) is to get the new btree stuff working on windows. I'm anticipating that I might have problems with certain #pragma statements. I never really figured out what they're for. Berkeley people claims that their package works on windows so I'm hoping there won't be too many problems. Another thing is that their code uses a lot of assert statements and I remember encountering problems with these in John's code. (That's what I was looking through these logs for.) I looked at the toolkit and there is other existing code that uses assert statements. I can't remember why we never changed those. Most likely it's because everything still worked. :) It all depends on the nature of those statements. If they're only for debug purposes, then it's ok. If the code actually relies on it, which I think is the case with libgist, then it can be a problem. The libgist code also produces a few warnings related to deleting void pointers. From what I can tell, deleting void pointers is a very bad idea. I've removed them from a couple places already but I think the rest of them are actually necessary. Or at least I can't figure a way to get around them. In general, I don't like releasing code that produces warnings, so this doesn't sit easily with me. But I might not have a choice.

<< >>