February 2002

2.25.02:
Found the "linux" bug, although the problem might be more related to different versions of gcc. Our linux machines run a newer version. The problem was that during the merge, when I iterate through the lists, there's a for loop to increment the iterator. But I don't want to start at the first item, I want the loop to increment itself first so the start case is says iter++, which works. But for some bone-headed reason, I have some that say iter=iter++, which doesn't increment. So all that "junk" being added was the values for the first item merged into the list a second time. iter+=1 and iter=iter+1 also work.

Also revamped the pruning code. It now does as many passes as necessary to get each db within x% error rate of desired rank. Also can do runs with top N% instead of just top n terms. Need to make runs on kmeans data to get generalizeable results.

2.20.02:
Been putting more effort into debugging why the large build doesn't work on Suwon. I've discovered that the merging code does not work on linux, regardless of size. Bill had said that he was able to build small web databases, but the queries hang. I have not had the same experience. If I build a small db, (only cd1/wtx001), it flies through the queries no problem for both okapi and simple tfidf. If I build a bigger one, (all of cd1), it actually core dumps. The difference is that the small db was small enough not to need more than one temp file. Now, if I run the same small dataset (cd1/wtx001) using a smaller cache, forcing it to go throug the merge, I get a corrupt inverted list db. (the docterm db is still good.) I've tried the same merge vs. no merge using trec data and it also gets corrupted. The index that is built from the merge is larger than it should be.

On LA, the indexes are exactly the same whether it goes through the merge or not. But the same problems occurs on Ringtail, which leads me to believe that it is a platform issue. The sample data, being small, worked beautifully. Off hand I don't know what the problem is. When we first tried to compile on linux, there were problems with the setbuf functionality of fstream (forcing it to use a given buffer). I commented that stuff out (letting it manage its own buffer), but the same problem occurs. I need to step through the merge code line by line to see what the problem is. The doc frequency and ctf counts are correct. The read also reads in the correct length list, but the data is incorrect. When I compare inverted lists from the 2 dbs, the begining looks the same, the corrupted one then has a bunch of garbage (wrong value integers) then continues on the right path again. So some wrong values got spliced into the middle of the list.

Could there be a memory leak on linux that is not on solaris? I have downloaded some purify type programs for linux, but have not tried them out. I think I should step through the merge first to see that there isn't a glaring error. First, I have to get gdb to work predictably.

2.04.02:
So last Friday I spent debugging with Bill to see what's wrong with building the web data using lemur on linux. We did manage to build a small index with webdata and tried to run queries through it. The good news is it only took 30 minutes to build 2.5 gigs, but the queries hung for a long time. I ran the program through the debugger to see where it was hanging. It was happening somewhere in the scoring function. As far as I can tell this is somewhat removed from the index. From what I could tell, the part where it accessed anything from the index was to get a termlist. (This is a bit surprising because I thought that the tfidf stuff would be interested in the doclist.) Anyway, we wrote a tiny test program to grab termlists from the index and that appeared to behave correctly. It isn't too hard to get an accurate docterm index since it never goes through any kind of merging like the inverted index needs to. So for the most part, it should be pretty bug-free. Today, Bill tells me that if he lets it run for a whole day, it actually gets through the queries. It's just impossibly slow. I'm not sure there is much I can do to figure out why this is happening since I'm not really too familiar with the retrieval part of the code. Perhaps what we'll need to do is ask Cheng. I hate being asked this stuff so I'm hesitant to bother Cheng. Meanwhile, I've asked Bill to run the indexer on the full webdata, keeping the tempfiles around so when it crashes, I'll have the tempfiles to play with and figure out what's wrong with the merge.

Jamie's guess is that there is a memory leak somewhere, which is causing a problem on linux and not on solaris. I ran the pushindexer through purify over the weekend (on solaris since there is not purify for linux) and there aren't really any memory leaks. There are a few UMR and array out of bounds stuff that is all in Paul's text handling stuff. I guess I'll have to either grab Paul or fix them myself. There are some small memory leaks relating to the param stuff. I kind of remember Cheng mentioning something about them, but then dismissing them because they are so small. There was one significant memory leak from my code with not releasing all the document id strings. But this would happen at the very end of the process anyway, so I can't see how that would make much of difference. In any case, that is easy to fix. Currently I'm looking into whether or not I need to allocate that memory anyway. I can't remember if I have to do my own copying before I push into a vector or if the vector takes care of all of that. I was doing strdup just to be safe. I'm guessing that I have to since I vaguely remember looking into that and I would have changed it if it wasn't the case. Anyway, I'm running that test now, and it's taking forever because there are people on LA.

Speaking of that, Jamie and I are in the process of looking into new servers to buy. I guess we will go with linux servers because they are cheaper. We're still waiting on some quotes from purchasing and some answers from facilities regarding cross-mounting of disks. Since it looks like we're moving over to linux, I also checked to see what memory leak detectors are available on linux. I downloaded 2 of them to try out at some point.

I'll need to re-run the indexing using a smaller cache so that it uses the hierarchical merge. Although that code is almost identical to the final merge method, I should still check to make sure there isn't anything weird going on there. It just takes so long to run one of these processes.

<<>>