December 2001

This is not a true log as I'm writing it quite later than the real date. But on this day, Yi and I finished the filtering engine stuff and put it all onto the laptop and gave it to Jamie to deliver. There were a few lingering bugs with java and the JNI. One of them I want to document, that is getting an error that causes the java vm to crash when we try to abort the thread that is running the engine. It shouldn't crash. It should exit the thread normally since the way we abort is by having the engine to check for an abort flag, then exiting normally. I looked on java's developer site and found that this was some kind of bug that is common with using threads and the jni. I also sent a bug report to java. Their most useful comment was for us to try the latest version of java - jdk1.3.1. (we are using one before - jdk1.3.0). my guess is that it wouldn't solve the problem. . I found that it behaves as expected if I run it using java -classic, which then uses the classic vm instead of the "hotspot" vm, which is where the crash occurs (HotSpot Virtual Machine Error; EXCEPTION_ACCESS_VIOLATION; abnormal program termination). The hotspot vm is supposed to perform more efficiently, but I have not really noticed a performace difference. This however might not be a good long term solution since the java person told me that in the next release of the jdk, they will probably eliminate the -classic option. Well, I would hope that in their next release of the hotspot vm, they should eliminate these bugs as well.

Met with Cheng today to see if we could get identical results for the 2 indexers. We did manage to get this right before the previous release. Cheng ran the test on both indexers again using not stopping or stemming for anything. The results were mostly the same except for a few test where it was slightly off. We examined the indexes and discovered that my index reports one fewer term. Could this be because of the mysterious 0 length string? This is what we concluded since we ran shorter data and both indexes got the same count. However, I think that the basicindexer relies on one of Paul's parsing programs to read the datafile and output it into the right input format for basicindexer. So doesn't that mean we are both using the same parsing? Why doesn't the basicindexer have problems with an empty string? Maybe it does have an empty string indexed, but it would just get ignored, would that cause the results to be different? I guess it could be if the calculations require the use of the total terms count. Probably gets used for average doc length or something. Anyway, Paul does not have time to look into his parsing code now. I'm not sure I do either. I need to make sure the all the documentation is ok.

Another big advancement today is getting the toolkit to compile on gcc 3.01. John and Cheng did this. There is a problem with the setbuf stuff again. A while back I figured out how to use setbug for windows and how to do it for unix differently. Apparently the unix way is just for gcc 2.96, gcc 3.01 does it the windows way. So my ifdef win32 doesn't work anymore. We need an ifdef to figure out what compiler version is being used. I don't know how to do that or if it's possible. Cheng's looking into it. He claims that if he does not find the "right" way of doing it, at least he thinks he can hack the makefile somehow to make it do it. We're so close to releasing this thing. I'd really like to get it over with.

Meanwhile I'm also trying to get us a laptop to install all this filtering engine stuff so Jamie can deliver it. I looked into whether Java can run on XP, but it was a bit unclear. I can't figure out from just reading what's on the internet if the JVM for XP is a full JVM or if it is only good as plug-in. The download page really stresses it as a plug-in, but I read an article somewhere else that it's a real JVM.

I did run the 1.2 gig data again on windows today and it finished fine. The index files looked about right. So that's a relief. Had to test it on the sample data that we are distributing though. Then I can run the retrieval tests and compare the results with the basicindexer. If the source is the same and handled the same way, we should get identical results. The sample data file we have is not exactly in TREC format which is the only thing my application handles. So I used Paul's application, which can also do parsing and stemming. It built fine, but when I tried to run the retrieval test, I noticed it didnt' really do anything for over an hour. I finally realized that the index never finished loading. It just hangs while trying to load in all the termids.

So, I tried loading an index I'd built with other data, it loads with no problem. It just has a problem with this test data. I finally discovered it was due to a term that has string length of 0. So it doesn't actually write anything for the term. I'm still not really sure why this causes the load to go into an infinite loop (never finding EOF) instead of just not properly loading the terms from that point on. Anyway, I changed the InvFPPushIndex code to check at the beginning of addTerm to ignore any terms that have length less than 1. This seems like a reasonable thing to do, and it fixed the problem. I was able to run the retrieval tests and get reasonable results. However, they were not the same as Cheng's since he apparently didn't do any stopping or stemming.

After timing my application (30 minutes to do 1.2 Gigs on LA), I wanted to time Paul's application to see how much stopping and stemming slows it down. The string length problem only occurred once with the sample data but it occurred thousands of times with the 1.2 gig data. This raised a concern with me that perhaps there is something wrong with the parsing code not tokenizing correctly.

Worked a bit with Yi today. She had a "bug list" for me, which I went through and fixed. I had to add more stuff to the GUI because the more I need to test it the more arguments I discover her program really needs. I asked her to tell me how to calculate the precision and recall based on the information that she returns. We decided it would just be easier if she returns me the values I need so I don't need to calculate them. She also asked for another callback method to send me messages. These are not error messages, rather they can be used for status updates, etc. We discovered a bug with the counter, but it was just that I thought she was passing back a number I had to keep adding to get the total, when she was in fact already returning the total. Less work for me. She still needs to change her code so that it returns an error message and exits instead of crashing the program (C code throwing an exception. I think that there is a way with JNI to throw a java exception and I can catch it, but I don't really want to look into that right now. The callback should work.)

Cheng tells me the reason assert failed is because it really couldn't find the ifstream. I had thought assert was causing some kind of run-time error. I'm a bit confused in general about why one should be using assert instead of regular error checking statements and throwing exceptions with meaningful messages instead of just allowing the program to abort. Anyway, the reason it couldn't find the ifstream is because of some hard coded /bin/mv commands, which obviously will not work on windows. I looked to see what builtin library stuff there is to move files. I did find some things like ffilecopy and movefile that are part of the stdio library, but not so on windows. Finally I found a rename() method that will move files. This should work on both unix on windows. It is in stdio.h. I passed this info on to Cheng.

Meanwhile I finally got the thread stuff in java figured out. The way to do it is with a while true() loop in the run method of the class that extends thread. In the GUI constructor, I instantiate the thread and call start. This starts the thread, but it immediately waits. When the action command for the button gets called, I call notify on the same thread handle. (in a synchronized(thread){} section). This wakes up the thread, which does it stuff and then goes back to wait. (wait() also needs to be in a synchronized(this){} section). I handed the code off to Yi today so she can do some testing.

Today, I also ran all kinds of tests with the indexer. I need to rebuild the pushindex on windows and on unix. I should probably time it too. I ran a small test on windows on Saturday and it crashed. Today when I looked at it again, I realized it was because of the changes I had made to nextEntry to make it constistent with Cheng's code. I use this API in my merge phase when I look at the temp files. Since nextEntry now returns static memory, I don't have to delete it. I removed all the delete statements. It compiled but did not run. There was an error that claims the program is trying to delete some memory it doesn't have permission to. I didn't really understand the error but that it might have something to do with the static_cast statement I use to downcast it. I remembered the dynamic casts that Cheng uses in his code so I tried that. That did not work either. Finally I just used the nextEntry method I have where you pass a pointer in to memory that you yourself manage. This worked fine. I ran it on unix a couple of times with seg faults and panicked until I finally realized it was just because of a full disk. So it all builds ok. The small datasets on windows runs ok. When I ran the 1.2 gig data, the machine complained about running out of memory. Later when I checked the index files, I saw that it was corrupt. I'm not sure it this is a problem with windows. I'll have to run it again later. It really hangs up my system though and I can't do anything else in the meantime so I'll do it later.

Tackling both the java stuff and the toolkit stuff today. First off java stuff: I tried running the data in a different process, but the loop waiting for the output to read it does the same hold up as not running the data in a different process. I talked to Daniel about it and it seems that running it in a different thread should do the trick. It's not supposed to wait until the thread finishes. It turns out there was a bug in my code in that I called run to start the thread instead of calling start. So, doing it that way, I can start off the thread and update in real time without the GUI holding up. The process is pretty intensive though and slows down my computer in general.

Then I discovered another problem, which is that once the thread has started, calling start again does not restart it, which means that you cannot do Run multiple times from the application, you'd have to quit. What I had done before is construct the thread in the constructor because then I can pass the gui (this) into the thread for callback reasons. But inside the action loop for the Run button is where the thread actually needs to run/start the process. So, it turns out that you cannot start/stop the same thread after it's started. If I wait until the action loop to construct the thread, it does not wait for the thread to finish but it takes a really long time until the GUI takes control back. The lag is unacceptable. I tried various ways, making the data class implement Runnable vs. extending Thread, but neither way worked. I'm now experiment with different ways of keeping the run loop of the thread "running" at all times. So it waits() after it finishes its job, and waits for the action loop to notify it to wake up. This theory works, but the implementation does not work yet. When I try to notify it, it continues to sleep.

Today I worked at length with Cheng to compile the lemur toolkit on windows. Not just removing errors, but trying to eliminate warnings as well. After changing the headers to the new standard, most everything compiled OK, even the BasicIndexer. We discovered another class that nobody uses (CUtil) and removed that. (BTW, I did manage to compile and link John's code. The problem was that for whatever reason, someone had checked in a local version of stdarg.h, so when windows compiled, it tried to link that instead of the builtin library. I just removed that file and it worked ok.) There were also some minor problems with some methods not returning anything when it was supposed to. And with the for loop. So on unix, I guess you can have multiple for loops instantiating the same variable (for int i=0; etc.) and it'll be kept locally. But on windows, you can't do that. I guess the scope goes one level higher into the method. There were also some warnings related to unsigned int mixmatches resulting from strlen. And some warnings about using dynamic cast. We didn't fix all of the warnings. I just emailed them to Cheng.

So after we got everything compiled, we tried to build a basicindexer. But it didn't work. The parsing does not work correctly and we discovered that it was in BasicDocStream. There is a method in there where it peeks ahead to see if it was at the end of file yet. So to do that, it tells where the stream pointer is, reads something, then tries to put the pointer back where it was. For some reason the pointer is not where we expect it to be anymore. (Off by a couple characters.) We did manage to get around that problem by hard coding some stuff for the dataset we were testing, just to see if there were other problems. We then ran into assert(ifs) problems, where the assertion failed. Instead of using assert, we could do an "if" check and continue around it if the check fails. This allows the program to finish without aborting, but it does not build a real index. I think the right way of checking for end of file is using peek and the builtin EOF stuff. So I made that change and it manages to finish parsing all the way through the file once without error. However, the second time through the data (I think the basicindexer does a 2 pass method) it fails with the same type of error. (not finding the proper begin a new document marker.) I'm not sure why this is. Perhaps a similar thing is going on somewhere else. I've passed my findings on to Cheng.

Yesterday I spent some time with Yi and managed to get the callbacks between Java and C working. I couldn't load the JNI library directly from the GUI because of the subclasses created for actionPerformed (for button click). So I created a JNI layer class that only talks between the JNI and the GUI. It all seems to work OK as far as being able to print to the command screen that I received the callbacks. (I implemented 3 different callbacks for different types of messages.) 2 issues related to this. 1) When there is an error, the program crashes. I think Yi throws an exception from her code, but I really didn't want to figure out how to pass exceptions back and forth, even though it's supposed to be possible. An easy solution is to create another callback for errors, and Yi's program should then exit after calling the error method. 2) The process takes a long time to run, which hangs the java GUI. At first I thought this was a performance issue with the process taking up a lot of memory/ram/etc. We thought the problem might be alleviated by asking Yi's program to wait after each callback to give the screen a chance to redraw. I implemented a class to generate fake data to test out this theory. It seems that the GUI just does not update or draw until the process is finished. I tried a few different things to see if I could get something to redraw. I used a JDialog so it would open another window. I used JFrame to open another window. I did that using a new Thread. Eventually, the only thing that worked is to actually exec a child process. In which case, its a separate process, so I don't know how to pass information to it. Apparently the parent has a handle on the process and can send receive information via the process's stdin/stdout/stderr. Instead of making the outputGUI the forked process, perhaps it would be better to make the JNI layer the process. Have it send its callbacks to stdout. Then the GUI should be able to read the stdout and update itself. This is a theory that I have not yet had a chance to try out today. Today I also helped Paul get all his code compiled and tested it on Windows.
