tnt homepage - work log

10.30.01:
John gave me a stripped down version of param.c today. Param.hpp does nothing except include param.h. There are some methods in Param.cpp that are used by ParamManager. I just included the signatures for them into Param.hpp and it compiled fine. (Some are also used by main.cpp.)

unistd.h is included by util.c. Generally, you can get by with just removing this line of code and it will compile fine on windows. When I did that, it tells me that it cannot find F_OK, which is used by qfilef. F_OK is just 0, so I defined that, and it compiles fine. But I guess util.c also needs fork(). So it won't link. fork is used in sleepx, erasef, and mv. I also have trouble finding __builtin_next_arg during link time. I'm not sure what that is. It is wanted by error.

I also looked into whether a Java GUI can talk with C++ code with callbacks. The answer is yes and no. Using the java native interface, you can create java classes that are implemented in c or C++. From that method, you can access Java methods, variables, etc. (because you have a handle on the jni environment). However, everything that happens between C++ and java needs to happen within that layer. I'm not sure how easy it would be to cleanly create such a layer for Yi's code. I guess we'll discuss on Thurs. I am kind of excited about doing some UI work, but I'm not really sure I'll be able to understand what it is this GUI needs to do. I looked at the current version and it's very confusing. It'll also probably feel strange to "think java" now too.

Found out today that my cori index is wrong. It thinks it only has 17 indexes, because of the data I used. I am supposed to use the trec123 cds, but from a special directory that has them split up into 100 files. Paul tried to rebuild one today and it crashed. Don't know why. It might be that something is wrong in his new code. So I am trying to build it again with his old code. Hopefully it isn't something wrong with my merging code. I believe he's built this database before (and so has Luo). It's taking a while though as so many people are using the server. Jamie's asked me to look into what kind of server we might want if we were to buy one. I haven't even gotten around to doing that yet.

10.25.01:
Ran the RetEval on the web database. This is for the cron job. It ran fine and created a results file. I think I need to also run treceval on the results file with relevance judgements. But I don't really know what the results are supposed to be. I don't know if it's good or bad... I don't think that I have the proper relevance judgements for it anyway since the one Cheng included is probably for the source file he included. So I guess that means I should look in the data directory to see if there are qrel files for the web data. Does it even have the same topics? Probably not. I probably have to create some queries too.

Offered to make the windows happy changes to Cheng's retrieval and langmod code. I think he'll take me up on it even though Jamie didn't think he would. Those changes are pretty easy but I definitely don't want to be responsible for their indexing code. I do want to get the whole port done with soon though. It's really dragging on. Since when did I become the C++ windows expert I really have no idea.

10.23.01:
Finished writing the documentation for Paul's text-handling stopper, stemmer, parser, stuff. The idea is that all these things are actually inherited from the same text handler interface. You can attach another TH to each one so that it can pass it on to the next. So that you can chain it and pass the same text through many handlers. However, his implementation does not force the chaining. It's up to the implementation to remember to make the call to pass the text along. I think that it might be better to have that part of the code set so that the implementation just adds the code that happens right before the chaining call. I'll suggest to him and see what he says. I also need him to read over the documentation.

Played around more with the .mak files yesterday. On Kevin's machine, it looked like opening a .mak file with VC++ would create a very nice project automatically. However, I tested this by deleting all other files from VC++ and it does not make a project at all. (It needs the project file for that.) It just created a new project with one file, the makefile. When you build, it says it's building the .exe but it doesn't actually. It builds whatever the makefile says to build (this case the .lib). I was able to figure out in the settings how to tell it where to put the file, etc. and get the paths all right. By default, it creates the debug version. There is a place in the settings where you can specify, like a command line, and it will build the release version. However, it would be nicer if it would just build the release version. I made the .mak file in the /lemur root directory. If the makefile is there, it finds everything and puts everything in the right place regardless of where you tell it to create the project. I've been thinking about getting CodeWarrior and seeing if I can use the .mak file with it.

I guess Jamie has other people in the group using the toolkit now too because I've been getting many emails about it. I hope that my job won't eventually turn into a support job. Ick. But for now, it's ok even if I've been accused of having buggy code. So I said bug reports are welcome, but have received none thus far. Anyway, I looked again at the code I wrote for InvFPTermList that converts it to bag-of-words. During that process I basically throw away the position information instead of putting it into a nice little list. So when you iterate through the TermList and get the term, you can't get positions. (You can still get list of positions through inverted list and through sequence-of-words.) It's amazing how much I've forgotten of my own code in such a short time. I really had to dig through it again. Anyway, if I have spare time maybe I should fix that. So far it's more trouble than it's worth. There is no "infrastructure" to support it yet. I'd really have to push things around. Also, at some point, I really need to buckle down and figure out why sometimes during indexing, words that are supposed to be in the map already don't get found. It doesn't appear to have an effect on the quality of the index though. I think this error occurs less frequently with larger caches.

10.11.01:
Met with Cheng today to go over some windows issues. He has not had time to make the changes for his code. He does not use any of the complicated code for param.c, but doesn't know for sure if John does. I have to talk to John. Cheng tells me that the unix make creates a separate library each for the directories: utility, index, langmod, retrieval. So I will do the same for windows. One .mak file each.

I ran the 2 pass method using Ziff's Law to compute topN on the web data, which is stopped and stemmed. The 2nd pass is definitely more accurate than using 0.1 for A, but I'm not sure it's close enough. For smaller numbers like 5000, it is pretty close, but for larger numbers, 15K, 30K, 60K, it's off by 10-20%. It gets closer with 3 passes and probably with more and more passes until it converges on the N we want. Played around some with trying to start with a different A or trying to get a sample rank and ctf using the top N% of the collection (like 35%) or using some variation on the given N. But those methods makes it close for only some of the numbers, whichever happens to be close to the N we want to begin with, and just a bit better for all the others. This just means that if the sample rank & CTF we get is close to the real N we want, then the results are better. But isn't that like saying well if we already have want we want then we can get what we want? So don't really know what to do there.

10.10.01:
Finally checked in all the code that is NT happy. That included making all the changes that I talked about in the september logs (unistd, open_max, setbuf, and headers). I wrote a local method called setbuf that calls the correct line of code to actually set the buffer for fstream depending on the OS. Also, I deleted map.hpp and vector.hpp from the cvs tree. Basically replaced it with common_headers.hpp. So all the common headers that need to be included are in one place.

Changed the HTML for the file upload to include place for job id (to be subject in return email) and to allow people to choose which retrieval alg they want to use. Now, do I want to let people choose more than one at a time? Does that make things too complicated for me? How would the return email look? Maybe I can make the return email subject be "ID+retmethod". I wonder if anyone would use this service.. Maybe it would be a good idea to have an explanation of the system and how it was built, stopword list, etc.

Set up a meeting with Cheng for tomorrow at 3pm. Need to talk to him about param.cpp. Does it really need to be that complicated? Why does it need param.c? Why are they the same name? It causes compilation errors on NT. Can we write a simpler one? Need to ask him how he's doing with the "port". Need to go over any other issues for the release.

Chat with Kevin about VC and projects and makefiles. I think distributing projects is a bad idea because of the paths issues. We can force VC to export .mak files, which are makefiles for use with nmake (dos make). If you open the .mak file with VC, it creates a project for you and includes all the correct source files using relative paths. You can also use nmake on the command line. I would guess that this file would be more useable with other applications as well, such as CodeWarrior. Jamie thinks that it would be best to create libraries parallel to the ones we create on unix. I guess that would be best. I'll have to recheck the details for the relative paths and make sure that the .mak files are distributed in a such a way that all someone has to do is open it.

10.09.01:
Writing the prune code today. Looked at Ziff's Law and it didn't make any sense, but worked it out with Jamie. I have most of it written and there doesn't seem to be any snags in it yet. Ran it without the actual copying just to see if the estimated "get top N words" gets close to N words using Ziff's Law. The second pass algorithm Jamie came up with is definitely more accurate than using just some default value for the constant. However, it is way off the higher N is. For example, at 5000, I pretty much get 5000 words, but at 30000 I get closer to 24000 words. I'm not sure if it's because my index is neither stemmed nor stopped. I want to run it on an index that has been stemmed and stopped. I'll run it with the web data when that index is finished.

SCS announced that we got a school-wide license for CodeWarrior. I'm trying to decide whether or not I want to get it and see if anything compiles using that. I wonder if it's any better than VC++. If I even want to bother with it. I'm really, really tempted though and I'm not sure why. Maybe because VC++ is so irritating.

10.05.01:
Helping Kevin get stuff running on NT. Good news is that he got the CVS NT client to work rather effortlessly. However, we don't know yet if all features work and if updating and all that will be accurate. I guess only time will tell.

Other goods news is that we were able to build the retrieval code and my index. It was evident thought that he could not have done this without my presence, which is a bad thing, as I do not wish to be present any time anyone wants to run this thing on NT. Kevin suggested we check in project files for libraries that people will want to build so that they don't need to know which files to grab to create a project. Is this really the only way to distribute things on windows? Obviously we would then be requiring people to use VC++. Are other compilers able to import other types of project files? I'm not sure. There is another concern, which is that when I create projects, the path info for the source files is encoded into the file. This means that we need to expect people to "install" in the same path as I have? That would be ludicrous. There should be a way to specify source files and then specify the path. There *should be a way, but I'm not sure it's easy to find knowing VC++.

Running Paul's CORI index code. I see in his temp directories that he creates one collection selection db and a separate index for each of the data files. When I run it, it only creates the CORI index and none of the individual ones. Must not be running something else I need to. Have to ask Paul about it. I'm also still waiting on Paul to build a web trec index. He's been having problems with CVS.

Waiting for Cheng to give me usr directory on ringtail so I can run stuff there. LA's been getting pounded.

10.03.01:
Got the CGI file upload to work. It was easy (without error and security checks). Not quite sure exactly how much security we would need. Ended up doing it in Perl, which was why it was easy. Question now about how to associate the email and the query file. I think I'll name the query files using timestamp. That way I won't have to worry about overwriting other files. Also, then we can process the files in order without having to actually look at the time and worry about parsing that. But, what are the chances that 2 people will upload at exactly the same time to the second? I was thinking of adding the email in as the first line in the file, but that might make it harder to pass the file on to the retrieval code. If so, then maybe another file timestamp.email or something.

10.02.01:
Spent some time last week finding all sorts of things about uploading files through CGI with perl. Jamie says today though that I should use C or java. So I guess it will be java.

Got RetEval.cpp (of app/src, Cheng's code) to compile on NT, barely.. and only if I change everything. It includes all sorts of stuff that does not compile on NT, mainly the BasicIndex and params.c stuff (used for managing parameter file). I'm not sure when Cheng will get a chance to windows-ize everything. He has been quite busy lately. One good thing is that I got to link in the static library I built with the retrieval code. That seems to work. One thing I discovered also it that it would be useful to have a library built with the InvFPIndex stuff so I don't need to pull everything into the project every time I want to build an application that wants to use this index (which is all of them).

Met with Paul and got the scoop on all his code and how to run the collection selection indexes. Need to actually run it now.

Been thinking about how to most efficiently "prune" an index. One way is to walk down the index file, copying over what we want, and skipping the rest. Another is to walk down the lookup file, seek to what we need and copy that over. The questions are whether skipping over what we don't need is effectively a seek, and what percentage of the index we'll end up copying. Seeks are expensive, but might be worth it if we end up skipping a lot of things. Also, it's a bit cleaner (interface wise) to use the lookup, as an index could be spread out over many files. Jamie thinks it's more robust to use lookup method. Another thing to remember is creating a lookup table during the index pruning task. Will this pose any problems? How likely is the pruned index going to span multiple files? Is it worth it now to write a separate class that deals with a "very large file" and internally takes care of spanning the index over multiple files. If I do this, it would also help John and Cheng's index to overcome the max file size limitation. A while ago, we had a meeting, and John was sure that a package like this already exists. No one has looked into it. Writing that functionality into the pruning code won't be difficult. But how many times does this same code need to be written in the long run? This will be the third time. (Other two in InvFPPushIndex for docterm index and in InvFPIndexMerge for inverted index.)

<< >>