![]() ![]() ![]() ![]() ![]() |
![]() 10.30.01: John gave me a stripped down version of param.c today. Param.hpp does nothing except include param.h. There are some methods in Param.cpp that are used by ParamManager. I just included the signatures for them into Param.hpp and it compiled fine. (Some are also used by main.cpp.) unistd.h is included by util.c. Generally, you can get by with just removing this line of code and it will compile fine on windows. When I did that, it tells me that it cannot find F_OK, which is used by qfilef. F_OK is just 0, so I defined that, and it compiles fine. But I guess util.c also needs fork(). So it won't link. fork is used in sleepx, erasef, and mv. I also have trouble finding __builtin_next_arg during link time. I'm not sure what that is. It is wanted by error. I also looked into whether a Java GUI can talk with C++ code with callbacks. The answer is yes and no. Using the java native interface, you can create java classes that are implemented in c or C++. From that method, you can access Java methods, variables, etc. (because you have a handle on the jni environment). However, everything that happens between C++ and java needs to happen within that layer. I'm not sure how easy it would be to cleanly create such a layer for Yi's code. I guess we'll discuss on Thurs. I am kind of excited about doing some UI work, but I'm not really sure I'll be able to understand what it is this GUI needs to do. I looked at the current version and it's very confusing. It'll also probably feel strange to "think java" now too. Found out today that my cori index is wrong. It thinks it only has 17 indexes, because of the data I used. I am supposed to use the trec123 cds, but from a special directory that has them split up into 100 files. Paul tried to rebuild one today and it crashed. Don't know why. It might be that something is wrong in his new code. So I am trying to build it again with his old code. Hopefully it isn't something wrong with my merging code. I believe he's built this database before (and so has Luo). It's taking a while though as so many people are using the server. Jamie's asked me to look into what kind of server we might want if we were to buy one. I haven't even gotten around to doing that yet.
10.25.01: Offered to make the windows happy changes to Cheng's retrieval and langmod code. I think he'll take me
up on it even though Jamie didn't think he would. Those changes are pretty easy but I definitely don't want
to be responsible for their indexing code. I do want to get the whole port done with soon though. It's really
dragging on. Since when did I become the C++ windows expert I really have no idea.
10.23.01: Played around more with the .mak files yesterday. On Kevin's machine, it looked like opening a .mak file with VC++
would create a very nice project automatically. However, I tested this by deleting all other files from VC++
and it does not make a project at all. (It needs the project file for that.) It just created a new project
with one file, the makefile. When you build, it says it's building the .exe but it doesn't actually. It
builds whatever the makefile says to build (this case the .lib). I was able to figure out in the settings
how to tell it where to put the file, etc. and get the paths all right. By default, it creates the debug version.
There is a place in the settings where you can specify, like a command line, and it will build the release version.
However, it would be nicer if it would just build the release version. I made the .mak file in the /lemur root directory.
If the makefile is there, it finds everything and puts everything in the right place regardless of where you
tell it to create the project. I've been thinking about getting CodeWarrior and seeing if I can use the .mak
file with it.
I guess Jamie has other people in the group using the toolkit now too because I've been getting many emails
about it. I hope that my job won't eventually turn into a support job. Ick. But for now, it's ok even if I've
been accused of having buggy code. So I said bug reports are welcome, but have received none thus far. Anyway,
I looked again at the code I wrote for InvFPTermList that converts it to bag-of-words. During that process I basically
throw away the position information instead of putting it into a nice little list. So when you iterate through
the TermList and get the term, you can't get positions. (You can still get list of positions through inverted list
and through sequence-of-words.) It's amazing how much I've forgotten of my own code in such a short time.
I really had to dig through it again. Anyway, if I have spare time maybe I should fix that. So far it's more
trouble than it's worth. There is no "infrastructure" to support it yet. I'd really have to push things around.
Also, at some point, I really need to buckle down and figure out why sometimes during indexing, words that are
supposed to be in the map already don't get found. It doesn't appear to have an effect on the quality of the
index though. I think this error occurs less frequently with larger caches.
10.11.01: I ran the 2 pass method using Ziff's Law to compute topN on the web data, which is stopped and stemmed.
The 2nd pass is definitely more accurate than using 0.1 for A, but I'm not sure it's close enough. For
smaller numbers like 5000, it is pretty close, but for larger numbers, 15K, 30K, 60K, it's off by 10-20%.
It gets closer with 3 passes and probably with more and more passes until it converges on the N we want.
Played around some with trying to start with a different A or trying to get a sample rank and ctf using the top
N% of the collection (like 35%) or using some variation on the given N. But those methods makes it close for only
some of the numbers, whichever happens to be close to the N we want to begin with, and just a bit better for all
the others. This just means that if the sample rank & CTF we get is close to the real N we want, then
the results are better. But isn't that like saying well if we already have want we want then we can get what we want?
So don't really know what to do there.
10.10.01: Changed the HTML for the file upload to include place for job id (to be subject in return email) and to
allow people to choose which retrieval alg they want to use. Now, do I want to let people choose more than
one at a time? Does that make things too complicated for me? How would the return email look? Maybe I can
make the return email subject be "ID+retmethod". I wonder if anyone would use this service.. Maybe it would
be a good idea to have an explanation of the system and how it was built, stopword list, etc.
Set up a meeting with Cheng for tomorrow at 3pm. Need to talk to him about param.cpp. Does it really need
to be that complicated? Why does it need param.c? Why are they the same name? It causes compilation errors on
NT. Can we write a simpler one? Need to ask him how he's doing with the "port". Need to go over any other
issues for the release.
Chat with Kevin about VC and projects and makefiles. I think distributing projects is a bad idea because of
the paths issues. We can force VC to export .mak files, which are makefiles for use with nmake (dos make). If
you open the .mak file with VC, it creates a project for you and includes all the correct source files using
relative paths. You can also use nmake on the command line. I would guess that this file would be more useable
with other applications as well, such as CodeWarrior. Jamie thinks that it would be best to create libraries
parallel to the ones we create on unix. I guess that would be best. I'll have to recheck the details for the
relative paths and make sure that the .mak files are distributed in a such a way that all someone has to do is
open it.
10.09.01: SCS announced that we got a school-wide license for CodeWarrior. I'm trying to decide whether or not I want
to get it and see if anything compiles using that. I wonder if it's any better than VC++. If I even want to bother
with it. I'm really, really tempted though and I'm not sure why. Maybe because VC++ is so irritating.
10.05.01:
Other goods news is that we were able to build the retrieval code and my index. It was evident thought that he
could not have done this without my presence, which is a bad thing, as I do not wish to be present any time anyone
wants to run this thing on NT. Kevin suggested we check in project files for libraries that people will want to
build so that they don't need to know which files to grab to create a project. Is this really the only way to
distribute things on windows? Obviously we would then be requiring people to use VC++. Are other compilers able to
import other types of project files? I'm not sure. There is another concern, which is that when I create projects,
the path info for the source files is encoded into the file. This means that we need to expect people to "install"
in the same path as I have? That would be ludicrous. There should be a way to specify source files and then specify
the path. There *should be a way, but I'm not sure it's easy to find knowing VC++.
Running Paul's CORI index code. I see in his temp directories that he creates one collection selection db and a
separate index for each of the data files. When I run it, it only creates the CORI index and none of the individual
ones. Must not be running something else I need to. Have to ask Paul about it. I'm also still waiting on Paul to
build a web trec index. He's been having problems with CVS.
Waiting for Cheng to give me usr directory on ringtail so I can run stuff there. LA's been getting pounded.
10.03.01:
10.02.01: Got RetEval.cpp (of app/src, Cheng's code) to compile on NT, barely.. and only if I change everything. It
includes all sorts of stuff that does not compile on NT, mainly the BasicIndex and params.c stuff (used for managing
parameter file). I'm not sure when Cheng will get a chance to windows-ize everything. He has been quite busy
lately. One good thing is that I got to link in the static library I built with the retrieval code. That seems
to work. One thing I discovered also it that it would be useful to have a library built with the InvFPIndex stuff
so I don't need to pull everything into the project every time I want to build an application that wants to use
this index (which is all of them).
Met with Paul and got the scoop on all his code and how to run the collection selection indexes. Need to
actually run it now.
Been thinking about how to most efficiently "prune" an index. One way is to walk down the index file, copying
over what we want, and skipping the rest. Another is to walk down the lookup file, seek to what we need and copy
that over. The questions are whether skipping over what we don't need is effectively a seek, and what percentage
of the index we'll end up copying. Seeks are expensive, but might be worth it if we end up skipping a lot of things.
Also, it's a bit cleaner (interface wise) to use the lookup, as an index could be spread out over many files. Jamie
thinks it's more robust to use lookup method. Another thing to remember is creating a lookup table during the index
pruning task. Will this pose any problems? How likely is the pruned index going to span multiple files? Is it
worth it now to write a separate class that deals with a "very large file" and internally takes care of spanning the
index over multiple files. If I do this, it would also help John and Cheng's index to overcome the max file size
limitation. A while ago, we had a meeting, and John was sure that a package like this already exists. No one has
looked into it. Writing that functionality into the pruning code won't be difficult. But how many times does this
same code need to be written in the long run? This will be the third time. (Other two in InvFPPushIndex for docterm
index and in InvFPIndexMerge for inverted index.)
|