tnt homepage - work log

8.29.01:
This is where I'll keep notes to myself relating to the work that I'm doing. I'm guessing that it will be mainly for the Lemur Toolkit. I got this idea from someone else's webpage because it seemed cool.

9.30.03:
Wow, it's been 20 days since I last updated this. I guess it's good then that I can say I've accomplished many things on that last todo list.

With Cheng's help, I wrote an application that can take a list of queries and compare its language model (based on which documents contains that query) to the background model (language model of all documents in the collection). From this we can get a sorted list of which phrases are included in documets that are most different from the collection vocabulary. At first we thought it would be good to just run the top 100 most frequent bigrams, but actually less frequent bigrams can be more different. So we tried it with all bigrams that occur more than X times, like 10 or 40. At the top of the KL results list are usually many names, but they are indeed different from the background. Running these are pretty fun. I think that I have a bug in my cooccurrence code though because occasionally it core dumps. But I'm not sure yet if it's a problem with my application or if it's a bug in the structured queries code in Lemur because the core dump occurs during de-allocation of query nodes for the structured queries.

We are finally going to put out another Lemur release. This will be mainly for bug fixes. We wanted to push out a release with the btree stuff, but now that UMass is revamping the indexing code it doesn't make sense to do that. I think we're aiming for some time in the next 2 weeks.

I just realized that I never finished the "introduction to Lemur" type beginner's guide. So I should do that. We have a new student joining the group who needs to use Lemur so I'll try to get a sense from him about where the documentation needs help.

I guess that's it. Here's my updated to do list:

update Lemur documentation
run more data mining tests
fix Lemur bugs
check stuff into CVS
run release tests/make 2.0.4 release
verify group equipment support levels
get VStudio .NET (?)

9.10.03:
Ok, so I said I had to "implement BTPIndex". Did that quite easily actually. Then I had to "do structqueries interactively". Did that too. It wasn't too bad. I already had the QueryDocument class I wrote so I made that into a TextHandler and chained it to the end of a InQueryOpParser. It works pretty well, but is still a little bit clunky. For example, I need to modify the query to add a query ID because the InQueryOpParser expects one, which is very much in a batch mode mentality. What do I need query ids for? I guess we could have a different OpParser class. Anyway, apparently InQuery had some builtin document manager that allows you to ask for certain fields back from the original document. We don't have anything that does that. I guess we could write a DocumentManager to do that, but I just created a separate btree that has all the titles in it. Another problem I encountered is that InQueryRetMethod (structured queries stuff) is coded to expect an InvFPIndex. That is, it wants an index with positions. That's fine, but now I wanted to use a BTPIndex, and we have no way of saying "this is an index that has positions." I just changed all references to InvFPIndex, but this is definitely not the long-term solution. I considered making BTPIndex subclass InvFPIndex, but it really doesn't make any sense. BTPIndex is an Index with positions, but there's no real reason why it should be a subclass of InvFPIndex (except that they both have a method not in Index: TermInfoListSeq). I guess we could make another interface, something like PosIndex, which subclasses Index, and InvFPIndex and BTPIndex could both be that. This problem runs a bit deeper though since it trickles down to the datastructures too. All the sturct queries also wants to use InvFPDocLists. Again, do we need some way of saying "this is a docinfolist with positions?" Maybe we do.

Next step was to improve FlattextDocMgr. I made all the changes that I proposed in the forum,but it only improved the time by like a second, or I guess I could say it improved it by 50% but I don't really know which is more accurate. Either way, it wasn't good enough so I ended up using a btree for the document lookups. Is this the long-term solution? Quite possibly, but so far I've only changed the reading part of FlattextDocMgr and not the building part.

I have not touched the co-occurrence code, but lots of people are asking me for the noun bigrams code. I don't even know what state that's in, but I'll have to package it up. I need to also run the co-occurrence stuff on a dataset that's different from the public comment data to see if we can get anything interesting out of it. Which brings me back to the problem with Brill's tagger munging the original text when I try to add tags. I'm currently trying to get a "good" version of Brill's tagger from someone.

Meanwhile, there are lots of other miscellaneous things. Buying a server, working with someone at UMass to use a better btree package and look into how to improve Lemur's speed in general, and looking into why/how we would use Perl and Lemur. Oh, and running something new to see how the distribution of terms might differ for subsets of documents that contain a specific phrase, using KL Divergence. So here's my massive to do list as of today (not sure if I've got the priority right). This of course doesn't even include the long list of stuff associated with releasing/enhancing Lemur.

package bigrams code and hand off
run KL Divergence tests
run co-occurrence on new dataset
compare Lemur speeds with new btree package
look into Perl and Lemur
look into server purchase
update group equipment list
update Lemur documentation of "known problems"
merge in noun phrase detection for co-occurrence application
beautify co-occurrence code
get VStudio .NET
check stuff into CVS

8.27.03:
I'm leaving tonight for a few days so I should probably write a summary of what I've been doing. After some confusion, I have finished implementing the co-occurrence counting code. At first I thought that the measurements pertain to each document, but actually it's only 1 score per collection for a pair of phrases. I didn't actually have to throw away too much code, but it means that the way I had it organized (as a RetrievalMethod) doesn't really make sense. We don't really have anything in the system currently that has 1 score for the whole collection. I'll have to think about how to generalize it. Maybe I should ask Jamie first if we'll make it an official part of the toolkit. It would be convenient to extrapolate an API only if there might be more techniques we'll want to implement in the future that has the same form. For now, we have just 2 association measures. So the code now is in one huge application. Given a phrase, it finds documents containing that phrase and detects noun bigrams in those documents. Then it computes a measure for each of those bigrams. Right now it doesn't detect bigrams in a window size, it just takes all bigrams in the documents. After looking for co-occurrences, I just ignore the bigrams that don't actually occur with the seed phrase within the given window size. It'll run faster if it only grabs bigrams that are already in the window size, but it's too much effort right now. Also, I need to add detecting noun phrases, not just bigrams. That should be pretty easy since I already have the code to do it.

I've abandoned the btree stuff for now as a release is not imminent. The last thing I did with that is build a simple cache of common terms to see if that would help speed it up. There was no real improvement. I was just taking the top 15K or 30K terms or something like that and stuffing it into a hash before indexing. Jamie says maybe I'm not taking enough terms, but if I take more, then it takes more memory so it goes back to the same trade-off. But Jamie still thinks a cache might be more speed for less memory. I haven't had time to experiment with what a good cache size might be. Recently the gcc version on the servers were upgraded to 3.2.2. As far as I can tell, the btree stuff does not work past gcc 3.1. No idea why. Haven't looked into it. Jamie says that maybe someone he knows might write a specialized btree package for Lemur. So I'll keep my fingers crossed.

I've also been trying to put Lemur behind a CGI interface that Jamie currently has hooked up to Inquery. It uses the btree version of Lemur. So far I've managed to "convert" the C cgi stuff to C++ and linked it up Lemur. I did run into trouble, but have fixed them. The index needs world write and read permission because the btree needs write permission to open. Also, since I've compiled the btree stuff with an older version of gcc than the server uses, I need to remember to link it statically. The biggest chunk of work that's required now is to implement structured queries, which I guess one can just take for granted in Inquery. It's a bit of a pain to do in Lemur. There are no real nice utilities to do any kind of retrieval interactively. All the tools are geared towards running batch queries from files. The structured queries are particularly painful to do since they require being parsed from a file then converted into a format that then gets put into a representation for the retrieval method. Why the second format is needed is a mystery to me. There have been many questions about this on the forum in the last few months as well so it's an issue we should probably address some time in the near future. Another thing I should fix for the CGI stuff is the DocumentManager being super slow. I know there's a memory bug and other things in FlattextDocMgr that can/should be fixed to improve this. I made a long reply about it in the forum so I just need to read it and do what it says.

So, to sum up, when I return I'll need to (in roughly this order): 1)Implement BTPIndex. 2)Do StructQueries interactively 3)Fix FlattextDocMgr 4)Merge in noun phrase detection for co-occurrence application 5)"Beautify" co-occurrence code

7.21.03:
Been working on a few different projects, all related to Lemur. The btree stuff still hasn't died yet. I have completed the btree testing on windows and unix for non position indexer. One big problem with it is that it's slooow. It was VERY slow, but we discovered you can tweak the libgist buffer (number of pages) size to make it faster. Still it's 3x slower than regular indexer (while more buffering takes more memory). We've been talking about adding a cache. But again, it's the same speed vs. memory trade-off. Jamie thinks having a low overhead cache will improve speed while using less memory than their page buffers. Personally, I don't think 3x slower is that bad because it does keep the memory at a fixed size so you can index gigs and gigs of data without running out of memory. I've been thinking of writing a utility to convert "old indexes" to new btree indexes.

Dave Fisher thinks we should consider using berkeleydb btrees. We did consider this before choosing libgist. The problem is that berkeleydb is in C, and the code is difficult to read (which is probably what makes it fast.) Switching to berkeleydb means more investigating of the code and experiments. We've invested enough time into libgist already, and it seems at least good enough for us to stick with it for now. I need to run some experiments using a simple cache of 15000 terms or so to see how memory/speed goes.

In other Lemur news, I've finished integrating the new improved TextHandler interface. It's completely backwards compatible so we don't have to change all the old code that implements it. But new code can take advantage of additional properties to be passed in with tokens. I've modified a WebParser to count positions and pass it in as a property. This is so that we can index multiple "terms" at the same position. In this way, we can do part of speech tagging and such. I've written a separate TextHandler that can accept "Brill tokens", terms that have been through a Brill part of speech tagger. I didn't make this part of the Parser so that different parsers can use the same BrillPOSParser. It passes the POS tag along as a property. An Indexer TH needs to look for the properties to pass them into the index. If you want, you can chain an old indexer on and it will create an index ignoring the POS tags. This way you don't need to have multiple versions of the data hanging around to create different indexes. The modified WebParser also looks for EOS markers and passes them along as a term [eos]. I think we're converging on a standard to include non real terms in the index within brackets, like [oov].

I also wrote some bigram and noun phrase counting applications on top of the POS indexer. (It's getting confusing pos (position) indexer and POS (part of speech) indexer.) At first I thought it would be better to have the tag come first and then the term, but we might add many more tags in the future for named entities. So will those come first too or last? Since any one term may have none or any number of tags, perhaps it's better to have the term come first and then all of its tags, including any end of sentence marker. We know it applies to that term as long as it has the same position. I'm not really sure how to decide which way is better. Maybe it doesn't matter.

Next step for applications for the POS indexer is to do co-occurrence of phrases (and named entities in the future.) I haven't started on this yet, but what thought I've given into it makes me think that having multiple iterators on the same terminfolist would be useful. This reminds me of the STL style iterators I tried to implement for the docinfolists and the memory issues I ran into. Unfortunately I don't think I documented it in these logs, but I might have records of the issues in email exchanges with Dave Fisher. Actually, I think I do have a final version that all in all works. I didn't want to release it because of some downcasting required for using with InvFPDocList. But that's certainly better than memory leaks so I might have to learn to live with it. What I really need is a super duper C++ guru to help me, but the problem is rather complicated. I tried posting it on a C++ forum, which helped a little.

This brings me to a final point I want to make for today. And that is that I have too many Lemur directories. At some point I will start to merge things into the main tree and it will be a really big pain. Oh I should maybe also mention that last month, I made some limited Lemur applications to be used for an ecommerce class project at CMU West. It seems to have gone with any problems. Actually, someone from the class even wrote a web interface to it. One day I might actually start working on a GUI for the full Lemur system.

All right, now really for the last point! It is becoming more and more apparent that the documentation for Lemur is inadequate. I've been working on a novice guide to using Lemur, and I should really wrap it up and post it. I should also include a "known problems" list on the download page. We have a few known bugs in the current version.

see thoughts in February 2003

March 2002

February 2002

January 2002

December 2001

November 2001

October 2001

September 2001