![]() ![]() ![]() ![]() ![]() |
8.29.01: This is where I'll keep notes to myself relating to the work that I'm doing. I'm guessing that it will be mainly for the Lemur Toolkit. I got this idea from someone else's webpage because it seemed cool.
9.30.03: With Cheng's help, I wrote an application that can take a list of queries and compare its language model (based on which documents contains that query) to the background model (language model of all documents in the collection). From this we can get a sorted list of which phrases are included in documets that are most different from the collection vocabulary. At first we thought it would be good to just run the top 100 most frequent bigrams, but actually less frequent bigrams can be more different. So we tried it with all bigrams that occur more than X times, like 10 or 40. At the top of the KL results list are usually many names, but they are indeed different from the background. Running these are pretty fun. I think that I have a bug in my cooccurrence code though because occasionally it core dumps. But I'm not sure yet if it's a problem with my application or if it's a bug in the structured queries code in Lemur because the core dump occurs during de-allocation of query nodes for the structured queries.
We are finally going to put out another Lemur release. This will be mainly for bug fixes. We wanted to push out a release with the btree stuff, but now that UMass is revamping the indexing code it doesn't make sense to do that. I think we're aiming for some time in the next 2 weeks.
I just realized that I never finished the "introduction to Lemur" type beginner's guide. So I should do that. We have a new student joining the group who needs to use Lemur so I'll try to get a sense from him about where the documentation needs help.
I guess that's it. Here's my updated to do list:
9.10.03: Next step was to improve FlattextDocMgr. I made all the changes that I proposed in the forum,but it only improved the time by like a second, or I guess I could say it improved it by 50% but I don't really know which is more accurate. Either way, it wasn't good enough so I ended up using a btree for the document lookups. Is this the long-term solution? Quite possibly, but so far I've only changed the reading part of FlattextDocMgr and not the building part.
I have not touched the co-occurrence code, but lots of people are asking me for the noun bigrams code. I don't even know what state that's in, but I'll have to package it up. I need to also run the co-occurrence stuff on a dataset that's different from the public comment data to see if we can get anything interesting out of it. Which brings me back to the problem with Brill's tagger munging the original text when I try to add tags. I'm currently trying to get a "good" version of Brill's tagger from someone.
Meanwhile, there are lots of other miscellaneous things. Buying a server, working with someone at UMass to use a better btree package and look into how to improve Lemur's speed in general, and looking into why/how we would use Perl and Lemur. Oh, and running something new to see how the distribution of terms might differ for subsets of documents that contain a specific phrase, using KL Divergence. So here's my massive to do list as of today (not sure if I've got the priority right). This of course doesn't even include the long list of stuff associated with releasing/enhancing Lemur.
8.27.03: I've abandoned the btree stuff for now as a release is not imminent. The last thing I did with that is build a simple cache of common terms to see if that would help speed it up. There was no real improvement. I was just taking the top 15K or 30K terms or something like that and stuffing it into a hash before indexing. Jamie says maybe I'm not taking enough terms, but if I take more, then it takes more memory so it goes back to the same trade-off. But Jamie still thinks a cache might be more speed for less memory. I haven't had time to experiment with what a good cache size might be. Recently the gcc version on the servers were upgraded to 3.2.2. As far as I can tell, the btree stuff does not work past gcc 3.1. No idea why. Haven't looked into it. Jamie says that maybe someone he knows might write a specialized btree package for Lemur. So I'll keep my fingers crossed.
I've also been trying to put Lemur behind a CGI interface that Jamie currently has hooked up to Inquery. It uses the btree version of Lemur. So far I've managed to "convert" the C cgi stuff to C++ and linked it up Lemur. I did run into trouble, but have fixed them. The index needs world write and read permission because the btree needs write permission to open. Also, since I've compiled the btree stuff with an older version of gcc than the server uses, I need to remember to link it statically. The biggest chunk of work that's required now is to implement structured queries, which I guess one can just take for granted in Inquery. It's a bit of a pain to do in Lemur. There are no real nice utilities to do any kind of retrieval interactively. All the tools are geared towards running batch queries from files. The structured queries are particularly painful to do since they require being parsed from a file then converted into a format that then gets put into a representation for the retrieval method. Why the second format is needed is a mystery to me. There have been many questions about this on the forum in the last few months as well so it's an issue we should probably address some time in the near future. Another thing I should fix for the CGI stuff is the DocumentManager being super slow. I know there's a memory bug and other things in FlattextDocMgr that can/should be fixed to improve this. I made a long reply about it in the forum so I just need to read it and do what it says.
So, to sum up, when I return I'll need to (in roughly this order): 1)Implement BTPIndex. 2)Do StructQueries interactively 3)Fix FlattextDocMgr 4)Merge in noun phrase detection for co-occurrence application 5)"Beautify" co-occurrence code
7.21.03: Dave Fisher thinks we should consider using berkeleydb btrees. We did consider this before choosing libgist. The problem is that berkeleydb is in C, and the code is difficult to read (which is probably what makes it fast.) Switching to berkeleydb means more investigating of the code and experiments. We've invested enough time into libgist already, and it seems at least good enough for us to stick with it for now. I need to run some experiments using a simple cache of 15000 terms or so to see how memory/speed goes.
In other Lemur news, I've finished integrating the new improved TextHandler interface. It's completely backwards compatible so we don't have to change all the old code that implements it. But new code can take advantage of additional properties to be passed in with tokens. I've modified a WebParser to count positions and pass it in as a property. This is so that we can index multiple "terms" at the same position. In this way, we can do part of speech tagging and such. I've written a separate TextHandler that can accept "Brill tokens", terms that have been through a Brill part of speech tagger. I didn't make this part of the Parser so that different parsers can use the same BrillPOSParser. It passes the POS tag along as a property. An Indexer TH needs to look for the properties to pass them into the index. If you want, you can chain an old indexer on and it will create an index ignoring the POS tags. This way you don't need to have multiple versions of the data hanging around to create different indexes. The modified WebParser also looks for EOS markers and passes them along as a term [eos]. I think we're converging on a standard to include non real terms in the index within brackets, like [oov].
I also wrote some bigram and noun phrase counting applications on top of the POS indexer. (It's getting confusing pos (position) indexer and POS (part of speech) indexer.) At first I thought it would be better to have the tag come first and then the term, but we might add many more tags in the future for named entities. So will those come first too or last? Since any one term may have none or any number of tags, perhaps it's better to have the term come first and then all of its tags, including any end of sentence marker. We know it applies to that term as long as it has the same position. I'm not really sure how to decide which way is better. Maybe it doesn't matter.
Next step for applications for the POS indexer is to do co-occurrence of phrases (and named entities in the future.) I haven't started on this yet, but what thought I've given into it makes me think that having multiple iterators on the same terminfolist would be useful. This reminds me of the STL style iterators I tried to implement for the docinfolists and the memory issues I ran into. Unfortunately I don't think I documented it in these logs, but I might have records of the issues in email exchanges with Dave Fisher. Actually, I think I do have a final version that all in all works. I didn't want to release it because of some downcasting required for using with InvFPDocList. But that's certainly better than memory leaks so I might have to learn to live with it. What I really need is a super duper C++ guru to help me, but the problem is rather complicated. I tried posting it on a C++ forum, which helped a little.
This brings me to a final point I want to make for today. And that is that I have too many Lemur directories. At some point I will start to merge things into the main tree and it will be a really big pain. Oh I should maybe also mention that last month, I made some limited Lemur applications to be used for an ecommerce class project at CMU West. It seems to have gone with any problems. Actually, someone from the class even wrote a web interface to it. One day I might actually start working on a GUI for the full Lemur system.
All right, now really for the last point! It is becoming more and more apparent that the documentation for Lemur is inadequate. I've been working on a novice guide to using Lemur, and I should really wrap it up and post it. I should also include a "known problems" list on the download page. We have a few known bugs in the current version.
|