Lemur Beginner's Guide

Lemur Beginner's Guide, getting started using Lemur

Introduction
What is Lemur?
How can it be useful?

Indexing
What is an index?
What kind of data/documents can Lemur index?
Do the parsers add all words into the index?
How do I build an index?
How can I tell if the index built anything?
What type of indexes does Lemur have?
How do I add documents to an index I already have?

Retrieval
How can I run queries?
How do I evaluate the results?
How can I view the original document as part of the results?

Documents
What is a document manager?
How do I build a document manager?
How can I add a document manager to an existing index?

What is Lemur?
Lemur is a toolkit designed to facilitate research in language modeling and information retrieval (IR), where IR is broadly interpreted to include such technologies as ad hoc and distributed retrieval, with structured queries, cross-language IR, summarization, filtering, and categorization. The system's underlying architecture was built to support the technologies above. We proved many useful sample applications, but have designed the toolkit to allow you to easily program your own customizations and applications.

How can it be useful?
Lemur is particularly useful for researchers in language modeling and information retrieval who do not want to write their own indexers but would rather focus on developing new techniques and algorithms. In addition to indexing, we provide some baseline retrieval algorithms, such as TFIDF and Okapi for use and comparisons. We have implemented basic ad hoc IR, distributed IR, IR using structured queries, IR using distributed indexes, and summarization. Others have used Lemur for filtering tasks, webpage finding, passage finding, and simple web search engines.

What is an index?
An index, or database, is basically a collection of information that can be quickly accessed, using some piece of information as a point of reference or key (what it's indexed by). In our case, we index information about the terms in a collection of documents, which you can access later using either a term or a document as the reference. Specificly, we can collect term frequency, term position, and document length statistics because those are most commonly needed for information retrieval. For example, from the index, you can find out how many times a certain term occurred in the collection of documents, or how many times it occurred in just one specific document. Retrieval algorthms that decide which documents to return for a given query use the collected information in the index in their scoring calculations. (See Index.hpp for full index access API.)

What kind of data/documents can Lemur index?
Actually, you can create your own parsers for whatever text documents you have, as long as your parser takes whatever it wants to recognize as a term and "pushes" it into the index. (See PushIndex.hpp for API.) However, we do provide several parsers with the toolkit.
Lemur is primarily a research system so the included parsers were designed to facilitate indexing many documents that are in the same file. In order for the index to know where the document boundaries are within files, each document must have begin document and end document tags. These tags are similar to HTML or XML tags and are actually the format for NIST's Text REtrieval Conference (TREC) documents.
The 2 most frequently used parsers are the TrecParser and WebParser.
TrecParser: This parser recognizes text in the TEXT, HL, HEAD, HEADLINE, TTL, and LP fields. For example:

<DOC>
<DOCNO> document_number </DOCNO>
<TEXT>
Index this document text.
</TEXT>
</DOC>

WebParser: This parser removes HTML tags, text within SCRIPT tags, as well as text in HTML comments. Document boundaries are specified with NIST style format:

<DOC>
<DOCNO> document_number </DOCNO>
Document text here could be in HTML.
</DOC>

In addition to these parsers, Lemur also provides parsers for Chinese (GB2312 encoding) and Arabic (CP1256 encoding). (See "Parsing in Lemur" for more information.)

If your documents are not from NIST, these are the methods you can take to parse and index your documents:

Write a script to add the NIST style tags around your documents. Then use one of the parsers provided by Lemur with either your own or one of Lemur's applications.
Write your own parser and feed the terms into an index by using the PushIndex API in your own application.
Implement your own TextHandler class (a parser to handle your document formats), which you can then use in pipeline fashion with other TextHandlers already in Lemur to further pre-process terms (i.e. stopping,stemming) and use with InvFPTextHandler to build an index. (See "Parsing in Lemur" for more information.)

Do the parsers add all words into the index?
After the initial parsing of a document into terms, there might be other considerations to be made before adding the term into the index, such as whether or not that word is important enough to add, whether to add the word as is or to index its stem form instead, and whether to recognize certain words as acronyms. Having an acronyms list, ignoring stopwords (very common words, like "the", "and", "it"), and indexing word stems (so "stem", "stemming", and "stems" would all become the same term) are features supported by Lemur. These features are all supported by the provided application, BuildInvertedIndex.

How do I build an index?
The easiest way to build an index is to get your documents into a format recognized by an existing Lemur parser, then use BuildInvertedIndex. BuildInvertedIndex will build an index with inverted lists of document frequencies for terms (DocInfoLists). It also builds a database of term frequencies for documents (TermInfoLists). This will allow you to run queries and get back a list of document ids as results, but not the original document text. For that capability, you will need to build a document manager (see What is a document manager?).

Like all the applications provided by Lemur, BuildInvertedIndex takes at least 1 parameter, the first being the file name containing a list of parameters, or settings for the application.
Usage is:
BuildInvertedIndex paramfile [datafile1] [datafile2] [...]

The parameter file given to BuildInvertedIndex must contain parameters in the following format:
paramname = value;
param2 = value;

Each parameter must be on a separate line, and each line must end with a semi-colon. There are many parameters you can have in this file. 2 of them are required parameters. They are index and docFormat.
index: This is what you want to name the index that will be created. For example, index = /lemur/data/myindex; This will create an index called myindex in the /lemur/data directory. You must put the full path information here to load this index later from other directories. If you do not put a full path, the index will be created in the current directory, and you must be in the current directory to use it again.
docFormat: This says what format your document files are in. For example, if the are the NIST trec format described above, you would put docFormat = trec; The available options are trec for general NIST TREC documents, web for TREC web HTML documents, chinese for segmented Chinese text in TREC format using GB encoding) , chinesechar for unsegmented Chinese text in TREC format using GB encoding) , and arabic for Arabic text in TREC format using Windows CP1256 encoding.

Here is the list of optional parameters:
memory: The amount of memory, in bytes, that the index will use for its caching purposes. Unfortunately, this is not a guarantee for how much total working memory the indexing process will take. You should set this value depending on how much RAM your machine has. The more memory that you can use, the faster the indexing process will be. The default value will be 96000000 if you do not set this parameter. If your machine handle it, generally use 256000000.
position: This specifies whether or not you want the index to store information about what the position of terms are in the documents (where they occur in the documents). Position information is necessary for certain retrieval applications, such as using structured queries for phrase detection, or looking for bigrams. Use 0 for no position and 1 for storing positions. The default value is 1. Storing positions will make the index bigger than not having the, which requires more disk space. It will also make the indexing and retrieval process a little bit slower.
stopwords: The name of the file containing the list of words you want to ignore. If you do not provide a list, all words will be indexed.
countStopWords: This can be true or false. If true, stopwords will be included when counting terms for document length. The default is to not include stopwords in the document length count.
acronyms: The name of the file containing the list of acronyms that you want the index to recognize. If you do not provide a list, all terms will be converted to lower case beforing being added to the index. This means that acronyms which have the same spelling as common words will be counted as if they are the same term.
stemmer: This specifies what kind of stemming to do on the words before adding them to the index. If this parameter is not specified, no stemming will be done. The options are porter to use the Porter stemmer, krovetz for the Krovetz stemmer, which requires this additional parameter KstemmerDir (Path to directory of data files used by Krovetz's stemmer.), and arabic for the arabic stemmer, which requires the additional parameters arabicStemDir (Path to directory of data files used by the Arabic stemmers.) and arabicStemFunc (Which stemming algorithm to apply, choices are arabic_stop: arabic_stop, arabic_norm2:table normalization, arabic_norm2_stop:table normalization with stopping, arabic_light10:light9 plus ll prefix, and arabic_light10_stop:light10 and remove stop words.).
dataFiles: This is the name of a file that lists all the actual data files that you want to index, with the name of each data file on a separate line. If you don’t use this parameter, you can just list all the data files on the command line after the parameter file. So, if you want to index datafile1 and datafile2, you can either use the command
BuildInvertedIndex paramfile datafile1 datafile2
OR
do not put datafiles on the command line, but use this dataFiles parameter. Your parameter file would contain the line
dataFiles = fileslist;

and fileslist would be a file that contains the following:
datafile1
datafile2

How can I tell if the index built anything?
Each index has a "table of contents" file which has some summary statistics on what's in the index as well as which files are needed to load the index. This table of contents file is an ascii file that you can view in a normal text viewer. When you want to use an index, you will need its table of contents file to load it. Each index in lemur has its own unique extension for its table of contents file (see next question: type of indexes).

What type of indexes does Lemur have?
Lemur currently has the following indexes: InvIndex, InvFPIndex, and BasicIndex. (BTIndex in development.) The indexes are different in that they might index different data or represent the data differently on disk. Each index has a "table of contents" file which has some summary statistics on what's in the index as well as which files are needed to load the index. When you want to use an index, you will need its table of contents file to load it. Each index in lemur has its own unique extension for its table of contents file.

Index Name	Extension	File Limit	Stores positions	Loads fast	Disk space usage	Application
InvIndex	.inv	no	no	no	less	BuildInvertedIndex
InvFPIndex	.ifp	no	yes	no	more	BuildInvertedIndex
BTIndex*	.bti	no	no	yes	more
BTPIndex*	.btp	no	yes	yes	even more
BasicIndex	.bsc	yes	no	no		BuildBasicIndex

*NOTE: BT(P)Index have not yet been released.

How do I add documents to an index I already have?
You can add documents to an existing index only if you have an InvFPIndex (table of contents file ends with .ifp) . Use the IncIndexer application to add new documents.

How can I run queries?
To run a set of queries, use the RetEval application. This application allows you to make use of several retrieval algorithms already implemented in Lemur, including simple TFIDF, Okapi, and KL-Divergence language model based method. There are many parameters involved with running RetEval. They are explained here: http://www-2.cs.cmu.edu/~lemur/2.0/app.html#RetEval

RetEval takes a file of queries in the following format:
<DOC #queryid>
term1
term1
termN
</DOC>

RetEval will create a result file containing the queryid, a docid, and the score for that document with respect to that query. It should be sorted by score within each query. As Lemur is primarily a research tool, this is the main retrieval application included with Lemur. But it should be fairly easy to write your own application or modify RetEval to output results in a format more suitable for your needs.

How do I evaluate the results?
There is a perl script included with Lemur called ireval.pl that calculates performance measurements in terms of precision and recall scores given the result file from running RetEval and a judgement file. A judgement file tells you for each query, which documents are relevant. This judgement file can be either of the 3-column format:
queryid docid score (1 for relevant)
or it can be in the same format as the TREC conference relevance files. See http://www-2.cs.cmu.edu/~lemur/2.0/app.html#ireval for more information.

How can I view the original document as part of the results?
In order to see the original document, your index needs to have a document manager (see Documents section).

What is a document manager?
Since all parsing and tokenizing of a document is down outside of the index, you need a document manager to get back the original source of a document. This makes it possible for documents in the same index to be from multiple sources, such as text files, databases, web sources, etc., as long as there is a document manager to handle that type of document. You can request a document's manager from the index. However, these managers have to be registered with the index during indexing time. In most normal usage, all documents in an index will probably be from the same source.

How do I build a document manager?
Lemur has one document manager implemented, called FlattextDocMgr, which manages documents that are in flat text files. FlattextDocMgr can handle multiple files and can remember which Parser was used. You can build an index with a flat text document manager at the same time using the BuildDocMgr application. Usage is the same as BuildInvertedIndex with the exception of an additional parameter for what to call your document manager.

It is possible to use an already existing document manager to an index that you are building. So if you have multiple indexes using the same data source, you don't have to rebuild the document manager. Lemur does not have an application that includes this feature, but you could easily modify BuildInvertedIndex by adding a line after the index object is created, calling the method PushIndex::setDocManager(DocMgrID). DocMgrID should be the same string you'd get back from calling DocumentManager::getMyID.

It is also possible to build a document manager without building an index, but there is no application provided to do that. What you'd need to do is write an application that creates a FlattextDocMgr object and call FlattextDocMgr::buildMgr.

How can I add a document manager to an existing index?
You can't. You can add existing document managers to a new index that you're building, but not the other way around. TIP: If you are unsure whether you'll need a document manager, one thing you can do is add a non-existent document manager to your index. This is okay as long as you don't try to use the non-existent document manager. This way, the index will have references to a document manager. As long as all the documents in your index use the same document manager, later you can build a document manager using the same name as the non-existent one you had pointed your index to use.