Introduction
What is Lemur?
How can it be useful?
Indexing
What is an index?
What kind of data/documents can Lemur index?
Do the parsers add all words into the index?
How do I build an index?
How can I tell if the index built anything?
What type of indexes does Lemur have?
How do I add documents to an index I already have?
Retrieval
How can I run queries?
How do I evaluate the results?
How can I view the original document as part of the
results?
Documents
What is a document manager?
How do I build a document manager?
How can I add a document manager to an existing index?
What
is Lemur?
Lemur is a toolkit designed to facilitate
research in language modeling and information retrieval (IR), where IR
is broadly interpreted to include such technologies as ad hoc and
distributed retrieval, with structured queries, cross-language IR,
summarization, filtering, and categorization. The system's
underlying architecture was built to support the technologies
above. We proved many useful sample applications, but have
designed the toolkit to allow you to easily program your own
customizations and applications.
How
can it be useful?
Lemur is particularly useful for researchers in
language modeling and information retrieval who do not want to write
their own indexers but would rather focus on developing new techniques
and algorithms. In addition to indexing, we provide some baseline
retrieval algorithms, such as TFIDF and Okapi for use and
comparisons. We have implemented basic ad hoc IR, distributed IR,
IR using structured queries, IR using distributed indexes, and
summarization. Others have used Lemur for filtering tasks, webpage
finding, passage finding, and simple web search engines.
What
is an index?
An index, or database, is basically a collection
of information that can be quickly accessed, using some piece of
information as a point of reference or key (what it's indexed by).
In our case, we index information about the terms in a collection of
documents, which you can access later using either a term or a document
as the reference. Specificly, we can collect term frequency, term
position, and document length statistics because those are most commonly
needed for information retrieval. For example, from the index, you
can find out how many times a certain term occurred in the collection of
documents, or how many times it occurred in just one specific
document. Retrieval algorthms that decide which documents to
return for a given query use the collected information in the index in
their scoring calculations. (See Index.hpp for full index access API.)
What
kind of data/documents can Lemur index?
Actually, you can create your own parsers for whatever
text documents you have, as long as your parser takes whatever it wants
to recognize as a term and "pushes" it into the index. (See
PushIndex.hpp for API.) However, we do provide several
parsers with the toolkit.
Lemur is primarily a research system so the included
parsers were designed to facilitate indexing many documents that are in
the same file. In order for the index to know where the document
boundaries are within files, each document must have begin document and
end document tags. These tags are similar to HTML or XML tags and
are actually the format for NIST's Text REtrieval Conference (TREC)
documents.
The 2 most frequently used parsers are the TrecParser
and WebParser.
TrecParser: This parser recognizes text in the
TEXT, HL, HEAD, HEADLINE, TTL, and LP fields. For example:
<DOC>WebParser: This parser removes HTML tags, text within SCRIPT tags, as well as text in HTML comments. Document boundaries are specified with NIST style format:
<DOCNO> document_number </DOCNO>
<TEXT>
Index this document text.
</TEXT>
</DOC>
<DOC>In addition to these parsers, Lemur also provides parsers for Chinese (GB2312 encoding) and Arabic (CP1256 encoding). (See "Parsing in Lemur" for more information.)
<DOCNO> document_number </DOCNO>
Document text here could be in HTML.
</DOC>
If your documents are not from NIST, these are the methods you can take to parse and index your documents:
How
do I build an index?
The easiest way to build an index is to get your documents into a
format recognized by an existing Lemur parser, then use BuildInvertedIndex.
BuildInvertedIndex will build an index with inverted lists of document
frequencies for terms (DocInfoLists). It also builds a database of
term frequencies for documents (TermInfoLists). This will allow
you to run queries and get back a list of document ids as results, but
not the original document text. For that capability, you will need to
build a document manager (see What is a document
manager?).
Like all the applications provided by Lemur, BuildInvertedIndex
takes at least 1 parameter, the first being the file name containing a
list of parameters, or settings for the application.
Usage is:
BuildInvertedIndex paramfile [datafile1] [datafile2] [...]
The parameter file given to BuildInvertedIndex must contain
parameters in the following format:
paramname = value;
param2 = value;
Each parameter must be on a separate line, and each line must end
with a semi-colon. There are many parameters you can have in this
file. 2 of them are required parameters. They are index and docFormat.
index: This is what you
want to name the index that will be created. For example, index =
/lemur/data/myindex; This will create an index called myindex in the /lemur/data
directory. You must put the full path information here to load
this index later from other directories. If you do not put a full
path, the index will be created in the current directory, and you must
be in the current directory to use it again.
docFormat: This says what
format your document files are in. For example, if the are the
NIST trec format described above, you would put docFormat = trec;
The available options are trec for
general NIST TREC documents, web for
TREC web HTML documents, chinese
for segmented Chinese text in TREC format using GB encoding) , chinesechar for unsegmented Chinese
text in TREC format using GB encoding) , and arabic for Arabic text in TREC
format using Windows CP1256 encoding.
Here is the list of optional parameters:
memory: The amount of memory,
in bytes, that the index will use for its caching purposes.
Unfortunately, this is not a guarantee for how much total working
memory the indexing process will take. You should set this value
depending on how much RAM your machine has. The more memory that
you can use, the faster the indexing process will be. The default
value will be 96000000 if you do not set this parameter. If your
machine handle it, generally use 256000000.
position: This specifies
whether or not you want the index to store information about what the
position of terms are in the documents (where they occur in the
documents). Position information is necessary for certain
retrieval applications, such as using structured queries for phrase
detection, or looking for bigrams. Use 0 for no position and 1 for
storing positions. The default value is 1. Storing positions
will make the index bigger than not having the, which requires more disk
space. It will also make the indexing and retrieval process a
little bit slower.
stopwords: The name of the file
containing the list of words you want to ignore. If you do not
provide a list, all words will be indexed.
countStopWords: This can be true or false. If true, stopwords
will be included when counting terms for document length. The
default is to not include stopwords in the document length count.
acronyms: The name of the file
containing the list of acronyms that you want the index to recognize.
If you do not provide a list, all terms will be converted to lower
case beforing being added to the index. This means that acronyms
which have the same spelling as common words will be counted as if they
are the same term.
stemmer:
This specifies what kind of stemming to do on the words before
adding them to the index. If this parameter is not specified, no
stemming will be done. The options are porter to use the Porter stemmer, krovetz for the Krovetz stemmer,
which requires this additional parameter KstemmerDir (Path to directory of
data files used by Krovetz's stemmer.), and arabic for the arabic stemmer, which
requires the additional parameters arabicStemDir
(Path to directory of data files used by the Arabic stemmers.) and arabicStemFunc (Which stemming
algorithm to apply, choices are arabic_stop:
arabic_stop, arabic_norm2:table
normalization, arabic_norm2_stop:table
normalization with stopping,
arabic_light10:light9 plus ll prefix, and arabic_light10_stop:light10 and
remove stop words.).
dataFiles: This is the
name of a file that lists all the actual data files that you want to
index, with the name of each data file on a separate line. If you don’t use this parameter, you can just list all the
data files on the command line after the parameter file. So, if you want to index datafile1 and datafile2, you can either use the command
BuildInvertedIndex paramfile datafile1 datafile2
OR
do not put datafiles on the command line, but use this dataFiles parameter. Your parameter file would contain the line
dataFiles = fileslist;
and fileslist would be a file that contains the following:
datafile1
datafile2
How can I tell if the index built anything?
Each index has a "table of contents" file which has some summary
statistics on what's in the index as well as which files are needed to
load the index. This table of contents file is an ascii file that
you can view in a normal text viewer. When you want to use an
index, you will need its table of contents file to load it. Each
index in lemur has its own unique extension for its table of contents
file (see next question: type of indexes).
What type of indexes does Lemur have?
Lemur currently has the following indexes: InvIndex, InvFPIndex,
and BasicIndex. (BTIndex in development.) The indexes are different in that they
might index different data or represent the data differently on disk.
Each index has a "table of contents" file which has some summary
statistics on what's in the index as well as which files are needed to
load the index. When you want to use an index, you will need its
table of contents file to load it. Each index in lemur has its own
unique extension for its table of contents file.
Index Name | Extension |
File Limit |
Stores positions |
Loads fast |
Disk space usage |
Application |
InvIndex |
.inv |
no |
no |
no |
less |
BuildInvertedIndex |
InvFPIndex |
.ifp |
no |
yes |
no |
more |
BuildInvertedIndex |
BTIndex* |
.bti |
no |
no |
yes |
more |
|
BTPIndex* | .btp |
no |
yes |
yes |
even more |
|
BasicIndex |
.bsc |
yes |
no |
no |
BuildBasicIndex |
*NOTE: BT(P)Index have not yet been released.
How do I add documents to an index I
already have?
You can add documents to an existing index only if you have an
InvFPIndex (table of contents file ends with .ifp) . Use the IncIndexer application to add new
documents.
How
can I run queries?
To run a set of queries, use the RetEval application. This application
allows you to make use of several retrieval algorithms already
implemented in Lemur, including simple TFIDF, Okapi, and KL-Divergence
language model based method. There are many parameters involved with
running RetEval. They are explained here: http://www-2.cs.cmu.edu/~lemur/2.0/app.html#RetEval
RetEval takes a file of queries in the following format:
<DOC #queryid>
term1
term1
termN
</DOC>
RetEval will create a result file containing the queryid, a docid,
and the score for that document with respect to that query. It should
be sorted by score within each query. As Lemur is primarily a research
tool, this is the main retrieval application included with Lemur. But
it should be fairly easy to write your own application or modify
RetEval to output results in a format more suitable for your needs.
How
do I evaluate the results?
There is a perl script included with Lemur called ireval.pl that calculates performance measurements in terms of precision and recall scores
given the result file from running RetEval and a judgement file. A
judgement file tells you for each query, which documents are relevant.
This judgement file can be either of the 3-column format:
queryid docid score (1 for relevant)
or it can be in the same format as the TREC conference relevance files. See http://www-2.cs.cmu.edu/~lemur/2.0/app.html#ireval
for more information.
How
can I view the original document as part of the results?
In order to see the original document, your index needs to have a
document manager (see Documents
section).
What is a document manager?
Since all parsing and tokenizing of a document is down outside
of the index, you need a document manager to get back the original
source of a document. This makes it possible for documents in the same
index to be from multiple sources, such as text files, databases, web
sources, etc., as long as there is a document manager to handle that
type of document. You can request a document's manager from the index.
However, these managers have to be registered with the index
during indexing time. In most normal usage, all documents in an index
will probably be from the same source.
How do I build a document manager?
Lemur has one document manager implemented, called
FlattextDocMgr, which manages documents that are in flat text files.
FlattextDocMgr can handle multiple files and can remember which Parser
was used. You can build an index with a flat text document manager at
the same time using the BuildDocMgr application. Usage is the
same as BuildInvertedIndex with the exception of an additional parameter
for what to call your document manager.
It is possible to use an already existing document manager to an
index that you are building. So if you have multiple indexes using the
same data source, you don't have to rebuild the document manager. Lemur
does not have an application that includes this feature, but you could
easily modify BuildInvertedIndex by adding a line after the index
object is created, calling the method
PushIndex::setDocManager(DocMgrID). DocMgrID should be the same
string you'd get back from calling DocumentManager::getMyID.
It is also possible to build a document manager without building an index, but there is no application provided to do that. What you'd need to do is write an application that creates a FlattextDocMgr object and call FlattextDocMgr::buildMgr.
How
can I add a document manager to an existing index?
You can't. You can add existing document managers to a new index that
you're building, but not the other way around. TIP: If you are unsure
whether you'll need a document manager, one thing you can do is add a
non-existent document manager to your index. This is okay as long as you
don't try to use the non-existent document manager. This way, the index
will have references to a document manager. As long as all the
documents in your index use the same document manager, later you can
build a document manager using the same name as the non-existent one you
had pointed your index to use.