Distributed Information Retrieval in Lemur

Contents

Overview
Applications
CollSelIndex
DistRetEval
QryBasedSample

Distributed Search and Merge API
Sampling API
Adding New Database Systems

1. Overview
The distributed retrieval in Lemur is built around the RetrievalMethod API. The DistSearchMethod class searches multiple indexes using the same query and stores in the results. These results are then passed to a DistMergeMethod for scores merging, based on the index ranking score and each individual document score. DistMergeMethod is an abstract API to support the implementation of different merging techniques.
Some distributed retrieval methods require the use of a central database that one might not have access to. For this purpose, Lemur includes support for query-based sampling. The query-based sampling application and utility classes provide an extensible tool for creating descriptions of text databases. The QryBasedSample application allows for sampling from text databases. The QryBasedSampler utility class gives an API for building other applications that require a query-based sampling component.
2. Applications

CollSelIndex
CollSelIndex builds a collection selection database using either document frequency or collection term frequency for the database's term frequency counts.
Usage: CollSelIndex paramfile [datfile1]* [datfile2] ...
Summary of parameters in paramfile:

dfIndex Name of the index to build using document frequency counts(without the .ifpextension).

ctfIndex Name of the index to build using collection term frequency (without the index extension)

dfCounts Name of the file to write out counts (needed for ranking)

dfDocs

countStopWords

memory Memory (in bytes) of PushIndex (def = 96000000).

stopwords Name of file containing stopword list. Words in this file should be one per line. If this parameter is not specified, all words are indexed.

acronyms Name of file containing acronym list (one word per line). Uppercase words recognized as acronyms (e.g. USA U.S.A. USAs USA's U.S.A.) are left uppercase if in the acronym list. If no acronym list is specified, acronyms will not be recognized.

docFormat Specify "trec" for standard TREC formatted documents or "web" for web TREC formatted documents. The default is "trec".

stemmer Specify "porter" to use Porter's stemmer. If no stemmer is specified, no stemmer will be used.

DistRetEval
This is a sample application that does distributed retrieval, using a resource selection index and individual indexes. resource selection is done using the CORI_CS (the only resource selection implemented thus far). results merging uses either CORIMergeMethod, SingleRegrMergeMethod, or MultiRegrMergeMethod. (If using CORIMerge, should use INQUERY for the retrieval method of each individual database.)

index the collection selection database

collCounts collection counts file for the collection selection index (needed by CORI)

ranksFile Name of the file to write ranking results (optional)

resultFile file to write final results

resultCount maximum number of results to output for each query (default to 1000)

textQuery file of text queries in docstream format

cutoff maximum number of databases to search (default to 10)

"dbids" = "db's param file" Required for each database in the collection selection index. Key should be the database character id string and value should be the name of the file that has parameters for that database:

index = the individual database
retModel = the retrieval model to use
"modelvals" - whatever parameters are required for that retModel

CSTF_factor TFfactor parameter in the CORI_CS resource selection method

CSTF_baseline TFbaseline parameter in the CORI_CS resource selection method

mergeMethod resource merging method (0 for CORI results merging method, 1 for single regression results merging method)

Merging Method-specific parameters:
For CORI merging method: None

For Single regression merging method:

csDbDataBaseIndex The centralized sampling database index
DOCTF_factor The TFfactor parameter in the CORI_DOC retrieval method for the centralized sampling database.
DOCTF_baseline The TFbaseline parameter in the CORI_DOC retrieval method for the centralized sampling database.

QryBasedSample
The application QryBasedSample performs query-based sampling on text databases. The output of the application is documents and database profiles . QryBasedSample takes a single command line argument, which is a parameter file. As with other Lemur applications, lines in the parameter file have the form:
 parameter = value; /* comment */ 
Summary of parameters:
dbManager Use to indicate which database manager to use. Specify lemur to sample Lemur databases and mind to sample MIND databases.

numDocs Terminate probe when the specified number of unique docs from the database have been seen.

numWords Terminate probe when the specified number of unique words from the database have been seen.

numQueries Terminate probe when the specified number of unique queries have been run.

docsPerQuery Use the specified number of documents per query to build the database description.

queryMode Selects the mode for query selection:

unif Words are chosen with equal probability from the documents seen so far.
avetf Words are chosen with probability proportional to their average term frequency.
ctf Words are chosen with probability proportional to their collection term frequency.
df Words chosen with probability proportional to their document frequency.
listFile Use to specify the file containing list of databases to probe and their output prefixes. The file format is:
 db      prefix      dbname
where the items are seperated by tabs and there is one tuple per line. For Lemur databases, the db field contains a parameter file specifying the retrieval parameters as in RetEval. The output prefix is used by the query-based sampler to create the filenames for outputting documents and database profiles. Documents are written to "prefixdocs" and profiles are written to "prefixmodel". For a MIND database, the db field contains a list of semicolon seperated items. The items are the xml urn name for the proxy, the url for the proxy, the xml urn name for the proxy's interface, the xml urn name for the proxy's construction text component, and the number of documents in the database. Example:
urn:proxy.Google;http://mind.proxy.url;urn:proxy-interface.Google;urn:proxy-construction-text.Google;2073418204
When sampling MIND databases, the documents are not stored locally, but a model built from the document sample field is stored.
initModel A language model to use for initial query selection. Words are selected using the specifed query mode. The initial model has the same format as models generated by the sampler:
 word      ctf      df
where ctf is the collection term frequency of the word and df is the document frequency.
MindRegistry If sampling MIND databases, this parameter is required. It should contain the url of the MIND Registry.
3. Distributed Search and Merge API
RetrievalMethod
Collection selection, or database ranking, uses the TextRetrievalMethod API, as implemented by CORIRetMethod. Basically it treats a collection selection index similar to a regular index where each "document" is actually a database.
DistSearchMethod
The main method in this class is scoreIndexSet(Query &qry, IndexedRealVector &indexset, DocScoreVector** results). The indexes in indexset should correspond to the indexes in the collection selection database passed into the constructor or set by setIndex. This method will load each individual databases's parameter file and score its documents against the given query. It will use whichever RetrievalMethod is specified in the parameter file, or it will use a set default. Set the default by using setDefaultRetMethod(RetMethodManager::RetModel rt). Although the method does not actually use the ranking scores from indexset here, it accepts this data structure that's the same as the one returned by the ranking method for convenience. There is another scoreIndexSet method that accepts a vector of database id strings.
A DocScoreVector is allocated for each index in indexset and stored in results. The caller should free this memory. Unlike a RetrievalMethod which returns the scores according to the index's internal document ids, DistSearchMethod converts the internal document ids to external document character ids. This is so there are no id conflicts when the scores are later merged into one list.
DistMergeMethod
This is an abstract interface for the merging of scores from individual databases. These databases should have ranking scores. Applications using this should call mergeScoreSet(IndexedRealVector &indexset, DocScoreVector** scoreset, DocScoreVector &results), where indexset is the same one used for DistSearchMethod and scoreset is the results from DistSearchMethod. Implementing classes should override the score(double dbscore, double docscore) method. For each document in each index, mergeScoreSet creates a merged score using the score method and stores it in the results vector. The returned results are not sorted, but can be by using the Sort method in DocScoreVector. CORIMergeMethod is one implementation of this interface. SingleRegrMergeMethod is another.
4. Sampling API
The class central to the sampling API is QryBasedSampler. It performs query-based sampling on a database, and outputs the profile and documents to disk. Other important classes include FreqCounter which builds the database profile and DBManager which gives an API for simple text database access. This section gives a brief description of these classes. For more detailed descriptions of functions, refer to the source code documentation.
QryBasedSampler
This class uses a DBManager and a FreqCounter to sample documents from a database and build a profile of the database's vocabulary. The probe function is does this, and its single argument is an intial query. If the initial query does not retrieve any documents, probe returns false. Before probe can be called, the application must create and set the sampler's database manager and frequency counter.
After the initial query, the sampler selects random query terms from the frequency counter. The means for selecting words is determined by the frequency counter's random mode. See the FreqCounter class for more information.
FreqCounter
This class builds a profile or simple language model from a TextHandler stream. In order to have the model updated properly when sampling, the application must build a TextHandler chain with the database manager's parser as the source and the frequency counter as the destination. See Parsing in the Lemur Toolkit for more details. The use of the TextHandler class here allows easy inclusion of a stemmer or indexing components. That is, a sampling application could easily build a collection selection database or normal retrieval database while sampling from a database.
A frequency counter can use an internal stopword list (Stopper class) specified in the constructor to filter out stopwords. A frequency counter can load its frequencies from a file using input and write frequencies to a file using output.
Frequency counters can also return random words. The randomWord function returns a word guarenteed unique since the last call to clear. The method used for selecting the random word is one of the following: R_CTF, R_DF, R_AVETF, or R_UNIF. R_CTF selects words with probability proportional to the terms' collection term frequency. R_DF chooses a word with probability according to the term's document frequency. R_AVETF selects words with probability proportional to the terms' average term frequency (ctf/df). R_UNIF selects words with equal probability. The mode for selection is set using the setRandomMode function.
DBManager
The DBManager provides a simplified API for querying a database and retrieving documents. The goal in providing this class is to supply only the functionality needed for query-based sampling in a simple, contained class.
The query function takes a string (char *) and a number of documents to retrieve. The function must return a results_t structure, which has two fields: num, the number of results in the list, and docs, an array of docid_t (char *) containing the results. The caller is responsible for freeing structure and its contents.
The getDoc function takes a docid_t and returns a doc_t structure. The doc_t structure consists of a docid_t called docid, a char * called doc, and a integer len which indicates the number of charaters in the document. The caller should make no assumptions about the format of the data in the doc field. The caller is responsible for freeing the structure and the contents of the structure.
The getParser function returns a MemParser that is capable of parsing the contents of the doc field of a doc_t.
The output function writes a document to file, which is specified using setOutputFile.
There are currently two implementations of the DBManager interface: LemurDBManager and MindDbManager. The LemurDbManager provides an example of communicating with a local database using an API, while the MindDbManager uses XML to communicate with a remote database.
MemParser
The MemParser class extends the Parser class. It adds a parse function that takes a doc_t. It is not required that you override the existing functions of Parser. Most important is that it is a TextHandler. See Parsing in the Lemur Toolkit for more details on TextHandlers.
5. Adding New Database Systems
Adding a new database system requires that you:

Wrap the database in a class the inherits from DBManager.
Provide a parser which inherits from MemParser.
Integrate the database wrapper into the QryBasedSample application:

Use the new dbManager parameter to check which database manager the application should use.
Modify AppMain so that the QryBasedSampler object is passed the DBManager you created.
Modify the program so that when the program is terminating, it will free any memory you may have allocated.
Update the usage function in QryBasedSample to reflect the changes you've made.

The Lemur Project
Last modified: Fri Feb 13 18:28:07 EST 2004