Distributed Information Retrieval in Lemur


Contents

  1. Overview
  2. Applications
  3. Distributed Search and Merge API
  4. Sampling API
  5. Adding New Database Systems

1. Overview

The distributed retrieval in Lemur is built around the RetrievalMethod API. The DistSearchMethod class searches multiple indexes using the same query and stores in the results. These results are then passed to a DistMergeMethod for scores merging, based on the index ranking score and each individual document score. DistMergeMethod is an abstract API to support the implementation of different merging techniques.

Some distributed retrieval methods require the use of a central database that one might not have access to. For this purpose, Lemur includes support for query-based sampling. The query-based sampling application and utility classes provide an extensible tool for creating descriptions of text databases. The QryBasedSample application allows for sampling from text databases. The QryBasedSampler utility class gives an API for building other applications that require a query-based sampling component.

2. Applications

CollSelIndex

CollSelIndex builds a collection selection database using either document frequency or collection term frequency for the database's term frequency counts.

Usage: CollSelIndex paramfile [datfile1]* [datfile2] ...

Summary of parameters in paramfile:

  1. dfIndex Name of the index to build using document frequency counts(without the .ifpextension).


  2. ctfIndex Name of the index to build using collection term frequency (without the index extension)


  3. dfCounts Name of the file to write out counts (needed for ranking)


  4. dfDocs


  5. countStopWords


  6. memory Memory (in bytes) of PushIndex (def = 96000000).


  7. stopwords Name of file containing stopword list. Words in this file should be one per line. If this parameter is not specified, all words are indexed.


  8. acronyms Name of file containing acronym list (one word per line). Uppercase words recognized as acronyms (e.g. USA U.S.A. USAs USA's U.S.A.) are left uppercase if in the acronym list. If no acronym list is specified, acronyms will not be recognized.


  9. docFormat Specify "trec" for standard TREC formatted documents or "web" for web TREC formatted documents. The default is "trec".


  10. stemmer Specify "porter" to use Porter's stemmer. If no stemmer is specified, no stemmer will be used.

DistRetEval

This is a sample application that does distributed retrieval, using a resource selection index and individual indexes. resource selection is done using the CORI_CS (the only resource selection implemented thus far). results merging uses either CORIMergeMethod, SingleRegrMergeMethod, or MultiRegrMergeMethod. (If using CORIMerge, should use INQUERY for the retrieval method of each individual database.)

  1. index the collection selection database


  2. collCounts collection counts file for the collection selection index (needed by CORI)


  3. ranksFile Name of the file to write ranking results (optional)


  4. resultFile file to write final results


  5. resultCount maximum number of results to output for each query (default to 1000)


  6. textQuery file of text queries in docstream format


  7. cutoff maximum number of databases to search (default to 10)


  8. "dbids" = "db's param file" Required for each database in the collection selection index. Key should be the database character id string and value should be the name of the file that has parameters for that database:
    index = the individual database
    retModel = the retrieval model to use
    "modelvals" - whatever parameters are required for that retModel
  9. CSTF_factor TFfactor parameter in the CORI_CS resource selection method


  10. CSTF_baseline TFbaseline parameter in the CORI_CS resource selection method


  11. mergeMethod resource merging method (0 for CORI results merging method, 1 for single regression results merging method)


  12. Merging Method-specific parameters:
    For CORI merging method: None

    For Single regression merging method:
    1. csDbDataBaseIndex The centralized sampling database index
    2. DOCTF_factor The TFfactor parameter in the CORI_DOC retrieval method for the centralized sampling database.
    3. DOCTF_baseline The TFbaseline parameter in the CORI_DOC retrieval method for the centralized sampling database.

QryBasedSample

The application QryBasedSample performs query-based sampling on text databases. The output of the application is documents and database profiles . QryBasedSample takes a single command line argument, which is a parameter file. As with other Lemur applications, lines in the parameter file have the form:
 parameter = value; /* comment */ 
Summary of parameters:

3. Distributed Search and Merge API

RetrievalMethod

Collection selection, or database ranking, uses the TextRetrievalMethod API, as implemented by CORIRetMethod. Basically it treats a collection selection index similar to a regular index where each "document" is actually a database.

DistSearchMethod

The main method in this class is scoreIndexSet(Query &qry, IndexedRealVector &indexset, DocScoreVector** results). The indexes in indexset should correspond to the indexes in the collection selection database passed into the constructor or set by setIndex. This method will load each individual databases's parameter file and score its documents against the given query. It will use whichever RetrievalMethod is specified in the parameter file, or it will use a set default. Set the default by using setDefaultRetMethod(RetMethodManager::RetModel rt). Although the method does not actually use the ranking scores from indexset here, it accepts this data structure that's the same as the one returned by the ranking method for convenience. There is another scoreIndexSet method that accepts a vector of database id strings.

A DocScoreVector is allocated for each index in indexset and stored in results. The caller should free this memory. Unlike a RetrievalMethod which returns the scores according to the index's internal document ids, DistSearchMethod converts the internal document ids to external document character ids. This is so there are no id conflicts when the scores are later merged into one list.

DistMergeMethod

This is an abstract interface for the merging of scores from individual databases. These databases should have ranking scores. Applications using this should call mergeScoreSet(IndexedRealVector &indexset, DocScoreVector** scoreset, DocScoreVector &results), where indexset is the same one used for DistSearchMethod and scoreset is the results from DistSearchMethod. Implementing classes should override the score(double dbscore, double docscore) method. For each document in each index, mergeScoreSet creates a merged score using the score method and stores it in the results vector. The returned results are not sorted, but can be by using the Sort method in DocScoreVector. CORIMergeMethod is one implementation of this interface. SingleRegrMergeMethod is another.

4. Sampling API

The class central to the sampling API is QryBasedSampler. It performs query-based sampling on a database, and outputs the profile and documents to disk. Other important classes include FreqCounter which builds the database profile and DBManager which gives an API for simple text database access. This section gives a brief description of these classes. For more detailed descriptions of functions, refer to the source code documentation.

QryBasedSampler

This class uses a DBManager and a FreqCounter to sample documents from a database and build a profile of the database's vocabulary. The probe function is does this, and its single argument is an intial query. If the initial query does not retrieve any documents, probe returns false. Before probe can be called, the application must create and set the sampler's database manager and frequency counter.

After the initial query, the sampler selects random query terms from the frequency counter. The means for selecting words is determined by the frequency counter's random mode. See the FreqCounter class for more information.

FreqCounter

This class builds a profile or simple language model from a TextHandler stream. In order to have the model updated properly when sampling, the application must build a TextHandler chain with the database manager's parser as the source and the frequency counter as the destination. See Parsing in the Lemur Toolkit for more details. The use of the TextHandler class here allows easy inclusion of a stemmer or indexing components. That is, a sampling application could easily build a collection selection database or normal retrieval database while sampling from a database.

A frequency counter can use an internal stopword list (Stopper class) specified in the constructor to filter out stopwords. A frequency counter can load its frequencies from a file using input and write frequencies to a file using output.

Frequency counters can also return random words. The randomWord function returns a word guarenteed unique since the last call to clear. The method used for selecting the random word is one of the following: R_CTF, R_DF, R_AVETF, or R_UNIF. R_CTF selects words with probability proportional to the terms' collection term frequency. R_DF chooses a word with probability according to the term's document frequency. R_AVETF selects words with probability proportional to the terms' average term frequency (ctf/df). R_UNIF selects words with equal probability. The mode for selection is set using the setRandomMode function.

DBManager

The DBManager provides a simplified API for querying a database and retrieving documents. The goal in providing this class is to supply only the functionality needed for query-based sampling in a simple, contained class.

The query function takes a string (char *) and a number of documents to retrieve. The function must return a results_t structure, which has two fields: num, the number of results in the list, and docs, an array of docid_t (char *) containing the results. The caller is responsible for freeing structure and its contents.

The getDoc function takes a docid_t and returns a doc_t structure. The doc_t structure consists of a docid_t called docid, a char * called doc, and a integer len which indicates the number of charaters in the document. The caller should make no assumptions about the format of the data in the doc field. The caller is responsible for freeing the structure and the contents of the structure.

The getParser function returns a MemParser that is capable of parsing the contents of the doc field of a doc_t.

The output function writes a document to file, which is specified using setOutputFile.

There are currently two implementations of the DBManager interface: LemurDBManager and MindDbManager. The LemurDbManager provides an example of communicating with a local database using an API, while the MindDbManager uses XML to communicate with a remote database.

MemParser

The MemParser class extends the Parser class. It adds a parse function that takes a doc_t. It is not required that you override the existing functions of Parser. Most important is that it is a TextHandler. See Parsing in the Lemur Toolkit for more details on TextHandlers.

5. Adding New Database Systems

Adding a new database system requires that you:


The Lemur Project
Last modified: Fri Feb 13 18:28:07 EST 2004