Lemur Retrieval Applications

Contents

RetEval
RelFBEval
QueryModelEval
TwoStageRetEval
GenerateSmoothSupport
GenerateQueryModel
EstimateDirPrior
QueryClarity
GenL2Norm
ireval.pl

RetEval

This application runs batched retrieval experiments (with/without feedback) to evaluate different retrieval models as well as different parameter settings for those models.
Scoring is either done over a working set of documents (essentially re-ranking), or over the whole collection. This is indicated by the parameter "useWorkingSet". When "useWorkingSet" has a non-zero (integer) value, scoring will be on a working set specified in a file given by "workSetFile". The file should have three columns. The first is the query id; the second the document id; and the last a numerical value, which is ignored. The reason for having a third column of numerical values is so that any retrieval result of the simple format (i.e., non-TREC format) generated by Lemur could be directly used as a "workSetFile" for the purpose of re-ranking, which is convenient. Also, the third column could be used to provide a prior probability value for each document, which could be useful for some algorithms. By default, scoring is on the whole collection.
It currently supports six different models:

The popular TFIDF retrieval model
The Okapi BM25 retrieval function
The KL-divergence language model based retrieval method
The InQuery (CORI) retrieval model
CORI collection selection
Cosine similarity model
Indri structured query language

The parameter to select the model is retModel. Valid values are:

tfidf or 0 for TFIDF
okapi or 1 for Okapi
kl or 2 for Simple KL
inquery or 3 for InQuery
cori_cs or 4 for CORI collection selection
cos or 5 for cosine similarity
indri or 7 for Indri SQL
(It is suspected that there is a bug in the implementation of the feedback for Okapi BM25 retrieval function, because the performance is not as expected.)
Other common parameters (for all retrieval methods) are:

index: The complete name of the index table-of-content file for the database index. (ie index.key or index.ifp)
textQuery: The file with queries to run (batched). The queries should be in the format below. This file can be generated using the application ParseQuery:
<DOC 1>
my first query with id 1
</DOC>
<DOC 2>
my second query with id 2
is on more than one line
</DOC>
...
<DOC N>
my last query id N
</DOC>

resultFile: The file to save the results to
TRECResultFormat: whether the result format is of the TREC format (i.e., six-column) or just a simple three-column format . Integer value, zero for non-TREC format, and non-zero for TREC format. Default: 1 (i.e., TREC format)
resultCount: the number of documents to return as result for each query
feedbackDocCount: the number of docs to use for pseudo-feedback (0 means no-feedback)
feedbackTermCount: the number of terms to add to a query when doing feedback. Note that in the KL-div. approach, the actual number of terms is also affected by two other parameters.(See below.)

Model-specific parameters are:

For TFIDF:

feedbackPosCoeff: the coefficient for positive terms in (positive) Rocchio feedback. We only implemented the positive part and non-relevant documents are ignored.
doc.tfMethod: document term TF weighting method: rawtf for RawTF, logf for log-TF, and bm25 for BM25TF
doc.bm25K1: BM25 k1 for doc term TF
doc.bm25B : BM25 b for doc term TF
query.tfMethod: query term TF weighting method: rawtf for RawTF, logf for log-TF, and bm25 for BM25TF
query.bm25K1: BM25 k1 for query term TF. bm25B is set to zero for query terms

For Okapi:

BM25K1 : BM25 K1
BM25B : BM25 B
BM25K3: BM25 K3
BM25QTF: The TF for expanded terms in feedback (the original paper about the Okapi system is not clear about how this is set, so it's implemented as a parameter.)

For KL-divergence:

Document model smoothing parameters:
smoothSupportFile: The name of the smoothing support file (e.g., one generated by GenerateSmoothSupport).
smoothMethod: One of : Jelinek-Mercer (jm), Dirichlet prior (dir), Absolute discounting (ad), or Two stage (twostage)
smoothStrategy: Either interpolate (interpolate) or backoff (backoff)
JelinekMercerLambda: The collection model weight in the JM interpolation method. Default: 0.5
DirichletPrior: The prior parameter in the Dirichlet prior smoothing method. Default: 1000
discountDelta: The delta (discounting constant) in the absolute discounting method. Default 0.7.
Query model updating method (i.e., pseudo feedback):
queryUpdateMethod: feedback method (mixture model (mixture), divergence minimization (divmin), Markov chain (mc), relevance model 1 (rm1) or relevance model 2 (rm2).
Method-specific feedback parameters:
For all interpolation-based approaches (i.e., the new query model is an interpolation of the original model with a (feedback) model computed based on the feedback documents), the following four parameters apply:

feedbackCoefficient: the coefficient of the feedback model for interpolation. The value is in [0,1], with 0 meaning using only the original model (thus no updating/feedback) and 1 meaning using only the feedback model (thus ignoring the original model).
feedbackTermCount: Truncate the feedback model to no more than a given number of words/terms.
feedbackProbThresh: Truncate the feedback model to include only words with a probability higher than this threshold. Default value: 0.001.
feedbackProbSumThresh: Truncate the feedback model until the sum of the probability of the included words reaches this threshold. Default value: 1.

Parameters feedbackTermCount, feedbackProbThresh, and feedbackProbSumThresh work conjunctively to control the truncation, i.e., the truncated model must satisfy all the three constraints.
All the three feedback methods also recognize the parameter feedbackMixtureNoise (default value :0.5), but with different interpretations.

For the collection mixture model method, feedbackMixtureNoise is the collection model selection probability in the mixture model. That is, with this probability, a word is picked according to the collection language model, when a feedback document is "generated".
For the divergence minimization method, feedbackMixtureNoise means the weight of the divergence from the collection language model. (The higher it is, the farther the estimated model is from the collection model.)
For the Markov chain method, feedbackMixtureNoise is the probability of not stopping, i.e., 1- alpha, where alpha is the stopping probability while walking through the chain.

In addition, the collection mixture model also recognizes the parameter emIterations, which is the maximum number of iterations the EM algorithm will run. Default: 50. (The EM algorithm can terminate earlier if the log-likelihood converges quickly, where convergence is measured by some hard-coded criterion. See the source code in SimpleKLRetMethod.cpp for details. )

RelFBEval

This application (RelFBEval.cpp) runs retrieval experiments with relevance feedback. Different retrieval models can be used with different settings for the corresponding parameters. Although this program is designed for relevance feedback, it can be easily used for pseudo feedback -- you just need to set the parameter feedbackDocuments to a result file, i.e., interpreting a result file as if all the entries represent relevant documents.
Two important notes:

All the feedback algorithms currently in Lemur assume that all entries in a judgment file are relevant documents, so you must remove all the entries of judged non-relevant documents. However, the judgment status is recorded in the internal representation of judgments, so that it is possible to distinguish judged relevant documents from judged non-relevant documents in a feedback algorithm.
The format of the judgment file, when used for feedback, must be of three columns, i.e., with the second column removed so that each line has a query id, a document id, and a judgment value. This is to be consistent with the format of a result file. An alternative would be to use the original four-column format directly, but, then we would need to add a parameter to distinguish this four-column format from the three-column format of a result file.

Scoring is either done over a working set of documents (essentially re-ranking), or over the whole collection. This is indicated by the parameter "useWorkingSet". When "useWorkingSet" has a non-zero (integer) value, scoring will be on a working set specified in a file given by "workSetFile". The file should have three columns. The first is the query id; the second the document id; and the last a numerical value, which is ignored. The reason for having a third column of numerical values is so that any retrieval result of the simple format (i.e., non-trec format) generated by Lemur could be directly used as a "workSetFile" for the purpose of re-ranking, which is convenient. Also, the third column could be used to provide a prior probability value for each document, which could be useful for some algorithms. By default, scoring is on the whole collection.
It currently supports three different models:

The popular TFIDF retrieval model
The Okapi BM25 retrieval function
The KL-divergence language model based retrieval method

The parameter to select the model is retModel (with value 0 for TFIDF, 1 for Okapi, and 2 for KL). It is suspected that there is a bug in the implementation of the feedback for Okapi BM25 retrieval function, because the performance is not as expected.
Other common parameters (for all retrieval methods) are:

index: The complete name of the index table-of-content file for the database index.
textQuerySet: the query text stream
resultFile: the result file
resultCount: the number of documents to return as result for each query
feedbackDocuments : the file of feedback documents to be used for feedback. In the case of pseudo feedback, this can be a result file generated from an initial retrieval process. In the case of relevance feedback, this is usually a 3-column relevance judgment file. Note that this means you can NOT use a TREC-style judgment file directly; you must remove the second column to convert it to three-column.
feedbackDocCount: the number of docs to use for feedback (negative value means using all judged documents for feedback). The documents in the feedbackDocuments are sorted in decreasing order according to the numerical value in the third column, and then the top documents are used for feedback.
feedbackTermCount: the number of terms to add to a query when doing feedback. Note that in the KL-div. approach, the actual number of terms is also affected by two other parameters.(See below.)

Model-specific parameters are:

For TFIDF:

feedbackPosCoeff: the coefficient for positive terms in (positive) Rocchio feedback. We only implemented the positive part and non-relevant documents are ignored.
doc.tfMethod: document term TF weighting method: 0 for RawTF, 1 for log-TF, and 2 for BM25TF
doc.bm25K1: BM25 k1 for doc term TF
doc.bm25B : BM25 b for doc term TF
query.tfMethod: query term TF weighting method: 0 for RawTF, 1 for log-TF, and 2 for BM25TF
query.bm25K1: BM25 k1 for query term TF. bm25B is set to zero for query terms

For Okapi:

BM25K1 : BM25 K1
BM25B : BM25 B
BM25K3: BM25 K3
BM25QTF: The TF for expanded terms in feedback (the original paper about the Okapi system is not clear about how this is set, so it's implemented as a parameter.)

For KL-divergence:

Document model smoothing parameters:
smoothSupportFile: The name of the smoothing support file (e.g., one generated by GenerateSmoothSupport).
smoothMethod: One of the three: Jelinek-Mercer (0), Dirichlet prior (1), and Absolute discounting (2)
smoothStrategy: Either interpolate (0) or backoff (1)
JelinekMercerLambda: The collection model weight in the JM interpolation method. Default: 0.5
DirichletPrior: The prior parameter in the Dirichlet prior smoothing method. Default: 1000
discountDelta: The delta (discounting constant) in the absolute discounting method. Default 0.7.
Query model updating method (i.e., pseudo feedback):
queryUpdateMethod: feedback method (0, 1, 2 for mixture model, divergence minimization, and Markov chain respectively).
Method-specific feedback parameters:
For all interpolation-based approaches (i.e., the new query model is an interpolation of the original model with a (feedback) model computed based on the feedback documents), the following four parameters apply:

feedbackCoefficient: the coefficient of the feedback model for interpolation. The value is in [0,1], with 0 meaning using only the original model (thus no updating/feedback) and 1 meaning using only the feedback model (thus ignoring the original model).
feedbackTermCount: Truncate the feedback model to no more than a given number of words/terms.
feedbackProbThresh: Truncate the feedback model to include only words with a probability higher than this threshold. Default value: 0.001.
feedbackProbSumThresh: Truncate the feedback model until the sum of the probability of the included words reaches this threshold. Default value: 1.

Parameters feedbackTermCount, feedbackProbThresh, and feedbackProbSumThresh work conjunctively to control the truncation, i.e., the truncated model must satisfy all the three constraints.
All the three feedback methods also recognize the parameter feedbackMixtureNoise (default value :0.5), but with different interpretations.

For the collection mixture model method, feedbackMixtureNoise is the collection model selection probability in the mixture model. That is, with this probability, a word is picked according to the collection language model, when a feedback document is "generated".
For the divergence minimization method, feedbackMixtureNoise means the weight of the divergence from the collection language model. (The higher it is, the farther the estimated model is from the collection model.)
For the Markov chain method, feedbackMixtureNoise is the probability of not stopping, i.e., 1- alpha, where alpha is the stopping probability while walking through the chain.

In addition, the collection mixture model also recognizes the parameter emIterations, which is the maximum number of iterations the EM algorithm will run. Default: 50. (The EM algorithm can terminate earlier if the log-likelihood converges quickly, where convergence is measured by some hard-coded criterion. See the source code in SimpleKLRetMethod.cpp for details. )

QueryModelEval

This application loads an expanded query model (e.g., one computed by GenerateQueryModel), and evaluates it with the KL-divergence retrieval model.
Parameters:

index: The complete name of the index table-of-content file for the database index.
smoothSupportFile: The name of the smoothing support file (e.g., one generated by GenerateSmoothSupport).
queryModel: the file of the query model to be evaluated
resultFile: the result file
TRECResultFormat: whether the result format should be of the TREC format (i.e., six-column) or just a simple three-column format <queryID, docID, score>. Integer value, zero for non-TREC format, and non-zero for TREC format. Default: 1 (i.e., TREC format)
resultCount: the number of documents to return as result for each query
The following are document model smoothing parameters:
smoothMethod: One of the three: Jelinek-Mercer (0), Dirichlet prior (1), and Absolute discounting (2)
smoothStrategy: Either interpolate (0) or backoff (1)
JelinekMercerLambda: The collection model weight in the JM interpolation method. Default: 0.5
DirichletPrior: The prior parameter in the Dirichlet prior smoothing method. Default: 1000
discountDelta: The delta (discounting constant) in the absolute discounting method. Default 0.7.

TwoStageRetEval

This application (TwoStageRetEval.cpp) runs retrieval experiments (with/without feedback) in exactly the same way as the application RetEval.cpp, except that it always uses the two-stage smoothing method for the initial retrieval and the KL-divergence model for feedback. It thus ignores the the parameter retModel.
It recognizes all the parameters relevant to the KL-divergence retrieval model, except for the smoothing method parameter SmoothMethod which is forced to the "Two-stage Smoothing" (value of 3) and JelinekMercerLambda, which gets ignored, since it automatically estimates the value of JelinekMercerLambda using a mixture model. For details on all the parameters, see the documentation for RetEval,
To achieve the effect of the completely automatic two-stage smoothing method, the parameter DirichletPrior should be set to the estimated value of the Dirichlet prior smoothing parameter using the application EstimateDirPrior, which computes a Maximum Likelihood estimate of DirichletPrior based on "leave-one-out".

GenerateSmoothSupport

This application generates two support files for retrieval using the language modeling approach. Both files contain some pre-computed quantities that are needed to speed up the retrieval process.
One file (name given by the parameter smoothSupportFile, see below) is needed by retrieval using smoothed unigram language model. Each entry in this support file corresponds to one document and records two pieces of information: (a) the count of unique terms in the document; (b) the sum of collection language model probabilities for the words in the document.
The other file (with an extra suffix ".mc" is needed if you run feedback based on the Markov chain query model. Each line in this file contains a term and a sum of the probability of the word given all documents in the collection. (i.e., a sum of p(w|d) over all possible d's.)
To run the application, follow the general steps of running a Lemur application and set the following variables in the parameter file:

index: the table-of-content (TOC) record file of the index (e.g., the .bsc file created by BuildBasicIndex or the .ifp file created by PushIndexer. )
smoothSupportFile: file path for the support file (e.g., /usr0/mydata/index.supp)

This application is also a good example of using the doc index (i.e., doc->term index).

GenerateQueryModel

This application (GenerateQueryModel.cpp) computes an expanded query model based on feedback documents and the original query model for the KL-divergence retrieval method. It can be regarded as performing a feedback in the language modeling approach to retrieval. The original query model can be computed based on the original query text (when the parameter "initQuery" is not set, or set to a null string), or based on a previously saved query model (the model is given by the parameter "initQuery"). Expanding a saved query model makes it possible to do iterative feedback. Feedback can be based on true relevance judgments or any previously returned retrieval results.
Two important notes:

All the feedback algorithms currently in Lemur assume that all entries in a judgment file are relevant documents, so you must remove all the entries of judged non-relevant documents. However, the judgment status is recorded in the internal representation of judgments, so that it is possible to distinguish judged relevant documents from judged non-relevant documents in a feedback algorithm.
The format of the judgment file, when used for feedback, must be of three columns, i.e., with the second column removed so that each line has a query id, a document id, and a judgment value. This is to be consistent with the format of a result file. An alternative would be to use the original four-column format directly, but, then we would need to add a parameter to distinguish this four-column format from the three-column format of a result file.

Parameters:

index: The complete name of the index table-of-content file for the database index.
smoothSupportFile: The name of the smoothing support file (e.g., one generated by GenerateSmoothSupport).
textQuery: the original query text stream
initQuery: the file with a saved initial query model. When this parameter is set to a non-empty string, the model stored in this file will be used for expansion; otherwise, the original query text is used the initial query model for expansion.
feedbackDocuments: the file of feedback documents to be used for feedback. In the case of pseudo feedback, this can be a result file generated from an initial retrieval process. In the case of relevance feedback, this is usually a 3-column relevance judgment file. Note that this means you can NOT use a TREC-style judgment file directly; you must remove the second column to convert it to three-column.
TRECResultFormat: whether the feedback document file (given by feedbackDocuments is of the TREC format (i.e., six-column) or just a simple three-column format . Integer value, zero for non-TREC format, and non-zero for TREC format. Default: 1 (i.e., TREC format). VERY IMPORTANT: For relevance feedback, TRECResultFormat should always be set to 0, since the judgment file is always a simple format.
expandedQuery: the file to store the expanded query model
feedbackDocCount: the number of docs to use for pseudo-feedback (0 means no-feedback)
queryUpdateMethod: feedback method (0, 1, 2 for mixture model, divergence minimization, and Markov chain respectively).
Method-specific feedback parameters:
For all interpolation-based approaches (i.e., the new query model is an interpolation of the original model with a (feedback) model computed based on the feedback documents), the following four parameters apply:

feedbackCoefficient: the coefficient of the feedback model for interpolation. The value is in [0,1], with 0 meaning using only the original model (thus no updating/feedback) and 1 meaning using only the feedback model (thus ignoring the original model).
feedbackTermCount: Truncate the feedback model to no more than a given number of words/terms.
feedbackProbThresh: Truncate the feedback model to include only words with a probability higher than this threshold. Default value: 0.001.
feedbackProbSumThresh: Truncate the feedback model until the sum of the probability of the included words reaches this threshold. Default value: 1.

Parameters feedbackTermCount, feedbackProbThresh, and feedbackProbSumThresh work conjunctively to control the truncation, i.e., the truncated model must satisfy all the three constraints.
All the three feedback methods also recognize the parameter feedbackMixtureNoise (default value :0.5), but with different interpretations.

For the collection mixture model method, feedbackMixtureNoise is the collection model selection probability in the mixture model. That is, with this probability, a word is picked according to the collection language model, when a feedback document is "generated".
For the divergence minimization method, feedbackMixtureNoise means the weight of the divergence from the collection language model. (The higher it is, the farther the estimated model is from the collection model.)
For the Markov chain method, feedbackMixtureNoise is the probability of not stopping, i.e., 1- alpha, where alpha is the stopping probability while walking through the chain.

In addition, the collection mixture model also recognizes the parameter emIterations, which is the maximum number of iterations the EM algorithm will run. Default: 50. (The EM algorithm can terminate earlier if the log-likelihood converges quickly, where convergence is measured by some hard-coded criterion. See the source code in SimpleKLRetMethod.cpp for details. )

EstimateDirPrior

This application (EstimateDirPrior.cpp) uses the leave-one-out method to estimate an optimal setting for the Dirichlet prior smoothing parameter (i.e., the "prior sample size").
To run the application, follow the general steps of running a lemur application and set the following variables in the parameter file:

index: the table-of-content (TOC) record file of the index (e.g., the .bsc file created by BuildBasicIndex)
initValue: the initial value for the parameter in the Newton method.The default value is 1. In general, you do not need to set this parameter.

After completion, it will print out the estimated parameter value to the standard output.

GenL2Norm
This application ( GenL2Norm.cpp ) generates a support file for retrieval using the cosine similarity. The file contains the L2 norms for each document, used to speed up the retrieval process. To run the application, follow the general steps of running a lemur application and set the following variables in the parameter file:

index: the table-of-content (TOC) record file of the index (e.g., the .bsc file created by BuildBasicIndex or the .ifp file created by PushIndexer. )
L2File: file path for the support file (e.g., /usr0/mydata/index.L2)
This application is also a good example of using the doc index (i.e., doc->term index)
QueryClarity
This application (QueryClarity.cpp) computes clarity scores for a query model which could be an expanded model based on feedback documents and the original query model using the KL-divergence retrieval method. The original query model can be computed based on the original query text (when the parameter "initQuery" is not set, or set to a null string), or based on a previously saved query model (the model is given by the parameter "initQuery"). If the feedbackDocCount==0 then computs the clarity score only for the original or given query files. Clarity scores for each entire query, and each individual term within each query are written to the file specified by the parameter "expandedQuery". Feedback can be based on true relevance judgments or any previously returned retrieval results.
Two important notes:

All the feedback algorithms currently in Lemur assume that all entries in a judgment file are relevant documents, so you must remove all the entries of judged non-relevant documents. However, the judgment status is recorded in the internal representation of judgments, so that it is possible to distinguish judged relevant documents from judged non-relevant documents in a feedback algorithm.
The format of the judgment file, when used for feedback, must be of three columns, i.e., with the second column removed so that each line has a query id, a document id, and a judgment value. This is to be consistent with the format of a result file. An alternative would be to use the original four-column format directly, but, then we would need to add a parameter to distinguish this four-column format from the three-column format of a result file.

Parameters:

index: The complete name of the index table-of-content file for the database index.
smoothSupportFile: The name of the smoothing support file (e.g., one generated by GenerateSmoothSupport).
textQuery: the original query text stream
initQuery: the file with a saved initial query model. When this parameter is set to a non-empty string, the model stored in this file will be used for expansion; otherwise, the original query text is used the initial query model for expansion.
feedbackDocuments: the file of feedback documents to be used for feedback. In the case of pseudo feedback, this can be a result file generated from an initial retrieval process. In the case of relevance feedback, this is usually a 3-column relevance judgment file. Note that this means you can NOT use a TREC-style judgment file directly; you must remove the second column to convert it to three-column.
TRECResultFormat: whether the feedback document file (given by feedbackDocuments is of the TREC format (i.e., six-column) or just a simple three-column format . Integer value, zero for non-TREC format, and non-zero for TREC format. Default: 1 (i.e., TREC format). VERY IMPORTANT: For relevance feedback, TRECResultFormat should always be set to 0, since the judgment file is always a simple format.
expandedQuery: the file to store the query clarity scores.
feedbackDocCount: the number of docs to use for pseudo-feedback (0 means no-feedback)
queryUpdateMethod: feedback method (0, 1, 2, 3, 4 for mixture model, divergence minimization, and Markov chain, relevance model1 and model2 respectively).
Method-specific feedback parameters:
For all interpolation-based approaches (i.e., the new query model is an interpolation of the original model with a (feedback) model computed based on the feedback documents), the following four parameters apply:

feedbackCoefficient: the coefficient of the feedback model for interpolation. The value is in [0,1], with 0 meaning using only the original model (thus no updating/feedback) and 1 meaning using only the feedback model (thus ignoring the original model).
feedbackTermCount: Truncate the feedback model to no more than a given number of words/terms.
feedbackProbThresh: Truncate the feedback model to include only words with a probability higher than this threshold. Default value: 0.001.
feedbackProbSumThresh: Truncate the feedback model until the sum of the probability of the included words reaches this threshold. Default value: 1.

Parameters feedbackTermCount, feedbackProbThresh, and feedbackProbSumThresh work conjunctively to control the truncation, i.e., the truncated model must satisfy all the three constraints.
All the three feedback methods also recognize the parameter feedbackMixtureNoise (default value :0.5), but with different interpretations.

For the collection mixture model method, feedbackMixtureNoise is the collection model selection probability in the mixture model. That is, with this probability, a word is picked according to the collection language model, when a feedback document is "generated".
For the divergence minimization method, feedbackMixtureNoise means the weight of the divergence from the collection language model. (The higher it is, the farther the estimated model is from the collection model.)
For the Markov chain method, feedbackMixtureNoise is the probability of not stopping, i.e., 1- alpha, where alpha is the stopping probability while walking through the chain.

In addition, the collection mixture model also recognizes the parameter emIterations, which is the maximum number of iterations the EM algorithm will run. Default: 50. (The EM algorithm can terminate earlier if the log-likelihood converges quickly, where convergence is measured by some hard-coded criterion. See the source code in SimpleKLRetMethod.cpp for details. )

ireval.pl

This is a Perl script that does TREC-style retrieval evaluation. The usage is

ireval.pl -j judgmentfile < resultfile

if the resultfile is of a simple three column format (i.e., queryid, docid, score), or

ireval.pl -j judgmentfile -trec < resultfile

if the resultfile is of the 6-column Trec format.

The Lemur Project
Last modified: Mon Feb 28 17:15:11 EST 2005