Main Page   Namespace List   Class Hierarchy   Alphabetical List   Compound List   File List   Namespace Members   Compound Members   File Members   Related Pages  

Indri Parameter Files

The indri applications, BuildIndriIndex, IndriDaemon, and IndriRunQuery accept parameters from either the command line or from a file. The parameter file uses an XML format. The command line uses dotted path notation. The top level element in the parameters file is named parameters.

Repository construction parameters

memory
an integer value specifying the number of bytes to use for the indexing process. The value can include a scaling factor by adding a suffix. Valid values are (case insensitive) K = 1000, M = 1000000, G = 1000000000. So 100M would be equivalent to 100000000. The value should contain only decimal digits and the optional suffix. Specified as <memory>100M</memory> in the parameter file and as -memory=100M on the command line.
index
path to where to place the Indri Repository. Specified as <index>/path/to/repository</index> in the parameter file and as -index=/path/to/repository on the command line.

corpus
a complex element containing parameters related to a corpus. This element can be specified multiple times. The parameters are
path
The pathname of the file or directory containing documents to index. Specified as <corpus><path>/path/to/file_or_directory</path></corpus> in the parameter file and as -corpus.path=/path/to/file_or_directory on the command line.
class
The FileClassEnviroment of the file or directory containing documents to index. Specified as <corpus><class>trecweb</class></corpus> in the parameter file and as -corpus.class=trecweb on the command line. The known classes are:
  • html -- web page data.
  • trecweb -- TREC web format, eg terabyte track.
  • trectext -- TREC format, eg TREC-3 onward.
  • doc -- Microsoft Word format (windows platform only).
  • ppt -- Microsoft Powerpoint format (windows platform only).
  • pdf -- Adobe PDF format.
  • txt -- Plain text format.
Combining each of these elements, the paramter file would contain:
<corpus>
  <path>/path/to/file_or_directory</path>
  <class>trecweb</class>
</corpus>
metadata
a complex element containing one or more field entry specifying the metadata fields to index, eg DOCNO. Specified as <metadata><field>fieldname</field></metadata> in the parameter file and as metadata.field=fieldname on the command line.
field
a complex element specifying the fields to index as data, eg TITLE. This parameter can appear multiple times. The subelements are:
name
the field name, specified as <field><name>fieldname</name></field> in the parameter file and as -field.name=fieldname on the command line.
numeric
the symbol true if the field contains numeric data, otherwise the symbol false, specified as <field><numeric>true</numeric></field> in the parameter file and as -field.numeric=true on the command line. This is an optional parameter, defaulting to false. Note that 0 can be used for false and 1 can be used for true.
stemmer
a complex element specifying the stemming algorithm to use in the subelement name. Valid options are Porter or Krovetz (case insensitive). Specified as <stemmer><name>stemmername</name></stemmer> and as -stemmer.name=stemmername on the command line. This is an optional parameter with the default of no stemming.
stopper
a complex element containing one or more subelements named word, specifying the stopword list to use. Specified as <stopper><word>stopword</word></stopper> and as -stopper.word=stopword on the command line. This is an optional parameter with the default of no stopping.

QueryEnvironment Parameters

Retrieval Parameters

memory
an integer value specifying the number of bytes to use for the query retrieval process. The value can include a scaling factor by adding a suffix. Valid values are (case insensitive) K = 1000, M = 1000000, G = 1000000000. So 100M would be equivalent to 100000000. The value should contain only decimal digits and the optional suffix. Specified as <memory>100M</memory> in the parameter file and as -memory=100M on the command line.
index
path to an Indri Repository. Specified as <index>/path/to/repository</index> in the parameter file and as -index=/path/to/repository on the command line. This element can be specified multiple times to combine Repositories.
server
hostname of a host running an Indri server (IndriDaemon). Specified as <server>hostname</server> in the parameter file and as -server=hostname on the command line. The hostname can include an optional port number to connect to, using the form hostname:portnum. This element can be specified multiple times to combine servers.
count
an integer value specifying the maximum number of results to return for a given query. Specified as <count>number</count> in the parameter file and as -count=number on the command line.
rule
specifies the smoothing rule (TermScoreFunction) to apply. Format of the rule is:

( key ":" value ) [ "," key ":" value ]*

Here's an example rule in command line format:

-rule=method:linear,collectionLambda:0.2,field:title

and in parameter file format:
<rule>method:linear,collectionLambda:0.2,field:title</rule>

This corresponds to Jelinek-Mercer smoothing with background lambda equal to 0.2, only for items in a title field.

If nothing is listed for a key, all values are assumed. So, a rule that does not specify a field matches all fields. This makes -rule=method:linear,collectionLambda:0.2 a valid rule.

Valid keys:

method
smoothing method (text)
field
field to apply this rule to
operator &nbsp;
type of item in query to apply to { term, window }

Valid methods:

dirichlet
(also 'd', 'dir') (default mu=2500)
jelinek-mercer
(also 'jm', 'linear') (default collectionLambda=0.4, documentLambda=0.0), collectionLambda is also known as just "lambda", either will work
twostage
(also 'two-stage', 'two') (default mu=2500, lambda=0.4)
If the rule doesn't parse correctly, the default is Dirichlet, mu=2500.
stopper
a complex element containing one or more subelements named word, specifying the stopword list to use. Specified as <stopper><word>stopword</word></stopper> and as -stopper.word=stopword on the command line. This is an optional parameter with the default of no stopping.

Formatting Parameters

queryOffset
an integer value specifying one less than the starting query number, eg 150 for TREC formatted output. Specified as <queryOffset>number</queryOffset> in the parameter file and as -queryOffset=number on the command line.
runID
a string specifying the id for a query run, used in TREC scorable output. Specified as <runID>someID</runID> in the parameter file and as -runID=someID on the command line.
trecFormat
the symbol true to produce TREC scorable output, otherwise the symbol false. Specified as <trecFormat>true</trecFormat> in the parameter file and as -trecFormat=true on the command line. Note that 0 can be used for false, and 1 can be used for true.

Pseudo-Relevance Feedback Parameters

fbDocs
an integer specifying the number of documents to use for feedback. Specified as <fbDocs>number</fbDocs> in the parameter file and as -fbDocs=number on the command line.
fbTerms
an integer specifying the number of terms to use for feedback. Specified as <fbTerms>number</fbTerms> in the parameter file and as -fbTerms=number on the command line.
fbMu
a floating point value specifying the value of mu to use for feedback. [NB: document the feedback formulae]. Specified as <fbMu>number</fbMu> in the parameter file and as -fbMu=number on the command line.
fbOrigWeight
a floating point value in the range [0.0..1.0] specifying the weight for the original query in the expanded query. Specified as <fbOrigWeight>number</fbOrigWeight> in the parameter file and as -fbOrigWeight=number on the command line.

IndriDaemon Parameters

memory
an integer value specifying the number of bytes to use for the query retrieval process. The value can include a scaling factor by adding a suffix. Valid values are (case insensitive) K = 1000, M = 1000000, G = 1000000000. So 100M would be equivalent to 100000000. The value should contain only decimal digits and the optional suffix. Specified as <memory>100M</memory> in the parameter file and as -memory=100M on the command line.
index
path to the Indri Repository to act as server for. Specified as <index>/path/to/repository</index> in the parameter file and as -index=/path/to/repository on the command line.
port
an integer value specifying the port number to use.Specified as <port>number</port> in the parameter file and as -port=number on the command line.


Generated on Wed Nov 3 13:00:02 2004 for Lemur Toolkit by doxygen1.2.18