Lemur Indexing Applications


Contents

  1. BuildIndex
  2. BuildDocMgr
  3. BuildPropIndex
  4. IndriBuildIndex
  5. PassageIndexer
  6. IncIndexer
  7. IncPassageIndexer


1. BuildIndex

This application builds an Inv(FP)Index, KeyfileIncIndex, or IndriIndex for a collection of documents.

To use it, follow the general steps of running a lemur application.

The parameters are:

  1. index: name of the index table-of-content file without the .ifp extension. use full path information here to use index later from other directories. i.e. /lemur/indexes/myindex
  2. indexType: the type of the index you want to build
    • inv for Inv (.inv) or InvFP (.ifp)
    • key for KeyfileIncIndex (.key)
    • indri for IndriIndex (.ind)
  3. memory: memory (in bytes) of Inv(FP)PushIndex (def = 96000000).
  4. position: store position information (def = 1), applicable only for inv indexes. Keyfile and Indri always store positions.
  5. stopwords: name of file containing the stopword list.
  6. acronyms: name of file containing the acronym list, currently not supported by IndriIndex. These acronyms will still be indexed in lowercase by IndriIndex.
  7. countStopWords: If true, count stopwords in document length.
  8. docFormat:
    • trec for standard TREC formatted documents
    • web for web TREC formatted documents
    • chinese for segmented Chinese text (TREC format, GB encoding)
    • chinesechar for unsegmented Chinese text (TREC format, GB encoding)
    • arabic for Arabic text (TREC format, Windows CP1256 encoding)
  9. stemmer:
    • porter Porter stemmer.
    • krovetz Krovetz stemmer, requires additional parameters
      1. KstemmerDir: Path to directory of data files used by Krovetz's stemmer.
    • arabic arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words
  10. dataFiles: name of file containing list of datafiles to index.

2. BuildDocMgr

BuildDocMgr builds a document manager. A DocumentManager is necessary for later retrieval the original documents in an index. Builds an inverted index simultaneously if an index name is provided.

Summary of required parameters:

  1. manager:required name of the document manager (without extension)
  2. managerType:required name of the document manager type, one of flat (FlatfileDocMgr) bdm (KeyfileDocMgr) or elem (ElemDocMgr)
  3. docFormat:
    • trec for standard TREC formatted documents
    • web for web TREC formatted documents
    • chinese for segmented Chinese text (TREC format, GB encoding)
    • chinesechar for unsegmented Chinese text (TREC format, GB encoding)
    • arabic for Arabic text (TREC format, Windows CP1256 encoding)
  4. dataFiles: name of file containing list of names datafiles (one line per datafile name, use full path)
The following parameters are optional for building an index
  1. index: name of the index table-of-content file without any extension. use full path information here to use index later from other directories. i.e. /lemur/indexes/myindex
  2. indexType:the type of index to create, "key" (KeyfileIncIndex) or "inv" (Inv(FP)Index). default is inv
  3. memory: memory (in bytes) of Inv(FP)PushIndex (def = 96000000).
  4. position: store position information (def = 1).
  5. stopwords: name of file containing the stopword list. Words in this file should be one per line. If this parameter is not specified, all words are indexed.
  6. acronyms: name of file containing the acronym list.
  7. countStopWords: If true, count stopwords in document length.
  8. stemmer:
    • porter Porter stemmer.
    • krovetz Krovetz stemmer, requires additional parameters
      1. KstemmerDir: Path to directory of data files used by Krovetz's stemmer.
    • arabic arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words

3. BuildPropIndex

This application builds an InvFPIndex for a collection of documents with properties associated with terms.

Usage: BuildPropIndex paramfile [datfile1]* [datfile2]* ...

* data files can be specified on the command line OR in a metafile specified as the dataFiles parameter

The parameters are:

  1. index: name of the index to create (don't include extension)
  2. indexType:the type of index to create, "key" (KeyfileIncIndex) or "inv" (InvFPIndex). default is inv
  3. memory: memory (in bytes) of InvFPPushIndex cache (def = 96000000).
  4. stopwords: name of file containing the stopword list.
  5. acronyms: name of file containing the acronym list.
  6. countStopWords: If true, count stopwords in document length.
  7. docFormat:
    • "brill" for documents with Brill's part of speech tags, still needs DOC separators between documents similar to Lemur's WebParser. This is the default.
    • "identifinder" for documents with Identifinder's named entity tags, still needs DOC separators between documents similar to Lemur's WebParser.
  8. stemmer:
    • "porter" Porter stemmer.
    • "krovetz" Krovetz stemmer, requires additional parameters
      1. KstemmerDir: Path to directory of data files used by Krovetz's stemmer.
    • "arabic" arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words
  9. dataFiles: name of file containing list of datafiles to index.

4. IndriBuildIndex

This application builds an Indri Repository for a collection of documents. The indri applications, IndriBuildIndex, IndriDaemon, and IndriRunQuery accept parameters from either the command line or from a file. The parameter file uses an XML format. The command line uses dotted path notation. The top level element in the parameters file is named parameters.

Repository construction parameters

memory
an integer value specifying the number of bytes to use for the indexing process. The value can include a scaling factor by adding a suffix. Valid values are (case insensitive) K = 1000, M = 1000000, G = 1000000000. So 100M would be equivalent to 100000000. The value should contain only decimal digits and the optional suffix. Specified as <memory>100M</memory> in the parameter file and as -memory=100M on the command line.
index
path to where to place the Indri Repository. Specified as <index>/path/to/repository</index> in the parameter file and as -index=/path/to/repository on the command line.
corpus
a complex element containing parameters related to a corpus. This element can be specified multiple times. The parameters are
path
The pathname of the file or directory containing documents to index. Specified as <corpus><path>/path/to/file_or_directory</path></corpus> in the parameter file and as -corpus.path=/path/to/file_or_directory on the command line.
class
The FileClassEnviroment of the file or directory containing documents to index. Specified as <corpus><class>trecweb</class></corpus> in the parameter file and as -corpus.class=trecweb on the command line. The known classes are:
  • html -- web page data.
  • trecweb -- TREC web format, eg terabyte track.
  • trectext -- TREC format, eg TREC-3 onward.
  • doc -- Microsoft Word format (windows platform only).
  • ppt -- Microsoft Powerpoint format (windows platform only).
  • pdf -- Adobe PDF format.
  • txt -- Plain text format.
Combining each of these elements, the paramter file would contain:
<corpus>
  <path>/path/to/file_or_directory</path>
  <class>trecweb</class>
</corpus>
metadata
a complex element containing one or more field entry specifying the metadata fields to index, eg DOCNO. Specified as <metadata><field>fieldname</field></metadata> in the parameter file and as metadata.field=fieldname on the command line.
field
a complex element specifying the fields to index as data, eg TITLE. This parameter can appear multiple times. The subelements are:
name
the field name, specified as <field><name>fieldname</name></field> in the parameter file and as -field.name=fieldname on the command line.
numeric
integer value of 1 if the field contains numeric data, otherwise 0, specified as <field><numeric>0</numeric></field> in the parameter file and as -field.numeric=0 on the command line. This is an optional parameter, defaulting to 0.
stemmer
a complex element specifying the stemming algorithm to use in the subelement name. Valid options are Porter or Krovetz (case insensitive). Specified as <stemmer><name>stemmername</name></stemmer> and as -stemmer.name=stemmername on the command line. This is an optional parameter with the default of no stemming.
stopper
a complex element containing one or more subelements named word, specifying the stopword list to use. Specified as <stopper><word>stopword</word></stopper> and as -stopper.word=stopword on the command line. This is an optional parameter with the default of no stopping.

5. PassageIndexer

This application builds an FP passage index for a collection of documents. Documents are segmented into passages of size passageSize with an overlap of passageSize/2 terms per passage.

To use it, follow the general steps of running a lemur application.

The parameters are:

  1. index: name of the index table-of-content file without the .ifp extension.
  2. memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
  3. stopwords: name of file containing the stopword list.
  4. acronyms: name of file containing the acronym list.
  5. countStopWords: If true, count stopwords in document length.
  6. docFormat:
    • trec for standard TREC formatted documents
    • web for web TREC formatted documents
    • chinese for segmented Chinese text (TREC format, GB encoding)
    • chinesechar for unsegmented Chinese text (TREC format, GB encoding)
    • arabic for Arabic text (TREC format, Windows CP1256 encoding)
  7. stemmer:
    • porter Porter stemmer.
    • krovetz Krovetz stemmer, requires additional parameters
      1. KstemmerDir: Path to directory of data files used by Krovetz's ste mmer.
    • arabic arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words
  8. dataFiles: name of file containing list of datafiles to index.
  9. passageSize: Number of terms per passage.

6. IncIndexer

This application builds an FP index for a collection of documents. If the index already exists, new documents are added to that index, otherwise a new index is created.

To use it, follow the general steps of running a lemur application.

The parameters are:

  1. index: name of the index table-of-content file without the .ifp extension.
  2. memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
  3. stopwords: name of file containing the stopword list.
  4. acronyms: name of file containing the acronym list.
  5. countStopWords: If true, count stopwords in document length.
  6. docFormat:
    • trec for standard TREC formatted documents
    • web for web TREC formatted documents
    • chinese for segmented Chinese text (TREC format, GB encoding)
    • chinesechar for unsegmented Chinese text (TREC format, GB encoding)
    • arabic for Arabic text (TREC format, Windows CP1256 encoding)
  7. stemmer:
    • porter Porter stemmer.
    • krovetz Krovetz stemmer, requires additional parameters
      1. KstemmerDir: Path to directory of data files used by Krovetz's ste mmer.
    • arabic arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words
  8. dataFiles: name of file containing list of datafiles to index.

7. IncPassageIndexer

This application builds an FP passage index for a collection of documents. If the index already exists, new documents are added to that index, otherwise a new index is created. Documents are segmented into passages of size passageSize with an overlap of passageSize/2 terms per passage.

To use it, follow the general steps of running a lemur application.

The parameters are:

  1. index: name of the index table-of-content file without the .ifp extension.
  2. memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
  3. stopwords: name of file containing the stopword list.
  4. acronyms: name of file containing the acronym list.
  5. countStopWords: If true, count stopwords in document length.
  6. docFormat:
    • trec for standard TREC formatted documents
    • web for web TREC formatted documents
    • chinese for segmented Chinese text (TREC format, GB encoding)
    • chinesechar for unsegmented Chinese text (TREC format, GB encoding)
    • arabic for Arabic text (TREC format, Windows CP1256 encoding)
  7. stemmer:
    • porter Porter stemmer.
    • krovetz Krovetz stemmer, requires additional parameters
      1. KstemmerDir: Path to directory of data files used by Krovetz's ste mmer.
    • arabic arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words
  8. dataFiles: name of file containing list of datafiles to index.
  9. passageSize: Number of terms per passage.

The Lemur Project
Last modified: Tue Nov 2 11:38:07 EST 2004