Contents
1. BuildIndex
This application builds an Inv(FP)Index, KeyfileIncIndex, or IndriIndex for a collection of documents.
To use it, follow the general steps of running a lemur application.
The parameters are:
- index: name of the index table-of-content file without the .ifp extension. use full path information here to use index later from other directories. i.e. /lemur/indexes/myindex
- indexType: the type of the index you want to build
- inv for Inv (.inv) or InvFP (.ifp)
- key for KeyfileIncIndex (.key)
- indri for IndriIndex (.ind)
- memory: memory (in bytes) of Inv(FP)PushIndex (def = 96000000).
- position: store position information (def = 1), applicable only for inv indexes. Keyfile and Indri always store positions.
- stopwords: name of file containing the stopword list.
- acronyms: name of file containing the acronym list, currently not supported by IndriIndex. These acronyms will still be indexed in lowercase by IndriIndex.
- countStopWords: If true, count stopwords in document length.
- docFormat:
- trec for standard TREC formatted documents
- web for web TREC formatted documents
- chinese for segmented Chinese text (TREC format, GB encoding)
- chinesechar for unsegmented Chinese text (TREC format, GB encoding)
- arabic for Arabic text (TREC format, Windows CP1256 encoding)
- stemmer:
- porter Porter stemmer.
- krovetz Krovetz stemmer, requires additional parameters
- KstemmerDir: Path to directory of data files used by Krovetz's stemmer.
- arabic arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
- dataFiles: name of file containing list of datafiles to index.
2. BuildDocMgr
BuildDocMgr builds a document manager. A DocumentManager is necessary for later retrieval the original documents in an index. Builds an inverted index simultaneously if an index name is provided.Summary of required parameters:
The following parameters are optional for building an index
- manager:required name of the document manager (without extension)
- managerType:required name of the document manager type, one of flat (FlatfileDocMgr) bdm (KeyfileDocMgr) or elem (ElemDocMgr)
- docFormat:
- trec for standard TREC formatted documents
- web for web TREC formatted documents
- chinese for segmented Chinese text (TREC format, GB encoding)
- chinesechar for unsegmented Chinese text (TREC format, GB encoding)
- arabic for Arabic text (TREC format, Windows CP1256 encoding)
- dataFiles: name of file containing list of names datafiles (one line per datafile name, use full path)
- index: name of the index table-of-content file without any extension. use full path information here to use index later from other directories. i.e. /lemur/indexes/myindex
- indexType:the type of index to create, "key" (KeyfileIncIndex) or "inv" (Inv(FP)Index). default is inv
- memory: memory (in bytes) of Inv(FP)PushIndex (def = 96000000).
- position: store position information (def = 1).
- stopwords: name of file containing the stopword list. Words in this file should be one per line. If this parameter is not specified, all words are indexed.
- acronyms: name of file containing the acronym list.
- countStopWords: If true, count stopwords in document length.
- stemmer:
- porter Porter stemmer.
- krovetz Krovetz stemmer, requires additional parameters
- KstemmerDir: Path to directory of data files used by Krovetz's stemmer.
- arabic arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
3. BuildPropIndex
This application builds an InvFPIndex for a collection of documents with properties associated with terms.
Usage: BuildPropIndex paramfile [datfile1]* [datfile2]* ...
* data files can be specified on the command line OR in a metafile specified as the dataFiles parameter
The parameters are:
- index: name of the index to create (don't include extension)
- indexType:the type of index to create, "key" (KeyfileIncIndex) or "inv" (InvFPIndex). default is inv
- memory: memory (in bytes) of InvFPPushIndex cache (def = 96000000).
- stopwords: name of file containing the stopword list.
- acronyms: name of file containing the acronym list.
- countStopWords: If true, count stopwords in document length.
- docFormat:
- "brill" for documents with Brill's part of speech tags, still needs DOC separators between documents similar to Lemur's WebParser. This is the default.
- "identifinder" for documents with Identifinder's named entity tags, still needs DOC separators between documents similar to Lemur's WebParser.
- stemmer:
- "porter" Porter stemmer.
- "krovetz" Krovetz stemmer, requires additional parameters
- KstemmerDir: Path to directory of data files used by Krovetz's stemmer.
- "arabic" arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
- dataFiles: name of file containing list of datafiles to index.
4. IndriBuildIndex
This application builds an Indri Repository for a collection of documents. The indri applications, IndriBuildIndex, IndriDaemon, and IndriRunQuery accept parameters from either the command line or from a file. The parameter file uses an XML format. The command line uses dotted path notation. The top level element in the parameters file is named parameters.Repository construction parameters
- memory
- an integer value specifying the number of bytes to use for the indexing process. The value can include a scaling factor by adding a suffix. Valid values are (case insensitive) K = 1000, M = 1000000, G = 1000000000. So 100M would be equivalent to 100000000. The value should contain only decimal digits and the optional suffix. Specified as <memory>100M</memory> in the parameter file and as -memory=100M on the command line.
- index
- path to where to place the Indri Repository. Specified as <index>/path/to/repository</index> in the parameter file and as -index=/path/to/repository on the command line.
- corpus
- a complex element containing parameters related to a corpus. This element can be specified multiple times. The parameters are
- path
- The pathname of the file or directory containing documents to index. Specified as <corpus><path>/path/to/file_or_directory</path></corpus> in the parameter file and as -corpus.path=/path/to/file_or_directory on the command line.
- class
- The FileClassEnviroment of the file or directory containing documents to index. Specified as <corpus><class>trecweb</class></corpus> in the parameter file and as -corpus.class=trecweb on the command line. The known classes are:
Combining each of these elements, the paramter file would contain:
- html -- web page data.
- trecweb -- TREC web format, eg terabyte track.
- trectext -- TREC format, eg TREC-3 onward.
- doc -- Microsoft Word format (windows platform only).
- ppt -- Microsoft Powerpoint format (windows platform only).
- pdf -- Adobe PDF format.
- txt -- Plain text format.
<corpus>
<path>/path/to/file_or_directory</path>
<class>trecweb</class>
</corpus>metadata a complex element containing one or more field entry specifying the metadata fields to index, eg DOCNO. Specified as <metadata><field>fieldname</field></metadata> in the parameter file and as metadata.field=fieldname on the command line. field a complex element specifying the fields to index as data, eg TITLE. This parameter can appear multiple times. The subelements are:
- name
- the field name, specified as <field><name>fieldname</name></field> in the parameter file and as -field.name=fieldname on the command line.
- numeric
- integer value of 1 if the field contains numeric data, otherwise 0, specified as <field><numeric>0</numeric></field> in the parameter file and as -field.numeric=0 on the command line. This is an optional parameter, defaulting to 0.
stemmer a complex element specifying the stemming algorithm to use in the subelement name. Valid options are Porter or Krovetz (case insensitive). Specified as <stemmer><name>stemmername</name></stemmer> and as -stemmer.name=stemmername on the command line. This is an optional parameter with the default of no stemming. stopper a complex element containing one or more subelements named word, specifying the stopword list to use. Specified as <stopper><word>stopword</word></stopper> and as -stopper.word=stopword on the command line. This is an optional parameter with the default of no stopping.
5. PassageIndexer
This application builds an FP passage index for a collection of documents. Documents are segmented into passages of size passageSize with an overlap of passageSize/2 terms per passage.To use it, follow the general steps of running a lemur application.
The parameters are:
- index: name of the index table-of-content file without the .ifp extension.
- memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
- stopwords: name of file containing the stopword list.
- acronyms: name of file containing the acronym list.
- countStopWords: If true, count stopwords in document length.
- docFormat:
- trec for standard TREC formatted documents
- web for web TREC formatted documents
- chinese for segmented Chinese text (TREC format, GB encoding)
- chinesechar for unsegmented Chinese text (TREC format, GB encoding)
- arabic for Arabic text (TREC format, Windows CP1256 encoding)
- stemmer:
- porter Porter stemmer.
- krovetz Krovetz stemmer, requires additional parameters
- KstemmerDir: Path to directory of data files used by Krovetz's ste mmer.
- arabic arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
- dataFiles: name of file containing list of datafiles to index.
- passageSize: Number of terms per passage.
6. IncIndexer
This application builds an FP index for a collection of documents. If the index already exists, new documents are added to that index, otherwise a new index is created.To use it, follow the general steps of running a lemur application.
The parameters are:
- index: name of the index table-of-content file without the .ifp extension.
- memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
- stopwords: name of file containing the stopword list.
- acronyms: name of file containing the acronym list.
- countStopWords: If true, count stopwords in document length.
- docFormat:
- trec for standard TREC formatted documents
- web for web TREC formatted documents
- chinese for segmented Chinese text (TREC format, GB encoding)
- chinesechar for unsegmented Chinese text (TREC format, GB encoding)
- arabic for Arabic text (TREC format, Windows CP1256 encoding)
- stemmer:
- porter Porter stemmer.
- krovetz Krovetz stemmer, requires additional parameters
- KstemmerDir: Path to directory of data files used by Krovetz's ste mmer.
- arabic arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
- dataFiles: name of file containing list of datafiles to index.
7. IncPassageIndexer
This application builds an FP passage index for a collection of documents. If the index already exists, new documents are added to that index, otherwise a new index is created. Documents are segmented into passages of size passageSize with an overlap of passageSize/2 terms per passage.To use it, follow the general steps of running a lemur application.
The parameters are:
- index: name of the index table-of-content file without the .ifp extension.
- memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
- stopwords: name of file containing the stopword list.
- acronyms: name of file containing the acronym list.
- countStopWords: If true, count stopwords in document length.
- docFormat:
- trec for standard TREC formatted documents
- web for web TREC formatted documents
- chinese for segmented Chinese text (TREC format, GB encoding)
- chinesechar for unsegmented Chinese text (TREC format, GB encoding)
- arabic for Arabic text (TREC format, Windows CP1256 encoding)
- stemmer:
- porter Porter stemmer.
- krovetz Krovetz stemmer, requires additional parameters
- KstemmerDir: Path to directory of data files used by Krovetz's ste mmer.
- arabic arabic stemmer, requires additional parameters
- arabicStemDir: Path to directory of data files used by the Arabic stemmers.
- arabicStemFunc: Which stemming algorithm to apply, one of:
- arabic_stop : arabic_stop
- arabic_norm2 : table normalization
- arabic_norm2_stop : table normalization with stopping
- arabic_light10 : light9 plus ll prefix
- arabic_light10_stop : light10 and remove stop words
- dataFiles: name of file containing list of datafiles to index.
- passageSize: Number of terms per passage.
The Lemur Project Last modified: Tue Nov 2 11:38:07 EST 2004