Main Page Namespace List Class Hierarchy Alphabetical List Compound List File List Namespace Members Compound Members File Members Related Pages

Incremental FP Passage Indexer

This application builds an FP passage index for a collection of documents. If the index already exists, new documents are added to that index, otherwise a new index is created. Documents are segmented into passages of size passageSize with an overlap of passageSize/2 terms per passage.

To use it, follow the general steps of running a lemur application.

The parameters are:

index: name of the index table-of-content file without the .ifp extension.
memory: memory (in bytes) of InvFPPushIndex (def = 96000000).
stopwords: name of file containing the stopword list.
acronyms: name of file containing the acronym list.
countStopWords: If true, count stopwords in document length.
docFormat:
- "trec" for standard TREC formatted documents
- "web" for web TREC formatted documents
- "chinese" for segmented Chinese text (TREC format, GB encoding)
- "chinesechar" for unsegmented Chinese text (TREC format, GB encoding)
- "arabic" for Arabic text (TREC format, Windows CP1256 encoding)
stemmer:
- "porter" Porter stemmer.
- "krovetz" Krovetz stemmer, requires additional parameters
  1. KstemmerDir: Path to directory of data files used by Krovetz's stemmer.
- "arabic" arabic stemmer, requires additional parameters
  1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
  2. arabicStemFunc: Which stemming algorithm to apply, one of:
    - arabic_stop : arabic_stop
    - arabic_norm2 : table normalization
    - arabic_norm2_stop : table normalization with stopping
    - arabic_light10 : light9 plus ll prefix
    - arabic_light10_stop : light10 and remove stop words
dataFiles: name of file containing list of datafiles to index.
passageSize: Number of terms per passage.

Generated on Wed Nov 3 13:00:02 2004 for Lemur Toolkit by

1.2.18