Main Page Namespace List Class Hierarchy Alphabetical List Compound List File List Namespace Members Compound Members File Members Related Pages
BrillPOSParser Class Reference
#include <BrillPOSParser.hpp>
Inheritance diagram for BrillPOSParser:
List of all members.
Detailed Description
Parses documents in with similar document separation tags NIST's Web format. <DOC></DOC> around documents and <DOCNO></DOCNO> around docids. recognizes tokens with "/" slashes in them, which is the default separator for Brill's part of speech tagger. Use with BrillPOSTokenizer. This parser also recognizes ./. ?/. and !/. as end of sentence markers and sends along a [eos] token to be indexed. Does case folding for words that are not in the acronym list. Contraction suffixes and possessive suffixes are stripped.
U.S.A., USA's, and USAs are converted to USA. Does not recognize acronyms with numbers.
Constructor & Destructor Documentation
BrillPOSParser::BrillPOSParser |
( |
|
) |
|
|
Member Function Documentation
long BrillPOSParser::fileTell |
( |
|
) |
[virtual] |
|
|
return the current byte position of the file being parsed
Implements Parser. |
void BrillPOSParser::parseBuffer |
( |
char * |
buf, |
|
|
int |
len |
|
) |
[virtual] |
|
|
Parse a buffer.
Implements Parser. |
void BrillPOSParser::parseFile |
( |
const string & |
filename |
) |
[virtual] |
|
|
Parse a file.
Implements Parser. |
Member Data Documentation
const string BrillPOSParser::identifier = "brill" [static]
|
|
The documentation for this class was generated from the following files:
Generated on Wed Nov 3 12:59:25 2004 for Lemur Toolkit by
1.2.18