|
Parses documents in with similar document separation tags NIST's Web format. <DOC></DOC> around documents and <DOCNO></DOCNO> around docids. This parser recognizes named entity tags from the Identifinder tagger and passed them along as properties. For each tag X, also adds in b_X and e_X to the first and last token of each entity. For example, "Carnegie Mellon University" was identified as a place, it would be parsed with the following properties: Carnegie [b_place] [place] Mellon [place] University [e_place] [place] A single token entity, like Madonna would be Madonna [b_person] [person] [e_person] Does case folding for words that are not in the acronym list. Contraction suffixes and possessive suffixes are stripped.
U.S.A., USA's, and USAs are converted to USA. Does not recognize acronyms with numbers. |