file format for annotations, UIMA annotator list, and
annotated data from CMU and Pitt Bio department web sites and Medline abstracts
Annotating spans of text within documents is central to the RTW
project. We use "stand off" annotation, which means the spans and
annotation are kept in a file or database other than the document (as
opposed to in-line annotation, such as XML). We use the stand-off
annotation convention developed for Minor Third, as folllows (thanks to
William Cohen):
File Format for Annotations:
The simplest format for Minothird annotations is a file where each line is of the form:
addToType FILE LO-INDEX LENGTH ANNOTATION-TYPE
where FILE is a file name, LO-INDEX is the byte offset in the file of
the start of the span, LENGTH is the byte length of the span, and
ANNOTATION-TYPE is a string naming this type of annotation.
For instance:
addToType yeast_00001_training.txt-char619.txt 0 280 sentence
addToType yeast_00001_training.txt-char619.txt 181 13 process
addToType yeast_00001_training.txt-char619.txt 202 5 gene
addToType yeast_00001_training.txt-char619.txt 202 5 protein
addToType yeast_00003_training.txt-char549.txt 165 2 component
addToType yeast_00003_training.txt-char549.txt 168 8 component
addToType yeast_00003_training.txt-char549.txt 168 8 goTerm
...
Each of these fields is white-space separated, so file names with
spaces in them or annotation-types with spaces (for instance) will be a
problem. The file should be just the name of a file, not a path
to a file. An annotation type is any string.
The easiest way to check a set of annotations is to put them in a file
named foo.labels, put the files they mention in a directory named foo,
and then type the command
% java edu.cmu.minorthird.ui.ViewLabels -labels foo
The "addToType" keyword is there because there some other types of annotations that can be inserted, for instance
setSpanProp FILE LO-INDEX LENGTH PROPERTY VALUE
adds a string valued "property" to a span, eg to associate an acronym with its expansion you might use the lines
setSpanProp SOMEFILE 44 4 acronym a1
setSpanProp SOMEFILE 37 23 expansion a1
Again, a property, or a property value, can be any string.
Example annotated data:
WEB PAGES
(thanks for Sophie Wang and Eric Riebling)
Web pages were crawled from the CMU and Pitt Biology dept web sites
(contributed by Sophie Wang, March 16, 2006). They are available
at
/afs/cs/project/theo-21/dataset/BIO_2uni/*
These pages we run through UIMA by Eric R., to preprocess them removing
HTML tags, and replacing each removed character by a space character
(so the span addresses in the original HTML page are identical to those
in the 'whitespace' files). The resulting 604 BIO dept web docs
are on
/afs/cs.cmu.edu/project/theo-21/dataset/BIO_white
UIMA was used to annotate these pages using MXTerminator, BBN Identifinder, Protein finder. Results are on
/afs/cs.cmu.edu/project/theo-21/dataset/BIO_white.labels
MEDLINE ABSTRACTS
46 UIMA-annotated MEDLINE docs:
/afs/cs.cmu.edu/project/theo-21/dataset/uima-yeast/small
/afs/cs.cmu.edu/project/theo-21/dataset/uima-yeast/small.labels
annotated with:
MXTerminator
BBN Identifinder
Protein finder
BRILL POS tagger
ASSERT
[both also available in UIMA and Annotations Database format]
46 Minorthird-annotated MEDLINE docs:
/afs/cs.cmu.edu/project/theo-21/dataset/yeast/small
/afs/cs.cmu.edu/project/theo-21/dataset/yeast/small.labels
annotated with:
sentences
yeast-specific gene/protein entities (span name "gene")
yeast-specific cellular components ("component")
yeast biological processes ("process")
molecular functions ("function")
1000 UIMA-annotated MEDLINE docs (with the most protein mentions):
/afs/cs.cmu.edu/project/theo-21/dataset/uima-medline-protein/protein.labels
/afs/cs.cmu.edu/project/theo-21/dataset/uima-medline-protein/protein
annotated with:
MXTerminator
BBN Identifinder
Protein finder
BRILL POS tagger
ASSERT
[both also available in UIMA and Annotations Database format]
Table of UIMA annotators:
LEGEND:
o what it extracts
o source of the extractor (e.g., minor third, UIMA, other)
o the name/location/documentation of the extractor
o the type of text it expects to run on (e.g., emails, plain text, HTML,...)
o an estimate of its speed (in bytes per second) - most accurate, as
documents per second or time per 100 depend entirely on document size
o label/property produced
MXTerminator
o sentences
o UIMA
o http://uima.lti.cs.cmu.edu/resources.html#cmu
o plain text
o 10100 bytes/sec
o SENTENCE
BBN Identifinder
o named entities
o UIMA
o http://uima.lti.cs.cmu.edu/resources.html#cmu
o plain text
o 677 bytes/sec
o NAMED_ENTITY (property e.g. ENAMEX TYPE="SUBSTANCE") Protein (FST wordlist) annotator
o protein names or anything from a word list
o http://uima.lti.cs.cmu.edu/resources.html#cmu
o UIMA
o any text
o 1180000 bytes/sec
o PROTEIN
Brill POS annotator
o parts of speech - Penn treebank tag set
o
http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/parsing/taggers/brill/0.html
o UIMA
o plain text
o 1450 bytes/sec
o BRILL (property e.g. NNP)
ASSERT annotator
o semantic role labeling
o http://uima.lti.cs.cmu.edu/resources.html#cmu
o UIMA
o plain text
o 29 bytes/sec
o ASSERT (property e.g. PREDICATE_ARGUMENT_ROLE ARG1)
Other possibly useful notes:
(from Eric R.)
To run the Minorthird viewer, it is necessary to crank the heap memory
on the commandline from the recommended 500M to 1000M, and you must
supply the name 'BIO.jbs' as the -labels argument, e.g.
cd /afs/cs/project/theo-21/dataset
java -Xmx1000M edu.cmu.minorthird.ui.ViewLabels -labels BIO.jbs
This also assumes you've set your classpath and environment for running Minorthird.
[Apologies for this level of technical detail, but I'm trying to spare
people having to go through the same learning hurdles I did to get this
to work!]
This page is located in the file
/afs/cs/project/theo-21/www/SoftwareDocumentation/fileAnnotationFormat.html.
It is writable by any member of the course.
It was created using NVU, freely available at http://www.nvu.com/
Tom Mitchell, March 30, 2006.