Read The Web

file format for annotations, UIMA annotator list, and
annotated data from CMU and Pitt Bio department web sites and Medline abstracts

Annotating spans of text within documents is central to the RTW project. We use "stand off" annotation, which means the spans and annotation are kept in a file or database other than the document (as opposed to in-line annotation, such as XML). We use the stand-off annotation convention developed for Minor Third, as folllows (thanks to William Cohen):

File Format for Annotations:

The simplest format for Minothird annotations is a file where each line is of the form:

addToType FILE LO-INDEX LENGTH ANNOTATION-TYPE

where FILE is a file name, LO-INDEX is the byte offset in the file of the start of the span, LENGTH is the byte length of the span, and ANNOTATION-TYPE is a string naming this type of annotation.

For instance:

addToType yeast_00001_training.txt-char619.txt 0 280 sentence
addToType yeast_00001_training.txt-char619.txt 181 13 process
addToType yeast_00001_training.txt-char619.txt 202 5 gene
addToType yeast_00001_training.txt-char619.txt 202 5 protein
addToType yeast_00003_training.txt-char549.txt 165 2 component
addToType yeast_00003_training.txt-char549.txt 168 8 component
addToType yeast_00003_training.txt-char549.txt 168 8 goTerm
...

Each of these fields is white-space separated, so file names with spaces in them or annotation-types with spaces (for instance) will be a problem. The file should be just the name of a file, not a path to a file. An annotation type is any string.

The easiest way to check a set of annotations is to put them in a file named foo.labels, put the files they mention in a directory named foo, and then type the command
% java edu.cmu.minorthird.ui.ViewLabels -labels foo

The "addToType" keyword is there because there some other types of annotations that can be inserted, for instance

setSpanProp FILE LO-INDEX LENGTH PROPERTY VALUE

adds a string valued "property" to a span, eg to associate an acronym with its expansion you might use the lines

setSpanProp SOMEFILE 44 4 acronym a1
setSpanProp SOMEFILE 37 23 expansion a1

Again, a property, or a property value, can be any string.

Example annotated data:

WEB PAGES
(thanks for Sophie Wang and Eric Riebling)

Web pages were crawled from the CMU and Pitt Biology dept web sites (contributed by Sophie Wang, March 16, 2006). They are available at
/afs/cs/project/theo-21/dataset/BIO_2uni/*

These pages we run through UIMA by Eric R., to preprocess them removing HTML tags, and replacing each removed character by a space character (so the span addresses in the original HTML page are identical to those in the 'whitespace' files). The resulting 604 BIO dept web docs are on
   /afs/cs.cmu.edu/project/theo-21/dataset/BIO_white

UIMA was used to annotate these pages using MXTerminator, BBN Identifinder, Protein finder. Results are on
   /afs/cs.cmu.edu/project/theo-21/dataset/BIO_white.labels

MEDLINE ABSTRACTS

46 UIMA-annotated MEDLINE docs:

   /afs/cs.cmu.edu/project/theo-21/dataset/uima-yeast/small
   /afs/cs.cmu.edu/project/theo-21/dataset/uima-yeast/small.labels

     annotated with:
       MXTerminator
       BBN Identifinder
       Protein finder
       BRILL POS tagger
       ASSERT

   [both also available in UIMA and Annotations Database format]

46 Minorthird-annotated MEDLINE docs:

   /afs/cs.cmu.edu/project/theo-21/dataset/yeast/small
   /afs/cs.cmu.edu/project/theo-21/dataset/yeast/small.labels

     annotated with:
       sentences
       yeast-specific gene/protein entities (span name "gene")
       yeast-specific cellular components ("component")
       yeast biological processes ("process")
       molecular functions ("function")

1000 UIMA-annotated MEDLINE docs (with the most protein mentions):

   /afs/cs.cmu.edu/project/theo-21/dataset/uima-medline-protein/protein.labels
   /afs/cs.cmu.edu/project/theo-21/dataset/uima-medline-protein/protein

     annotated with:
       MXTerminator
       BBN Identifinder
       Protein finder
       BRILL POS tagger
       ASSERT

   [both also available in UIMA and Annotations Database format]

Table of UIMA annotators:

LEGEND:
   o what it extracts
   o source of the extractor (e.g., minor third, UIMA, other)
   o the name/location/documentation of the extractor
   o the type of text it expects to run on (e.g., emails, plain text, HTML,...)
   o an estimate of its speed (in bytes per second) - most accurate, as
     documents per second or time per 100 depend entirely on document size
   o label/property produced

MXTerminator
   o sentences
   o UIMA
   o http://uima.lti.cs.cmu.edu/resources.html#cmu
   o plain text
   o 10100 bytes/sec
   o SENTENCE
BBN Identifinder
   o named entities
   o UIMA
   o http://uima.lti.cs.cmu.edu/resources.html#cmu
   o plain text
   o 677 bytes/sec
   o NAMED_ENTITY (property e.g. ENAMEX TYPE="SUBSTANCE") Protein (FST wordlist) annotator
   o protein names or anything from a word list
   o http://uima.lti.cs.cmu.edu/resources.html#cmu
   o UIMA
   o any text
   o 1180000 bytes/sec
   o PROTEIN
Brill POS annotator
   o parts of speech - Penn treebank tag set
   o
http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/parsing/taggers/brill/0.html
   o UIMA
   o plain text
   o 1450 bytes/sec
   o BRILL (property e.g. NNP)
ASSERT annotator
   o semantic role labeling
   o http://uima.lti.cs.cmu.edu/resources.html#cmu
   o UIMA
   o plain text
   o 29 bytes/sec
   o ASSERT (property e.g. PREDICATE_ARGUMENT_ROLE ARG1)

Other possibly useful notes:

(from Eric R.)
To run the Minorthird viewer, it is necessary to crank the heap memory on the commandline from the recommended 500M to 1000M, and you must supply the name 'BIO.jbs' as the -labels argument, e.g.

   cd /afs/cs/project/theo-21/dataset
   java -Xmx1000M edu.cmu.minorthird.ui.ViewLabels -labels BIO.jbs

This also assumes you've set your classpath and environment for running Minorthird.

[Apologies for this level of technical detail, but I'm trying to spare people having to go through the same learning hurdles I did to get this to work!]

This page is located in the file /afs/cs/project/theo-21/www/SoftwareDocumentation/fileAnnotationFormat.html.
It is writable by any member of the course.
It was created using NVU, freely available at http://www.nvu.com/
Tom Mitchell, March 30, 2006.