Date: Wed, 24 Aug 94 14:03:06 EDT
From: Mike Garris x2928 <mdg@magi.ncsl.nist.gov>
Subject: Re:  OCR database


                   ANNOUNCEMENT - PUBLIC DOMAIN OCR

             NIST FORM-BASED HANDPRINT RECOGNITION SYSTEM
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

              Michael D. Garris (mdg@magi.ncsl.nist.gov)
     James L. Blue, Gerald T. Candela, Darrin L. Dimmick, Jon Geist,
       Patrick J. Grother, Stanley A. Janet, and Charles L. Wilson

            National Institute of Standards and Technology,
                        Building 225, Room A216
                      Gaithersburg, Maryland 20899
            Phone: (301)975-2928          FAX: (301)840-1357


The National Institute of Standards and Technology (NIST) has developed a
standard reference form-based handprint recognition system for evaluating
optical character recognition (OCR). NIST is making this recognition system
freely available to the general public on an ISO-9660 format CD-ROM. The
recognition system processes the Handwriting Sample Forms distributed with
NIST Special Database 1 and NIST Special Database 3. The system reads
handprinted fields containing digits, lower case letters, upper case letters,
and reads a text paragraph containing the Preamble to the U.S. Constitution.

This is a source code distribution written primarily in C and is organized
into 11 libraries. There are approximately 19,000 lines of code supporting
more than 550 subroutines. Source code is provided for form registration,
form removal, field isolation, field segmentation, character normalization,
feature extraction, character classification, and dictionary-based post-
processing. A host of data structures and low-level utilities are also
provided. These utilities include the application of CCITT Group 4 decompres-
sion, IHead file manipulation, spatial histograms, Least-Squares fitting,
spatial zooming, connected components, Karhunen Loeve (KL) feature extraction,
optimized Probabilistic Neural Network classification, multiple-key sorting,
Levenstein distance dynamic string alignment, and dictionary-based post-
processing. Two supporting programs are provided that compute eigenvectors
and KL feature vectors for training classifiers. Unlike the recognition
system (which is written entirely in C), these two programs contain FORTRAN
subroutines. To support these programs, a training set of 168,365 segmented
and labeled character images is provided. About 1000 writers contributed to
this training set.

The NIST standard reference recognition system is designed to run on UNIX
workstations and has been successfully compiled and tested on a Digital
Equipment Corporation (DEC) Alpha, Hewlett Packard (HP) Model 712/80, IBM
RS6000, Silicon Graphics Incorporated (SGI) Indigo 2, SGI Onyx, SGI Challenge,
Sun Microsystems (Sun) IPC, Sun SPARCstation 2, Sun 4/470, and a Sun SPARC-
station 10.** Scripts for installation and compilation on these architectures
are provided with this distribution.

A CD-ROM distribution of this standard reference system can be obtained free
of charge by sending a letter of request to Michael D. Garris at the address
above. The paper letter, preferably on company letterhead, should identify the
requesting organization or individuals. This system or any portion of this
system may be used without restrictions. However, redistribution of this
standard reference recognition system is strongly discouraged as any
subsequent corrections or updates will be sent to registered recipients only.
This software was produced by NIST, an agency of the U.S. government, and by
statute is not subject to copyright in the United States. Recipients of this
software assume all responsibilities associated with its operation,
modification, and maintenance.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
** Specific hardware and software products identified were used in order to
   adequately support the development of this technology. In no case does
   such identification imply recommendation or endorsement by the National
   Institute of Standards and Technology, nor does it imply that the
   equipment identified is necessarily the best available for the purpose.