Date: Wed, 24 Aug 94 14:03:06 EDT From: Mike Garris x2928 <mdg@magi.ncsl.nist.gov> Subject: Re: OCR database ANNOUNCEMENT - PUBLIC DOMAIN OCR NIST FORM-BASED HANDPRINT RECOGNITION SYSTEM ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Michael D. Garris (mdg@magi.ncsl.nist.gov) James L. Blue, Gerald T. Candela, Darrin L. Dimmick, Jon Geist, Patrick J. Grother, Stanley A. Janet, and Charles L. Wilson National Institute of Standards and Technology, Building 225, Room A216 Gaithersburg, Maryland 20899 Phone: (301)975-2928 FAX: (301)840-1357 The National Institute of Standards and Technology (NIST) has developed a standard reference form-based handprint recognition system for evaluating optical character recognition (OCR). NIST is making this recognition system freely available to the general public on an ISO-9660 format CD-ROM. The recognition system processes the Handwriting Sample Forms distributed with NIST Special Database 1 and NIST Special Database 3. The system reads handprinted fields containing digits, lower case letters, upper case letters, and reads a text paragraph containing the Preamble to the U.S. Constitution. This is a source code distribution written primarily in C and is organized into 11 libraries. There are approximately 19,000 lines of code supporting more than 550 subroutines. Source code is provided for form registration, form removal, field isolation, field segmentation, character normalization, feature extraction, character classification, and dictionary-based post- processing. A host of data structures and low-level utilities are also provided. These utilities include the application of CCITT Group 4 decompres- sion, IHead file manipulation, spatial histograms, Least-Squares fitting, spatial zooming, connected components, Karhunen Loeve (KL) feature extraction, optimized Probabilistic Neural Network classification, multiple-key sorting, Levenstein distance dynamic string alignment, and dictionary-based post- processing. Two supporting programs are provided that compute eigenvectors and KL feature vectors for training classifiers. Unlike the recognition system (which is written entirely in C), these two programs contain FORTRAN subroutines. To support these programs, a training set of 168,365 segmented and labeled character images is provided. About 1000 writers contributed to this training set. The NIST standard reference recognition system is designed to run on UNIX workstations and has been successfully compiled and tested on a Digital Equipment Corporation (DEC) Alpha, Hewlett Packard (HP) Model 712/80, IBM RS6000, Silicon Graphics Incorporated (SGI) Indigo 2, SGI Onyx, SGI Challenge, Sun Microsystems (Sun) IPC, Sun SPARCstation 2, Sun 4/470, and a Sun SPARC- station 10.** Scripts for installation and compilation on these architectures are provided with this distribution. A CD-ROM distribution of this standard reference system can be obtained free of charge by sending a letter of request to Michael D. Garris at the address above. The paper letter, preferably on company letterhead, should identify the requesting organization or individuals. This system or any portion of this system may be used without restrictions. However, redistribution of this standard reference recognition system is strongly discouraged as any subsequent corrections or updates will be sent to registered recipients only. This software was produced by NIST, an agency of the U.S. government, and by statute is not subject to copyright in the United States. Recipients of this software assume all responsibilities associated with its operation, modification, and maintenance. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ** Specific hardware and software products identified were used in order to adequately support the development of this technology. In no case does such identification imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the equipment identified is necessarily the best available for the purpose.