The first component of our system is a filter that receives as input a 20x20 pixel region of the image, and generates an output ranging from 1 to -1, signifying the presence or absence of a face, respectively. To detect faces anywhere in the input, the filter is applied at every location in the image. To detect faces larger than the window size, the input image is repeatedly reduced in size (by subsampling), and the filter is applied at each size. The filter itself must have some invariance to position and scale. The amount of invariance built into the filter determines the number of scales and positions at which the filter must be applied. For the work presented here, we apply the filter at every pixel position in the image, and scale the image down by a factor of 1.2 for each step in the pyramid.
Figure 1:
The basic algorithm used for face detection.
The filtering algorithm is shown in Figure 1. First, a preprocessing step, adapted from [7], is applied to a window of the image. The window is then passed through a neural network, which decides whether the window contains a face. The preprocessing first attempts to equalize the intensity values in across the window. We fit a function which varies linearly across the window to the intensity values in an oval region inside the window. Pixels outside the oval (shown at the top of Figure 2) may represent the background, so those intensity values are ignored in computing the lighting variation across the face. The linear function will approximate the overall brightness of each part of the window, and can be subtracted from the window to compensate for a variety of lighting conditions. Then histogram equalization is performed, which non-linearly maps the intensity values to expand the range of intensities in the window. The histogram is computed for pixels inside an oval region in the window. This compensates for differences in camera input gains, as well as improving contrast in some cases. Examples of the results of each of the preprocessing steps are shown in Figure 2.
Figure 2:
The steps in preprocessing a window. First, a linear function is fit
to the intensity values in the window, and then subtracted out,
correcting for some extreme lighting conditions. Then, histogram
equalization is applied, to correct for different camera gains and to
improve contrast. For each of these steps, the mapping is computed
based on pixels inside the oval mask, while the mapping is applied to
the entire window.
The preprocessed window is then passed through a neural network. The network has retinal connections to its input layer; the receptive fields of hidden units are shown in Figure 1. There are three types of hidden units: 4 which look at 10x10 pixel subregions, 16 which look at 5x5 pixel subregions, and 6 which look at overlapping 20x5 pixel horizontal stripes of pixels. Each of these types was chosen to allow the hidden units to represent features that might be important for face detection. In particular, the horizontal stripes allow the hidden units to detect such features as mouths or pairs of eyes, while the hidden units with square receptive fields might detect features such as individual eyes, the nose, or corners of the mouth. Although the figure shows a single hidden unit for each subregion of the input, these units can be replicated. For the experiments which are described later, we use networks with two and three sets of these hidden units. Similar input connection patterns are commonly used in speech and character recognition tasks [5][10]. The network has a single, real-valued output, which indicates whether or not the window contains a face.
Examples of output from a single network are shown in Figure 3. In the figure, each box represents the position and size of a window to which the neural network gave a positive response. The network has some invariance to position and scale, which results in multiple boxes around some faces. Note also that there are some false detections; they will be eliminated by methods presented in Section 2.2.
Figure 3:
Images with all the above threshold detections indicated by boxes.
To train the neural network used in stage one to serve as an accurate filter, a large number of face and non-face images are needed. Nearly 1050 face examples were gathered from face databases at CMU and Harvard. The images contained faces of various sizes, orientations, positions, and intensities. The eyes and the center of the upper lip of each face were located manually, and these points were used to normalize each face to the same scale, orientation, and position, as follows:
In the training set, 15 face examples are generated from each original image, by randomly rotating the images (about their center points) up to , scaling between 90% and 110%, translating up to half a pixel, and mirroring. Each 20x20 window in the set is then preprocessed (by applying lighting correction and histogram equalization). A few example images are shown in Figure 4. The randomization gives the filter invariance to translations of less than a pixel and scalings of 20%. Larger changes in translation and scale are dealt with by applying the filter at every pixel position in an image pyramid, in which the images are scaled by factors of 1.2.
Figure 4:
Example face images, randomly mirrored, rotated, translated, and
scaled by small amounts.
Practically any image can serve as a non-face example because the space of non-face images is much larger than the space of face images. However, collecting a ``representative'' set of non-faces is difficult. Instead of collecting the images before training is started, the images are collected during training, in the following manner, adapted from [7]:
Some examples of non-faces that are collected during training are shown in Figure 5. We used 120 images of scenery for collecting negative examples in this bootstrap manner. A typical training run selects approximately 8000 non-face images from the 146,212,178 subimages that are available at all locations and scales in the training scenery images.
Figure 5:
During training, the partially-trained system is applied to images of
scenery which do not contain faces (like the one on the left). Any
regions in the image detected as faces (which are expanded and shown
on the right) are errors, which can be added into the set of negative
training examples.