In this section, we briefly discuss some methods to improve the speed of the system. The work described is preliminary, and is not intended to be an exhaustive exploration of methods to optimize the execution time.
The dominant factor in the running time of the system described thus far is the number of 20x20 pixel windows which the neural networks must process. Applying two networks to a 320x240 pixel image (193737 windows) on a Sparc 20 takes approximately 590 seconds. The computational cost of the arbitration steps is negligible in comparison, taking less than one second to combine the results of the two networks over all positions in the image.
Recall that the amount of position invariance in the pattern recognition component of our system determines how many windows must be processed. In the related task of license plate detection, [8] decreased the number of windows that must be processed. The key idea was to have the neural-network be invariant to translations of about 25% of the size of a license plate. Instead of a single number indicating the existence of a face in the window, the output of Umezaki's network is an image with a peak indicating where the network believes a license plate is located. These outputs are accumulated over the entire image, and peaks are extracted to give candidate locations for license plates.
The same idea can be applied to face detection. The original detector was trained to detect a 20x20 face centered in a 20x20 window. We can make the detector more flexible by allowing the same 20x20 face to be off-center by up to 5 pixels in any direction. To make sure the network can still see the whole face, the window size is increased to 30x30 pixels. Thus the center of the face will fall within a 10x10 pixel region at the center of the window. As before, the network has a single output, indicating the presence or absence of a face. This detector can be moved in steps of 10 pixels across the image, and still detect all faces that might be present. The network is trained using the bootstrap procedure described earlier. This first scanning step is illustrated in Figure 14, which shows the input image pyramid, and the 10x10 pixel regions which are classified as containing the centers of faces. An architecture with an image output was also tried. It yielded about the same detection accuracy, but at the expense of more computation.
Figure 14:
Illustration of the steps in the fast version of the face detector.
On the left is the input image pyramid, which is scanned with a 30x30
detector which moves in steps of 10 pixels. The center of the figure
shows the 10x10 pixel regions (at the center of the 30x30 detection
windows) which the detector believes contain the center of a face.
These candidates are then verified by the detectors described earlier
in the paper, and the final results are shown on the right.
As can be seen from the figure, this network has many more false detections than the detectors described earlier. To improve the accuracy, we treat each detection by the 30x30 detector as a candidate face, and use the 20x20 detectors described earlier to verify it. Since the candidate faces are not precisely located, the center of the verification network's 20x20 window must be scanned over the 10x10 pixel region potentially containing the center of the face. Simple arbitration strategies, such as ANDing, can be used to combine the outputs of two verification networks. The heuristic that faces rarely overlap can also be used to reduce computation, by first scanning the image for large faces, and at smaller scales not processing locations which overlap with any detections found so far. The results of these verification steps are illustrated on the right side of Figure 14.
With these modifications, the processing time for a typical 320x240 image is about 24 seconds on a Sparc 20. To examine the effect of these changes on the accuracy of the system, it was applied to the three test sets used in the previous section. The results are listed as System 17 in Tables 1 and 2. As can be seen, this system has false detection rates comparable to the most conservative of the other systems, System 10, with detection rates about 4% lower than that system. For applications where near real-time performance is required to process a sequence of images, this is an acceptable degradation; even if a face is missed in one image, it will often be detected in the next image in the sequence.
Further performance improvements can be made if one is analyzing many pictures taken by a stationary camera. By taking a picture of the background scene, one can determine which portions of the picture have changed in a newly acquired image, and analyze only those portions of the image. These techniques, taken together, have proved useful in building an almost real-time version of the system suitable for demonstration purposes.