Note that in Figure 3, most faces are detected at multiple nearby positions or scales, while false detections often occur with less consistency. This observation leads to a heuristic which can eliminate many false detections. For each location and scale at which a face is detected, the number of detections within a specified neighborhood of that location can be counted. If the number is above a threshold, then that location is classified as a face. The centroid of the nearby detections defines the location of the detection result, thereby collapsing multiple detections. In the experiments section, this heuristic will be referred to as ``thresholding''.
If a particular location is correctly identified as a face, then all other detection locations which overlap it are likely to be errors, and can therefore be eliminated. Based on the above heuristic regarding nearby detections, we preserve the location with the higher number of detections within a small neighborhood, and eliminate locations with fewer detections. Later, in the discussion of the experiments, this heuristic is called ``overlap elimination''. There are relatively few cases in which this heuristic fails; however, one such case is illustrated in the left two faces in Figure 3B, in which one face partially occludes another.
The implementation of these two heuristics is illustrated in Figure 6. Each detection by the network at a particular location and scale is marked in an image pyramid, labelled the ``output'' pyramid. Then, each location in the pyramid is replaced by the number of detections in a specified neighborhood of that location. This has the effect of ``spreading out'' the detections. Normally, the neighborhood extends an equal number of pixels in the dimensions of scale and position, but for clarity in Figure 6 detections are only spread out in position. A threshold is applied to these values, and the centroids (in both position and scale) of all above threshold regions are computed. All detections contributing to the centroids are collapsed down to single points. Each centroid is then examined in order, starting from the ones which had the highest number of detections within the specified neighborhood. If any other centroid locations represent a face overlapping with the current centroid, they are removed from the output pyramid. All remaining centroid locations constitute the final detection result.
Figure 6:
The framework used for merging multiple detections from a single
network: A) The detections are recorded in an image pyramid. B) The
detections are ``spread out'' and a threshold is applied. C) The
centroids in scale and position are computed, and the regions
contributing to each centroid are collapsed to single points. In the
example shown, this leaves only two detections in the output pyramid.
D) The final step is to check the proposed face locations for
overlaps, and E) to remove overlapping detections if they exist. In
this example, removing the overlapping detection eliminates what would
otherwise be a false positive.