The first segmentation is
clearly better than the second one. Is there a principled way to
compute a score that would capture that fact. |
![](segs_files/intro.jpg) |
Unsupervised image segmentation is an important component in many image
understanding algorithms and practical vision systems. However,
evaluation of segmentation algorithms thus far has been largely
subjective, leaving a system designer to judge the effectiveness of a
technique based only on intuition and results in the form of a few
example segmented images. This is largely due to image segmentation
being an ill-defined problem—there is no unique ground-truth
segmentation of an image against which the output of an algorithm may
be compared.
In this work, we propose a new measure of similarity, the
Normalized Probabilistic Rand (NPR) index, which can be used to perform
a quantitative comparison between image segmentation algorithms using a
hand-labeled set of ground-truth segmentations. This measure is a
generalization of the Rand index commonly used in statistics and it has
some
nice properties, broadly speaking in four categories:
- Nondegeneracy: The
measure does not have degenerate cases where input instances that are
not well represented by the ground-truth segmentations give abnormally
high values of similarity.
- No assumptions about data generation: The measure does not assume equal number of regions or region sizes in the segmentations.
- Adaptive accommodation of refinement:
The measure of similarity ignores small differences in pixel-level
granularity in regions that humans find ambiguous and penalizes
differences in refinement elsewhere.
- Comparable scores: The
measure gives scores that permit meaningful comparison between
segmentations of different images and between different segmentations
of the same image.
In practice, the measure of similarity allows principled comparisons
between segmentations created by different algorithms, as well as
segmentations on different images. The reason why it is well-suited for
comparing segmentations is that it maches the "natural" ordering
(i.e., as defined by human subjects) of segmentations. For
example, the NPR index is low for both over- and under-segmentation,
and is high on "good" segmentations that match well the reference
segmentations:
|
Input |
Segmentation
(mean shift) |
Manual segmentations (from the Berkeley set) |
Under-segmented
NPR = -0.59 |
|
![](segs_files/undersegout.jpg) |
![](segs_files/oversegmanual.jpg) |
Over-segmented
NPR = -0.74 |
|
![](segs_files/oversegout.jpg) |
![](segs_files/oversegmanual.jpg) |
"Good" segmentation
NPR = 0.85 |
![](segs_files/goodin.jpg) |
![](segs_files/goodout.jpg) |
![](segs_files/goodmanual.jpg) |
This can be seen in a different way by comparing the scores
obtained by the NPR index with those obtained by three leading
approaches for segmentation scoring. In general, our measure does
conform to the natural ordering of the segmentations, including a
sharper separation between useuful segementations and massively over-
or under-segmented examplars:
Input image and 7 possible segmentations |
![](segs_files/h1.jpg) ![](segs_files/h2.jpg) |
NPR score value for
all 7 segmentations,
compared with the other popular similarity measures |
![](segs_files/graph.jpg) |
We developed a procedure for algorithm evaluation through an
example evaluation of some familiar algorithms—the
mean-shift-based algorithm, an efficient graph-based segmentation
algorithm, a hybrid algorithm that combines the strengths of both
methods, and expectation maximization. We used the 300 images in
the publicly available Berkeley Segmentation Data Set to validate this
measure.