Proximity problems ask questions about distances between points
such as various sorts of nearest-neighbor queries. Algorithms and
data structures (typically trees) are studied with theoretical goals
in the field of computational geometry and with practical goals in
many fields such as databases and machine learning.
|
Tutorial: Data Structures for Fast Statistics
| |
Our main interest in proximity methods is a little
different from the usual ones. We are typically not interested in performing,
say, nearest-neighbor queries per se; instead, we use the data
structures that are effective for proximity problems to accelerate
more complicated statistical and machine learning methods: see this
tutorial given at ICML 04.
|
The Proximity Project is a large undertaking begun in summer 2004
which attempts to finally answer the question once and for all "What
is the best nearest-neighbor method?", by implementing and empirically
comparing the best-known published methods from the last three decades
for five variations of the nearest-neighbor searching problem. Source
code and datasets will be available. The results are done but the
tedious writeup is not, though I am doing my best to make this appear
here shortly.
|
|
|
Nearest-Neighbor Classification
| |
Perhaps the most common need for nearest-neighbor search is for the
purpose of nearest-neighbor classification. It turns out,
however, that solving this problem need not be as hard as actually
finding the k nearest neighbors and then finding out what the
majority label is. To determine the correct class label for a query
point, one need only determine whether more of its k nearest
neighbors are of the positive class or of the negative class.
We demonstrate for the first time that this can be achieved exactly using
bounds on distances, in less time than it takes to actually identify
the k points in question using the best state-of-the-art
algorithms. We show, for example, that this can make efficient search
possible for a dataset with over 1 million dimensions and a few
hundred thousand points, something which cannot be achieved with full
nearest-neighbor search. (Liu, Moore, and Gray,
New Algorithms for
Efficient High-dimensional Nonparametric Classification [pdf], [ps] NIPS 2003.) A journal
version has been accepted to JMLR but hasn't appeared yet. Ting Liu has also
developed the multi-class generalization of this algorithm, in another paper.
|
Approximate Nearest-Neighbor
| |
For a finite dataset in high dimension, it may not be possible in practice
to find the nearest neighbor in less time than exhaustive search. So what
can be done?
Much attention has been paid lately to the idea of approximate
nearest-neighbor searching -- instead of finding the exact nearest
neighbor, change the problem to that of finding a neighbor within
1+ε of the distance to the true nearest neighbor. While this may
seem sensible, this criterion says nothing about the rank of
the points returned by such a method -- which may be arbitrarily poor.
In fact the higher the dimension, the more of the points lie within
1+ε of the absolute nearest-neighbor distance -- so it can be that
returning virtually any point is sufficient to solve this problem.
Thus I strongly urge caution regarding using this approach in practice.
Nonetheless, some applications of nearest-neighbor searching can
withstand occasionally egregious errors. Computer vision researchers,
for example, have found this approach useful in some applications.
Our tree-based data structure and associated algorithm yields up to
30x speedups over the best-known methods such as recent hashing approaches.
(Liu, Moore, Gray, and Yang,
An Investigation of
Practical Approximate Nearest-Neighbor Algorithms [pdf], [ps] NIPS 2004.)
|
Although the nearest-neighbor searching problem is posed in terms of a
single query point and a set of reference points to be searched, in
fact it is very often (perhaps most often)
the case that the user has in hand a set
of query points and wants the nearest neighbor for each. It turns out
that this can be done in less time than it takes to process each query
point individually.
See our formulation of this problem (in its most general bichromatic form)
as an N-body problem, which can utilize
virtually any space-partitioning tree. The standard nearest-neighbor
branch-and-bound algorithm is a special case of this algorithm, corresponding
to a query set of size 1.
|
|