Alex Gray - joint work with Andrew Moore.
Abstract
Density estimation is a core operation of virtually all probabilistic
learning methods (as opposed to discriminative methods). Approaches to
density estimation can be divided into two principal classes,
parametric methods, such as Bayesian networks, and nonparametric
methods such as kernel density estimation and smoothing splines.
While neither choice should be universally preferred for all
situations, a well-known benefit of nonparametric methods is their
ability to achieve estimation optimality for ANY input distribution as
more data are observed, a property that no model with a parametric
assumption can have. The success of nonparametric methods for other
problems, e.g. support vector machines for classification and Gaussian
processes for regression, has also drawn recent attention to
nonparametric learning approaches.
To date, however, despite a wealth of advanced underlying statistical theory, the use of nonparametric methods has been limited by the computational intractibility of nonparametric methods for all but the smallest datasets. This intractability stems from the nonparametric representation of a density in terms of the entire training set, causing the cost of estimation to scale with the training set size - a fact which seems inescapable. The cost is further multiplied greatly by the need for reliably finding the optimal scale for the estimator. In this paper, we present an algorithm for kernel density estimation, the chief nonparametric approach, which for the first time, reduces the cost of estimating the density for a test point to O(1), or *constant* time complexity. Empirically and analytically we show that the method is dramatically faster than previous algorithmic attempts to solve this computational problem, including the FFT and the 'fast-binning' method, in terms of both dataset size and dimensionality. Furthermore, the algorithm provides arbitrarily tight accuracy guarantees, provides anytime convergence, efficiently addresses the computational issue of scale (bandwidth) selection, works for all common kernel choices, and requires no parameter tuning. The algorithm is an instance of a new principle of algorithm design: multi-recursion, or higher-order divide-and-conquer (thesis forthcoming). |
Charles Rosenberg Last modified: Fri Aug 30 23:14:24 EDT 2002