Multiresolution kd-tree representations of data clouds
In instance-based learning (aka memory-based or non-parametric learning), predictions are made by querying high-dimensional numerical data clouds, and making inferences from local clusters, locally polynomial patches, and local kernel density estimates. When we make a prediction we must gather up the effects induced by each datapoint and combine them statistically. This is so slow as to render instance-based learning impractical for datasets larger than around 10,000 to 100,000 records. Previous researchers had addressed this by either throwing away data (known as "editing the dataset") or by performing range searches over kd-trees. Range searching cannot help, however, in problems where all datapoints have non-zero effect (such as kernel density estimation), or where the ranges cover a large fraction of the data.
We have been working on new algorithms that avoid these problems. It uses a new kind of kd-tree in which statistics about regions of space are maintained at many resolutions simultaneously. For fixed kernel width or range size, the cost of the query increases sub-logarithmically in the number of datapoints. And, unlike "editing", once built, the same tree can be used for arbitrary and changing distance metrics and kernels. As a result, massive model selection (autonomously picking the best out of thousands of computationally expensive learning algorithms) becomes tractable, and we know of no other instance-based systems that are able to do nearly as much model selection search as ours.
Paper on Multiresolution Kernel Methods
More recent paper on Multiresolution methods for locally weighted polynomial regression
Constant-time "counting": ADtree representations of cached sufficient statistics
The ADtree (all-dimensions tree) is a new data structure that can empirically give a 50-fold to 1000-fold computational speedup for machine learning methods such as Bayes net structure finders, rule learners and feature selectors operating on large datasets. Many machine learning algorithms need to do vast numbers of counting queries (e.g. "How many records have Color=Blue, Nationality=British, Smoker=False, Status=Married, Nose=Big?"). Similarly, many algorithms need to build and test many contingency tables. We are researching methods for which, subject to certain assumptions, the costs of these kinds of operations can be shown to be independent of the number of records in the dataset and loglinear in the number of non-zero entries in the contingency table.
The ADtree is a very sparse data structure to minimize memory use. We can provide analytical worst-case memory and construction-cost bounds for this structure for several models of data distribution. We have empirically demonstrated that tractably-sized data structures can be produced for large real-world datasets by (a) using a sparse tree structure that never allocates memory for counts of zero, (b) never allocating memory for counts that can be deduced from other counts, and (c) not bothering to expand the tree fully near its leaves. We are interested in investigating the possible uses of ADTrees in other machine learning methods, and looking at the merits of ADTrees in comparison with alternative representations such as kd-trees, R-trees and frequent sets.
Probabilistic reasoning and belief networks
We show how the ADTree can be used to accelerate Bayes net structure finding algorithms, and we are currently exploiting this technology (in which structure searches can now happen at interactive speeds) in a number of new projects.
Fast Association Rule Learning
We have also applied ADTrees to learning association rules quickly.
A new paper on ADTrees applied to association rule algorithms such as CN2.
|
||||
Decision and Reinforcement Learning |