Ting Liu's Ph.D. thesis

Thesis Proposal

I did my thesis proposal at 11:00 EST Feb. 2nd, 2005
The printable version of my thesis proposal: [PDF] [PS] [PROPOSAL SLIDES]

Thesis Oral

Title

Fast Nonparametric Machine Learning Algorithms for High-dimensional Massive Data and Applications

Time and Date

14:00 EDT March 17th, 2006

Place

3305 Newell Simon Hall

Thesis Committee

Andrew W. Moore (chair)

Martial Hebert

Jeff Schneider

Trevor Darrell (MIT, CS & AI Laboratory)

Abstract

Nonparametric methods have become increasingly popular in statistics and probabilistic AI communities. One well-known nonparametric method is ``nearest-neighbor''. It uses the observations in the training set T closest in input space to a query q to form the prediction of q. Specifically, when k of the observations in T are considered, it is called k-nearest-neighbor (or knn).

Despite its simplicity, knn and its variants have been successful in many machine learning problems, including pattern recognition, text categorization, information retrieval, computational statistics, database and data mining. It is also used for estimating sample distributions and Bayes error.

However, knn and many related nonparametric methods remain hampered by their computational complexity. Many spatial methods, such as metric-trees, have been proposed to alleviate the computational cost, but the effectiveness of these methods decreases as the number of dimensions of feature vectors increases. From another direction, researchers are trying to develop ways to find approximate answers. The premise of this research is that in many cases it is not necessary to insist on the exact answers; instead, determining an approximate answer should be sufficient. In fact, some approximate methods show good performance in a number of applications, and some methods enjoy very good theoretical soundness. However, when facing hundreds or thousands dimensions, many algorithms do not work well in reality.

I propose four new spatial methods for fast knn and its variants, namely KNS2, KNS3, IOC and spill-tree. The first three algorithms are designed to speed up knn classification problems, and they all share the same insight that finding the majority class among the knn of q need not to explicitly find those k-nearest-neighbors. Spill-tree is designed for approximate knn search. By adapting metric-trees to a more flexible data structure, spill-tree is able to adapt to the distribution of data and it scales well even for huge high-dimensional data sets. Significant efficiency improvement has been observed comparing to LSH (localify sensitive hashing), the state of art approximate knn algorithm. We applied spill-tree to three real-world applications: shot video segmentation, drug activity detection and image clustering, which I will explain in the thesis.

Links

[thesis (PDF)] [thesis (PS)]