Brigham Anderson, Andrew Moore and David Cohn (2000). A Nonparametric Approach to Noisy and Costly Optimization, Proceedings of the Seventeenth International Conference on Machine Learning. Stanford, CA. < abstract >
Low rank approximation techniques are widespread in pattern recognition research --- they include Latent Semantic Analysis (LSA), Probabilis tic LSA, Principal Components Analysus (PCA), the Generative Aspect Model, and many forms of bibliometric analysis. All make use of a low-dimensional manifold onto which data are projected.
Such techniques are generally "unsupervised," which allows them to model data in the absence of labels or categories. With many practical problems, however, some prior knowledge is available in the form of context. In this paper, I describe a principled approach to incorporating such information, and demonstrate its application to PCA-based approximations of several data sets.
We describe a joint probabilistic model for modeling the contents and inter-connectivity of document collections such as sets of web pages or research paper archives. The model is based on a probabilistic factor decomposition and allows identifying principal topics of the collection as well as authoritative documents within those topics. Furthermore, the relationships between topics is mapped out in order to build a predictive model of link content. Among the many applications of this approach are information retrieval and search, topic identification, query disambiguation, focused web crawling, web authoring, and bibliometric analysis.
This paper investigates whether a machine can automatically learn the task of finding, within a large collection of candidate responses, the answers to questions. The learning process consists of inspecting a collection of answered questions and characterizing the relation between question and answer with a statistical model. For the purpose of learning this relation, we propose two sources of data: Usenet FAQ documents and customer service call-center dialogues from a large retail company. We will show that the task of "answer-finding" differs from both document retrieval and traditional question-answering, presenting challenges different from those found in these problems. The central aim of this work is to discover, through theoretical and empirical investigation, those statistical techniques best suited to the answer-finding problem.
We describe a model of document citation that learns to identify hubs and authorities in a set of linked documents, such as pages retrieved from the world wide web, or papers retrieved from a research paper archive. Unlike the popular HITS algorithm, which relies on dubious statistical assumptions, our model provides probabilistic estimates that have clear semantics. We also find that in general, the identified authoritative documents correspond better to human intuition.
The proliferation of hypertext and the popularity of Kleinberg's HITS algorithm have brought about an increased interest in link analysis. While HITS and its older relatives from the Bibliometrics literature provide a method for finding authoritative sources on a particular topic, they do not allow individual users to inject their own opinions about what sources are authoritative. This paper presents a learning technique which incorporates user feedback to adjust its model of document authority so that it better corresponds to the user's internal model. We demonstrate how this technique can be used to generate user-specific authority lists. We present experimental results based on a database of about one million references collected as part of the Cora on-line index of the computer science literature.
We describe a simple active learning heuristic which greatly enhances the generalization behavior of support vector machines (SVMs) on several practical document classification tasks. We observe a number of benefits, the most surprising of which is that a SVM trained on a well-chosen subset of the available corpus frequently performs better than one trained on all available data. The heuristic for choosing this subset is simple to compute, and makes no use of information about the test set. Given that the training time of SVMs depends heavily on the training set size, our heuristic not only offers better performance with fewer data, it frequently does so in less time than the naive approach of training on all available data.
This paper describes pairwise bisection, a nonparametric approach to optimizing a noisy function with few function evaluations. The algorithm uses nonparametric reasoning about simple geometric relationships to find minima efficiently. Two factors often frustrate optimization: noise and cost. Output can contain significant quantities of noise or error, while time or money allow for only a handful of experiments. Pairwise bisection is used here to attempt to automate the process of robust and efficient experiment design. Real world functions also tend to violate traditional assumptions of continuousness and Gaussian noise. Since nonparametric statistics do not depend on these assumptions, thisalgorithm can optimize a wide variety of phenomena with fewer restrictions placed on noise. The algorithm's performance is compared to that of three competing algorithms, Amoeba, PMAX and Q2 on several different test functions. Results on these functions indicate competitive performance and superior resistance to noise.
We are frequently called upon to perform multiple tasks that compete for our attention and resource. Often we know the optimal solution to each task in isolation; in this paper, we describe how this knowledge can be exploited to efficiently find good solutions for doing the tasks in parallel. We formulate this problem as that of dynamically merging multiple Markov decision processes (MDPs) into a composite MDP, and present a new theoretically-sound dynamic programming algorithm for finding an optimal policy for the composite MDP. We analyze various aspects of our algorithm and illustrate its use on a simple merging problem.
I describe a querying criterion that attempts to minimize the error of a learner by minimizing its estimated squared bias. I describe experiments with locally-weighted regression on two simple problems, and observe that this `bias-only' approach outperforms the more common `variance-only' exploration approach, even in the presence of noise.
Predictions of lifetimes of dynamically allocated objects can and have been used to improve time and space efficiency of dynamic memory management in computer programs. Barrett and Zorn (1993) used a simple lifetime predictor and demonstrated this improvement on a variety of computer programs. In this paper, we use decision trees to do lifetime prediction on the same programs and show significantly better prediction. Our method also has the advantage that during training we can use a large number of features and let the decision tree automatically choose the relevant subset.
For many types of machine learning algorithms, one can compute the statistically `optimal' way to select training data. In this paper, we review how optimal data selection techniques have been used with feedforward neural networks. We then show how the same principles may be used to select data for two alternative, statistically-based learning architectures: mixtures of Gaussians and locally weighted regression. While the techniques for neural networks are computationally expensive and approximate, the techniques for mixtures of Gaussians and locally weighted regression are both efficient and accurate. Empirically, we observe that the optimality criterion sharply decreases the number of training examples the learner needs in order to achieve good performance.
Active learning differs from passive "learning from examples" in that the learning algorithm assumes at least some control over what part of the input domain it receives information about. In some situations, active learning is provably more powerful that learning from examples alone, giving better generalization for a fixed number of training examples. In this paper, we consider the problem of learning a binary concept in the absence of noise (Valiant 1984). We describe a formalism for active concept learning called selective sampling, and show how it may be approximately implemented by a neural network. In selective sampling, a learner receives distribution information from the environment and queries an oracle on parts of the domain it considers "useful." We test our implementation, called an SG-network, on three domains, and observe significant improvement in generalization.
We examine how the performance of a memoryless vector quantizer changes as a function of its training set size. Specifically, we study how well the training set distortion predicts test distortion when the training set is a randomly drawn subset of blocks from the test or training image(s). Using the Vapnik-Chervonenkis dimension, we derive formal bounds for the difference of test and training distortion of vector quantizer codebooks. We then describe extensive empirical simulations that test these bounds for a variety of bit rates and vector dimensions, and give practical suggestions for determining the training set size necessary to achieve good generalization from a codebook. We conclude that, by using training sets comprised of only a small fraction of the available data, one can produce results that are close to the results obtainable when all available data are used.
Consider the problem of learning input/output mappings through exploration, e.g. learning the kinematics or dynamics of a robotic manipulator. If actions are expensive and computation is cheap, then we should explore by selecting a trajectory through the input space which gives us the most amount of information in the fewest number of steps. I discuss how results from the field of optimal experiment design may be used to guide such exploration, and demonstrate its use on a simple kinematics problem.
I consider the question "How should one act when the only goal is to learn as much as possible?" Building on the theoretical results of Fedorov and MacKay, I apply techniques from Optimal Experiment Design (OED) to guide the query/action selection of a neural network learner. We demonstrate that these techniques allow the learner to minimize its generalization error by exploring its domain efficiently and completely. I conclude that, while not a panacea, OED-based query/action has much to offer, especially in domains where its high computational costs can be tolerated.