Nonparametric Methods for Large Scale Representation Learning

NIPS 2015 Workshop

Friday, 11 December, 2015
Convention and Exhibition Center, Montreal, Canada

Talk abstracts

Francis Bach (INRIA & ENS)

Title: Sharp Analysis of Random Feature Expansions

Abstract: Random feature expansions provide a simple way to avoid the usual
quadratic running-time complexity of kernel methods. In this talk, I will present
recent results about the approximation properties of these expansions. In
particular, I will provide improved bounds on the number of features needed
for a given approximation quality.

Michael Mahoney (Berkeley)

Title: Using Local Spectral Methods in Theory and in Practice

Abstract: Local spectral methods are algorithms that touch only a small part
of a large data graph and yet come with locally-biased versions of the
Cheeger-like quality-of-approximation guarantees that make the usual global
spectral methods so popular. Since they touch only a small part of a large data
graph, these methods come with strong scalability guarantees, and they can be
applied to graphs with hundreds of millions or billions of nodes. Moreover, due
to implicit regularization, they also come with interesting statistical guarantees,
and they perform quite well in many practical situations. We will describe the
basic ideas underlying these methods, how these methods tend to perform in
practice at identifying different types of structure in data, and how an understanding
of the implicit regularization properties underlying these methods leads to novel
methods to robustify graph-based learning algorithms to the peculiarities of data
preprocessing decisions.

Yee Whye Teh (Oxford)

Title: Random Tensor Decompositions for Regression and Collaborative Filtering

Abstract: In this talk I will present some ongoing work by Xiaoyu Lu, Hyunjik Kim,
Seth Flaxman and myself on approximations for efficiently learning Gaussian processes
and kernel methods. Our approximation is applicable when the kernel has Kronecker
structure, but when the data need not be on a grid. The idea is to make use of random
feature expansions, low-rank tensors, and recent advances in stochastic gradient
MCMC / Variational Inference / Descent. We will also present how this can be used
in a novel formualtion for collaborative filtering with side-information using Gaussian
processes, arguing it is more natural than current proposals for using GPs in collaborative
filtering, and showing interesting connections between our approximations and low-rank
matrix factorization approaches to collaborative filtering.

Fei Sha (UCLA)

Title: Do shallow kernel methods match deep neural networks -- and if not, what can
the shallow ones learn from the deep ones?

Abstract: Deep neural networks (DNNs) and other types of deep learning architectures
have been hugely successful in a large number of applications. By contrast, kernel
methods, which were exceedingly popular, have become lackluster. The crippling
obstacle is the computational complexity of those methods. Nonetheless, there has
been a resurgence of interest in these methods. In particular, several research groups
have studied how to scale kernel methods to cope with large-scale learning problems.

Despite such progress, there has not been a systematic and head-on comparison
between kernel methods and DNNs. Specifically, while recent approaches have shown
exciting promises, we are still left with at least one itching unanswered question: can
kernel methods, after being scaled up for large datasets, truly match DNN performance?

In this talk, I will describe our efforts in (partially) answering that question. I will
present extensive empirical studies comparing kernel methods and DNNs for automatic
speech recognition, a key field to which DNNs have been applied. Our investigative
studies highlight the similarities and differences between these two paradigms. I will leave
our main conclusion as a surprise.

Jean-Philippe Vert (Mines ParisTech & Curie Institute)

Title: Learning from Rankings

Abstract: In many applications such as genomics, high-dimensional data are often
subject to technical variability such as noise of batch effects which are difficult to
remove or model. If the variability approximately keeps the relative order of the
features within each sample, then one could keep only the information of relative
orders between features to characterize each sample, resulting in a representation
of each sample as a permutation over the set of features. In this talk, I will discuss
several new methods for supervised and unsupervised classification of such
permutations, including new positive definite kernels on the symmetric groups and
a new method for supervised full-quantile normalization, illustrating the benefits
of these techniques on cancer patient stratification from noisy gene expression and
mutation data.

Amr Ahmed (Google)

Title: Dirichlet-Hawkes Processes with Applications to Clustering Continuous-Time Document Streams

Abstract: Clustering in document streams, such as online news articles, can be induced by their textual contents, as well as by the temporal dynamics of their arriving patterns. Can we leverage both sources of information to obtain a better clustering of the documents, and distill information that is not possible to extract using contents only? In this talk, I will describe a novel random process, referred to as the Dirichlet-Hawkes process, to take into account boht information in a unified framework. A distinctive feature of the proposed model is that the preferential attachment of items to clusters according to cluster sizes, present in Dirichlet processes, is now driven according to the intensities of cluster-wise self-exciting temporal point processes, the Hawkes processes. This new model establishes a previously unexplored connection between Bayesian nonparametrics and temporal point processes, which makes the number of clusters grow to accommodate the increasing complexity of online streaming contents, while at the same time adapts to the ever changing dynamics of the respective continuous arrival time. Large-scale experiments on both synthetic and real world news articles showed that Dirichlet-Hawkes processes can recover both meaningful topics and temporal dynamics, which leads to better predictive performance in terms of content perplexity and arrival time of future documents.