Talk abstracts
Francis Bach
(INRIA & ENS)
Title: Sharp Analysis of Random Feature
Expansions
Abstract: Random feature expansions
provide a simple way to avoid the usual
quadratic running-time complexity of kernel methods.
In this talk, I will present
recent results about the approximation properties of these
expansions. In
particular, I will provide improved bounds on the number
of features needed
for a given approximation quality.
Michael
Mahoney (Berkeley)
Title: Using Local Spectral Methods in
Theory and in Practice
Abstract: Local spectral methods are
algorithms that touch only a small part
of a large data graph and yet come with locally-biased
versions of the
Cheeger-like quality-of-approximation guarantees that make
the usual global
spectral methods so popular. Since they touch only a
small part of a large data
graph, these methods come with strong scalability
guarantees, and they can be
applied to graphs with hundreds of millions or billions of
nodes. Moreover, due
to implicit regularization, they also come with
interesting statistical guarantees,
and they perform quite well in many practical
situations. We will describe the
basic ideas underlying these methods, how these methods
tend to perform in
practice at identifying different types of structure in
data, and how an understanding
of the implicit regularization properties underlying these
methods leads to novel
methods to robustify graph-based learning algorithms to
the peculiarities of data
preprocessing decisions.
Yee Whye Teh
(Oxford)
Title: Random Tensor Decompositions for
Regression and Collaborative Filtering
Abstract: In this talk I will
present some ongoing work by Xiaoyu Lu, Hyunjik Kim,
Seth Flaxman and myself on approximations for efficiently
learning Gaussian processes
and kernel methods. Our approximation is applicable
when the kernel has Kronecker
structure, but when the data need not be on a grid.
The idea is to make use of random
feature expansions, low-rank tensors, and recent advances
in stochastic gradient
MCMC / Variational Inference / Descent. We will also
present how this can be used
in a novel formualtion for collaborative filtering with
side-information using Gaussian
processes, arguing it is more natural than current
proposals for using GPs in collaborative
filtering, and showing interesting connections between our
approximations and low-rank
matrix factorization approaches to collaborative
filtering.
Fei Sha (UCLA)
Title: Do shallow kernel methods match deep
neural networks -- and if not, what can
the shallow ones learn from the deep ones?
Abstract: Deep neural networks
(DNNs) and other types of deep learning architectures
have been hugely successful in a large number of
applications. By contrast, kernel
methods, which were exceedingly popular, have become
lackluster. The crippling
obstacle is the computational complexity of those
methods. Nonetheless, there has
been a resurgence of interest in these methods. In
particular, several research groups
have studied how to scale kernel methods to cope with
large-scale learning problems.
Despite such progress, there has not been a systematic and
head-on comparison
between kernel methods and DNNs. Specifically, while
recent approaches have shown
exciting promises, we are still left with at least one
itching unanswered question: can
kernel methods, after being scaled up for large datasets,
truly match DNN performance?
In this talk, I will describe our efforts in (partially)
answering that question. I will
present extensive empirical studies comparing kernel
methods and DNNs for automatic
speech recognition, a key field to which DNNs have been
applied. Our investigative
studies highlight the similarities and differences between
these two paradigms. I will leave
our main conclusion as a surprise.
Jean-Philippe
Vert (Mines ParisTech & Curie Institute)
Title: Learning from Rankings
Abstract: In many applications such as
genomics, high-dimensional data are often
subject to technical variability such as noise of batch
effects which are difficult to
remove or model. If the variability approximately
keeps the relative order of the
features within each sample, then one could keep only the
information of relative
orders between features to characterize each sample,
resulting in a representation
of each sample as a permutation over the set of
features. In this talk, I will discuss
several new methods for supervised and unsupervised
classification of such
permutations, including new positive definite kernels on
the symmetric groups and
a new method for supervised full-quantile normalization,
illustrating the benefits
of these techniques on cancer patient stratification from
noisy gene expression and
mutation data.
Amr Ahmed (Google)
Title: Dirichlet-Hawkes Processes with
Applications to Clustering Continuous-Time Document
Streams
Abstract: Clustering in document
streams, such as online news articles, can be induced by
their textual contents, as well as by the temporal
dynamics of their arriving patterns. Can we leverage
both sources of information to obtain a better clustering
of the documents, and distill information that is not
possible to extract using contents only? In this
talk, I will describe a novel random process, referred to
as the Dirichlet-Hawkes process, to take into account boht
information in a unified framework. A distinctive
feature of the proposed model is that the preferential
attachment of items to clusters according to cluster
sizes, present in Dirichlet processes, is now driven
according to the intensities of cluster-wise self-exciting
temporal point processes, the Hawkes processes. This
new model establishes a previously unexplored connection
between Bayesian nonparametrics and temporal point
processes, which makes the number of clusters grow to
accommodate the increasing complexity of online streaming
contents, while at the same time adapts to the ever
changing dynamics of the respective continuous arrival
time. Large-scale experiments on both synthetic and
real world news articles showed that Dirichlet-Hawkes
processes can recover both meaningful topics and temporal
dynamics, which leads to better predictive performance in
terms of content perplexity and arrival time of future
documents.