This
is a full-day virtual workshop on February 8, 2021. Real-time events will begin at 6am Pacific
Time and continue to 2pm Pacific Time.
Or 9am to 5pm Eastern (U.S.) time.
·
The page for this
workshop and the Zoom channel on the AAAI-21 Website is here.
·
The main page for the AAAI-21
workshop program is here.
Workshop Description
A fundamental problem in the use of artificial neural networks is that the first step is to guess the network architecture. Fine tuning a neural network is very time consuming and far from optimal. Hyperparameters such as the number of layers, the number of nodes in each layer, the pattern of connectivity, and the presence and placement of elements such as memory cells, recurrent connections, and convolutional elements are all manually selected. If it turns out that the architecture is not appropriate for the task, the user must repeatedly adjust the architecture and retrain the network until an acceptable architecture has been obtained.
There is now a great deal of interest in finding better alternatives to this scheme. Options include pruning a trained network or training many networks automatically. In this workshop we focus on a contrasting approach: to learn the architecture during training. This topic encompasses forms of Neural Architecture Search (NAS) in which the performance properties of each architecture, after some training, are used to guide the selection of the next architecture to be tried. This topic also encompasses techniques that augment or alter a network as the network is trained. An example of the latter is the Cascade Correlation algorithm, as well as others that incrementally build or modify a neural network during training, as needed for the problem at hand.
Our goal is to build a stronger community of researchers exploring these methods, and to find synergies among these related approaches and alternatives. Eliminating the need to guess the right topology in advance of training is a prominent benefit of learning network architecture during training. Additional advantages are possible, including decreased computational resources to solve a problem, reduced time for the trained network to make predictions, reduced requirements for training set size, and avoiding “catastrophic forgetting.” We would especially like to highlight approaches that are qualitatively different from current popular, but computationally intensive, NAS methods.
As deep learning problems become increasingly complex, network sizes must increase and other architectural decisions become critical to success. The deep learning community must often confront serious time and hardware constraints from suboptimal architectural decisions. The growing popularity of NAS methods demonstrates the community’s hunger for better ways of choosing or evolving network architectures that are well-matched to the problem at hand.
During
the workshop, there will be two breaks for discussion and social time, a chance
to talk with the organizers, authors, presenters, and other attendees. This will be done on the conference Gather.Town
site.
06:00 Welcome and talk: Scott E. Fahlman
"Cascade Correlation Thirty Years
Later: Lessons and Open Questions"
07:00 Invited
Talk: Sindy Löwe
"Putting
an End to End-to-End"
07:45 Talk:
Misha Khodak
"Geometry-Aware
Gradient Algorithms for Neural Architecture Search"
08:30 Four 5-minute spotlight presentations of
other accepted papers. Video
09:00 Break and Gather.Town discussions.
Chance to talk with
workshop organizers and authors.
09:45 Talk: Edouard Oyallon
"Greedy
Training & Asynchronous Optimization"
10:30 Talk: Nicholas
Roberts
"Searching
for Convolutions and a More Ambitious NAS"
11:15 Talk: Dean
Alderucci
"Some
Intuitions in the Algorithmic Analysis of Cascade Correlation Learning"
12:00 Invited Talk:
Maithra Raghu
"Beyond
Performance Measures: Representational Insights for Machine Learning Design"
12:45 - 14:00 Gather.Town
discussions and social time, all participants.
Putting
an End to End-to-End
Modern deep learning models are typically optimized with end-to-end
backpropagation. As a result, when following the most naive approach for Neural
Architecture Search, we need to train every candidate network from scratch -
resulting in a very expensive evaluation strategy. In this talk, I will present
Greedy InfoMax - a self-supervised representation learning approach that could
provide a solution to this problem. I will demonstrate how Greedy InfoMax
enables us to train a neural network without labels and without end-to-end
backpropagation, while achieving highly competitive results on downstream
classification tasks. Finally, I will outline how such local learning could be
used to tremendously cut down the evaluation cost of candidate networks during
Neural Architecture Search.
Sindy Löwe (AMLab, University of
Amsterdam)
Beyond
Performance Measures: Representational Insights for Machine Learning Design
Over the past several years, rapid advances in modelling and algorithms, coupled with increased data and compute, has led to fundamental breakthroughs of machine learning across many different domains. However, the advanced capabilities of these machine learning systems comes at the cost of greater complexity. Machine learning design is becoming increasingly laborious, computationally expensive and opaque, sometimes resulting in catastrophic failures at deployment. In this talk, I overview steps towards an insight-driven design of machine learning systems. I introduce techniques that go beyond standard evaluation measures, and enable quantitative analysis of the complex hidden representations of machine learning systems. I discuss the resulting insights into the underlying deep neural network models, and the principled way in which this informs many aspects of their design, from characteristics when varying architecture width/depth, signs of overfitting, and catastrophic forgetting.
Maithra
Raghu (Google Brain)
(Listed alphabetically by first author name. The two papers marked with a * were chosen to be presented in the plenary stream. All papers are on the workshop materials website.)
Joint
Search of Data Augmentation Policies and Network Architectures
The common pipeline of training deep neural networks consists of several
building blocks such as data augmentation and network architecture selection.
AutoML is a research field that aims at automatically designing those parts,
but most methods explore each part independently because it is more challenging
to simultaneously search all the parts. In this paper,
we propose a joint optimization method for data augmentation policies and
network architectures to bring more automation to the design of training
pipeline. The core idea of our approach is to make the whole part
differentiable. The proposed method combines differentiable methods for
augmentation policy search and network architecture search to jointly optimize
them in the end-to-end manner. The experimental results show our method
achieves competitive or superior performance to the independently searched
results.
Taiga Kashima kashima@nlab.ci.i.u-tokyo.ac.jp
(University of Tokyo)*; Yoshihiro Yamada (Preferred Networks Inc.); Shunta Saito (Preferred Networks, Inc.)
* Geometry-Aware
Gradient Algorithms for Neural Architecture Search (Video)
Recent state-of-the-art methods for neural architecture search (NAS) exploit
gradient-based optimization by relaxing the problem into continuous
optimization over architectures and shared-weights, a noisy process that
remains poorly understood. We argue for the study of single-level empirical
risk minimization to understand NAS with weight-sharing, reducing the design of
NAS methods to devising optimizers and regularizers that can quickly obtain
high-quality solutions to this problem. Invoking the theory of mirror descent,
we present a geometry-aware framework that exploits the underlying structure of
this optimization to return sparse architectural parameters, leading to simple
yet novel algorithms that enjoy fast convergence guarantees and achieve
state-of-the-art accuracy on the latest NAS benchmarks in computer vision.
Notably, we exceed the best published results for both CIFAR and ImageNet on
both the DARTS search space and NAS-Bench-201; on the latter we achieve
near-oracle-optimal performance on CIFAR-10 and CIFAR-100. Together, our theory
and experiments demonstrate a principled way to codesign optimizers and
continuous relaxations of discrete NAS search spaces.
Liam Li me@liamcli.com Liam
Li (Carnegie Mellon University)*; Mikhail Khodak (Carnegie Mellon University);
Maria-Florina Balcan (Carnegie Mellon University); Ameet Talwalkar (CMU)
* Searching
for Convolutions and a More Ambitious NAS
An important goal of neural architecture search (NAS) is to automate-away the
design of neural networks on new tasks in under-explored domains, thus helping
to democratize machine learning. However, current NAS research largely focuses
on search spaces consisting of existing operations— such as different types of
convolution—that are already known to work well on well-studied problems—often
in computer vision. Our work is motivated by the following question: can we
enable users to build their own search spaces and discover the right neural
operations given data from their specific domain? We make progress towards this
broader vision for NAS by introducing a space of operations generalizing the
convolution that enables search over a large family of parameterizable
linear-time matrix-vector functions. Our flexible construction allows users to
design their own search spaces adapted to the nature and shape of their data,
to warmstart search methods using convolutions when
they are known to perform well, or to discover new operations from scratch when
they do not. We evaluate our approach on several novel search spaces over
vision and text data, on all of which simple NAS search algorithms can find
operations that perform better than baseline layers.
Nicholas C
Roberts ncrobert@cs.cmu.edu Nicholas C Roberts
(Carnegie Mellon University)*; Mikhail Khodak (Carnegie Mellon University);
Liam Li (Carnegie Mellon University); Maria-Florina Balcan
(Carnegie Mellon University); Ameet Talwalkar (CMU);
Tri Dao (Stanford University); Christopher Re (Stanford University)
Two
Novel Performance Improvements for Evolving CNN Topologies
Convolutional Neural Networks (CNNs) are the state-of-the art algorithms
for the processing of images. However the configuration and training of these
networks is a complex task requiring deep domain knowledge, experience and much
trial and error. Using genetic algorithms, competitive CNN topologies for image
recognition can be produced for any specific purpose, however in previous work
this has come at high computational cost. In this work two novel approaches are
presented to the utilisation of these algorithms, effective in reducing
complexity and training time by nearly 20%. This is accomplished via
regularisation directly on training time, and the use of partial training to
enable early ranking of individual architectures. Both approaches are validated
on the benchmark CIFAR10 data set, and maintain accuracy.
Yaron
Strauch y.strauch@soton.ac.uk (University of
Southampton)*; Jo Grundy (University of Southampton)
Training
Integer-Valued Neural Networks with Mixed Integer Programming
Recent work has shown potential in using Mixed Integer Programming (MIP)
solvers to optimize certain aspects of neural networks (NNs). However little
research has gone into training NNs with MIP solvers. State of the art methods
to train NNs are typically gradient-based and require significant data
computation on GPUs, and extensive hyper-parameter tuning. In contrast,
training with MIP solvers does not require GPUs or heavy hyper-parameter tuning
but likely cannot handle large amounts of data. This paper builds
on recent advances that train binarized NNs using MIP solvers. We go beyond
current work by formulating new MIP models which improve training efficiency
and which can train the important class of integer-valued neural networks
(INNs). We provide two novel methods to further the potential significance of
using MIP to train NNs. The first method optimizes the number of neurons in the
NN while training. This reduces the need for deciding on network architecture
before training. The second method addresses the amount of training data which
MIP can feasibly handle: we provide a batch training method that dramatically
increases the amount of data that MIP solvers can use to train. Our methodology
is proficient at training NNs when minimal training data is available, and at
training with minimal memory requirements – which is potentially valuable for
deploying to low-memory devices. We also provide the first step into utilizing
much more data than before when training NNs using MIP models.
Neil Yorke-Smith n.yorke-smith@tudelft.nl Neil
Yorke-Smith (Delft University of Technology)*; Tómas Þorbjarnarson (Delft University of Technology)
Curriculum
based Cascade Learning for Image Classification
Cascade Learning (CL) provides a novel bottom-up training paradigm for
efficient training. Despite the similar generalisation performance compared to
traditional training, recent work shows that the bottleneck of CL is underlying
the poor separability in the early stages of training. In this work we
investigate the feasibility of using Curriculum Learning (CuL)
together with CL to trade-off generalisation and training consumption in
training early layers of the CL network. We order the data from ”easy” to
”hard” by using prior knowledge with two proposed methods: an entropy
measurement on pixel intensity and a teacher network based on knowledge
distillation. We show that a meaningful curriculum leads to fast convergence in
training early layers of CL. We demonstrate such advantage leads to a improvement in final generalisation by carefully
controlled training resources. Finally, we provide mathematical proof that a CuL-CL framework is able to further increase the training
in early stages due to the need for less data.
Junwen
Wang jw7u18@soton.ac.uk Junwen
Wang (University of Southampton)*; Katayoun Farrahi
(University of Southampton); Mahesan Niranjan
(University of Southampton)
Scott E. Fahlman
School of Computer Science, Carnegie Mellon University
Kate Farrahi
Electronics and Computer Science Department, University of Southampton
George Magoulas
Department of Computer Science and Information Systems,
Birkbeck College, University of London
Edouard Oyallon
Sorbonne Université – LIP6
Bhiksha Raj Ramakrishnan
School of Computer Science, Carnegie Mellon University
Dean Alderucci
School of Computer Science, Carnegie Mellon University dalderuc@cs.cmu.edu