AAAI-21 Workshop: Learning Network Architecture During Training

This is a full-day virtual workshop on February 8, 2021. Real-time events will begin at 6am Pacific Time and continue to 2pm Pacific Time. Or 9am to 5pm Eastern (U.S.) time.

· The page for this workshop and the Zoom channel on the AAAI-21 Website is here.

· The main page for the AAAI-21 workshop program is here.

Workshop Description

A fundamental problem in the use of artificial neural networks is that the first step is to guess the network architecture. Fine tuning a neural network is very time consuming and far from optimal. Hyperparameters such as the number of layers, the number of nodes in each layer, the pattern of connectivity, and the presence and placement of elements such as memory cells, recurrent connections, and convolutional elements are all manually selected. If it turns out that the architecture is not appropriate for the task, the user must repeatedly adjust the architecture and retrain the network until an acceptable architecture has been obtained.

There is now a great deal of interest in finding better alternatives to this scheme. Options include pruning a trained network or training many networks automatically. In this workshop we focus on a contrasting approach: to learn the architecture during training. This topic encompasses forms of Neural Architecture Search (NAS) in which the performance properties of each architecture, after some training, are used to guide the selection of the next architecture to be tried. This topic also encompasses techniques that augment or alter a network as the network is trained. An example of the latter is the Cascade Correlation algorithm, as well as others that incrementally build or modify a neural network during training, as needed for the problem at hand.

Main Objectives of the Workshop

Our goal is to build a stronger community of researchers exploring these methods, and to find synergies among these related approaches and alternatives. Eliminating the need to guess the right topology in advance of training is a prominent benefit of learning network architecture during training. Additional advantages are possible, including decreased computational resources to solve a problem, reduced time for the trained network to make predictions, reduced requirements for training set size, and avoiding “catastrophic forgetting.” We would especially like to highlight approaches that are qualitatively different from current popular, but computationally intensive, NAS methods.

As deep learning problems become increasingly complex, network sizes must increase and other architectural decisions become critical to success. The deep learning community must often confront serious time and hardware constraints from suboptimal architectural decisions. The growing popularity of NAS methods demonstrates the community’s hunger for better ways of choosing or evolving network architectures that are well-matched to the problem at hand.

Schedule

The virtual “plenary session” of the workshop will take place in our primary Zoom channel noted above. Most of the talks will be pre-recorded videos, as a hedge against internet problems.

During the workshop, there will be two breaks for discussion and social time, a chance to talk with the organizers, authors, presenters, and other attendees. This will be done on the conference Gather.Town site.

All times listed below are for the Pacific Time Zone – the official conference time zone.

For those who find these times inconvenient, note that the videos will be made available on our workshop materials website, accessible for AAAI-21 registrants to view at times convenient for them. Slides and papers for many of the talks will also be there.

06:00 Welcome and talk: Scott E. Fahlman
"Cascade Correlation Thirty Years Later: Lessons and Open Questions"

07:00 Invited Talk: Sindy Löwe
"Putting an End to End-to-End"

07:45 Talk: Misha Khodak
"Geometry-Aware Gradient Algorithms for Neural Architecture Search"

08:30 Four 5-minute spotlight presentations of other accepted papers. Video

09:00 Break and Gather.Town discussions.
Chance to talk with workshop organizers and authors.

09:45 Talk: Edouard Oyallon
"Greedy Training & Asynchronous Optimization"

10:30 Talk: Nicholas Roberts
"Searching for Convolutions and a More Ambitious NAS"

11:15 Talk: Dean Alderucci
"Some Intuitions in the Algorithmic Analysis of Cascade Correlation Learning"

12:00 Invited Talk: Maithra Raghu
"Beyond Performance Measures: Representational Insights for Machine Learning Design"

12:45 - 14:00 Gather.Town discussions and social time, all participants.

Invited Talks

Putting an End to End-to-End
Modern deep learning models are typically optimized with end-to-end backpropagation. As a result, when following the most naive approach for Neural Architecture Search, we need to train every candidate network from scratch - resulting in a very expensive evaluation strategy. In this talk, I will present Greedy InfoMax - a self-supervised representation learning approach that could provide a solution to this problem. I will demonstrate how Greedy InfoMax enables us to train a neural network without labels and without end-to-end backpropagation, while achieving highly competitive results on downstream classification tasks. Finally, I will outline how such local learning could be used to tremendously cut down the evaluation cost of candidate networks during Neural Architecture Search.

Sindy Löwe (AMLab, University of Amsterdam)

Beyond Performance Measures: Representational Insights for Machine Learning Design

Over the past several years, rapid advances in modelling and algorithms, coupled with increased data and compute, has led to fundamental breakthroughs of machine learning across many different domains. However, the advanced capabilities of these machine learning systems comes at the cost of greater complexity. Machine learning design is becoming increasingly laborious, computationally expensive and opaque, sometimes resulting in catastrophic failures at deployment. In this talk, I overview steps towards an insight-driven design of machine learning systems. I introduce techniques that go beyond standard evaluation measures, and enable quantitative analysis of the complex hidden representations of machine learning systems. I discuss the resulting insights into the underlying deep neural network models, and the principled way in which this informs many aspects of their design, from characteristics when varying architecture width/depth, signs of overfitting, and catastrophic forgetting.

Maithra Raghu (Google Brain)

Papers Submitted and Accepted for the Workshop

(Listed alphabetically by first author name. The two papers marked with a * were chosen to be presented in the plenary stream. All papers are on the workshop materials website.)

Joint Search of Data Augmentation Policies and Network Architectures
The common pipeline of training deep neural networks consists of several building blocks such as data augmentation and network architecture selection. AutoML is a research field that aims at automatically designing those parts, but most methods explore each part independently because it is more challenging to simultaneously search all the parts. In this paper, we propose a joint optimization method for data augmentation policies and network architectures to bring more automation to the design of training pipeline. The core idea of our approach is to make the whole part differentiable. The proposed method combines differentiable methods for augmentation policy search and network architecture search to jointly optimize them in the end-to-end manner. The experimental results show our method achieves competitive or superior performance to the independently searched results.

Taiga Kashima kashima@nlab.ci.i.u-tokyo.ac.jp (University of Tokyo)*; Yoshihiro Yamada (Preferred Networks Inc.); Shunta Saito (Preferred Networks, Inc.)

* Geometry-Aware Gradient Algorithms for Neural Architecture Search (Video)
Recent state-of-the-art methods for neural architecture search (NAS) exploit gradient-based optimization by relaxing the problem into continuous optimization over architectures and shared-weights, a noisy process that remains poorly understood. We argue for the study of single-level empirical risk minimization to understand NAS with weight-sharing, reducing the design of NAS methods to devising optimizers and regularizers that can quickly obtain high-quality solutions to this problem. Invoking the theory of mirror descent, we present a geometry-aware framework that exploits the underlying structure of this optimization to return sparse architectural parameters, leading to simple yet novel algorithms that enjoy fast convergence guarantees and achieve state-of-the-art accuracy on the latest NAS benchmarks in computer vision. Notably, we exceed the best published results for both CIFAR and ImageNet on both the DARTS search space and NAS-Bench-201; on the latter we achieve near-oracle-optimal performance on CIFAR-10 and CIFAR-100. Together, our theory and experiments demonstrate a principled way to codesign optimizers and continuous relaxations of discrete NAS search spaces.

Liam Li me@liamcli.com Liam Li (Carnegie Mellon University)*; Mikhail Khodak (Carnegie Mellon University); Maria-Florina Balcan (Carnegie Mellon University); Ameet Talwalkar (CMU)

* Searching for Convolutions and a More Ambitious NAS
An important goal of neural architecture search (NAS) is to automate-away the design of neural networks on new tasks in under-explored domains, thus helping to democratize machine learning. However, current NAS research largely focuses on search spaces consisting of existing operations— such as different types of convolution—that are already known to work well on well-studied problems—often in computer vision. Our work is motivated by the following question: can we enable users to build their own search spaces and discover the right neural operations given data from their specific domain? We make progress towards this broader vision for NAS by introducing a space of operations generalizing the convolution that enables search over a large family of parameterizable linear-time matrix-vector functions. Our flexible construction allows users to design their own search spaces adapted to the nature and shape of their data, to warmstart search methods using convolutions when they are known to perform well, or to discover new operations from scratch when they do not. We evaluate our approach on several novel search spaces over vision and text data, on all of which simple NAS search algorithms can find operations that perform better than baseline layers.

Nicholas C Roberts ncrobert@cs.cmu.edu Nicholas C Roberts (Carnegie Mellon University)*; Mikhail Khodak (Carnegie Mellon University); Liam Li (Carnegie Mellon University); Maria-Florina Balcan (Carnegie Mellon University); Ameet Talwalkar (CMU); Tri Dao (Stanford University); Christopher Re (Stanford University)

Two Novel Performance Improvements for Evolving CNN Topologies
Convolutional Neural Networks (CNNs) are the state-of-the art algorithms for the processing of images. However the configuration and training of these networks is a complex task requiring deep domain knowledge, experience and much trial and error. Using genetic algorithms, competitive CNN topologies for image recognition can be produced for any specific purpose, however in previous work this has come at high computational cost. In this work two novel approaches are presented to the utilisation of these algorithms, effective in reducing complexity and training time by nearly 20%. This is accomplished via regularisation directly on training time, and the use of partial training to enable early ranking of individual architectures. Both approaches are validated on the benchmark CIFAR10 data set, and maintain accuracy.

Yaron Strauch y.strauch@soton.ac.uk (University of Southampton)*; Jo Grundy (University of Southampton)

Training Integer-Valued Neural Networks with Mixed Integer Programming
Recent work has shown potential in using Mixed Integer Programming (MIP) solvers to optimize certain aspects of neural networks (NNs). However little research has gone into training NNs with MIP solvers. State of the art methods to train NNs are typically gradient-based and require significant data computation on GPUs, and extensive hyper-parameter tuning. In contrast, training with MIP solvers does not require GPUs or heavy hyper-parameter tuning but likely cannot handle large amounts of data. This paper builds on recent advances that train binarized NNs using MIP solvers. We go beyond current work by formulating new MIP models which improve training efficiency and which can train the important class of integer-valued neural networks (INNs). We provide two novel methods to further the potential significance of using MIP to train NNs. The first method optimizes the number of neurons in the NN while training. This reduces the need for deciding on network architecture before training. The second method addresses the amount of training data which MIP can feasibly handle: we provide a batch training method that dramatically increases the amount of data that MIP solvers can use to train. Our methodology is proficient at training NNs when minimal training data is available, and at training with minimal memory requirements – which is potentially valuable for deploying to low-memory devices. We also provide the first step into utilizing much more data than before when training NNs using MIP models.

Neil Yorke-Smith n.yorke-smith@tudelft.nl Neil Yorke-Smith (Delft University of Technology)*; Tómas Þorbjarnarson (Delft University of Technology)

Curriculum based Cascade Learning for Image Classification
Cascade Learning (CL) provides a novel bottom-up training paradigm for efficient training. Despite the similar generalisation performance compared to traditional training, recent work shows that the bottleneck of CL is underlying the poor separability in the early stages of training. In this work we investigate the feasibility of using Curriculum Learning (CuL) together with CL to trade-off generalisation and training consumption in training early layers of the CL network. We order the data from ”easy” to ”hard” by using prior knowledge with two proposed methods: an entropy measurement on pixel intensity and a teacher network based on knowledge distillation. We show that a meaningful curriculum leads to fast convergence in training early layers of CL. We demonstrate such advantage leads to a improvement in final generalisation by carefully controlled training resources. Finally, we provide mathematical proof that a CuL-CL framework is able to further increase the training in early stages due to the need for less data.

Junwen Wang jw7u18@soton.ac.uk Junwen Wang (University of Southampton)*; Katayoun Farrahi (University of Southampton); Mahesan Niranjan (University of Southampton)

Organizing Committee

Scott E. Fahlman
School of Computer Science, Carnegie Mellon University

Kate Farrahi
Electronics and Computer Science Department, University of Southampton

George Magoulas
Department of Computer Science and Information Systems,
Birkbeck College, University of London

Edouard Oyallon
Sorbonne Université – LIP6

Bhiksha Raj Ramakrishnan
School of Computer Science, Carnegie Mellon University

Dean Alderucci
School of Computer Science, Carnegie Mellon University dalderuc@cs.cmu.edu