Why Should We Add Early Exits to Neural Networks?

PDF / 702,490 Bytes
13 Pages / 595.224 x 790.955 pts Page_size
16 Downloads / 239 Views

Why Should We Add Early Exits to Neural Networks? Simone Scardapane1

· Michele Scarpiniti1 · Enzo Baccarelli1 · Aurelio Uncini1

Received: 25 February 2020 / Accepted: 14 May 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Deep neural networks are generally designed as a stack of differentiable layers, in which a prediction is obtained only after running the full stack. Recently, some contributions have proposed techniques to endow the networks with early exits, allowing to obtain predictions at intermediate points of the stack. These multi-output networks have a number of advantages, including (i) significant reductions of the inference time, (ii) reduced tendency to overfitting and vanishing gradients, and (iii) capability of being distributed over multi-tier computation platforms. In addition, they connect to the wider themes of biological plausibility and layered cognitive reasoning. In this paper, we provide a comprehensive introduction to this family of neural networks, by describing in a unified fashion the way these architectures can be designed, trained, and actually deployed in time-constrained scenarios. We also describe in-depth their application scenarios in 5G and Fog computing environments, as long as some of the open research questions connected to them. Keywords Deep learning · Conditional computation · Early exit · Fog computing · Distributed optimization

Introduction The success of deep networks can be attributed in large part to their extreme modularity and compositionality, coupled with the power of automatic optimization routines such as stochastic gradient descent [73]. While a number of innovative components have been proposed recently (such as attention layers [66], neural ODEs [16], and graph modules [58]), the vast majority of deep networks is designed as a sequential stack of (differentiable) layers, trained by propagating the gradient from the final layer inwards. Even if the optimization of very large stacks of layers can today be greatly improved with modern techniques such as residual connections [72], their implementation still brings forth a number of possible drawbacks. Firstly, very deep networks are hard to parallelize because of the gradient locking problem [49] and the purely sequential nature of their information flow. Secondly, in the inference phase, these networks are complex to implement in resource-constrained or distributed scenarios [32, 53]. Simone Scardapane

[email protected] 1

Department of Information Engineering, Electronics and Telecommunications (DIET), “Sapienza” University of Rome, Via Eudossiana 18, 00184 Rome, Italy

Thirdly, overfitting and vanishing gradient phenomena can still happen even with strong regularization, due to the possibility of raw memorization intrinsic to these architectures [27]. When discussing overfitting and model selection, these are generally considered properties of a given network applied to a full dataset. However, recently, a large number of contributions, e.g., [3, 7, 36, 45,

Data Loading...

Why Should We Add Early Exits to Neural Networks?

Recommend Documents

Caregiver Support Strategies: Why Should We Care?

Why Should We Care About Cities?

Antimicrobial stewardship: can we add pharmacovigilance networks to the toolbox?

Educology Is Interdisciplinary: What Is It? Why Do We Need It? Why Should We Care?

What is conditionalization, and why should we do it?

Why Are We Attracted to Sad Music?

Deficit Why Should I Care?

The brain, the artificial neural network and the snake: why we see what we see

Why We Remember Him

Why Are We Here?

Why over-parameterization of deep neural networks does not overfit?

Finding Old Nubian, or, why we should divest from Western tongues