Why Should We Add Early Exits to Neural Networks?

  • PDF / 702,490 Bytes
  • 13 Pages / 595.224 x 790.955 pts Page_size
  • 16 Downloads / 221 Views

DOWNLOAD

REPORT


Why Should We Add Early Exits to Neural Networks? Simone Scardapane1

· Michele Scarpiniti1 · Enzo Baccarelli1 · Aurelio Uncini1

Received: 25 February 2020 / Accepted: 14 May 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Deep neural networks are generally designed as a stack of differentiable layers, in which a prediction is obtained only after running the full stack. Recently, some contributions have proposed techniques to endow the networks with early exits, allowing to obtain predictions at intermediate points of the stack. These multi-output networks have a number of advantages, including (i) significant reductions of the inference time, (ii) reduced tendency to overfitting and vanishing gradients, and (iii) capability of being distributed over multi-tier computation platforms. In addition, they connect to the wider themes of biological plausibility and layered cognitive reasoning. In this paper, we provide a comprehensive introduction to this family of neural networks, by describing in a unified fashion the way these architectures can be designed, trained, and actually deployed in time-constrained scenarios. We also describe in-depth their application scenarios in 5G and Fog computing environments, as long as some of the open research questions connected to them. Keywords Deep learning · Conditional computation · Early exit · Fog computing · Distributed optimization

Introduction The success of deep networks can be attributed in large part to their extreme modularity and compositionality, coupled with the power of automatic optimization routines such as stochastic gradient descent [73]. While a number of innovative components have been proposed recently (such as attention layers [66], neural ODEs [16], and graph modules [58]), the vast majority of deep networks is designed as a sequential stack of (differentiable) layers, trained by propagating the gradient from the final layer inwards. Even if the optimization of very large stacks of layers can today be greatly improved with modern techniques such as residual connections [72], their implementation still brings forth a number of possible drawbacks. Firstly, very deep networks are hard to parallelize because of the gradient locking problem [49] and the purely sequential nature of their information flow. Secondly, in the inference phase, these networks are complex to implement in resource-constrained or distributed scenarios [32, 53].  Simone Scardapane

[email protected] 1

Department of Information Engineering, Electronics and Telecommunications (DIET), “Sapienza” University of Rome, Via Eudossiana 18, 00184 Rome, Italy

Thirdly, overfitting and vanishing gradient phenomena can still happen even with strong regularization, due to the possibility of raw memorization intrinsic to these architectures [27]. When discussing overfitting and model selection, these are generally considered properties of a given network applied to a full dataset. However, recently, a large number of contributions, e.g., [3, 7, 36, 45,