Derivation and analysis of parallel-in-time neural ordinary differential equations
- PDF / 1,637,770 Bytes
- 25 Pages / 439.642 x 666.49 pts Page_size
- 22 Downloads / 152 Views
Derivation and analysis of parallel-in-time neural ordinary differential equations E. Lorin1,2
© Springer Nature Switzerland AG 2020
Abstract The introduction in 2015 of Residual Neural Networks (RNN) and ResNET allowed for outstanding improvements of the performance of learning algorithms for evolution problems containing a “large” number of layers. Continuous-depth RNN-like models called Neural Ordinary Differential Equations (NODE) were then introduced in 2019. The latter have a constant memory cost, and avoid the a priori specification of the number of hidden layers. In this paper, we derive and analyze a parallel (-in-parameter and time) version of the NODE, which potentially allows for a more efficient implementation than a standard/naive parallelization of NODEs with respect to the parameters only. We expect this approach to be relevant whenever we have access to a very large number of processors, or when we are dealing with high dimensional ODE systems. Moreover, when using implicit ODE solvers, solutions to linear systems with up to cubic complexity are then required for solving nonlinear systems using for instance Newton’s algorithm; as the proposed approach allows to reduce the overall number of time-steps thanks to an iterative increase of the accuracy order of the ODE system solvers, it then reduces the number of linear systems to solve, hence benefiting from a scaling effect. Keywords Residual Neural Network · Neural Ordinary Differential Equations · Parareal method · Parallelism-in-time Mathematics Subject Classification (2010) 65Y05 · 65L20 · 68T01 · 68T20
1 Introduction Recently, differential equations based neural networks were successfully developed allowing for a natural extension of residual networks [1] with a continuum of hidden layers. That is from an “arbitrary” sequence of transformations ht+1 = ht + f (ht , θt ) , E. Lorin
[email protected] 1
Centre de Recherches Math´ematiques, Universit´e de Montr´eal, Montr´eal, H3T 1J4, Canada
2
School of Mathematics and Statistics, Carleton University, Ottawa, K1S 5B6, Canada
E. Lorin
with f given, T ∈ N∗ hidden units (t ∈ {0, · · · , T − 1}), it is suggested [2] that neural networks could be modeled by a (continuous) differential equation dhθ = f (hθ (t), t, θ) , dt
(1)
where t ∈ R+ is typically a “time-variable” and θ represents a scalar parameter-variable. For f locally Lipschitz continuous with respect to its first variable, and continuous with respect to the second one, along an initial condition this system has a unique maximal solution. Later on, we will even require f to be differentiable with respect to its first and third variables. In order to get a broader picture of the derived approach, let us specify a bit more the framework. Traditional deep-learning algorithms are based on the minimization parameterdependent (denoted θ ) loss functions L. The latter is constructed i) using traning data, and ii) as a composition of parameter-dependent transfer functions from one layer to the next. Minimization is traditionally obtained
Data Loading...