A new approach for the vanishing gradient problem on sigmoid activation

PDF / 1,037,092 Bytes
10 Pages / 595.276 x 790.866 pts Page_size
19 Downloads / 163 Views

REGULAR PAPER

A new approach for the vanishing gradient problem on sigmoid activation Matías Roodschild1

· Jorge Gotay Sardiñas1

· Adrián Will1

Received: 6 March 2020 / Accepted: 27 September 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract The vanishing gradient problem (VGP) is an important issue at training time on multilayer neural networks using the backpropagation algorithm. This problem is worse when sigmoid transfer functions are used, in a network with many hidden layers. However, the sigmoid function is very important in several architectures such as recurrent neural networks and autoencoders, where the VGP might also appear. In this article, we propose a modification of the backpropagation algorithm for the sigmoid neurons training. It consists of adding a small constant to the calculation of the sigmoid’s derivative so that the proposed training direction differs slightly from the gradient while keeping the original sigmoid function in the network. This approach suggests that the derivative’s modification produces the same accuracy in fewer training steps on most datasets. Moreover, due to VGP, the original derivative does not converge using sigmoid functions on more than five hidden layers. However, the modification allows backpropagation to train two extra hidden layers in feedforward neural networks. Keywords Vanishing gradient problem · Sigmoid function · Feedforward neural networks · Backpropagation algorithm

1 Introduction For a long time, the sigmoid and hyperbolic tangent functions were the most commonly used activation functions for neural networks, mainly due to their useful properties: Both are differentiable and strictly increasing, transform the real axis onto (0, 1) and (−1, 1), respectively and represent an elegant balance between linear and nonlinear behavior. Nevertheless, hyperbolic tangent (tansig) is usually preferred due to the fact that sigmoid neurons present a risk at the time of training multilayer networks with the backpropagation algorithm, namely the appearance of the well-known vanishing gradient problem (VGP) [1–3]. This problem occurs when many components of the gradient of the loss function (in particular, those partial derivatives associated with the parameters of the

B

Matías Roodschild [email protected] Jorge Gotay Sardiñas [email protected] Adrián Will [email protected]

1

Grupo de Investigación en Tecnologías Avanzadas (GITIA), Facultad Regional Tucumán, Universidad Tecnológica Nacional, Tucumán, Argentina

layers closer to the input) get very close to zero. This causes a stall in the updates of the parameters associated with these layers, since the algorithm uses the gradient to calculate its next step. In other words, the parameters of the shallower layers do not change as significantly as they should, and they move less as the network gets deeper. Then, the problem gets worse as the number of hidden layers increases. Several approaches were proposed for this problem. Some authors suggested to replace the activation function with

Data Loading...

A new approach for the vanishing gradient problem on sigmoid activation

Recommend Documents

Vanishing viscosity limit of the 2D micropolar equations for planar rarefaction wave to a Riemann problem

A New Analysis on the Barzilai-Borwein Gradient Method

New stepsizes for the gradient method

A new conjugate gradient method for solving multi-source dynamic load identification problem

A new decomposition approach for the single machine total tardiness scheduling problem

A new branch-and-cut approach for the generalized regenerator location problem

On the convergence analysis of the gradient-CQ algorithms for the split feasibility problem

Generalized Activation Energy Spectrum Theory: A New Approach for Modeling Structural Relaxation in Amorphous Solids

A Clustering-Sequencing Approach for the Facility Layout Problem

Analyzing the effect of dilatation on the velocity gradient tensor using a model problem

New approach based on group technology for the consolidation problem in cloud computing-mathematical model and genetic a

The Demarcation Problem in Jurisprudence: A New Case for Skepticism