A new approach for the vanishing gradient problem on sigmoid activation

  • PDF / 1,037,092 Bytes
  • 10 Pages / 595.276 x 790.866 pts Page_size
  • 19 Downloads / 148 Views

DOWNLOAD

REPORT


REGULAR PAPER

A new approach for the vanishing gradient problem on sigmoid activation Matías Roodschild1

· Jorge Gotay Sardiñas1

· Adrián Will1

Received: 6 March 2020 / Accepted: 27 September 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract The vanishing gradient problem (VGP) is an important issue at training time on multilayer neural networks using the backpropagation algorithm. This problem is worse when sigmoid transfer functions are used, in a network with many hidden layers. However, the sigmoid function is very important in several architectures such as recurrent neural networks and autoencoders, where the VGP might also appear. In this article, we propose a modification of the backpropagation algorithm for the sigmoid neurons training. It consists of adding a small constant to the calculation of the sigmoid’s derivative so that the proposed training direction differs slightly from the gradient while keeping the original sigmoid function in the network. This approach suggests that the derivative’s modification produces the same accuracy in fewer training steps on most datasets. Moreover, due to VGP, the original derivative does not converge using sigmoid functions on more than five hidden layers. However, the modification allows backpropagation to train two extra hidden layers in feedforward neural networks. Keywords Vanishing gradient problem · Sigmoid function · Feedforward neural networks · Backpropagation algorithm

1 Introduction For a long time, the sigmoid and hyperbolic tangent functions were the most commonly used activation functions for neural networks, mainly due to their useful properties: Both are differentiable and strictly increasing, transform the real axis onto (0, 1) and (−1, 1), respectively and represent an elegant balance between linear and nonlinear behavior. Nevertheless, hyperbolic tangent (tansig) is usually preferred due to the fact that sigmoid neurons present a risk at the time of training multilayer networks with the backpropagation algorithm, namely the appearance of the well-known vanishing gradient problem (VGP) [1–3]. This problem occurs when many components of the gradient of the loss function (in particular, those partial derivatives associated with the parameters of the

B

Matías Roodschild [email protected] Jorge Gotay Sardiñas [email protected] Adrián Will [email protected]

1

Grupo de Investigación en Tecnologías Avanzadas (GITIA), Facultad Regional Tucumán, Universidad Tecnológica Nacional, Tucumán, Argentina

layers closer to the input) get very close to zero. This causes a stall in the updates of the parameters associated with these layers, since the algorithm uses the gradient to calculate its next step. In other words, the parameters of the shallower layers do not change as significantly as they should, and they move less as the network gets deeper. Then, the problem gets worse as the number of hidden layers increases. Several approaches were proposed for this problem. Some authors suggested to replace the activation function with