Learning flat representations with artificial neural networks

  • PDF / 2,848,995 Bytes
  • 15 Pages / 595.224 x 790.955 pts Page_size
  • 22 Downloads / 243 Views

DOWNLOAD

REPORT


Learning flat representations with artificial neural networks Vlad Constantinescu1,2 · Costin Chiru1,2 · Tudor Boloni3 · Adina Florea2 · Robi Tacutu1,4 Accepted: 21 October 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract In this paper, we propose a method of learning representation layers with squashing activation functions within a deep artificial neural network which directly addresses the vanishing gradients problem. The proposed solution is derived from solving the maximum likelihood estimator for components of the posterior representation, which are approximately Betadistributed, formulated in the context of variational inference. This approach not only improves the performance of deep neural networks with squashing activation functions on some of the hidden layers - including in discriminative learning - but can be employed towards producing sparse codes. Keywords Learning representations · Infomax · Beta distribution · Vanishing gradients

1 Introduction Currently, most deep neural network models tend to avoid squashing activation functions (AF), such as the logistic sigmoid or the hyperbolic tangent [1]. This is because the Stochastic Gradient Descent (SGD) optimization, implemented as back-propagation, has been found to be ineffective due to the vanishing derivatives when the function reaches its saturation region [2]. For feed-forward neural networks, the typical approach relies on various nonsquashing AFs, which are less prone to vanishing gradients for activations in (0, ∞), such as the Rectified Linear Unit [3, 4], softplus activation [4] or the Exponential Linear Units [5]. However, these approaches come with a main disadvantage, namely the gradients may explode in the positive

 Vlad Constantinescu

[email protected]  Robi Tacutu

[email protected] 1

Systems Biology of Aging Group, Institute of Biochemistry of the Romanian Academy, Bucharest, Romania

2

Computer Science and Engineering Department, University Politehnica of Bucharest, Bucharest, Romania

3

AITIAOne Inc., 2531 Piedmont Ave., Montrose, CA, United States

4

Chronos Biosystems SRL, Bucharest, Romania

region of the AFs’ inputs and thus additional measures have to be considered to regularize them. Moreover, although ReLU is the most popular AF, it has one additional problem - its gradients will become 0 in the negative region, resulting in some weights stopping to be adjusted. This means that parts of the network will stop responding to error variations at some point during training. By contrast, if a technical solution to the vanishing gradient problem would exist, squashing functions could present significant advantages, mainly due to their nonlinear nature, with a smooth analog activation and a bounded output range. Considering the above, in this work we revisit the technical problem of vanishing gradients and present a method able to alleviate the main disadvantage of the squashing functions. The method proposes an alternative formulation of the canonically-described vanishing