Learning flat representations with artificial neural networks

PDF / 2,848,995 Bytes
15 Pages / 595.224 x 790.955 pts Page_size
22 Downloads / 268 Views

Learning ﬂat representations with artiﬁcial neural networks Vlad Constantinescu1,2 · Costin Chiru1,2 · Tudor Boloni3 · Adina Florea2 · Robi Tacutu1,4 Accepted: 21 October 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract In this paper, we propose a method of learning representation layers with squashing activation functions within a deep artificial neural network which directly addresses the vanishing gradients problem. The proposed solution is derived from solving the maximum likelihood estimator for components of the posterior representation, which are approximately Betadistributed, formulated in the context of variational inference. This approach not only improves the performance of deep neural networks with squashing activation functions on some of the hidden layers - including in discriminative learning - but can be employed towards producing sparse codes. Keywords Learning representations · Infomax · Beta distribution · Vanishing gradients

1 Introduction Currently, most deep neural network models tend to avoid squashing activation functions (AF), such as the logistic sigmoid or the hyperbolic tangent [1]. This is because the Stochastic Gradient Descent (SGD) optimization, implemented as back-propagation, has been found to be ineffective due to the vanishing derivatives when the function reaches its saturation region [2]. For feed-forward neural networks, the typical approach relies on various nonsquashing AFs, which are less prone to vanishing gradients for activations in (0, ∞), such as the Rectified Linear Unit [3, 4], softplus activation [4] or the Exponential Linear Units [5]. However, these approaches come with a main disadvantage, namely the gradients may explode in the positive

Vlad Constantinescu

[email protected] Robi Tacutu

[email protected] 1

Systems Biology of Aging Group, Institute of Biochemistry of the Romanian Academy, Bucharest, Romania

2

Computer Science and Engineering Department, University Politehnica of Bucharest, Bucharest, Romania

3

AITIAOne Inc., 2531 Piedmont Ave., Montrose, CA, United States

4

Chronos Biosystems SRL, Bucharest, Romania

region of the AFs’ inputs and thus additional measures have to be considered to regularize them. Moreover, although ReLU is the most popular AF, it has one additional problem - its gradients will become 0 in the negative region, resulting in some weights stopping to be adjusted. This means that parts of the network will stop responding to error variations at some point during training. By contrast, if a technical solution to the vanishing gradient problem would exist, squashing functions could present significant advantages, mainly due to their nonlinear nature, with a smooth analog activation and a bounded output range. Considering the above, in this work we revisit the technical problem of vanishing gradients and present a method able to alleviate the main disadvantage of the squashing functions. The method proposes an alternative formulation of the canonically-described vanishing

Data Loading...

Learning flat representations with artificial neural networks

Recommend Documents

Artificial Neural Networks

Artificial Neural Networks

CerebLearn: Biologically Motivated Learning Rule for Artificial Feedforward Neural Networks

Machine learning, artificial neural networks and social research

Structured (De)composable Representations Trained with Neural Networks

Exploring Internal Representations of Deep Neural Networks

Artificial Neural Networks Based SRGM

Supervised Learning with Complex-valued Neural Networks

Learning Permutation Invariant Representations Using Memory Networks

MATLAB Deep Learning With Machine Learning, Neural Networks and Arti

Foreign-Exchange-Rate Forecasting With Artificial Neural Networks

Artificial Neural Networks with TensorFlow 2 ANN Architecture Ma