Natural Gradient Learning and Its Dynamics in Singular Regions

Learning takes place in a parameter space, which is not Euclidean in general but Riemannian. Therefore, we need to take the Riemannian structure into account when designing a learning method.

  • PDF / 812,756 Bytes
  • 36 Pages / 439.37 x 666.142 pts Page_size
  • 51 Downloads / 132 Views

DOWNLOAD

REPORT


Natural Gradient Learning and Its Dynamics in Singular Regions

Learning takes place in a parameter space, which is not Euclidean in general but Riemannian. Therefore, we need to take the Riemannian structure into account when designing a learning method. The natural gradient method, which is a version of stochastic descent learning, is proposed for this purpose, using the Riemannian gradient. It is a Fisher efficient on-line method of estimation. Its performance is excellent in general and it has been used in various types of learning problems such as neural learning, policy gradient in reinforcement learning, optimization by means of stochastic relaxation, independent component analysis, Monte Carlo Markov Chain (MCMC) in a Riemannian manifold and others. Some statistical models are singular, implying that its parameter space includes singular regions. The multilayer perceptron (MLP) is a typical singular model. Since supervised learning of MLP is involved in deep learning, it is important to study the dynamical behavior of learning in singular regions, in which learning is very slow. This is known as plateau phenomena. The natural gradient method overcomes this difficulty.

12.1 Natural Gradient Stochastic Descent Learning 12.1.1 On-Line Learning and Batch Learning Huge amounts of data exist in the real world. Consider a set of data which are generated randomly subject to a fixed but unknown probability distribution. A typical example is shown in the regression problem, where input signal x is generated randomly, accompanied by a desired response f (x). A teacher signal y, which is a noisy version of the desired output f (x), y = f (x) + ε, © Springer Japan 2016 S. Amari, Information Geometry and Its Applications, Applied Mathematical Sciences 194, DOI 10.1007/978-4-431-55978-8_12

(12.1) 279

280

12 Natural Gradient Learning and Its Dynamics in Singular Regions

is given together with x, where ε is random noise. The task of a learning machine is, in this case, to estimate the desired output mapping f (x) by using the available examples of input–output pairs D = {(x i , yi ) , i = 1, 2, . . . , T }, called training examples. They are subject to an unknown joint probability distribution, p(x, y) = q(x)Prob {y|x} = q(x) pε {y − f (x)} ,

(12.2)

where q(x) is the probability distribution of x and pε (ε) is the probability distribution of noise ε, typically Gaussian. This is a usual scheme of supervised learning. We use a parameterized family f (x, ξ) of functions as candidates for the desired output, where ξ is a vector parameter. The set of ξ is a parameter space and we search for the optimal ξˆ that approximates the true f (x) by using training examples D. When y takes an analog value, this is a regression problem. When y is discrete, say binary, this is pattern recognition. In order to evaluate the performance of machine f (x, ξ), we define a loss function or cost function. The instantaneous loss of processing x by machine f (x, ξ) is typically given by 1 (12.3) l(x, y; ξ) = {y − f (x, ξ)}2 , 2 in the case of