Optimal transport natural gradient for statistical manifolds with continuous sample space

PDF / 1,697,732 Bytes
32 Pages / 439.37 x 666.142 pts Page_size
55 Downloads / 165 Views

Optimal transport natural gradient for statistical manifolds with continuous sample space Yifan Chen1

· Wuchen Li2

Received: 21 May 2018 / Revised: 22 March 2020 © Springer Nature Singapore Pte Ltd. 2020

Abstract We study the Wasserstein natural gradient in parametric statistical models with continuous sample spaces. Our approach is to pull back the L 2 -Wasserstein metric tensor in the probability density space to a parameter space, equipping the latter with a positive definite metric tensor, under which it becomes a Riemannian manifold, named the Wasserstein statistical manifold. In general, it is not a totally geodesic sub-manifold of the density space, and therefore its geodesics will differ from the Wasserstein geodesics, except for the well-known Gaussian distribution case, a fact which can also be validated under our framework. We use the sub-manifold geometry to derive a gradient flow and natural gradient descent method in the parameter space. When parametrized densities lie in R, the induced metric tensor establishes an explicit formula. In optimization problems, we observe that the natural gradient descent outperforms the standard gradient descent when the Wasserstein distance is the objective function. In such a case, we prove that the resulting algorithm behaves similarly to the Newton method in the asymptotic regime. The proof calculates the exact Hessian formula for the Wasserstein distance, which further motivates another preconditioner for the optimization process. To the end, we present examples to illustrate the effectiveness of the natural gradient in several parametric statistical models, including the Gaussian measure, Gaussian mixture, Gamma distribution, and Laplace distribution. Keywords Optimal transport · Information geometry · Wasserstein statistical manifold · Wasserstein natural gradient

B

Yifan Chen [email protected] Wuchen Li [email protected]

1

Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA 91106, USA

2

Department of Mathematics, UCLA, Los Angeles, CA 90095, USA

123

Information Geometry

1 Introduction The statistical distance between probability measures plays an important role in many fields such as data analysis and machine learning, which usually consist in minimizing a loss function as minimize d(ρ, ρe ) s.t. ρ ∈ Pθ . Here Pθ is a parameterized subset of the probability density space, and ρe is the target density, which is often an empirical realization of a ground-truth distribution. The function d quantifies the difference between densities ρ and ρe . An important example for d is the Kullback–Leibler (KL) divergence, also known as the relative entropy, which closely relates to the maximum likelihood estimate in statistics and the field of information geometry [2,7]. The Hessian operator of KL embeds Pθ as a statistical manifold, in which the Riemannian metric is the Fisher–Rao metric. Due to Chentsov [15], the Fisher–Rao metric is the only one, up to scaling, that is invariant under sufficient statistics. Usin

Data Loading...

Optimal transport natural gradient for statistical manifolds with continuous sample space

Recommend Documents

Natural Gradient

Inequalities for Statistical Submanifolds in Sasakian Statistical Manifolds

Gradient Projection Method on Matrix Manifolds

Continuous-Space Markov Processes with Jumps

Depth-Adaptive Discriminant Projection with Optimal Transport

1-Conformal geometry of quasi statistical manifolds

Statistical Space-Time Modeling

Optimal Mass Transport for Activities Location Problem

Hermite subdivision on manifolds via parallel transport

Statistical Test and Sample Size Calculation

D3PG: Decomposed Deep Deterministic Policy Gradient for Continuous Control

Concerns of Organic Contamination for Sample Return Space Missions