Near-Optimal Coresets of Kernel Density Estimates

  • PDF / 494,666 Bytes
  • 21 Pages / 439.37 x 666.142 pts Page_size
  • 83 Downloads / 233 Views

DOWNLOAD

REPORT


Near-Optimal Coresets of Kernel Density Estimates Jeff M. Phillips1 · Wai Ming Tai1 Received: 22 June 2018 / Revised: 14 August 2019 / Accepted: 27 August 2019 © Springer Science+Business Media, LLC, part of Springer Nature 2019

Abstract We construct near-optimal coresets for kernel density estimates for points in Rd when the kernel is positive definite. Specifically we provide a polynomial time construction √ √ for a coreset √ of size O( d/ε· log 1/ε), and we show a near-matching lower bound of size (min{ d/ε, 1/ε2 }). When d ≥ 1/ε2 , it is known that the size of coreset can be O(1/ε2 ). The upper bound is a polynomial-in-(1/ε) improvement when d ∈ [3, 1/ε2 ) and the lower bound is the first known lower bound to depend on d for this problem. Moreover, the upper bound restriction that the kernel is positive definite is significant in that it applies to a wide variety of kernels, specifically those most important for machine learning. This includes kernels for information distances and the sinc kernel which can be negative. Keywords Coreset · Kernel density estimate · Discrepancy theory

1 Introduction Kernel density estimates are pervasive objects in data analysis. They are the classic way to estimate a continuous distribution from a finite sample of points [46,47]. With some points negatively weighted, they are the prediction function in kernel SVM classifiers [44]. They are the core of many robust topological reconstruction approaches [7,17,41]; and they arise in many other applications including mode estimation [1], outlier detection [45], regression [16], and clustering [42].

Editor in Charge: Kenneth Clarkson J.M. Phillips thanks the support by NSF CCF-1350888, IIS-1251019, ACI-1443046, CNS-1514520, and CNS-1564287. Jeff M. Phillips [email protected] Wai Ming Tai [email protected] 1

University of Utah, Salt Lake City, USA

123

Discrete & Computational Geometry

Consider a dataset P ⊂ Rd of size n, and a kernel K : Rd × Rd → R, for instance the Gaussian kernel K (x, p) = exp(−α 2 x − p2 ) with 1/α as a bandwidth parameter. Then a kernel density estimate is defined at any point x ∈ Rd as 1 kde P (x) = n p∈P K (x, p). Given that it takes O(n) time to evaluate kde P , and that data sets are growing to massive sizes, in order to continue to use these powerful modeling objects, a common approach is to replace P with a much smaller data set Q so that kde Q approximates kde P . While statisticians have classically studied various sorts of average deviations (L 2 [46,47] or L 1 error [13]), for most modern data modeling purposes, a worstcase L ∞ is more relevant (e.g., for preserving classification margins [44], density estimates [51], topology [41], and hypothesis testing on distributions [22]). Specifically this error guarantee preserves kde P − kde Q ∞ = sup |kde P (x) − kde Q (x)| ≤ ε. x∈Rd

We call such a set Q an ε-KDE coreset of P. In this paper we study how small can Q be as a function of error ε, dimension d, and properties of the kernels.1 1.1 Background on Kernels and Related Coresets Tradit