The Intrinsic Dimensionality of Data

  • PDF / 435,636 Bytes
  • 9 Pages / 439.37 x 666.142 pts Page_size
  • 52 Downloads / 202 Views

DOWNLOAD

REPORT


The Intrinsic Dimensionality of Data Subhash Kak1 Received: 29 June 2020 / Revised: 18 October 2020 / Accepted: 22 October 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract We consider the problem of determining the intrinsic dimensionality of data which is important for optimizing the organization and processing of large data sets in classical machines, quantum decision theory, and observations of natural phenomena. We prove a theorem that determines the minimum dimensions associated with the data and this result is consistent with the result that base-e is optimal for number representation. The dimension value be viewed as coding the structure in the most efficient representation of the data and has relevance for natural and engineered systems. Since the optimal intrinsic dimensionality is shown to be noninteger, this paper provides a rationale for fractals in natural data. Keywords Intrinsic dimensions · Noninteger dimensions · Fractal data · Information theory

1 Introduction In human–machine interaction associated with engineered and natural systems, the decisions of the human agents must be based on logic that is optimized with respect to the nature of the data [21, 31]. This logic is an element of data analysis that is essential for the discovery of association rules, intrinsic data classes, as well as behavior in engineered and physical systems. For physical systems we are also motivated to discover optimal representation that is likely to provide insight into how fundamental simple processes at an elementary level lead to much more complex observed behavior [32]. Many methods used for data analysis are motivated by computation considerations and not by the question of intrinsic dimensionality. Data from a source or sensor is assigned to a single bin with which a separate dimension is associated. For natural data, dimensions are mapped from the corresponding physical system. In the cases

B 1

Subhash Kak [email protected] Oklahoma State University, Stillwater, USA

Circuits, Systems, and Signal Processing

of financial or economic data [30], if the data is a single-variable time series, and the problem is that of prediction of the next point given the previous n points, the issue of dimensionality may be related to the number of parameters of the underlying neural network map [10, 27]. Let us consider the problem of data analysis from a single set of data that satisfies a distribution . Typically, this requires searching within examples E and accept those that satisfy certain constraints ϕ. Such a process entails evaluating large numbers of examples t ∈ E against a set of sample data D, seeking those ρ that satisfy ϕ. In such a situation, it is possible for ρ to satisfy ϕ within D even though it doesn’t do so for the distribution . Let the probability of such misclassification be κ, then in choosing n different patterns the probability of error will be 1 − (1 − κ)n , which can be large even for small values of κ and n. Given such error, one may end up with redunda