SADIRE: a context-preserving sampling technique for dimensionality reduction visualizations

  • PDF / 2,385,404 Bytes
  • 15 Pages / 595.276 x 790.866 pts Page_size
  • 86 Downloads / 189 Views

DOWNLOAD

REPORT


R E G UL A R P A P E R

Wilson Este´cio Marcilio-Jr



Danilo Medeiros Eler

SADIRE: a context-preserving sampling technique for dimensionality reduction visualizations

Received: 29 November 2019 / Revised: 17 April 2020 / Accepted: 23 June 2020  The Visualization Society of Japan 2020

Abstract Sampling techniques are widely used in the effort to reduce complexity and improve interpretability of datasets. Given the enormous availability of data, these techniques try to select representative data points that inherently reflect the data structure. In this work, we propose a novel sampling technique that preserves the structures imposed by dimensionality reduction techniques when visualized as scatter plots. In the experiments, we demonstrate how our technique is able to reflect the class boundaries and layout structures, besides decreasing redundancy of the datasets visualized as scatter plots. We also provide an user experiment regarding the perception of sampling from scatter plot visualizations. Keywords Visualization  Multidimensional projection  Scatter plot  Overplotting  Sampling  Context-preserving

1 Introduction During analytical tasks, one of the most common challenges is to deal with multidimensional data. The applications that can benefit from multidimensional data, such as fraud detection (Leite et al. 2018), machine learning (Pezzotti et al. 2018), image analysis (Eler et al. 2009) and so on, brought a lot of attention on research of dimensionality reduction (DR) techniques (van der Maaten and Hinton 2008; Sarikaya et al. 2018; McInnes et al. 2018). These techniques are used to reduce the dimensionality of a dataset, in which the similarity and neighborhood relations of the multidimensional space must be preserved as much as possible. Formally, DR techniques can be defined as function f that minimizes jdðxi ; xj Þ  dðf ðxi Þ; f ðxj ÞÞj, where xi and xj are dataset instances, dðxi ; xj Þ and dðxi ; xj Þ measure the similarity and distance between xi and xj in the multidimensional space and in the projected space, respectively. The result of a dimensionality reduction is usually visualized by scatter plots, in which similarity is depicted by spatial proximity. Besides being able to represent the structures and relations of multidimensional datasets in lower dimensions, the ability to scale computational methods to massive datasets is also a very discussed problem, and it has becoming increasingly important. For a long time, scientists have recognized that interactivity of methods cannot be guaranteed by only adding raw processing power (Hellerstein et al. 1999). In order to address this problem, sampling mechanisms are often applied since approximate answers based on samples are often as useful as exact answers (Kwon et al. 2017). Visualization community also takes advantages of sampling techniques, where summary visualizations are proposed to depict large amount of data (Sarikaya et al. 2018).

W. E. Marcilio-Jr (&)  D. M. Eler Sa˜o Paulo State University, Presidente Prudente, Sa˜o Paulo, Bra