Hash-tree PCA: accelerating PCA with hash-based grouping

  • PDF / 1,416,165 Bytes
  • 17 Pages / 439.37 x 666.142 pts Page_size
  • 58 Downloads / 202 Views

DOWNLOAD

REPORT


Hash‑tree PCA: accelerating PCA with hash‑based grouping Lkhagvadorj Battulga1 · Sang‑Hyun Lee2 · Aziz Nasridinov3 · Kwan‑Hee Yoo3

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Abstract In data mining or machine learning, one of the most commonly used feature extraction techniques is principal component analysis (PCA). However, it performs poorly on a large dataset. In this paper, we propose a new method of accelerating conventional PCA, named hash-tree PCA. It samples the objects that are similar to each other without losing the original data distribution. First, it explores similar objects and stores them in hash tables. Afterward, it samples a certain number of the objects from each hash table and creates a new dataset with a reduced number of objects. Finally, it executes PCA on the sampled dataset. Experimental results show that our method outperforms the PCA and fast PCA methods. Keywords  Fast PCA · Hash table · Eigenvalue decomposition (EVD) · Data grouping · Feature extraction

1 Introduction Recently, artificial intelligence (AI) has become an essential part of daily life due to its technological development [1]. However, algorithms that work behind AI do not scale up to high-dimensional datasets. Thus, conventional methods extract the most valuable variables from the dataset before AI algorithms are applied. This preprocessing step is * Kwan‑Hee Yoo [email protected] Lkhagvadorj Battulga [email protected] Sang‑Hyun Lee [email protected] Aziz Nasridinov [email protected] 1

Department of Mathematics and Computer Science, Eindhoven University of Technology, Eindhoven, The Netherlands

2

YURA Co., Ltd, Seongnam, Gyeunggi, South Korea

3

Department of Computer Science, Chungbuk National University, Cheongju, South Korea



13

Vol.:(0123456789)



L. Battulga et al.

known as feature extraction or dimension reduction. One of the most commonly used feature extraction or dimension reduction techniques is principal component analysis (PCA). English mathematician Karl Pearson originated PCA theory in his research [2], and ever since, it has been used in various fields, from statistics to computer science. Generally, in computer science, PCA is used for data compression, dimension reduction, and image processing. Current systems and applications collect a large amount of data on a daily basis [3]. Executing PCA on this collected data would consume a massive amount of time on commodity hardware. Therefore, several researchers have proposed accelerating the execution time of PCA for large datasets. Nobuhiro and Kuroki [4] accelerated execution time of PCA with parallel programming by using a graphics processing unit (GPU) instead of a central processing unit (CPU). Vogt and Tacke [5] achieved PCA execution–time acceleration by using wavelet transformation. In their work, the authors claimed that if the data distribution is kept to the fewest number of objects, executing PCA on the reduced dataset is faster, and the result is identical to executing conventional PCA on the original