Performance and power consumption analysis of Arm Scalable Vector Extension

  • PDF / 1,155,687 Bytes
  • 22 Pages / 439.37 x 666.142 pts Page_size
  • 75 Downloads / 239 Views

DOWNLOAD

REPORT


Performance and power consumption analysis of Arm Scalable Vector Extension Tetsuya Odajima1 · Yuetsu Kodama1 · Mitsuhisa Sato1 Accepted: 26 October 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Modern CPUs not only have multiple cores but also support wide single instruction multiple data (SIMD). This trend is expected to grow in the future. In this paper, we examine the effect of the vector length and the number of out-of-order resources on the performance and the power consumption of programs having multiple vector lengths using the Arm Scalable Vector Extension. Based on the performed evaluation, we conclude that using a longer vector length with multicycle vector units leads to up to approximately 30% improvement in performance and 21% decrease in power consumption than when using a shorter vector length. Keywords  Arm Scalable Vector Extension · gem5 · McPAT · Simulation

1 Introduction The latest CPU architectures not only contain multiple cores but also support wide single instruction multiple data (SIMD) instructions having extended vector lengths. As several operations can be executed in a single instruction, high-performance processing can be attained. Arm has introduced the Scalable Vector Extension (SVE) [1–4], which is an extension of the SIMD instruction set. The most important feature of SVE is uniformly supporting various vector lengths from 128-bit to 2048-bit. To enable this feature, SVE adopts the vector length agnostic (VLA) programming model, which allows for the regulation of vector length dynamically. Generally, using a longer vector length leads to the improvement in the peak computing performance owing to an increase in the number of elements that can be executed in * Tetsuya Odajima [email protected] Yuetsu Kodama [email protected] Mitsuhisa Sato [email protected] 1



RIKEN Center for Computational Science, Kobe, Japan

13

Vol.:(0123456789)



T. Odajima et al.

parallel. However, implementing this feature requires significant amount of hardware resources. In this study, we examined the impact of scaling the vector length and the number of out-of-order resources on the performance and the power consumption. We employed the Arm SVE for our evaluations as its VLA programming model can execute the same binaries with different vector lengths. We evaluated the performance for multiple vector lengths using the gem5 processor simulator [5–7], which supports a cycle-accurate out-of-order pipeline simulation. Additionally, we analyzed the power consumption for multiple vector lengths using the McPAT framework [8], which is a tool for estimating the processor area and the power consumption. To perform these experiments, we extended gem5 and McPAT. We developed “gem5-sve” simulator [9] to support the out-of-order execution with Arm SVE and set up its parameters based on the Marvell’s Thunder X2 processor [10, 11]. In addition, we modified McPAT and created its templates to support Arm SVE and SIMD instructions for calculating processor area and powe