Filter cache: filtering useless cache blocks for a small but efficient shared last-level cache
- PDF / 3,213,939 Bytes
- 24 Pages / 439.37 x 666.142 pts Page_size
- 7 Downloads / 148 Views
Filter cache: filtering useless cache blocks for a small but efficient shared last‑level cache Han Jun Bae1 · Lynn Choi1
© Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract Although the shared last-level cache (SLLC) occupies a significant portion of multicore CPU chip die area, more than 59% of SLLC cache blocks are not reused during their lifetime. If we can filter out these useless blocks from SLLC, we can effectively reduce the size of SLLC without sacrificing performance. For this purpose, we classify the reuse of cache blocks into temporal and spatial reuse and further analyze the reuse by using reuse interval and reuse count. From our experimentation, we found that most of spatially reused cache blocks are reused only once with short reuse interval, so it is inefficient to manage them in SLLC. In this paper, we propose a new small additional cache called Filter Cache to the SLLC, which cannot only check the temporal reuse but also can prevent spatially reused blocks from entering the SLLC. Thus, we do not maintain data for non-reused blocks and spatially reused blocks in the SLLC, dramatically reducing the size of the SLLC. Through our detailed simulation on PARSEC benchmarks, we show that our new SLLC design with Filter Cache exhibits comparable performance to the conventional SLLC with only 24.21% of SLLC area across a variety of different workloads. This is achieved by its faster access and high reuse rates in the small SLLC with Filter Cache. Keywords Shared last-level cache · Reuse rate · Temporal reuse · Spatial reuse · Multicore CPU · Cache organization
1 Introduction To bridge the speed gap between the off-chip memory and private caches on multicore CPU chips, on-chip shared last-level caches (SLLCs) are expected to grow in size. However, increasing the size of SLLC has the disadvantage of increased * Lynn Choi [email protected] Han Jun Bae [email protected] 1
School of Electrical Engineering, Korea University, Seoul, Korea
13
Vol.:(0123456789)
Table 1 The die areas and the access latencies of SLLCs on recent commercial multicore CPU chips
H. J. Bae, L. Choi Core
Area
Latency
Intel i7 (3.4 GHz) [1]
37%
36-Cyclc
AMD Ryzen 7 (4.0 GHz) [2]
29%
39-Cycle
Intel itanium 2 (1.3 GHz) [3]
51%
12-Cycle
SUN SPARC M7 (4.13 GHz) [4]
53%
28-Cycle
IBM POWER 8 (3.7 GHZ) [5]
33%
27-Cycle
Fig. 1 SLLC reuse rates in PARSEC benchmarks
area and latency. These SLLCs are usually organized as multiple independent banks to reduce the access time, yet the latency is still expected to grow due to the wire delay caused by the increasing number of sub-banks. In addition, the on-chip SLLCs already occupy a significant portion of the CPU chip die area as shown in Table 1. For example, in the 22-nm Haswell version of 3.4 GHz Intel i7 chip, the SLLC occupies 37% of the total chip area and its latency is 36 cycles. Table 1 shows the die area and the access latency in CPU cycles for recent commercial multicore CPUs. Despite its huge size, SLLCs are not efficiently manage
Data Loading...