Filter cache: filtering useless cache blocks for a small but efficient shared last-level cache

PDF / 3,213,939 Bytes
24 Pages / 439.37 x 666.142 pts Page_size
7 Downloads / 156 Views

Filter cache: filtering useless cache blocks for a small but efficient shared last‑level cache Han Jun Bae1 · Lynn Choi1

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Although the shared last-level cache (SLLC) occupies a significant portion of multicore CPU chip die area, more than 59% of SLLC cache blocks are not reused during their lifetime. If we can filter out these useless blocks from SLLC, we can effectively reduce the size of SLLC without sacrificing performance. For this purpose, we classify the reuse of cache blocks into temporal and spatial reuse and further analyze the reuse by using reuse interval and reuse count. From our experimentation, we found that most of spatially reused cache blocks are reused only once with short reuse interval, so it is inefficient to manage them in SLLC. In this paper, we propose a new small additional cache called Filter Cache to the SLLC, which cannot only check the temporal reuse but also can prevent spatially reused blocks from entering the SLLC. Thus, we do not maintain data for non-reused blocks and spatially reused blocks in the SLLC, dramatically reducing the size of the SLLC. Through our detailed simulation on PARSEC benchmarks, we show that our new SLLC design with Filter Cache exhibits comparable performance to the conventional SLLC with only 24.21% of SLLC area across a variety of different workloads. This is achieved by its faster access and high reuse rates in the small SLLC with Filter Cache. Keywords Shared last-level cache · Reuse rate · Temporal reuse · Spatial reuse · Multicore CPU · Cache organization

1 Introduction To bridge the speed gap between the off-chip memory and private caches on multicore CPU chips, on-chip shared last-level caches (SLLCs) are expected to grow in size. However, increasing the size of SLLC has the disadvantage of increased * Lynn Choi [email protected] Han Jun Bae [email protected] 1

School of Electrical Engineering, Korea University, Seoul, Korea

13

Vol.:(0123456789)

Table 1 The die areas and the access latencies of SLLCs on recent commercial multicore CPU chips

H. J. Bae, L. Choi Core

Area

Latency

Intel i7 (3.4 GHz) [1]

37%

36-Cyclc

AMD Ryzen 7 (4.0 GHz) [2]

29%

39-Cycle

Intel itanium 2 (1.3 GHz) [3]

51%

12-Cycle

SUN SPARC M7 (4.13 GHz) [4]

53%

28-Cycle

IBM POWER 8 (3.7 GHZ) [5]

33%

27-Cycle

Fig. 1 SLLC reuse rates in PARSEC benchmarks

area and latency. These SLLCs are usually organized as multiple independent banks to reduce the access time, yet the latency is still expected to grow due to the wire delay caused by the increasing number of sub-banks. In addition, the on-chip SLLCs already occupy a significant portion of the CPU chip die area as shown in Table 1. For example, in the 22-nm Haswell version of 3.4 GHz Intel i7 chip, the SLLC occupies 37% of the total chip area and its latency is 36 cycles. Table 1 shows the die area and the access latency in CPU cycles for recent commercial multicore CPUs. Despite its huge size, SLLCs are not efficiently manage

Data Loading...

Filter cache: filtering useless cache blocks for a small but efficient shared last-level cache

Recommend Documents

Processor Cache

Page Cache

Cache Performance

L3 Cache

Aggregate Cache

Cache-Zoomer: On-demand High-resolution Cache Monitoring for Security

L1 Cache

Instruction Cache

L2 Cache

Data Cache

Cache Manager

CPU Cache