KimMinsik1
OhYunho2
RoWonWoo1
-
(Department of Electrical and Electronic Engineering, Yonsei University, Seoul 03722,
Republic of Korea)
-
(School of Electrical Engineering, Korea University, Seoul 02841, Republic of Korea)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Index Terms
GPU, register file, register compression, entropy.
I. INTRODUCTION
The number of registers in a Graphics Processing Unit (GPU) has been continuously
grown over the last decade. A driving force of such growth is that although the number
of cores has to be increased for better performance, the number of registers per core
has to be maintained to provide sufficient thread parallelism for hiding memory latency,
which is generally considered as challenging even with better process technology in
a near future. Another aspect is that as GPUs are covering broader areas from graphics
to artificial intelligence, applications will perform more complex work in a thread,
thus will likely demand more registers.
For GPU registers, the data should be resident on chip as possible to enable fast
context switching of threads as well as to feed massively parallel execution units.
As a result, GPU register files consume substantial energy as well as on-chip storage.
For instance, an NVIDIA GV100 GPU has a 256 KB register file per Streaming Multiprocessor
(SM) and there are 80 SMs in the GPU, which corresponds to 20MB of registers [1]. According to previous studies, the register file consumes about 15 percent of SM
energy [2]. Deep learning workloads, which are widely used in recent GPUs, also show similar
energy consumption tendencies [3]. The energy consumption of register files for various deep learning algorithms is
between 15 and 20 percent of SM energy.
A recent idea to improve both the energy and storage efficiency of the GPU register
file is to compress register data [4,5]. There are unique aspects of GPUs that make register compression as a feasible option
beyond widely adopted memory traffic compression [1,6]. A typical GPU executes a group of 32 threads on a vector execution unit. The grouped
threads are collectively called a warp. In the vector unit, each scalar lane runs
a thread. A vector register used in the vector unit is called a warp register. Previous
studies have observed that the thread registers in a warp register have a strong similarity
in values with each other [5,7]. Because of this value similarity, simple but efficient data compression algorithms
such as base-delta-immediate (BDI) [5] are expected to be quite effective for compressing a warp register as a whole. Furthermore,
GPUs' great latency tolerance enables them to allow additional compression stages
in register access path while minimizing performance overheads due to increased pipeline
depth. Compressed registers consume less energy for data transfer, and also require
fewer bits to be stored in the register file which provides power gating opportunities
for register banks.
Although previous studies have been explored register compression to some degree,
a limit study is required to investigate its potential gain, given that the compression
ratio varies significantly according to the choice of algorithm. As an indicator to
guide how efficient a register compression algorithm is, this paper proposes an analytical
model to calculate the ideal compression ratio which is known to be mathematical upper
bound according to entropy theory. Our evaluation shows the theoretical maximum compression
ratio can be obtained by 3.99, which is 2.33 times more than a previously proposed
BDI based register compression algorithm [5].
II. RELATED WORK
Previous papers have proposed several data compression techniques. Cache or memory
system have been known to be good locations to compress data for many reasons, such
as energy reduction [8], throughput improvement [9], or better cache or memory efficiency [10]. There are papers to leverage compression for improving efficiency or performance
in the context of GPUs. For instance, Pekhimenko et al. proposed a toggle-aware compression
to improve the memory bandwidth efficiency in GPUs [11]. Lal et al. utilized the Huffman coding for the GPU memory compression [12].
Lee et al. [5] proposed to compress the register data for GPU energy efficiency improvement. The
proposed com- pression uses a variant of base-delta-immediate (BDI) compression, which
was originally designed for cache compression [13]. GPU register compression is motivated by similarity between threads within a warp,
which lead to an observation that BDI-like algorithms are well suited to exploit value
similarity across thread lanes. Compressing registers can be leveraged to reduce power
or to improve energy efficiency by reducing the number of register banks to store
or access a warp register, thus saves dynamic energy for register fetching as well
as static energy when combined with power gating unused register banks.
There have also been studies using STT-MRAM to improve energy efficiency. Li et al.
[14] proposed a STT-MRAM register file architecture with SRAM buffer. Won et al. [15] proposed the STT-MRAM register file with SRAM in the form of a cache. GPU register
file implementation using STT-MRAM can be used with other data compression techniques
and allows for further energy efficiency improvements.
III. ANALYSIS OF REGISTER DATA COMPRESSION ON GPUS
1. Lossless Compression
According to Shannon`s source coding theorem [16], the compression ratio of lossless data compression is closely related to the entropy
value of data which can be derived from how likely each source symbol to appear. The
theorem basically says for a vector consists of N elements where each element is from
the source symbol set E, if the distribution of occurrence of source symbols in the
vector is independent and identically distributed random and the entropy of the information
the vector has is $H(X)$ bits, then the lower bound of the data size when the vector
is lossless compressed is $N$ $H(X)$ bits as $N \to \infty$.
Fig. 1 is an example to illustrate how this principle can be applied to obtain an ideal
compression ratio for register accesses. The first row shows an access stream consists
of seven 2-bit thread register accesses. Thus 14 bits are needed to store the entire
trace without compression. The second row shows only 13 bits are needed to store when
the trace is compressed, with a hypothetical algorithm which maps from the source
symbol $E \in \{0$, $1$, $2$, $3\}$ to the code word $c(X) = \{$``0'': 110, ``1'':
0, ``2'': 10, ``3'': 111$\}$. The lower bound of a code word based compression can
be obtained from the results of entropy theory. Assuming the entropy $H(X)$ is given
as a prior knowledge, the upper bound of the compression ratio, $C(X)$, can be derived
from $E$ and $H(X)$ as follows:
where $\left| E \right|$ represents the number of elements in the set $E$. The last
row shows how to calculate the lower bound of the code word length is 12.897 bits
from the given $H(X)$ as shown in Fig. 1, which corresponds to the compression ratio of 1.086.
Fig. 1. Example of the lower bound of the code word length.
2. Entropy Model
The key of our model is how to obtain $H(X)$ for register values. We first need assumptions
about the compression granularity. The choice in this paper is to view each thread
register value as a symbol. The insight under this decision is that there is structural
value similarity between thread registers within a warp register [5], thus our selection can capture such similarity as well as benefits with optimal
entropy coding. Further, we assume each thread register is 32-bit wide which is quite
a common choice as most GPUs are optimized for 32-bit operations. Regardless of our
choices, we believe the presented methodology is general enough to work with any other
parameters.
Assume that $E$ is the set of all possible register values, thus $E=\{0$, $1$, $2$,
$\dots$, $2^{32}-1 \}$ for a 32-bit wide register. We also define the sample space
$\Omega$, which is a trace of register values observed during program execution. $\Omega$
can be constructed by collecting the register values written back to the register
file while running applications on a simulator.
A discrete random variable for a register value observation event $X$ can be defined
as $X :$ $\Omega \to E$. Since the register values are discrete by nature, we can
define the probability mass function (PMF) $f_X(x)$ of $X$ which is defined as
$f_X(x)$ can be obtained by counting the number of occurrences for each register value
$e \in E$ in $\Omega$. To derive $H(X)$, the information content $I(x)$ which is
defined as the following:
$I(x)$ means the amount of information in bits when a particular register value $x$
is observed. Finally, the entropy $H(X)$ can be calculated from $I(x)$ as follows.
3. Optimizing for Application and Data Types
Eqs. (1) and (4) imply that the ideal compression efficiency is determined from $f_X(x)$, and as the
distribution of $f_X(x)$ is closer to uniform distribution the maximum achievable
compression ratio $C(X)$ will approach to 1.0 as the entropy $H(X)$ will increase.
At extreme, if $f_X(x)$ is a discrete uniform distribution, all register values will
occur equally likely. In this case, the entropy is calculated as the following:
This is identical to the register width, thus no benefit is expected even with a hypothetical
algorithm with an ideal entropy coding. In contrast, if $f_X(x)$ is a binomial distribution
with $p = 0.5$, the entropy $H(X)$ becomes
In this case, the ideal compression ratio is $32/17.047 = 1.87$. Because entropy $H(X)$
varies according to $f_X(x)$, then it is needed to use different $f_X(x)$ for each
application to better model ideal performance, as $f_X(x)$ will vary according to
application characteristics. Our model uses different $f_X(x)$ for each application
by collecting register access trace separately for each benchmark.
In similar reasons, our model can be further improved by accounting for data types,
which leverages the fact that $f_X(x)$ also varies according to registers' data types.
Fig. 2 shows the portion of register values for each of the four different data types: Signed
integer (S32), unsigned integer (U32), floating point (F32), and untyped bit (B32).
Registers with different data types will have different value patterns, thus optimizing
the compression algorithm for each data type will likely lead to a higher compression
ratio. Also, this result shows the most frequent data type is different in each of
the applications, which further supports the argument that the optimal algorithm has
to consider $f_X(x)\ $variation across applications.
Our model uses four conditional entropy values for the four data types, $H(X\mid S)$,
$H(X\mid U)$, $H(X\mid F)$, and $H(X\mid B)$, where $T \in \{ S$, $U$, $F$, $B \}$.
Each element in T means S32, U32, F32, and B32, respectively. The aggregated entropy
$H(X)$ can be derived from the conditional entropy values and the portion of register
accesses to each data type. The derivation is as follows:
Fig. 2. Ratio of register value data types (a) Fermi configuration, (b) Volta configuration.
4. Latency Model
The program characteristic can be classified to compute-bounded or memory-bounded
by analyzing the average stalled cycle of SM [2]. Under the phase of compute-bounded workloads, the core's average stalled cycle decreases,
since execution units of SM are fully utilized. In this section, capability of SM
directly influences the overall performance. On the contrary, SM's average stalled
cycle increases when GPU encounters memory-bounded phase of the application. Because
of the latency of the memory access or preparing the data for computation, the pipeline
should wait until the access or preparation is finished.
Fig. 3 shows the ratio of the average stalled cycle during the heartwall application which
is in Rodinia benchmark suite [17]. As shown in the figure, the characteristic of the ratio of the average stalled can
be divided into two types, highly stalled phase and nearly non-stalled phase. These
two phases appear one after the other during the execution of the application. To
classify the application's characteristics whether the compute-bounded or memory-bounded,
we use measured values of the stalled cycle's ratio. When the average ratio of compute-intensity
is larger than memory-intensity, we can classify this phase as compute-bounded phase.
Register compression introduces additional latency while fetching values from registers
in compute-bounded phase. On the other hand, memory work is the bottleneck of the
memory-bounded phase. Therefore, the memory-bounded phase is not influenced by register
compression. Equation 8 shows the relative performance of the GPU that Amdahl's law
[18] is applied to compute-bounded phase and memory-bounded phase.
where $p$ is the portion of compute-bounded phase ($0 \le p\le 1$) and $l$ is the
additional latency of register compression.
Fig. 3. Average stall cycle.
IV. EVALUATION
1. Methodology
We used 16 GPU applications selected from Rodinia [17], Parboil [19], and GPGPU-Sim [20] benchmark suites. We captured register access trace from GPGPU-Sim [20] and run a post-simulation on the data. The post simulation includes our compression
model as well as an implementation of BDI-based register compression [5]. Our simulation approximates NVIDIA Fermi architecture [21] and Volta architecture [1] with the microarchitectural parameters listed in Table 1.
Table 1. GPU microarchitectural parameters.
Configuration
|
Fermi
|
Volta
|
Clock Frequency
|
1.4GHz
|
1.2GHz
|
SMs / GPU
|
15
|
80
|
Warp Scheduling Policy
|
Greedy-The-Oldest
|
SIMT lane width
|
32
|
32
|
Max # Warps / SM
|
48
|
64
|
Max # Threads / SM
|
1536
|
2048
|
Register File Size
|
128KB
|
256KB
|
# of 32-bit Registers / SM
|
32768
|
65536
|
# of Register Banks
|
32
|
32
|
2. Entropy Values and Upper Bound of Compression Ratio
Fig. 4 shows the entropy values of the 32-bit thread register values measured from 16 applications.
$Y$-axis means the entropy value in bits. We first measured the entropy values for
each of the four data types explained in Section III-1. The modeled upper bound number
is derived from Equation and the collected per-type entropy values and $f_X(x)$ data.
The entropy bits can be understood how many bits are required to store a register.
The entropy value when all data types are globally averaged is 9.45 and 9.08 bits,
and there is significant application variation, ranging from the minimum of 5.22 (LPS)
and 5.40 (hot) and the maximum entropy of 19.59 (sgemm), and 17.14 (sgemm), respectively.
The entropy values of the integer data type and the untyped bit type are $2\times$
smaller than that of the floating point data type, which is around 5. The average
entropy value for each data type is 5.30 bits, 8.76 bits, 4.59 bits, and 11.09 bits
on Fermi configuration and 7.32 bits, 5.31 bits, 5.16 bits, and 14.6 bits on Volta
configuration for S32, U32, B32, and F32, respectively. Because a typical use case
of 32-bit floating point data is for processing of rational or large integer numbers,
the entropy value of the F32 appears relatively larger than the entropy value of the
other data types, which is used for control or indexing data structures thus expected
to show more regular value patterns. However, this data also shows such trend can
be difficult to be generalized, for instance, in LPS integer type registers have higher
entropy than the float registers. Fig. 5 shows per-type upper bounds of compression efficiency, which is derived from 1. The
global average is 3.39x and 3.52x, respectively. This bound is proportional to the
entropy values in 3, for instance, the minimum is 1.63x (sgemm) and 1.87x (sgemm)
and the maximum is 6.13x (LPS) and 5.9x (back), respectively.
Fig. 4. Measured entropy value of the GPU register (a) Fermi configuration, (b) Volta
configuration.
Fig. 5. Upper bound for the register value compression (a) Fermi configuration, (b)
Volta configuration.
3. Compression Ratio with Existing Technique
Fig. 6 compares compression ratio for the integer type registers of the evaluated applications.
GPU microarchitectural parameters from BDI compression [5] were used. The bar labeled BDI shows the compression ratio when BDI compression algorithm
is used for register file [5]. All benchmarks combined, the calculated ideal case compression ratio is $3.99\times$,
which is significantly higher than $2.33\times$ of the compared BDI register compression.
We noticed LIB is the most outstanding case where BDI efficiency is close to the ideal
case. On a closer look of the actual register values, we observed the majority of
register values are zero, which is the case where the BDI compression can achieve
highest compression efficiency close to ideal. In most other applications, the efficiency
of BDI appears to be from 21% to 57% compared with the upper bound from our ideal
compression model. A hypothetical algorithm may be able to perform 2% to 623% better
than the BDI compression, which implies headroom for further improvement.
Fig. 6. Compression ratio comparison for the integer data type with existing GPU register
compression method.
4. Energy Efficiency
Fig. 7 represents dynamic energy of individual hardware components of the LIB application
from GPUWattch [2]. GPU microarchitectural parameters from NVIDIA Fermi architecture [21] were used. Dynamic energy was measured for each kernel within the application. The
percentage of energy consumed by the register file with ideal compression ratio in
the LIB application was measured to be 1.27% and 0.9%, respectively. Since the register
file consumes about 15 percent of SM energy, the limit of energy efficiency improvement
cannot exceed the energy consumed by the register file.
Fig. 7. Breakdown of average dynamic energy consumption.
V. CONCLUSIONS
This paper proposed an entropy-based compression model to study theoretical limits
of GPU register compression algorithms. We provided a method to obtain the entropy
function from empirically profiled register value traces. The model is then further
refined to account for application and data type variations. Experimental results
show that an existing algorithm is still underperforming compared with our model.
A hypothetical algorithm may achieve higher compression ratio which would lead to
more energy efficient GPUs.
ACKNOWLEDGMENTS
This work has been supported by Global Education and Research Center for Convergence
Technology, a Brain Korea 21 four program, Yonsei University.
References
NVIDIA, NVIDIA Tesla V100 GPU Architecture, 2017.

J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V.
J. Reddi, ``GPUWattch: Enabling energy optimizations in GPGPUs,'' ACM SIGARCH Computer
Architecture News, vol. 41, no. 3, pp. 487-498, 2013.

V. Kandiah, S. Peverelle, M. Khairy, J. Pan, A. Manjunath, T. G. Rogers, T. M. Aamodt,
and N. Hardavellas, ``AccelWattch: A power modeling framework for modern GPUs,'' Proc.
of MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture,
pp. 738-753, 2021.

N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, R. Ausavarungnirun, C. Das, M.
Kandemir, T. C. Mowry, and O. Mutlu, ``A case for core-assisted bottleneck acceleration
in GPUs: Enabling flexible data compression with assist warps,'' Proc. of the 42nd
Annual International Symposium on Computer Architecture, pp. 41-53, 2015.

S. Lee, K. Kim, G. Koo, H. Jeon, W. W. Ro, and M. Annavaram, ``Warped-compression:
Enabling power efficient gpus through register compression,'' Proc. of the 42nd Annual
International Symposium on Computer Architecture, pp. 502-514, 2015.

AMD, The Polaris Architecture, 2016.

D. Wong, N. S. Kim, and M. Annavaram, ``Approximating warps with intra-warp operand
value similarity,'' Proc. of 2016 IEEE International Symposium on High Performance
Computer Architecture (HPCA), 2016.

S. Sardashti and D. A. Wood, ``Decoupled compressed cache: Exploiting spatial locality
for energy-optimized compressed caching,'' Proc. of the 46th Annual IEEE/ACM International
Symposium on Microarchitecture, pp. 62-73, 2013.

T. M. Nguyen and D. Wentzlaff, ``MORC: A manycore-oriented compressed cache,'' Proc.
of the 48th International Symposium on Microarchitecture, pp. 76-88, 2015.

S. Hong, B. Abali, A. Buyuktosunoglu, M. B. Healy, and P. J. Nair, “Touché: Towards
ideal and efficient cache compression by mitigating tag area overheads,” Proc. of
the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 453-465,
2019.

G. Pekhimenko, E. Bolotin, N. Vijaykumar, O. Mutlu, T. C. Mowry, and S. W. Keckler,
``A case for toggle-aware compression for GPU systems,'' Proc. of 2016 IEEE International
Symposium on High Performance Computer Architecture (HPCA), 2016.

S. Lal, J. Lucas, and B. Juurlink, ``E$^2$MC: Entropy encoding based memory compression
for GPUs,'' Proc. of 2017 IEEE International Parallel and Distributed Processing Symposium
(IPDPS), 2017.

G. Pekhimenko, V. Seshadri, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry,
``Base-delta-immediate compression: Practical data compression for on-chip caches,''
Proc. of the 21st International Conference on Parallel Architectures and Compilation
Techniques, pp. 377-388, 2012.

G. Li, X. Chen, G. Sun, H. Hoffmann, Y. Liu, Y. Wang, and H. Yang, ``A STT-RAM-based
low-power hybrid register file for GPGPUs,'' Proc. of the 52nd Annual Design Automation
Conference, pp. 1-6, 2015.

W. Jeon, J. H. Park, Y. Kim, G. Koo, and W. W. Ro, ``Hi-End: Hierarchical, endurance-aware
STT-MRAM-based register file for energy-efficient GPUs,'' IEEE Access, vol. 8, pp.
127768-127780, 2020

C. E. Shannon, ``A mathematical theory of communication,'' Bell System Technical Journal,
vol. 27, no. 3, July 1948.

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, ``Rodinia:
A benchmark suite for heterogeneous computing,'' Proc. of 2009 IEEE International
Symposium on Workload Characterization (IISWC), 2009.

G. M. Amdahl, ``Validity of the single-processor approach to achieving large scale
computing capabilities,'' Proc. of Spring Joint Computer Conference, pp. 483-485,
1967

J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D.
Liu, and W.-m. W. Hwu, ``Parboil: A revised benchmark suite for scientific and commercial
throughput computing,'' Center for Reliable and High-Performance Computing, vol. 127,
2012.

A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, ``Analyzing CUDA
Workloads using a Detailed GPU Simulator,'' Proc. of 2009 IEEE International Symposium
on Performance Analysis of Systems and Software, 2009

NVIDIA, NVIDIA’s Fermi: The First Complete GPU Computing Architecture, 2009.

Minsik Kim received his B.S. degree in electrical and electronic engineering from
Yonsei University, Seoul, South Korea, in 2013. He received his M.S. and Ph.D. degrees
in 2019, both at once, through the integrated Ph.D. program, in electrical and electronic
engineering, Yonsei University. He was a Senior Researcher with the Supercomputing
Infrastructure Center, Korea Institute of Science and Technology Information, Daejeon,
South Korea. He is currently working with Global Education and Research Center for
Convergence Technology, Yonsei University, as a Research Professor. His research interests
include GPU microarchitectures, high performance computing, and quantum computing.
Yunho Oh received his B.S., M.S., and Ph.D. degrees from the School of Electrical
and Electronic Engineering, Yonsei University, Seoul, South Korea, in 2009, 2011,
and 2018, respectively. From 2011 to 2014, he was a Software Engineer in the mobile
communications business with Samsung Electronics. From 2019 to 2021, he was a Postdoctoral
Researcher with the Parallel Systems Architecture Laboratory (PARSA), EPFL, Switzerland.
He is currently an Assistant Professor with the School of Electrical Engineering,
Korea University. Prior to joining Korea University, he was an Assistant Professor
with Sungkyunkwan University. His research interests include hardware and software
architectures for energy-efficient data centers, processor architectures (CPUs, GPUs,
and neural network accelerators), in-storage processing, memory systems, and high-performance
computing.
Won Woo Ro received his BS degree in electrical engineering from Yonsei University,
Seoul, Korea, in 1996, and his M.S. and Ph.D. degrees in electrical engineering from
the University of Southern California, in 1999 and 2004, respectively. He worked as
a research scientist with the Electrical Engineering and Computer Science Department,
University of California, Irvine. He currently works as a professor with the School
of Electrical and Electronic Engineering, Yonsei University. Prior to joining Yonsei
University, he worked as an assistant professor with the Department of Electrical
and Computer Engineering, California State University, Northridge. His industry experience
includes a college internship with Apple Computer, Inc. and a contract software engineer
with ARM, Inc. His current research interests include high-performance microprocessor
design, GPU microarchitectures, neural network accelerators, and memory hierarchy
design. He is a senior member of the IEEE.