Park Jaehyun1
Shin Donghwa2
Lee Hyung Gyu3,*
-
(School of Electrical Engineering, University of Ulsan, Ulsan 44610,
Korea)
-
(Department of Smart Systems Software, Soongsil University, Seoul 06978, Korea)
-
(School of Computer and Communication Engineering/Computer & Communication Research
Center, Daegu University, Gyeongsangbukdo 38453, Korea)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Index Terms
Ionic screening, electrolyte, field-effect transistor, Monte Carlo method, electrical noise
I. INTRODUCTION
Memory and storage have been emerging as an important issue for accommodating explosive
data generations from yet-another innovative services including Internet of Things
(IoTs), social network service (SNS), and private internet broadcasting [1,2]. While the challenges on traditional storage systems are mainly on enlarging the
capacity, the above-mentioned applications require fine-grained data management where
the latency and random accessibility are important as well. To meet these requirements,
many new storage and memory device technologies have been developed. Among them, phase
change memory (PCM) has been actively researched for achieving high performance and
large-capacity of memory/storage device simultaneously. It is expected to replace
conventional DRAM devices due to its ability to scale very deeply down into the low
nanometer regime and its low power consumption with non-volatility [3]. However, there are two major drawbacks in adopting this PCM technology to conventional
memory/storage architectures: poor write performance and limited long-term endurance.
Various architectural techniques for mitigating these drawbacks have been proposed,
while maximally exploiting the benefits of the PCM [4].
Most previous approaches have tried to reduce the number of memory accesses or to
optimize the internal architecture of PCM cell array under assumption that the internal
architecture and interface to the memory controller of the PCMs are very similar to
DRAM [5,6]. Though the proposed techniques have contributed to enhance the performance and energy
consumption of the PCM memory systems significantly, most approaches have considered
little or overlooked the characteristics of the industry-announced PCM devices. The
PCM’s internal architectures and interfaces considered in industries are significantly
different with those of the conventional DRAMs. For example, LPDDR2-NVM standard interface
have been announced to come up with the many different characteristics of newly announced
non-volatile memory devices including PCMs [7] and major PCM manufacturers have announced their PCM prototypes to be compatible
with this LPDDR2-NVM standard interface [8,9]. Although this standard interface inherits many common features from conventional
double data rate (DDR) interfaces of DRAMs, many distinctive features such as a three-phase
addressing mechanism, different row buffer and bank architectures, and support asymmetric
read and write operations using an overlay windows, are included as well. Detailed
information about this standard interface is described in the later section.
Among several distinctive features of the LPDDR2-NVM supported PCMs, the row buffer
architecture has been significantly revised from the conventional DRAM’s architecture
in terms of the number of row buffer, the unit size of single row buffer, and buffer
management policy. The LPDDR2-NVM interface defines 4 or 8 pairs of row buffer and
each pair consists of row address buffers (RABs) and row data buffers (RDBs). These
multiple row buffers are arbitrarily selected by the memory controller regardless
of the physical memory address because the row buffers are not tightly coupled with
the physical memory address in the LPDDR2-NVM interface. In addition, the unit size
of single RDB is pretty much smaller ${-}$ 32 bytes in typical case ${-}$ than the
unit size of single row buffer in DRAM. All above mentioned differences indicate that
we have more flexibility to control the PCM’s row buffers and the sophisticated mechanism
of controlling these row buffers is desirable to maximize the performance of the PCM-based
memory systems.
In this paper, we investigate the performance of the memory system affected by the
row buffer architecture and its management policy targeting for LPDDR2-NVM compatible
devices. To this end, we devise a proactive row buffer architecture for enhancing
the performance of the PCM memory system. The proposed scheme efficiently traces and
adaptively controls the number of prefetched rows depending on the real-time memory
access characteristics. Our trace-driven simulations using real workloads and practical
timing parameters extracted from industry-announced PCM prototypes demonstrate that
the proposed row buffer architecture considering a LPDDR2-NVM enhances the system-level
memory performance and energy consumption by 12.2% and 0.3% on average, respectively,
compared to the conventional row buffer architecture under the same cost (area) restrictions.
The rest of this paper is organized as follows. Section 2 shows the backgrounds of
this work including a brief introduction of LPDDR2-NVM industry standard interface
and related work. Reconfiguration of row buffer architecture is introduced with its
motivational example in Section 3. Then we introduce the proactive row data buffer
management in Section 4. Section 5 evaluates the proposed scheme. Finally, Section
6 concludes this work.
II. BACKGROUND
1. LPDDR2-NVM Interface
LPDDR2-NVM interface includes many different features from the conventional DDR interface
to support different behaviors of non-volatile memory devices including asymmetric
read and write operations. The representative features compared to the conventional
DDR interface are:
${-}$ Three-phase addressing mechanism for supporting large size of memory (up to
32 Gb).
${-}$ No multi-bank architecture.
${-}$ Multiple RABs and RDBs which are arbitrarily selected by the memory controller
regardless of the physically accessed address.
${-}$ Smaller unit size of RDB (typically 32 bytes).
${-}$ Indirect write operations via overlay window.
${-}$ Dual operation that enables read operation while performing cell programming
in the other partitions.
Fig. 1 shows the internal structure of a LPDDR2-NVM compatible PCM device and its interface.
In LPDDR2-NVM, address and commands are transferred through command/address (CA) pins
while the conventional DRAMs have dedicated 12 to 16 pins for transferring the address
and command separately. The LPDDR2-NVM specifies 10 bits of CA pins and they are used
with a DDR architecture even for the address phase as shown in Fig. 2. This indicates that the memory controller transfers up to 20 bits of command or/and
address data per single memory clock cycle. In addition, three-phase addressing mechanism
is used for supporting larger size of memory devices than the conventional DRAMs which
originally use two-phase address mechanism. As shown in Fig. 2(a), three-phase addressing consists of preactive, activate, and read/write phases. In
preactive phase, only upper 3 to 12 bits of the row address are transferred, and this
partial row address is stored into the designated RAB. In activate phase, remaining
row address is transferred. The entire row address after combining it with the upper
row address stored in the RAB is transferred to the memory array. Then the corresponding
row data is transferred from the memory array to the designated RDB. Finally, the
data is transferred from the RDB to memory controller at the last phase. The size
of upper row bits and lower row bits are determined by the density of device and the
unit size of RDB, respectively.
Fig. 1. Functional block diagram of JEDEC LPDDR2-NVM standard-compatible PCM device.
Fig. 2. Addressing comparison of LPDDR2-NVM and the conventional DRAM (1Gb device
with 16-bit data width).
LPDDR2-NVM also supports multiple pairs (4 or 8) of row buffers where each pair consists
of RAB and RDB. Unlike the row buffers in traditional DRAMs, each row buffer can be
arbitrarily selected by the memory controller. The BA signals, which are originally
used to select a bank in the conventional DRAM, are used to select a row buffer. Note
that these BA signals are only intended to select a row buffer not a physical bank
address of the memory array [7]. In each phase, the memory controller selects a proper RAB and/or RDB by controlling
these BA signals regardless of the physically accessed memory address.
PCM shows a relatively long program latency because of its operating principle. This
long program latency may also affect read performance if a read request arrives during
the program operation. Similar to the multi-bank architecture in traditional DRAM
devices, multi-partition architecture and parallel operation in LPDDR2-NVM alleviate
the read performance degradation. Parallel operations can read the data in a partition
while another partition is being programmed. However parallel program operations are
not allowed.
Another distinctive feature of the LPDDR2-NVM is to support an asymmetric read and
write operations. The process of read operation is very similar to the conventional
DRAMs except three-phase addressing and row buffer management. However, write operation
${-}$ strictly speaking non-volatile cell programming ${-}$ is completely different
to the conventional DRAMs. Write operation is done indirectly through the special
registers, called overlay window, similar to the NOR flash. Single write operation
requires several overlay window accesses to complete non-volatile cell programming.
The size of overlay window is 4 KBs and it consists of several memory-mapped registers
such as a command address register, a command code register, a command execution register,
and program buffers to properly control LPDDR2-NVM devices and write operations. Single
word overwrites, buffer overwrites, suspend and other cell programming operations
are supported through this overlay window.
2. Related Work
Compared with many system-level approaches to enhance the performance and energy consumption
of the PCM memory system, relatively less research has been conducted on optimizing
the row buffer architecture and its managements.
Lee et al. analyzed the row buffer architecture under assumption that the baseline
buffer architecture of the PCM is similar to that of the conventional DRAMs [5]. Instead of using single 2-KB big size buffer, they reorganized the row buffer with
multiple small size of row buffers so that the energy consumption on the small size
row buffer is significantly saved by reducing the number of sense amplifiers. Performance
has been enhanced as well through reorganizing the row buffer architecture. However,
their architecture does not consider about the asymmetric read and write characteristics
of LPDDR2-NVM where the write operation is performed only through the overlay window
access in industry PCM devices.
Yoon et al. separately considered the hot rows from the cold rows so that hot row
is cached in the DRAM [10]. This approach contributes to decrease the number of hot row misses, which finally
results in performance and energy enhancement. However, their approach is a type of
system-level technique and does not care about the row buffer architecture itself
and its optimization. They just exploit the locality information of row buffers. Their
analysis is also assumed of similar internal buffer architecture to DRAM.
Li et al. considered the LPDDR2-NVM interface in their research. However, they just
utilized the channel and bus model of the LPDDR2-NVM to design a photonic-channel
based memory communication infrastructure for PCM [11]. Park et al. enhanced the performance of memory system with LPDDR2-MVM using address
phase skipping technique, but they only omit the address phase passively similar to
the open-row policy in the conventional DRAM [12]. The configuration of row buffer architecture affects the performance of LPDDR2-NVM
[13]. Hence, a row buffer prefetch technique has been proposed, but it does not consider
write operation in LPDDR2-NVM [14].
III. RECONFIGURATION OF ROW BUFFER ARCHITECTURE
LPDDR2-NVM standard provides more flexibility in designing and managing the row buffer
architecture than traditional LPDDR2 standard. For example, the memory controller
can select a row buffer arbitrarily in the LPDDR2-NVM like a fully-associative cache
while the row buffer selection in the conventional DRAM is fixed by the internal architecture
of DRAM similar to a directed-mapped cache. This flexibility enables us to design
various row buffer management schemes considering the access patterns of the applications.
The difference of management and configuration policy causes a different RDB hit during
read and write operations and this, in turn, leads to performance variations of the
memory systems.
1. Motivational Example
In designing row buffer architecture, determining the unit size of RDB and the number
of RDBs are as important as determining the total number of bytes dedicated for RDBs.
Fig. 3 shows a motivational example of this work. We simply compare the number of RDB hits
on the three different configurations; (a) the largest-RDB configuration, (b) the highest-number-of-RDB configuration, and (c) the adaptive RDB configuration. In the figure, the box with thick solid line means
one physical RDB which consists of one or more basic units ${-}$ the box with dotted
line. The size of one basic unit is equal to the size of one cacheline in microprocessors.
The largest-RDB configuration has only one physical RDB which consists of 4 basic
units while the highest-number-of-RDB configuration has four physical RDBs where each
RDB size is equal to one basic unit.
The example memory access patterns are presented on top of the figure. The first half
of pattern is sequential while the second half of pattern is random. A grayed-box
and horizontally-lined-box represents an RDB miss and an RDB hit, respectively. For
fair comparison, all configurations start with the same initial state ${-}$ Cachelines
4, 5, 6, and 7 are stored in the RDBs.
In the largest-RDB configuration, the request of Cacheline 0 incurs an RDB miss at
time $T0$. This RDB miss evicts all cachelines in the RDB, and then Cachelines 0 to
3 are fetched from the memory array as shown in Fig. 3(a). Since the next three memory accesses are sequential, all three requests incur RDB
hits. However, remaining memory accesses from $T4$ to $T7$ incur consecutive RDB misses
again because of only one physical RDB is available. In total, 5 RDB misses and 3
RDB hits are occurred during the example memory accesses. In contrast, the highest-number-of-RDB
configuration handles a random memory access pattern efficiently because one RDB stores
only one cacheline. However, this configuration is very weak to sequential memory
access patterns. The requests of Cachelines 0 to 3 continuously incur RDB misses from
$T0$ to $T3$ as shown in Fig. 3(b). In total, we observe 6 RDB misses and 2 RDB hits.
Fig. 3. RDB hit ratio varying on row buffer architecture and memory access patterns.
Based on the observation above, both the largest-RDB configuration and the highest-number-of-RDB
configuration provide limited capability to the given example memory access patterns.
Each configuration has clear advantages but also has clear disadvantages depending
on the memory access patterns. The characteristic of memory access patterns may vary
on the application changes. Even in the same application, it may vary according to
the time changes. Thus, changing the RDB configuration dynamically even in the same
application is desirable to increase the chances of RDB hit as shown in Fig. 3(c). The number of RDB misses can be reduced by reconfiguring the row buffer from 3 RDBs
to one RDB which consists of 4 basic units, and then replacing them with four consecutive
cachelines at $T0$. It makes the next three memory accesses as RDB hits. The adaptive-RDB
configuration again modifies its configuration to three RDBs, one RDB with the size
of two basic units and two RDBs with the size of one basic unit, respectively, at
$T4$. This reconfiguration turns the remaining random memory accesses from $T5$ to
$T7$ as RDB hits. In total, 2 RDB misses and 6 RDB hits are occurred in this adaptive
configuration. This adaptive RDB reconfiguration clearly shows the best RDB hit ratio
with information of incoming memory access pattern. However, it is important to predict
the characteristics of incoming memory access accurately because this adaptive RDB
configuration with inaccurate predictions may incur even more RDB misses.
IV. PROACTIVE ROW BUFFER CONTROL POLICY
Reconfigurable RDB architecture must be an attractive way to increase the RDB hits
and finally to enhance the performance of the memory system. By exploiting recently
announced LPDDR2-NVM’s flexible features of selecting any RDBs regardless of the requested
physical memory address, we propose a proactive row buffer control method which enables
the dynamic reconfiguration of RDB without requiring hardware modification in LPDDR2-NVM
specification. The proposed method mainly consists of a row buffer prefetch and an
overlay-window aware address pinning techniques.
1. Row Buffer Prefetch
In traditional cache memory architecture, prefetch technique has been mainly used
to maximally utilize the limited size of cache memory by proactively moving the specific
data from main memory to the cache memory even before it is explicitly requested.
Similar to this, we propose a row buffer prefetch technique that moves the specific
data to the row buffer in advance when the memory device is in idle state. By doing
this in LPDDR2-NVM device, we realize the adaptive RDB architecture. Fig. 4 shows the key concept of logical RDB reconfiguration architecture using a row buffer
prefetch. The row buffer consists of 4 physical RDBs. When there is an RDB miss, one
physical RDB is allocated to serve this request. In addition to this basic operation,
we allocate more physical RDBs for prefetching the consecutive row data if the next
request is expected to be sequential. This prefetch operation implicitly acts like
allocating two RDBs for one memory request, which eventually creates a similar effect
of increasing the size of a single RDB. Depending on the number of RDBs used for prefetching,
the size of a single RDB (not the physical size but the logical size) can be varied,
and this is the basic principle of our proposed dynamic RDB reconfiguration. The number
of RDBs used for prefetching is increased if the access patterns are expected to be
strongly sequential, while it is decreased when expecting random access patterns.
For implementing above mentioned operations, no additional control logics are required
in the memory devices. The memory controller just issues a row buffer prefetch to
the command queue if the memory controller predicts that the incoming memory access
pattern is sequential.
Fig. 4. Prefetch-based dynamic RDB reconfiguration.
Some commercial DRAM controllers offer memory access reordering feature to increase
row buffer hit ratio. It changes the order of memory access in the memory controller
using reordering buffer, and then returns the results to the processor as in-order.
Reordering may be able to increase the RDB hit ratio at LPDDR2-NVM as well, but we
do not consider reordering issue because it is orthogonal problem with our proposed
scheme. This paper only focuses on the relation between the characteristics of memory
access pattern and the RDB reconfiguration.
We give a higher priority to row buffer prefetch request than regular memory accesses
from the processor when there are conflicted requests, because the row buffer prefetch
requires less time than regular memory accesses because of skipping the read/write
phase (the last phase of the operation) of three-phase addressing. Although this may
slightly and temporarily increase the response time of the coming memory request,
we found that the long-term benefits of this policy are greater than the temporal
response time degradation.
Since the number of row buffers is very limited in most memory devices, performance
enhancement depends heavily on the accuracy of prediction. In the latter subsections,
we propose a simple but efficient system-level row buffer prediction and management
policy for the dynamic RDB reconfiguration.
2. Tagged Row Buffer Prefetch
In our design, decision of prefetch is mainly depending on detecting whether the current
memory access pattern will be sequential or not. For efficiently detecting the characteristics
of memory access patterns, we devise a tagged row buffer prefetch scheme, $TPRE$,
similar to tagged prefetch in a cache [15]. As shown in Fig. 5(a), $TPRE$ uses one tag bit on the RDB tracking table for issuing a prefetch command.
The tag bit is set when the corresponding RDB is firstly activated. This bit is cleared
once the corresponding RDB is rereferenced, and then the row buffer prefetch is requested
to move the consecutive data to another RDB as shown in Fig. 5(b). The assumption of $TPRE$ is that memory access pattern will be mostly sequential
if the RDB is accessed more than twice. This assumption is justified in that the unit
size of single RDB is larger than the size of cacheline.
Fig. 5. Operation of tagged row buffer prefetch, $TPRE$.
3. Multiple Row Buffer Prefetch
Row buffer prefetch is initiated by a prediction, and thus, the accuracy of the prediction
mainly determines the reduction of total execution time. Since $TPRE$ assumes the
memory access patterns as sequential patterns$\textit{,}$ it may incur unnecessary
row buffer prefetches that are evicted without even being referenced. To minimize
these unnecessary row buffer prefetches, we propose a multiple row buffer prefetch
technique, $MPRE$.
$MPRE$ uses a two-bits saturating counter to accurately predict the characteristic
of incoming memory access pattern. This saturating counter changes a mode between
STRONG RANDOM to STRONG SEQUENTIAL according to the recent activity of each RDB as
shown in Fig. 6. When an RDB hit occurs, it decides that the incoming memory access will be a sequential
memory access, then it promotes the mode of saturating counter up to STRONG SEQUENTIAL
mode. When an RDB miss occurs, $MPRE$ demotes the mode of saturating counter down
to STRONG RANDDOM because incoming memory access will be high likely random memory
access.
Fig. 6. Mode transition and initial allocation for predicting incoming memory accesses
in $MPRE$.
Single global saturating counter for entire RDBs can be a simple solution, however,
it turns out that the accuracy of single global saturating counter is very poor especially
for the mixed patterns with sequential and random access patterns. In this case, the
RDB miss due to the random memory access demotes the mode of the global saturating
counter so quickly because it uses information from successive memory accesses. As
a result, the part of sequential memory access pattern is frequently predicted as
a random memory access pattern. To avoid this misprediction, $MPRE$ uses one saturating
counter for each RDB, so that it tracks multiple mixed memory accesses. When an RDB
hit occurs, $MPRE$ promotes the mode of the corresponding RDB while the modes of the
other RDBs are remaining without changing. In opposite case which means an RDB miss,
$MPRE$ demotes the modes of all RDBs at once because the requested memory access is
not a part of any sequential access that are tracked by each saturating counter.
It is also important to decide an initial mode of the saturating counter when new
data is allocated to the RDB. Since the row buffer prefetch is performed from the
result of predicting sequential access pattern, $MPRE$ assigns WEAK SEQUENTIAL mode
to the RDB caused by a row buffer prefetch as shown in Fig. 6. In the other case where the RDB allocation is caused by the processor after an RDB
miss, $MPRE$ assigns WEAK RANDOM to the mode of the corresponding RDB.
The overhead of keeping and managing two-bit saturating counter is not significant
because the memory controller should always keep the address of data stored in RDBs
with its validity to check whether the memory controller can skip the addressing phase
in LPDDR2-NVM or not. For selecting a victim, $MPRE$ uses a simple least recently
used (LRU).
4. RDB Pinning for Overlay Window Access
As described in Section II, write operation in LPDDR2-NVM is translated into several
overlay window accesses, and the address of overlay window is not changed regardless
of the target address of write operation. This means that the write memory access
causes intensive accesses of the overlay window which only use specific range of memory
addresses. If there are multiple intensive write requests, the RDBs that contain overlay
window have more chances to be referenced before they are evicted. However, due to
the very limited number of RDB in LPDDR2-NVM, the RDBs that contain overlay window
can still be selected as a victim by conventional replacement policy like an LRU.
To avoid this situation, we propose a simple overlay window pinning scheme, $MPRE+OW$
that assigns a certain number of RDBs only for overlay window access. This pinning
method is implemented with negligible overhead because the address comparison is inevitable
to decide an RDB hit in conventional memory controller targeting for LPDDR2-NVM.
In the proposed pinning scheme, we do not pin all RDBs that contain overlay window.
Among several types of overlay window accesses, the program buffer accesses show a
low RDB hit ratio because the address of program buffer access changes according to
the address of write request. Therefore, the proposed scheme does not pin the RDB
that contains program buffer.
Table 1. Simulated system configuration details
Number of cores
|
4
|
Processor
|
UltraSPARC-III+, 2 GHz (OoO)
|
L1 cache (Private)
|
I/D-cache: 32 KB, 4-way 64 B block
|
L2 cache (Shared)
|
2 MB, 4-way 64 B block
|
PCM main memory
|
4 GB, LPDDR2-800, 64-bit wide
|
Preactive to Activate ($t_{RP}$ )
|
3 $t_{CK}$$^1$
|
Activate to Read/Write ($t_{RCD}$)
|
120 ns
|
Read/write latency
|
6 tCK/3 $t_{CK}$$^1$
|
Cell program time ($t_{program}$)
|
150 ns
|
The number of partitions
|
16
|
$^1$ $t_{CK}$ is a memory clock cycle (2.5 ns at LPDDR2-800)
Algorithm 1 describes how $MPRE+OW$ manages row buffers proactively with combination
of previously proposed $MPRE$. When there is a new memory read request, it first checks
whether the new access is read or miss. If the RDB hit occurs, the mode of corresponding
RDB is promoted. After the promotion, if its mode is higher than WEAK SEQUENTIAL,
then new row buffer prefetch is requested. If the RDB miss occurs, all RDBs are demoted
and one of unpinned RDBs is selected as a victim for new request. Then the mode of
victim RDB is set to WEAK RANDOM. If the prefetch is requested as a result of the
RDB hit process, a victim is selected from unpinned RDBs for new prefetch request.
Then the mode of victim RDB is set to WEAK SEQUENTIAL.
There is no prefetch request for the RDBs contain overlay window, so the mode for
those are set as WEAK RANDOM temporarily and it is updated when it is selected as
victim RDB for read memory access. The write memory access also does not change the
mode of other RDBs. The address phase of write memory access starts from READ\textbackslash
WRITE phase if the access hit the pinned RDBs. Otherwise, it starts from PREACTIVE
phase.
V. EXPERIMENTAL RESULTS
1. Evaluation Setup
We develop a cycle-accurate trace-driven simulator using SystemC to evaluate the total
execution time and the total execution energy. The traces have been extracted from
Simics full-system simulator [16] with the information of processor clock cycle which indicates the issue time of memory
access issue. We calculate the total execution time of trace by using idle time of
memory system obtained from the processor clock cycle and simulated memory access
latency.
We simulate 4-core out-of-order processor systems, operating at 2 GHz clock frequency
with the shared last level cache. The main memory system has 64-bit bus with four
LPDDR2-NVM compatible PCM chips. The timing parameters of non-volatile memory (phase-change
memory in our experiments) are extracted from the JEDEC LPDDR2-NVM standard and the
industrial prototype [17]. The details of the simulation setup are summarized in Table 1.
Eight multi-threaded benchmarks from the PARSEC benchmark suite [18] are selected. Table 2 summarizes the characteristics of each benchmark in terms of the ratio of read operations
normalized to the write operations and the frequency of the memory accesses. Based
on this setup, we intensively evaluate the proposed prefetch-based proactive row buffer
management schemes, $TPRE$, $MPRE$ and $MPRE+OW$.
Table 2. Memory access characteristics of the benchmarks
Applications
|
R/W ratio
|
Mem. Accesses/ 1K CPU cycles
|
blackscholes
|
3.02
|
4.2
|
bodytrack
|
2.80
|
1.2
|
facesim
|
1.57
|
7.6
|
ferret
|
2.71
|
6.3
|
freqmine
|
2.20
|
4.7
|
raytrace
|
1.73
|
2.5
|
streamcluster
|
2.53
|
2.2
|
swaptions
|
3.28
|
1.2
|
vips
|
1.77
|
4.7
|
X264
|
2.87
|
3.7
|
We evaluate them focusing on the total execution time and the total energy consumption.
As a baseline, we use the static optimum RDB configuration that has a minimum execution
time among all possible RDB configurations. Note that we assume the same row activation
time regardless of the size of RDB. From the extensive design space explorations,
8ⅹ128 bytes RDB configuration is selected as the static optimum RDB configuration
for all benchmarks.
2. Performance Evaluations
Before evaluating the performance and energy consumption, we first analyze the RDB
hit ratio and prefetch ratio that directly affect the latency and energy consumption
of memory devices. Table 3 compares the RDB hit ratio of $TPRE$, $MPRE$, and $MPRE+OW$. We separately present
the RDB hit ratio of read access, $\textit{r}$$_{RD}$, and overlay window access,
$\textit{r}$$_{OW}$ to clearly show the effects of each row management schemes.
As we expect, $\textit{r}$$_{OW}$ is generally higher than $\textit{r}$$_{RD}$ in
all applications. This means that the overlay window access shows higher spatial and
temporal locality than other type of memory access in LPDDR2-NVM. Compared with the
baseline configuration, the most na\"{i}ve scheme, $TPRE$ shows higher $\textit{r}$$_{RD}$
because it prefetches row buffer aggressively. However, this aggressive row buffer
prefetches also decrease $\textit{r}$$_{OW}$, which negatively affects the total execution
time. The $\textit{r}$$_{RD}$ and $\textit{r}$$_{OW}$ of $MPRE$ are enhanced in most
applications except $bodytrack$. By exploiting the history of memory accesses, $MPRE$
efficiently reduces the unnecessary prefetches and evictions of RDB that contains
high locality overlay window data. We observe further improvement of $\textit{r}$$_{OW}$
in $MPRE+OW$ because $MPRE+OW$ tries to keep the RDB contains overlay window as long
as possible when the memory controller predicts that there will be several write accesses
in upcoming memory requests. Overall, compared with the static optimum RDB configuration,
$MPRE+OW$ enhances $\textit{r}$$_{RD}$ and $\textit{r}$$_{OW}$ by 16.0% and 3.0%,
on the average, respectively.
Table 3. Comparison of the RDB hit ratio (%)
Applications
|
Static
|
$TPRE$
|
$MPRE$
|
$MPRE+OW$
|
$r_{RD}$
|
$r_{OW}$
|
$r_{RD}$
|
$r_{OW}$
|
$r_{RD}$
|
$r_{OW}$
|
$r_{RD}$
|
$r_{OW}$
|
blackscholes
|
27.0
|
73.8
|
39.2
|
64.4
|
42.2
|
71.7
|
42.1
|
79.0
|
bodytrack
|
21.5
|
74.7
|
31.5
|
65.9
|
30.9
|
73.2
|
30.8
|
79.0
|
facesim
|
41.7
|
83.7
|
56.1
|
81.5
|
69.9
|
83.0
|
69.8
|
84.1
|
ferret
|
35.8
|
77.2
|
48.4
|
70.3
|
56.8
|
75.3
|
56.7
|
80.4
|
freqmine
|
30.3
|
78.5
|
39.0
|
68.1
|
43.6
|
77.0
|
43.5
|
80.6
|
raytrace
|
25.7
|
80.7
|
36.3
|
70.6
|
38.9
|
79.8
|
38.9
|
81.7
|
streamcluster
|
36.7
|
79.0
|
50.0
|
74.8
|
56.9
|
77.5
|
56.8
|
81.6
|
swaptions
|
28.7
|
73.9
|
44.4
|
63.9
|
49.0
|
71.6
|
48.9
|
79.2
|
vips
|
14.0
|
86.3
|
19.2
|
71.9
|
21.8
|
85.7
|
21.8
|
87.6
|
x264
|
25.2
|
74.4
|
34.5
|
64.6
|
38.1
|
72.8
|
38.0
|
79.2
|
average
|
28.7
|
78.2
|
39.9
|
69.6
|
44.8
|
76.8
|
44.7
|
81.2
|
We first define a row buffer prefetch ratio, $\textit{r}$$_{PF}$ which is the fraction
of the number of RDB allocations caused by row buffer prefetch over the total number
of RDB allocations to further analyze the effect of a row buffer prefetch. We also
define a good row buffer prefetch ratio to evaluate the prediction accuracy of each
scheme. The meaning of good row buffer prefetch is that the prefetched row data in
an RDB is referenced more than once before it is evicted. Otherwise, we consider it
as a bad row buffer prefetch. We define a ratio of good row buffer prefetch, $\textit{r}$$_{G.PF}$,
as the fraction of the number of good row buffer prefetches over the number of total
row buffer prefetches. The $\textit{r}$$_{PF}$ and $\textit{r}$$_{G.PF}$ are good
indicators that show the prediction accuracy of the proposed schemes.
Table 4. Comparison of the prefetch ratio and the good prefetch ratio (%)
Applications
|
$TPRE$
|
$MPRE$
|
$MPRE+OW$
|
$r_{PF}$
|
$r_{G.PF}$
|
$r_{PF}$
|
$r_{G.PF}$
|
$r_{PF}$
|
$r_{G.PF}$
|
blackscholes
|
34.5
|
24.9
|
18.3
|
73.1
|
20.0
|
73.2
|
bodytrack
|
35.0
|
17.4
|
14.1
|
53.9
|
15.1
|
53.7
|
facesim
|
30.9
|
40.6
|
31.5
|
85.4
|
32.3
|
85.4
|
ferret
|
33.0
|
30.7
|
25.0
|
81.2
|
26.9
|
81.1
|
freqmine
|
32.3
|
20.0
|
16.7
|
69.8
|
17.7
|
69.9
|
raytrace
|
30.6
|
20.0
|
14.8
|
71.7
|
15.4
|
71.6
|
streamcluster
|
32.8
|
32.3
|
25.8
|
74.3
|
27.5
|
74.3
|
swaptions
|
33.8
|
32.5
|
23.5
|
78.0
|
25.6
|
78.0
|
vips
|
35.6
|
8.3
|
9.4
|
70.5
|
9.7
|
70.6
|
x264
|
35.2
|
17.4
|
15.5
|
72.7
|
16.7
|
72.5
|
average
|
33.4
|
24.4
|
19.5
|
73.1
|
20.7
|
73.0
|
Fig. 7 compares the total execution time of $TPRE$, $MPRE$, and $MPRE+OW$. The total execution
time of each scheme is normalized to that of the static RDB configuration. $TPRE$
mostly takes higher execution time than the static optimum configuration. As shown
in Table 3, $TPRE$ successfully increases the RDB hit ratio in all applications. However, aggressive
prefetches in $TRE$ negatively increase the number of unnecessary evictions of high
temporal and spatial locality row data in the applications. One example is that $TPRE$
frequently evicts RDBs that contain high temporal and spatial locality overlay window
data. As a result, the RDB hit ratio for overlay window access operation is degraded
as shown in Table 3. Only in $bodytrack$, $raytrace$ and $swaptions$ that show the lower R/W ratio and
lower memory accesses per a 1K CPU cycle than other benchmarks, $TPRE$ reduces the
memory access time. Overall, $TPRE$ increases the total execution time by 5.2%, on
average.
Fig. 7. Comparison of the total execution time (normalized to the static optimum RDB
configuration).
Compared with $TPRE$, $MPRE$ is designed to minimize unnecessary row buffer prefetches
by exploiting the history of memory access patterns. As shown in Tables 3 and 4, $MPRE$
enhances the RDB hit ratio as well as the good prefetch ratio for all applications
significantly. These enhancements directly connect to 2.4% to 21.6% enhancements of
the total execution time in all applications. On the average, $MPRE$ enhances the
total execution time by 8.0%.
Finally, $MPRE+OW$ reduces the total execution time even more than $MPRE$ by pinning
RDBs dedicated for the overlay window access. Compared to $MPRE$, $MPRE+OW$ shows
the higher reduction ratios on the total execution time with the applications that
have high R/W ratio such as $blackscholes$ and $swpations$ than the applications that
has low R/W ratio such as $facesim$, $raytrace$ and $vips$. We analyze that $MPRE$
evicts the RDB that contains overlay window data frequently even though write request
arrives soon. These unnecessary evictions are prevented by $MPRE+OW$ efficiently,
which leads to the performance enhancement from 4.7% to 23.2%. In summary, $MPRE+OW$
reduces the total execution time by 12.2% on average compared to the static optimum
RDB configuration.
3. Energy Consumption Evaluations
We analyze the energy consumption of the proposed row buffer management schemes. Although
the proposed row buffer prefetch schemes reduces the total execution time by hiding
row buffer activation time, it may consume additional energy if the prefetch is not
a good row buffer prefetch. To analyze the energy consumption, we model the energy
consumption of LPDDR2-NVM devices based on the manufacturer’s datasheet. Our energy
model mainly focuses on finding the differences of energy consumption during row and
row buffer activation which are the dominant sources of memory energy consumption.
Fig. 8. Comparison of the total execution energy (normalized to the static optimum
RDB configuration).
Fig. 9. Sensitivity analysis by changing the size of RDB (normalized to the static
optimum RDB configuration).
Fig. 8 shows the energy consumption of $TPRE$, $MPRE$ and $MPRE+OW$, normalized to that
of the static optimum RDB configuration. The energy consumption of $TPRE$ is higher
than that of the static optimum RDB configuration for all applications. It is obvious
that the additional energy consumption from too many unnecessary RDB prefetches and
the increased execution time result in increasing energy consumption. As shown in
Table 4, only 24.4% of total prefetches in $TPRE$ are classified as a good prefetch. $MPRE$
also shows higher energy consumption than the static optimum RDB configuration for
all applications even though the total execution time is decreased. Similar to $TPRE$,
still 26.9% of prefetches are useless in $MPRE$. We analyze that this additional energy
consumption slightly exceeds the benefits of reducing energy consumption in most applications.
Finally, $MPRE+OW$ shows slightly lower or almost same energy consumption than the
baseline configuration in most applications. Unlike $MPRE$, in $MPRE+OW$, the energy
benefit of $MPRE+OW$ by reducing the total execution time exceeds the energy overhead
of unnecessary prefetches. This summarizes that managing RDB proactively can successfully
reduce the total execution energy as well as the total execution time.
4. Sensitivity Analysis
In the above experiments, the physical size RDB is fixed based on LPDDR2-NVM specification.
Since the physical size of RDB may affect to the memory access time and energy consumption
significantly, we analyze the total execution time and the total execution energy
of the proposed schemes when the physical size of RDB increases. We change the physical
size of single RDB while fixing the number of RDB to 8 for all configurations. As
shown in Fig. 9, even simple $TPRE$ shows the reduction of the total execution time when the size
of RDB increases more than 8ⅹ512 bytes. The enhancement ratios of the total execution
time in $MPRE$ and $MPRE+OW$ increase significantly until the physical RDB size reaches
8ⅹ512 bytes. After 8ⅹ512 bytes, the enhancement ratios are not increased or slightly
decreased. This means that setting the RDB to 8ⅹ512 bytes gives the best performance.
As described previously, the proposed prefetch-based RDB managements affects to the
energy consumption of the memory devices positively and negatively at the same time.
By increasing the physical RDB size, the positive effect of reducing the total execution
time is getting better while the negative effect of additional energy consumption
due to the unnecessary prefetch is getting worse. We observe that the positive effect
is higher than the negative effect only in 8ⅹ256 bytes configuration. This means that
the RDB to 8ⅹ256 bytes is the best configuration in energy consumption perspective,
which is a different configuration in performance perspective.
VI. CONCLUSIONS
Memory interface affects the performance of memory system significantly, but it has
less addressed or overlooked. This paper focused on the role of memory interface targeting
LPDDR2-NVM compatible nonvolatile memory devices because it has quite different mechanisms
compared to conventional LPDDR interfaces. Based on observations, we proposed a proactive
row buffer management that enables the logical reconfiguration of row buffer architecture
in runtime using prefetch techniques. Extensive evaluations using a trace from full
system simulator demonstrate that the proposed method enhances the performance and
energy consumption of memory system by 12.2% and 0.3%, on average, compared to the
design-time optimization technique without a memory device modification.
ACKNOWLEDGMENTS
This work was supported by the 2018 Research Fund of University of Ulsan.
REFERENCES
IDC , April 2014, The Digital Universe of Opportunities: Rich Data and the Increasing
Value of the Internet of Things, EMC digital universe with research & analysis
Zypryme , November 2013(4), Global Smart Meter Forecasts, 2012-2020, Smart Grid Insights
Raoux Simone, et al , 2008, Phase-Change Random Access Memory: A Scalable Technology,
IBM Journal of Research and Development, Vol. 52, pp. 465-479
Zilberberg Omer, Weiss Shlomo, Toledo Sivan, 2013, Phase-Change Memory: an Architectural
Perspective, Comput. Surveys, Vol. 45
Lee Benjamin C., Ipek Engin, Mutlu Onur, Burger Doug, 2009, Architecting Phase Change
Memory as a scalable DRAM Alternative, ISCA
Qureshi Moinuddin K., Srinivasan Vijayalakshmi, Rivers Jude A., 2009, Scalable High-Performance
Main Memory System using Phase-Change Memory Technology, ISCA
JEDEC , 2013, Low-Power Double Data Rate 2 Non-Volatile Memory, JESD209-F
Clarke Peter, Nov 2011, Samsung preps 8-Gbit phase-change memory, EE Times
Choi Youngdon, et al , A 20nm 1.8V 8Gb PRAM with 40MB/s program bandwidth, ISSCC
Yoon HanBin, et al , 2011, DynRBLA: A high-performance and energy-efficient row buffer
locality-aware caching policy for hybrid memories, SAFARI Technical Report No. 2011-005
Li Zhongqi, Zhou Ruijin, Li Tao, 2013, Exploring high-performance and energy proportional
interface for phase change memory systems, HPCA
Park Jaehyun, et al , 2014, Accelerating memory access with address phase skipping
in LPDDR2-NVM, JSTS, Vol. 14, No. 6, pp. 741-749
Park Jaehyun, Shin Donghwa, Lee Hyung Gyu, 2015, Design space exploration of row buffer
architecture for phase change memory with LPDDR2-NVM interface, VLSI-SOC
Park Jaehyun, Shin Donghwa, Lee Hyung Gyu, 2015, Prefetch-based dynamic row buffer
management for LPDDR2-NVM devices, VLSI-SOC
Srinivasan Viji, Davidson Edward S., Tyson Gary S., Feb 2004, A Prefetch Taxonomy,
IEEE Trans. Comput., Vol. 53, No. 2, pp. 26-140
Magnusson Peter S., et al , 2002, Simics: A full system simulation platform, Computer,
Vol. 35, No. 2, pp. 50-58
Bienia Christian, Kumar Sanjeev, Singh Jaswinder Pal, Li Kai, 2008, The PARSEC benchmark
suite: Characterization and architectural implications, PACT
Author
Jaehyun Park received the B.S. in
Electrical Engineering and Ph.D
degree in Electrical Engineering and
Computer Science from Seoul
National University, Seoul, Korea, in
2006 and 2015, respectively.
From
2009 to 2010, he was a Visiting
Scholar with the University of Southern California, Los
Angeles, CA.
He was an Exchange Scholar with School
of Electrical, Computer and Energy Engineering,
Arizona State University, Tempe, AZ from 2015 to 2018.
He is currently an Assistant Professor at the School of
Electrical Engineering, University of Ulsan, Ulsan,
Korea.
Dr. Park has received 2007 and 2012 ISLPED
Low Power Design Contest Award, and 2017 13th
ACM/IEEE ESWEEK Best Paper Award.
His current
research interests include energy harvesting and
management, low-power IoT systems and nonvolatile
memory systems.
Donghwa Shin received the B.S.
degree in computer engineering and
the M.S. and Ph.D. degrees in
computer science and electrical
engineering from Seoul National
University, Seoul, South Korea, in
2005, 2007, and 2012, respectively.
He joined the Dipartimento di Automaticae Informatica,
Politecnico di Torino, Turin, Italy, as a Researcher.
He is
currently an Assistant Professor at the Department of
Smart Systems Software, Soongsil University, Seoul,
South Korea.
His research interests have covered systemlevel
low-power techniques, and he is currently focusing
on energy-aware neuromorphic computing.
Dr. Shin
serves (and served) as a Reviewer of the IEEE
TComputers, TCAD, TVLSI, ACM TODAES, TECS,
and so on.
He serves on the Technical Program
Committee of IEEE and ACM technical conferences,
including DATE, ISLPED, and ASP-DAC.
Hyung Gyu Lee received the Ph.D.
degree in Computer Science and
Engineering from Seoul National
University, Seoul, Korea, 2007.
He
was a senior engineer with Samsung
Electronics from 2007 to 2010.
He
also worked as a research faculty
with the Georgia Institute of Technology, Atlanta, GA,
from 2010 to 2012.
Currently he is an associate professor
with the School of Computer and Communication
Engineering, Daegu University, Korea.
His research
interests include embedded system design, low power
system, and memory system design focusing on
emerging non-volatile memory and storage technologies.
Also, energy harvesting and wearable IoT applications
are his current research interests.
He received three best
paper awards from HPCC 2011 and ESWEEK 2017,
ESWEEK 2019 and one design contest award from
ISLPED 2014.