1. Experimental Methodology
1) Configurations: To establish an experimental environment for DCM, we built a client
node and donor node connected by 100 Gb/s InfiniBand. The details of experimental
specification are shown in Table 1.
2) Workloads: We implemented widely used memory intensive applications, written either
in C and Java. For workloads written in C, user manages the memory lifecycle by calling
malloc and free API. Conversely, for workloads written in Java, JVM manages the memory
lifecycle [18], thereby allowing application development without explicit concern for memory management.
To investigate potential performance differences resulting from memory management
by the user or the JVM, we conducted experiments using both unmanaged and managed
workloads. The details of the workload can be found in Table 2.
For the managed workload, we utilized the application provided by Intel HiBench benchmark
suite [17] that is executed on Apache Spark [9], a notable distributed processing system used for big data workloads. Garbage collection
policy was set to use ParallelGC [19].
3) Experimental Environment: To ascertain the performance improvement of running applications
in DCM in comparison to the virtualized computing environment, it is imperative to
conduct a quantitative verification. For this purpose, we designated SSD disk shown
in Table 1 as the swap space. The experimental environments for DCM and disk swap are illustrated
in Fig. 3.
An application executing in a VM loads necessary working set into memory during runtime.
If the requested working set exceeds the local memory, the operating system evicts
pages mapped to the client node’s memory area into swap space, thus creating free
space. Then it subsequently retrieves the pages from the swap space, as depicted in
Fig. 3(a). In contrast, DCM has the capability to procure memory pages from the donor node
via remote fetching, effectively enlarging the available memory capacity (Fig. 3(b)).
4) Comparison Targets and Evaluation Criteria: To denote the proportion of local memory
utilized by working set, we define L(N) as a working set occupying N% of local memory.
For instance, in the case of L(50), half of the application’s working set is loaded
into local memory, while the remaining is mapping to either remote memory or swap
space. For the performance metric, we denote makespan that refers to the total time
taken to complete a set of tasks. We also categorized three evaluation criteria to
represent the impact on DCM: page fault cost, garbage collection, and page caching.
These criteria let us explore deeply the performance implications of each factor in
the context of DCM.
Fig. 3. Experimental Environment.
Table 1. Specifications of server in DCM
CPU
|
Intel Xeon Gold 6330, 2.00 GHz 28 Core X 2
|
Memory
|
16 GB (DDR4, 3200 MHz) X 8
|
Network
|
Mellanox ConnectX-5 100 Gb/s EDR HCA
|
SSD
|
Intel NVMe SSD 750 [16] (R/W : 2.2 Gb/s, 0.9 Gb/s)
|
OS
|
Linux Kernel 4.18, Centos 8.4
|
Table 2. Experimental Workloads
Application
|
Workload Type
|
Description
|
Grep [12]
|
Unmanaged
|
Entails searching through a large dataset to identify and extract specific words.
|
GroupByAggregation [12] (GAG)
|
Unmanaged
|
Determines the sum of values corresponding to identical keys within a file.
|
PageRank [17] (PR)
|
Managed
|
Calculates the importance of a webpage based on the number and incoming links.
|
Bayesian Classification [17] (BC)
|
Managed
|
Predicts class membership probabilities belongs to a particular group.
|
2. Makespan
Fig. 4 shows the makespan for each workload against the proportion of working set residing
in local memory. Across all workloads, applications running on DCM out-performs those
in disk swap. The performance gap widens as the fraction of the working set located
in local memory decreases, with DCM exhibiting up to 3.5 times speed-up in execution
time. This substantial performance disparity stems from the contrasting I/O performances
of DCM and disk swap during the handling of page faults.
Interestingly, the performance variation within DCM is contingent upon whether garbage
collection is engaged during application execution. In Fig. 4(a), all workloads running on DCM do not exhibit significant performance variations as
local memory ratio to working set is diminished. In Grep, the slowdown at L(60) is
approximately 6% compared to L(100).
Contrasting the unmanaged application, the managed application shows divergent performance
patterns. It shows a modest degradation in performance as the ratio of working set
residing in local memory decreases as shown in Fig. 4(b). This is because the software overhead of marking and copying objects during garbage
collection counteracts the benefits of fast remote paging. We have further described
in the next section.
Fig. 4. Makespan of applications per the experimental environments. In the figure, X-axis
represents the ratio of local memory to the working set size, listed in the order
of L(100), L(80), and L(60).
3. Influence Factors
In this section, we demonstrate how fast remote paging and its careful victim selection
policy in DCM impact over- all performance. For this, we have selected an unmanaged
application for our experimental workload. The reason for choosing an unmanaged workload
was to clearly isolate the overhead associated with page faults, thereby minimizing
any potential interference from software overheads such as garbage collection. We
selected Grep workload among the workloads because it shows the most distinct performance
pattern. To quantitatively compare the page fault overhead between the two environments,
we define page fault cost as shown in Eq. (1). In Eq. (1), we denote PFC, PFL, PFF, N as page fault cost, page fault latency, page fault frequency
and the total number of page fault, respectively. In short, PFC is calculated by multiplying
the page fault frequency by the total time to get faulted pages from remote memory
or disk swap space.
1) Page Fault Cost: Fig. 5 shows a time breakdown for each operation in Grep workload. In all DCM scenarios,
ToLower and PtrFree operations account for less than 10% of the total execution time.
Meanwhile, these two operations become significantly more time-consuming, taking up
over 45% of the total execution time in disk swap.
This significant performance difference is due to the memory paging that occurs during
the processing of these two operations, which quickly depletes local memory and causes
frequent page faults. Firstly, ToLower operation involves converting uppercase sentences
to lowercase, necessitating the loading of a string into memory. If sufficient memory
is not available, a page fault occurs, causing the CPU to block while waiting for
the faulted page into local memory. Besides, PtrFree operation performs memory deallocation,
which triggers a page-out process. If the memory page to be freed resides in swap
space, a page fault is triggered, thus halting the free operation until the page is
relocated to local memory.
As illustrated by the red dotted line in Fig. 5, DCM exhibits negligible performance degradation over L(100) even at L(60). At L(60),
the average page fault latency for DCM is 72 ${\mu}$s, while that of a page fault
in the disk swap is 1 seconds. The number of page faults is also reduced by about
37% on DCM (20660) compared to disk swap (32659). Calculating this as the page fault
cost, DCM took a total of 1.4 seconds, while disk swap took 33.3 seconds, which is
about 23 times less overhead. This implies that remote paging and optimized page victim
selection policy in DCM handles page fault events more rapidly and reduces the time
during which the CPU is blocked, thereby minimizing the performance degradation of
the application.
Garbage Collection: To investigate the relation between garbage collection and memory
paging, we performed the following experiment. We configured the local memory, remote
memory, and disk swap space to 16GB. As detailed in Table 2, our experimental workload consisted of java applications from Intel HiBench benchmark
suite [17], built on Apache Spark. Apache Spark initiates a JVM process (i.e., executor) with
heap memory to execute tasks. Based on this framework configuration, we allocated
32GB as the memory size for the executor, encompassing both local memory and disk
swap space. This configuration facilitated the observation of performance patterns
in DCM and disk swap, specifically when OS swapping coincides with garbage collection
as the working set size surpasses the local memory (16GB).
In a managed application, objects are managed by JVM. Consequently, even if the input
size is small, the working set size can increase as the application processes due
to object creation. In our study, we categorized the working sets into four types,
as presented in Table 3. This classification is based on the observation that accessing swap space or remote
memory occurs when the working set created at runtime exceeds the local memory capacity.
There is no significant performance difference between DCM and disk swap, where all
working sets are loaded into local memory at L(100). We found that the working set
at L(92) approximately aligns with the local memory size. Beyond L(92), OS swap was
triggered, causing the makespan to increase as shown in Fig. 6. At L(90), where OS swapping and remote paging overlap in earnest, we observed the
performance degradation under both DCM and disk swap.
This observation contradicts previous observation suggesting minimal performance degradation
even when the working sets are evicted from the local memory in unmanaged applications.
This contradiction emerges due to the garbage collection overhead, which dilutes the
benefits of fast remote paging. Garbage collection entails marking objects for temporary
copying to a different memory space before they are cleared. When the objects marked
for cleanup are evicted from the local memory to the backing store, the cleanup operation
will come to a halt. It can only resume once the cleanup process retrieves the object
back into local memory. As depicted in Fig. 2, all application threads are temporarily halted until the garbage collection process
is completed. Consequently, the accumulation of slowdown caused by OS swapping during
garbage collection can significantly degrade performance.
Of course, DCM still outperforms disk. As shown in Fig. 7, the overhead is not negligible, with garbage collection time making up 15% of total
execution time at L(70). When we examine the garbage collection time in more detail,
we observe that object copy time constitutes about 85% in both environments (Fig. 8(a)). It indicates that the eviction of copied marked objects to the backing store and
their subsequent retrieval is an dominant operation when page fault occurs.
In order to investigate how both environments impact object copy time when page faults
and garbage collection are overlapped, we also represented tail latency using a Cumulative
Distribution Function (CDF). As a result, DCM exhibits a smoother tail latency, while
the disk swap displays a more extended tail latency spectrum in Fig. 8(b). DCM shows less performance degradation due to its ability to execute page replacements
quickly via fast remote paging, thus minimizing the amount of time application threads
stop.
3) Paging Caching: Although DCM enables a fast remote paging, it inevitably incurs
a not negligible commu- nication overhead with each fault event. DCM mitigates the
communication overhead by recognizing memory access patterns and preventing eviction
of pages that are likely to be accessed in near future.
In this section, we explore how the selective page eviction impacts performance in
DCM. For this, we configured ramdisk as the swap space because it is much faster than
SSD. In this context, both ramdisk setting and DCM have similar memory access latencies.
The key difference lies in the handling of eviction, where DCM utilizes a policy that
reserves eviction through multiple opportunities during page events. Through our experiments,
we aim to validate the effectiveness of DCM’s optimized page eviction policy.
Fig. 9 compares the makespan between DCM and an environment that sets different types of
disks as swap space. It is observed that the performance of disk swap with ramdisk
and DCM is similar in L(80). Intriguingly, starting at L(70), ramdisk performance
deteriorates relative to DCM. Given that ramdisks generally perform I/O operations
at memory speed, we might expect the makespan at DCM as well. However, the performance
disparity stems from the fact that more careful page eviction selection in DCM, which
provides a sort of caching effect, is more efficient than LRU page replacement approach.
The impact of the careful page eviction policy becomes more significant as the working
set size increases and more working sets are evicted from local memory.
Fig. 5. Breakdown task time analysis in Grep workload. In the figure, LD/LS denotes as local memory ratio in DCM and disk swap, respectively.
Fig. 6. Makespan of the managed application (PageRank). In the figure, X-axis represents the ratio of local memory to working set, with the ratio decreasing from left to right.
Fig. 7. Breakdown of computation time and GC time as a percentage of the total execution time.
Fig. 8. (a) shows how much time each garbage collection task accounts for in the total garbage collection time; (b) shows a Cumulative Distribution Function (CDF) of object copy times in the L(70) for both cases.
Fig. 9. Makespan of the managed application (PageRank) in DCM and disk swap (NVMe-SSD, Ramdisk).
Table 3. Working set sizes generated at runtime by different PageRank input data sizes
Input/Working Set Size
|
Case
|
0.36 GB/7.97 GB
|
L(100)
|
0.72 GB/17.6 GB
|
L(90)
|
1.08 GB/19.3 GB
|
L(80)
|
1.44 GB/21.3 GB
|
L(70)
|