I. INTRODUCTION
Distributed clusters are essential for processing complex algorithms such as big data
analytics and deep learning, which have recently been applied in various fields. Particularly,
in HPC (High Performance Computing)-grade specifications, network communi-cation that
connects each node has a significant impact on overall performance. InfiniBand is
a prominent in-network device used for high-speed transmission in such high-performance
distributed processing environments [1]. It has gained increased attention recently, especially with the emergence of high-speed
DPU (Data Processing Unit) chips like NVIDIA BlueField [2-4].
To maximize the performance of InfiniBand, we need to utilize a recent communication
protocol called RDMA (Remote Direct Memory Access) [5]. RDMA is a technology that sends and receives data directly to and from memory between
nodes, bypassing the CPU, which can significantly reduce CPU usage and data transfer
latency. Typically, RDMA is developed with a programming API called Verbs [6]. However, Verbs is very complex and requires a lot of definitions in developing real-world
applications. Also, the Verb’s manual is based on one-to-one communication, so it
takes many trials and errors to adapt RDMA to a distributed environment. Therefore,
we propose a new RDMA library, called D-RDMALib, which generalizes the existing programming
process of RDMA to reduce the development complexity and can be easily adapted to
multi-node environments. Furthermore, to demonstrate the efficiency of D-RDMALib,
we implement the communication-intensive PageRank [7] using both D-RDMALib and MPI (Message Passing Interface) [8], and evaluate the processing performance.
D-RDMALib consists of existing RDMA operation functions and newly defined connection
information manager and RDMA manager for distributed cluster support. RDMA communication
is performed in four steps using the components of each manager: 1) network information
configuration, 2) network connection, 3) RDMA execution, and 4) RDMA completion. In
the network information configuration and network connection steps, we set up the
necessary network information and parameters for many-to-many, one-to-many, and many-to-one
communications. Based on the configured information, D-RDMALib then performs data
transmission, and once the transmission is complete, the configuration information
and resources are reset. In order to apply the existing Verbs API to a many-to-many
environment, it is not enough to replicate simply the one-to-one model to increase
the number of communication connections. Instead, additional tasks such as setting
up the connection information required for communication, and allocating/connecting
the buffers used for communication are required. At the same time, as the number of
physical nodes in the cluster increases and the number of communication connections
within the application increases, the configuration complexity increases significantly.
In this paper, we present an efficient and easy-to-use D-RDMALib suitable for multi-node
distributed environments by minimizing the existing unnecessary configuration process.
To demonstrate the efficiency of D-RDMALib, we conduct two main experiments. First,
for functional evaluation, we verify the normal operation of many-to-many communication
in distributed nodes. Second, for performance evaluation, we design and implement
D-RDMALib-based PageRank (DRDMA-PR) with master-worker structure and MPI-based PageRank
(MPI-PR), respectively, and compare the processing performance of two models. The
functional evaluation shows that multi-node RDMA communication using D-RDMALib can
be performed normally, and messages up to 64MB in size are transmitted stably. The
performance evaluation also shows that DRDMA-PR has faster processing times compared
to MPI-PR regardless of the input graph size.
The contributions of this paper are as follows. First, we highlight the problems in
implementing RDMA communication in multi-node environments. Second, we propose D-RDMALib,
which generalizes the existing RDMA development process and applies them to many-to-many
communication environments. Third, for the functional and performance evaluations
of D-RDMALib, we choose a communication-intensive many-to-many environment and the
PageRank algorithm, and implement two PageRank models. Fourth, we conduct comparative
experiments based on the implemented applications and demonstrate the efficiency of
D-RDMALib. As a result, we believe that the proposed D-RDMALib is an efficient technique
to improve the accessibility of InfiniBand and RDMA technologies with high barriers
to entry and can be easily utilized in existing distributed processing environments
as well as emerging ultra-high-speed in-network technologies such as Quantum InfiniBand
[9,10].
The rest of the paper is organized as follows. Section 2 describes related work. Section
3 introduces the overall operations and detailed architecture of D-RDMALib. Section
4 presents the D-RDMALib-based PageRank algorithm as an application of the proposed
library. Section 5 verifies the functionality and efficiency of D-RDMALib through
extensive experiments. Finally, Section 6 summarizes and concludes the paper.
II. RELATED WORK
InfiniBand [1] is a communications network standard that applies high-performance DPU to ensure
high bandwidth and low latency. With rapid advanced in semiconductor technology [10,11], the latest InfiniBand switches support up to 1.2Tbps of bandwidth at NDR (Next Data
Rate). RDMA [5] is a communication protocol that maximizes the power of InfiniBand by transferring
data directly to remote host memory, bypassing the CPU. These characteristics of RDMA
can reduce latency, CPU consumption, and memory bandwidth bottlenecks, and enable
high-speed transmission by sending and receiving data without CPU interference [12]. As a result, RDMA-based performance enhancements have recently been attempted in
various fields [12-16].
The Verbs API provided by Open Fabric Enterprise Deployment (OFED) is required to
implement these InfiniBand applications [5,6]. OFED is an open-source middleware for using RDMA, which performs InfiniBand-based
RDMA operations via Verbs functions. RDMA operations are provided by the MLNX_OFED
[17] driver. Currently, the following operations are available for RDMA communications:
SEND, SEND w/Imm, RECEIVE, WRITE, WRITE w/Imm (with immediate), and Atomic, all of
which are supported in this study except Atomic.
For RDMA communication, both the sender and receiver must specify the memory regions
(MRs) to be used in advance. RDMA communication is via InfiniBand HCA, and each node
has a queue pair (QP) for sending and receiving, like a socket. Sending and receiving
tasks are managed through work requests (WRs), and completed WRs can be confirmed
through completion queue entry (CQE). Based on these properties, the process of RDMA
communication works as follows.\begin{enumerate}[1)]
1) Create an InfiniBand context: connect the HCA through the driver, and create the
custom space required for communication.
2) Create protection domain: allocate a memory area for the RDMA communication process.
The allocated areas include MR, QP, etc.
3) Create a completion queue (CQ): create the remaining pairs in the queue for communication.
4) Create QP: create a QP that actually operates sending and receiving messages.
5) Establish a connection with exchanged QP identifier information: determine whom
to communicate with through the QP's identifier.
6) Change QP state: change the QP from the initial state, RESET, to the communicable
state, RTR (Ready to Receive) and RTS (Ready to Send).
7) MR registration: after the initialization process up to step (6), allocate the
memory to be used for actual communication through memory region registration.
8) Exchange MR information and sending/receiving data: send and receive data based
on registered MR and QP through RDMA communication operations.
As shown in the above steps, implementing RDMA applications using the Verbs API is
highly inefficient due to the complex initialization steps (1-7) required for communication.
To address these shortcomings of the existing API, researches on lightweight RDMA
programming has been conducted. Infinity [18] is a lightweight C++ RDMA library for InfiniBand. It supports existing Verbs API-based
RDMA implementations and aims to make it easier to develop RDMA applications without
sacrificing performance. T. Kim et al. [19] attempted to reduce the difficulty of implementing RDMA by designing a detailed configuration
like a socket program. Similar to our work, previous studies have focused on reducing
the difficulty of implementing RDMA communication, but all of them support only one-to-one
communication, which is still a major inconvenience for multi-node environments. Therefore,
in this paper, we propose D-RDMALib, which can be easily utilized not only in one-to-one
communication but also in many-to-one, one-to-many, and many-to-many communication
environments.
To evaluate the performance of the proposed D-RDMALib, we design and implement the
PageRank algorithm. We also use MPI as a comparison model to see the effect of real-world
distributed RDMA communication performance. MPI [8] is a widely used communication interface for parallel computing that allows distributed
computers or processors to send and receive messages and distribute and process tasks.
This structure allows for high performance by utilizing multiple nodes or processors
simultaneously. Various libraries exist for MPI, and in the paper we use OpenMP [20] to implement MPI-based PageRank.
III. DISTRIBUTED RDMA LIBRARY
1. Overall Architecture
This section describes the many-to-many communication -enabled D-RDMALib and the communication
modules based on it. Fig. 1 shows the operating process of the RDMA communication modules using D-RDMALib. As
discussed in the introduction, using traditional RDMA communication mechanisms can
greatly increase development complexity in distributed clusters consisting of many
nodes. In particular, as the number of nodes increases and the number of applications
requiring RDMA communication within the nodes increases, the traditional development
approach of setting up connection pairs at the code level becomes very inefficient.
Therefore, in this study, we simplify this initialization process by categorizing
four steps: 1) network information configuration, 2) network connection, 3) RDMA execution,
and 4) RDMA completion.
We briefly describe the operation of each step as follows. First, in the network information
configuration step, we set variables such as the number of nodes, server IP addresses,
port numbers, and send/receive buffer addresses required for RDMA communication. Second,
in the network connection step, D-RDMALib creates and registers the information and
buffers related to data transmission and reception, such as QPs and CQs. For this
purpose, the variables specified in the network information configuration step are
registered, and the nodes are connected through socket communication. Once the connection
is established, the queues used for RDMA communication are created, and they are registered
in the MR. Each node uses socket communication to exchange information about the queues
used to send and receive data, and to change the state of the queues. Third, the RDMA
execution step performs one-to-many, many-to-one, and many-to-many communications
based on the information set in the previous step and the Verbs API. In this case,
D-RDMALib supports SEND, SEND w/Imm, RECEIVE, WRITE, and WRITE w/Imm operations. Once
all data communication is completed, it proceeds to the fourth step, RDMA completion.
In the RDMA completion step, all the various queues created for communication are
deleted, and the areas responsible for managing the state and registration of the
queues are also reset. After this process is complete, all communication processes
are terminated.
Fig. 1. Operating process of D-RDMALib and its managers.
2. Main Functions of D-RDMALib
This section describes the main functions used in each step of D-RDMALib in turn.
All the functions are appeared in Fig. 1 with their corresponding step.
First, in the network information configuration step, the initialize_rdma_connection()
registers the user-declared variables and connects the nodes via socket communication.
This function takes in the current node's IP, the IPs of the nodes to use, the number
of nodes to use, the port number, and the send_buffer and recv_buffer parameters to
connect the server on each node to the client on the other node.
Second, in the network connection step, create_rdma _info() creates RDMA communication
information. It generates the information needed for RDMA communication, such as contexts,
QPs, and CQs, and registers the memory region of the buffers to be used. Using this
function, we can create and register complex information at once that is previously
defined step-by-step. In this case, since the state of the queues is modified using
the generated information from each node, it is stored in their respective vectors
to enable sharing of the information among the nodes.
Like the previous create_rdma_info() function, the network connection step, send_info_change_qp(),
exchanges the RDMA communication information stored in the vector between each node
and changes the state of the QPs. The information exchange is designed using socket
communication. Information exchange and QPs state changes occur based on the send_buffer
and recv_buffer values, respectively. This is because after the exchange of send_buffer
and recv_buffer values, the QP of each buffer must also change its state to match
the buffer. A QP has three states: Init, which is the initialization state, RTR, which
is the ability to receive, and RTS, which is the ability to send. The send_info_change_qp()
function is designed to receive the information sent by each node, and then use the
information to change the send_buffer to RTS, the state where data can be sent, and
the recv_buffer to RTR, the state where data can be received.
Third, the rdma_comm() handles the actual many-to-many data transmission using RDMA
during the execution step, and defines the detailed operations. SEND, SEND w/Imm,
and WRITE w/Imm are operations that send data from the local to the remote, and are
designed to send both the data to be transmitted and a 32bit integer value to acknowledge
receipt. The remote can know that listening is complete by polling the integer value
accumulated in the CQ. WRITE, on the other hand, unlike the other operations, has
no way of determining if the data is successfully received after the local sends it
to the remote. Therefore, we design the socket communication to transmit a ``send''
signal when performing a WRITE to determine if the reception is successful. The rdma_comm()
takes as arguments the operation to be used and the data to be sent to each other.
The operation to be passed as an argument can be specified as a string: ``send'',
``send_with_imm'', ``write'', ``write_with_imm''. The data to be transmitted can be
specified as either a string or a double buffer. If the operation is determined by
the arguments, sending and receiving data will be performed concurrently using threads,
and accordingly, the received data will be stored in recv_buffer when the reception
is complete.
Fourth, rdma_one_to_many_send_msg() and rdma_ one_to_many_recv_msg() are functions
used for one-to-many communication in the RDMA execution step. One-to-many communication
is a structure in which a node with the server IP defined in the network information
configuration step sends messages, and nodes with non-server IPs receive messages.
This means that the node corresponding to the server IP passes the operation to be
used and the message to be sent as arguments to the rdma_one_to_many_send_msg() function,
while non-server nodes pass only the operation to be used as arguments to the rdma_one_to_many_recv_msg()
function.
Fifth, rdma_many_to_one_send_msg() and rdma _many_to_one_recv_msg() are functions
used for many-to-one communication in the RDMA execution step. Many-to-one communication
is a structure in which non-server nodes send messages and the server node receives
messages from each node. This means that node that are server can use the rdma_many_to_one_recv_msg()
function, passing the operation to be used as an argument, and non-server nodes can
use the rdma_many_to_one_ send_msg() function, passing the operation and the message
to be sent as arguments.
Finally, the exit_rdma() function provides the ability to clean up the information
generated to end communication in the RDMA completion step. This means deleting any
queues used for communication and deallocating any memory regions created for queue
management. Therefore, this function is called after all RDMA communication has ended
and before program termination. The implemented D-RDMALib is open-source and publicly
available.\footnote{$^{}$https://github.com/laewonJeong/D-RDMALib} We use this D-RDMALib
to implement and experiment with PageRank in Section 4.
IV. D-RDMALIB-BASED PAGERANK
To demonstrate the effectiveness of D-RDMALib, we present both a simple data sending
and receiving application and a PageRank-based distributed application with heavy
inter-node communication. This section describes the overall architecture of DRDMA-PR,
which distributes PageRank in a master-worker structure. Fig. 2 shows the overall architecture of DRDMA-PR. The operating process of DRDMA-PR consists
of three steps:
(1) graph creation, (2) network configuration, and (3) PageRank calculation. The graph
creation step reads the graph data file to be used and stores the graph in vectors,
and each worker node partition the generated vertices. In the paper, we use two vertex
partitioning methods to analyze the performance variation with throughput between
nodes. The first is to partition the vertices simply based on the number of worker
nodes. This partitioning method results in a nearly equal number of vertices computing
PageRank at each worker node, but the amount of computation can vary depending on
the number of outgoing edges of each vertex. It can lead to throughput imbalances
where other nodes wait for the computation of one node to finish. Therefore, as a
second, we also employ another partitioning method that calculates, counts the number
of outgoing edges for each vertex, and evenly distributes the number of vertices and
outgoing edges that need to be processed among each node.
The network configuration step connects all the nodes via socket communication using
the proposed initialize_ rdma_connection() function of D-RDMALib. We then use the
create_rdma_info() function to generate the information needed for RDMA communication.
Finally, we use the send_info_change_qp() function to share the generated information
with each other nodes and change the state of the QP.
There is a difference between the roles of the master node and worker nodes in the
PageRank calculation step. Worker nodes compute PageRank for each vertex they own
and store it in a double-type buffer. After that, the rdma_many_to_one_send_msg()
function, which is a many-to-one communication function, is used to send the stored
PageRank double-type buffer to the master node. The master node receives the PageRank
sent by all worker nodes, using the rdma_many_to_one_recv_msg() function, which is
a many-to-one communication function. After receiving all the PageRank from the worker
nodes, the master node merges the received PageRank values into a single double-type
buffer and sends the result to the worker nodes using rdma_one_to_many_send_msg(),
which is the one-to-many communication function, and the worker nodes also receive
using the function rdma_one_to_many_ recv_msg(), which is one-to-many communication.
We repeat the process until the difference between the updated and old PageRank is
less than a threshold (used 0.00001 in our experiment).
In the paper, we also implement MPI-based PageRank as a comparison model to evaluate
the performance of DRDMA-PR. Like DRDMA-PR, MPI-PR first reads the input file to generate
a graph and proceeds to partition the vertices at each node. Each node then computes
the PageRank for the vertices it partitions, and gathers them into a buffer using
MPI_Allgatherv(), which is the collective communication function of MPI. It repeats
the process and ends the calculation when the difference between the updated PageRank
and the old PageRank is less than the threshold. However, in general, MPI applications
are in worker nodes where all the nodes in the cluster perform parallelism. In the
paper, we implement MPI-PR using the collective communication function, so that in
a parallel structure, one worker plays the role of the master together. This node
has the overhead for handling both the assigned PageRank calculation and resource/
communication management, which ultimately degrades the overall processing performance.
In the proposed DRDMA-PR, worker nodes are responsible for the actual computation
and the master is responsible for aggregating and redistributing the computation results.
This structure eliminates the need to send and receive data between worker nodes,
greatly reducing the amount of communication. Also, the master node communicates with
all worker nodes but only performs simple merge-distribution operations, so it doesn't
put a strain on overall performance. When MPI-PR is organized in such a master-worker
structure, the communication process between nodes must be implemented with additional
functions other than the collective communication functions. In other words, the master-worker
structure may incur more overhead than the overhead of a parallel structure. Therefore,
in the paper, we do not consider a master-worker structure for MPI-PR, but we construct
a comparison model of MPI-PR as a parallel structure.
Fig. 2. DRDMA-PR overall architecture.
V. PERFORMANCE EVALUATION
In this section, we perform two experimental evaluations with D-RDMALib-based applications.
In Experiment 1, we validate the communication performance of D-RDMALib by implementing
many-to-many, one-to-many, and many-to-one communication applications. In Experiment
2, we implement PageRank using D-RDMALib and MPI to evaluate the processing performance
using real graph data and compare the results. For experiments, we use nodes with
Intel Xeon 2.4 GHz 10 core, 128 GB RAM and Mellanox network switch as hardware as
shown in Table 1, and the software using MLNX_OFED_LINUX-4.9-6.0.6.0 version and Open MPI 4.0.3 on
Ubuntu OS 20.04. PageRank experiment uses three data [21] inputs as shown in Table 2.
1. Experiment 1: Functional Evaluation
In Experiment 1, to validate the communication performance, we perform a scenario
where we send and receive messages of sizes 1MB, 16MB, and 64MB on four nodes. In
this case, we have configured many-to-many, one-to-many, and many-to-one communications,
and we use RDMA's SEND operation. In Experiment 1, for many-to-many communication,
each node acts as both a server and a client because all nodes send and receive together.
It also operates in a mesh topology where all nodes are fully connected to each other.
Fig. 3 and 4 show the results of each communication topology configured based on D-RDMALib.
Fig. 3 shows the results of many-to-many communication, and Fig. 4 shows the results of one-to-many and many-to-one communication. As shown in the figures,
we can observe all nodes successfully send and receive messages in all many-to-many,
one-to-many, and many-to-one communications. In many-to-many communication, all nodes
send and receive various data sizes of 1MB, 16MB, and 64 MB to each other through
the SEND operation. In this case, all nodes can send data without any problem as the
message size increases. In the case of one-to-many communication, when node 1 sends
data, the rest of the nodes receive it normally. Also, in the case of many-to-one
communication, nodes 2-4 send data, and node 1 receives all the data normally.
Fig. 3. Result of RDMA many-to-many communication.
Fig. 4. Result of RDMA one-to-many and many-to-one communications.
2. Experiment 2: Performance Evaluation
In Experiment 2, we compare the data processing performance of D-RDMALib-based PageRank
(DRDMA-PR) and MPI-based PageRank (MPI-PR). DRDMA-PR uses five nodes (one master and
four workers), and MPI-PR uses four nodes. Single-PR uses one node.
Fig. 5 shows a comparison of the execution time of Single-PR and DRDMA-PR. We can see that
DRDMA-PR is up to 2.71 times faster than Single-PR for all data. This means that distributed
processing under DRDMA-PR is much effective. In the figure, the difference between
the result of the soc-liveJourna1 experiment and the result of the other data is small,
which simply partitions and computes the total number of vertices by the number of
worker nodes without considering the outgoing edges of the vertices when partitioning
the graph.
Fig. 6 shows a comparison of MPI-PR and DRDMA-PR execution times. We can observe that DRDMA-PR
is up to 1.29 times faster than MPI-PR for all data. This is because, unlike DRDMA-PR,
the overhead of one node performing resource and communication management in an MPI
cluster, which is a parallel structure, makes a bad effect on the total performance
degradation. Overall comparison result of the execution time for each of Single-PR,
MPI-PR and DRDMA-PR is shown in Table 3.
Fig. 7 shows the average time per step for each node when computing PageRank for soc-LiveJournal1.
From the figure, we can see that the execution times of node 2~4 are similar for both
MPI-PR and DRDMA-PR. However, node 1 takes much longer with MPI-PR compared to DRDMA-PR.
These results demonstrate the overhead issue arising from a specific worker in the
MPI cluster performing the role of the master concurrently. Also, the difference in
the execution time of each node is due to the different number of outgoing edges of
the vertices partitioned by each node, which results in different amounts of PageRank
computations. In other words, when the total number of vertices is simply partitioned
and allocated according to the number of nodes, nodes 1 and 2 have more outgoing edges
per vertices than nodes 3 and 4.
When allocating vertices per node, we counted the number of outgoing edges for each
vertex to balance the number of vertices and outgoing edges to compute. Fig. 8(a) and Table 4 show MPI-PR and DRDMA-PR execution times for soc-LiveJournal1 with the improved vertex
partitioning method. In the improved DRDMA-PR, the performance improvements of up
to 3.04 times compared to Single-PR, up to 1.54 times compared to simple allocation
DRDMA-PR, and 1.33 times compared to the improved MPI-PR are achieved. This shows
that the improved vertex partitioning method improves the overall performance of the
distributed PageRank computation. Fig. 8(b) and Table 5 show the average execution time per node after applying the improved vertex partitioning
method. Compared with Fig. 7, we can observe that DRDMA-PR has nearly similar average execution times across all
nodes, while MPI-PR still exhibits overhead on node 1. Consequently, we find that
DRDMA-PR can efficiently perform distributed parallelism using D-RDMALib proposed
in the paper and is more efficient than MPI-PR using MPI regardless of the vertex
partitioning method.
Fig. 5. Single-PR vs. DRDMA-PR execution time comparison.
Fig. 6. MPI-PR vs. DRDMA-PR execution time comparison.
Fig. 7. Average execution time per step for each node in MPI-PR and DRDMA-PR.
Fig. 8. MPI-PR and DRDMA-PR with the improved vertex partitioning: (a) average execution time; (b) average execution time per step for each node.
Table 1. Experimental environment
|
Specification
|
Hardware
|
- Node: Intel Xeon 2.4GHz 10 core, 128GB RAM
- HCA: Mellanox MT4099
- InfiniBand: Mellanox ConnectX®-3 FDR 56Gbps
|
Software
|
- OS: Ubuntu 20.04
- InfiniBand Driver: MLNX_OFED_LINUX-4.9-6.0.6.0
- MPI: Open MPI 4.0.3
|
Table 2. Experimental datasets
Name
|
# of vertices
|
# of edges
|
twitter_combined
|
81,306
|
1,768,149
|
wiki-Talk
|
2,394,385
|
5,021,410
|
soc-LiveJournal1
|
4,847,571
|
68,993,773
|
Table 3. Execution times for each of Single-PR, MPI-PR and DRDMA-PR
Dataset
|
Singe-PR (sec)
|
MPI-PR (sec)
|
DRDMA-PR (sec)
|
twitter_combined
|
1.297
|
0.552
|
0.478
|
wiki-Talk
|
14.194
|
6.665
|
5.804
|
soc-LiveJournal1
|
121.729
|
79.657
|
61.837
|
Table 4. Total execution times in MPI-PR and DRDMA-PR based on partitioning techniques
|
MPI-PR (sec)
|
DRDMA-PR(sec)
|
Simply Partitioning
|
79.657
|
61.837
|
Improved Partitioning
|
53.253
|
40.069
|
Table 5. Average execution times per step for each node in MPI-PR and DRDMA-PR based on partitioning techniques
|
Node 1
(sec)
|
Node 2
(sec)
|
Node 3
(sec)
|
Node 4
(sec)
|
Simply Partitioning
|
MPI-PR
|
1.228
|
0.473
|
0.302
|
0.174
|
DRDMA-PR
|
0.892
|
0.47
|
0.299
|
0.17
|
Improved Partitioning
|
MPI-PR
|
0.812
|
0.456
|
0.447
|
0.454
|
DRDMA-PR
|
0.443
|
0.448
|
0.469
|
0.46
|
VI. CONCLUSIONS
In the paper, we proposed D-RDMALib, which can be utilized to improve the performance
of high-performance distributed cluster environments using InfiniBand. RDMA is a network
communication technology optimized for high-speed communication, but it has a problem
of being is difficult to apply to the cluster environment consisting of multiple nodes
due to its high development complexity and one-to-one communication-based programing
model. To solve these problems, we proposed D-RDMALib, an efficient and easy-to-use
library that generalizes the existing programming model and can be easily adapted
to multi-node environments. Furthermore, we designed and implemented the PageRank
algorithm as an application for the performance evaluation of the proposed D-RDMALib.
In Experiment 1, we confirmed that the D-RDMALib-based communication module performs
data sending and receiving normally in a multi-node environment in all many-to-many,
one-to-many, and many-to-one environments, respectively. In Experiment 2, through
the performance evaluation comparing the graph processing speeds of DRDMA-PR and MPI-PR,
we showed that D-RDMALib-based PageRank is faster overall. Based on these results,
we believe that D-RDMALib is a novel programming library that reduces the complexity
of developing RDMA applications and can be easily adapted to distributed processing
environments with multiple nodes. Also, D-RDMALib is an efficient technique that can
be readily applied to existing distributed processing environments as well as emerging
ultra-fast in-network technologies such as Quantum InfiniBand.
ACKNOWLEDGMENTS
This work was partly supported by Institute of Information & communications Technology
Planning & evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2021-0-00859,
Development of A Distributed Graph DBMS for Intelligent Processing of Big Graphs)
and National Research Foundation of Korea (NRF) grant funded by the Korea government
(MSIT) (No. NRF-2022R1A2C1003067).
References
P. Grun, “Introduction to InfiniBand for End Users,” InfiniBand Trade Association,
2010.
NVIDIA BlueField DPU, https://www.nvidia.com/en-us/networking/products/data-processing-unit/.
I. Burstein, “NVIDIA Data Center Processing Unit (DPU) Architecture,” In Proc. of
the IEEE Hot Chips 33 Symp., Palo Alto, CA, pp. 1-20, Aug. 2021.
G. Shainer, R. Graham, C. J. Newburn, O. Hernandez, G. Bloch, T. Gibbs, and J. C.
Wells, “NVIDIA’s Cloud Native Supercomputing,” In Proc. of the Smoky Mountains Computational
Sciences and Engineering, Virtual Event, pp. 340-357, Mar. 2022.
Mellanox Technologies, RDMA Aware Networks Programming User Manual, Rev 1.7, Mellanox
User Manual, 2015.
P. MacArthur, Q. Liu, R. D. Russell, F. Mizero, M. Veeraraghavan, and J. M. Dennis,
“An Integrated Tutorial on InfiniBand, Verbs, and MPI,” IEEE Communications Surveys
& Tutorials, Vol. 19, Issue 4, pp. 2894-2926, Aug. 2017.
S. Brin and L. Page, “The Anatomy of a Large-scale Hypertextual Web Search Engine,”
Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pp. 107-117, Apr. 1998.
MPI, https://www.mpi-forum.org.
Quantum InfiniBand, https://www.nvidia.com/en-us/networking/quantum2/.
Y. Shpigelman, G. Shainer, R. Graham, Y. Qin, G. Cisneros-Stoianowski, and C. Stunkel,
“NVIDIA’s Quantum InfiniBand Network Congestion Control Technology and its Impact
on Application Performance,” In Proc. of the ISC High Performance, Hamburg, Germany,
pp. 26-43, May 2022.
S. J. Mazivila, J. X. Soares, and J. L. Santos, “A Tutorial on Multi-way Data Processing
of Excitation-emission Fluorescence Matrices Acquired from Semiconductor Quantum Dots
Sensing Platforms,” Analytica Chimica Acta, Vol. 1211, ID 339216, June 2022.
S. Yang, S. Son, M.-J. Choi, and Y.-S. Moon, “Performance Improvement of Apache Storm
using InfiniBand RDMA,” The Journal of Supercomputing, Vol. 75, No. 10, pp. 6804-6830,
June 2019.
E. Zamanian, X. Yu, M. Stonebraker, and T. Kraska, “Rethinking Database High Availability
with RDMA Networks,” In Proc. of the VLDB Endowment, Vol. 12, Issue 11, pp. 1637-1650,
July 2019.
C. Link, J. Sarran, G. Grigoryan, M. Kwon, M. M. Rafique, and W. R. Carithers, “Container
Orchestration by Kubernetes for RDMA Networking,” In Proc. of the IEEE 27th Int’l
Conf. on Network Protocols, Chicago, IL, pp. 1-2, Oct. 2019.
J. Wang, et al., “Fargraph+: Excavating the Parallelism of Graph Processing Workload
on RDMA-based Far Memory System,” Journal of Parallel and Distributed Computing, Vol.
177, pp. 144-159, July 2023.
Z. Yao, R. Chen, B. Zang, and H. Chen, “Wukong+ G: Fast and Concurrent RDF Query Processing
using RDMA-assisted GPU Graph Exploration,” IEEE Trans. on Parallel and Distributed
Systems, Vol. 33, Issue 7, pp. 1619-1635, July 2023.
MLNX_OFED, https://developer.nvidia.com/networking/infiniband-software/.
Infinity, https://github.com/claudebarthels/infinity.
T. Kim, B. Kim, and H. Jung, ”Design and Implementation of Easily Applicable and Lightweight
RDMA API for InfiniBand Cluster,” In Proc. of the Symp. KICS, Jeju, Korea, pp. 1483-1484,
June 2015.
OpenMP, https://www.openmp.org/.
Stanford Large Network Dataset Collection, https://snap.stanford.edu/data/.
Laewon Jeong received B.S.(2022) degrees in Computer Science from Kangwon National
University. He is currently a master’s student in Kangwon National University. His
research interests include distributed and parallel processing, high-speed data processing,
and container system.
Myeong-Seon Gil received B.S. (2007), M.S.(2009), and Ph.D.(2021) degrees in Computer
Science from Kangwon National University. She is currently a postdoctoral researcher
in Kangwon National University. From 2010 to 2012, she was an employee in Central
Information and Computing Center in Kangwon National University, where she participated
in developing mobile applications and university portal systems. Her research interests
include data mining, big data analysis, distributed and parallel processing, and high-speed
stream processing.
Yang-Sae Moon received B.S. (1991), M.S.(1993), and Ph.D.(2001) degrees in Computer
Science from Korea Advanced Institute of Science and Technology (KAIST). From 1993
to 1997, he was a research engineer in Hyundai Syscomm, Inc., where he participated
in developing 2G and 3G mobile communication systems. From 2002 to 2005, he was a
technical director in Infravalley, Inc., where he participated in planning, designing,
and developing CDMA and W-CDMA mobile network services and systems. He is currently
a professor of Dept. of Computer Science and a dean of Graduate School of Data Science
at Kangwon National University. He was a visiting scholar at Purdue University in
2008 to 2009. His research interests include data mining, big data systems and analytics,
storage systems, distributed and parallel processing, open source systems, and high-speed
stream processing.