Mobile QR Code QR CODE

Main Menu

The Journal of Semiconductor Technology and Science (JSTS) is an international, peer-reviewed, and open-access journal that is published bimonthly.
- Scope: semiconductor processes, devices, circuits, and MEMS.
- Editor-in-Chief: Prof. Woo Young Choi (ECE, Seoul National University)
- Indexed within Science Citation Index Expanded (SCIE), SCOPUS, Korea Citation Index (KCI), and other databases.

Journal Search

[

Research article

]

JSTS(Journal of Semiconductor Technology and Science)

IEIE Vol. 23, No. 06, p.341-351

ISSN (print) :

1598-1657

ISSN (online) :

2233-4866

Received : 8 June 2023Revised : 10 August 2023Accepted : 7 October 2023

DOI :

https://doi.org/10.5573/JSTS.2023.23.6.341

D-RDMALib: InfiniBand-based RDMA Library for Distributed Cluster Applications

JeongLaewon¹ GilMyeong-Seon¹ MoonYang-Sae¹

(Dept. of Computer Science and Engineering, Kangwon National University, Chuncheon 24341, Korea)

^* E-mail : ysmoon@kangwon.ac.kr (Corresponding Author : Yang-Sae Moon)

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

RDMA (Remote Direct Memory Access) is a recent communication protocol that allows for the maximum utilization of the performance of InfiniBand, a high-speed network device based on Data Processing Units (DPUs). However, the existing RDMA development method incurs a high implementation complexity, which makes it difficult to apply to a multi-node environment. To solve these problems, we propose D-RDMALib, an efficient and easy-to-use RDMA library suitable for distributed cluster environments. As an effective development tool, D-RDMALib can reduce the implementation difficulty of applying RDMA to distributed multi-node environments. It also improves the accessibility of InfiniBand and RDMA technologies with high entry barriers. In this paper, we demonstrate the efficiency of D-RDMALib by 1) verifying the successful operation of many-to-many communication in distributed nodes, and 2) evaluating the performance by implementing PageRank-based on D-RDMALib. We believe that our work is a novel technique that can be easily applied to existing distributed processing environments as well as to emerging ultra-high-speed in-network technologies such as Quantum InfiniBand.

Index Terms

InfiniBand, remote direct memory access, high-speed distributed processing, distributed cluster, big data processing

I. INTRODUCTION

Distributed clusters are essential for processing complex algorithms such as big data analytics and deep learning, which have recently been applied in various fields. Particularly, in HPC (High Performance Computing)-grade specifications, network communi-cation that connects each node has a significant impact on overall performance. InfiniBand is a prominent in-network device used for high-speed transmission in such high-performance distributed processing environments ^[1]. It has gained increased attention recently, especially with the emergence of high-speed DPU (Data Processing Unit) chips like NVIDIA BlueField ^[2-^4].

To maximize the performance of InfiniBand, we need to utilize a recent communication protocol called RDMA (Remote Direct Memory Access) ^[5]. RDMA is a technology that sends and receives data directly to and from memory between nodes, bypassing the CPU, which can significantly reduce CPU usage and data transfer latency. Typically, RDMA is developed with a programming API called Verbs ^[6]. However, Verbs is very complex and requires a lot of definitions in developing real-world applications. Also, the Verb’s manual is based on one-to-one communication, so it takes many trials and errors to adapt RDMA to a distributed environment. Therefore, we propose a new RDMA library, called D-RDMALib, which generalizes the existing programming process of RDMA to reduce the development complexity and can be easily adapted to multi-node environments. Furthermore, to demonstrate the efficiency of D-RDMALib, we implement the communication-intensive PageRank ^[7] using both D-RDMALib and MPI (Message Passing Interface) ^[8], and evaluate the processing performance.

D-RDMALib consists of existing RDMA operation functions and newly defined connection information manager and RDMA manager for distributed cluster support. RDMA communication is performed in four steps using the components of each manager: 1) network information configuration, 2) network connection, 3) RDMA execution, and 4) RDMA completion. In the network information configuration and network connection steps, we set up the necessary network information and parameters for many-to-many, one-to-many, and many-to-one communications. Based on the configured information, D-RDMALib then performs data transmission, and once the transmission is complete, the configuration information and resources are reset. In order to apply the existing Verbs API to a many-to-many environment, it is not enough to replicate simply the one-to-one model to increase the number of communication connections. Instead, additional tasks such as setting up the connection information required for communication, and allocating/connecting the buffers used for communication are required. At the same time, as the number of physical nodes in the cluster increases and the number of communication connections within the application increases, the configuration complexity increases significantly. In this paper, we present an efficient and easy-to-use D-RDMALib suitable for multi-node distributed environments by minimizing the existing unnecessary configuration process.

To demonstrate the efficiency of D-RDMALib, we conduct two main experiments. First, for functional evaluation, we verify the normal operation of many-to-many communication in distributed nodes. Second, for performance evaluation, we design and implement D-RDMALib-based PageRank (DRDMA-PR) with master-worker structure and MPI-based PageRank (MPI-PR), respectively, and compare the processing performance of two models. The functional evaluation shows that multi-node RDMA communication using D-RDMALib can be performed normally, and messages up to 64MB in size are transmitted stably. The performance evaluation also shows that DRDMA-PR has faster processing times compared to MPI-PR regardless of the input graph size.

The contributions of this paper are as follows. First, we highlight the problems in implementing RDMA communication in multi-node environments. Second, we propose D-RDMALib, which generalizes the existing RDMA development process and applies them to many-to-many communication environments. Third, for the functional and performance evaluations of D-RDMALib, we choose a communication-intensive many-to-many environment and the PageRank algorithm, and implement two PageRank models. Fourth, we conduct comparative experiments based on the implemented applications and demonstrate the efficiency of D-RDMALib. As a result, we believe that the proposed D-RDMALib is an efficient technique to improve the accessibility of InfiniBand and RDMA technologies with high barriers to entry and can be easily utilized in existing distributed processing environments as well as emerging ultra-high-speed in-network technologies such as Quantum InfiniBand ^[9,^10].

The rest of the paper is organized as follows. Section 2 describes related work. Section 3 introduces the overall operations and detailed architecture of D-RDMALib. Section 4 presents the D-RDMALib-based PageRank algorithm as an application of the proposed library. Section 5 verifies the functionality and efficiency of D-RDMALib through extensive experiments. Finally, Section 6 summarizes and concludes the paper.

II. RELATED WORK

InfiniBand ^[1] is a communications network standard that applies high-performance DPU to ensure high bandwidth and low latency. With rapid advanced in semiconductor technology ^[10,^11], the latest InfiniBand switches support up to 1.2Tbps of bandwidth at NDR (Next Data Rate). RDMA ^[5] is a communication protocol that maximizes the power of InfiniBand by transferring data directly to remote host memory, bypassing the CPU. These characteristics of RDMA can reduce latency, CPU consumption, and memory bandwidth bottlenecks, and enable high-speed transmission by sending and receiving data without CPU interference ^[12]. As a result, RDMA-based performance enhancements have recently been attempted in various fields ^[12-^16].

The Verbs API provided by Open Fabric Enterprise Deployment (OFED) is required to implement these InfiniBand applications ^[5,^6]. OFED is an open-source middleware for using RDMA, which performs InfiniBand-based RDMA operations via Verbs functions. RDMA operations are provided by the MLNX_OFED ^[17] driver. Currently, the following operations are available for RDMA communications: SEND, SEND w/Imm, RECEIVE, WRITE, WRITE w/Imm (with immediate), and Atomic, all of which are supported in this study except Atomic.

For RDMA communication, both the sender and receiver must specify the memory regions (MRs) to be used in advance. RDMA communication is via InfiniBand HCA, and each node has a queue pair (QP) for sending and receiving, like a socket. Sending and receiving tasks are managed through work requests (WRs), and completed WRs can be confirmed through completion queue entry (CQE). Based on these properties, the process of RDMA communication works as follows.\begin{enumerate}[1)]

1) Create an InfiniBand context: connect the HCA through the driver, and create the custom space required for communication.

2) Create protection domain: allocate a memory area for the RDMA communication process. The allocated areas include MR, QP, etc.

3) Create a completion queue (CQ): create the remaining pairs in the queue for communication.

4) Create QP: create a QP that actually operates sending and receiving messages.

5) Establish a connection with exchanged QP identifier information: determine whom to communicate with through the QP's identifier.

6) Change QP state: change the QP from the initial state, RESET, to the communicable state, RTR (Ready to Receive) and RTS (Ready to Send).

7) MR registration: after the initialization process up to step (6), allocate the memory to be used for actual communication through memory region registration.

8) Exchange MR information and sending/receiving data: send and receive data based on registered MR and QP through RDMA communication operations.

As shown in the above steps, implementing RDMA applications using the Verbs API is highly inefficient due to the complex initialization steps (1-7) required for communication. To address these shortcomings of the existing API, researches on lightweight RDMA programming has been conducted. Infinity ^[18] is a lightweight C++ RDMA library for InfiniBand. It supports existing Verbs API-based RDMA implementations and aims to make it easier to develop RDMA applications without sacrificing performance. T. Kim et al. ^[19] attempted to reduce the difficulty of implementing RDMA by designing a detailed configuration like a socket program. Similar to our work, previous studies have focused on reducing the difficulty of implementing RDMA communication, but all of them support only one-to-one communication, which is still a major inconvenience for multi-node environments. Therefore, in this paper, we propose D-RDMALib, which can be easily utilized not only in one-to-one communication but also in many-to-one, one-to-many, and many-to-many communication environments.

To evaluate the performance of the proposed D-RDMALib, we design and implement the PageRank algorithm. We also use MPI as a comparison model to see the effect of real-world distributed RDMA communication performance. MPI ^[8] is a widely used communication interface for parallel computing that allows distributed computers or processors to send and receive messages and distribute and process tasks. This structure allows for high performance by utilizing multiple nodes or processors simultaneously. Various libraries exist for MPI, and in the paper we use OpenMP ^[20] to implement MPI-based PageRank.

III. DISTRIBUTED RDMA LIBRARY

1. Overall Architecture

This section describes the many-to-many communication -enabled D-RDMALib and the communication modules based on it. Fig. 1 shows the operating process of the RDMA communication modules using D-RDMALib. As discussed in the introduction, using traditional RDMA communication mechanisms can greatly increase development complexity in distributed clusters consisting of many nodes. In particular, as the number of nodes increases and the number of applications requiring RDMA communication within the nodes increases, the traditional development approach of setting up connection pairs at the code level becomes very inefficient. Therefore, in this study, we simplify this initialization process by categorizing four steps: 1) network information configuration, 2) network connection, 3) RDMA execution, and 4) RDMA completion.

We briefly describe the operation of each step as follows. First, in the network information configuration step, we set variables such as the number of nodes, server IP addresses, port numbers, and send/receive buffer addresses required for RDMA communication. Second, in the network connection step, D-RDMALib creates and registers the information and buffers related to data transmission and reception, such as QPs and CQs. For this purpose, the variables specified in the network information configuration step are registered, and the nodes are connected through socket communication. Once the connection is established, the queues used for RDMA communication are created, and they are registered in the MR. Each node uses socket communication to exchange information about the queues used to send and receive data, and to change the state of the queues. Third, the RDMA execution step performs one-to-many, many-to-one, and many-to-many communications based on the information set in the previous step and the Verbs API. In this case, D-RDMALib supports SEND, SEND w/Imm, RECEIVE, WRITE, and WRITE w/Imm operations. Once all data communication is completed, it proceeds to the fourth step, RDMA completion. In the RDMA completion step, all the various queues created for communication are deleted, and the areas responsible for managing the state and registration of the queues are also reset. After this process is complete, all communication processes are terminated.

Fig. 1. Operating process of D-RDMALib and its managers.

2. Main Functions of D-RDMALib

This section describes the main functions used in each step of D-RDMALib in turn. All the functions are appeared in Fig. 1 with their corresponding step.

First, in the network information configuration step, the initialize_rdma_connection() registers the user-declared variables and connects the nodes via socket communication. This function takes in the current node's IP, the IPs of the nodes to use, the number of nodes to use, the port number, and the send_buffer and recv_buffer parameters to connect the server on each node to the client on the other node.

Second, in the network connection step, create_rdma _info() creates RDMA communication information. It generates the information needed for RDMA communication, such as contexts, QPs, and CQs, and registers the memory region of the buffers to be used. Using this function, we can create and register complex information at once that is previously defined step-by-step. In this case, since the state of the queues is modified using the generated information from each node, it is stored in their respective vectors to enable sharing of the information among the nodes.

Like the previous create_rdma_info() function, the network connection step, send_info_change_qp(), exchanges the RDMA communication information stored in the vector between each node and changes the state of the QPs. The information exchange is designed using socket communication. Information exchange and QPs state changes occur based on the send_buffer and recv_buffer values, respectively. This is because after the exchange of send_buffer and recv_buffer values, the QP of each buffer must also change its state to match the buffer. A QP has three states: Init, which is the initialization state, RTR, which is the ability to receive, and RTS, which is the ability to send. The send_info_change_qp() function is designed to receive the information sent by each node, and then use the information to change the send_buffer to RTS, the state where data can be sent, and the recv_buffer to RTR, the state where data can be received.

Third, the rdma_comm() handles the actual many-to-many data transmission using RDMA during the execution step, and defines the detailed operations. SEND, SEND w/Imm, and WRITE w/Imm are operations that send data from the local to the remote, and are designed to send both the data to be transmitted and a 32bit integer value to acknowledge receipt. The remote can know that listening is complete by polling the integer value accumulated in the CQ. WRITE, on the other hand, unlike the other operations, has no way of determining if the data is successfully received after the local sends it to the remote. Therefore, we design the socket communication to transmit a ``send'' signal when performing a WRITE to determine if the reception is successful. The rdma_comm() takes as arguments the operation to be used and the data to be sent to each other. The operation to be passed as an argument can be specified as a string: ``send'', ``send_with_imm'', ``write'', ``write_with_imm''. The data to be transmitted can be specified as either a string or a double buffer. If the operation is determined by the arguments, sending and receiving data will be performed concurrently using threads, and accordingly, the received data will be stored in recv_buffer when the reception is complete.

Fourth, rdma_one_to_many_send_msg() and rdma_ one_to_many_recv_msg() are functions used for one-to-many communication in the RDMA execution step. One-to-many communication is a structure in which a node with the server IP defined in the network information configuration step sends messages, and nodes with non-server IPs receive messages. This means that the node corresponding to the server IP passes the operation to be used and the message to be sent as arguments to the rdma_one_to_many_send_msg() function, while non-server nodes pass only the operation to be used as arguments to the rdma_one_to_many_recv_msg() function.

Fifth, rdma_many_to_one_send_msg() and rdma _many_to_one_recv_msg() are functions used for many-to-one communication in the RDMA execution step. Many-to-one communication is a structure in which non-server nodes send messages and the server node receives messages from each node. This means that node that are server can use the rdma_many_to_one_recv_msg() function, passing the operation to be used as an argument, and non-server nodes can use the rdma_many_to_one_ send_msg() function, passing the operation and the message to be sent as arguments.

Finally, the exit_rdma() function provides the ability to clean up the information generated to end communication in the RDMA completion step. This means deleting any queues used for communication and deallocating any memory regions created for queue management. Therefore, this function is called after all RDMA communication has ended and before program termination. The implemented D-RDMALib is open-source and publicly available.\footnote{$^{}$https://github.com/laewonJeong/D-RDMALib} We use this D-RDMALib to implement and experiment with PageRank in Section 4.

IV. D-RDMALIB-BASED PAGERANK

To demonstrate the effectiveness of D-RDMALib, we present both a simple data sending and receiving application and a PageRank-based distributed application with heavy inter-node communication. This section describes the overall architecture of DRDMA-PR, which distributes PageRank in a master-worker structure. Fig. 2 shows the overall architecture of DRDMA-PR. The operating process of DRDMA-PR consists of three steps:

(1) graph creation, (2) network configuration, and (3) PageRank calculation. The graph creation step reads the graph data file to be used and stores the graph in vectors, and each worker node partition the generated vertices. In the paper, we use two vertex partitioning methods to analyze the performance variation with throughput between nodes. The first is to partition the vertices simply based on the number of worker nodes. This partitioning method results in a nearly equal number of vertices computing PageRank at each worker node, but the amount of computation can vary depending on the number of outgoing edges of each vertex. It can lead to throughput imbalances where other nodes wait for the computation of one node to finish. Therefore, as a second, we also employ another partitioning method that calculates, counts the number of outgoing edges for each vertex, and evenly distributes the number of vertices and outgoing edges that need to be processed among each node.

The network configuration step connects all the nodes via socket communication using the proposed initialize_ rdma_connection() function of D-RDMALib. We then use the create_rdma_info() function to generate the information needed for RDMA communication. Finally, we use the send_info_change_qp() function to share the generated information with each other nodes and change the state of the QP.

There is a difference between the roles of the master node and worker nodes in the PageRank calculation step. Worker nodes compute PageRank for each vertex they own and store it in a double-type buffer. After that, the rdma_many_to_one_send_msg() function, which is a many-to-one communication function, is used to send the stored PageRank double-type buffer to the master node. The master node receives the PageRank sent by all worker nodes, using the rdma_many_to_one_recv_msg() function, which is a many-to-one communication function. After receiving all the PageRank from the worker nodes, the master node merges the received PageRank values into a single double-type buffer and sends the result to the worker nodes using rdma_one_to_many_send_msg(), which is the one-to-many communication function, and the worker nodes also receive using the function rdma_one_to_many_ recv_msg(), which is one-to-many communication. We repeat the process until the difference between the updated and old PageRank is less than a threshold (used 0.00001 in our experiment).

In the paper, we also implement MPI-based PageRank as a comparison model to evaluate the performance of DRDMA-PR. Like DRDMA-PR, MPI-PR first reads the input file to generate a graph and proceeds to partition the vertices at each node. Each node then computes the PageRank for the vertices it partitions, and gathers them into a buffer using MPI_Allgatherv(), which is the collective communication function of MPI. It repeats the process and ends the calculation when the difference between the updated PageRank and the old PageRank is less than the threshold. However, in general, MPI applications are in worker nodes where all the nodes in the cluster perform parallelism. In the paper, we implement MPI-PR using the collective communication function, so that in a parallel structure, one worker plays the role of the master together. This node has the overhead for handling both the assigned PageRank calculation and resource/ communication management, which ultimately degrades the overall processing performance.

In the proposed DRDMA-PR, worker nodes are responsible for the actual computation and the master is responsible for aggregating and redistributing the computation results. This structure eliminates the need to send and receive data between worker nodes, greatly reducing the amount of communication. Also, the master node communicates with all worker nodes but only performs simple merge-distribution operations, so it doesn't put a strain on overall performance. When MPI-PR is organized in such a master-worker structure, the communication process between nodes must be implemented with additional functions other than the collective communication functions. In other words, the master-worker structure may incur more overhead than the overhead of a parallel structure. Therefore, in the paper, we do not consider a master-worker structure for MPI-PR, but we construct a comparison model of MPI-PR as a parallel structure.

Fig. 2. DRDMA-PR overall architecture.

V. PERFORMANCE EVALUATION

In this section, we perform two experimental evaluations with D-RDMALib-based applications. In Experiment 1, we validate the communication performance of D-RDMALib by implementing many-to-many, one-to-many, and many-to-one communication applications. In Experiment 2, we implement PageRank using D-RDMALib and MPI to evaluate the processing performance using real graph data and compare the results. For experiments, we use nodes with Intel Xeon 2.4 GHz 10 core, 128 GB RAM and Mellanox network switch as hardware as shown in Table 1, and the software using MLNX_OFED_LINUX-4.9-6.0.6.0 version and Open MPI 4.0.3 on Ubuntu OS 20.04. PageRank experiment uses three data ^[21] inputs as shown in Table 2.

1. Experiment 1: Functional Evaluation

In Experiment 1, to validate the communication performance, we perform a scenario where we send and receive messages of sizes 1MB, 16MB, and 64MB on four nodes. In this case, we have configured many-to-many, one-to-many, and many-to-one communications, and we use RDMA's SEND operation. In Experiment 1, for many-to-many communication, each node acts as both a server and a client because all nodes send and receive together. It also operates in a mesh topology where all nodes are fully connected to each other.

Fig. 3 and 4 show the results of each communication topology configured based on D-RDMALib. Fig. 3 shows the results of many-to-many communication, and Fig. 4 shows the results of one-to-many and many-to-one communication. As shown in the figures, we can observe all nodes successfully send and receive messages in all many-to-many, one-to-many, and many-to-one communications. In many-to-many communication, all nodes send and receive various data sizes of 1MB, 16MB, and 64 MB to each other through the SEND operation. In this case, all nodes can send data without any problem as the message size increases. In the case of one-to-many communication, when node 1 sends data, the rest of the nodes receive it normally. Also, in the case of many-to-one communication, nodes 2-4 send data, and node 1 receives all the data normally.

Fig. 3. Result of RDMA many-to-many communication.

Fig. 4. Result of RDMA one-to-many and many-to-one communications.

2. Experiment 2: Performance Evaluation

In Experiment 2, we compare the data processing performance of D-RDMALib-based PageRank (DRDMA-PR) and MPI-based PageRank (MPI-PR). DRDMA-PR uses five nodes (one master and four workers), and MPI-PR uses four nodes. Single-PR uses one node.

Fig. 5 shows a comparison of the execution time of Single-PR and DRDMA-PR. We can see that DRDMA-PR is up to 2.71 times faster than Single-PR for all data. This means that distributed processing under DRDMA-PR is much effective. In the figure, the difference between the result of the soc-liveJourna1 experiment and the result of the other data is small, which simply partitions and computes the total number of vertices by the number of worker nodes without considering the outgoing edges of the vertices when partitioning the graph.

Fig. 6 shows a comparison of MPI-PR and DRDMA-PR execution times. We can observe that DRDMA-PR is up to 1.29 times faster than MPI-PR for all data. This is because, unlike DRDMA-PR, the overhead of one node performing resource and communication management in an MPI cluster, which is a parallel structure, makes a bad effect on the total performance degradation. Overall comparison result of the execution time for each of Single-PR, MPI-PR and DRDMA-PR is shown in Table 3.

Fig. 7 shows the average time per step for each node when computing PageRank for soc-LiveJournal1. From the figure, we can see that the execution times of node 2~4 are similar for both MPI-PR and DRDMA-PR. However, node 1 takes much longer with MPI-PR compared to DRDMA-PR. These results demonstrate the overhead issue arising from a specific worker in the MPI cluster performing the role of the master concurrently. Also, the difference in the execution time of each node is due to the different number of outgoing edges of the vertices partitioned by each node, which results in different amounts of PageRank computations. In other words, when the total number of vertices is simply partitioned and allocated according to the number of nodes, nodes 1 and 2 have more outgoing edges per vertices than nodes 3 and 4.

When allocating vertices per node, we counted the number of outgoing edges for each vertex to balance the number of vertices and outgoing edges to compute. Fig. 8(a) and Table 4 show MPI-PR and DRDMA-PR execution times for soc-LiveJournal1 with the improved vertex partitioning method. In the improved DRDMA-PR, the performance improvements of up to 3.04 times compared to Single-PR, up to 1.54 times compared to simple allocation DRDMA-PR, and 1.33 times compared to the improved MPI-PR are achieved. This shows that the improved vertex partitioning method improves the overall performance of the distributed PageRank computation. Fig. 8(b) and Table 5 show the average execution time per node after applying the improved vertex partitioning method. Compared with Fig. 7, we can observe that DRDMA-PR has nearly similar average execution times across all nodes, while MPI-PR still exhibits overhead on node 1. Consequently, we find that DRDMA-PR can efficiently perform distributed parallelism using D-RDMALib proposed in the paper and is more efficient than MPI-PR using MPI regardless of the vertex partitioning method.

Fig. 5. Single-PR vs. DRDMA-PR execution time comparison.

Fig. 6. MPI-PR vs. DRDMA-PR execution time comparison.

Fig. 7. Average execution time per step for each node in MPI-PR and DRDMA-PR.

Fig. 8. MPI-PR and DRDMA-PR with the improved vertex partitioning: (a) average execution time; (b) average execution time per step for each node.

Table 1. Experimental environment

Specification

Hardware

- Node: Intel Xeon 2.4GHz 10 core, 128GB RAM

- HCA: Mellanox MT4099

- InfiniBand: Mellanox ConnectX®-3 FDR 56Gbps

Software

- OS: Ubuntu 20.04

- InfiniBand Driver: MLNX_OFED_LINUX-4.9-6.0.6.0

- MPI: Open MPI 4.0.3

Table 2. Experimental datasets

Name	# of vertices	# of edges
twitter_combined	81,306	1,768,149
wiki-Talk	2,394,385	5,021,410
soc-LiveJournal1	4,847,571	68,993,773

Table 3. Execution times for each of Single-PR, MPI-PR and DRDMA-PR

Dataset	Singe-PR (sec)	MPI-PR (sec)	DRDMA-PR (sec)
twitter_combined	1.297	0.552	0.478
wiki-Talk	14.194	6.665	5.804
soc-LiveJournal1	121.729	79.657	61.837

Table 4. Total execution times in MPI-PR and DRDMA-PR based on partitioning techniques

	MPI-PR (sec)	DRDMA-PR(sec)
Simply Partitioning	79.657	61.837
Improved Partitioning	53.253	40.069

Table 5. Average execution times per step for each node in MPI-PR and DRDMA-PR based on partitioning techniques

		Node 1 (sec)	Node 2 (sec)	Node 3 (sec)	Node 4 (sec)
Simply Partitioning	MPI-PR	1.228	0.473	0.302	0.174
Simply Partitioning	DRDMA-PR	0.892	0.47	0.299	0.17
Improved Partitioning	MPI-PR	0.812	0.456	0.447	0.454
Improved Partitioning	DRDMA-PR	0.443	0.448	0.469	0.46

VI. CONCLUSIONS

In the paper, we proposed D-RDMALib, which can be utilized to improve the performance of high-performance distributed cluster environments using InfiniBand. RDMA is a network communication technology optimized for high-speed communication, but it has a problem of being is difficult to apply to the cluster environment consisting of multiple nodes due to its high development complexity and one-to-one communication-based programing model. To solve these problems, we proposed D-RDMALib, an efficient and easy-to-use library that generalizes the existing programming model and can be easily adapted to multi-node environments. Furthermore, we designed and implemented the PageRank algorithm as an application for the performance evaluation of the proposed D-RDMALib. In Experiment 1, we confirmed that the D-RDMALib-based communication module performs data sending and receiving normally in a multi-node environment in all many-to-many, one-to-many, and many-to-one environments, respectively. In Experiment 2, through the performance evaluation comparing the graph processing speeds of DRDMA-PR and MPI-PR, we showed that D-RDMALib-based PageRank is faster overall. Based on these results, we believe that D-RDMALib is a novel programming library that reduces the complexity of developing RDMA applications and can be easily adapted to distributed processing environments with multiple nodes. Also, D-RDMALib is an efficient technique that can be readily applied to existing distributed processing environments as well as emerging ultra-fast in-network technologies such as Quantum InfiniBand.

ACKNOWLEDGMENTS

This work was partly supported by Institute of Information & communications Technology Planning & evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2021-0-00859, Development of A Distributed Graph DBMS for Intelligent Processing of Big Graphs) and National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2022R1A2C1003067).

References

P. Grun, “Introduction to InfiniBand for End Users,” InfiniBand Trade Association, 2010.

NVIDIA BlueField DPU, https://www.nvidia.com/en-us/networking/products/data-processing-unit/.

I. Burstein, “NVIDIA Data Center Processing Unit (DPU) Architecture,” In Proc. of the IEEE Hot Chips 33 Symp., Palo Alto, CA, pp. 1-20, Aug. 2021.

G. Shainer, R. Graham, C. J. Newburn, O. Hernandez, G. Bloch, T. Gibbs, and J. C. Wells, “NVIDIA’s Cloud Native Supercomputing,” In Proc. of the Smoky Mountains Computational Sciences and Engineering, Virtual Event, pp. 340-357, Mar. 2022.

Mellanox Technologies, RDMA Aware Networks Programming User Manual, Rev 1.7, Mellanox User Manual, 2015.

P. MacArthur, Q. Liu, R. D. Russell, F. Mizero, M. Veeraraghavan, and J. M. Dennis, “An Integrated Tutorial on InfiniBand, Verbs, and MPI,” IEEE Communications Surveys & Tutorials, Vol. 19, Issue 4, pp. 2894-2926, Aug. 2017.

S. Brin and L. Page, “The Anatomy of a Large-scale Hypertextual Web Search Engine,” Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pp. 107-117, Apr. 1998.

MPI, https://www.mpi-forum.org.

Quantum InfiniBand, https://www.nvidia.com/en-us/networking/quantum2/.

Y. Shpigelman, G. Shainer, R. Graham, Y. Qin, G. Cisneros-Stoianowski, and C. Stunkel, “NVIDIA’s Quantum InfiniBand Network Congestion Control Technology and its Impact on Application Performance,” In Proc. of the ISC High Performance, Hamburg, Germany, pp. 26-43, May 2022.

S. J. Mazivila, J. X. Soares, and J. L. Santos, “A Tutorial on Multi-way Data Processing of Excitation-emission Fluorescence Matrices Acquired from Semiconductor Quantum Dots Sensing Platforms,” Analytica Chimica Acta, Vol. 1211, ID 339216, June 2022.

S. Yang, S. Son, M.-J. Choi, and Y.-S. Moon, “Performance Improvement of Apache Storm using InfiniBand RDMA,” The Journal of Supercomputing, Vol. 75, No. 10, pp. 6804-6830, June 2019.

E. Zamanian, X. Yu, M. Stonebraker, and T. Kraska, “Rethinking Database High Availability with RDMA Networks,” In Proc. of the VLDB Endowment, Vol. 12, Issue 11, pp. 1637-1650, July 2019.

C. Link, J. Sarran, G. Grigoryan, M. Kwon, M. M. Rafique, and W. R. Carithers, “Container Orchestration by Kubernetes for RDMA Networking,” In Proc. of the IEEE 27th Int’l Conf. on Network Protocols, Chicago, IL, pp. 1-2, Oct. 2019.

J. Wang, et al., “Fargraph+: Excavating the Parallelism of Graph Processing Workload on RDMA-based Far Memory System,” Journal of Parallel and Distributed Computing, Vol. 177, pp. 144-159, July 2023.

Z. Yao, R. Chen, B. Zang, and H. Chen, “Wukong+ G: Fast and Concurrent RDF Query Processing using RDMA-assisted GPU Graph Exploration,” IEEE Trans. on Parallel and Distributed Systems, Vol. 33, Issue 7, pp. 1619-1635, July 2023.

MLNX_OFED, https://developer.nvidia.com/networking/infiniband-software/.

Infinity, https://github.com/claudebarthels/infinity.

T. Kim, B. Kim, and H. Jung, ”Design and Implementation of Easily Applicable and Lightweight RDMA API for InfiniBand Cluster,” In Proc. of the Symp. KICS, Jeju, Korea, pp. 1483-1484, June 2015.

OpenMP, https://www.openmp.org/.

Stanford Large Network Dataset Collection, https://snap.stanford.edu/data/.

Laewon Jeong

Laewon Jeong received B.S.(2022) degrees in Computer Science from Kangwon National University. He is currently a master’s student in Kangwon National University. His research interests include distributed and parallel processing, high-speed data processing, and container system.

Myeong-Seon Gil

Myeong-Seon Gil received B.S. (2007), M.S.(2009), and Ph.D.(2021) degrees in Computer Science from Kangwon National University. She is currently a postdoctoral researcher in Kangwon National University. From 2010 to 2012, she was an employee in Central Information and Computing Center in Kangwon National University, where she participated in developing mobile applications and university portal systems. Her research interests include data mining, big data analysis, distributed and parallel processing, and high-speed stream processing.

Yang-Sae Moon

Yang-Sae Moon received B.S. (1991), M.S.(1993), and Ph.D.(2001) degrees in Computer Science from Korea Advanced Institute of Science and Technology (KAIST). From 1993 to 1997, he was a research engineer in Hyundai Syscomm, Inc., where he participated in developing 2G and 3G mobile communication systems. From 2002 to 2005, he was a technical director in Infravalley, Inc., where he participated in planning, designing, and developing CDMA and W-CDMA mobile network services and systems. He is currently a professor of Dept. of Computer Science and a dean of Graduate School of Data Science at Kangwon National University. He was a visiting scholar at Purdue University in 2008 to 2009. His research interests include data mining, big data systems and analytics, storage systems, distributed and parallel processing, open source systems, and high-speed stream processing.