Mobile QR Code QR CODE

  1. (Department of Computer Engineering, Kwangwoon University, Seoul 01897, Korea)
  2. (School of Computer and Information Engineering, Kwangwoon University, Seoul 01897, Korea)



NVMe SSDs, I/O scheduling, Linux, multi-queue block I/O layer, priority

I. INTRODUCTION

Non-volatile memory express (NVMe) is an open logical device interface specification for accessing non-volatile storage media such as solid-state drives (SSDs) (18). As NVMe has been designed to capitalize on the low latency and internal parallelism of SSDs, it can reduce the I/O overhead and bring I/O bandwidth improvements. Unlike SATA SSDs that use native command queuing (NCQ) with a single I/O queue, NVMe SSDs employ multiple submission queues (SQs) that can process I/O requests received from the host in parallel.

To support the NVMe SSDs efficiently, a multi-queue block I/O layer has been introduced in the Linux kernel. It uses two sets of queues: software queues and hardware queues in order to improve scalability (10). The software queues are used to alleviate the lock contention problem in multi-core environments, and the hardware queues are used to exploit the parallelism of SSDs, respectively. Due to many such queues, the multi-queue block I/O layer can provide a high I/O bandwidth, but it lacks the differentiated services of I/O requests yet.

Although every process is executed with a priority, the priority is just for CPU scheduler, not I/O scheduler. As a result, I/O intensive processes cannot be sufficiently serviced by their priorities even though their priorities are very high. In this paper, we present a scheme that efficiently supports the priorities in the multi-queue block I/O layer. In order to exactly service processes by their priorities, we bring the CFS technique that is a CPU scheduler used in the current Linux kernel into the I/O layer. Between software queues and hardware queues, we insert another layer consisting of priority queues. For each priority queue, we maintain the count of token, which is a right for an I/O request to be able to move to the hardware queue. Like the weight in the CFS scheduler, by allocating the token counts appropriately according to the process priorities for all priority queues, we make processes be serviced by their priorities regardless of whether they are I/O intensive or not. Our scheme also considers the request type; it preferentially services read requests rather than write requests. By appropriately controlling the number of requests to be submitted to SSDs in the priority queue layer, our scheme can give a differentiated I/O service.

The remainder of this paper is organized as follows. First, we describe the background and related work in Section II. Then, we explain the proposed scheme for supporting the priorities in the multi-queue block I/O layer in Section III. The experimental results and concluding remarks are presented in Section V and VI, respectively.

II. RELATED WORK

NVMe is a high-performance, scalable host controller interface designed for PCIe based SSDs and non-volatile memories (2). To provide high speed I/O, it supports up to 65,535 submission and completion queues that can queue up to 64K commands (3,4,9,13). Such a scalable architecture enables the internal parallelism of SSDs to be fully utilized. The submission queues have I/O requests submitted by the host, and the SSD device dispatches them from the submission queues in the round-robin fashion. After processing requests in the SSD, it inserts the completed requests in the completion queues and notifies the host of the completion.

The multi-queue block I/O layer of Linux is a structure designed for computer systems with multiple CPU cores and storages supporting multiple queues like NVMe SSDs. In a previous single-queue block I/O layer, all the I/O requests issued from processes running on each CPU core are sent to a single request queue (Fig. 1(a)). As a result, it became a performance bottleneck because of lock contention for the request queue. To alleviate this problem, the multi-queue block I/O layer employs two levels of queues: software queues and hardware queues (Fig. 1(b)). I/O requests issued by a process running on a CPU core are first sent to corresponding software queue attached to the CPU core, and then moves to the hardware queue mapped to the software queue. If hardware queues are less in number than software queues, two or more software queues can be mapped to one hardware queue. The I/O requests are finally sent to the submission queues in the NVMe controller, which are mapped to the hardware queues.

Fig. 1. The multi-queue block I/O layer (a) Single-queue block I/O layer, (b) Multi-queue block I/O layer.

../../Resources/ieie/JSTS.2020.20.1.055/fig1.png

Researches for optimizing the Linux I/O stack have been actively performed. Kim et al. proposed a new method called NVMeDirect, which is a user-space I/O framework for directly accessing the submission queues and completion queues of NVMe SSDs (5). As it is implemented in the user level, it can be easily optimized to target applications by utilizing various techniques related to SSDs as well as bypass various layers in Linux kernel. Josephson et al. proposed an optimized file system for SSDs called the direct file system (DFS) (6). The DFS places a virtual address space in the block I/O layer to manage all space of SSD, and it maps the virtual addresses into the physical addresses of SSD. Since it detects I/O requests that can be processed in the host and then directly manages them, it could improve the lifetime as well as the performance. In (7), Lee et al. proposed the F2FS, which is a file system for SSDs. It is designed to optimize performance and lifetime of SSDs with a generic block interface and shows a better performance than EXT4 by up to 3.1 times. Matias et al. proposed a LightNVM, a method to offload internal operations of SSDs to the host (8). Because SSDs supporting the LightNVM enable the host to access the internals of SSDs, the host kernel can perform write-buffering, wear-leveling, and garbage collection.

There have been some studies for I/O scheduling in NVMe SSDs. In (10), Joshi et al. implemented a mechanism that supports several I/O priorities for NVMe SSDs by using the weighted round-robin (WRR) feature of the NVMe interface. They designed four kinds of submission queues: Medium, High, Low and Urgent, and I/O requests are sent to one of them according to their priority. Lee et al. solved the write interference problem by isolating read and write requests into independent software queues (12). They showed that read operations could be processed faster than other systems by up to 33%. Ahn et al. proposed a weight-based dynamic throttling (WDT) for Linux cgroup, which allocates I/O resources more efficiently and replaces existing multi-queue block I/O layer (11). By intelligently predicting the future I/O bandwidth requirement based on past I/O service rates, they achieved a highly accurate I/O resource sharing while reducing wasted I/O bandwidth.

III. SUPPORTING THE PRIORITIES IN THE MULTI-QUEUE BLOCK I/O LAYER

When a process is executed with a given priority, it actually may not run as expected. It means the priority in CPU scheduler, not the priority in the system, and the CPU priority information is not directly delivered to the I/O layer in the current Linux kernel. Even if the CPU information is delivered to the I/O layer through some additional delivery mechanisms, it is suspicious if I/O requests can be serviced by the priority because there are too many queues such as software queues, hardware queues, and submission queues.

To dispel the doubts, we performed a simple experiment with a CPU-intensive program and an IO-intensive program: $\textit{sysbench}$ that searches 8000 prime numbers and $\textit{fio}$ that randomly reads 5GB data from SSD, respectively (14,17). We first executed 40 $\textit{sysbench}$ processes whose nice values are from -20 and 19. Note that the lower this nice value, the higher the actual priority. As seen in Fig. 2, runtimes of $\textit{sysbench}$ processes increase as a function of a priority, and the average performance difference between two adjacent processes in the priority order is about 1.223 times. It is because the CFS scheduler allocates 1.25 times of weights for CPU usage to a process with priority $\textit{n}$ than a process with priority $\textit{n+1}$. In conclusion, as sysbench is a CPU-intensive process that mainly uses the CPU resource, they are sufficiently serviced by the CFS policy.

Fig. 2. Performances of CPU-intensive and I/O intensive processes.

../../Resources/ieie/JSTS.2020.20.1.055/fig2.png

On the other hand, the experiment with $\textit{fio}$ under the same condition shows different results. The average performance difference between two adjacent processes in the priority order is just 1.007 times; it is rarely affected by priority. It is expected results because it is an I/O intensive process that rarely uses the CPU resource and no mechanism supports the priority in the I/O layer. The experiments show how much the priority affects the execution of processes depends on the CPU usage.

In this section, we present a scheme that provides all processes the consistent services by their priorities regardless of whether they are CPU-intensive or I/O-intensive. To this end, we first deliver the CPU priority information to the I/O layer by adding it into bio and request, which are basic structures used for I/O processing in the multi-queue block I/O layer. As can be seen in Fig. 3, we add another layer consisting of many priority queues corresponding to the CPU priority levels between software queues and hardware queues. As the I/O request type is also an important consideration in designing the I/O scheduler, we divide all the priority queues into two types: read and write.

Fig. 3. A priority-supported I/O scheduduler.

../../Resources/ieie/JSTS.2020.20.1.055/fig3.png

Like the original multi-queue block I/O layer, the I/O requests issued from processes are first sent to the software queues, and then they are moved to priority queues according to their priority and request type. The priority queue layer determines if it will pass the I/O requests to the next hardware queues or not. For each priority queue, we maintain the token counts, which play a vital role in our scheme. The token is a right for an I/O request to be able to move to the next queue. If the token count is greater than 0, I/O requests can move to the next hardware queue with a token, and thus the token count decreases. On the while, when the I/O processing of a request is finished in the SSD and it is notified via the completion queue, the token count increases again.

Fig. 4 is an example that shows how I/O requests move to the next queues using the token. In Fig. 4(a), when a read request with priority n arrives at the priority queue from the software queue, it can move to the hardware queue immediately because the token count is 3, which is greater than 0. Naturally, the token count is reduced to 2. On the contrary, in Fig. 4(b), a read request cannot move to the hardware queue because the token count is 0. In this case, the read request should wait in the priority queue corresponding to the priority n until other requests previously submitted to the SSD are completed and the tokens occupied by them are returned. Finally, when the SSD finishes processing some I/O requests and notifies the completion to the host by putting the messages to the completion queues, the I/O requests in the priority queues move to the hardware queues or the token count is increased by the number of requests finished.

Fig. 4. A priority I/O scheduling through token counts (a) if token count {\textgreater} 0, (b) If token count is 0.

../../Resources/ieie/JSTS.2020.20.1.055/fig4.png

As only requests with the token can move to the next hardware queue, we can control the amount of` I/O requests to be processed in the priority queue layer using token counts. In other words, the higher the token count is, the more I/O requests can be serviced within some time duration. To allocate more SSD resources to processes with higher priorities, it is sufficient to give them larger token counts. It is possible because I/O requests in the hardware queue immediately submitted to the submission queue if there is enough space in the submission queues and NVMe SSDs usually dispatch I/O requests in all submission queues in the round-robin fashion.

Eventually, in our scheme, it is the most important to decide appropriate initial token counts for all priority queues. They can be determined based on the process priority, I/O type, and the number of processes with the same priority. First, we can determine the initial token counts for every priority queue by referencing the CFS scheduler. Just as it allocates 1.25 times of weights to a process with priority n than a process with priority n+1, we can allocate the initial token counts for every priority queue. For example, if we initialize the initial token count of priority queue for priority 1 as 10, we can determine the initial token count of priority queue for priority 0 as 12 or 13. Likewise, the initial token count of priority queue for priority 2 can be 8.

Second, we consider the I/O request type in determining the initial token counts. Since read requests should be preferentially serviced than write requests, higher token counts should be given for read requests. Like the existing Linux I/O scheduler such as deadline scheduler, our scheme sets the initial token counts for read requests two times higher than ones for write requests.

Finally, the number of processes running with the same priority should also be considered in determining the initial token counts. If we do not consider it, processes with higher priority can be serviced less than processes with lower priority. Let us assume that the initial token count of priority queue for read requests with priority 0 and that of priority queue for read requests with priority 1 are 12 and 10, respectively. For example, there are three processes A, B, and C and their priorities and request type are 0 and read, respectively. Since their priorities and request types are all the same, they share a priority queue and corresponding token count. Since the initial token count for them is 12, each process can submit only four requests to the SSD on average. At the same time, if process D whose priority is 1 is running, it may submit I/O requests by up to 10. To alleviate this irrationality, we give a process queue and token count per each process. Since the priority queue is dynamically managed and each process can issue two types of requests, if the number of running processes is n, the number of priority queues required would be less than 2*n. As each process has its priority queue and token count, the lock contention would not occur in the priority queue layer.

IV. PERFORMANCE EVALUATION

We implemented our I/O scheduler on the Linux kernel by modifying Kyber, which is one of Linux I/O schedulers (19). Table 1 shows the experimental environments. In our experiments, eight software queues are created according to the number of CPU cores. The numbers of hardware queues and submission queues are also eight because Linux generates these queues according to the number of the software queues if NVMe SSDs support. For generating I/O intensive processes, we used the $\textit{fio}$ by some parameters described in Table 2.

Table 1. Experimental environments

CPU Inter i7-7700K (8 cores)
Memory 32GB
Storage Samsung 960 PRO 512GB
OS Ubuntu 16.04 (Kernel 4.13.10)

Table 2. Parameter setting for $\textit{fio}$

Block size 4KB
Total I/O size 8GB
IO engine libaio (iodepth: 2048)

We first experimented if the proposed I/O scheduler sufficiently provides differentiated services based on the process priority and I/O type. To this end, we executed 80 $\textit{fios}$ processes at the same time: 40 processes issue only read requests, and the other 40 process issue only write requests. Their priorities are from -20 to 19, which are nice values.

Fig. 5 shows the runtimes of fio processes as a function of a priority. To exactly measure the effect of our I/O scheduler, fio processes are executed repeatedly in this experiment. In the original kernel, which never considers the I/O type, read or write, and priority of process who requests the I/Os in the I/O block layer, there is little performance difference between read-intensive processes and write-intensive processes.

Fig. 5. Performance as a function of a priority.

../../Resources/ieie/JSTS.2020.20.1.055/fig5.png

Besides, the priority rarely affects the performance because fio processes are I/O intensive ones as well as there are no supports for I/O prioritization in the original kernel. On the contrary, we can observe that the proposed scheme provides a differentiated I/O service based on process priority and I/O type. For both read-intensive and write-intensive processes, we can see that the runtimes are increasing as a function of a priority. The read-intensive processes show better performances than write-intensive ones because we give higher values to the initial token counts of priority queues for read requests.

Fig. 6 shows the average performance difference between two adjacent processes in the priority order. In the original kernel, the performance difference for both read-intensive and write-intensive processes is about 1.01, and they show almost constant performance irrespective of priority and request type. However, the proposed scheme shows the performance difference between two adjacent processes in the priority order is about 1.08. It means that the process with priority n shows better performance than one with priority n+1 by 1.08 times.

Fig. 6. Performance gap between two adjacent processes in the priority order.

../../Resources/ieie/JSTS.2020.20.1.055/fig6.png

Fig. 7 shows the overall performance of our scheme, namely, the average runtime of all fio processes. This experiment has been performed in the same environment as Fig. 5 and 6 except that the terminated fio processes do not re-executed. When our scheme is used, the average runtimes of read-intensive and write-intensive processes are improved by 33% and 15%, respectively. We think that it is because processes having relatively lower priorities can occupy the remaining system resources released by the terminated processes having higher process priorities. Note that the overhead of our scheme is included in this result. In another experiment performed with a single fio that only issues random read requests, we could see that average latency is delayed by 8.7% when our I/O scheduler is employed.

Fig. 7. Average runtime of fio processes.

../../Resources/ieie/JSTS.2020.20.1.055/fig7.png

Since our scheme allocates a priority queue per process, even when there are many processes with the same priority, they are not affected. Fig. 8 shows the average runtime when there are several kinds of processes with the same priorities as described in Table 3. We can see that the runtime of each $\textit{fio}$ process increases as a function of a priority again regardless of how many processes have the same process priority. For example, since there are many processes with priority -15, but they do not share any priority queue or token count, the average running time of them is still lower than those of other processes.

Table 3. Number of running processes per priority

Priority -15 -10 -5 0 5 10 15
# of processes 64 32 16 8 4 2 1

Fig. 8. Performance of processes with same priorities.

../../Resources/ieie/JSTS.2020.20.1.055/fig8.png

V. CONCLUSIONS

Although the multi-queue block I/O layer provides a high I/O bandwidth, it lacks the differentiated I/O services yet. In this paper, we presented a scheme that supports the priority in the multi-queue block I/O layer. Our scheme could provide the differentiated services according to the process priority even for I/O intensive processes by employing the concept of CFS in CPU scheduler.

ACKNOWLEDGMENTS

The present research has been conducted by the Research Grant of Kwangwoon University in 2019. It was also supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP; Ministry of Science, ICT & Future Planning) (No. 2017R1A2B4008536).

REFERENCES

1 
Matias B., Axboe J., Nellans D., Bonnet P., Jun 2013, Linux block IO: introducing multi-queue SSD access on multi-core systems, Systems and Storage, SYSTOR ‘13 6th ACM International Conference on, No. 22DOI
2 
NVMe overview, https://www.nvmexpress.org/wp-content/uploads/NVMe_Overview.pdfGoogle Search
3 
Zhang J., et al , 2018, FlashShare: Punching through server storage stack from kernel to firmware for ultra-low latency SSDs, Operating Systems Design and Implementation, OSDI ‘18, 13th USENIX Symposium on, pp. 477-492Google Search
4 
Huffman A., NVM Express Base Specification Revision 1.3c, https://nvmexpress.org/wp-content/ uploads/NVM-Express-1_3c-2018.05.24-Ratified.pdf.Google Search
5 
Kim H., Lee Y., Kim J., 2016, NVMeDirect: A User-space I/O Framework for Application-specific Optimization on NVMe SSDs, Hot Topics in Storage and File Systems, HotStorage ‘16, 8th USENIX Workshop onGoogle Search
6 
Josephson W. K., Bongo L. A., Li K., Flynn D., Sep 2010, DFS: A File System for Virtualized Flash Storage, Transactions on Storage, ACM Journal of, Vol. 6, No. 3DOI
7 
Lee C., Sim D., Hwang J., 2015, F2FS: A New File System for Flash Storage, File and Storage Technologies, FAST '15, 13th USENIX Conference on, pp. 273-286Google Search
8 
Matias B., González J., Bonnet P., 2017, LightNVM: The Linux Open-Channel SSD Subsystem, File and Storage Technologies, FAST ‘17, 15th USENIX Conference on, pp. 359-374Google Search
9 
Peng B., Zhang H., Yao J., Dong Y., Xu Y., Guan H., 2018, MDev-NVMe: A NVMe Storage Virtualization Solution with Mediated Pass-Through, Annual Technical Conference, ATC ‘18, USENIX Conference on, pp. 665-676Google Search
10 
Joshi K., Yadav K., Choudhary P., 2017, Enabling NVMe WRR support in Linux Block Layer, Hot Topics in Storage and File Systems, HotStorage ‘17, 9th USENIX Workshop onGoogle Search
11 
Ahn S., La K., Kim J., 2016, Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems, Hot Topics in Storage and File Systems, HotStorage ‘16, 8th USENIX Workshop onGoogle Search
12 
Lee M., Kang D., Lee M., Eom Y., 2017, Improving Read Performance by Isolating Multiple Queues in NVMe SSDs, Ubiquitous Information Manage-ment and Communication, IMCOM ‘17, 11th ACM International Conference on, No. 36, pp. 1-6DOI
13 
Kim S., Yang J., 2018, Optimized I/O Determinism for Emerging NVM-based NVMe SSD in an Enterprise System, Design Automation Conference, DAC ‘18, 55th ACM Conference on, No. 56, pp. 1-6DOI
14 
sysbench, https://github.com/akopytov/sysbenchGoogle Search
15 
Kim T., Kang D., Lee D., Eom Y., 2015, Improving performance by bridging the semantic gap between multi-queue SSD and I/O virtualization frame-work, Mass Storage Systems and Technologies (MSST), 31st IEEE Symposium on, pp. 1-11DOI
16 
Oh M., Eom H., Yeom H., Dec 2014, Enhancing the I/O system for virtual machines using high performance SSDs, Performance Computing and Communica-tions Conference (IPCCC), 33rd IEEE International Conference on, pp. 1-8DOI
17 
Flexible I/O Tester, https://github.com/axboe/fioGoogle Search
18 
Eshghi K., Micheloni R., 2013, SSD Architecture and PCI Express Interface, Inside Solid State Drives (SSDs), Springer, Dordrecht, Vol. 37, pp. 19-45DOI
19 
Sandoval O., Kyber multi-queue i/o scheduler, https://lwn.net/Articles/720071/.Google Search

Author

Kyusik Kim
../../Resources/ieie/JSTS.2020.20.1.055/au1.png

Kyusik Kim received the BS degrees in computer engineering from Kwangwoon University, Korea, in 2014.

He is currently working toward the combined MS and PhD degree at the School of Computer Engineering, Kwangwoon University.

His research interests include storage system, operating system, embedded system, solid-state drives, and nextgeneration nonvolatile memories.

Seungkyu Hong
../../Resources/ieie/JSTS.2020.20.1.055/au2.png

Seungkyu Hong received the BS degree in computer engineering from Kwangwoon University, Korea, in 2017.

He is currently working toward the combined MS degree at the School of Computer Engineering, Kwangwoon University.

His research interests include operating systems, flash memories, and next-generation nonvolatile memories.

Taeseok Kim
../../Resources/ieie/JSTS.2020.20.1.055/au3.png

Taeseok Kim received the B.S. degree in computer science and the M.S. and Ph.D. degrees in computer science and engineering from Seoul National University, South Korea, in 2000, 2002, and 2007, respectively.

In 2007, he was a Senior Research Engineer with Samsung Electronics, Suwon, South Korea. In 2008, he joined the Department of Computer Engineering, Kwangwoon University, Seoul, South Korea, where he is currently a Professor.

His research interests are primarily in computer systems such as operating systems, storage systems, multimedia systems, and embedded systems.