Mobile QR Code QR CODE

Main Menu

The Journal of Semiconductor Technology and Science (JSTS) is an international, peer-reviewed, and open-access journal that is published bimonthly.
- Scope: semiconductor processes, devices, circuits, and MEMS.
- Editor-in-Chief: Prof. Woo Young Choi (ECE, Seoul National University)
- Indexed within Science Citation Index Expanded (SCIE), SCOPUS, Korea Citation Index (KCI), and other databases.

Journal Search

[

Research article

]

JSTS(Journal of Semiconductor Technology and Science)

IEIE Vol. 20, No. 1, p.055-062

ISSN (print) :

1598-1657

ISSN (online) :

2233-4866

Received : 21 June 2019Accepted : 23 December 2019

DOI :

https://doi.org/10.5573/JSTS.2020.20.1.055

Supporting the Priorities in the Multi-queue Block I/O Layer for NVMe SSDs

Kim Kyusik¹ Hong Seungkyu¹ Kim Taeseok²

(Department of Computer Engineering, Kwangwoon University, Seoul 01897, Korea)
(School of Computer and Information Engineering, Kwangwoon University, Seoul 01897, Korea)

^* E-mail: tskim@kw.ac.kr

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Recently, a multi-queue block I/O layer has been introduced in the Linux kernel to support NVMe SSDs efficiently. It can provide a high I/O bandwidth by employing several queues but does not provide a differentiated I/O services yet. In this paper, we present a new scheme that supports the priorities in the multi-queue block I/O layer. Our scheme brings the CFS (Completely Fair Scheduler) technique that is one of CPU scheduler into the I/O layer. In this way, we can provide a differentiated I/O service regardless of whether processes are CPU-intensive or I/O intensive. Besides, by considering the I/O request type, read requests are preferentially handled over write requests. Through implementation on the Linux kernel, we show that the proposed scheme appropriately services the I/O requests according to priority and request type and improves runtimes of read-intensive and write-intensive processes by 33% and 15%, respectively.

Index Terms

NVMe SSDs, I/O scheduling, Linux, multi-queue block I/O layer, priority

I. INTRODUCTION

Non-volatile memory express (NVMe) is an open logical device interface specification for accessing non-volatile storage media such as solid-state drives (SSDs) ⁽¹⁸⁾. As NVMe has been designed to capitalize on the low latency and internal parallelism of SSDs, it can reduce the I/O overhead and bring I/O bandwidth improvements. Unlike SATA SSDs that use native command queuing (NCQ) with a single I/O queue, NVMe SSDs employ multiple submission queues (SQs) that can process I/O requests received from the host in parallel.

To support the NVMe SSDs efficiently, a multi-queue block I/O layer has been introduced in the Linux kernel. It uses two sets of queues: software queues and hardware queues in order to improve scalability ⁽¹⁰⁾. The software queues are used to alleviate the lock contention problem in multi-core environments, and the hardware queues are used to exploit the parallelism of SSDs, respectively. Due to many such queues, the multi-queue block I/O layer can provide a high I/O bandwidth, but it lacks the differentiated services of I/O requests yet.

Although every process is executed with a priority, the priority is just for CPU scheduler, not I/O scheduler. As a result, I/O intensive processes cannot be sufficiently serviced by their priorities even though their priorities are very high. In this paper, we present a scheme that efficiently supports the priorities in the multi-queue block I/O layer. In order to exactly service processes by their priorities, we bring the CFS technique that is a CPU scheduler used in the current Linux kernel into the I/O layer. Between software queues and hardware queues, we insert another layer consisting of priority queues. For each priority queue, we maintain the count of token, which is a right for an I/O request to be able to move to the hardware queue. Like the weight in the CFS scheduler, by allocating the token counts appropriately according to the process priorities for all priority queues, we make processes be serviced by their priorities regardless of whether they are I/O intensive or not. Our scheme also considers the request type; it preferentially services read requests rather than write requests. By appropriately controlling the number of requests to be submitted to SSDs in the priority queue layer, our scheme can give a differentiated I/O service.

The remainder of this paper is organized as follows. First, we describe the background and related work in Section II. Then, we explain the proposed scheme for supporting the priorities in the multi-queue block I/O layer in Section III. The experimental results and concluding remarks are presented in Section V and VI, respectively.

II. RELATED WORK

NVMe is a high-performance, scalable host controller interface designed for PCIe based SSDs and non-volatile memories ⁽²⁾. To provide high speed I/O, it supports up to 65,535 submission and completion queues that can queue up to 64K commands ^(3,^4,^9,¹³⁾. Such a scalable architecture enables the internal parallelism of SSDs to be fully utilized. The submission queues have I/O requests submitted by the host, and the SSD device dispatches them from the submission queues in the round-robin fashion. After processing requests in the SSD, it inserts the completed requests in the completion queues and notifies the host of the completion.

The multi-queue block I/O layer of Linux is a structure designed for computer systems with multiple CPU cores and storages supporting multiple queues like NVMe SSDs. In a previous single-queue block I/O layer, all the I/O requests issued from processes running on each CPU core are sent to a single request queue (Fig. 1(a)). As a result, it became a performance bottleneck because of lock contention for the request queue. To alleviate this problem, the multi-queue block I/O layer employs two levels of queues: software queues and hardware queues (Fig. 1(b)). I/O requests issued by a process running on a CPU core are first sent to corresponding software queue attached to the CPU core, and then moves to the hardware queue mapped to the software queue. If hardware queues are less in number than software queues, two or more software queues can be mapped to one hardware queue. The I/O requests are finally sent to the submission queues in the NVMe controller, which are mapped to the hardware queues.

Fig. 1. The multi-queue block I/O layer (a) Single-queue block I/O layer, (b) Multi-queue block I/O layer.

Researches for optimizing the Linux I/O stack have been actively performed. Kim et al. proposed a new method called NVMeDirect, which is a user-space I/O framework for directly accessing the submission queues and completion queues of NVMe SSDs ⁽⁵⁾. As it is implemented in the user level, it can be easily optimized to target applications by utilizing various techniques related to SSDs as well as bypass various layers in Linux kernel. Josephson et al. proposed an optimized file system for SSDs called the direct file system (DFS) ⁽⁶⁾. The DFS places a virtual address space in the block I/O layer to manage all space of SSD, and it maps the virtual addresses into the physical addresses of SSD. Since it detects I/O requests that can be processed in the host and then directly manages them, it could improve the lifetime as well as the performance. In ⁽⁷⁾, Lee et al. proposed the F2FS, which is a file system for SSDs. It is designed to optimize performance and lifetime of SSDs with a generic block interface and shows a better performance than EXT4 by up to 3.1 times. Matias et al. proposed a LightNVM, a method to offload internal operations of SSDs to the host ⁽⁸⁾. Because SSDs supporting the LightNVM enable the host to access the internals of SSDs, the host kernel can perform write-buffering, wear-leveling, and garbage collection.

There have been some studies for I/O scheduling in NVMe SSDs. In ⁽¹⁰⁾, Joshi et al. implemented a mechanism that supports several I/O priorities for NVMe SSDs by using the weighted round-robin (WRR) feature of the NVMe interface. They designed four kinds of submission queues: Medium, High, Low and Urgent, and I/O requests are sent to one of them according to their priority. Lee et al. solved the write interference problem by isolating read and write requests into independent software queues ⁽¹²⁾. They showed that read operations could be processed faster than other systems by up to 33%. Ahn et al. proposed a weight-based dynamic throttling (WDT) for Linux cgroup, which allocates I/O resources more efficiently and replaces existing multi-queue block I/O layer ⁽¹¹⁾. By intelligently predicting the future I/O bandwidth requirement based on past I/O service rates, they achieved a highly accurate I/O resource sharing while reducing wasted I/O bandwidth.

III. SUPPORTING THE PRIORITIES IN THE MULTI-QUEUE BLOCK I/O LAYER

When a process is executed with a given priority, it actually may not run as expected. It means the priority in CPU scheduler, not the priority in the system, and the CPU priority information is not directly delivered to the I/O layer in the current Linux kernel. Even if the CPU information is delivered to the I/O layer through some additional delivery mechanisms, it is suspicious if I/O requests can be serviced by the priority because there are too many queues such as software queues, hardware queues, and submission queues.

To dispel the doubts, we performed a simple experiment with a CPU-intensive program and an IO-intensive program: $\textit{sysbench}$ that searches 8000 prime numbers and $\textit{fio}$ that randomly reads 5GB data from SSD, respectively ^(14,¹⁷⁾. We first executed 40 $\textit{sysbench}$ processes whose nice values are from -20 and 19. Note that the lower this nice value, the higher the actual priority. As seen in Fig. 2, runtimes of $\textit{sysbench}$ processes increase as a function of a priority, and the average performance difference between two adjacent processes in the priority order is about 1.223 times. It is because the CFS scheduler allocates 1.25 times of weights for CPU usage to a process with priority $\textit{n}$ than a process with priority $\textit{n+1}$. In conclusion, as sysbench is a CPU-intensive process that mainly uses the CPU resource, they are sufficiently serviced by the CFS policy.

Fig. 2. Performances of CPU-intensive and I/O intensive processes.

On the other hand, the experiment with $\textit{fio}$ under the same condition shows different results. The average performance difference between two adjacent processes in the priority order is just 1.007 times; it is rarely affected by priority. It is expected results because it is an I/O intensive process that rarely uses the CPU resource and no mechanism supports the priority in the I/O layer. The experiments show how much the priority affects the execution of processes depends on the CPU usage.

In this section, we present a scheme that provides all processes the consistent services by their priorities regardless of whether they are CPU-intensive or I/O-intensive. To this end, we first deliver the CPU priority information to the I/O layer by adding it into bio and request, which are basic structures used for I/O processing in the multi-queue block I/O layer. As can be seen in Fig. 3, we add another layer consisting of many priority queues corresponding to the CPU priority levels between software queues and hardware queues. As the I/O request type is also an important consideration in designing the I/O scheduler, we divide all the priority queues into two types: read and write.

Fig. 3. A priority-supported I/O scheduduler.

Like the original multi-queue block I/O layer, the I/O requests issued from processes are first sent to the software queues, and then they are moved to priority queues according to their priority and request type. The priority queue layer determines if it will pass the I/O requests to the next hardware queues or not. For each priority queue, we maintain the token counts, which play a vital role in our scheme. The token is a right for an I/O request to be able to move to the next queue. If the token count is greater than 0, I/O requests can move to the next hardware queue with a token, and thus the token count decreases. On the while, when the I/O processing of a request is finished in the SSD and it is notified via the completion queue, the token count increases again.

Fig. 4 is an example that shows how I/O requests move to the next queues using the token. In Fig. 4(a), when a read request with priority n arrives at the priority queue from the software queue, it can move to the hardware queue immediately because the token count is 3, which is greater than 0. Naturally, the token count is reduced to 2. On the contrary, in Fig. 4(b), a read request cannot move to the hardware queue because the token count is 0. In this case, the read request should wait in the priority queue corresponding to the priority n until other requests previously submitted to the SSD are completed and the tokens occupied by them are returned. Finally, when the SSD finishes processing some I/O requests and notifies the completion to the host by putting the messages to the completion queues, the I/O requests in the priority queues move to the hardware queues or the token count is increased by the number of requests finished.

Fig. 4. A priority I/O scheduling through token counts (a) if token count {\textgreater} 0, (b) If token count is 0.

As only requests with the token can move to the next hardware queue, we can control the amount of` I/O requests to be processed in the priority queue layer using token counts. In other words, the higher the token count is, the more I/O requests can be serviced within some time duration. To allocate more SSD resources to processes with higher priorities, it is sufficient to give them larger token counts. It is possible because I/O requests in the hardware queue immediately submitted to the submission queue if there is enough space in the submission queues and NVMe SSDs usually dispatch I/O requests in all submission queues in the round-robin fashion.

Eventually, in our scheme, it is the most important to decide appropriate initial token counts for all priority queues. They can be determined based on the process priority, I/O type, and the number of processes with the same priority. First, we can determine the initial token counts for every priority queue by referencing the CFS scheduler. Just as it allocates 1.25 times of weights to a process with priority n than a process with priority n+1, we can allocate the initial token counts for every priority queue. For example, if we initialize the initial token count of priority queue for priority 1 as 10, we can determine the initial token count of priority queue for priority 0 as 12 or 13. Likewise, the initial token count of priority queue for priority 2 can be 8.

Second, we consider the I/O request type in determining the initial token counts. Since read requests should be preferentially serviced than write requests, higher token counts should be given for read requests. Like the existing Linux I/O scheduler such as deadline scheduler, our scheme sets the initial token counts for read requests two times higher than ones for write requests.

Finally, the number of processes running with the same priority should also be considered in determining the initial token counts. If we do not consider it, processes with higher priority can be serviced less than processes with lower priority. Let us assume that the initial token count of priority queue for read requests with priority 0 and that of priority queue for read requests with priority 1 are 12 and 10, respectively. For example, there are three processes A, B, and C and their priorities and request type are 0 and read, respectively. Since their priorities and request types are all the same, they share a priority queue and corresponding token count. Since the initial token count for them is 12, each process can submit only four requests to the SSD on average. At the same time, if process D whose priority is 1 is running, it may submit I/O requests by up to 10. To alleviate this irrationality, we give a process queue and token count per each process. Since the priority queue is dynamically managed and each process can issue two types of requests, if the number of running processes is n, the number of priority queues required would be less than 2*n. As each process has its priority queue and token count, the lock contention would not occur in the priority queue layer.

IV. PERFORMANCE EVALUATION

We implemented our I/O scheduler on the Linux kernel by modifying Kyber, which is one of Linux I/O schedulers ⁽¹⁹⁾. Table 1 shows the experimental environments. In our experiments, eight software queues are created according to the number of CPU cores. The numbers of hardware queues and submission queues are also eight because Linux generates these queues according to the number of the software queues if NVMe SSDs support. For generating I/O intensive processes, we used the $\textit{fio}$ by some parameters described in Table 2.

Table 1. Experimental environments

CPU	Inter i7-7700K (8 cores)
Memory	32GB
Storage	Samsung 960 PRO 512GB
OS	Ubuntu 16.04 (Kernel 4.13.10)

Table 2. Parameter setting for $\textit{fio}$

Block size	4KB
Total I/O size	8GB
IO engine	libaio (iodepth: 2048)

We first experimented if the proposed I/O scheduler sufficiently provides differentiated services based on the process priority and I/O type. To this end, we executed 80 $\textit{fios}$ processes at the same time: 40 processes issue only read requests, and the other 40 process issue only write requests. Their priorities are from -20 to 19, which are nice values.

Fig. 5 shows the runtimes of fio processes as a function of a priority. To exactly measure the effect of our I/O scheduler, fio processes are executed repeatedly in this experiment. In the original kernel, which never considers the I/O type, read or write, and priority of process who requests the I/Os in the I/O block layer, there is little performance difference between read-intensive processes and write-intensive processes.

Fig. 5. Performance as a function of a priority.

Besides, the priority rarely affects the performance because fio processes are I/O intensive ones as well as there are no supports for I/O prioritization in the original kernel. On the contrary, we can observe that the proposed scheme provides a differentiated I/O service based on process priority and I/O type. For both read-intensive and write-intensive processes, we can see that the runtimes are increasing as a function of a priority. The read-intensive processes show better performances than write-intensive ones because we give higher values to the initial token counts of priority queues for read requests.

Fig. 6 shows the average performance difference between two adjacent processes in the priority order. In the original kernel, the performance difference for both read-intensive and write-intensive processes is about 1.01, and they show almost constant performance irrespective of priority and request type. However, the proposed scheme shows the performance difference between two adjacent processes in the priority order is about 1.08. It means that the process with priority n shows better performance than one with priority n+1 by 1.08 times.

Fig. 6. Performance gap between two adjacent processes in the priority order.

Fig. 7 shows the overall performance of our scheme, namely, the average runtime of all fio processes. This experiment has been performed in the same environment as Fig. 5 and 6 except that the terminated fio processes do not re-executed. When our scheme is used, the average runtimes of read-intensive and write-intensive processes are improved by 33% and 15%, respectively. We think that it is because processes having relatively lower priorities can occupy the remaining system resources released by the terminated processes having higher process priorities. Note that the overhead of our scheme is included in this result. In another experiment performed with a single fio that only issues random read requests, we could see that average latency is delayed by 8.7% when our I/O scheduler is employed.

Fig. 7. Average runtime of fio processes.

Since our scheme allocates a priority queue per process, even when there are many processes with the same priority, they are not affected. Fig. 8 shows the average runtime when there are several kinds of processes with the same priorities as described in Table 3. We can see that the runtime of each $\textit{fio}$ process increases as a function of a priority again regardless of how many processes have the same process priority. For example, since there are many processes with priority -15, but they do not share any priority queue or token count, the average running time of them is still lower than those of other processes.

Table 3. Number of running processes per priority

Priority	-15	-10	-5	0	5	10	15
# of processes	64	32	16	8	4	2	1

Fig. 8. Performance of processes with same priorities.

V. CONCLUSIONS

Although the multi-queue block I/O layer provides a high I/O bandwidth, it lacks the differentiated I/O services yet. In this paper, we presented a scheme that supports the priority in the multi-queue block I/O layer. Our scheme could provide the differentiated services according to the process priority even for I/O intensive processes by employing the concept of CFS in CPU scheduler.

ACKNOWLEDGMENTS

The present research has been conducted by the Research Grant of Kwangwoon University in 2019. It was also supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP; Ministry of Science, ICT & Future Planning) (No. 2017R1A2B4008536).

REFERENCES

Matias B., Axboe J., Nellans D., Bonnet P., Jun 2013, Linux block IO: introducing multi-queue SSD access on multi-core systems, Systems and Storage, SYSTOR ‘13 6th ACM International Conference on, No. 22

NVMe overview, https://www.nvmexpress.org/wp-content/uploads/NVMe_Overview.pdf

Zhang J., et al , 2018, FlashShare: Punching through server storage stack from kernel to firmware for ultra-low latency SSDs, Operating Systems Design and Implementation, OSDI ‘18, 13th USENIX Symposium on, pp. 477-492

Huffman A., NVM Express Base Specification Revision 1.3c, https://nvmexpress.org/wp-content/ uploads/NVM-Express-1_3c-2018.05.24-Ratified.pdf.

Kim H., Lee Y., Kim J., 2016, NVMeDirect: A User-space I/O Framework for Application-specific Optimization on NVMe SSDs, Hot Topics in Storage and File Systems, HotStorage ‘16, 8th USENIX Workshop on

Josephson W. K., Bongo L. A., Li K., Flynn D., Sep 2010, DFS: A File System for Virtualized Flash Storage, Transactions on Storage, ACM Journal of, Vol. 6, No. 3

Lee C., Sim D., Hwang J., 2015, F2FS: A New File System for Flash Storage, File and Storage Technologies, FAST '15, 13th USENIX Conference on, pp. 273-286

Matias B., González J., Bonnet P., 2017, LightNVM: The Linux Open-Channel SSD Subsystem, File and Storage Technologies, FAST ‘17, 15th USENIX Conference on, pp. 359-374

Peng B., Zhang H., Yao J., Dong Y., Xu Y., Guan H., 2018, MDev-NVMe: A NVMe Storage Virtualization Solution with Mediated Pass-Through, Annual Technical Conference, ATC ‘18, USENIX Conference on, pp. 665-676

Joshi K., Yadav K., Choudhary P., 2017, Enabling NVMe WRR support in Linux Block Layer, Hot Topics in Storage and File Systems, HotStorage ‘17, 9th USENIX Workshop on

Ahn S., La K., Kim J., 2016, Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems, Hot Topics in Storage and File Systems, HotStorage ‘16, 8th USENIX Workshop on

Lee M., Kang D., Lee M., Eom Y., 2017, Improving Read Performance by Isolating Multiple Queues in NVMe SSDs, Ubiquitous Information Manage-ment and Communication, IMCOM ‘17, 11th ACM International Conference on, No. 36, pp. 1-6

Kim S., Yang J., 2018, Optimized I/O Determinism for Emerging NVM-based NVMe SSD in an Enterprise System, Design Automation Conference, DAC ‘18, 55th ACM Conference on, No. 56, pp. 1-6

sysbench, https://github.com/akopytov/sysbench

Kim T., Kang D., Lee D., Eom Y., 2015, Improving performance by bridging the semantic gap between multi-queue SSD and I/O virtualization frame-work, Mass Storage Systems and Technologies (MSST), 31st IEEE Symposium on, pp. 1-11

Oh M., Eom H., Yeom H., Dec 2014, Enhancing the I/O system for virtual machines using high performance SSDs, Performance Computing and Communica-tions Conference (IPCCC), 33rd IEEE International Conference on, pp. 1-8

Flexible I/O Tester, https://github.com/axboe/fio

Eshghi K., Micheloni R., 2013, SSD Architecture and PCI Express Interface, Inside Solid State Drives (SSDs), Springer, Dordrecht, Vol. 37, pp. 19-45

Sandoval O., Kyber multi-queue i/o scheduler, https://lwn.net/Articles/720071/.

Author

Kyusik Kim

Kyusik Kim received the BS degrees in computer engineering from Kwangwoon University, Korea, in 2014.

He is currently working toward the combined MS and PhD degree at the School of Computer Engineering, Kwangwoon University.

His research interests include storage system, operating system, embedded system, solid-state drives, and nextgeneration nonvolatile memories.

Seungkyu Hong

Seungkyu Hong received the BS degree in computer engineering from Kwangwoon University, Korea, in 2017.

He is currently working toward the combined MS degree at the School of Computer Engineering, Kwangwoon University.

His research interests include operating systems, flash memories, and next-generation nonvolatile memories.

Taeseok Kim

Taeseok Kim received the B.S. degree in computer science and the M.S. and Ph.D. degrees in computer science and engineering from Seoul National University, South Korea, in 2000, 2002, and 2007, respectively.

In 2007, he was a Senior Research Engineer with Samsung Electronics, Suwon, South Korea. In 2008, he joined the Department of Computer Engineering, Kwangwoon University, Seoul, South Korea, where he is currently a Professor.

His research interests are primarily in computer systems such as operating systems, storage systems, multimedia systems, and embedded systems.