I. INTRODUCTION
Non-volatile memory express (NVMe) is an open logical device interface specification
for accessing non-volatile storage media such as solid-state drives (SSDs) (18). As NVMe has been designed to capitalize on the low latency and internal parallelism
of SSDs, it can reduce the I/O overhead and bring I/O bandwidth improvements. Unlike
SATA SSDs that use native command queuing (NCQ) with a single I/O queue, NVMe SSDs
employ multiple submission queues (SQs) that can process I/O requests received from
the host in parallel.
To support the NVMe SSDs efficiently, a multi-queue block I/O layer has been introduced
in the Linux kernel. It uses two sets of queues: software queues and hardware queues
in order to improve scalability (10). The software queues are used to alleviate the lock contention problem in multi-core
environments, and the hardware queues are used to exploit the parallelism of SSDs,
respectively. Due to many such queues, the multi-queue block I/O layer can provide
a high I/O bandwidth, but it lacks the differentiated services of I/O requests yet.
Although every process is executed with a priority, the priority is just for CPU scheduler,
not I/O scheduler. As a result, I/O intensive processes cannot be sufficiently serviced
by their priorities even though their priorities are very high. In this paper, we
present a scheme that efficiently supports the priorities in the multi-queue block
I/O layer. In order to exactly service processes by their priorities, we bring the
CFS technique that is a CPU scheduler used in the current Linux kernel into the I/O
layer. Between software queues and hardware queues, we insert another layer consisting
of priority queues. For each priority queue, we maintain the count of token, which
is a right for an I/O request to be able to move to the hardware queue. Like the weight
in the CFS scheduler, by allocating the token counts appropriately according to the
process priorities for all priority queues, we make processes be serviced by their
priorities regardless of whether they are I/O intensive or not. Our scheme also considers
the request type; it preferentially services read requests rather than write requests.
By appropriately controlling the number of requests to be submitted to SSDs in the
priority queue layer, our scheme can give a differentiated I/O service.
The remainder of this paper is organized as follows. First, we describe the background
and related work in Section II. Then, we explain the proposed scheme for supporting
the priorities in the multi-queue block I/O layer in Section III. The experimental
results and concluding remarks are presented in Section V and VI, respectively.
II. RELATED WORK
NVMe is a high-performance, scalable host controller interface designed for PCIe based
SSDs and non-volatile memories (2). To provide high speed I/O, it supports up to 65,535 submission and completion queues
that can queue up to 64K commands (3,4,9,13). Such a scalable architecture enables the internal parallelism of SSDs to be fully
utilized. The submission queues have I/O requests submitted by the host, and the SSD
device dispatches them from the submission queues in the round-robin fashion. After
processing requests in the SSD, it inserts the completed requests in the completion
queues and notifies the host of the completion.
The multi-queue block I/O layer of Linux is a structure designed for computer systems
with multiple CPU cores and storages supporting multiple queues like NVMe SSDs. In
a previous single-queue block I/O layer, all the I/O requests issued from processes
running on each CPU core are sent to a single request queue (Fig. 1(a)). As a result, it became a performance bottleneck because of lock contention for
the request queue. To alleviate this problem, the multi-queue block I/O layer employs
two levels of queues: software queues and hardware queues (Fig. 1(b)). I/O requests issued by a process running on a CPU core are first sent to corresponding
software queue attached to the CPU core, and then moves to the hardware queue mapped
to the software queue. If hardware queues are less in number than software queues,
two or more software queues can be mapped to one hardware queue. The I/O requests
are finally sent to the submission queues in the NVMe controller, which are mapped
to the hardware queues.
Fig. 1. The multi-queue block I/O layer (a) Single-queue block I/O layer, (b) Multi-queue
block I/O layer.
Researches for optimizing the Linux I/O stack have been actively performed. Kim et
al. proposed a new method called NVMeDirect, which is a user-space I/O framework for
directly accessing the submission queues and completion queues of NVMe SSDs (5). As it is implemented in the user level, it can be easily optimized to target applications
by utilizing various techniques related to SSDs as well as bypass various layers in
Linux kernel. Josephson et al. proposed an optimized file system for SSDs called the
direct file system (DFS) (6). The DFS places a virtual address space in the block I/O layer to manage all space
of SSD, and it maps the virtual addresses into the physical addresses of SSD. Since
it detects I/O requests that can be processed in the host and then directly manages
them, it could improve the lifetime as well as the performance. In (7), Lee et al. proposed the F2FS, which is a file system for SSDs. It is designed to
optimize performance and lifetime of SSDs with a generic block interface and shows
a better performance than EXT4 by up to 3.1 times. Matias et al. proposed a LightNVM,
a method to offload internal operations of SSDs to the host (8). Because SSDs supporting the LightNVM enable the host to access the internals of
SSDs, the host kernel can perform write-buffering, wear-leveling, and garbage collection.
There have been some studies for I/O scheduling in NVMe SSDs. In (10), Joshi et al. implemented a mechanism that supports several I/O priorities for NVMe
SSDs by using the weighted round-robin (WRR) feature of the NVMe interface. They designed
four kinds of submission queues: Medium, High, Low and Urgent, and I/O requests are
sent to one of them according to their priority. Lee et al. solved the write interference
problem by isolating read and write requests into independent software queues (12). They showed that read operations could be processed faster than other systems by
up to 33%. Ahn et al. proposed a weight-based dynamic throttling (WDT) for Linux cgroup,
which allocates I/O resources more efficiently and replaces existing multi-queue block
I/O layer (11). By intelligently predicting the future I/O bandwidth requirement based on past I/O
service rates, they achieved a highly accurate I/O resource sharing while reducing
wasted I/O bandwidth.
III. SUPPORTING THE PRIORITIES IN THE
MULTI-QUEUE BLOCK I/O LAYER
When a process is executed with a given priority, it actually may not run as expected.
It means the priority in CPU scheduler, not the priority in the system, and the CPU
priority information is not directly delivered to the I/O layer in the current Linux
kernel. Even if the CPU information is delivered to the I/O layer through some additional
delivery mechanisms, it is suspicious if I/O requests can be serviced by the priority
because there are too many queues such as software queues, hardware queues, and submission
queues.
To dispel the doubts, we performed a simple experiment with a CPU-intensive program
and an IO-intensive program: $\textit{sysbench}$ that searches 8000 prime numbers
and $\textit{fio}$ that randomly reads 5GB data from SSD, respectively (14,17). We first executed 40 $\textit{sysbench}$ processes whose nice values are from -20
and 19. Note that the lower this nice value, the higher the actual priority. As seen
in Fig. 2, runtimes of $\textit{sysbench}$ processes increase as a function of a priority,
and the average performance difference between two adjacent processes in the priority
order is about 1.223 times. It is because the CFS scheduler allocates 1.25 times of
weights for CPU usage to a process with priority $\textit{n}$ than a process with
priority $\textit{n+1}$. In conclusion, as sysbench is a CPU-intensive process that
mainly uses the CPU resource, they are sufficiently serviced by the CFS policy.
Fig. 2. Performances of CPU-intensive and I/O intensive processes.
On the other hand, the experiment with $\textit{fio}$ under the same condition shows
different results. The average performance difference between two adjacent processes
in the priority order is just 1.007 times; it is rarely affected by priority. It is
expected results because it is an I/O intensive process that rarely uses the CPU resource
and no mechanism supports the priority in the I/O layer. The experiments show how
much the priority affects the execution of processes depends on the CPU usage.
In this section, we present a scheme that provides all processes the consistent services
by their priorities regardless of whether they are CPU-intensive or I/O-intensive.
To this end, we first deliver the CPU priority information to the I/O layer by adding
it into bio and request, which are basic structures used for I/O processing in the
multi-queue block I/O layer. As can be seen in Fig. 3, we add another layer consisting of many priority queues corresponding to the CPU
priority levels between software queues and hardware queues. As the I/O request type
is also an important consideration in designing the I/O scheduler, we divide all the
priority queues into two types: read and write.
Fig. 3. A priority-supported I/O scheduduler.
Like the original multi-queue block I/O layer, the I/O requests issued from processes
are first sent to the software queues, and then they are moved to priority queues
according to their priority and request type. The priority queue layer determines
if it will pass the I/O requests to the next hardware queues or not. For each priority
queue, we maintain the token counts, which play a vital role in our scheme. The token
is a right for an I/O request to be able to move to the next queue. If the token count
is greater than 0, I/O requests can move to the next hardware queue with a token,
and thus the token count decreases. On the while, when the I/O processing of a request
is finished in the SSD and it is notified via the completion queue, the token count
increases again.
Fig. 4 is an example that shows how I/O requests move to the next queues using the token.
In Fig. 4(a), when a read request with priority n arrives at the priority queue from the software
queue, it can move to the hardware queue immediately because the token count is 3,
which is greater than 0. Naturally, the token count is reduced to 2. On the contrary,
in Fig. 4(b), a read request cannot move to the hardware queue because the token count is 0. In
this case, the read request should wait in the priority queue corresponding to the
priority n until other requests previously submitted to the SSD are completed and
the tokens occupied by them are returned. Finally, when the SSD finishes processing
some I/O requests and notifies the completion to the host by putting the messages
to the completion queues, the I/O requests in the priority queues move to the hardware
queues or the token count is increased by the number of requests finished.
Fig. 4. A priority I/O scheduling through token counts (a) if token count {\textgreater}
0, (b) If token count is 0.
As only requests with the token can move to the next hardware queue, we can control
the amount of` I/O requests to be processed in the priority queue layer using token
counts. In other words, the higher the token count is, the more I/O requests can be
serviced within some time duration. To allocate more SSD resources to processes with
higher priorities, it is sufficient to give them larger token counts. It is possible
because I/O requests in the hardware queue immediately submitted to the submission
queue if there is enough space in the submission queues and NVMe SSDs usually dispatch
I/O requests in all submission queues in the round-robin fashion.
Eventually, in our scheme, it is the most important to decide appropriate initial
token counts for all priority queues. They can be determined based on the process
priority, I/O type, and the number of processes with the same priority. First, we
can determine the initial token counts for every priority queue by referencing the
CFS scheduler. Just as it allocates 1.25 times of weights to a process with priority
n than a process with priority n+1, we can allocate the initial token counts for every
priority queue. For example, if we initialize the initial token count of priority
queue for priority 1 as 10, we can determine the initial token count of priority queue
for priority 0 as 12 or 13. Likewise, the initial token count of priority queue for
priority 2 can be 8.
Second, we consider the I/O request type in determining the initial token counts.
Since read requests should be preferentially serviced than write requests, higher
token counts should be given for read requests. Like the existing Linux I/O scheduler
such as deadline scheduler, our scheme sets the initial token counts for read requests
two times higher than ones for write requests.
Finally, the number of processes running with the same priority should also be considered
in determining the initial token counts. If we do not consider it, processes with
higher priority can be serviced less than processes with lower priority. Let us assume
that the initial token count of priority queue for read requests with priority 0 and
that of priority queue for read requests with priority 1 are 12 and 10, respectively.
For example, there are three processes A, B, and C and their priorities and request
type are 0 and read, respectively. Since their priorities and request types are all
the same, they share a priority queue and corresponding token count. Since the initial
token count for them is 12, each process can submit only four requests to the SSD
on average. At the same time, if process D whose priority is 1 is running, it may
submit I/O requests by up to 10. To alleviate this irrationality, we give a process
queue and token count per each process. Since the priority queue is dynamically managed
and each process can issue two types of requests, if the number of running processes
is n, the number of priority queues required would be less than 2*n. As each process
has its priority queue and token count, the lock contention would not occur in the
priority queue layer.
IV. PERFORMANCE EVALUATION
We implemented our I/O scheduler on the Linux kernel by modifying Kyber, which is
one of Linux I/O schedulers (19). Table 1 shows the experimental environments. In our experiments, eight software queues are
created according to the number of CPU cores. The numbers of hardware queues and submission
queues are also eight because Linux generates these queues according to the number
of the software queues if NVMe SSDs support. For generating I/O intensive processes,
we used the $\textit{fio}$ by some parameters described in Table 2.
Table 1. Experimental environments
CPU
|
Inter i7-7700K (8 cores)
|
Memory
|
32GB
|
Storage
|
Samsung 960 PRO 512GB
|
OS
|
Ubuntu 16.04 (Kernel 4.13.10)
|
Table 2. Parameter setting for $\textit{fio}$
Block size
|
4KB
|
Total I/O size
|
8GB
|
IO engine
|
libaio (iodepth: 2048)
|
We first experimented if the proposed I/O scheduler sufficiently provides differentiated
services based on the process priority and I/O type. To this end, we executed 80 $\textit{fios}$
processes at the same time: 40 processes issue only read requests, and the other 40
process issue only write requests. Their priorities are from -20 to 19, which are
nice values.
Fig. 5 shows the runtimes of fio processes as a function of a priority. To exactly measure
the effect of our I/O scheduler, fio processes are executed repeatedly in this experiment.
In the original kernel, which never considers the I/O type, read or write, and priority
of process who requests the I/Os in the I/O block layer, there is little performance
difference between read-intensive processes and write-intensive processes.
Fig. 5. Performance as a function of a priority.
Besides, the priority rarely affects the performance because fio processes are I/O
intensive ones as well as there are no supports for I/O prioritization in the original
kernel. On the contrary, we can observe that the proposed scheme provides a differentiated
I/O service based on process priority and I/O type. For both read-intensive and write-intensive
processes, we can see that the runtimes are increasing as a function of a priority.
The read-intensive processes show better performances than write-intensive ones because
we give higher values to the initial token counts of priority queues for read requests.
Fig. 6 shows the average performance difference between two adjacent processes in the priority
order. In the original kernel, the performance difference for both read-intensive
and write-intensive processes is about 1.01, and they show almost constant performance
irrespective of priority and request type. However, the proposed scheme shows the
performance difference between two adjacent processes in the priority order is about
1.08. It means that the process with priority n shows better performance than one
with priority n+1 by 1.08 times.
Fig. 6. Performance gap between two adjacent processes in the priority order.
Fig. 7 shows the overall performance of our scheme, namely, the average runtime of all fio
processes. This experiment has been performed in the same environment as Fig. 5 and 6 except that the terminated fio processes do not re-executed. When our scheme
is used, the average runtimes of read-intensive and write-intensive processes are
improved by 33% and 15%, respectively. We think that it is because processes having
relatively lower priorities can occupy the remaining system resources released by
the terminated processes having higher process priorities. Note that the overhead
of our scheme is included in this result. In another experiment performed with a single
fio that only issues random read requests, we could see that average latency is delayed
by 8.7% when our I/O scheduler is employed.
Fig. 7. Average runtime of fio processes.
Since our scheme allocates a priority queue per process, even when there are many
processes with the same priority, they are not affected. Fig. 8 shows the average runtime when there are several kinds of processes with the same
priorities as described in Table 3. We can see that the runtime of each $\textit{fio}$ process increases as a function
of a priority again regardless of how many processes have the same process priority.
For example, since there are many processes with priority -15, but they do not share
any priority queue or token count, the average running time of them is still lower
than those of other processes.
Table 3. Number of running processes per priority
Priority
|
-15
|
-10
|
-5
|
0
|
5
|
10
|
15
|
# of processes
|
64
|
32
|
16
|
8
|
4
|
2
|
1
|
Fig. 8. Performance of processes with same priorities.
V. CONCLUSIONS
Although the multi-queue block I/O layer provides a
high I/O bandwidth, it lacks the differentiated I/O services
yet. In this paper, we presented a scheme that supports the
priority in the multi-queue block I/O layer. Our scheme
could provide the differentiated services according to the
process priority even for I/O intensive processes by
employing the concept of CFS in CPU scheduler.
ACKNOWLEDGMENTS
The present research has been conducted by the
Research Grant of Kwangwoon University in 2019. It
was also supported by the National Research Foundation
of Korea (NRF) grant funded by the Korea government
(MSIP; Ministry of Science, ICT & Future Planning) (No.
2017R1A2B4008536).
REFERENCES
Matias B., Axboe J., Nellans D., Bonnet P., Jun 2013, Linux block IO: introducing
multi-queue SSD access on multi-core systems, Systems and Storage, SYSTOR ‘13 6th
ACM International Conference on, No. 22
NVMe overview, https://www.nvmexpress.org/wp-content/uploads/NVMe_Overview.pdf
Zhang J., et al , 2018, FlashShare: Punching through server storage stack from kernel
to firmware for ultra-low latency SSDs, Operating Systems Design and Implementation,
OSDI ‘18, 13th USENIX Symposium on, pp. 477-492
Huffman A., NVM Express Base Specification Revision 1.3c, https://nvmexpress.org/wp-content/
uploads/NVM-Express-1_3c-2018.05.24-Ratified.pdf.
Kim H., Lee Y., Kim J., 2016, NVMeDirect: A User-space I/O Framework for Application-specific
Optimization on NVMe SSDs, Hot Topics in Storage and File Systems, HotStorage ‘16,
8th USENIX Workshop on
Josephson W. K., Bongo L. A., Li K., Flynn D., Sep 2010, DFS: A File System for Virtualized
Flash Storage, Transactions on Storage, ACM Journal of, Vol. 6, No. 3
Lee C., Sim D., Hwang J., 2015, F2FS: A New File System for Flash Storage, File and
Storage Technologies, FAST '15, 13th USENIX Conference on, pp. 273-286
Matias B., González J., Bonnet P., 2017, LightNVM: The Linux Open-Channel SSD Subsystem,
File and Storage Technologies, FAST ‘17, 15th USENIX Conference on, pp. 359-374
Peng B., Zhang H., Yao J., Dong Y., Xu Y., Guan H., 2018, MDev-NVMe: A NVMe Storage
Virtualization Solution with Mediated Pass-Through, Annual Technical Conference, ATC
‘18, USENIX Conference on, pp. 665-676
Joshi K., Yadav K., Choudhary P., 2017, Enabling NVMe WRR support in Linux Block Layer,
Hot Topics in Storage and File Systems, HotStorage ‘17, 9th USENIX Workshop on
Ahn S., La K., Kim J., 2016, Improving I/O Resource Sharing of Linux Cgroup for NVMe
SSDs on Multi-core Systems, Hot Topics in Storage and File Systems, HotStorage ‘16,
8th USENIX Workshop on
Lee M., Kang D., Lee M., Eom Y., 2017, Improving Read Performance by Isolating Multiple
Queues in NVMe SSDs, Ubiquitous Information Manage-ment and Communication, IMCOM ‘17,
11th ACM International Conference on, No. 36, pp. 1-6
Kim S., Yang J., 2018, Optimized I/O Determinism for Emerging NVM-based NVMe SSD in
an Enterprise System, Design Automation Conference, DAC ‘18, 55th ACM Conference on,
No. 56, pp. 1-6
sysbench, https://github.com/akopytov/sysbench
Kim T., Kang D., Lee D., Eom Y., 2015, Improving performance by bridging the semantic
gap between multi-queue SSD and I/O virtualization frame-work, Mass Storage Systems
and Technologies (MSST), 31st IEEE Symposium on, pp. 1-11
Oh M., Eom H., Yeom H., Dec 2014, Enhancing the I/O system for virtual machines using
high performance SSDs, Performance Computing and Communica-tions Conference (IPCCC),
33rd IEEE International Conference on, pp. 1-8
Flexible I/O Tester, https://github.com/axboe/fio
Eshghi K., Micheloni R., 2013, SSD Architecture and PCI Express Interface, Inside
Solid State Drives (SSDs), Springer, Dordrecht, Vol. 37, pp. 19-45
Sandoval O., Kyber multi-queue i/o scheduler, https://lwn.net/Articles/720071/.
Author
Kyusik Kim received the BS
degrees in computer engineering
from Kwangwoon University, Korea,
in 2014.
He is currently working
toward the combined MS and PhD
degree at the School of Computer
Engineering, Kwangwoon University.
His research interests include storage system, operating
system, embedded system, solid-state drives, and nextgeneration
nonvolatile memories.
Seungkyu Hong received the BS
degree in computer engineering from
Kwangwoon University, Korea, in
2017.
He is currently working toward
the combined MS degree at the
School of Computer Engineering,
Kwangwoon University.
His research
interests include operating systems, flash memories, and
next-generation nonvolatile memories.
Taeseok Kim received the B.S.
degree in computer science and the
M.S. and Ph.D. degrees in computer
science and engineering from Seoul
National University, South Korea, in
2000, 2002, and 2007, respectively.
In 2007, he was a Senior Research
Engineer with Samsung Electronics, Suwon, South
Korea. In 2008, he joined the Department of Computer
Engineering, Kwangwoon University, Seoul, South
Korea, where he is currently a Professor.
His research
interests are primarily in computer systems such as
operating systems, storage systems, multimedia systems,
and embedded systems.