Son Suho1
Oh Gijun1
Ahn Sungyong1*
-
(School of Computer Science and Engineering, Pusan National University, Busan 26241,
Korea)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Index Terms
NVMe SSDs, proportional-share I/O scheduler, multi-queue block layer, Linux, Cgroups
I. INTRODUCTION
In the cloud computing service, it is critical problem that system resources should
be isolated and allocated among multiple tenants according to different service requirements
(3). Especially, as high performance storage such as SSDs are introduced in the cloud
computing area, sharing I/O resource of SSDs among multiple tenants with only few
overhead became an important problem. Linux Cgroups (5,6) is a resource control framework of Linux which allocates system resources such as
CPU, memory, and I/O bandwidth to each process group(called as cgroup). Specifically,
it provides a proportional I/O resource sharing scheme which distributes I/O bandwidth
to each cgroup in proportion to I/O weight through an I/O scheduler.
NAND flash memory-based SSDs (Solid-state Drives) are recently very popular as a second
storage device because of its outstanding I/O performance in comparison to HDDs (Hard-disk
Drives). However, legacy storage interface, Serial ATA(SATA) has insufficient maximum
bandwidth due to supporting single I/O queue so that it could not take full advantage
of internal parallelism of SSDs. NVM Express (NVMe) (4) is the state-of-the-art storage interface providing multiple submission queues for
SSDs, so that it can achieve more than one million IOPS with maximizing internal parallelism
of SSDs. However, I/O bandwidth sharing is not efficiently supported for NVMe SSDs
with Linux Cgroups.
There are some studies have proposed proportional I/O scheduler for a multi-queue
storage (13,14). However, they cannot fully utilize the performance of NVMe SSD because of a scalability
problem. Especially, BFQ scheduler (13), the only proportional share scheduler of Linux, suffers from performance overhead
due to single dispatch queue. According to our preliminary performance evaluation
results, it degrades I/O throughput of NVMe SSD up to 63% compared to None scheduler.
In this paper, we present scalable multi-queue I/O scheduler supporting proportional
I/O bandwidth sharing for NVMe SSDs without performance degradation. The proposed
scheduler employs a cgroup dedicated request queue set for each CPU for parallel I/O
request dispatching. Also, I/O bandwidth is shared by periodically allocating I/O
tokens to each cgroup in proportion to their I/O weight. Consequently, our scheduler
can parallelly dispatch I/O request and share I/O bandwidth with no global synchronization
between multiple cgroups because of the scalable architecture. Moreover, the total
amount of I/O tokens periodically updates for adjusting to the fluctuated performance
of NVMe SSDs due to internal operations such as garbage collection. The proposed I/O
scheduler is implemented on Linux kernel 5.2.0 and evaluated with both synthetic and
real workloads. The evaluation results show that our I/O scheduler improves I/O bandwidth
3.61 times than BFQ scheduler while allocating I/O bandwidth in proportion to I/O
weights exactly. Also, scalability evaluation result shows that our scheduler has
no scalability problem even with high degree of multi-threading unlike other multi-queue
proportional share scheduler.
The remainder of this paper is organized as follows. Section II introduces the previous
studies for I/O resource sharing with SSDs. Then, we show the preliminary evaluation
results for BFQ scheduler in Section III. Section IV describes the proposed proportional-share
I/O scheduler. Experimental results are presented in Section V. The conclusion is
given in Section VI.
II. RELATED WORK
There have been several studies on the proportional I/O sharing scheme for multi-queue
SSDs. Ahn et al. (12) implemented the weight-based dynamic throttling (WDT) technique on Linux Cgroups
layer to throttle the bandwidth to match the I/O weight of each cgroup. However, although
it guarantees the fairness on Linux Cgroups layer, the fairness can be degraded due
to unexpected scheduling because the WDT is implemented above the multi-queue block
layer (7). Moreover, the responsibility for the guarantee of fairness has been maintained by
the scheduler traditionally. Therefore, it is hierarchically appropriate that I/O
schedulers in the multi-queue block layer carry out the proportional I/O sharing.
Budget fair queueing (BFQ) scheduler proposed by Valente et al. (13) is currently the default proportional share scheduler for Linux multi-queue block
layer. BFQ scheduler allocates a budget, the number of sectors used for dispatching
I/O requests, to each cgroup according to I/O weight to achieve proportional I/O sharing.
However, in BFQ scheduler, the internal parallelism of multi-queue NVMe SSDs cannot
be fully utilized because only one request queue can dispatch I/O request to the storage
device at a time.
M. Hedayati et al. (14) proposed the MQFQ scheduler which adapts a classical fair queueing algorithm to multi-queue
designs and employs scalable data structures to minimize synchronization overhead.
However, even with these attempts, MQFQ shows scalability problem with extremely high
degree of concurrency workload because the global data structure must be referred
whenever each I/O request is dispatched. In addition, since MQFQ provides request
queue for each CPU, scalability problem occurs with large number of threads exceeding
the number of CPUs.
Kim et al. (15) proposed a I/O differentiated scheme based on Linux process priority. However, it
is focused to improve the performance of I/O intensive processes by considering process
priority even for I/O scheduling, not to share I/O bandwidth proportionally.
The proportional I/O sharing schemes for multi-queue SSDs should be implemented slightly
differently from the traditional schemes for single-queue SSDs (9-11). First, the parallelism of multi-queue block layer should not be harmed by minimizing
global synchronization and allowing each I/O queue to perform proportional I/O sharing
without interaction between other cgroups. Unlike single queue block layer, attempting
to maintain the fairness of bandwidth between multiple cgroups through global synchronization
would break the parallel processing of multi-queues and causes the severe I/O performance
degradation.
Second, work-conservation should also be considered while implementing proportional
I/O sharing schemes for multi-queue NVMe SSDs. Work-conserving in the proportional
I/O bandwidth sharing scheme means that the storage device should be busy without
leaving the SSDs idle, even if all of cgroups are consumed their own allocated I/O
bandwidth.
Fig. 1. Evaluation results of proportional I/O bandwidth sharing for four cgroups
with FIO benchmark.
III. MOTIVATION
In this section, we show the performance degradation of the BFQ scheduler through
the experimental results. To evaluate the fairness and performance overhead of the
BFQ scheduler, we measured the I/O throughput of four cgroups having different I/O
weight (1000:500:250:100) for the None and BFQ scheduler. Note that the None scheduler
is FIFO-fashioned I/O scheduler with little scheduling overhead, and provides a baseline
of the maximum performance of NVMe SSD (17). Fig. 1 shows the I/O throughput of both the None and BFQ scheduler measured by using synthetic
mixed 4KB random workload configured by 0.5 read ratio with FIO benchmark (16). As you can see in the figure, the None scheduler does not provide any proportional
bandwidth sharing scheme for cgroups, while the BFQ scheduler accurately shares the
bandwidth among four cgroups according to their own I/O weight. However, for the BFQ
scheduler, the aggregated throughput was seriously degraded by about 0.38 times compared
to the None scheduler. Moreover, the throughput of the cgroup with the highest weight
in the BFQ scheduler is lower than any cgroup in the None scheduler. It means that
the BFQ scheduler cannot fully utilize the performance of NVMe SSDs.
The reason for the performance degradation in the BFQ scheduler is a serialized request
dispatching process used to guarantee the fairness between multiple cgroups (14). Therefore, in this paper, we propose the new proportional-share I/O scheduler that
can accurately share I/O bandwidth proportionally without wasting high performance
of multi-queue SSDs.
Fig. 2. (a) No cgroup dedicated request queue, (b) serialized request dispatching
process, (c) per-cgroup request queue and no global synchronization.
IV. KYBER-FAIRNESS SCHEDULER
In this section we describe the Kyber-fairness scheduler, a novel proportional share
I/O scheduler for multi-queue block layer proposed in this paper.
1. Overview of Kyber-fairness Scheduler
Kyber-fairness scheduler is implemented by extending the existing multi-queue I/O
scheduler of Linux, Kyber scheduler (19). It tries to achieve the target I/O latency by tunning the number of requests in
dispatch queue. However, there is no proportional bandwidth sharing mechanism because
it does not employ separated request queue for each cgroup as can be seen in Fig. 2(a). We decide to modify Kyber scheduler because of (1) its little scheduling overhead
and (2) simple architecture. The proposed scheduler is implemented according to following
design principles described in Fig. 2(c).
At first, Kyber-fairness employs a cgroup dedicated request queue set for each submission
queue of NVMe SSD to manage the fairness among multiple cgroups. The previous multi-queue
schedulers are suffered from lack of fairness or scalability problem due to no separated
request queue architecture and serialized dispatching mechanism, respectively, as
described in Fig. 2(a) and (b). Otherwise, the proposed scheduler can provide a proportional bandwidth sharing for
cgroups while maintaining scalability in multi-core system.
Second, Kyber-fairness distributes a bandwidth of NVMe SSD by allocating token to
each cgroup proportionally. Note that token represents I/O bandwidth allocated to
each cgroup for a certain amount time. Therefore, each cgroup needs no synchronization
with other cgroups for dispatching requests. As a result, scalable proportional I/O
sharing can be guaranteed even under high degree of multiple threading.
Fig. 3. Overview of Kyber-fairness scheduler architecture.
2. Proportional I/O Sharing in Kyber-fairness
The Fig. 3 describes the proportional bandwidth sharing scheme of Kyber-fairness which is composed
of three steps as follows: (1) determining the total amount of tokens, (2) allocating
tokens to cgroups in proportion to I/O weight, and (3) consuming tokens to dispatch
I/O request.
At first, the total amount of token represents the maximum I/O bandwidth can be served
by underlying NVMe SSD. It can be predicted based on average number of used tokens
in previous periods as described in next section. In second step, tokens should be
allocated to each cgroup in proportion to I/O weight. Token manager assigns tokens
periodically to each cgroup in proportion to the I/O weight. The last step is to dispatch
I/O request from a request queue dedicated certain cgroup to NVMe SSD by consuming
tokens. If there are not enough tokens, the request cannot be dispatched and is throttled
in the request queue. Here, because each cgroup do not need to concern of other cgroups,
it is guaranteed that our proposed scheduler has no scalability problem even with
high degree of multiple cgroups.
When consuming tokens, writing is a more costly task than reading, so if the same
token is consumed without distinction between read and write, the read-dominant cgroups
can be disadvantaged by the write-dominant cgroups. To address inequality, we have
multiplied the token consumption of the write request by adding the ratio of the write
latency over read latency recorded by histogram buckets in Kyber scheduler.
Token manager refills the tokens for each cgroup if one of the following two conditions
is met: (1) if it has reached a certain period (e.g., 100 ms), (2) if the token for
all cgroups in the active state has been exhausted. Here, if there are no I/O requests
inserted into the queues for the previous period, it becomes inactive state.
3. Prediction of NVMe SSD Bandwidth
As mentioned above, total amount of tokens represents I/O throughput which can be
served by the underlying NVMe SSD. However, it is difficult to accurately predict
the I/O performance of NVMe SSD because it is greatly affected by the characteristics
of the I/O workload and internal operations such as garbage collection and wear leveling.
If the predicted token amount exceeds the available bandwidth of NVMe SSD, then the
fairness between cgroups is deteriorate. Conversely, if the amount of required token
is underestimated, the overhead of Kyber-fairness is increased due to frequent token
reassignment. In this section, we explain how the number of tokens can be accurately
predicted.
Algorithm. 1 describes bandwidth prediction and token allocation process. First of all, Kyber-fairness
scheduler records the number of used tokens($u_{i}^{k}$) and throttled time($t_{i}^{k}$)
over the periods and calculate an average number of tokens used during allocation
period by each cgroup($u_{i}^{avg}$). It causes only negligible overhead because no
global synchronization is required between multiple cgroups. Then, the number of required
tokens for next period is predicted based on the sum of average used tokens ($U_{k}$).
At this time, if throttled time of cgroup($t_{i}^{k}$) is shorter than allocation
period , the predicted number increases, otherwise it decreases. The predicted tokens
are distributed over the active cgroups in proportion to weight.
Algorithm 1. Token Prediction and Reallocation for The Next Period
4. Conformance to Multi-queue Scheme
Kyber-fairness scheduler assigns individual tokens to each cgroup that can be consumed
without requiring a global synchronization of any proportional I/O sharing information.
In the proposed scheduler the global synchronization is required for allocating tokens
only for each cgroup. In other words, Kyber-fairness scheduler does not refer to the
global structure to dispatch I/O requests. As a result, a scalable I/O performance
can be delivered by fully utilizing internal parallelism of NVMe SSDs.
Table 1. Experimental environments
CPU
|
Intel E5-2620 (8-core) x 2
|
Memory
|
32GB
|
Storage
|
Intel DC P4500 2TB
|
OS
|
Ubuntu 18.04 (Kernel 5.2.0)
|
Table 2. Parameters for FIO benchmark
IO engine
|
Libaio (32 I/O depth)
|
Block size
|
Random: 4KB / Sequential: 1MB
|
Cgroups
|
4 cgroups (4 threads per cgroup)
|
Direct I/O
|
Enabled
|
File size
|
10GB (60sec runtime)
|
V. EXPERIMENTAL RESULTS
To evaluate the performance and fairness of Kyber-fairness scheduler, we have implemented
and evaluated it on Linux kernel 5.2.0 and the experimental environment described
in Table 1 (8). Also, FIO and trace-replay (20) have been used for measuring I/O performance of synthetic and real workloads, respectively.
FIO benchmark performs four different synthetic I/O workload, random read/write and
sequential read/write workloads. Four cgroups are generated in total and each one
has four threads. The I/O weight of cgroups are 100, 250, 500 and 1000, respectively.
Table 2 shows the parameters used for FIO.
Fig. 4 shows the results of FIO benchmark. Fig. 4(a) shows Kyber-fairness scheduler can delivers closed I/O throughput to the None scheduler
for all types of workloads. It improves I/O performance by 2.68x and 3.61x, respectively,
for random read and random write workloads compared to BFQ scheduler. However, we
can see that the performance improvement is not impressive in sequential workloads.
The reason is that the scalability problem of BFQ is hidden for sequential workloads
that do not require high scalability. Moreover, as you can see in Fig. 4(b), proposed Kyber-fairness scheduler ensures the fairness among multiple cgroups having
different I/O weights. Note that Fairness Index (FI) (18) is an indicator of how accurately the I/O bandwidth is shared in proportion to I/O
weights, the closer it is to 1, the more accurately it is shared.
Fig. 4. Evaluation results of proportional I/O bandwidth sharing for four cgroups
with synthetic workloads (a) Total I/O Throughput(MB/s), (b) Fairness Index(FI).
For generating real workloads, block I/O trace replay tool, trace-replay
(20) is used with block I/O traces from SNIA
(21). Tables 3 and 4 shows the parameters used for the trace-replay and the characteristics
of I/O workloads, respectively. In our experiments, four cgroups are created and each
cgroup replays one of the block I/O traces described in
Table 4 where the different I/O weights are given to them. Note that all traces are replayed
by four concurrent threads during 60 seconds without timing consideration of the I/O
requests.
Table 3. Parameters for trace-replay
Traces
|
SYSTOR `17
|
IO depth
|
32
|
Cgroups
|
4 cgroups
|
Runtime
|
60 sec
|
Num. of threads
|
4 threads
|
Table 4. Characteristics of I/O workloads
LUN
|
Request size
(total / average)
|
Required Bandwidth
(MB/s)
|
R : W
|
LUN0
|
1.12 GB / 31.1 KB
|
462.45
|
3.44 : 1
|
LUN1
|
505 MB / 24.8 KB
|
540.66
|
3.19 : 1
|
LUN2
|
2.88 GB / 50.4 KB
|
364.98
|
2.78 : 1
|
LUN3
|
1.87 GB / 63.7KB
|
476.54
|
1.60 : 1
|
Fig. 5 shows the I/O performance and fairness of I/O schedulers for real workloads.
Fig. 5(a) shows that Kyber-fairness scheduler can improves the I/O performance up to 1.33 times
higher than that of BFQ scheduler. Moreover, Kyber-fairness scheduler shows slightly
better fairness index than BFQ scheduler, as shown in
Fig. 5(b). However, in
Fig. 5(a), you can see that the performance gap between BFQ and Kyber-fairness is narrower
when the LUN3 has the highest weight. The reason is that LUN3 does not require scalable
performance compared to other LUNs so that the scalability problem of BFQ scheduler
is not significantly exposed.
Fig. 5. Evaluation results of proportional I/O bandwidth sharing for four cgroups
with real-world workloads (a) Total I/O Throughput(MB/s), (b) Fairness Index(FI).
Fig. 6 shows the scalability evaluation results of I/O schedulers. To evaluate scalability,
we use small sized (1KB) random read requests with increasing number of cgroups. As
you can see in the figure, the Kyber-fairness shows a performance close to that of
the None scheduler even if the number of cgroups increases, whereas the BFQ scheduler
decreases sharply as the number of cgroups increases. The experimental result shows
that our proposed scheduler does not harm the parallelism of multi-queue block design
even with high degree of multiple threading.
Fig. 6. Scalability evaluation results with increasing number of cgroups.
VI. CONCLUSIONS
We proposed the scalable low-overhead proportional-share I/O scheduler for multi-queue
block layer. The proposed I/O scheduler efficiently supports the proportional I/O
bandwidth sharing for NVMe SSDs with removing global synchronization and token amount
prediction technique. According to the experimental results, our I/O scheduler shows
up to 3.61 times better I/O performance than the existing proportional-share I/O scheduler
while the fairness was also accurately guaranteed. Moreover, the scalability evaluation
result shows that the proposed Kyber-fairness scheduler can serve a scalable performance
even with extremely high degree of multiple threading.
ACKNOWLEDGMENTS
This work was supported by the National Research Foundation of Korea(NRF) grant funded
by the Korea government(MSIT) (No. 2018R1D1A3B07050034).
REFERENCES
Popek G. J., Goldberg R. P., July 1974, SFormal requirements for virtualizable third
generation architectures, Communications of the ACM, Vol. 17, No. 7, pp. 412-421
Li Z., Kihl M., Lu Q., Andersson J. A., Mar 2017, Performance Overhead Comparison
between Hypervisor and Container Based Virtualization, Advanced Information Networking
and Applications, 2017, AINA 2017, 31st IEEE International Con-ference on, pp. 955-962
Xavier M. G., Mar 2015, A Performance Isolation Analysis of Disk-Intensive Workloads
on Container-Based Clouds, Parallel, Distributed, and Network-Based Processing, 2015,
PDP 2015, 23rd Euromicro International Conference on, pp. 253-260
NVMe overview , 2020, https://www.nvmexpress.org/wp-content/uploads/NVMe_Overview.pdf
Cgroups , 2016, https://www.kernel.org/doc/Documentation/ cgroup- v1/cgroups.txt
Linux Containers (LXC) , 2016, https://linuxcontainers. org/lxc/introduction/
Bjørling M., June 2013, Linux Block IO: Introducing Multi-queue SSD Access on Multi-core
Systems, 6th ACM International Systems and Storage Conference, SYSTOR 2013, pp. 1-10
Intel DC P4500 2TB NVMe SSD , 2017, https://ark. intel.com/content/www/us/en/ark/products/99031/intel-ssd-dc-p4500-series-2-0tb-2-5in-pcie-3-1-x4-3d1-tlc.html
Demers A., Keshav S., Shenker S., Aug 1989, Analysis and simulation of a fair queueing
algorithm, ACM SIGCOMM Computer Communication Review, Vol. 19, No. 4, pp. 1-12
Axboe J., July 2004, Linux block IO-present and future, In Ottawa Linux Symp., pp.
51-61
Jin W., Chase J. S., Kaur J., June 2004, Interposed Proportional Sharing for a Storage
Service Utility, ACM SIGMETRICS Performance Evaluation Review, Vol. 32, No. 1, pp.
37-48
Ahn S., La K., Kim J., June 2016, Improving I/O Resource Sharing of Linux Cgroup for
NVMe SSDs on Multi-core Systems, Hot Topics in Storage and File Systems, 2016, HotStorage
2016, 8th USENIX Workshop on
Valente P., Checconi F., Sep 2010, High throughput disk scheduling with fair bandwidth
distribution, Computers, IEEE Transactions on, Vol. 59, No. 9, pp. 1172-1186
Hedayati M., July 2019, Multi-Queue Fair Queueing, USENIX Annual Technical Conference,
2019, ATC 2019, pp. 301-314
Kim K., Hong S., Kim T., Feb 2020, Supporting the Priorities in the Multi-queue Block
I/O Layer for NVMe SSDs, Journal of Semiconductor Technology and Science, Vol. 20,
No. 1, pp. 55-62
Axboe J., Flexible I/O tester, https://github. com/axboe/fio
Open Benchmarking , 2020. Linux 5.6 IO Scheduler Benchmarks. https://openbenchmarking.org/result/2003265-NI-LINUX56IO44
Jain R., Chiu D., Hawe W., Sep 1984, A Quantitative Measure of Fairness and Discrimination
for Resource Allocation in Shared Computer System, DEC Eastern Research Lab, Tech.
Rep. TR-301
Sandoval O., 2017, Kyber Multi-queue I/O Scheduler, https://lwn.net/Articles/720071
Oh Y., trace-replay, https://github.com/ yongseokoh/trace-replay
Lee C., 2017, Understanding Storage Traffic Characteristics on Enterprise Virtual
Desktop Infrastructure, 10th ACM International Systems and Storage Conference, SYSTOR
2017, pp. 1-11
Author
Suho Son was born in Suncheon, Korea, in 1995.
He received the B.S. degree in the School of Computer Science and Engineering from
Pusan National University, Korea, in 2020.
He is currently pursuing the M.S. degree in the School of Computer Science and Engineering
from Pusan National University, Korea.
His interests include operating systems, cloud platforms and non-volatile memory storages.
Gijun Oh was born in Ulsan, Korea, in 1995.
He received the B.S. degree in the School of Computer Science and Engineering from
Pusan National University, Korea, in 2020.
He is currently pursuing the M.S. degree in the School of Computer Science and Engineering
from Pusan National University, Korea.
His interests include NVMe device driver, Flash-Translation Layer and Zoned Namespace
SSD.
Sungyong Ahn received the B.S. and Ph.D. degree in the Department of Computer Science
and Engineering from Seoul National University, Korea, in 2003 and 2012, respec-tively.
He worked at Samsung Electronics as a Senior Engineer from 2012 to 2017.
In 2017, he joined the School of Computer Science and Engineering, Pusan National
University, Busan, South Korea, where he is currently a Professor.
His interests include operating systems, solid-state disk, and cloud platform.