KimByungJun1
MunHan-Gyeol1
KimShinwoong2
LeeJongMin1
SimJae-yoon1
-
(Department of Electrical Engineering, POSTECH, Pohang, Gyeongbuk, Korea 37673)
-
(Department of Communication Engineering, Handong Global University, Pohang, Gyeongbuk,
Korea 37554)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Index Terms
Post-quantum cryptography, lattice-based cryptography, security, accelerator, learning-with errors
I. INTRODUCTION
As large-scale quantum computer development progresses [1], threats have been raised about the vulnerability of traditional public-key based
cryptography. The National Institute of Standards and Technology (NIST) has been standardizing
on post-quantum cryptography (PQC) (Fig. 1). After several rounds, in August 2023, NIST published the initial drafts (FIPS-203,
FIPS-204, and FIPS-205) for PQC [2]. Two of them are ML-KEM and ML-DSA algorithms based on module learning with errors
(MLWE), which are based on lattice-based cryptography that is robust against quantum
attacks.
Because ML-KEM and ML-DSA require complex computations, their software implementation
introduces a lot of latency. In particular, sampling based on cryptographically-secure
pseudo random number generator (CS-PRNG) and number-theoretic transforms (NTT) based
on modular arithmetic are very complex for microcontroller units (MCUs) to handle.
To solve these problems, MLWE accelerator implementations on ASICs or FPGAs have been
reported [3-6]. Especially considering that IoT devices are more vulnerable to security attacks,
a compactly designed crypto processor is needed in the IoT environment.
The followings are the contributions of this work. 1) It reduces lots of latency of
sampling and NTT operations by using parallel hardware structure. 2) A modular multiplier
design optimized for the specific prime numbers of ML-KEM and ML-DSA. 3) This work
supports both of number systems for ML-KEM and ML-DSA efficiently by using clock-gating
method.
This work presents the fastest PQC crypto-processor that processes MLWE-based PQC
algorithms ML-KEM and ML-DSA. It is implemented and estimated under 28nm CMOS technology
@0.9V supply voltage. The estimation is done by using Synopsys Design Compiler syn_R-2020.09,
IC Compiler 2020.09-SP1, Primetime prime-R-2020.09-SP1.
Fig. 1. A summary of PQC standardization
II. ARCHITECTURE
Fig. 2 shows the overall architecture of the proposed PQC processor. It consists of memory
and operator modules for sampling and arithmetic.
The memory module has an instruction cache (2 KB), banks for polynomials (6 KB), and
banks for twiddle factors (1.5 KB). 32-bit instructions can be stored in the instruction
cache. The on-chip memory for 8 polynomials is required to process the PQC algorithms
on chip. For each polynomial, a bank consists of 16 polynomial coefficient memories,
enabling polynomial operations to be accelerated 16x faster. The twiddle factors stored
in the twiddle banks are used in the NTT computation\textcolor{color-1}{.} We could
reduce the size of ${\omega}$-bank by merging the NTT and multiplication by powers
of ${\psi}$ (${\omega}$ = ${\psi}$$^{2}$).
The Arithmetic module performs the polynomial operations required by PQC. It handles
modular arithmetic of polynomials, NTT operations, etc. Arithmetic Logic Elements
(ALEs) array efficiently handle the operations, and each ALE contains necessary arithmetic
units.
The Keccak-based CS-PRNG generates a random bitstream for sampling. The bitstream
is passed through specific sampling algorithm filters to generate polynomial coefficients
with a uniform, Gaussian distribution. With parallel structure of memory and sampling
filters, the latency for sampling is reduced.
Fig. 2. Overall architecture of the proposed crypto-processor.
III. SAMPLER
In software implementation, sampling operations account for 70% of the computation
latency [4]. Fig. 3 shows the architecture of the sampler module. CS-PRNG generates a random bitstream,
and 5 different sampling filters produce polynomial coefficients with uniform or gaussian
distribution.
Fig. 3. Block diagram of sampler module.
1. Real-time Psuedo Random Number Generator
The proposed CS-PRNG uses the 1600b Keccak-f algorithm to compute SHA-3 functions.
It performs four hash functions, called SHA3-224, SHA3-256, SHA3-384, and SHA3-512,
and two extendable-output functions (XOFs), called SHAKE128 and SHAKE256. The random
bitstream used for sampling are generated by SHAKE128 and SHAKE256. Since these functions
are based on very complex operations, hardware implementation of SHA-3 significantly
reduces latency and energy consumption, compared to software implementation.
2. Sampling Filters
ML-KEM and ML-DSA use uniform rejection (UR), centered binomial distribution (CBD),
uniform eta (UE), uniform gamma (UG), and sample in ball (SIB) algorithms for sampling
operations. The UR filter samples a random number with equal probability within [0,
q-1]. It discards the number that doesn’t satisfy the condition ({\textless} q) of
the rejection bound in order to make data with uniform probability from the incoming
bitstream. CBD filter generates polynomial coefficients with Gaussian distribution.
It samples the result of subtraction after obtaining two Hamming Weights (HW) from
random bitstream. UE and UG filters sample the number within the range [-${\eta}$,
${\eta}$] and [-${\gamma}$+1, ${\gamma}$] respectively. The filter is designed through
a comparator with ${\eta}$ and ${\gamma}$ constants set according to the PQC protocol.
The SIB filter is implemented using a comparator and a mux. Since it needs to be operated
by checking the polynomial coefficient one by one, it is implemented without parallelism.
3. Evaluation
Fig. 4 and Table 1 show a comparison of the latency generated in sampling by the software and hardware
implementations. When performing UR sampling, which consumes the most clock cycles
in software implementation, the proposed work reduces the latency by 99% compared
to software. For CBD and UE sampling, the latency reduction is 92% and 87% respectively.
This improvement is achieved by implementing the efficient CS-PRNG hardware design
and parallelization of the polynomial banks. The software benchmark is based on an
Intel i7-7700 environment.
Fig. 4. Latency of sampler for ML-KEM-512 and ML-DSA-44.
Table 1. Clock cycles of sampling operations
Latency of Sampler
|
ML-KEM-512
|
Intel i7-7700
|
Intel i7-7700
(w/AVX2)
|
This work
|
UR (x4)
|
72,256
|
15,588
|
808
|
CBD (x1)
|
5,251
|
6,876
|
99
|
ML-DSA-44
|
Intel i7-7700
|
Intel i7-7700
(w/AVX2)
|
This work
|
UR (x16)
|
241,688
|
87,480
|
2,752
|
UE (x1)
|
4,850
|
2,934
|
155
|
IV. ARITHMETIC LOGIC ELEMENT ARRAY
Fig. 5 shows the structure of an ALE array. The ALE array consists of 32 ALEs. Each ALE
contains an arithmetic, modular arithmetic, and logical arithmetic operator. All of
32 ALEs operates when the crypto-core processes NTT or INTT instructions. For modular
multiplier, we implemented Barrett reduction as hardware.
Fig. 5. Block diagram of ALE array module.
1. Modular Multiplier
The modular multiplication is the heaviest part in the ALE. It involves arithmetic
multiplication and calculating the remainder on the q values. To reduce the complexity
of the operation, Montgomery, Barrett reduction, etc. are used. However, the reduction
operation also requires high power consumption and latency.
On the other hand, modulo q of the PQC protocols is a fixed prime number. Under these
conditions, Barrett reduction can be implemented through a simple algorithm (Fig. 6). We implemented the reduction using only a few adders, subtractors, and shifters.
Fig. 6. Block diagram of modular multiplier.
2. NTT/INTT
Fig. 5 shows the data flow of NTT/INTT. The operation is processed through Cooley-Tukey
(CT) and Gentleman-Sande (GS) butterfly configurations, respectively. Each ALE uses
modular operators to implement the appropriate butterfly unit according to the operation.
We reduced memory accesses by using an 8x4 ALE array. Since the dimension of the polynomial
in both ML-KEM and ML-DSA is 256, only 16x2 16-to-16 bijection operations are required
to complete the NTT/INTT operation. The implementation is realized by inserting a
reordering MUX stage. The reordering block configures any bijection according to the
operation type and NTT/INTT stage.
3. Polynomial Arithmetic
The key, message, and ciphertext of the MLWE algorithm are all coefficients of a polynomial.
Except for sampling, pointwise arithmetic between polynomials accounts for most of
the computation in both ML-KEM and ML-DSA. We use ALE arrays to parallelize the polynomial
arithmetic. Since we have 16-parallel memory of coefficient, 16 ALEs out of 32 are
used for polynomial arithmetic, which are shown as gray box in Fig. 5.
4. Evaluation
Table 2 shows a comparison of the latency of NTT/INTT in the software and hardware implementations.
Compared to the Intel i7-7700 environment, the proposed hardware reduces the running
time of NTT/INTT by 99%.
Table 2. Clock cycles of NTT & INTT operations
Latency of NTT & INTT of a polynomial
|
ML-KEM
|
Intel i7-7700
|
Intel i7-7700
(w/AVX2)
|
This work
|
NTT
|
9,138
|
365
|
61
|
INTT
|
12,301
|
382
|
61
|
ML-DSA
|
Intel i7-7700
|
Intel i7-7700
(w/AVX2)
|
This work
|
NTT
|
6,750
|
2,270
|
61
|
INTT
|
9,535
|
2,115
|
61
|
V. PARALLEL OPERATION AND CLOCK GATING
To reduce the execution time of the protocol, we designed the instructions that make
the operation of the sampler and ALE array to be performed in parallel. We also reduced
the power consumption of the ML-KEM by clock gating.
1. Parallel Operation
Depending on the order of the instructions, the sampler and ALE array can operate
independently. If there is a situation where both sampling and NTT operations are
required for independent polynomials, the latency can be reduced by performing the
operations in parallel (Fig. 7).
Fig. 7. Effect of instruction for parallel operations.
2. Clock Gating
In the proposed implementation, both ML-KEM and ML-DSA are performed on the same hardware.
In both of them, all operations are performed on the quotient ring, but modulo q values
are different (3329 for ML-KEM and 8380417 for ML-DSA). Operating ML-KEM, about half
of registers consume power just by clock network. We applied clock gating for removing
the unnecessary power consumption. Fig. 8 shows a comparison of the power consumption of the ALE array when performing NTT
operations on ML-KEM. The power consumption of ALE array is reduced by 16% by clock
gating.
Fig. 8. Effect of clock gating.
VI. IMPLEMENTATION
The PQC processor is implemented on a 28 nm CMOS technology process using 9.5 KB of
on-chip memory and 511.8k NAND gate equivalents. Fig. 9 shows the layout of the proposed design. Table 3 compares the performances with previous works. This implementation consumes 16~24
mW from a supply voltage of 0.9 V while processing ML-KEM-512 and ML-DSA-44. Compared
to state-of-the-art [6], the implementation shows 2.91x and 2.62x higher power efficiency for ML-KEM-512
and ML-DSA-44, respectively.
Fig. 9. Layout of PQC processor.
Table 3. Performance Comparison
|
[4]
|
[5]
|
[6]
|
This Work
|
Process
|
40 nm
|
28 nm
|
28 nm
|
28 nm
|
Frequency [MHz]
|
12~72
|
300
|
500
|
83*
|
Voltage [V]
|
0.68~1.1
|
0.9
|
0.9
|
0.9
|
Area [mm2]
|
0.28
|
-
|
3.6
|
0.83
|
Logic Gates
|
106k
|
942k
|
-
|
560k
|
SRAM [KB]
|
40.25
|
12
|
448
|
7.49
|
Power [mW]
|
5~7.5
|
-
|
163~304
|
16~24*
|
ML-KEM
|
Support
|
Support
|
Support
|
Support
|
ML-DSA
|
Support
|
-
|
Support
|
Support
|
ML-KEM-512 (Keygen + Encaps. + Decaps.)
|
Clock Cycles
|
348,526
|
144,431
|
10,433
|
5,794
|
Power. Eff. [OPS/W]
|
37,617
|
78,125
|
294,018
|
856,768*
|
Area. Eff. [GOPS/Wm2]
|
134
|
-
|
82
|
1,032
|
ML-DSA-44 (Keygen + Sign + Verify)
|
Clock Cycles
|
829,201
|
-
|
43,376
|
27,426
|
Power. Eff. [OPS/W]
|
11,472
|
-
|
48,637
|
127,477*
|
Area. Eff. [GOPS/Wm2]
|
41
|
-
|
14
|
153
|
* Measured by Primetime prime-R-2020.09-SP1.
V. CONCLUSIONS
This work proposes a hardware architecture for a PQC processor that supports ML-KEM
and ML-DSA. The proposed design enables efficient modular multiplication using Barrett
reduction, which is configurable specifically for the PQC algorithm. It also utilizes
a parallel memory structure and a real-time sampler module. The architecture is implemented
in 28nm CMOS technology. The proposed design uses the smallest on-chip memory and
achieves 2.91x higher power efficiency compared to the previous state-of-the-art.
References
J. Park, et al., “A Fully Integrated Cryo-CMOS SoC for Qubit Control in Quantum Computers
Capable of State Manipulation, Readout and High-Speed Gate Pulsing of Spin Qubits
in Intel 22nm FFL FinFET Technology,” ISSCC, pp. 208-209, Feb. 2021.
National Institute of Standards and Technology (NIST), “Module-Lattice-Based Key-Encapsulation
Mechanism Standard,” Federal Information Processing Standards Publications (FIPS PUBS)
203, Aug. 2023.
L. Ma, X. Wu and G. Bai, "Parallel polynomial multiplication optimized scheme for
CRYSTALS-KYBER Post-Quantum Cryptosystem based on FPGA," 2021 International Conference
on Communications, Information System and Computer Engineering (CISCE), Beijing, China,
2021, pp. 361-365.
U. Banerjee, A. Pathak, and A. P. Chandrakasan, “An Energy-Efficient Configurable Lattice
Cryptography Processor for the Quantum-Secure Internet of Things,” ISSCC, pp. 46-47,
Feb. 2019.
G. Xin, et al., “VPQC: A Domain-Specific Vector Processor for Post-Quantum Cryptography
Based on RISC-V Architecture,” TCAS-I, vol. 67, no. 8, pp. 2672-2684, Aug. 2020.
Y. Zhu, et al., “A 28nm 48KOPS 3.4µJ/Op Agile Crypto-Processor for Post-Quantum Cryptography
on Multi-Mathematical Problems,” ISSCC, pp. 514-516, 2022.
ByungJun Kim received the B.S. and M.S. degrees in electronic and electrical engineering
from the Pohang University of Science and Technology (POSTECH), Pohang, South Korea,
in 2016 and 2018, respectively, where he is currently pursuing the Ph.D. degree. His
research interests include post-quantum cryptography processor and homomorphic encryption
accelerator.
Han-Gyeol Mul received the B.S. degree in electrical engineering and the Ph.D.
degree in convergence IT engineering from the Pohang University of Science and Technology
(POSTECH), Pohang, South Korea, in 2015 and 2023, respectively. He is currently a
Post-Doctoral Researcher with the Department of Electronic and Electrical Engineering,
POSTECH. His research primarily focuses on developing low-power and energy-efficient
deeplearning hardware solutions specifically tailored for edge IoT devices.
Shinwoong Kim received the B.S. and M.S. degrees in electrical engineering and
information and communication engineering from Handong Global University, Pohang,
South Korea, in 2009 and 2011, respectively, and the Ph.D. degree in electronic and
electrical engineering from the Pohang University of Science and Technology, Pohang,
South Korea, in 2016. From 2016 to 2022, he was a Senior Engineer at Samsung Electronics,
Hwasung, South Korea, where he involved in the design of Local Oscillator (LO) domain
including all-digital phase-locked loop for RF communication systems. In 2022, he
joined Handong Global University, Pohang, South Korea, where he is currently a Assistnat
Professor. His current research interests include analog/digital frequency synthesizer
and low-power clock generation.
JongMin Lee received the B.S. degrees in electronic and electrical engineering
from Pohang University of Science and Technology (POSTECH), Pohang, South Korea, in
2023, where he is currently pursuing the M.S. degree. His research interests include
Low DropOut regulator and hardware accelerator for machine learning.
Jae-Yoon Sim (Senior Member, IEEE) received the B.S., M.S., and Ph.D. degrees
in electronic and electrical engineering from the Pohang University of Science and
Technology (POSTECH), Pohang, South Korea, in 1993, 1995, and 1999, respectively.
From 1999 to 2005, he was a Senior Engineer with Samsung Electronics, Hwasung, South
Korea. From 2003 to 2005, he was a Post-Doctoral Researcher at the University of Southern
California, Los Angeles, CA, USA. From 2011 to 2012, he was a Visiting Scholar with
the University of Michigan, Ann Arbor, MI, USA. In 2005, he joined POSTECH, where
he is currently a Professor. Since 2019, he has been the Director of the Scalable
Quantum Computer Technology Platform Center, Pohang, South Korea, nominated by the
Korean Ministry of Science and Information and Communication Technology (ICT). His
research interests include sensor interface circuits, high-speed serial/parallel links,
phase-locked loops, data converters, braininspired computing, and quantum computing.
Dr. Sim was a co-recipient of the Takuo Sugano Award from the IEEE International Solid-State
Circuits Conference (ISSCC) in 2001 and a recipient of the Special Author Recognition
Award from ISSCC 2013. In 2020, he received the Scientist of the Month Award sponsored
by the Ministry of Science and ICT of Korea. He has been an IEEE Distinguished Lecturer
from 2018 to 2019. He has served on the Technical Program Committees for the ISSCC,
the IEEE Symposium on VLSI Circuits, and the IEEE Asian Solid-State Circuits Conference.