Mobile QR Code QR CODE

Main Menu

The Journal of Semiconductor Technology and Science (JSTS) is an international, peer-reviewed, and open-access journal that is published bimonthly.
- Scope: semiconductor processes, devices, circuits, and MEMS.
- Editor-in-Chief: Prof. Woo Young Choi (ECE, Seoul National University)
- Indexed within Science Citation Index Expanded (SCIE), SCOPUS, Korea Citation Index (KCI), and other databases.

Journal Search

[

Research article

]

JSTS(Journal of Semiconductor Technology and Science)

IEIE Vol. 24, No. 01, p.55-61

ISSN (print) :

1598-1657

ISSN (online) :

2233-4866

Received : 15 November 2023Revised : 21 December 2023Accepted : 27 December 2023

DOI :

https://doi.org/10.5573/JSTS.2024.24.1.55

A 1.03MOPS/W Lattice-based Post-quantum Cryptography Processor for IoT Devices

KimByungJun¹ MunHan-Gyeol¹ KimShinwoong² LeeJongMin¹ SimJae-yoon¹

(Department of Electrical Engineering, POSTECH, Pohang, Gyeongbuk, Korea 37673)
(Department of Communication Engineering, Handong Global University, Pohang, Gyeongbuk, Korea 37554)

^*E-mail : kbj1213@postech.ac.kr

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

This work introduces a configurable lattice-based post-quantum cryptography processor designed specifically for lightweight IoT devices. It accelerates the computation of Key-Encapsulation Mechanism (KEM) and Digital Signature Algorithm (DSA) based on module learning with errors algorithm (MLWE). In order to minimize both hardware cost and energy consumption, the processor incorporates a Barrett reduction algorithm method for efficient number-theoretic transform calculations and implements real-time processing for polynomial sampling. The chip is fabricated on a 28 nm CMOS technology process. It achieves the state-of-the-art power efficiencies and latency in MLWE-based PQC.

Index Terms

Post-quantum cryptography, lattice-based cryptography, security, accelerator, learning-with errors

I. INTRODUCTION

As large-scale quantum computer development progresses ^[1], threats have been raised about the vulnerability of traditional public-key based cryptography. The National Institute of Standards and Technology (NIST) has been standardizing on post-quantum cryptography (PQC) (Fig. 1). After several rounds, in August 2023, NIST published the initial drafts (FIPS-203, FIPS-204, and FIPS-205) for PQC ^[2]. Two of them are ML-KEM and ML-DSA algorithms based on module learning with errors (MLWE), which are based on lattice-based cryptography that is robust against quantum attacks.

Because ML-KEM and ML-DSA require complex computations, their software implementation introduces a lot of latency. In particular, sampling based on cryptographically-secure pseudo random number generator (CS-PRNG) and number-theoretic transforms (NTT) based on modular arithmetic are very complex for microcontroller units (MCUs) to handle. To solve these problems, MLWE accelerator implementations on ASICs or FPGAs have been reported ^[3-^6]. Especially considering that IoT devices are more vulnerable to security attacks, a compactly designed crypto processor is needed in the IoT environment.

The followings are the contributions of this work. 1) It reduces lots of latency of sampling and NTT operations by using parallel hardware structure. 2) A modular multiplier design optimized for the specific prime numbers of ML-KEM and ML-DSA. 3) This work supports both of number systems for ML-KEM and ML-DSA efficiently by using clock-gating method.

This work presents the fastest PQC crypto-processor that processes MLWE-based PQC algorithms ML-KEM and ML-DSA. It is implemented and estimated under 28nm CMOS technology @0.9V supply voltage. The estimation is done by using Synopsys Design Compiler syn_R-2020.09, IC Compiler 2020.09-SP1, Primetime prime-R-2020.09-SP1.

Fig. 1. A summary of PQC standardization

II. ARCHITECTURE

Fig. 2 shows the overall architecture of the proposed PQC processor. It consists of memory and operator modules for sampling and arithmetic.

The memory module has an instruction cache (2 KB), banks for polynomials (6 KB), and banks for twiddle factors (1.5 KB). 32-bit instructions can be stored in the instruction cache. The on-chip memory for 8 polynomials is required to process the PQC algorithms on chip. For each polynomial, a bank consists of 16 polynomial coefficient memories, enabling polynomial operations to be accelerated 16x faster. The twiddle factors stored in the twiddle banks are used in the NTT computation\textcolor{color-1}{.} We could reduce the size of ${\omega}$-bank by merging the NTT and multiplication by powers of ${\psi}$ (${\omega}$ = ${\psi}$$^{2}$).

The Arithmetic module performs the polynomial operations required by PQC. It handles modular arithmetic of polynomials, NTT operations, etc. Arithmetic Logic Elements (ALEs) array efficiently handle the operations, and each ALE contains necessary arithmetic units.

The Keccak-based CS-PRNG generates a random bitstream for sampling. The bitstream is passed through specific sampling algorithm filters to generate polynomial coefficients with a uniform, Gaussian distribution. With parallel structure of memory and sampling filters, the latency for sampling is reduced.

Fig. 2. Overall architecture of the proposed crypto-processor.

III. SAMPLER

In software implementation, sampling operations account for 70% of the computation latency ^[4]. Fig. 3 shows the architecture of the sampler module. CS-PRNG generates a random bitstream, and 5 different sampling filters produce polynomial coefficients with uniform or gaussian distribution.

Fig. 3. Block diagram of sampler module.

1. Real-time Psuedo Random Number Generator

The proposed CS-PRNG uses the 1600b Keccak-f algorithm to compute SHA-3 functions. It performs four hash functions, called SHA3-224, SHA3-256, SHA3-384, and SHA3-512, and two extendable-output functions (XOFs), called SHAKE128 and SHAKE256. The random bitstream used for sampling are generated by SHAKE128 and SHAKE256. Since these functions are based on very complex operations, hardware implementation of SHA-3 significantly reduces latency and energy consumption, compared to software implementation.

2. Sampling Filters

ML-KEM and ML-DSA use uniform rejection (UR), centered binomial distribution (CBD), uniform eta (UE), uniform gamma (UG), and sample in ball (SIB) algorithms for sampling operations. The UR filter samples a random number with equal probability within [0, q-1]. It discards the number that doesn’t satisfy the condition ({\textless} q) of the rejection bound in order to make data with uniform probability from the incoming bitstream. CBD filter generates polynomial coefficients with Gaussian distribution. It samples the result of subtraction after obtaining two Hamming Weights (HW) from random bitstream. UE and UG filters sample the number within the range [-${\eta}$, ${\eta}$] and [-${\gamma}$+1, ${\gamma}$] respectively. The filter is designed through a comparator with ${\eta}$ and ${\gamma}$ constants set according to the PQC protocol. The SIB filter is implemented using a comparator and a mux. Since it needs to be operated by checking the polynomial coefficient one by one, it is implemented without parallelism.

3. Evaluation

Fig. 4 and Table 1 show a comparison of the latency generated in sampling by the software and hardware implementations. When performing UR sampling, which consumes the most clock cycles in software implementation, the proposed work reduces the latency by 99% compared to software. For CBD and UE sampling, the latency reduction is 92% and 87% respectively. This improvement is achieved by implementing the efficient CS-PRNG hardware design and parallelization of the polynomial banks. The software benchmark is based on an Intel i7-7700 environment.

Fig. 4. Latency of sampler for ML-KEM-512 and ML-DSA-44.

Table 1. Clock cycles of sampling operations

Latency of Sampler
ML-KEM-512	Intel i7-7700	Intel i7-7700 (w/AVX2)	This work
UR (x4)	72,256	15,588	808
CBD (x1)	5,251	6,876	99
ML-DSA-44	Intel i7-7700	Intel i7-7700 (w/AVX2)	This work
UR (x16)	241,688	87,480	2,752
UE (x1)	4,850	2,934	155

IV. ARITHMETIC LOGIC ELEMENT ARRAY

Fig. 5 shows the structure of an ALE array. The ALE array consists of 32 ALEs. Each ALE contains an arithmetic, modular arithmetic, and logical arithmetic operator. All of 32 ALEs operates when the crypto-core processes NTT or INTT instructions. For modular multiplier, we implemented Barrett reduction as hardware.

Fig. 5. Block diagram of ALE array module.

1. Modular Multiplier

The modular multiplication is the heaviest part in the ALE. It involves arithmetic multiplication and calculating the remainder on the q values. To reduce the complexity of the operation, Montgomery, Barrett reduction, etc. are used. However, the reduction operation also requires high power consumption and latency.

On the other hand, modulo q of the PQC protocols is a fixed prime number. Under these conditions, Barrett reduction can be implemented through a simple algorithm (Fig. 6). We implemented the reduction using only a few adders, subtractors, and shifters.

Fig. 6. Block diagram of modular multiplier.

2. NTT/INTT

Fig. 5 shows the data flow of NTT/INTT. The operation is processed through Cooley-Tukey (CT) and Gentleman-Sande (GS) butterfly configurations, respectively. Each ALE uses modular operators to implement the appropriate butterfly unit according to the operation. We reduced memory accesses by using an 8x4 ALE array. Since the dimension of the polynomial in both ML-KEM and ML-DSA is 256, only 16x2 16-to-16 bijection operations are required to complete the NTT/INTT operation. The implementation is realized by inserting a reordering MUX stage. The reordering block configures any bijection according to the operation type and NTT/INTT stage.

3. Polynomial Arithmetic

The key, message, and ciphertext of the MLWE algorithm are all coefficients of a polynomial. Except for sampling, pointwise arithmetic between polynomials accounts for most of the computation in both ML-KEM and ML-DSA. We use ALE arrays to parallelize the polynomial arithmetic. Since we have 16-parallel memory of coefficient, 16 ALEs out of 32 are used for polynomial arithmetic, which are shown as gray box in Fig. 5.

4. Evaluation

Table 2 shows a comparison of the latency of NTT/INTT in the software and hardware implementations. Compared to the Intel i7-7700 environment, the proposed hardware reduces the running time of NTT/INTT by 99%.

Table 2. Clock cycles of NTT & INTT operations

Latency of NTT & INTT of a polynomial
ML-KEM	Intel i7-7700	Intel i7-7700 (w/AVX2)	This work
NTT	9,138	365	61
INTT	12,301	382	61
ML-DSA	Intel i7-7700	Intel i7-7700 (w/AVX2)	This work
NTT	6,750	2,270	61
INTT	9,535	2,115	61

V. PARALLEL OPERATION AND CLOCK GATING

To reduce the execution time of the protocol, we designed the instructions that make the operation of the sampler and ALE array to be performed in parallel. We also reduced the power consumption of the ML-KEM by clock gating.

1. Parallel Operation

Depending on the order of the instructions, the sampler and ALE array can operate independently. If there is a situation where both sampling and NTT operations are required for independent polynomials, the latency can be reduced by performing the operations in parallel (Fig. 7).

Fig. 7. Effect of instruction for parallel operations.

2. Clock Gating

In the proposed implementation, both ML-KEM and ML-DSA are performed on the same hardware. In both of them, all operations are performed on the quotient ring, but modulo q values are different (3329 for ML-KEM and 8380417 for ML-DSA). Operating ML-KEM, about half of registers consume power just by clock network. We applied clock gating for removing the unnecessary power consumption. Fig. 8 shows a comparison of the power consumption of the ALE array when performing NTT operations on ML-KEM. The power consumption of ALE array is reduced by 16% by clock gating.

Fig. 8. Effect of clock gating.

VI. IMPLEMENTATION

The PQC processor is implemented on a 28 nm CMOS technology process using 9.5 KB of on-chip memory and 511.8k NAND gate equivalents. Fig. 9 shows the layout of the proposed design. Table 3 compares the performances with previous works. This implementation consumes 16~24 mW from a supply voltage of 0.9 V while processing ML-KEM-512 and ML-DSA-44. Compared to state-of-the-art ^[6], the implementation shows 2.91x and 2.62x higher power efficiency for ML-KEM-512 and ML-DSA-44, respectively.

Fig. 9. Layout of PQC processor.

Table 3. Performance Comparison

	[4]	[5]	[6]	This Work
Process	40 nm	28 nm	28 nm	28 nm
Frequency [MHz]	12~72	300	500	83*
Voltage [V]	0.68~1.1	0.9	0.9	0.9
Area [mm²]	0.28	-	3.6	0.83
Logic Gates	106k	942k	-	560k
SRAM [KB]	40.25	12	448	7.49
Power [mW]	5~7.5	-	163~304	16~24*
ML-KEM	Support	Support	Support	Support
ML-DSA	Support	-	Support	Support
ML-KEM-512 (Keygen + Encaps. + Decaps.)
Clock Cycles	348,526	144,431	10,433	5,794
Power. Eff. [OPS/W]	37,617	78,125	294,018	856,768*
Area. Eff. [GOPS/Wm²]	134	-	82	1,032
ML-DSA-44 (Keygen + Sign + Verify)
Clock Cycles	829,201	-	43,376	27,426
Power. Eff. [OPS/W]	11,472	-	48,637	127,477*
Area. Eff. [GOPS/Wm²]	41	-	14	153

* Measured by Primetime prime-R-2020.09-SP1.

V. CONCLUSIONS

This work proposes a hardware architecture for a PQC processor that supports ML-KEM and ML-DSA. The proposed design enables efficient modular multiplication using Barrett reduction, which is configurable specifically for the PQC algorithm. It also utilizes a parallel memory structure and a real-time sampler module. The architecture is implemented in 28nm CMOS technology. The proposed design uses the smallest on-chip memory and achieves 2.91x higher power efficiency compared to the previous state-of-the-art.

References

J. Park, et al., “A Fully Integrated Cryo-CMOS SoC for Qubit Control in Quantum Computers Capable of State Manipulation, Readout and High-Speed Gate Pulsing of Spin Qubits in Intel 22nm FFL FinFET Technology,” ISSCC, pp. 208-209, Feb. 2021.

National Institute of Standards and Technology (NIST), “Module-Lattice-Based Key-Encapsulation Mechanism Standard,” Federal Information Processing Standards Publications (FIPS PUBS) 203, Aug. 2023.

L. Ma, X. Wu and G. Bai, "Parallel polynomial multiplication optimized scheme for CRYSTALS-KYBER Post-Quantum Cryptosystem based on FPGA," 2021 International Conference on Communications, Information System and Computer Engineering (CISCE), Beijing, China, 2021, pp. 361-365.

U. Banerjee, A. Pathak, and A. P. Chandrakasan, “An Energy-Efﬁcient Conﬁgurable Lattice Cryptography Processor for the Quantum-Secure Internet of Things,” ISSCC, pp. 46-47, Feb. 2019.

G. Xin, et al., “VPQC: A Domain-Specific Vector Processor for Post-Quantum Cryptography Based on RISC-V Architecture,” TCAS-I, vol. 67, no. 8, pp. 2672-2684, Aug. 2020.

Y. Zhu, et al., “A 28nm 48KOPS 3.4µJ/Op Agile Crypto-Processor for Post-Quantum Cryptography on Multi-Mathematical Problems,” ISSCC, pp. 514-516, 2022.

ByungJun Kim

ByungJun Kim received the B.S. and M.S. degrees in electronic and electrical engineering from the Pohang University of Science and Technology (POSTECH), Pohang, South Korea, in 2016 and 2018, respectively, where he is currently pursuing the Ph.D. degree. His research interests include post-quantum cryptography processor and homomorphic encryption accelerator.

Han-Gyeol Mul

Han-Gyeol Mul received the B.S. degree in electrical engineering and the Ph.D. degree in convergence IT engineering from the Pohang University of Science and Technology (POSTECH), Pohang, South Korea, in 2015 and 2023, respectively. He is currently a Post-Doctoral Researcher with the Department of Electronic and Electrical Engineering, POSTECH. His research primarily focuses on developing low-power and energy-efficient deeplearning hardware solutions specifically tailored for edge IoT devices.

Shinwoong Kim

Shinwoong Kim received the B.S. and M.S. degrees in electrical engineering and information and communication engineering from Handong Global University, Pohang, South Korea, in 2009 and 2011, respectively, and the Ph.D. degree in electronic and electrical engineering from the Pohang University of Science and Technology, Pohang, South Korea, in 2016. From 2016 to 2022, he was a Senior Engineer at Samsung Electronics, Hwasung, South Korea, where he involved in the design of Local Oscillator (LO) domain including all-digital phase-locked loop for RF communication systems. In 2022, he joined Handong Global University, Pohang, South Korea, where he is currently a Assistnat Professor. His current research interests include analog/digital frequency synthesizer and low-power clock generation.

JongMin Lee

JongMin Lee received the B.S. degrees in electronic and electrical engineering from Pohang University of Science and Technology (POSTECH), Pohang, South Korea, in 2023, where he is currently pursuing the M.S. degree. His research interests include Low DropOut regulator and hardware accelerator for machine learning.

Jae-Yoon Sim

Jae-Yoon Sim (Senior Member, IEEE) received the B.S., M.S., and Ph.D. degrees in electronic and electrical engineering from the Pohang University of Science and Technology (POSTECH), Pohang, South Korea, in 1993, 1995, and 1999, respectively. From 1999 to 2005, he was a Senior Engineer with Samsung Electronics, Hwasung, South Korea. From 2003 to 2005, he was a Post-Doctoral Researcher at the University of Southern California, Los Angeles, CA, USA. From 2011 to 2012, he was a Visiting Scholar with the University of Michigan, Ann Arbor, MI, USA. In 2005, he joined POSTECH, where he is currently a Professor. Since 2019, he has been the Director of the Scalable Quantum Computer Technology Platform Center, Pohang, South Korea, nominated by the Korean Ministry of Science and Information and Communication Technology (ICT). His research interests include sensor interface circuits, high-speed serial/parallel links, phase-locked loops, data converters, braininspired computing, and quantum computing. Dr. Sim was a co-recipient of the Takuo Sugano Award from the IEEE International Solid-State Circuits Conference (ISSCC) in 2001 and a recipient of the Special Author Recognition Award from ISSCC 2013. In 2020, he received the Scientist of the Month Award sponsored by the Ministry of Science and ICT of Korea. He has been an IEEE Distinguished Lecturer from 2018 to 2019. He has served on the Technical Program Committees for the ISSCC, the IEEE Symposium on VLSI Circuits, and the IEEE Asian Solid-State Circuits Conference.

JSTSJournal of Semiconductor Technology and Science

Journal Search

Journal XML

Journal Information

A 1.03MOPS/W Lattice-based Post-quantum Cryptography Processor for IoT Devices

Abstract

Index Terms

I. INTRODUCTION

Fig. 1. A summary of PQC standardization

II. ARCHITECTURE

Fig. 2. Overall architecture of the proposed crypto-processor.

III. SAMPLER

Fig. 3. Block diagram of sampler module.

1. Real-time Psuedo Random Number Generator

2. Sampling Filters

3. Evaluation

Fig. 4. Latency of sampler for ML-KEM-512 and ML-DSA-44.

Table 1. Clock cycles of sampling operations

IV. ARITHMETIC LOGIC ELEMENT ARRAY

Fig. 5. Block diagram of ALE array module.

1. Modular Multiplier

Fig. 6. Block diagram of modular multiplier.

2. NTT/INTT

3. Polynomial Arithmetic

4. Evaluation

Table 2. Clock cycles of NTT & INTT operations

V. PARALLEL OPERATION AND CLOCK GATING

1. Parallel Operation

Fig. 7. Effect of instruction for parallel operations.

2. Clock Gating

Fig. 8. Effect of clock gating.

VI. IMPLEMENTATION

Fig. 9. Layout of PQC processor.

Table 3. Performance Comparison

V. CONCLUSIONS

References

ByungJun Kim

Han-Gyeol Mul

Shinwoong Kim

JongMin Lee

Jae-Yoon Sim

Article Information (continued)

Index Terms

JSTS
Journal of Semiconductor Technology and Science