LeeChang-Hyeon1
LeeJae-Hyeok1
JungHaesung1
LeeHanyoung1
LeeHanho1
-
(Dept. of Electrical and Computer Engineering, Inha University, Incheon, 22212, Korea)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Index Terms
Post-quantum cryptography, high-level synthesis, HW/SW co-design, hybrid HLS-RTL design, FPGA
I. INTRODUCTION
Artificial intelligence (AI) and cloud services are prominent fields that will gain
increased importance in the future. These technologies are expected to be applied
in a wide range of fields as advances in network technology enable rapid data communication.
According to the trends in these technologies, experts believe that future services
will be in the form of combined cloud services and AI. The traditional approach involves
on-demand availability, in which the required data is immediately accessed from cloud
servers. However, by leveraging the data provided by AI, it is possible to achieve
more diverse and personalized services, thereby enabling more efficient operations
between operators and users.
Computer technology has also exhibited rapid progress. Quantum computers, which utilize
principles of quantum mechanics such as entanglement and superposition, are being
actively researched by global IT companies for the purpose of commercialization. When
quantum computers become commercially available, these are expected to provide overwhelmingly
superior computational power compared with existing computers, enabling the quick
resolution of time-consuming or even unsolvable mathematical problems.
Therefore, it is necessary to establish new secure cryptographic systems for quantum
computers in preparation for their commercialization. Existing cryptographic systems
(e.g., RSA and ECC) used in AI and cloud technologies are vulnerable to the Shor algorithm
[1] in quantum computers, which increases the need for post-quantum cryptography (PQC).
To address this issue, the National Institute of Standards and Technology (NIST) organized
an algorithm competition [2] for PQC in 2016. Through three rounds of evaluation, the Crystals–Kyber cryptographic
algorithm [3], which was included among lattice-based candidates, was selected as the final candidate
in the public key encryption category scheduled for July 2022 [4].
Crystals–Kyber is a cryptographic system based on mathematical problems (e.g., shortest
vector problem and closest vector problem) [5]. It offers high security and is resistant to quantum computers. The use of the ring
learning with errors (Ring-LWE) technique has gained significant attention owing to
its relatively short key and ciphertext sizes. In this paper, we propose methodologies
for the high-level synthesis (HLS)-based hardware/ software (HW/SW) co-design and
hybrid HLS and register transfer level (RTL) design (HLS-RTL) of secure PQC systems.
We conducted a comparative analysis of the performance and results to evaluate the
effectiveness of these approaches.
II. BACKGROUND
1. Ring Learning with Errors
The Ring-LWE technique was proposed by Oded Regev in 2010 [6] to address the complexity of multiplication operations (e.g., convolution) and the
limitations of long encryption keys in the LWE technique. It performs all the operations
and encryptions on a polynomial ring in a finite field, R$_{\mathrm{q}}$ = Z$_{\mathrm{q}}$[x]
/ f(x). Typically, f(x) is defined as f(x) = (X$^{\mathrm{N}}$ + 1), where N is a
power of 2 and q is a prime number that satisfies q (1 mod 2N).
The Ring-LWE technique adopts a public key encryption scheme with two asymmetric keys:
a public key for encryption and a secret key for decryption. A public key is generated
during the key-generation process (a, b), where ``a'' is a random number, ``s'' represents
the secret key, and ``e'' is the value generated by a discrete Gaussian distribution
representing small noise (error). Using the public key (a, b), the process of encrypting
a message (m) to obtain (c$_{1}$, c$_{2}$) can be verified through Eq. (2).
The decryption process uses the secret key (s) to recover the ciphertext (c$_{1}$,
c$_{2}$) from the original message (m). In the key-generation step, the secret key
(s) corresponding to the public key (a, b) values used during encryption must also
be used to eliminate the small noise (error) inserted during the encryption process
and obtain the original message (m) value.
In the Ring-LWE technique, polynomial multiplication is the key operation, which is
typically performed by convolution. However, by leveraging the number theoretic transform
(NTT), the convolution operation (*) can be replaced with a point-wise multiplication
operation (•) to reduce the overall time complexity from O(N$^{2}$) to O(NlogN) based
on Big-O notation.
2. High-level Synthesis
HLS consists of several stages, as shown in Fig. 1. It begins with the C simulation process, in which the testbench code is written
to validate the C logic against the C/C++ source code. The next stage is C synthesis,
where directives such as UNROLL and PIPELINE provided by the HLS tool are applied
for optimization, and the interface (AXI-Stream) is incorporated to generate the RTL
source code. The synthesized RTL code is then verified through C/RTL co-simulation
using Vivado simulation (waveform) to ensure the equivalence of the C and RTL logics.
Replacing the RTL coding and verification steps in the conventional hardware design
process with HLS offers the advantage of higher productivity.
Fig. 1. High-level synthesis design flow.
3. Xilinx PYNQ Platform
The Xilinx PYNQ platform is an open-source project provided by Xilinx to facilitate
the easy design of embedded systems using the All Programmable Systems on Chip (APSoC)
field-programmable gate array (FPGA).
By leveraging the Python language and libraries, developers can write the host code
to be executed in the processing system (PS) domain, allowing collaborative design
between the ARM processor in the ZYNQ FPGA's PS domain and the programmable logic
(PL) domain. This enables the implementation of various embedded systems in a single
FPGA.
Fig. 2 shows the PYNQ framework that runs on the CPU in the PS domain when the PYNQ boot
image (provided by Xilinx) is loaded onto an SD card, and the ZYNQ 104 FPGA is booted
in the SD-card boot mode.
III. ARCHITECTURE
1. HLS-based HW/SW Co-design
The Xilinx Vitis-HLS tool was used with the ZYNQ 104 FPGA set as the target board
for the design process. The tool provides various pragma directives, such as loops,
block random-access memory (BRAM), and resource allocation, which are required for
optimization. These directives were utilized to design the Crystals–Kyber cryptographic
algorithm in HW. The Advanced eXtensible Interface (AXI) Stream was applied to the
input and output ports in the top function.
1.1 Optimization Using Directives
The directives primarily utilized in this design are PIPELINE and UNROLL, which optimize
the declared loops and assign specific resources to certain operations when synthesizing
from C to RTL. These directives were applied within the main functions of Crystals–Kyber.
The "#pragma HLS PIPELINE" directive optimizes the declared functions or loops by
introducing concurrency in ongoing tasks and operations, reducing the initial interval
(II) [7]. As shown in Fig. 3, based on the specified II value, a new input can be received at each clock cycle
to speed up the output generation. Specifically, the core operations of the NTT function
are described, and the PIPELINE directive is applied to the loops and internal functions
of the hash function for optimization.
Fig. 3. PIPELINE directive.
By default, the "#pragma HLS UNROLL" directive maintains the rolling state of loops
in the C/C++ functions, in which the logic for one iteration is synthesized and sequentially
executed for each subsequent iteration. However, as shown in Fig. 4, applying the UNROLL directive allows the loop to be unrolled and transformed into
multiple independent operations [7]. Consequently, multiple copies of the loop can be generated, enabling parallel computation
for all iterations. In addition, partial UNROLL can be applied by specifying a factor
value. In this study, the UNROLL directive was applied to the loops in the top function,
specifically in sections where the output values during the encryption and decryption
stages are transmitted via the interface (AXI-Stream), thus achieving parallelization.
Fig. 4. UNROLL directive.
The "#pragma HLS BIND_OP" directive specifies the resources to be mapped when synthesizing
the variables and operations declared within the function. By assigning resources
to the desired variables and operations (e.g., mul, add, sub, div), the synthesis
process incorporates the specified mappings. If no explicit directive is provided,
then the tool automatically determines the resource allocation. The options include
digital signal processing (DSP) and Fabric (non-DSP), in which DSP utilization results
in a synthesis that utilize DSP resources, while the Fabric designation excludes the
use of DSP resources. Notably, the BIND_OP directive was applied to the operations
(e.g., multiplication and reduction) within a finite field (Montgomery reduction and
Barrett reduction), and polynomial multiplication in the NTT function, enabling synthesis
with DSP resource utilization.
1.2 Interface (AXI-Stream)
The AXI (Advanced eXtensible Interface) applied to the input and output ports in the
top function of Crystals–Kyber is one of the protocols developed by ARM and designed
for arbitrary unidirectional data transfer [7]. Fig. 5 shows the signals and communication channels of the AXI-Stream, demonstrating the
utilization of a ready/valid-based handshake communication.
Fig. 5. AXI-Stream signal and communication channel.
To effectively apply the AXI-Stream in HLS, specific headers must be included, such
as ap_axi_sdata.h, hls_stream.h, and hls_vector.h. These headers define the dedicated
data structures for streams, which to convert variables of the defined stream structure
data-type to regular data-types (Interface ${\rightarrow}$ input variables) and vice
versa (output variables ${\rightarrow}$ Interface).
Therefore, by utilizing the stream headers and dedicated structures, additional code
was added to the top function to establish the connection between the input/output
signals of Crystals–Kyber and the Interface, facilitating the synthesis process.
1.3 HW/SW Co-design
The block design in the PL area was configured using the Xilinx Vivado intellectual
property (IP) integrator. It incorporates the PS IP of the ZYNQ 104 FPGA, optimized
Crystals–Kyber IP designed with HLS, and direct memory access (DMA) IP. The PS and
PL areas communicate through two channels: stream-to-memory-mapped (S2MM) and memory-mapped
to stream (MM2S). Fig. 6 shows the block diagram of the co-designed HW/SW (PS-PL).
Fig. 6. HW/SW co-design block diagram and PS part design.
The design of the PS area utilized the PYNQ platform described in Section 3. Accordingly,
the ARM core in the PS area operates within the PYNQ framework. The Python host code
performs the following steps. It programs the bitstream file of the final HW onto
the target board (ZYNQ 104) using the Overlay Library; imports information on the
DMA IP using the DMA Library; declares the S2MM and MM2S channels; instantiates the
DMA; and assigns the input message to be encrypted to the MM2S channel of the DMA.
As a result, the input message is transferred from the PS area to the PL area, where
the Crystals–Kyber IP performs encryption and decryption. The resulting output is
assigned to the S2MM channel of the DMA and communicated back from the PL area to
the PS area. The message used for validation in the PYNQ environment is "Inha_DIS_Lab.''
2. Hybrid HLS-RTL Design
The Verilog HDL code of Crystals–Kyber, generated through the optimization, synthesis,
and verification processes in HLS, was imported into Xilinx Vivado. A testbench code
for the AXI-Stream was developed, and the simulation waveform was analyzed to explore
the modules that consumed a significant number of cycles and exhibited poor data throughput.
Specifically, for the complex NTT module responsible for polynomial multiplication
in Crystals–Kyber, a hybrid HLS-RTL design was devised to replace it with a hand-coded
NTT (RTL) module.
The hybrid design approach was adopted to deal with the complexity of the NTT module
and improve its performance. Replacing the original HLS-generated module with a hand-coded
RTL implementation is expected to improve the data throughput and reduce the number
of cycles required for processing.
2.1 Number Theoretic Transform
The internal structure of the hand-coded NTT module applied in this hybrid design
involved storing the twiddle factor values as read-only memory (ROM) to multiply them
with the coefficients of the input polynomial. As the size of incoming data increases,
Montgomery reduction becomes more advantageous. Montgomery is used after multiplication
operations, while Barrett is used after addition or subtraction operations. Modulo
operations were performed using Montgomery to reduce the values within the parameter
q (3329) of Crystals–Kyber. In addition, the butterfly unit was employed to perform
Cooley–Turkey butterfly operations, followed by Barrett. These operations were combined
into a single processing element. The structure follows a total of N (2$^{8}$) stages,
determined by the parameter N of Crystals-Kyber. To parallelize the process, we designed
a 2-parallel MDF structure. A block diagram of this process is shown in Fig. 7.
Eq. (4) calculates the data throughput. Comparing the data throughput of the NTT module generated
by the HLS and the hand-coded NTT (RTL) module, the former achieved a throughput of
approximately 0.225 Mbps, while the latter achieved a throughput of approximately
3.351 Mbps, representing an approximately 15 times increase. Because the NTT module
was used in all stages of key generation, encryption, and decryption within Crystals–Kyber,
its optimization and design using HLS improved the overall latency by 40% compared
with that of the original Crystals–Kyber design.
Fig. 7. Hand-coded NTT module block diagram.
2.2 Hybrid Design
In the HW generated by the HLS, communication between modules was based on the following
signals: ap_clk (clock), ap_rst (reset), ap_start (indicating the execution state
and waiting for the next input), ap_idle (indicating the idle state when no execution
is in progress), ap_done (indicating the completion of the operation), and ap_ready
(indicating the ability to receive the next input). These four signals were utilized
to communicate through the ap_ctrl_hs (handshake) interface. In the top code of the
NTT module, the input, output, and control signals of the hand-coded NTT (RTL) module
were divided into four states: Idle, Read, Run, and Write. Within each state, the
necessary operations and logic were declared and connected to design the finite-state
machine.
Fig. 8 shows the top diagram of the hybrid design, where the topmost layer represents the
key-generation process responsible for generating public and secret keys. The bottom
layer consists of the encryption module which takes the message to be encrypted and
public key as inputs and the decryption module, which takes the secret key as an input
and performs the decryption process. It is worth noting that within the entire set
of modules generated by HLS for Crystals–Kyber, the HLS-based NTT module was replaced
with a hand-coded Verilog HDL-based NTT module in the hybrid structure.
Fig. 8. Hybrid HLS-RTL design block diagram.
IV. IMPLEMENTATION RESULTS
In this study, we compared and analyzed the performance of HLS-based optimization
and the design results of a proposed post-quantum cryptographic system. The Xilinx
tools Vitis-HLS (2021.1) and Vivado (2020.1) were utilized, and the target board was
set as the ZYNQ 104 FPGA for implementation.
As shown in Table 1, Hand-coded NTT requires more LUTs compared to HLS but offers higher data throughput
and lower latency. The HLS-based NTT module does not apply pipeline, parallel, etc.
and no optimization using HLS directives, therefore the area may look smaller, but
the latency is 22 times worse. If it makes an HLS-based NTT using similar HW resources
as an RTL-based NTT, it will perform well enough, but an RTL-based NTT will perform
better.
As shown in Table 2, the hybrid HLS-RTL design utilized in the ZYNQ 104 FPGA. By utilizing the hand-coded
NTT module, a shorter latency was achieved compared with the results obtained from
the HLS-based NTT module, resulting in improved performance in terms of the area ${\times}$
time (ATP) metric. In particular, the productivity benefits of HLS significantly reduced
implementation time compared to a Verilog HDL-based implementation that was 100% hand-coded.
These results suggest that the hybrid design has achieved performance gains, particularly
in the Key Generation, Encryption, and Decryption phases, a speedup of approximately
1.8x and 2.1x, respectively.
Table 1. Performance comparison between NTT generated with Hand-coded and HLS NTT
|
Hand-coded
|
HLS
|
Throughput (Mbps)
|
3.351
|
0.225
|
LUTs
|
3,732
|
232
|
Latency (μs)
|
5.1
|
112.7
|
Area × Time
(LUTs × Latency)
|
0.019
|
0.026
|
FPGA
|
ZYNQ104
|
ZYNQ104
|
Table 2. Implementation result and performance comparisons (Key + Enc + Dec)
|
Hybrid design
|
HLS design
|
LUTs
|
39,942 (17.3%)
|
31,807 (13.8%)
|
FFs
|
26,877 (5.8%)
|
19,820 (4.3%)
|
BRAMs
|
30.5 (9.8%)
|
32 (10.3%)
|
DSPs
|
282 (16.3%)
|
290 (16.8%)
|
Clock Freq (MHz)
|
201.2
|
207.7
|
Latency (μs) (Key/Enc/Dec)
|
201.95/ 224.91/ 74.44
|
363.81/ 410.39/ 157.5
|
Area × Time
(LUTs × Latency)
|
20
|
29.6
|
FPGA
|
ZYNQ104
|
ZYNQ104
|
V. CONCLUSIONS
In this study, the proposed HLS-based HW/SW co-design allowed us to implement and
validate a system that controls and communicates with the Crystals–Kyber HW IP, which
was designed using HLS, utilizing the ARM processor (Cortex-A53) and memory in the
PS domain of the ZYNQ 104 FPGA on a single board. In the hybrid HLS-RTL design, the
specific modules generated by the HLS were replaced with hand-coded RTL modules that
offered better data throughput, resulting in the improved overall latency for Crystals–Kyber.
ACKNOWLEDGMENTS
This work was supported by INHA UNIVERSITY Research Grant.
References
P. W. Shor, “Polynomial-time algorithms for prime factorization and discrete logarithms
on a quantum computer,” SIAM J. Comput., vol. 26, pp. 1484-1509, Oct. 1997.
National Institute of Standards and Technology, “PQC (Post-Quantum Cryptography) 3rd
Round Submissions,” last modified June 14, 2021, accessed https://csrc.nist.gov/projects/post-quantum-
cryptography/round-3-, Oct. 18. 2021.
R. Avanzi, et. al. “CRYSTALS-Kyber Algorithm Specifications And Supporting Documentation
(version 3.0),” NIST PQC Round 3 submission, Oct. 2020.
National Institute of Standards and Technology, “Status Report on the Third Round
of the NIST Post-Quantum Cryptography Standardization Process,” accessed https://doi.org/10.6028/NIST.IR.8413-upd1,
July 2022.
O. Regev, “On lattices, learning with errors, random linear codes, and cryptography,”
in Proceedings of the ACM Symposium on Theory of Computing, Baltimore, MD, USA, pp.
84-93, May. 2005.
V. Lyubashevsky, “On ideal lattices and learning with errors over rings,” Annual international
conference on the theory and applications of cryptographic techniques, pp. 1-23, 2010.
Vitis High-Level Synthesis, “User Guide,” UG 1399, v2021.2, October 22, 2021.
Chang-Hyeon Lee received a B.S. degree in Electronic Engineering from Tech University
of Korea in 2020, and an M.S. degree in Electrical and Computer Engineering from Inha
University, South Korea, in 2023. He is currently a System-on-Chip designer at Next
Chip, South Korea. His research interests include algorithms and architectures for
post-quantum cryptography.
Jae-Hyeok Lee received a B.S. degree in Information and Communi-cation Engineering
from Inha University, South Korea, in 2021, and an M.S. degree in Electrical and Computer
Engineering from Inha University, South Korea, in 2023. He is currently a System-on-chip
digital IC designer at Realtek Korea. His research interests include algorithm and
architecture for lattice-based homomorphic encryption.
Haesung Jung received a B.S. degree in Information and Communi-cation Engineering
from Inha University, South Korea, in 2023. Since March 2023, he has been pursuing
the M.S. degree in Electrical and Computer Engineering from Inha University, South
Korea. He currently specializes in post-quantum cryptography hardware architecture
design, and his research interests include VLSI architecture design for cryptography.
Hanyoung Lee received a B.S. degree in Information and Communi-cation Engineering
from Inha University, South Korea, in 2023. Since March 2023, he has been pursuing
the M.S. degree in Electrical and Computer Engineering from Inha University, South
Korea. He currently specializes in homomorphic encryption hardware architecture design,
and his research interests include VLSI architecture design for cryptography.
Hanho Lee received M.Sc. and Ph.D. degrees in Electrical and Computer Engineering
from the University of Minnesota, MN, USA, in 1996 and 2000, respectively. He was
a Technical Staff Member at Lucent Technologies (Bell Laboratories Innovations), Allentown,
PA, USA, from April 2000 to August 2002, and an Assistant Professor of the Department
of Electrical and Computer Engineering, University of Connecticut, USA, from August
2002 to August 2004. He was a Visiting Scholar at Bell Laboratories, Alcatel Lucent,
Murray Hill, NJ, USA, from August 2010 to August 2011. He has been with the Department
of Information and Communication Engineering, Inha University, South Korea, since
August 2004, where he is currently a professor. His research interests include algorithms
and architectures design for post-quantum cryptography, homomorphic encryption, artificial
intelligence, forward error correction, and digital signal processing.