Hardware-software Co-design for Vector Similarity Search on HBM-PIM
KimNahyeon1,*
KimSujin2,*
JungMin1
NohHaechannuri2
KimJi-Hoon1
-
(Department of Electronic Engineering, Hanyang University, 222, Wangsimni-ro, Seongdong-gu,
Seoul, Republic of Korea)
-
(Division of Electronic & Semiconductor Engineering, Ewha Womans University, 52, Ewhayeodae-gil,
Seodaemun-gu, Seoul, Republic of Korea)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Index terms
Processing-in-memory (PIM), retrieval-augmented generation (RAG), vector similarity search, distance computation, instruction set extension, hardware-software co-design, PIM simulator
I. INTRODUCTION
Large Language Models (LLMs), such as GPT-4, have driven remarkable advances in natural
language processing (NLP). Despite their capabilities, LLMs still face fundamental
challenges, including hallucination and a lack of long-term memory [1]. Retrieval-Augmented Generation (RAG) [2] has been proposed to address these limitations by enabling models to dynamically
retrieve relevant external information. A key enabler of RAG is the vector similarity
search [3], which retrieves the most relevant data by computing similarities between query vectors
and large-scale vector databases. Vector similarity search involves intensive memory
access, typically implemented as Level 2 BLAS (matrix-vector) operations. As datasets
grow larger, these computations become memory-bound, with performance limited by memory
bandwidth. Processing-In-Memory (PIM) architectures, which integrate compute capabilities
directly within memory [4-7], offer a promising solution by reducing data transfers between the memory and the
processor, thereby alleviating the memory bandwidth bottleneck.
In this work, we address these challenges through a holistic approach that spans both
the software and hardware domains. We initially implemented vector similarity search
applications and developed vector distance calculation libraries supporting Euclidean
distance and cosine similarity. These libraries directly access PIM memory and calculate
distances, reducing data transfer latency between memory and host.
While software-level optimization yielded some improvements, our experiments identified
significant performance bottlenecks in distance computations. Although distance operations
could be implemented using the existing PIM instructions [8], this method proved highly inefficient because it required frequent data movement
and intermediate buffering between memory banks and the PIM processing unit, thereby
negating the inherent advantages of in-memory computation. To overcome these constraints,
we extended our work to the hardware level by modifying the PIM datapath and introducing
two new instructions [9] optimized for distance computations: one targeting Euclidean distance and another
supporting Manhattan distance. Fig. 1 shows the overall system architecture, including the distance computation library
and the PIM-based hardware with custom instructions. This combined software-hardware
optimization resulted in substantial performance improvements, as demonstrated through
implementation and validation on an FPGA-based PIM platform and PIM simulator.
Fig. 1. Overview of the distance computation PIM platform with custom instructions
for vector similarity.
The remainder of this paper is organized as follows. Section II introduces the background
of distance computation and related PIM architectures including the baseline HBM-PIM
architecture. Section III describes the software and hardware implementation, including
the proposed instruction set extensions. Section IV presents evaluation results from
both DRAM-based and HBM-based PIM platforms, as well as area analysis. Finally, the
paper is concluded in Section V.
II. BACKGROUND
1. Operation Description for Distance Calculation
Distance calculation is a fundamental operation in vector similarity search, used
to quantify the similarity or dissimilarity between vectors in high-dimensional spaces.
This work focuses on three commonly used distance metrics: Manhattan distance, Euclidean
distance, and cosine similarity.
Manhattan distance, also known as the L1 distance, measures the sum of the absolute
differences between the corresponding vector elements p and q. It is defined as
Euclidean distance, also referred to as L2 distance, represents the straight-line
distance between two points p and q in Euclidean space. It is calculated as
Cosine similarity measures the cosine of the angle θ between two vectors p and q,
indicating their directional alignment regardless of magnitude. It is computed as
where p and q are vectors, $p \cdot q$ is the dot product, and $\|p\|$, $\|q\|$ are
their norms. A value closer to 1 implies a higher similarity.
Euclidean distance measures the separation between two vectors, but vector search
typically omits the square root since only relative distances matter for ranking.
Since most of the datasets of vector searches using cosine similarity are already
normalized to the size of the two vectors as 1, we implemented the cosine library
the same as dot product.
2. Related PIM Architectures
Recent years have witnessed the emergence of diverse DRAM-based Processing-in-Memory
(PIM) architectures. SK hynix AiM extends the HBM/GDDR interface with all-bank parallelism
and in-memory GEMV support, targeting matrix-vector multiplication workloads [10,11]. UPMEM’s DIMM-PIM integrates lightweight processing units (DPUs) inside DDR4 DIMMs,
offering flexible near-memory acceleration for a wide range of tasks [12-14]. While these efforts demonstrate the broad potential of DRAM-PIM, our focus differs:
distance computation in vector similarity search requires repeated execution of simple
arithmetic (ADD, MUL). For this purpose, Samsung’s HBM-PIM, which natively exposes
such primitive operations, provides a more direct and efficient baseline. We therefore
adopt HBM-PIM to align with architectures recently offered by leading memory vendors
while ensuring suitability for distance-calculation workloads [15,16]. Based on this rationale, the next subsection describes the baseline Samsung HBM-PIM
architecture that serves as the foundation of our work.
3. Baseline HBM-PIM Architecture
Fig. 2 overviews baseline Samsung PIM architecture [15,16]. The architecture consists of Even and Odd banks that share a single PIM unit, so
they cannot perform operations simultaneously. The PIM execution unit is composed
of a Floating-Point Unit (FPU), Command Register Files (CRF), General Register Files
(GRF), and Scalar Register Files (SRF). The architecture supports three modes: single
bank mode, all bank mode, and all bank-PIM mode. In single bank mode, it operates
in standard DRAM mode, performing conventional DRAM commands. In all-bank mode, it
can access multiple banks simultaneously, while in all-bank PIM mode, it triggers
PIM commands and executes PIM operations. Currently, Samsung PIM architecture supports
nine instructions: four ALU instructions (ADD, MUL, MAC, MAD), two Data instructions
(MOV, FILL), and three Control instructions (NOP, JUMP, EXIT).
Fig. 2. Baseline PIM architecture [15,16].
As shown in Fig. 3, the programmable computing unit, referred to as the PIM unit, consists of a FPU,
CRF, GRF, and SRF. The FPU can execute 16 operations on 16-bit data simultaneously.
PIM instructions are stored in the CRF, so PIM commands are triggered as specified
by the CRF. The GRF is divided into GRF_A and GRF_B, which are connected to the Even
and Odd banks, respectively. Each GRF consists of eight 16-bit registers, allowing
storage of eight 16-bit elements. Lastly, the SRF is divided into SRF_A, for addition
operations, and SRF_M, for multiplication, with each section containing eight 16-bit
registers, similar to the GRF.
Fig. 3. The micro-architecture of programmable computing unit.
This work adopts the Samsung HBM-PIM architecture as the baseline to ensure alignment
with prior work and provide a fair reference point for performance evaluation. Consequently,
16-bit data types are used throughout the system. Although many public datasets are
originally provided in 32-bit floating point (FP32) format, we confirm that quantization
to 16-bit precision preserves accuracy in our target workloads [17]. Specifically, the SIFT dataset consists of image descriptors with integer values
ranging from 0 to 255, which are naturally robust to reduced precision and show no
degradation even under BF16 representation. For the GIST dataset, we empirically evaluated
vector search accuracy after BF16 quantization and observed no measurable drop compared
to FP32. Therefore, implementing the PIM unit with FP16 capability efficiently supports
both neural network operations and vector similarity search, while remaining practical
for real-world datasets.
III. IMPLEMENTATION
1. Software Implementation
The most used distance metrics for calculating vector similarity are Euclidean distance,
cosine similarity, and dot product. However, in datasets designed for cosine similarity-based
vector search, the vectors are often normalized. Therefore, in this study, we considered
cosine similarity and dot product as one metric and developed libraries for PIM devices
to compute Euclidean distance and cosine similarity [18-20]. Since the goal is to compare relative distances rather than absolute ones, square
root operations can be omitted in Euclidean distance calculation.
Vector similarity search compares the query vector with database vectors and finds
nearest neighbor vectors. In this study, we implemented distance operation libraries
for the vector similarity search in PIM environments. First, the database vectors
and query vectors are written to memory regions referred to as srcA and srcB, which
are then mapped to the PIM address space. The PIM unit performs distance calculations
between these regions, and the results are stored in dstC. Finally, the host uses
dstC to classify nearest neighbors. Since these libraries access datasets from PIM
space and compute distances in its unit, they save data transfer latency between memory
and host and do not require additional memory space. Fig. 4 shows the resulting distance computation library with Euclidean distance and cosine
similarity, used for vector similarity search.
Fig. 4. Distance computation library for Euclidean distance and cosine similarity.
2. System Architecture
We initially implemented the distance calculation at the software level. However,
experimental results revealed performance bottlenecks due to the limitations of the
existing PIM instruction set. Therefore, we performed hardware-level optimization
by modifying the datapath and extending the instruction set [21]. Accordingly, we implemented the Euclidean distance computation, and through the
datapath modification, we found that it was also possible to optimize the Manhattan
distance computation. As a result, we implemented two instructions to support both
Euclidean and Manhattan distance computations. Since cosine similarity is already
efficiently supported by the existing MAC instruction, no additional optimization
was necessary. The proposed Instruction Set Architecture is shown in Table 1.
Table 1. The instruction set of the proposed PIM architecture.
|
Type
|
Command
|
Result
|
Operand(src0)
|
Operand(src1)
|
|
Control
|
NOP
|
|
|
|
|
Control
|
JUMP
|
|
|
|
|
Control
|
EXIT
|
|
|
|
|
Data
|
FILL
|
GRF, BANK
|
GRF, BANK
|
|
|
Data
|
MOV
|
GRF, SRF
|
GRF, BANK
|
|
|
Arithmetic
|
ADD
|
GRF
|
GRF, BANK, SRF
|
GRF, BANK, SRF
|
|
Arithmetic
|
MUL
|
GRF
|
GRF, BANK, SRF
|
GRF, BANK, SRF
|
|
Arithmetic
|
MAC
|
GRF_B
|
GRF, BANK
|
GRF, BANK, SRF
|
|
Arithmetic
|
MAD
|
GRF
|
GRF, BANK
|
GRF, BANK, SRF
|
|
Arithmetic
|
AMC
|
GRF_B
|
GRF, BANK, SRF
|
GRF, BANK, SRF
|
|
Arithmetic
|
MAN
|
GRF_B
|
GRF, BANK
|
GRF, BANK, SRF
|
As shown in Fig. 5, we customized the instruction by adding a 1-bit 'm' field to determine whether to
take the absolute value. Architecturally, an 'absolute value unit' is needed in the
FPU_ADD unit. The absolute value unit functions by changing the most significant bit
(MSB) to 0. When the 'MAN' operation is triggered, the subtraction between the two
vectors is calculated, the absolute value is taken, and the result is accumulated
in the GRF.
As shown in Eq. (2), the L2 operation calculates the subtraction between the two vectors, squares that
value, and then accumulates the result. However, since the result needed to be stored
in the GRF after the ADD, only 8 elements could be stored at a time, requiring frequent
data transfers to the banks. To address this, an operation that performs the multiplication
immediately after the addition, without the need to store the intermediate result,
was proposed. This led to the creation of a new instruction, "AMC" (Add-Multiply-Accumulate),
like MAC (Multiply-Accumulate), but with the order of operations reversed (ADD to
MUL). To implement this, a new path connecting the FPU_ADD to the FPU_MUL is required.
By implementing this, the operation could be performed without transferring data between
the banks.
Fig. 5. Instruction format to support multiple similarity metrics.
IV. EVALUATIONS
This study evaluated the proposed vector similarity search system across two different
environments, reflecting the progression of the research from application-level implementation
to instruction-level optimization. Initially, the vector similarity search application
was developed and evaluated on a DRAM-based PIM system. However, we identified that
instruction-level improvements were difficult to achieve on the DRAM-based platform,
leading us to transition the evaluation environment to an HBM-based PIM simulator
to analyze the effectiveness of instruction-level enhancements.
1. Application-level Evaluation on DRAM-based PIM System
The initial evaluation was conducted on a VMK180 board, configured with LPDDR4 as
system memory and two GDDR6 modules as PIM memory [22]. A dual-core ARM Cortex-A53 processor from Zynq was used, with only a single core
utilized during the experiments. The vector similarity search application was implemented
using a brute-force algorithm under the K-Nearest Neighbor (KNN) search category,
and the proposed distance computation library was evaluated within this framework.
The experiments employed random datasets of various sizes. Figs. 6 and 7 present the results of brute-force search using Euclidean distance and cosine similarity,
respectively, on both CPU and PIM environments. In the brute-force search, distances
between the query vector and all vectors in the database were computed, and the top
100 closest neighbors were returned after sorting.
The results showed that the PIM outperformed the CPU in both Euclidean distance and
cosine similarity computation. Specifically, for Euclidean distance, the PIM achieved
an average speed improvement of 44.2% over the CPU, while for cosine similarity, the
improvement was 59.0%. The performance gap between PIM and CPU increased as the dataset
size grew, with cosine similarity exhibiting a relatively higher improvement rate
compared to Euclidean distance.
Fig. 6. Computing time with log scale for brute-force with Euclidean distance.
Fig. 7. Computing time with log scale for brute-force with cosine similarity.
2. Instruction-level Optimization on HBM-based PIM Simulator
Although application-level implementation and evaluation were performed on the DRAM-based
PIM system, it exposed inherent limitations [23] of the existing instruction set, which was not optimized for vector distance computations.
While it was technically feasible to implement distance computations using the existing
instructions, frequent data movement and intermediate storage across memory banks
resulted in performance bottlenecks, preventing full utilization of PIM’s inherent
performance benefits. To address these challenges, we transitioned the evaluation
platform to an HBM-based PIM simulator environment [24,25], which provided greater flexibility for instruction set design and datapath modification.
The HBM-based PIM simulator was configured by integrating DRAMSim2, and the performance
of the proposed AMC and MAN instructions (optimized for Euclidean distance and Manhattan
distance computations, respectively) was evaluated. The simulator architecture is
illustrated in Fig. 8. The host processor generated transactions required for PIM operations, which were
executed by the PIM unit within the simulator. DRAMSim2 measured the number of cycles
required for each transaction, and upon completion of the simulation, both the computation
results and cycle counts were returned to the host.
Fig. 8. The system organized with PIM simulator with DRAMSim2.
As shown in Table 2, the total cycle count for Euclidean distance computation using the AMC instruction
was reduced by up to 44% compared to using the existing instruction. Additionally,
Fig. 9 shows that the performance improvement increased as the input matrix size grew. Notably,
for matrix sizes of 4096 × 4096 and above, the performance of the Manhattan distance
computation using the MAN instruction in the PIM-enabled state surpassed that of the
PIM-disabled state, as shown in Table 3.
Table 2. Total cycle count when using the baseline instruction and AMC.
|
|
L2_existing
|
L2_AMC
|
Cycle reduced(%)
|
|
256 × 256
|
3471
|
1937
|
44.19
|
|
512 × 512
|
5844
|
3800
|
34.98
|
|
1024 × 1024
|
11323
|
7209
|
36.33
|
|
2048 × 2048
|
21544
|
13690
|
36.46
|
|
4096 × 4096
|
42738
|
27321
|
36.07
|
|
8192 × 8192
|
168331
|
108028
|
35.82
|
|
16384 × 16384
|
671081
|
429884
|
35.94
|
Fig. 9. Performance gain when using the baseline instruction and AMC.
Table 3. Total cycle count of MAN instruction executed on CPU and PIM.
|
|
PIM_disabled
|
PIM_enabled
|
|
256 × 256
|
194
|
1937
|
|
512 × 512
|
578
|
3800
|
|
1024 × 1024
|
2494
|
7209
|
|
2048 × 2048
|
9034
|
13690
|
|
4096 × 4096
|
36082
|
27321
|
|
8192 × 8192
|
143276
|
108028
|
|
16384 × 16384
|
570801
|
429884
|
This progression from application-level evaluation on a DRAM-based platform to instruction-level
optimization on an HBM-based platform demonstrates that architectural improvements
are necessary to fully exploit the potential of Processing-In-Memory for workloads
such as distance computations.
To assess the implementability of the proposed PIM datapath, we conducted an area
comparison using a commercial digital library. Both the baseline HBM-PIM datapath
and our proposed datapath were implemented at the RTL level and synthesized using
the TSMC 65nm standard cell library. The synthesis results show that the proposed
design incurs less than 5% area overhead compared to the baseline. This minimal overhead
suggests that the proposed datapath is practically implementable without significant
hardware cost.
V. CONCLUSIONS
In this paper, we developed distance calculation libraries for Euclidean distance
and cosine similarity in a PIM environment and evaluated their performance during
vector search. For Euclidean distance, PIM achieved up to 46.2% reduction in computation
time compared to the CPU, while for cosine similarity, up to 60.6% of computation
time was reduced. Although PIM demonstrated better application-level performance,
we identified a performance bottleneck caused by limitations in the existing PIM instruction
set. To address this, we introduced the AMC and MAN instructions through datapath
modification and extension. The AMC instruction reduced cycle count by up to 44%,
and the MAN instruction outperformed the PIM-disabled baseline for input sizes larger
than 4096 × 4096 elements.
These developments demonstrate that vector search can be effectively optimized on
PIM through both software and hardware-level enhancements, significantly improving
performance for memory-bound tasks and highlighting the potential of PIM as a scalable
solution for data-intensive applications.
ACKNOWLEDGMENTS
This work was partly supported by the Institute of Information Communications Technology
Planning Evaluation (IITP) under the Artificial Intelligence Semiconductor Support
Program to nurture the best talents (IITP-(2025)-RS-2023-00253914), funded by the
Korea government (MSIT), and partly supported by the IITP grant funded by the Korea
government (MSIT) (No. 2022-0-00441, Memory-Centric Architecture Using the Reconfigurable
PIM Devices). The EDA tool was supported by the IC Design Education Center (IDEC),
Korea.
REFERENCES
Kim H., Kim T., Park T., Kim D., Yu Y., Kim H., Park Y., 2025, Accelerating LLMs using
an efficient GEMM library and target-aware optimizations on real-world PIM devices,
Proc. of CGO ’25: 23rd ACM/IEEE International Symposium on Code Generation and Optimization

Lewis P., Perez E., Piktus A., Petroni F., Karpukhin V., Goyal N., Küttler H., Lewis
M., Yih W., Tocktäschel T., Riedel S., Kiela D., 2020, Retrieval-augmented generation
for knowledge-intensive NLP tasks, Advances in Neural Information Processing Systems
(NeurIPS), Vol. 33, pp. 9459-9474

Johnson J., Douze M., Jégou H., 2021, Billion-scale similarity search with gpus, IEEE
Transactions on Big Data, Vol. 7, No. 3, pp. 535-547

Jang J.-H., Shin J., 2023, In-depth survey of processing-in-memory architectures for
deep neural networks, Journal of Semiconductor Technology and Science, Vol. 23, No.
5, pp. 322-339

Cho S., 2022, Volatile and nonvolatile memory devices for neuromorphic and processing-in-memory
applications, Journal of Semiconductor Technology and Science, Vol. 22, No. 1, pp.
30-46

Lee J., Cha M., 2023, Charge trap flash structure with feedback field effect transistor
for processing in memory, Journal of Semiconductor Technology and Science, Vol. 23,
No. 5, pp. 295-302

Asifuzzaman K., Miniskar N. R., Young A. R., Liu F., Vetter J. S., 2023, A survey
on processing-in-memory techniques: Advances and challenges, Memories - Materials,
Devices, Circuits and Systems, Vol. 4, pp. 100022

Hu H., Wang W.-C., Chang Y.-H., Lee Y.-C., Lin B.-R., Wang H.-M., 2022, ICE: An intelligent
cognition engine with 3D NAND-based in-memory computing for vector similarity search
acceleration, Proc. of 55th IEEE/ACM International Symposium on Microarchitecture
(MICRO)

Verma V., Stan M. R., 2022, AI-PiM-extending the RISC-V processor with processing-in-memory
functional units for AI inference at the edge of IoT, Frontiers in Electronics, Vol.
3

Choi H., Kim G., Shin W., Won J., Kim C., Joo H., An B., Shin G., Yun J. D., 2024,
AiMX: Accelerator-in-memory based accelerator for cost-effective large language model
inference, Proc. of 2024 IEEE International Electron Devices Meeting (IEDM), pp. 1-4

Kwon Y., Vladimir K., Kim N., Shin W., Won J., Lee M., Joo H., Choi H., Kim G., An
B., 2022, System architecture and software stack for GDDR6-AiM, Proc. of 2022 IEEE
Hot Chips 34 Symposium (HCS), pp. 1-25

Gómez-Luna J., Guo Y., Brocard S., Legriel J., Cimadomo R., Oliveira G. F., Singh
G., Mutlu O., 2022, An experimental evaluation of machine learning training on a real
processing-in-memory system, arXiv preprint arXiv:2207.07886

Ortega C., Falevoz Y., Ayrignac R., 2024, PIM-AI: A novel architecture for high-efficiency
llm inference, arXiv preprint arXiv:2411.17309

Falevoz Y., Legriel J., 2023, Energy efficiency impact of processing in memory: A
comprehensive review of workloads on the upmem architecture, Proc. of European Conference
on Parallel Processing, Springer, pp. 155-166

Lee S., Kang S., Lee J., Kim H., Kim E., Seo S., 2021, Hardware architecture and software
stack for PIM based on commercial DRAM technology, Proc. of the 48th Annual International
Symposium on Computer Architecture (ISCA)

Kwon Y.-C., Lee S. H., Lee J., Kwon S.-H., Ryu J. M., Son J.-P., 2021, 25.4 A 20nm
6GB function-in-memory DRAM, based on HBM2 with a 1.2TFLOPS programmable computing
unit using bank-level parallelism, for machine learning applications, Proc. of IEEE
International Solid-State Circuits Conference (ISSCC)

Jegou H., Douze M., Schmid C., 2010, Product quantization for nearest neighbor search,
IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 33, No. 1, pp.
117-128

Chen J., Gómez-Luna J., Hajj I. E., Guo Y., Mutlu O., 2023, Simplepim: A software
framework for productive and efficient processing-in-memory, Proc. of International
Conference on Parallel Architectures and Compilation Techniques (PACT)

Noh S. U., Hong J., Lim C., Park S., Kim J., Kim H., Kim Y., Lee J., 2024, PID-Comm:
A fast and flexible collective communication framework for commodity processing-in-DIMM
devices, Proc. of ACM/IEEE International Symposium on Computer Architecture (ISCA)

Item M., Gómez-Luna J., Guo Y., Oliveira G. F., Sadrosadati M., Mutlu O., 2023, TransPimLib:
A library for efficient transcendental functions on processing-in-memory systems,
Proc. of IEEE International Symposium on Performance Analysis of Systems and Software
(ISPASS)

Ahn J., Yoo S., Mutlu O., Choi K., 2015, PIM-enabled instructions: A low-overhead,
locality-aware processing-in-memory architecture, Proc. of ACM/IEEE Annual International
Symposium on Computer Architecture (ISCA), pp. 336-348

Kim C. H., Lee W. J., Paik Y., Kwon K., Kim S. Y., Park I., Kim S. W., 2022, Silent-PIM:
Realizing the processing-in-memory computing with standard memory requests, IEEE Transactions
on Parallel and Distributed Systems, Vol. 33, No. 2, pp. 251-262

Ryu S., 2024, Resource analysis on FPGA for functional verification of digital SRAM
PIM, Journal of Semiconductor Technology and Science, Vol. 24, No. 3, pp. 218-225

Karunamurthy P., Alhady S. S. N., Wahab A. A. A., Othman W. A. F. W., 2022, Integration
of Gem5 and Dramsim2 for DDR4 simulation, International Journal of Advanced Trends
in Computer Science and Engineering, Vol. 9, No. 1, pp. 698-793

Christ D., Steiner L., Jung M., Wehn N., 2024, PIMSys: A virtual prototype for processing
in memory, Proc. of International Symposium on Memory Systems, pp. 26-33

Nahyeon Kim received her B.S. degree in electronic and electrical engineering from
Ewha Womans University, Seoul, South Korea, in 2024. She is currently pursuing an
M.S. degree from the Digital System Architecture Laboratory, Hanyang University. Her
current research interests include memory testing and SoC design.
Sujin Kim received her B.S. degree in electronic and electrical engineering from Ewha
Womans University, Seoul, South Korea, in 2019. She is currently pursuing a Ph.D.
degree at the same university. Her research interests include domain-specific accelerator
architectures, with a focus on vector search, large language model inference, and
memory-centric computing.
Min Jung received her B.S. degree from the Department of Electronic and Electric Engineering
from Ewha Womans University in 2025. She is currently pursuing an M.S. degree from
the Digital System Architecture Laboratory, Hanyang University. Her research interests
include RISC-V processor and compression/decompression IP design.
Haechannuri Noh received her B.S. degree from the Department of Electronic and Electric
Engineering from Ewha Womans University in 2025. Her research interests include RISC-V
processor and SoC design.
Ji-Hoon Kim received his B.S. (summa cum laude) and Ph.D. degrees in Electrical Engineering
and Computer Science from KAIST, Daejeon, South Korea, in 2004 and 2009, respectively.
In 2009, he joined Samsung Electronics, Suwon, South Korea, as a Senior Engineer,
and worked on next-generation architecture for 4G communication modem system-on-chip
(SoC). From 2018 to 2025, he was a professor in the Department of Electronic and Electrical
Engineering, Ewha Womans University, Seoul, South Korea. Since 2025, he has been with
the Department of Electronic Engineering, Hanyang University, Seoul, South Korea.
His current research interests include CPU microarchitecture, domain-specific SoC,
and deep neural network accelerators. Dr. Kim served on the Technical Program Committee
and Organizing Committee for various international conferences, including the IEEE
International Conference on Computer Design (ICCD), the IEEE Asian Solid-State Circuits
Conference (A-SSCC), and the IEEE International Solid-State Circuits Conference (ISSCC).
He was a co-recipient of the Distinguished Design Award at the 2019 IEEE A-SSCC, and
a recipient of the Best Design Award at 2007 Dongbu HiTek IP Design Contest, the First
Place Award at 2008 International SoC Design Conference (ISOCC) Chip Design Contest,
and the IEEE/IEIE Joint Award for Young Scientist and Engineer. He also serves as
an Associate Editor for the IEEE Transactions on Circuits and System-II: Express Briefs.