Mobile QR Code QR CODE

  1. (Department of Electronic Engineering, Hanyang University, 222, Wangsimni-ro, Seongdong-gu, Seoul, Republic of Korea)
  2. (Division of Electronic & Semiconductor Engineering, Ewha Womans University, 52, Ewhayeodae-gil, Seodaemun-gu, Seoul, Republic of Korea)



Processing-in-memory (PIM), retrieval-augmented generation (RAG), vector similarity search, distance computation, instruction set extension, hardware-software co-design, PIM simulator

I. INTRODUCTION

Large Language Models (LLMs), such as GPT-4, have driven remarkable advances in natural language processing (NLP). Despite their capabilities, LLMs still face fundamental challenges, including hallucination and a lack of long-term memory [1]. Retrieval-Augmented Generation (RAG) [2] has been proposed to address these limitations by enabling models to dynamically retrieve relevant external information. A key enabler of RAG is the vector similarity search [3], which retrieves the most relevant data by computing similarities between query vectors and large-scale vector databases. Vector similarity search involves intensive memory access, typically implemented as Level 2 BLAS (matrix-vector) operations. As datasets grow larger, these computations become memory-bound, with performance limited by memory bandwidth. Processing-In-Memory (PIM) architectures, which integrate compute capabilities directly within memory [4-7], offer a promising solution by reducing data transfers between the memory and the processor, thereby alleviating the memory bandwidth bottleneck.

In this work, we address these challenges through a holistic approach that spans both the software and hardware domains. We initially implemented vector similarity search applications and developed vector distance calculation libraries supporting Euclidean distance and cosine similarity. These libraries directly access PIM memory and calculate distances, reducing data transfer latency between memory and host.

While software-level optimization yielded some improvements, our experiments identified significant performance bottlenecks in distance computations. Although distance operations could be implemented using the existing PIM instructions [8], this method proved highly inefficient because it required frequent data movement and intermediate buffering between memory banks and the PIM processing unit, thereby negating the inherent advantages of in-memory computation. To overcome these constraints, we extended our work to the hardware level by modifying the PIM datapath and introducing two new instructions [9] optimized for distance computations: one targeting Euclidean distance and another supporting Manhattan distance. Fig. 1 shows the overall system architecture, including the distance computation library and the PIM-based hardware with custom instructions. This combined software-hardware optimization resulted in substantial performance improvements, as demonstrated through implementation and validation on an FPGA-based PIM platform and PIM simulator.

Fig. 1. Overview of the distance computation PIM platform with custom instructions for vector similarity.

../../Resources/ieie/JSTS.2025.25.6.662/fig1.png

The remainder of this paper is organized as follows. Section II introduces the background of distance computation and related PIM architectures including the baseline HBM-PIM architecture. Section III describes the software and hardware implementation, including the proposed instruction set extensions. Section IV presents evaluation results from both DRAM-based and HBM-based PIM platforms, as well as area analysis. Finally, the paper is concluded in Section V.

II. BACKGROUND

1. Operation Description for Distance Calculation

Distance calculation is a fundamental operation in vector similarity search, used to quantify the similarity or dissimilarity between vectors in high-dimensional spaces. This work focuses on three commonly used distance metrics: Manhattan distance, Euclidean distance, and cosine similarity.

Manhattan distance, also known as the L1 distance, measures the sum of the absolute differences between the corresponding vector elements p and q. It is defined as

(1)
$L1(p, q) = \sum |p - q|.$

Euclidean distance, also referred to as L2 distance, represents the straight-line distance between two points p and q in Euclidean space. It is calculated as

(2)
$L2(p, q) = \sqrt{\sum (p - q)^2}.$

Cosine similarity measures the cosine of the angle θ between two vectors p and q, indicating their directional alignment regardless of magnitude. It is computed as

(3)
$\cos \theta = \frac{p \cdot q}{\|p\| \|q\|}.$

where p and q are vectors, $p \cdot q$ is the dot product, and $\|p\|$, $\|q\|$ are their norms. A value closer to 1 implies a higher similarity.

Euclidean distance measures the separation between two vectors, but vector search typically omits the square root since only relative distances matter for ranking.

Since most of the datasets of vector searches using cosine similarity are already normalized to the size of the two vectors as 1, we implemented the cosine library the same as dot product.

2. Related PIM Architectures

Recent years have witnessed the emergence of diverse DRAM-based Processing-in-Memory (PIM) architectures. SK hynix AiM extends the HBM/GDDR interface with all-bank parallelism and in-memory GEMV support, targeting matrix-vector multiplication workloads [10,11]. UPMEM’s DIMM-PIM integrates lightweight processing units (DPUs) inside DDR4 DIMMs, offering flexible near-memory acceleration for a wide range of tasks [12-14]. While these efforts demonstrate the broad potential of DRAM-PIM, our focus differs: distance computation in vector similarity search requires repeated execution of simple arithmetic (ADD, MUL). For this purpose, Samsung’s HBM-PIM, which natively exposes such primitive operations, provides a more direct and efficient baseline. We therefore adopt HBM-PIM to align with architectures recently offered by leading memory vendors while ensuring suitability for distance-calculation workloads [15,16]. Based on this rationale, the next subsection describes the baseline Samsung HBM-PIM architecture that serves as the foundation of our work.

3. Baseline HBM-PIM Architecture

Fig. 2 overviews baseline Samsung PIM architecture [15,16]. The architecture consists of Even and Odd banks that share a single PIM unit, so they cannot perform operations simultaneously. The PIM execution unit is composed of a Floating-Point Unit (FPU), Command Register Files (CRF), General Register Files (GRF), and Scalar Register Files (SRF). The architecture supports three modes: single bank mode, all bank mode, and all bank-PIM mode. In single bank mode, it operates in standard DRAM mode, performing conventional DRAM commands. In all-bank mode, it can access multiple banks simultaneously, while in all-bank PIM mode, it triggers PIM commands and executes PIM operations. Currently, Samsung PIM architecture supports nine instructions: four ALU instructions (ADD, MUL, MAC, MAD), two Data instructions (MOV, FILL), and three Control instructions (NOP, JUMP, EXIT).

Fig. 2. Baseline PIM architecture [15,16].

../../Resources/ieie/JSTS.2025.25.6.662/fig2.png

As shown in Fig. 3, the programmable computing unit, referred to as the PIM unit, consists of a FPU, CRF, GRF, and SRF. The FPU can execute 16 operations on 16-bit data simultaneously. PIM instructions are stored in the CRF, so PIM commands are triggered as specified by the CRF. The GRF is divided into GRF_A and GRF_B, which are connected to the Even and Odd banks, respectively. Each GRF consists of eight 16-bit registers, allowing storage of eight 16-bit elements. Lastly, the SRF is divided into SRF_A, for addition operations, and SRF_M, for multiplication, with each section containing eight 16-bit registers, similar to the GRF.

Fig. 3. The micro-architecture of programmable computing unit.

../../Resources/ieie/JSTS.2025.25.6.662/fig3.png

This work adopts the Samsung HBM-PIM architecture as the baseline to ensure alignment with prior work and provide a fair reference point for performance evaluation. Consequently, 16-bit data types are used throughout the system. Although many public datasets are originally provided in 32-bit floating point (FP32) format, we confirm that quantization to 16-bit precision preserves accuracy in our target workloads [17]. Specifically, the SIFT dataset consists of image descriptors with integer values ranging from 0 to 255, which are naturally robust to reduced precision and show no degradation even under BF16 representation. For the GIST dataset, we empirically evaluated vector search accuracy after BF16 quantization and observed no measurable drop compared to FP32. Therefore, implementing the PIM unit with FP16 capability efficiently supports both neural network operations and vector similarity search, while remaining practical for real-world datasets.

III. IMPLEMENTATION

1. Software Implementation

The most used distance metrics for calculating vector similarity are Euclidean distance, cosine similarity, and dot product. However, in datasets designed for cosine similarity-based vector search, the vectors are often normalized. Therefore, in this study, we considered cosine similarity and dot product as one metric and developed libraries for PIM devices to compute Euclidean distance and cosine similarity [18-20]. Since the goal is to compare relative distances rather than absolute ones, square root operations can be omitted in Euclidean distance calculation.

Vector similarity search compares the query vector with database vectors and finds nearest neighbor vectors. In this study, we implemented distance operation libraries for the vector similarity search in PIM environments. First, the database vectors and query vectors are written to memory regions referred to as srcA and srcB, which are then mapped to the PIM address space. The PIM unit performs distance calculations between these regions, and the results are stored in dstC. Finally, the host uses dstC to classify nearest neighbors. Since these libraries access datasets from PIM space and compute distances in its unit, they save data transfer latency between memory and host and do not require additional memory space. Fig. 4 shows the resulting distance computation library with Euclidean distance and cosine similarity, used for vector similarity search.

Fig. 4. Distance computation library for Euclidean distance and cosine similarity.

../../Resources/ieie/JSTS.2025.25.6.662/fig4.png

2. System Architecture

We initially implemented the distance calculation at the software level. However, experimental results revealed performance bottlenecks due to the limitations of the existing PIM instruction set. Therefore, we performed hardware-level optimization by modifying the datapath and extending the instruction set [21]. Accordingly, we implemented the Euclidean distance computation, and through the datapath modification, we found that it was also possible to optimize the Manhattan distance computation. As a result, we implemented two instructions to support both Euclidean and Manhattan distance computations. Since cosine similarity is already efficiently supported by the existing MAC instruction, no additional optimization was necessary. The proposed Instruction Set Architecture is shown in Table 1.

Table 1. The instruction set of the proposed PIM architecture.

Type Command Result Operand(src0) Operand(src1)
Control NOP
Control JUMP
Control EXIT
Data FILL GRF, BANK GRF, BANK
Data MOV GRF, SRF GRF, BANK
Arithmetic ADD GRF GRF, BANK, SRF GRF, BANK, SRF
Arithmetic MUL GRF GRF, BANK, SRF GRF, BANK, SRF
Arithmetic MAC GRF_B GRF, BANK GRF, BANK, SRF
Arithmetic MAD GRF GRF, BANK GRF, BANK, SRF
Arithmetic AMC GRF_B GRF, BANK, SRF GRF, BANK, SRF
Arithmetic MAN GRF_B GRF, BANK GRF, BANK, SRF

As shown in Fig. 5, we customized the instruction by adding a 1-bit 'm' field to determine whether to take the absolute value. Architecturally, an 'absolute value unit' is needed in the FPU_ADD unit. The absolute value unit functions by changing the most significant bit (MSB) to 0. When the 'MAN' operation is triggered, the subtraction between the two vectors is calculated, the absolute value is taken, and the result is accumulated in the GRF.

As shown in Eq. (2), the L2 operation calculates the subtraction between the two vectors, squares that value, and then accumulates the result. However, since the result needed to be stored in the GRF after the ADD, only 8 elements could be stored at a time, requiring frequent data transfers to the banks. To address this, an operation that performs the multiplication immediately after the addition, without the need to store the intermediate result, was proposed. This led to the creation of a new instruction, "AMC" (Add-Multiply-Accumulate), like MAC (Multiply-Accumulate), but with the order of operations reversed (ADD to MUL). To implement this, a new path connecting the FPU_ADD to the FPU_MUL is required. By implementing this, the operation could be performed without transferring data between the banks.

Fig. 5. Instruction format to support multiple similarity metrics.

../../Resources/ieie/JSTS.2025.25.6.662/fig5.png

IV. EVALUATIONS

This study evaluated the proposed vector similarity search system across two different environments, reflecting the progression of the research from application-level implementation to instruction-level optimization. Initially, the vector similarity search application was developed and evaluated on a DRAM-based PIM system. However, we identified that instruction-level improvements were difficult to achieve on the DRAM-based platform, leading us to transition the evaluation environment to an HBM-based PIM simulator to analyze the effectiveness of instruction-level enhancements.

1. Application-level Evaluation on DRAM-based PIM System

The initial evaluation was conducted on a VMK180 board, configured with LPDDR4 as system memory and two GDDR6 modules as PIM memory [22]. A dual-core ARM Cortex-A53 processor from Zynq was used, with only a single core utilized during the experiments. The vector similarity search application was implemented using a brute-force algorithm under the K-Nearest Neighbor (KNN) search category, and the proposed distance computation library was evaluated within this framework. The experiments employed random datasets of various sizes. Figs. 6 and 7 present the results of brute-force search using Euclidean distance and cosine similarity, respectively, on both CPU and PIM environments. In the brute-force search, distances between the query vector and all vectors in the database were computed, and the top 100 closest neighbors were returned after sorting.

The results showed that the PIM outperformed the CPU in both Euclidean distance and cosine similarity computation. Specifically, for Euclidean distance, the PIM achieved an average speed improvement of 44.2% over the CPU, while for cosine similarity, the improvement was 59.0%. The performance gap between PIM and CPU increased as the dataset size grew, with cosine similarity exhibiting a relatively higher improvement rate compared to Euclidean distance.

Fig. 6. Computing time with log scale for brute-force with Euclidean distance.

../../Resources/ieie/JSTS.2025.25.6.662/fig6.png

Fig. 7. Computing time with log scale for brute-force with cosine similarity.

../../Resources/ieie/JSTS.2025.25.6.662/fig7.png

2. Instruction-level Optimization on HBM-based PIM Simulator

Although application-level implementation and evaluation were performed on the DRAM-based PIM system, it exposed inherent limitations [23] of the existing instruction set, which was not optimized for vector distance computations. While it was technically feasible to implement distance computations using the existing instructions, frequent data movement and intermediate storage across memory banks resulted in performance bottlenecks, preventing full utilization of PIM’s inherent performance benefits. To address these challenges, we transitioned the evaluation platform to an HBM-based PIM simulator environment [24,25], which provided greater flexibility for instruction set design and datapath modification. The HBM-based PIM simulator was configured by integrating DRAMSim2, and the performance of the proposed AMC and MAN instructions (optimized for Euclidean distance and Manhattan distance computations, respectively) was evaluated. The simulator architecture is illustrated in Fig. 8. The host processor generated transactions required for PIM operations, which were executed by the PIM unit within the simulator. DRAMSim2 measured the number of cycles required for each transaction, and upon completion of the simulation, both the computation results and cycle counts were returned to the host.

Fig. 8. The system organized with PIM simulator with DRAMSim2.

../../Resources/ieie/JSTS.2025.25.6.662/fig8.png

As shown in Table 2, the total cycle count for Euclidean distance computation using the AMC instruction was reduced by up to 44% compared to using the existing instruction. Additionally, Fig. 9 shows that the performance improvement increased as the input matrix size grew. Notably, for matrix sizes of 4096 × 4096 and above, the performance of the Manhattan distance computation using the MAN instruction in the PIM-enabled state surpassed that of the PIM-disabled state, as shown in Table 3.

Table 2. Total cycle count when using the baseline instruction and AMC.

L2_existing L2_AMC Cycle reduced(%)
256 × 256 3471 1937 44.19
512 × 512 5844 3800 34.98
1024 × 1024 11323 7209 36.33
2048 × 2048 21544 13690 36.46
4096 × 4096 42738 27321 36.07
8192 × 8192 168331 108028 35.82
16384 × 16384 671081 429884 35.94

Fig. 9. Performance gain when using the baseline instruction and AMC.

../../Resources/ieie/JSTS.2025.25.6.662/fig9.png

Table 3. Total cycle count of MAN instruction executed on CPU and PIM.

PIM_disabled PIM_enabled
256 × 256 194 1937
512 × 512 578 3800
1024 × 1024 2494 7209
2048 × 2048 9034 13690
4096 × 4096 36082 27321
8192 × 8192 143276 108028
16384 × 16384 570801 429884

This progression from application-level evaluation on a DRAM-based platform to instruction-level optimization on an HBM-based platform demonstrates that architectural improvements are necessary to fully exploit the potential of Processing-In-Memory for workloads such as distance computations.

To assess the implementability of the proposed PIM datapath, we conducted an area comparison using a commercial digital library. Both the baseline HBM-PIM datapath and our proposed datapath were implemented at the RTL level and synthesized using the TSMC 65nm standard cell library. The synthesis results show that the proposed design incurs less than 5% area overhead compared to the baseline. This minimal overhead suggests that the proposed datapath is practically implementable without significant hardware cost.

V. CONCLUSIONS

In this paper, we developed distance calculation libraries for Euclidean distance and cosine similarity in a PIM environment and evaluated their performance during vector search. For Euclidean distance, PIM achieved up to 46.2% reduction in computation time compared to the CPU, while for cosine similarity, up to 60.6% of computation time was reduced. Although PIM demonstrated better application-level performance, we identified a performance bottleneck caused by limitations in the existing PIM instruction set. To address this, we introduced the AMC and MAN instructions through datapath modification and extension. The AMC instruction reduced cycle count by up to 44%, and the MAN instruction outperformed the PIM-disabled baseline for input sizes larger than 4096 × 4096 elements.

These developments demonstrate that vector search can be effectively optimized on PIM through both software and hardware-level enhancements, significantly improving performance for memory-bound tasks and highlighting the potential of PIM as a scalable solution for data-intensive applications.

ACKNOWLEDGMENTS

This work was partly supported by the Institute of Information Communications Technology Planning Evaluation (IITP) under the Artificial Intelligence Semiconductor Support Program to nurture the best talents (IITP-(2025)-RS-2023-00253914), funded by the Korea government (MSIT), and partly supported by the IITP grant funded by the Korea government (MSIT) (No. 2022-0-00441, Memory-Centric Architecture Using the Reconfigurable PIM Devices). The EDA tool was supported by the IC Design Education Center (IDEC), Korea.

REFERENCES

1 
Kim H., Kim T., Park T., Kim D., Yu Y., Kim H., Park Y., 2025, Accelerating LLMs using an efficient GEMM library and target-aware optimizations on real-world PIM devices, Proc. of CGO ’25: 23rd ACM/IEEE International Symposium on Code Generation and OptimizationDOI
2 
Lewis P., Perez E., Piktus A., Petroni F., Karpukhin V., Goyal N., Küttler H., Lewis M., Yih W., Tocktäschel T., Riedel S., Kiela D., 2020, Retrieval-augmented generation for knowledge-intensive NLP tasks, Advances in Neural Information Processing Systems (NeurIPS), Vol. 33, pp. 9459-9474DOI
3 
Johnson J., Douze M., Jégou H., 2021, Billion-scale similarity search with gpus, IEEE Transactions on Big Data, Vol. 7, No. 3, pp. 535-547DOI
4 
Jang J.-H., Shin J., 2023, In-depth survey of processing-in-memory architectures for deep neural networks, Journal of Semiconductor Technology and Science, Vol. 23, No. 5, pp. 322-339DOI
5 
Cho S., 2022, Volatile and nonvolatile memory devices for neuromorphic and processing-in-memory applications, Journal of Semiconductor Technology and Science, Vol. 22, No. 1, pp. 30-46DOI
6 
Lee J., Cha M., 2023, Charge trap flash structure with feedback field effect transistor for processing in memory, Journal of Semiconductor Technology and Science, Vol. 23, No. 5, pp. 295-302DOI
7 
Asifuzzaman K., Miniskar N. R., Young A. R., Liu F., Vetter J. S., 2023, A survey on processing-in-memory techniques: Advances and challenges, Memories - Materials, Devices, Circuits and Systems, Vol. 4, pp. 100022DOI
8 
Hu H., Wang W.-C., Chang Y.-H., Lee Y.-C., Lin B.-R., Wang H.-M., 2022, ICE: An intelligent cognition engine with 3D NAND-based in-memory computing for vector similarity search acceleration, Proc. of 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)DOI
9 
Verma V., Stan M. R., 2022, AI-PiM-extending the RISC-V processor with processing-in-memory functional units for AI inference at the edge of IoT, Frontiers in Electronics, Vol. 3DOI
10 
Choi H., Kim G., Shin W., Won J., Kim C., Joo H., An B., Shin G., Yun J. D., 2024, AiMX: Accelerator-in-memory based accelerator for cost-effective large language model inference, Proc. of 2024 IEEE International Electron Devices Meeting (IEDM), pp. 1-4DOI
11 
Kwon Y., Vladimir K., Kim N., Shin W., Won J., Lee M., Joo H., Choi H., Kim G., An B., 2022, System architecture and software stack for GDDR6-AiM, Proc. of 2022 IEEE Hot Chips 34 Symposium (HCS), pp. 1-25DOI
12 
Gómez-Luna J., Guo Y., Brocard S., Legriel J., Cimadomo R., Oliveira G. F., Singh G., Mutlu O., 2022, An experimental evaluation of machine learning training on a real processing-in-memory system, arXiv preprint arXiv:2207.07886DOI
13 
Ortega C., Falevoz Y., Ayrignac R., 2024, PIM-AI: A novel architecture for high-efficiency llm inference, arXiv preprint arXiv:2411.17309DOI
14 
Falevoz Y., Legriel J., 2023, Energy efficiency impact of processing in memory: A comprehensive review of workloads on the upmem architecture, Proc. of European Conference on Parallel Processing, Springer, pp. 155-166DOI
15 
Lee S., Kang S., Lee J., Kim H., Kim E., Seo S., 2021, Hardware architecture and software stack for PIM based on commercial DRAM technology, Proc. of the 48th Annual International Symposium on Computer Architecture (ISCA)DOI
16 
Kwon Y.-C., Lee S. H., Lee J., Kwon S.-H., Ryu J. M., Son J.-P., 2021, 25.4 A 20nm 6GB function-in-memory DRAM, based on HBM2 with a 1.2TFLOPS programmable computing unit using bank-level parallelism, for machine learning applications, Proc. of IEEE International Solid-State Circuits Conference (ISSCC)DOI
17 
Jegou H., Douze M., Schmid C., 2010, Product quantization for nearest neighbor search, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 33, No. 1, pp. 117-128DOI
18 
Chen J., Gómez-Luna J., Hajj I. E., Guo Y., Mutlu O., 2023, Simplepim: A software framework for productive and efficient processing-in-memory, Proc. of International Conference on Parallel Architectures and Compilation Techniques (PACT)DOI
19 
Noh S. U., Hong J., Lim C., Park S., Kim J., Kim H., Kim Y., Lee J., 2024, PID-Comm: A fast and flexible collective communication framework for commodity processing-in-DIMM devices, Proc. of ACM/IEEE International Symposium on Computer Architecture (ISCA)DOI
20 
Item M., Gómez-Luna J., Guo Y., Oliveira G. F., Sadrosadati M., Mutlu O., 2023, TransPimLib: A library for efficient transcendental functions on processing-in-memory systems, Proc. of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)DOI
21 
Ahn J., Yoo S., Mutlu O., Choi K., 2015, PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture, Proc. of ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), pp. 336-348DOI
22 
Kim C. H., Lee W. J., Paik Y., Kwon K., Kim S. Y., Park I., Kim S. W., 2022, Silent-PIM: Realizing the processing-in-memory computing with standard memory requests, IEEE Transactions on Parallel and Distributed Systems, Vol. 33, No. 2, pp. 251-262DOI
23 
Ryu S., 2024, Resource analysis on FPGA for functional verification of digital SRAM PIM, Journal of Semiconductor Technology and Science, Vol. 24, No. 3, pp. 218-225DOI
24 
Karunamurthy P., Alhady S. S. N., Wahab A. A. A., Othman W. A. F. W., 2022, Integration of Gem5 and Dramsim2 for DDR4 simulation, International Journal of Advanced Trends in Computer Science and Engineering, Vol. 9, No. 1, pp. 698-793DOI
25 
Christ D., Steiner L., Jung M., Wehn N., 2024, PIMSys: A virtual prototype for processing in memory, Proc. of International Symposium on Memory Systems, pp. 26-33DOI
Nahyeon Kim
../../Resources/ieie/JSTS.2025.25.6.662/au1.png

Nahyeon Kim received her B.S. degree in electronic and electrical engineering from Ewha Womans University, Seoul, South Korea, in 2024. She is currently pursuing an M.S. degree from the Digital System Architecture Laboratory, Hanyang University. Her current research interests include memory testing and SoC design.

Sujin Kim
../../Resources/ieie/JSTS.2025.25.6.662/au2.png

Sujin Kim received her B.S. degree in electronic and electrical engineering from Ewha Womans University, Seoul, South Korea, in 2019. She is currently pursuing a Ph.D. degree at the same university. Her research interests include domain-specific accelerator architectures, with a focus on vector search, large language model inference, and memory-centric computing.

Min Jung
../../Resources/ieie/JSTS.2025.25.6.662/au3.png

Min Jung received her B.S. degree from the Department of Electronic and Electric Engineering from Ewha Womans University in 2025. She is currently pursuing an M.S. degree from the Digital System Architecture Laboratory, Hanyang University. Her research interests include RISC-V processor and compression/decompression IP design.

Haechannuri Noh
../../Resources/ieie/JSTS.2025.25.6.662/au4.png

Haechannuri Noh received her B.S. degree from the Department of Electronic and Electric Engineering from Ewha Womans University in 2025. Her research interests include RISC-V processor and SoC design.

Ji-Hoon Kim
../../Resources/ieie/JSTS.2025.25.6.662/au5.png

Ji-Hoon Kim received his B.S. (summa cum laude) and Ph.D. degrees in Electrical Engineering and Computer Science from KAIST, Daejeon, South Korea, in 2004 and 2009, respectively. In 2009, he joined Samsung Electronics, Suwon, South Korea, as a Senior Engineer, and worked on next-generation architecture for 4G communication modem system-on-chip (SoC). From 2018 to 2025, he was a professor in the Department of Electronic and Electrical Engineering, Ewha Womans University, Seoul, South Korea. Since 2025, he has been with the Department of Electronic Engineering, Hanyang University, Seoul, South Korea. His current research interests include CPU microarchitecture, domain-specific SoC, and deep neural network accelerators. Dr. Kim served on the Technical Program Committee and Organizing Committee for various international conferences, including the IEEE International Conference on Computer Design (ICCD), the IEEE Asian Solid-State Circuits Conference (A-SSCC), and the IEEE International Solid-State Circuits Conference (ISSCC). He was a co-recipient of the Distinguished Design Award at the 2019 IEEE A-SSCC, and a recipient of the Best Design Award at 2007 Dongbu HiTek IP Design Contest, the First Place Award at 2008 International SoC Design Conference (ISOCC) Chip Design Contest, and the IEEE/IEIE Joint Award for Young Scientist and Engineer. He also serves as an Associate Editor for the IEEE Transactions on Circuits and System-II: Express Briefs.