Mobile QR Code QR CODE

Main Menu

The Journal of Semiconductor Technology and Science (JSTS) is an international, peer-reviewed, and open-access journal that is published bimonthly.
- Scope: semiconductor processes, devices, circuits, and MEMS.
- Editor-in-Chief: Prof. Woo Young Choi (ECE, Seoul National University)
- Indexed within Science Citation Index Expanded (SCIE), SCOPUS, Korea Citation Index (KCI), and other databases.

Journal Search

[

Research article

]

JSTS(Journal of Semiconductor Technology and Science)

IEIE Vol. 25, No. 06, p.662-669

ISSN (print) :

1598-1657

ISSN (online) :

2233-4866

Received : 2 Jun. 2025Revised : 7 Sep. 2025Accepted : 14 Sep. 2025

DOI :

https://doi.org/10.5573/JSTS.2025.25.6.662

Hardware-software Co-design for Vector Similarity Search on HBM-PIM

KimNahyeon^1,^* KimSujin^2,^* JungMin¹ NohHaechannuri² KimJi-Hoon¹

(Department of Electronic Engineering, Hanyang University, 222, Wangsimni-ro, Seongdong-gu, Seoul, Republic of Korea)
(Division of Electronic & Semiconductor Engineering, Ewha Womans University, 52, Ewhayeodae-gil, Seodaemun-gu, Seoul, Republic of Korea)

^*Nahyeon Kim and Sujin Kim equally contributed to this work. E-mail: jhoonkim@hanyang.ac.kr

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Vector similarity search is a key component of Retrieval-Augmented Generation (RAG) for large language models (LLMs), requiring memory-intensive computations such as Manhattan distance, Euclidean distance, and cosine similarity. Processing-In-Memory (PIM) architectures offer a promising solution to accelerate these memory-bound operations by reducing data movement between memory and processor. This study presents a hardware-software co-design approach for optimizing distance computation on PIM. We first implemented and evaluated a vector similarity search application on a DRAM-based PIM platform using the developed computation library, achieving 44.2% and 59.0% speed improvements for Euclidean distance and cosine similarity, respectively, compared to the CPU. However, instruction set limitations led to performance bottlenecks despite software-level optimization. To address this, we utilized an HBM-based PIM simulator and proposed two new instructions, AMC and MAN, optimized for Euclidean and Manhattan distance computations. Evaluation using a simulator integrated with DRAMSim2 showed that the proposed instructions reduced the total cycle count for distance computations by up to 44% compared to the baseline, with performance gains increasing for larger input sizes. These results demonstrate that both software-level and instruction-level optimizations are essential to fully exploit the performance potential of PIM architectures for distance computation workloads.

Index terms

Processing-in-memory (PIM), retrieval-augmented generation (RAG), vector similarity search, distance computation, instruction set extension, hardware-software co-design, PIM simulator

I. INTRODUCTION

Large Language Models (LLMs), such as GPT-4, have driven remarkable advances in natural language processing (NLP). Despite their capabilities, LLMs still face fundamental challenges, including hallucination and a lack of long-term memory ^[1]. Retrieval-Augmented Generation (RAG) ^[2] has been proposed to address these limitations by enabling models to dynamically retrieve relevant external information. A key enabler of RAG is the vector similarity search ^[3], which retrieves the most relevant data by computing similarities between query vectors and large-scale vector databases. Vector similarity search involves intensive memory access, typically implemented as Level 2 BLAS (matrix-vector) operations. As datasets grow larger, these computations become memory-bound, with performance limited by memory bandwidth. Processing-In-Memory (PIM) architectures, which integrate compute capabilities directly within memory ^[4-^7], offer a promising solution by reducing data transfers between the memory and the processor, thereby alleviating the memory bandwidth bottleneck.

In this work, we address these challenges through a holistic approach that spans both the software and hardware domains. We initially implemented vector similarity search applications and developed vector distance calculation libraries supporting Euclidean distance and cosine similarity. These libraries directly access PIM memory and calculate distances, reducing data transfer latency between memory and host.

While software-level optimization yielded some improvements, our experiments identified significant performance bottlenecks in distance computations. Although distance operations could be implemented using the existing PIM instructions ^[8], this method proved highly inefficient because it required frequent data movement and intermediate buffering between memory banks and the PIM processing unit, thereby negating the inherent advantages of in-memory computation. To overcome these constraints, we extended our work to the hardware level by modifying the PIM datapath and introducing two new instructions ^[9] optimized for distance computations: one targeting Euclidean distance and another supporting Manhattan distance. Fig. 1 shows the overall system architecture, including the distance computation library and the PIM-based hardware with custom instructions. This combined software-hardware optimization resulted in substantial performance improvements, as demonstrated through implementation and validation on an FPGA-based PIM platform and PIM simulator.

Fig. 1. Overview of the distance computation PIM platform with custom instructions for vector similarity.

The remainder of this paper is organized as follows. Section II introduces the background of distance computation and related PIM architectures including the baseline HBM-PIM architecture. Section III describes the software and hardware implementation, including the proposed instruction set extensions. Section IV presents evaluation results from both DRAM-based and HBM-based PIM platforms, as well as area analysis. Finally, the paper is concluded in Section V.

II. BACKGROUND

1. Operation Description for Distance Calculation

Distance calculation is a fundamental operation in vector similarity search, used to quantify the similarity or dissimilarity between vectors in high-dimensional spaces. This work focuses on three commonly used distance metrics: Manhattan distance, Euclidean distance, and cosine similarity.

Manhattan distance, also known as the L1 distance, measures the sum of the absolute differences between the corresponding vector elements p and q. It is defined as

(1)

$L1(p, q) = \sum |p - q|.$

Euclidean distance, also referred to as L2 distance, represents the straight-line distance between two points p and q in Euclidean space. It is calculated as

(2)

$L2(p, q) = \sqrt{\sum (p - q)^2}.$

Cosine similarity measures the cosine of the angle θ between two vectors p and q, indicating their directional alignment regardless of magnitude. It is computed as

(3)

$\cos \theta = \frac{p \cdot q}{\|p\| \|q\|}.$

where p and q are vectors, $p \cdot q$ is the dot product, and $\|p\|$, $\|q\|$ are their norms. A value closer to 1 implies a higher similarity.

Euclidean distance measures the separation between two vectors, but vector search typically omits the square root since only relative distances matter for ranking.

Since most of the datasets of vector searches using cosine similarity are already normalized to the size of the two vectors as 1, we implemented the cosine library the same as dot product.

2. Related PIM Architectures

Recent years have witnessed the emergence of diverse DRAM-based Processing-in-Memory (PIM) architectures. SK hynix AiM extends the HBM/GDDR interface with all-bank parallelism and in-memory GEMV support, targeting matrix-vector multiplication workloads ^[10,^11]. UPMEM’s DIMM-PIM integrates lightweight processing units (DPUs) inside DDR4 DIMMs, offering flexible near-memory acceleration for a wide range of tasks ^[12-^14]. While these efforts demonstrate the broad potential of DRAM-PIM, our focus differs: distance computation in vector similarity search requires repeated execution of simple arithmetic (ADD, MUL). For this purpose, Samsung’s HBM-PIM, which natively exposes such primitive operations, provides a more direct and efficient baseline. We therefore adopt HBM-PIM to align with architectures recently offered by leading memory vendors while ensuring suitability for distance-calculation workloads ^[15,^16]. Based on this rationale, the next subsection describes the baseline Samsung HBM-PIM architecture that serves as the foundation of our work.

3. Baseline HBM-PIM Architecture

Fig. 2 overviews baseline Samsung PIM architecture ^[15,^16]. The architecture consists of Even and Odd banks that share a single PIM unit, so they cannot perform operations simultaneously. The PIM execution unit is composed of a Floating-Point Unit (FPU), Command Register Files (CRF), General Register Files (GRF), and Scalar Register Files (SRF). The architecture supports three modes: single bank mode, all bank mode, and all bank-PIM mode. In single bank mode, it operates in standard DRAM mode, performing conventional DRAM commands. In all-bank mode, it can access multiple banks simultaneously, while in all-bank PIM mode, it triggers PIM commands and executes PIM operations. Currently, Samsung PIM architecture supports nine instructions: four ALU instructions (ADD, MUL, MAC, MAD), two Data instructions (MOV, FILL), and three Control instructions (NOP, JUMP, EXIT).

Fig. 2. Baseline PIM architecture [15,16].

As shown in Fig. 3, the programmable computing unit, referred to as the PIM unit, consists of a FPU, CRF, GRF, and SRF. The FPU can execute 16 operations on 16-bit data simultaneously. PIM instructions are stored in the CRF, so PIM commands are triggered as specified by the CRF. The GRF is divided into GRF_A and GRF_B, which are connected to the Even and Odd banks, respectively. Each GRF consists of eight 16-bit registers, allowing storage of eight 16-bit elements. Lastly, the SRF is divided into SRF_A, for addition operations, and SRF_M, for multiplication, with each section containing eight 16-bit registers, similar to the GRF.

Fig. 3. The micro-architecture of programmable computing unit.

This work adopts the Samsung HBM-PIM architecture as the baseline to ensure alignment with prior work and provide a fair reference point for performance evaluation. Consequently, 16-bit data types are used throughout the system. Although many public datasets are originally provided in 32-bit floating point (FP32) format, we confirm that quantization to 16-bit precision preserves accuracy in our target workloads ^[17]. Specifically, the SIFT dataset consists of image descriptors with integer values ranging from 0 to 255, which are naturally robust to reduced precision and show no degradation even under BF16 representation. For the GIST dataset, we empirically evaluated vector search accuracy after BF16 quantization and observed no measurable drop compared to FP32. Therefore, implementing the PIM unit with FP16 capability efficiently supports both neural network operations and vector similarity search, while remaining practical for real-world datasets.

III. IMPLEMENTATION

1. Software Implementation

The most used distance metrics for calculating vector similarity are Euclidean distance, cosine similarity, and dot product. However, in datasets designed for cosine similarity-based vector search, the vectors are often normalized. Therefore, in this study, we considered cosine similarity and dot product as one metric and developed libraries for PIM devices to compute Euclidean distance and cosine similarity ^[18-^20]. Since the goal is to compare relative distances rather than absolute ones, square root operations can be omitted in Euclidean distance calculation.

Vector similarity search compares the query vector with database vectors and finds nearest neighbor vectors. In this study, we implemented distance operation libraries for the vector similarity search in PIM environments. First, the database vectors and query vectors are written to memory regions referred to as srcA and srcB, which are then mapped to the PIM address space. The PIM unit performs distance calculations between these regions, and the results are stored in dstC. Finally, the host uses dstC to classify nearest neighbors. Since these libraries access datasets from PIM space and compute distances in its unit, they save data transfer latency between memory and host and do not require additional memory space. Fig. 4 shows the resulting distance computation library with Euclidean distance and cosine similarity, used for vector similarity search.

Fig. 4. Distance computation library for Euclidean distance and cosine similarity.

2. System Architecture

We initially implemented the distance calculation at the software level. However, experimental results revealed performance bottlenecks due to the limitations of the existing PIM instruction set. Therefore, we performed hardware-level optimization by modifying the datapath and extending the instruction set ^[21]. Accordingly, we implemented the Euclidean distance computation, and through the datapath modification, we found that it was also possible to optimize the Manhattan distance computation. As a result, we implemented two instructions to support both Euclidean and Manhattan distance computations. Since cosine similarity is already efficiently supported by the existing MAC instruction, no additional optimization was necessary. The proposed Instruction Set Architecture is shown in Table 1.

Table 1. The instruction set of the proposed PIM architecture.

Type	Command	Result	Operand(src0)	Operand(src1)
Control	NOP
Control	JUMP
Control	EXIT
Data	FILL	GRF, BANK	GRF, BANK
Data	MOV	GRF, SRF	GRF, BANK
Arithmetic	ADD	GRF	GRF, BANK, SRF	GRF, BANK, SRF
Arithmetic	MUL	GRF	GRF, BANK, SRF	GRF, BANK, SRF
Arithmetic	MAC	GRF_B	GRF, BANK	GRF, BANK, SRF
Arithmetic	MAD	GRF	GRF, BANK	GRF, BANK, SRF
Arithmetic	AMC	GRF_B	GRF, BANK, SRF	GRF, BANK, SRF
Arithmetic	MAN	GRF_B	GRF, BANK	GRF, BANK, SRF

As shown in Fig. 5, we customized the instruction by adding a 1-bit 'm' field to determine whether to take the absolute value. Architecturally, an 'absolute value unit' is needed in the FPU_ADD unit. The absolute value unit functions by changing the most significant bit (MSB) to 0. When the 'MAN' operation is triggered, the subtraction between the two vectors is calculated, the absolute value is taken, and the result is accumulated in the GRF.

As shown in Eq. (2), the L2 operation calculates the subtraction between the two vectors, squares that value, and then accumulates the result. However, since the result needed to be stored in the GRF after the ADD, only 8 elements could be stored at a time, requiring frequent data transfers to the banks. To address this, an operation that performs the multiplication immediately after the addition, without the need to store the intermediate result, was proposed. This led to the creation of a new instruction, "AMC" (Add-Multiply-Accumulate), like MAC (Multiply-Accumulate), but with the order of operations reversed (ADD to MUL). To implement this, a new path connecting the FPU_ADD to the FPU_MUL is required. By implementing this, the operation could be performed without transferring data between the banks.

Fig. 5. Instruction format to support multiple similarity metrics.

IV. EVALUATIONS

This study evaluated the proposed vector similarity search system across two different environments, reflecting the progression of the research from application-level implementation to instruction-level optimization. Initially, the vector similarity search application was developed and evaluated on a DRAM-based PIM system. However, we identified that instruction-level improvements were difficult to achieve on the DRAM-based platform, leading us to transition the evaluation environment to an HBM-based PIM simulator to analyze the effectiveness of instruction-level enhancements.

1. Application-level Evaluation on DRAM-based PIM System

The initial evaluation was conducted on a VMK180 board, configured with LPDDR4 as system memory and two GDDR6 modules as PIM memory ^[22]. A dual-core ARM Cortex-A53 processor from Zynq was used, with only a single core utilized during the experiments. The vector similarity search application was implemented using a brute-force algorithm under the K-Nearest Neighbor (KNN) search category, and the proposed distance computation library was evaluated within this framework. The experiments employed random datasets of various sizes. Figs. 6 and 7 present the results of brute-force search using Euclidean distance and cosine similarity, respectively, on both CPU and PIM environments. In the brute-force search, distances between the query vector and all vectors in the database were computed, and the top 100 closest neighbors were returned after sorting.

The results showed that the PIM outperformed the CPU in both Euclidean distance and cosine similarity computation. Specifically, for Euclidean distance, the PIM achieved an average speed improvement of 44.2% over the CPU, while for cosine similarity, the improvement was 59.0%. The performance gap between PIM and CPU increased as the dataset size grew, with cosine similarity exhibiting a relatively higher improvement rate compared to Euclidean distance.

Fig. 6. Computing time with log scale for brute-force with Euclidean distance.

Fig. 7. Computing time with log scale for brute-force with cosine similarity.

2. Instruction-level Optimization on HBM-based PIM Simulator

Although application-level implementation and evaluation were performed on the DRAM-based PIM system, it exposed inherent limitations ^[23] of the existing instruction set, which was not optimized for vector distance computations. While it was technically feasible to implement distance computations using the existing instructions, frequent data movement and intermediate storage across memory banks resulted in performance bottlenecks, preventing full utilization of PIM’s inherent performance benefits. To address these challenges, we transitioned the evaluation platform to an HBM-based PIM simulator environment ^[24,^25], which provided greater flexibility for instruction set design and datapath modification. The HBM-based PIM simulator was configured by integrating DRAMSim2, and the performance of the proposed AMC and MAN instructions (optimized for Euclidean distance and Manhattan distance computations, respectively) was evaluated. The simulator architecture is illustrated in Fig. 8. The host processor generated transactions required for PIM operations, which were executed by the PIM unit within the simulator. DRAMSim2 measured the number of cycles required for each transaction, and upon completion of the simulation, both the computation results and cycle counts were returned to the host.

Fig. 8. The system organized with PIM simulator with DRAMSim2.

As shown in Table 2, the total cycle count for Euclidean distance computation using the AMC instruction was reduced by up to 44% compared to using the existing instruction. Additionally, Fig. 9 shows that the performance improvement increased as the input matrix size grew. Notably, for matrix sizes of 4096 × 4096 and above, the performance of the Manhattan distance computation using the MAN instruction in the PIM-enabled state surpassed that of the PIM-disabled state, as shown in Table 3.

Table 2. Total cycle count when using the baseline instruction and AMC.

	L2_existing	L2_AMC	Cycle reduced(%)
256 × 256	3471	1937	44.19
512 × 512	5844	3800	34.98
1024 × 1024	11323	7209	36.33
2048 × 2048	21544	13690	36.46
4096 × 4096	42738	27321	36.07
8192 × 8192	168331	108028	35.82
16384 × 16384	671081	429884	35.94

Fig. 9. Performance gain when using the baseline instruction and AMC.

Table 3. Total cycle count of MAN instruction executed on CPU and PIM.

	PIM_disabled	PIM_enabled
256 × 256	194	1937
512 × 512	578	3800
1024 × 1024	2494	7209
2048 × 2048	9034	13690
4096 × 4096	36082	27321
8192 × 8192	143276	108028
16384 × 16384	570801	429884

This progression from application-level evaluation on a DRAM-based platform to instruction-level optimization on an HBM-based platform demonstrates that architectural improvements are necessary to fully exploit the potential of Processing-In-Memory for workloads such as distance computations.

To assess the implementability of the proposed PIM datapath, we conducted an area comparison using a commercial digital library. Both the baseline HBM-PIM datapath and our proposed datapath were implemented at the RTL level and synthesized using the TSMC 65nm standard cell library. The synthesis results show that the proposed design incurs less than 5% area overhead compared to the baseline. This minimal overhead suggests that the proposed datapath is practically implementable without significant hardware cost.

V. CONCLUSIONS

In this paper, we developed distance calculation libraries for Euclidean distance and cosine similarity in a PIM environment and evaluated their performance during vector search. For Euclidean distance, PIM achieved up to 46.2% reduction in computation time compared to the CPU, while for cosine similarity, up to 60.6% of computation time was reduced. Although PIM demonstrated better application-level performance, we identified a performance bottleneck caused by limitations in the existing PIM instruction set. To address this, we introduced the AMC and MAN instructions through datapath modification and extension. The AMC instruction reduced cycle count by up to 44%, and the MAN instruction outperformed the PIM-disabled baseline for input sizes larger than 4096 × 4096 elements.

These developments demonstrate that vector search can be effectively optimized on PIM through both software and hardware-level enhancements, significantly improving performance for memory-bound tasks and highlighting the potential of PIM as a scalable solution for data-intensive applications.

ACKNOWLEDGMENTS

This work was partly supported by the Institute of Information Communications Technology Planning Evaluation (IITP) under the Artificial Intelligence Semiconductor Support Program to nurture the best talents (IITP-(2025)-RS-2023-00253914), funded by the Korea government (MSIT), and partly supported by the IITP grant funded by the Korea government (MSIT) (No. 2022-0-00441, Memory-Centric Architecture Using the Reconfigurable PIM Devices). The EDA tool was supported by the IC Design Education Center (IDEC), Korea.

REFERENCES

Kim H., Kim T., Park T., Kim D., Yu Y., Kim H., Park Y., 2025, Accelerating LLMs using an efficient GEMM library and target-aware optimizations on real-world PIM devices, Proc. of CGO ’25: 23rd ACM/IEEE International Symposium on Code Generation and Optimization

Lewis P., Perez E., Piktus A., Petroni F., Karpukhin V., Goyal N., Küttler H., Lewis M., Yih W., Tocktäschel T., Riedel S., Kiela D., 2020, Retrieval-augmented generation for knowledge-intensive NLP tasks, Advances in Neural Information Processing Systems (NeurIPS), Vol. 33, pp. 9459-9474

Johnson J., Douze M., Jégou H., 2021, Billion-scale similarity search with gpus, IEEE Transactions on Big Data, Vol. 7, No. 3, pp. 535-547

Jang J.-H., Shin J., 2023, In-depth survey of processing-in-memory architectures for deep neural networks, Journal of Semiconductor Technology and Science, Vol. 23, No. 5, pp. 322-339

Cho S., 2022, Volatile and nonvolatile memory devices for neuromorphic and processing-in-memory applications, Journal of Semiconductor Technology and Science, Vol. 22, No. 1, pp. 30-46

Lee J., Cha M., 2023, Charge trap flash structure with feedback field effect transistor for processing in memory, Journal of Semiconductor Technology and Science, Vol. 23, No. 5, pp. 295-302

Asifuzzaman K., Miniskar N. R., Young A. R., Liu F., Vetter J. S., 2023, A survey on processing-in-memory techniques: Advances and challenges, Memories - Materials, Devices, Circuits and Systems, Vol. 4, pp. 100022

Hu H., Wang W.-C., Chang Y.-H., Lee Y.-C., Lin B.-R., Wang H.-M., 2022, ICE: An intelligent cognition engine with 3D NAND-based in-memory computing for vector similarity search acceleration, Proc. of 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)

Verma V., Stan M. R., 2022, AI-PiM-extending the RISC-V processor with processing-in-memory functional units for AI inference at the edge of IoT, Frontiers in Electronics, Vol. 3

Choi H., Kim G., Shin W., Won J., Kim C., Joo H., An B., Shin G., Yun J. D., 2024, AiMX: Accelerator-in-memory based accelerator for cost-effective large language model inference, Proc. of 2024 IEEE International Electron Devices Meeting (IEDM), pp. 1-4

Kwon Y., Vladimir K., Kim N., Shin W., Won J., Lee M., Joo H., Choi H., Kim G., An B., 2022, System architecture and software stack for GDDR6-AiM, Proc. of 2022 IEEE Hot Chips 34 Symposium (HCS), pp. 1-25

Gómez-Luna J., Guo Y., Brocard S., Legriel J., Cimadomo R., Oliveira G. F., Singh G., Mutlu O., 2022, An experimental evaluation of machine learning training on a real processing-in-memory system, arXiv preprint arXiv:2207.07886

Ortega C., Falevoz Y., Ayrignac R., 2024, PIM-AI: A novel architecture for high-efficiency llm inference, arXiv preprint arXiv:2411.17309

Falevoz Y., Legriel J., 2023, Energy efficiency impact of processing in memory: A comprehensive review of workloads on the upmem architecture, Proc. of European Conference on Parallel Processing, Springer, pp. 155-166

Lee S., Kang S., Lee J., Kim H., Kim E., Seo S., 2021, Hardware architecture and software stack for PIM based on commercial DRAM technology, Proc. of the 48th Annual International Symposium on Computer Architecture (ISCA)

Kwon Y.-C., Lee S. H., Lee J., Kwon S.-H., Ryu J. M., Son J.-P., 2021, 25.4 A 20nm 6GB function-in-memory DRAM, based on HBM2 with a 1.2TFLOPS programmable computing unit using bank-level parallelism, for machine learning applications, Proc. of IEEE International Solid-State Circuits Conference (ISSCC)

Jegou H., Douze M., Schmid C., 2010, Product quantization for nearest neighbor search, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 33, No. 1, pp. 117-128

Chen J., Gómez-Luna J., Hajj I. E., Guo Y., Mutlu O., 2023, Simplepim: A software framework for productive and efficient processing-in-memory, Proc. of International Conference on Parallel Architectures and Compilation Techniques (PACT)

Noh S. U., Hong J., Lim C., Park S., Kim J., Kim H., Kim Y., Lee J., 2024, PID-Comm: A fast and flexible collective communication framework for commodity processing-in-DIMM devices, Proc. of ACM/IEEE International Symposium on Computer Architecture (ISCA)

Item M., Gómez-Luna J., Guo Y., Oliveira G. F., Sadrosadati M., Mutlu O., 2023, TransPimLib: A library for efficient transcendental functions on processing-in-memory systems, Proc. of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Ahn J., Yoo S., Mutlu O., Choi K., 2015, PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture, Proc. of ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), pp. 336-348

Kim C. H., Lee W. J., Paik Y., Kwon K., Kim S. Y., Park I., Kim S. W., 2022, Silent-PIM: Realizing the processing-in-memory computing with standard memory requests, IEEE Transactions on Parallel and Distributed Systems, Vol. 33, No. 2, pp. 251-262

Ryu S., 2024, Resource analysis on FPGA for functional verification of digital SRAM PIM, Journal of Semiconductor Technology and Science, Vol. 24, No. 3, pp. 218-225

Karunamurthy P., Alhady S. S. N., Wahab A. A. A., Othman W. A. F. W., 2022, Integration of Gem5 and Dramsim2 for DDR4 simulation, International Journal of Advanced Trends in Computer Science and Engineering, Vol. 9, No. 1, pp. 698-793

Christ D., Steiner L., Jung M., Wehn N., 2024, PIMSys: A virtual prototype for processing in memory, Proc. of International Symposium on Memory Systems, pp. 26-33

Nahyeon Kim

Nahyeon Kim received her B.S. degree in electronic and electrical engineering from Ewha Womans University, Seoul, South Korea, in 2024. She is currently pursuing an M.S. degree from the Digital System Architecture Laboratory, Hanyang University. Her current research interests include memory testing and SoC design.

Sujin Kim

Sujin Kim received her B.S. degree in electronic and electrical engineering from Ewha Womans University, Seoul, South Korea, in 2019. She is currently pursuing a Ph.D. degree at the same university. Her research interests include domain-specific accelerator architectures, with a focus on vector search, large language model inference, and memory-centric computing.

Min Jung

Min Jung received her B.S. degree from the Department of Electronic and Electric Engineering from Ewha Womans University in 2025. She is currently pursuing an M.S. degree from the Digital System Architecture Laboratory, Hanyang University. Her research interests include RISC-V processor and compression/decompression IP design.

Haechannuri Noh

Haechannuri Noh received her B.S. degree from the Department of Electronic and Electric Engineering from Ewha Womans University in 2025. Her research interests include RISC-V processor and SoC design.

Ji-Hoon Kim

Ji-Hoon Kim received his B.S. (summa cum laude) and Ph.D. degrees in Electrical Engineering and Computer Science from KAIST, Daejeon, South Korea, in 2004 and 2009, respectively. In 2009, he joined Samsung Electronics, Suwon, South Korea, as a Senior Engineer, and worked on next-generation architecture for 4G communication modem system-on-chip (SoC). From 2018 to 2025, he was a professor in the Department of Electronic and Electrical Engineering, Ewha Womans University, Seoul, South Korea. Since 2025, he has been with the Department of Electronic Engineering, Hanyang University, Seoul, South Korea. His current research interests include CPU microarchitecture, domain-specific SoC, and deep neural network accelerators. Dr. Kim served on the Technical Program Committee and Organizing Committee for various international conferences, including the IEEE International Conference on Computer Design (ICCD), the IEEE Asian Solid-State Circuits Conference (A-SSCC), and the IEEE International Solid-State Circuits Conference (ISSCC). He was a co-recipient of the Distinguished Design Award at the 2019 IEEE A-SSCC, and a recipient of the Best Design Award at 2007 Dongbu HiTek IP Design Contest, the First Place Award at 2008 International SoC Design Conference (ISOCC) Chip Design Contest, and the IEEE/IEIE Joint Award for Young Scientist and Engineer. He also serves as an Associate Editor for the IEEE Transactions on Circuits and System-II: Express Briefs.