ParkSunyoung1
YangHannah2
KimHana2
KimHyunji1
KimJi-Hoon2*
-
(Department of Electronic and Electrical Engineering, Ewha Womans University, Seoul,
Korea)
-
(Department of Electronic Engineering, Hanyang University, Seoul, Korea)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Index Terms
Neural networks, systolic array, functional safety, fault-tolerant, fault mitigation
I. Introduction
The systolic array, a foundational architecture for general matrix multiplication
(GEMM) acceleration, has been widely adopted in artificial intelligence (AI) applications
[1]. It consists of interconnected processing elements (PEs) that perform Multiply-Accumulate
(MAC) operations in parallel [2]. Due to its regular structure and high data reuse capability via localized inter-PE
communication, the systolic array architecture is especially well- suited for linear
algebra operations such as matrix multi- plication and convolution. This has led to
its integration into numerous commercial hardware accelerators [3], particularly in AI systems.
However, the architectural characteristic of continuous data exchange between neighboring
PEs creates a vulnerability in fault scenarios. Specifically, when a fault occurs
in a single PE, its erroneous data may propagate to adjacent PEs---even those that
are otherwise fault-free--- resulting in cascading errors and overall accuracy degradation.
In safety-critical applications such as autonomous driving systems and medical devices
[4,5,6], GEMM accelerators must be designed with robust fault mitigation mechanisms that
can ensure safe and reliable operation with- out excessive hardware overhead. As nanometer-scale
CMOS technologies advance, systems become increasingly susceptible to both permanent
and transient faults, making the incorporation of such mechanisms indispensable [7,8,9,10,11,12,13].
Although transient faults may cause sporadic computation errors, their impact on the
overall accuracy tends to be limited---even at relatively high fault rates. In contrast,
permanent faults continuously affect system behavior, requiring more deliberate and
targeted mitigation strategies. Various approaches have been proposed to address this
challenge. For example, Fault-Aware Pruning (FAP) and its retraining-based extension
(FAP+T) attempt to minimize accuracy loss by masking faulty weights [14,15]. However, these techniques do not account for the relative importance of individual
weights, often resulting in significant accuracy degradation and incurring high retraining
costs, which limit their practicality in real-time applications.
Redundancy-based methods have also been pro- posed [16,17], which utilize redundant PEs to replace faulty PEs. However, these approaches suffer
from poor area efficiency. To address this issue, more area-efficient redundancy architectures
have been proposed, such as the one presented in [18]. Nevertheless, both types of solutions suffer from scalability issues: once all redundant
PEs are exhausted, the system can no longer recover from subsequent faults. Moreover,
such solutions are ill-suited for resource-constrained edge devices, where minimizing
area and power consumption is critical.
To overcome these limitations, we argue that severity-adaptive fault mitigation techniques
are essential. Such mechanisms activate only when a substantial degradation in accuracy
is anticipated, allowing for intelligent trade- offs between reliability and resource
usage.
In this paper, we propose two lightweight fault mitigation techniques specifically
targeting permanent faults in systolic array-based GEMM accelerators. Based on microarchitectural
fault characterization, we introduce:
· High-low bit swapping (HL-Swap) mechanism, which redirects data from faulty high-bit
registers to error-free low-bit registers, thereby mitigating accuracy degradation
caused by bit-level defects;
· Row-column deactivation (RC-Off) technique, which selectively disables specific
rows or columns in the systolic array that are identified to cause significant performance
degradation, based on fault severity analysis.
These two techniques are designed to maximize fault resilience while minimizing hardware
overhead, enabling robust GEMM acceleration even under harsh reliability constraints.
II. Proposed Fault-Tolerant GEMM Accelerator
In order to overcome the limitations of the prior works which promptly address faults
upon their occurrence, we propose two fault mitigation techniques which adaptively
operate according to the severity of the accuracy drop based on microarchitectural
fault analysis.
1. Microarchitectural Fault Analysis
To evaluate the impact of faults on accuracy degradation in output-stationary systolic
array architectures, we define $D_{\rm fault}$ as the distance from the array boundary---where
in- put data is initially supplied---to the pro cessing element (PE) at which the
fault occurs, as illustrated in Fig. 1(c).
This parameter allows us to quantify the spatial propagation of faults within the
systolic array. Our microarchitectural analysis considers multiple fault-related parameters,
including the fault type, bit index, faulty PE rate (FPR), and the aforementioned
$D_{\rm fault}$.
The experimental setup for the fault analysis is as follows. We perform fault injection
on a $16 \times 16$ systolic array using the MNIST dataset [19]. Permanent faults ranging from 1-bit to 16-bit are injected into the 8-bit registers
that receive data from neighboring PEs. Accordingly, the bit index spans from 1 (least
significant bit, LSB) to 8 (most significant bit, MSB). To simulate worst-case conditions
at the microarchitectural level, we assume that all other system components are operating
under the most disadvantageous settings, except for the comparison group under evaluation.
This setup ensures that we can isolate and clearly observe the individual effects
of each fault parameter.
We compute the accuracy degradation using the Mean Absolute Percentage Error (MAPE),
and we record both the maximum and minimum observed degradations along with their
corresponding parameter settings. As shown in Table 1, faults of the Stuck-at-1 type result in approximately 23% greater accuracy degradation
compared to Stuck-at-0 faults. This disparity may be attributed to the asymmetry in
value distributions within activation functions like ReLU, which tend to suppress
negative outputs and amplify positive ones.
Moreover, faults at the MSB level cause significantly greater degradation than those
at the LSB level---by up to 46.7%---highlighting the critical influence of bit significance
in fixed-point arithmetic. Similarly, increasing the number of injected faults from
1 to 16 results in a 21.5% rise in error rates, indicating the cumulative effect of
fault density. Spatially, faults closer to the input boundary ($D_{\rm fault} = 0$)
cause approximately 29.1% more degradation than those occurring at the farthest boundary
($D_{\rm fault} = 15$), which supports the hypothesis that early-stage faults propagate
more aggressively through the systolic flow.
Among the evaluated parameters, the bit index and $D_{\rm fault}$ emerged as the most
dominant contributors to ac- curacy degradation. Accordingly, we target these two
fac- tors in our proposed mitigation techniques. The HL-Swap mechanism addresses bit-level
sensitivity by redirecting high-bit values to low-bit locations, while the RC-Off
technique mitigates spatial fault propagation by selectively disabling critical rows
or columns based on fault severity analysis.
Fig. 1. Motivation of fault-tolerant GEMM accelerator: (a) accuracy degradation due
to permanent faults and transient faults, (b) limitations of previous works, and (c)
accuracy drop with two major fault factors.
Table 1. Microarchitectural fault analysis: Impact of fault factors on accuracy drop
with MNIST.
2. HL-Swap: High-Low Bit Swapping
According to the fault analysis described earlier, computational accuracy can degrade
by up to 50.2% depending on the bit index where the fault occurs. In particular, faults
at higher bit positions have a significantly greater impact on the output compared
to lower bits, as higher bits carry more weight in fixed-point arithmetic. To address
this issue, we propose a fault mitigation technique called HL-Swap, which is designed
to reduce the impact of bit-index-dependent faults. HL-Swap operates by exchanging
the upper bits with the lower $e$ bits in a register when a fault is detected in the
upper half. More specifically, as illustrated in Fig. 2, when on of the two 8-bit registers receiving data from neigh boring PEs encounters
a fault in its upper 4 bits, HL-Swap swaps them with the lower 4 bits within the same
register. After the swap, a shifter realigns the data path to ensure that the accumulation
process functions correctly despite the modified bit positions.
HL-Swap supports four operational modes depending on the location of the fault: No
Swap, Reg0 Swap, Reg1 Swap, and Reg0,1 Swap. The No Swap mode is selected when no
fault is present or when only the lower 4 bits are affected. Reg0 Swap is applied
when the upper 4 bits of Reg0 are faulty, while Reg1 Swap is used if the upper 4 bits
of Reg1 are affected. Reg0,1 Swap is activated when faults exist in the upper 4 bits
of both Reg0 and Reg1.
These swap modes enable adaptive fault handling based on the location of the fault,
and the mechanism provides effective mitigation without requiring redundant hard-
ware. By using only simple internal bit manipulation and a lightweight shifter, HL-Swap
offers a practical solution with minimal area overhead while preserving computational
accuracy.
Fig. 2. Proposed HL-Swap fault mitigation method operation scheme.
3. RC-Off: Location-Aware Row-Column Off with Scoring
As illustrated in Fig. 3, we propose a fault mitigtion mechanism called RC-Off, which selectively deactivates
specific rows or columns in the systolic array based on their estimated contribution
to accuracy degradation. Unlike conventional redundancy-based methods that uniformly
replicate resources regardless of fault severity, RC- Off dynamically adapts its response
according to the microarchitectural location and fault characteristics. This al lows
the system to avoid unnecessary overhead while effectively suppressing high-impact
faults. The method operates through three distinct phases: fault analysis, monitoring,
and adaptive handling.
In the first phase, an offline fault analysis is conducted to assess the correlation
between fault-induced degradations and their spatial and logical properties. Since
accuracy degradation is highly sensitive to both the AI model architecture and the
dataset in use, this phase begins by injecting faults into the systolic array under
controlled conditions. We simulate various fault types, bit indices, positions, and
quantities to build a comprehensive characterization. The results are aggregated into
a score table that captures how much each fault configuration impacts output accuracy.
This score table serves as a statistical foundation for runtime decision-making.
In the second phase, real-time monitoring is activated. Upon fault detection, each
row and column in the systolic array is assigned a fault severity score. This score
is computed based on the predefined score table and the attributes of the detected
faults, including their type, bit position, spatial distance from the array boundary,
and frequency. The scoring policy is defined as
In this equation, ${n} $ indexes each detected fault, ${ft} $ reprsents the fault
type (e.g., stuck-at-0, stuck-at-1), ${bi} $ is the bit index within the register,
${d} $ corresponds to the fault distance $D_{\rm fault}$, and ${num} $ is the total
number of faults observed in the given row or column. Each term is weighted according
to its relative influence on output degradation, based on prior analysis.
In the final phase, once a row or column exceeds a predefined score threshold---indicating
that it significantly degrades system accuracy---it is selectively disabled. This
is implemented through a binary enable signal: a value of 1 maintains normal operation,
while 0 indicates that the row or column is excluded from subsequent computations.
The input data fed into the systolic array is dynamically reshaped to bypass these
disabled elements, ensuring consistent data flow and structural integrity. This design
ensures that computational resources are efficiently reallocated and fault impact
is minimized without requiring full redundancy or complex reconfiguration logic.
Overall, RC-Off enables location-aware and model- sensitive fault mitigation, achieving
a favorable balance between reliability and resource efficiency. The simplicity of
its runtime enforcement makes it suitable for deployment in edge systems where area
and power constraints are stringent.
Fig. 3. Overall scheme of the proposed RC-Off fault mitigation method based on the
microarchitectural location of faults.
4. Fault-Tolerant GEMM Accelerator
The accelerator consists of a systolic array, SRAM buffer, data reshaper, Score Table,
and Fault Monitor. It employs four $16\times 16$ systolic arrays for computation.
To mitigate the accuracy drop incurred by the most significant fault factor, bit index,
as shown in Fig. 1(c), HL- Swap is applied to convert faulty high-bit registers to fault- free low-bit
registers. Also, to support RC-Off, the pro- posed GEMM accelerator measures the accuracy
degradation rate based on the microarchitectural location and type of fault occurrence,
which is stored in the score table as depicted in Fig. 4. The fault monitor continuously computes severity scores for RC-Off, bypassing them
if the score exceeds a threshold value. This proposed approach effectively mitigates
the impact of the propagation chain in the systolic array. In this scenario, erroneous
data not only affects the faulty PEs but also spreads to other PEs when a permanent
fault occurs in a PE.
The SRAM buffer preloads data to enhance data re- trieval efficiency. The data reshaper
adjusts systolic array data feeding according to RC-Off criteria to mitigate accuracy
drop. The score table updates fault-related information injected via fault injection
commands and accommodates changes in fault bit indices through HL-Swap. The fault
monitor computes fault severity based on the score table.
Fig. 4. Overall architecture of proposed fault-tolerantsystolic array for GEMM accelerator.
III. Experimental Results
The proposed fault-tolerant GEMM accelerator was de- signed at the RTL and synthesized
using a standard 28 nm Samsung FD-SOI technology. The target operating conditions
included a supply voltage of 1.0 V and a nominal clock frequency of 250 MHz under
typical process-voltage-temperature (PVT) corners. After logic synthesis, the resulting
design comprised approximately 4.1 million gate equivalents (GE), which demonstrates
the feasibility of the proposed design for lightweight edge devices with strict area
and power constraints.
To evaluate fault resilience under realistic scenarios, we injected a total of 1,000
randomly generated permanent faults into four $16 \times 16$ systolic arrays. Each
fault was assigned to random bit positions within the 8-bit registers
that store intermediate PE outputs. The simulation was conducted under the assumption
that faults had been pre- identified using a built-in self-test (BIST) mechanism,
al- lowing for appropriate fault-aware activation of mitigation techniques during
runtime.
The hardware overhead of the proposed techniques was quantitatively measured after
synthesis. The RC-Off mechanism incurred an area overhead of 2.4%, while the HL-Swap
introduced an 8.9% increase in gate count. These values are considered negligible
when weighed against the significant improvement in system reliability and inference
accuracy.
Fig. 5 presents the accuracy degradation trends as a function of the FPR. Three configurations
were evaluated: a baseline conventional GEMM accelerator, GEMM with RC-Off applied,
and GEMM with both RC-Off and HL- Swap applied. For reference, the degradation values
are normalized to the accuracy drop of the baseline system operating under a 6% FPR.
Under this condition (6% FPR), the conventional GEMM accelerator exhibited an accuracy
drop of 25%. When the RC-Off technique was applied, the degradation was reduced to
9.3%, corresponding to an improvement of approximately 62.8%. Moreover, the simultaneous
ap- plication of both RC-Off and HL-Swap further reduced the error, achieving a total
improvement of approximately 63% relative to the baseline. These results demonstrate
that the proposed techniques operate synergistically and effectively mitigate the
impact of permanent faults with minimal hardware cost.
Fig. 5. Accuracy drop comparison of proposed fault mitigationschemes at various FPR.
V. Conclusions
In this study, we proposed two fault mitigation tecniques to reduce accuracy degradation
in GEMM accelerators under permanent faults. The first, HL-Swap, mitigates bit-index-dependent
errors by swapping the upper and lower bits in a register when high-bit faults are
detected. The second, RC-Off, selectively disables rows or columns that significantly
impact accuracy, based on a fault scoring mechanism derived from microarchitectural
analysis.
Implemented in Samsung 28 nm FD-SOI technology, the proposed methods incur low hardware
over- head---2.4% for RC-Off and 8.9% for HL-Swap---while offering substantial reliability
improvements. Under a 6% faulty PE rate, the combined application of HL-Swap and RC-Off
reduces accuracy degradation by up to 63%, demonstrating their effectiveness and efficiency.
ACKNOWLEDGMENTS
This work was supported by the R&D Program of the Ministry of Trade, Industry,
and Energy (MOTIE) and Korea Evaluation Institute of Industrial Technology (KEIT).
(RS-2023-00232192)
References
S. Ryu and J.-J. Kim, ``High-performance sparsity-aware NPU with reconfigurable comparator-multiplier
architecture,'' Journal of Semiconductor Technology and Science, vol. 24, no. 6, pp.
572-577, 2024.

H.-T. Kung, ``Why systolic architectures?'' Computer, vol. 15, no. 1, pp. 37-46, 1982.

N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S.
Bhatia, N. Boden, A. Borchers, et al., “In-datacenter performance analysis of a tensor
processing unit,” Proc. of 44th Annual International Symposium on Computer Architecture
(ISCA), pp. 1-12, 2017.

F. Yu, Z. Qin, C. Liu, D. Wang, and X. Chen, ``REIN the RobuTS: Robust DNN-based image
recognition in autonomous driving systems,'' IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, vol. 40, no. 6, pp. 1258-1271, 2020.

H. A. Glory, C. Vigneswaran, S. S. Jagtap, R. Shruthi, G. Hariharan, and V. S. S.
Sriram, ``AHW-BGOA-DNN: A novel deep learning model for epileptic seizure detection,''
Neural Computing and Applications, vol. 33, pp. 6065- 6093, 2021.

S. J. Yoon, T. Talluri, A. Angani, H. T. Chung, and K. J. Shin, ``Development of battery
management system with PCM using neural network based aging algorithm for electric
vehicle,'' IEIE Transactions on Smart Processing and Computing, vol. 14, no. 2, pp.
280-296, 2025.

S. S. Sahoo, A. Kumar, and B. Veeravalli, ``Design and evaluation of reliability-oriented
task re-mapping in MP-SoCs using time-series analysis of intermittent faults,'' Proc.
of Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 798-803,
2016.

S. Borkar, ``Design perspectives on 22nm CMOS and beyond,'' Proc. of 46th Annual Design
Automation Conference (DAC), pp. 93-94, 2009.

C. Constantinescu, ``Trends and challenges in VLSI circuit reliability,'' IEEE Micro,
vol. 23, no. 4, pp. 14-19, 2003.

H. Nan and K. Choi, ``High performance, low cost, and robust soft error tolerant latch
designs for nanoscale CMOS technology,'' IEEE Transactions on Circuits and Systems
I: Regular Papers, vol. 59, no. 7, pp. 1445-1457, 2012.

J. Henkel, L. Bauer, N. Dutt, P. Gupta, S. Nassif, M. Shafique, M. Tahoori, and N.
Wehn, ``Reliable on-chip systems in the nano-era: Lessons learnt and future trends,''
Proc. of 50th Annual Design Automation Conference (DAC), pp. 1- 10, 2013.

H. Lee, H.-J. Lee, and H. Kim, ``A read disturbance tolerant phase change memory system
for CNN inference workloads,'' Journal of Semiconductor Technology and Science, vol.
22, no. 4, pp. 216-223, 2022.

M. Pandey and A. Islam, ``Radiation tolerant by design 12-transistor static random
access memory,'' Journal of Semiconductor Technology and Science, vol. 24, no. 5,
pp.410-423, 2024.

J. J. Zhang, K. Basu, and S. Garg, ``Fault-tolerant systolic array based accelerators
for deep neural network execution,'' IEEE Design & Test, vol. 36, no. 5, pp. 44-53,
2019.

M. A. Hanif and M. Shafique, ``Salvagednn: Salvaging deep neural network accelerators
with permanent faults through saliency-driven fault-aware mapping,'' Philosophical
Transactions of the Royal Society A, vol. 378, no. 2164, 20190164, 2020.

K. Cho, I. Lee, H. Lim, and S. Kang, ``Efficient systolic-array redundancy architecture
for offline/online repair,'' Electronics, vol. 9, no. 2, 338, 2020.

L.-C. Chu and B. W. Wah, ``Fault tolerant neural networks with hybrid redundancy,''
Proc. of IJCNN International Joint Conference on Neural Networks, pp. 639-649, 1990.

H. Lee, J. Park, and S. Kang, ``An area-efficient systolic array redundancy architecture
for reliable AI accelerator,'' IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, vol. 32, no. 10, pp. 1950-1954, 2024.

L. Deng, ``The MNIST database of handwritten digit images for machine learning research,''
IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141-142, 2012.

J. J. Zhang, T. Gu, K. Basu, and S. Garg, ``Analyzing and mitigating the impact of
permanent faults on a systolic array based neural network accelerator,'' Proc. of
IEEE 36th VLSI Test Symposium (VTS), pp. 1-6, 2018.

Sunyoung Park received her B.S. degree in electronic and electrical engineering
from Ewha Womans University, Seoul, South Korea, in 2019, and an M.S. degree from
the Digital System Architecture Laboratory, Ewha Womans University, where she is currently
pursuing a Ph.D. degree with the Digital System Architecture Laboratory. Her current
research interests include run-time test architecture, functional safety and digital
system architecture design.
Hannah Yang received her B.S. degree in electronic and electrical engineering from
Ewha Womans University, Seoul, South Korea, in 2022, and an M.S. degree from the same
university conducting research in the Digital System Architecture Laboratory with
a focus on memory system architecture optimization for the VESA VDC-M Decoder IP,
in 2024. She is currently pursuing a Ph.D. degree at Hanyang University. Her current
research interests include RISC-V processors and domain-specific accelerators.
Hana Kim received her B.S. degree in electronic engineering in 2020 and an M.S.
degree from the Department of Electronic and Electrical Engineering, Ewha Womans University,
Seoul, South Korea, in 2022. She is currently pursuing a Ph.D. degree with the Digital
System Architecture Laboratory at Hanyang University, Seoul. Her current research
interests include Artificial Intelligence(AI) accelerators for various applications,
data types, system-on-chip, and digital system architecture design.
Hyunji Kim received her B.S. and M.S. degrees in electronic and electrical engineering
from Ewha Womans University, Seoul, South Korea, in 2019 and 2021, respectively. She
is currently pursuing a Ph.D. degree at the same university with Digital System Architecture
Lab. Her current research interests include domain-specific SoC architecture.
Ji-Hoon Kim received his B.S. (summa cum laude) and Ph.D. degrees in Electrical
Engineering and Computer Science from KAIST, Daejeon, South Korea, in 2004 and 2009,
respectively. In 2009, he joined Samsung Electronics, Suwon, South Korea, as a Senior
Engineer, and worked on next-generation architecture for 4G communication modem system-on-chip
(SoC). From 2018 to 2025, he was a professor in the Department of Electronic and Electrical
Engineering, Ewha Womans University, Seoul, South Korea. Since 2025, he has been with
the Department of Electronic Engineering, Hanyang University, Seoul, South Korea.
His current research interests include CPU microarchitecture, domain-specific SoC,
and deep neural network accelerators.
Dr. Kim served on the Technical Program Committee and Organizing Committee for
various international conferences, including the IEEE International Conference on
Computer Design (ICCD), the IEEE Asian Solid-State Circuits Conference (A-SSCC), and
the IEEE International Solid-State Circuits Conference (ISSCC). He was a co-recipient
of the Distinguished Design Award at the 2019 IEEE A-SSCC, and a recipient of the
Best Design Award at 2007 Dongbu HiTek IP Design Contest, the First Place Award at
2008 International SoC Design Conference (ISOCC) Chip Design Contest, and the IEEE/IEIE
Joint Award for Young Scientist and Engineer. He also serves as an Associate Editor
for the IEEE Transactions on Circuits and Systems II: Express Briefs.