KimMoonsoo1
                     LeeJuhan1
                     KimHyun2*
                     LeeHyuk-Jae1
               
                  - 
                           
                        (Department of Electrical and Computer Engineering, Seoul National University, Seoul,
                        Korea
                        							{kimms213, jhyi, hyuk_jae_lee}@capp.snu.ac.kr
                        						)
                        
- 
                           
                        (Department of Electrical and Information Engineering and Research Center for Electrical
                        and Information Technology, Seoul National University of Science and Technology, Seoul,
                        Korea   hyunkim@seoultech.ac.kr )
                        
 
            
            
            Copyright © The Institute of Electronics and Information Engineers(IEIE)
            
            
            
            
            
               
                  
Keywords
               
               Non-volatile memory, Phase-change memory, Read disturbance errors, On-demand scrubbing
             
            
          
         
            
                  1. Introduction
               
                  				A modern computer system requires large amounts of main memory owing to its multi-core
                  structure and complex applications. In particular, data-intensive applications such
                  as big data and deep learning require a large amount of main memory to support large
                  amounts of data [1-3]. As a result, the need for large-capacity main memory with low power consumption
                  and high reliability has become important [4,5], and studies on the use of phase-change memory (PCM) as main memory have been actively
                  conducted [6-8]. The cell size in PCM is smaller than DRAM, so the module can be denser, enabling
                  a large memory capacity [9]. Furthermore, owing to the non-volatile characteristics of PCM, it is more advantageous
                  than DRAM in terms of power efficiency and data retention time.
                  			
               
               
                  				Despite these advantages, PCM suffers from low reliability, which needs to be
                  addressed in order to use PCM as main memory. One of the main causes of reliability
                  issues in PCM is read disturbance errors (RDEs) [10,11]. An RDE is a phenomenon whereby cells that are repeatedly read are damaged by thermal
                  energy. RDEs occur when the number of reads exceeds a certain threshold. A conventional
                  solution for RDEs is to scrub the cells in a word before the number of reads reaches
                  the threshold. Memory scrubbing first reads a word, corrects bit errors with error-correcting
                  code (ECC), and writes the corrected word back to the same location. Periodically
                  scrubbing a word therefore prevents RDEs in that word.
                  			
               
               
                  				However, periodical scrubbing requires read counters, which results in significant
                  resource overhead because the number of reads must be counted in order to trigger
                  scrubbing. In this paper, on-demand memory scrubbing that does not require read counters
                  is proposed. Under the given RDE model with ECC, the probability distribution of the
                  number of errors that occur with an additional read is derived. By using the derived
                  probability distribution, the proposed solution suggests whether to scrub or not based
                  on the current number of errors. Because the proposed solution only requires the number
                  of errors, it does not need read counters, thereby eliminating nearly 1GB of resource
                  overhead. The contributions from this paper are summarized as follows.
                  			
               
               
                  
                  				· A probabilistic model for RDEs is mathematically derived, and the optimal on-demand
                  scrubbing policy is derived from the proposed model.
                  			
               
               
                  				· Monte-Carlo (MC) simulation is conducted to verify the probabilistic model.
                  			
               
               
                  				· The proposed on-demand scrubbing eliminates more than 1GB needed for read counters,
                  while fixing more than 99.99% of RDEs.
                  			
               
               
                  				The remainder of this paper is organized as follows. Section 2 introduces the
                  background, and Section 3 presents the proposed on-demand scrubbing method. In Section
                  4, experimental results are given. Finally, Section 5 concludes the paper.
                  
                  			
               
             
            
                  2. Background
               
                  				In this section, the background for RDE error models and RDE mitigating schemes
                  is presented.
                  			
               
               
                     2.1 Error Models for RDEs
                  
                     					Typically, a counter-based error model is used for RDEs [9]. Under this model, each cell has an RDE threshold, and when the number of reads reaches
                     the threshold, an RDE occurs. The RDE threshold values follow a Gaussian distribution,
                     $N\left(m,\sigma ^{2}\right)$. For later discussions, $\textit{m}$ at 3,000 and various
                     $\sigma $ values are assumed.
                     				
                  
                  
                     					The word size for PCM typically ranges from 64B to 256B [7]. In this paper, we assume 128B words. For ECC, 176-21 Reed-Solomon code is assumed
                     so up to 21 symbols can be corrected out of 176 symbols in total. These 1,408 cells
                     in a word are assumed to have independent RDE threshold values modeled as a Gaussian
                     distribution.
                     
                     				
                  
                
               
                     2.2 RDE Mitigating Schemes
                  
                     					To mitigate RDE occurrences, a memory scrubbing method is used [12]. In conventional methods, each word utilizes a read counter. If the counter value
                     reaches a certain threshold, the method reads a whole word and checks for errors via
                     ECC. If errors are found, they are corrected and the word is rewritten. This read-and-fix
                     process is called memory scrubbing. Conventional memory scrubbing, which uses a read
                     counter, can remove RDEs effectively as long as the scrubbing threshold value is well
                     chosen. However, as shown in Fig. 1, it requires a read counter per word, which means an extra 2B of storage must be
                     allocated per word. Taking into account that the word size in PCM is typically between
                     64B and 256B, the extra storage takes about 1/32 to 1/128 of the total PCM capacity.
                     Moreover, these read counters are updated frequently, and thus, DRAM should be used
                     for them. Assuming 512GB of PCM capacity and a 128B word size, nearly 8GB of DRAM
                     is used only for read counters, which is a significantly large overhead.
                     				
                  
                  
                        Fig. 1. Diagram of PCM and its read counters.
 
                
             
            
                  3. On-demand Memory Scrubbing
               
                  				In this section, an on-demand memory scrubbing method that effectively eliminates
                  read counters is described. First, the probability distribution under the Gaussian
                  counter-based error model is derived, and then, an efficient on-demand scrubbing policy
                  under the given probability distribution is suggested.
                  			
               
               
                     3.1 Probability Distribution for the Number of Errors 
                  
                     					Denote as $\textit{L}$ the number of errors, with $\textit{K}$ as the number
                     of reads, and $\textit{T}$ as the RDE threshold. The first probability to derive is
                     $e_{k}$, which is the probability that an error occurs when the number of reads is
                     $\textit{k}$ $\left(K=k\right)$. $e_{k}$ is derived as follows:
                     				
                  
                  
                  
                     					Because errors occur when the number of reads exceeds the RDE threshold, the
                     first equality in (1) stands. The second equality is from the Gaussian modeling of $\textit{T}$. It should
                     be noted that $P\left(k\geq N\left(m,\sigma ^{2}\right)\right)$ can be easily calculated
                     from the normal distribution table.
                     				
                  
                  
                     					When $K=k$, the probability of $L$ being $l$ is a binomial distribution $B\left(l,176,e_{k}\right)$.
                     More specifically, for all 176 symbols, the probability of an error is $e_{k}$, so
                     they have a binomial distribution as follows:
                     				
                  
                  
                  
                     					However, we are interested in probability $P(k|l)$, not $P(l|k)$, because the
                     value that can be observed is $\textit{l}$, not $\textit{k}$. By using Bayes’s rule,
                     $P(k|l)$ can be derived as follows:
                     				
                  
                  
                  
                     					This means that when the observed number of errors is $\textit{l}$, the hidden
                     number of reads, $\textit{k,}$ has the probability distribution derived from (3).
                     				
                  
                  
                     					Lastly, probability $P(L'=l'|L=l)$ is derived. $L'$ represents the number of
                     errors when an additional read operation is performed. Therefore, probability $P(l'|l)$
                     means the number of errors \textit{l$^{\prime}$} from $\textit{l}$ due to an additional
                     read. It is derived as follows:
                     				
                  
                  
                  
                     					$B\left(\left(l'-l\right);176-l,e_{k+1}\right)$ calculates the probability that
                     the number of additional errors that occur from one more reads is $\left(l'-l\right)$.
                     Because the number of reads is now \textit{k+}1, $e_{k+1}$ is used as the error probability.
                     By summing the term over $\textit{k,}$ the desired $P\left(l'|l\right)$ is calculated.
                     
                     				
                  
                
               
                     3.2 On-demand Scrubbing Policy
                  
                     					So far, probability distribution $P\left(l'|l\right)$ is calculated with (4). This distribution is used to derive the on-demand scrubbing policy. Assuming that
                     ECC can correct up to 21 symbols, scrubbing must be done before the number of errors
                     exceeds 21. This means that probability $P\left(L'>21|L=l\right)$ must be small enough
                     when the scrubbing is done. It can be calculated as follows:
                     				
                  
                  
                  
                     					Table 1 shows the value of $P\left(L'>21|L=l\right)$ for various values of $\sigma $. When
                     $\sigma =10$ and $L=6$, the probability is 6.5E-7%. This means that when the current
                     number of errors is 6, the probability that the number of errors becomes larger than
                     21 due to one more read operation is 6.5E-7%. It is obvious that the probability must
                     be very small to prevent an uncorrectable error (UE). Therefore, performing scrubbing
                     at a small value for $\textit{L}$ is always better in terms of removing UEs. However,
                     a small L value causes too-frequent scrubbing. Table 2 shows the expected value for the number of reads ($\textit{K}$) with respect to $\textit{L}$.
                     We can see that $\textit{K}$ increases as $\textit{L}$ increases. Moreover, the difference
                     in the value is relatively large when $\textit{L}$ is small (from 0 to 6). On the
                     other hand, when $\textit{L}$ is relatively large, the difference becomes smaller.
                     To sum up, it is best to delay scrubbing while the UE probability is small enough,
                     but when the $\textit{L}$ value becomes relatively large, to delay scrubbing does
                     not give much benefit, because the difference in the $\textit{K}$ value is minor.
                     For example, when the probability goal is 99.999%, which means the UE probability
                     should be less than 0.001%, the selected $\textit{L}$ values that trigger scrubbing
                     are 7, 10, and 13 when $\sigma $ is 10, 20, and 50, respectively. The average reads
                     at the selected points are 2984, 2969, and 2920, respectively.
                     				
                  
                  
                     					The proposed on-demand scrubbing provides a significant hardware overhead reduction
                     because it does not require read counters. As mentioned above, a read counter for
                     a single word takes up 2B. Assuming a word size of 128B, 1/64 of the total PCM capacity
                     must be allocated to read counters. For example, if the PCM capacity is 64GB, total
                     of 1GB of storage must be allocated to read counters. This means that using on-demand
                     scrubbing significantly reduces storage overhead by about 1/64 of the total capacity.
                     
                     				
                  
                  
                        Table 1. The probability of violation with respect to L.
                     
                           
                              
                                 | L | σ=10 | σ=20 | σ=50 | 
                        
                        
                              
                                 | 0-6 | <6.5E-7% | <4.8E-7% | <2.4E-8% | 
                           
                                 | 7 | 2.9E-4% | 6.1E-6% | 1.6E-7% | 
                           
                                 | 8 | 5.0E-3% | 2.0E-5% | 1.1E-6% | 
                           
                                 | 9 | 0.03% | 8.4E-5% | 8.7E-6% | 
                           
                                 | 10 | 0.08% | 2.8E-4% | 2.2E-5% | 
                           
                                 | 11 | 0.31% | 3.4E-3% | 9.1E-5% | 
                           
                                 | 12 | 0.82% | 0.02% | 3.3E-4% | 
                        
                     
                   
                  
                        Table 2. The expected value of K with respect to L.
                     
                           
                              
                                 | L | σ=10 | σ=20 | σ=50 | 
                        
                        
                              
                                 | 0-6 | 2956-2982 | 2913-2963 | 2783-2908 | 
                           
                                 | 7 | 2983 | 2965 | 2912 | 
                           
                                 | 8 | 2983 | 2967 | 2915 | 
                           
                                 | 9 | 2984 | 2968 | 2917 | 
                           
                                 | 10 | 2984 | 2969 | 2918 | 
                           
                                 | 11 | 2984 | 2970 | 2919 | 
                           
                                 | 12 | 2985 | 2970 | 2920 | 
                        
                     
                   
                
             
            
                  4. Simulation Results
               
                  				In this section, MC simulation results regarding the proposed on-demand memory
                  scrubbing are presented. For MC simulations, RDE threshold value $\textit{T}$ for
                  each bit in a word was randomly sampled from the Gaussian distribution, $N\left(3000,~
                  \sigma ^{2}\right)$. Under the generated $\textit{T}$ values in a word, the number
                  of errors ($\textit{L}$) was followed as the read count ($\textit{K}$) increased.
                  			
               
               
                  				Table 3 shows the MC simulation results with respect to various standard deviation values.
                  The number of MC trials was 1,000,000. The first column of Table 3 shows the number of errors in the current state. From the current state, if one more
                  read causes a violation ($\textit{i.e.}$, the number of errors exceeds 21), the case
                  is counted as a $\textit{violated read}$. The second, fifth, and eighth columns in
                  Table 3 show the number of violated reads for standard deviations of 10, 20, and 50, respectively.
                  For example, when the standard deviation was 10, the violated read value was 4 for
                  an $\textit{L}$ of 7. This means that in these four cases, the error number changed
                  directly from 7 to a value larger than 21. The columns labeled Probability of VR show
                  the ratio of violated reads with respect to the total number of trials. The column
                  Average Read Threshold represents the average read counts among the violations. For
                  example, when the standard deviation was 10, the average read threshold was 2,973
                  for an $\textit{L}$ of 7. This means that the average read count over four violations
                  was 2,973.
                  			
               
               
                  				From the MC simulation results in Table 3, two observations can be drawn. First, the distribution of violated reads changed
                  with respect to the standard deviation. As the standard deviation of the underlying
                  Gaussian model increased, the violations occurred late. In other words, when the standard
                  deviation was relatively large, more errors could be endured until scrubbing. The
                  other observation is that the average read threshold value remained almost the same
                  while $\textit{L}$ increased. This means that even if we accumulate more errors before
                  scrubbing, in order to reduce scrubbing frequency, the actual reduction in the scrubbing
                  frequency is negligible. From this observation, the optimal on-demand scrubbing point
                  for $\textit{L}$ is the point where a violation first occurs. Therefore, the optimal
                  on-demand scrubbing points for each standard deviation value for $\textit{L}$ are
                  7, 10, and 13, respectively.
                  			
               
               
                  				It should be noted that the MC simulation results presented in Table 3 verify the probability distribution described in Section 3.1. The probabilities of
                  violations derived from the probability distributions shown in Table 1 give values similar to those in Table 3. For the average read threshold value drawn from the probability distribution in
                  Table 2, the results verify that the value does not increase significantly when $\textit{L}$
                  is greater than 6.
                  			
               
               
                  				In addition to RDEs, there are other reliability issues regarding PRAM, such as
                  write disturbance error (WDE). That is, for example, if the current number of errors
                  is 8, then it is possible that four RDEs and four WDEs both occur. Under this circumstance,
                  consider a case where the proposed on-demand scrubbing policy initiates scrubbing.
                  The policy thinks that all eight errors were caused by RDEs, but in this case, only
                  four errors were caused by RDEs. Therefore, the probability that an additional read
                  will cause an ECC violation is even less when other errors are considered. In summary,
                  the proposed approach, which only considers RDEs, is a conservative approach, and
                  other errors do not compromise the reliability of the proposed on-demand scrubbing.
                  			
               
               
                     Table 3. MC simulation results showing violated read and average read threshold values.
                  
                        
                           
                              | L | N3000,$10^{2}$ | N3000,$20^{2}$ | N3000,$50^{2}$ | 
                        
                              | Violated Read | Probability of VR (%) | Average Read Threshold | Violated Read | Probability of VR (%) | Average Read Threshold | Violated Read | Probability of VR (%) | Average Read Threshold | 
                     
                     
                           
                              | 0-6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
                        
                              | 7 | 4 | 0.0004 | 2973 | 0 | 0 | 0 | 0 | 0 | 0 | 
                        
                              | 8 | 37 | 0.0037 | 2973 | 0 | 0 | 0 | 0 | 0 | 0 | 
                        
                              | 9 | 197 | 0.0197 | 2973 | 0 | 0 | 0 | 0 | 0 | 0 | 
                        
                              | 10 | 706 | 0.0706 | 2973 | 3 | 0.0003 | 2952 | 0 | 0 | 0 | 
                        
                              | 11 | 2462 | 0.2462 | 2973 | 25 | 0.0025 | 2949 | 0 | 0 | 0 | 
                        
                              | 12 | 7604 | 0.7604 | 2973 | 160 | 0.016 | 2948 | 0 | 0 | 0 | 
                        
                              | 13 | 18692 | 1.8692 | 2973 | 866 | 0.0866 | 2948 | 3 | 0.0003 | 2881 | 
                        
                              | 14 | 40413 | 4.0413 | 2973 | 3669 | 0.3669 | 2949 | 47 | 0.0047 | 2876 | 
                        
                              | 15 | 73756 | 7.3756 | 2973 | 12916 | 1.2916 | 2949 | 432 | 0.0432 | 2875 | 
                        
                              | 16 | 115931 | 11.5931 | 2973 | 38812 | 3.8812 | 2949 | 2880 | 0.288 | 2876 | 
                        
                              | 17 | 157267 | 15.7267 | 2973 | 95337 | 9.5337 | 2949 | 17173 | 1.7173 | 2876 | 
                        
                              | 18 | 186598 | 18.6598 | 2974 | 187397 | 18.7397 | 2949 | 78581 | 7.8581 | 2876 | 
                        
                              | 19 | 198617 | 19.8617 | 2974 | 293767 | 29.3767 | 2949 | 267358 | 26.7358 | 2877 | 
                        
                              | 20 | 197716 | 19.7716 | 2974 | 367048 | 36.7048 | 2950 | 633526 | 63.3526 | 2877 | 
                     
                  
                
             
            
                  5. Conclusion 
               
                  				To efficiently address RDEs, this paper proposes an on-demand scrubbing policy
                  that does not require read counters. Assuming the memory capacity of PCM is 64GB,
                  a read counter per word takes nearly 1GB of storage, which creates a significant resource
                  overhead. In the proposed method, without this resource overhead, more than 99.99%
                  of read disturbance errors can be fixed. From a mathematically derived probability
                  distribution model of RDE occurrence, the optimal on-demand scrubbing policy is drawn
                  up with respect to a certain standard deviation in RDE threshold values. MC simulation
                  results also verified the derived probability distribution model. It should be noted
                  that the probability of preventing violations can be increased by scrubbing more often.
                  This means that there is a trade-off between reliability and performance, and the
                  user can adaptively select the configuration according to the application.
                  			
               
             
          
         
            
                  ACKNOWLEDGMENTS
               
                  				This paper was supported in part by the Technology Innovation Program (10080613,
                  DRAM/PRAM hetero-geneous memory architecture and controller IC design technology research
                  and development) funded by the Ministry of Trade, Industry & Energy (MOTIE, Korea)
                  and in part by the Basic Science Research Program through the National Research Foundation
                  of Korea (NRF) funded by the Ministry of Education under Grant NRF-2019R1A6A1A03032119.
                  			
               
             
            
                  
                     REFERENCES
                  
                     
                        
                        Kim B., et al. , Apr. 2020, PCM: Precision-Controlled Memory System for Energy Efficient
                           Deep Neural Network Training, 2020 Design, Automation & Test in Europe Conference
                           & Exhibition (DATE)

 
                     
                        
                        Nguyen D. T., Hung N. H., Kim H., Lee H.-J., May 2020, An Approximate Memory Architecture
                           for Energy Saving in Deep Learning Applications, in IEEE Transactions on Circuits
                           and Systems for Video Technology, Vol. 67, No. 5, pp. 1588-1601

 
                     
                        
                        Lee C., Lee H., Feb. 2019, Effective Parallelization of a High-Order Graph Matching
                           Algorithm for GPU Execution, in IEEE Transactions on Circuits and Systems for Video
                           Technology, Vol. 29, No. 2, pp. 560-571

 
                     
                        
                        Kim M., Choi J., Kim H., Lee H., 1 Oct. 2019, An Effective DRAM Address Remapping
                           for Mitigating Rowhammer Errors, in IEEE Transactions on Computers, Vol. 68, No. 10,
                           pp. 1428-1441

 
                     
                        
                        Kim M., Chang I., Lee H., 2019, Segmented Tag Cache: A Novel Cache Organization for
                           Reducing Dynamic Read Energy, in IEEE Transactions on Computers, Vol. 68, No. 10,
                           pp. 1546-1552

 
                     
                        
                        Lee H., Kim M., Kim H., Kim H., Lee H., 2019, Integration and boost of a read-modify-write
                           module in phase change memory system, IEEE Transactions on Computers, Vol. 68, No.
                           12, pp. 1772-1784

 
                     
                        
                        Lee B. C., Ipek E., Mutlu O., Burger D., Architecting phase change memory as a scalable
                           dram alternative, in Proceedings of the 36th Annual International Symposium on Computer
                           Architecture, ser. ISCA ’09.

 
                     
                        
                        Qureshi M. K., Srinivasan V., Rivers J. A., 2012, Scalable high performance main memory
                           system using phase-change memory technology, in Proceedings of the 36th Annual International
                           Symposium on High-Performance Comp Architecture (HPCA)

 
                     
                        
                        Wong H.-S. P., Raoux S., Kim S., Liang J., Reifenberg J. P., Rajendran B., Asheghi
                           M., Goodson K. E., 2010, Phase change memory, Proceedings of the IEEE, Vol. 98, No.
                           12, pp. 2201-2227

 
                     
                        
                        Nair P. J., Chou C., Rajendran B., Qureshi M. K., 2015, Reducing read latency of phase
                           change memory via early read and Turbo Read, 2015 IEEE 21st International Symposium
                           on High Performance Computer Architecture (HPCA), Burlingame, CA

 
                     
                        
                        Rashidi S., Jalili M., Sarbazi-Azad. H., Improving MLC PCM Performance through Relaxed
                           Write and Read for Intermediate Resistance Levels, ACM Trans. Archit. Code Optim.
                           15, 1, Article 12 (April 2018), 31 pages.

 
                     
                        
                        Awasthi M., Shevgoor M., Sudan K., Rajendran B., Balasubramonian R., Srinivasan V.,
                           2012, Efficient scrub mechanisms for error-prone emerging memories, IEEE International
                           Symposium on High-Performance Comp Architecture (HPCA), New Orleans, LA

 
                   
                
             
            Author
             
             
             
            
            
               			Moonsoo Kim received a B.S. and Ph.D. degrees in electrical and computer engineering
               from Seoul Na- tional University, Seoul, Korea, in 2014 and 2020, respectively. In
               2020, he joined Inter-University Semicon-ductor Research Center from Seoul National
               University, Seoul, Korea as a post-doctoral researcher. His research interests include
               SoC design of video/image applications, and low-power, reliable design of memory hierarchy.
               		
            
            
            
               			Joohan Yi received the B.S. degree in electric and electrical engineering from
               Korea University, Seoul, Korea, in 2018. He is currently working toward the integrated
               M.S and Ph.D degree in electrical and computer engineering at Seoul National University,
               Seoul, Korea. His research interests include memory hierarchy, deep neural network
               processor, and image processing for Robot Systems.
               		
            
            
            
               			Hyun Kim received the B.S., M.S. and Ph.D. degrees in Electrical Engineering and
               Computer Science from Seoul National University, Seoul, Korea, in 2009, 2011 and 2015,
               respectively. From 2015 to 2018, he was with the BK21 Creative Research Engineer Development
               for IT, Seoul National University, Seoul, Korea, as a Research Professor. In 2018,
               he joined the Department of Electrical and Information Engineering, Seoul National
               University of Science and Technology, Seoul, Korea, where he is currently working
               as an Assistant Professor. His research interests are the areas of algorithm, computer
               architecture, and SoC design for low-complexity multimedia applications.
               		
            
            
            
               			Hyuk-Jae Lee received B.S. and M.S. degrees in Electronics Engineering from Seoul
               National University, Korea, in 1987 and 1989, respectively, and the Ph.D. degree in
               Electrical and Computer Engineering from Purdue University at West Lafayette, Indiana,
               in 1996. From 1998 to 2001, he worked at the Server and Workstation Chipset Division
               of Intel Corporation in Hillsboro, Oregon as a senior component design engineer. From
               1996 to 1998, he was on the faculty of the Department of Computer Science of Louisiana
               Tech University at Ruston, Louisiana. In 2001, he joined the School of Electrical
               Engineering and Computer Science at Seoul National University, Korea, where he is
               currently working as a Professor. He is a founder of Mamurian Design, Inc., a fabless
               SoC design house for multimedia applications. His research interests are in the areas
               of computer architecture and SoC design for multimedia applications.