KimMinseong1,2
                     ChoiKyu Hyun1,2
                     PaikYoonah1
                     KimSeon Wook1
               
                  - 
                           
                        (School of Electrical and Computer Engineering, Korea University / 145 Anam-ro, Seongbuk-gu,
                        Seoul 02841, Korea
                        							{minseong, nurlonn, yoonpaik, seon}@korea.ac.kr
                        						)
                        
- 
                           
                        (SK Hynix Inc. / 2091 Gyeongchung-daero, Bubaleup, Icheon-si, Gyeonggi-do 17336, Korea.
                        This work was done when Minseong Kim and Kyu Hyun Choi were with Korea University
                        )
                        
 
            
            
            Copyright © The Institute of Electronics and Information Engineers(IEIE)
            
            
            
            
            
               
                  
Keywords
               
               TensorFlow, Profiling, Optimization, Batch
             
            
          
         
            
                  1. Introduction
               
                  				TensorFlow [1] is one of the most well-known deep learning frameworks, providing high-level APIs
                  for defining, training, and inferencing deep learning models on various machines,
                  such as multicores and GPUs. TensorFlow performs numerical computations based on data
                  flow graphs that are a series of TensorFlow operations. In the past few years, image
                  recognition and classification studies have made tremendous progress due to the collaborative
                  development of such frameworks as TensorFlow, Caffe [2], Theano [3,4], Keras [5], CNTK [6], Torch [7], and PyTorch [8]. Besides, deep learning techniques have been actively studied and developed in various
                  application fields, such as image recognition [9,10], speech recognition [11,12], public health [13,14], load monitoring [15,16], and so on. In particular, the convolutional neural network (CNN), one of the deep
                  learning models, achieves higher performance on image recognition tasks than humans
                  in some domains [17-19].
                  			
               
               
                  
                  				Among various CNN operations, the convolution operation is the key component,
                  many variants of which are currently available [20-22]. Because the performance of these algorithms depends on hardware configuration, input
                  characteristics, and model behavior, TensorFlow utilizes various algorithms, i.e.,
                  it selects the best algorithm for each convolution operation, using profiling to achieve
                  optimal performance. However, the profiling overhead is considerably large, because
                  all available algorithms are examined for each convolution node, and the examination
                  incurs overhead in execution time and memory usage. In addition, the current profiling
                  process is performed whenever the application is launched; thus, redundant profiling
                  processes might be executed.
                  			
               
               
                  
                  				In this paper, we present a novel profiling method to reduce overhead by storing
                  and reusing the profiling results. We conducted a comprehensive study on TensorFlow
                  and analyzed memory usage over time that limits data parallelism and fails to deliver
                  maximum performance. We observed that memory capacity overshoots during the profiling
                  phase, and we found opportunities to eliminate this memory overshoot by recording
                  profiling results and reusing them without losing accuracy. Thus, we constructed a
                  novel profiling framework to store profiling results during the first run and reuse
                  the profiling results from the second run on, regardless of the TensorFlow detail
                  configurations, on NVidia GPUs using cuDNN [23], which is a GPU-accelerated library of primitives for deep neural networks. From
                  the measurement results, we could achieve up to 1.12 times and 1.11 times higher throughput
                  than vanilla TensorFlow and TensorFlow with XLA JIT compilation, respectively, using
                  Inception-V3, which is a widely used image classification model that is trained with
                  ImageNet datasets [24].
                  			
               
               
                  
                  				The rest of the paper is organized as follows. In Section~2, we describe the background
                  to and motivation for our study. In Section 3, we introduce the overall architecture
                  of our proposed profiling method. We evaluate the performance of the method (in terms
                  of throughput) in Section 4, and present memory usage over time. The paper concludes
                  in Section 5.
                  
                  			
               
             
            
                  2. TensorFlow and Its Profiling Execution
               
                     2.1 TensorFlow
                  
                     					TensorFlow is an open-source software library developed by Google [1], which performs numerical computations based on data flow graphs to conduct machine
                     learning and deep neural network research [1]. TensorFlow provides high-level APIs that make it easy to construct a neural network
                     with high flexibility that can deploy computations either on CPUs or GPUs on desktops,
                     servers, or mobile devices. The basic unit of TensorFlow is a computational graph
                     where each node represents an operation, and each edge, called a tensor, represents
                     data. The data comprise a set of primitive values shaped into an array of any dimension.
                     The computation node takes zero or more tensors as input and produces a tensor as
                     output.
                     				
                  
                  
                     					The CNN is the current state-of-the-art image classification architecture, and
                     it applies a series of convolution filters to raw images in order to extract and learn
                     higher-level features. The CNN receives input and transforms it through a series of
                     hidden layers composed of convolutional, pooling, and dense fully connected layers.
                     The convolutional layer is the core of the CNN and consumes most of the computational
                     capacity from among all the layers. TensorFlow employs a few convolution algorithms
                     and selects one of them to achieve optimal performance in terms of memory usage and
                     execution time with respect to the input parameters. The selection is performed in
                     the profiling phase.
                     
                     				
                  
                
               
                     2.2 Profiling
                  
                     
                     					Algorithm 1 presents TensorFlow's convolution profiling algorithm. TensorFlow
                     examines whether each convolution node was profiled when triggering the node. If the
                     node was profiled, the selected algorithms are chosen. Otherwise, profiling is performed
                     to find the best algorithm for the node.
                     				
                  
                  
                     					There are two reasons for profiling to enhance performance: identifying an algorithm
                     to minimize memory usage and one to minimize execution time. To find the algorithms,
                     all available convolutions are executed, depending on the convolution type, such as
                     forward, backward input, and backward filter. Table 1 shows the available algorithms
                     from TensorFlow with NVidia GPUs. In the forward convolution type, IMPLICIT_GEMM,
                     IMPLICIT_PRECOMP_GEMM, GEMM, and FFT algorithms are available. All the algorithms
                     perform a matrix-to-matrix product, but their memory usage and execution times differ.
                     An implicit algorithm like IMPLICIT_GEMM requires less memory, but executes for a
                     longer time than explicit ones, such as GEMM and FFT [23,25].
                     				
                  
                  
                     					Fig. 1 shows an example record from profiling. The first four lines represent the convolution
                     node's ID (name: starting address), an applied algorithm number, and its execution
                     time in $ms$. Line 5 shows the convolution type. In this example, the type is $\textit{backward_filter}$,
                     and the algorithm numbers on lines 1 to 4 represent the algorithms in Table 1, in order. In other words, algorithm numbers 0, 1, 2, and 3 imply $\textit{ALGO_0}$,
                     $\textit{ALGO_1}$, $\textit{FFT}$, and $\textit{ALGO_3}$, respectively. The fields
                     in the input and output shapes in lines 6 and 8, respectively, present the data type
                     (F32: 32-bit floating-point), batch size, image width, image height, and image channel.
                     For the filter shape in line 7, the fields are the filter width, filter height, input
                     channel, and output channel. Finally, the last line shows the two selected algorithms
                     from the profile: the first (3), achieves the minimum execution time, and the second
                     (0) requires the least memory.
                     				
                  
                  
                     					Basically, the profiled information includes convolution type, input shape, filter
                     shape, and output shape. We show in Section 3.1 that the batch size has no significant
                     effect on algorithm selection, so we can use the profiled information from the first
                     run for future runs.
                     
                     				
                  
                  
                        Algorithm 1. TensorFlow’s convolution profiling.
 
                  
                        Fig. 1. Example record from TensorFlow profiling.
 
                  
                        Table 1. Applicable convolution algorithms according to convolution type [23].
                     
                           
                              
                                 | Algorithm | Meaning | Memory usage | 
                        
                        
                              
                                 | IMPLICIT_GEMM | Matrix product | Low (no extra memory workspace) | 
                           
                                 | IMPLICIT_PRECOMP_GEMM | Matrix product | Medium (needs some memory workspace) | 
                           
                                 | GEMM | Explicit matrix product | High (needs significant memory workspace) | 
                           
                                 | FFT | Fast Fourier transform approach | High (stores intermediate results of FFT) | 
                           
                                 | FFT_TILING | Fast Fourier transform approach, but splits the input into 32x32 tiles | Medium (needs significantly less than FFT for large images) | 
                           
                                 | (a) Forward | 
                        
                        
                              
                                 | Algorithm | Meaning | Memory usage | 
                        
                        
                              
                                 | ALGO_0 | Sum of matrix product (non-deterministic) | Medium | 
                           
                                 | ALGO_1 | Sum of matrix product (deterministic) | Medium | 
                           
                                 | FFT | Fast Fourier transform approach (deterministic) | High (stores intermediate results of FFT) | 
                           
                                 | FFT_TILING | Fast Fourier transform approach, but splits the input into 32x32 tiles (deterministic) | Medium (needs significantly less than FFT for large images) | 
                           
                                 | (b) Backward input | 
                        
                        
                              
                                 | Algorithm | Meaning | Memory usage | 
                        
                        
                              
                                 | ALGO_0 | Sum of matrix product (non-deterministic) | Medium | 
                           
                                 | ALGO_1 | Sum of matrix product (deterministic) | Medium | 
                           
                                 | FFT | Fast Fourier transform approach (deterministic) | High (stores intermediate results of FFT) | 
                           
                                 | ALGO_3 | Sum of matrix product (precomputation, non-deterministic) | Medium | 
                           
                                 | (c) Backward filter | 
                        
                     
                   
                
               
                     2.3 Profiling Drawbacks
                  
                     					Fig. 2 presents the memory usage of our GPU platform (detailed in Section 4) over time in
                     TensorFlow when the batch sizes (the numbers of processed images in one execution)
                     are 16 and 32 for the Inception-V3 application, which is a widely used image classification
                     model trained with the ImageNet Large Visual Recognition Challenge 2012 datasets.
                     From the experiment, we identified drawbacks in profiling execution. Memory usage
                     overshoots approximately 10 seconds after the start of execution, and the overshoot
                     was more than 70% of the total memory, after which memory usage was constant at less
                     than 40\%.
                     				
                  
                  
                     					We found that the overshoot resulted from memory allocation at the beginning
                     of program execution, i.e., at the profiling phase. The memory required for profiling
                     all the algorithms was allocated initially. The allocated memory for the selected
                     algorithm was used for the future execution, and the other regions were deallocated.
                     More memory for processing more batches cannot be allocated for future runs of the
                     selected algorithm. Therefore, profiling limits the exploitation of higher data-level
                     parallelism.
                     
                     				
                  
                  
                        Fig. 2. Memory usage in GPUs over time for the Inception-V3 application in TensorFlow.
 
                
             
            
                  3. Architecture Proposal
               
                     3.1 Algorithm Selection
                  
                     					Table 2 compares profiling information for minimum execution time while running Inception-V3
                     on batches of 16 and 32. We categorize the results into three cases: $\textit{Case~1}
                     $represents two batches for which the same algorithm was selected; $\textit{Case 2}$
                     represents a situation with the same number of different algorithm selections; $\textit{Case
                     3}$ represents a situation with different numbers of the same algorithm being selected.
                     In this example, $\textit{Case 1} covers 90% of the total selections, and the other
                     cases cover the remaining 10\%.
                     				
                  
                  
                     					In $\textit{Case 1}$, there is no difference between, and no obstacles to storing
                     and reusing, the results, but $\textit{Case 2}$ and $\textit{Case 3}$ require careful
                     approaches. In Fig. 1, which shows the profile information, algorithm 3 was selected for minimum execution
                     time. However, the execution times of algorithm 3 and algorithm 1 are 0.809984 $ms$
                     and 0.816864 $ms$, respectively, a difference of only 0.8%. Between $\textit{Case
                     2}$ and $\textit{Case 3}$, the execution time difference is negligible, and therefore,
                     there is no significant difference in the overall performance when using either algorithm.
                     In the case of algorithms with minimum memory usage, the profiles are always the same.
                     				
                  
                  
                     					Table 3 presents some comparison results of profile information obtained by running
                     the algorithm twice with a batch size of 32. We compared the results to confirm there
                     were no obstacles in storing and reusing the information when there is repeated execution
                     under the same conditions. As a result, most of them belonged to $\textit{Case 1}$,
                     and the remaining belonged to $\textit{Case 3}$, but there was no problem in storing
                     and reusing them, because the performance of the two algorithms does not differ noticeably,
                     as mentioned above. Therefore, the profile information can be saved to the profile
                     record, and the profile record can be reused for four classification criteria: convolution
                     type, input shape, filter shape, and output shape (excluding batch size).
                     
                     
                     				
                  
                  
                        Table 2. Comparison of the profiling results for different batch sizes with minimum execution times.
                     
                           
                              
                                 | CT | Input shape | Filter shape | Output shape | Size 16 batch | Size 32 batch | 
                           
                                 | Selected algorithm | # | Selected algorithm | # | 
                        
                        
                              
                                 | F | Nx73x73x80 | 1x1x64x80 | Nx73x73x64 | IMP_P_GEMM | 2 | IMP_P_GEMM | 2 | 
                           
                                 | F | Nx73x73x64 | 1x1x64x80 | Nx73x73x80 | IMP_P_GEMM | 4 | IMP_P_GEMM | 4 | 
                           
                                 | F | Nx5x5x768 | 1x1x768x128 | Nx5x5x128 | IMP_GEMM | 4 | IMP_GEMM | 4 | 
                           
                                 | F | Nx5x5x128 | 5x5x128x768 | Nx1x1x128 | GEMM | 4 | GEMM | 4 | 
                           
                                 | BI | Nx8x8x448 | 3x3x448x384 | Nx8x8x384 | FFT_TILING | 4 | FFT_TILING | 4 | 
                           
                                 | BI | Nx8x8x384 | 3x1x384x384 | Nx8x8x384 | ALGO_1 | 8 | ALGO_1 | 8 | 
                           
                                 | BI | Nx8x8x384 | 1x3x384x384 | Nx8x8x384 | ALGO_1 | 8 | ALGO_1 | 8 | 
                           
                                 | BI | Nx73x73x80 | 3x3x80x192 | Nx71x71x192 | FFT_TILING | 2 | FFT_TILING | 2 | 
                           
                                 | BF | Nx35x35x48 | 5x5x48x64 | Nx35x35x64 | ALGO_3 | 6 | ALGO_3 | 6 | 
                           
                                 | BF | Nx35x35x288 | 3x3x288x384 | Nx17x17x384 | ALGO_1 | 2 | ALGO_1 | 2 | 
                           
                                 | (a) Case 1 | 
                        
                        
                              
                                 | CT | Input shape | Filter shape | Output shape | Size 16 batch | Size 32 batch | 
                           
                                 | Selected algorithm | # | Selected algorithm | # | 
                        
                        
                              
                                 | F | Nx8x8x2048 | 1x1x2048x192 | Nx8x8x192 | IMP_GEMM | 4 | IMP_P_GEMM | 4 | 
                           
                                 | (b) Case 2 | 
                        
                        
                              
                                 | CT | Input shape | Filter shape | Output shape | Size 16 batch | Size 32 batch | 
                           
                                 | Selected algorithm | # | Selected algorithm | # | 
                        
                        
                              
                                 | BF | Nx35x35x96 | 3x3x96x96 | Nx35x35x96 | ALGO_0 | 15 | ALGO_0 | 9 | 
                           
                                 | ALGO_1 | 5 | ALGO_1 | 0 | 
                           
                                 | ALGO_3 | 4 | ALGO_3 | 15 | 
                           
                                 | BF | Nx35x35x64 | 3x3x64x96 | Nx35x35x96 | ALGO_1 | 5 | ALGO_1 | 7 | 
                           
                                 | ALGO_3 | 3 | ALGO_3 | 1 | 
                           
                                 | (b) Case 3 | 
                        
                     
                     					CT: Convolution type; F: Forward; BI: Backward Input; BF: Backward Filter; #:
                     the number of selections.
                     				
                  
 
                  
                        Table 3. Some comparison results of profile information when running twice in Case 3 and with a backward filter.
                     
                           
                              
                                 | Input shape | Filter shape | Output shape | 1st run | 2nd run | 
                           
                                 | Selected algorithm | # | Selected algorithm | # | 
                        
                        
                              
                                 | 32x8x8x448 | 3x3x448x384 | 32x8x8x384 | ALGO_1 | 3 | ALGO_1 | 4 | 
                           
                                 | ALGO_3 | 1 | ALGO_3 | 0 | 
                           
                                 | 32x17x17x768 | 1x1x768x192 | 32x17x17x192 | ALGO_0 | 9 | ALGO_0 | 8 | 
                           
                                 | ALGO_3 | 15 | ALGO_3 | 16 | 
                        
                     
                     					#: the number of selections.
                     				
                  
 
                
               
                     3.2 Overall Architecture
                  
                     					Algorithm 2 shows our proposed convolution profiling algorithm. For each convolution
                     node, the algorithm examines whether the convolution node was profiled. If the node
                     was not profiled, it checks whether a profile record is available from the previous
                     run. If a profile record is available, we select the algorithm from the profile record.
                     Otherwise, we perform profiling to find the best algorithm, and save the profile information
                     in the profile record file for future runs.
                     				
                  
                  
                        Algorithm 2. Our proposed convolution profiling.
 
                
             
            
                  4. Experimental Evaluation
               
                  				The performance evaluation of our proposed architecture was done on the platform
                  detailed in Table 4. For the evaluation, TensorFlow r1.4 and the Inception-V3 application were used.
                  We trained the application with ImageNet datasets, and used batch sizes of 32, 48,
                  and 56.
                  			
               
               
                  				Fig. 3 compares the performance in terms of seconds per execution, and Fig. 4 does so in terms of images per second, i.e., the throughput between vanilla TensorFlow,
                  TensorFlow with XLA JIT compilation, and our approach with XLA.
                  			
               
               
                  				The execution time of our approach is similar to that of XLA for batch sizes 32
                  and 48. With a batch size of 56, our technique requires 9.8% less execution time than
                  the others. The throughput of our approach is also similar to that of XLA for batch
                  sizes 32 and 48. With the batch size of 56, our technique outperformed the vanilla
                  algorithm by 1.12 times. For the vanilla TensorFlow, throughput increased to 16457,
                  17658, and 18397 when the batch size was 32, 48, and 56, respectively, but it did
                  not reach the maximum throughput of the proposed architecture. With XLA, the throughput
                  increased from 18120 to 20147 as the batch size increased from 32 to 48, but decreased
                  to 18438 for batch size 56. The reason for the decrease in throughput is that the
                  larger batch size implies higher memory usage, and thus, the memory insufficiency
                  prefers an algorithm needing minimum memory usage over minimum execution time. Our
                  proposed architecture eliminates the high memory usage, and the throughput steadily
                  increased: 18179, 19820, and 20511 for batch sizes of 32, 48, and 56, respectively.
                  			
               
               
                  				In order to quantitatively observe the reduced memory usage, Figs. 5 and 6 present
                  the memory usage by GPUs over time for the Inception-V3 application on TensorFlow
                  with XLA JIT compilation and for our architecture with batch sizes of 32, 48, and
                  56. TensorFlow with XLA JIT incurs a memory usage overshoot at the initial execution
                  instance, i.e., the profiling phase. Otherwise, the memory usage of the proposed architecture
                  was constant over time, i.e., our proposed architecture eliminates memory usage overshoot
                  in the profiling phase. Thus, we confirmed that our approach has higher performance
                  than existing schemes, and we can expect our approach to achieve higher performance
                  for larger batch sizes.
                  
                  			
               
               
                     Fig. 3. Performance comparison in terms of seconds per execution. Lower is better.
 
               			F
               			
               
                     Fig. 4. Throughput comparison in terms of images per second. Higher is better.
 
               
                     Fig. 5. The memory usage of GPUs over time for the Inception-V3 application on TensorFlow with XLA JIT compilation.
 
               
                     Fig. 6. The memory usage of GPUs over time for the Inception-V3 application on the proposed architecture.
 
               
                     Table 4. Server specifications for the experiment.
                  
                        
                           
                              | Feature | Description | 
                     
                     
                           
                              | CPU | Intel Xeon CPU E5-1620 v4 @ 3.50 GHz | 
                        
                              | CPU memory | 32 GB | 
                        
                              | GPU | NVidia Titan XP (2 ea) | 
                        
                              | GPU memory | 12 GB per GPU | 
                     
                  
                
             
            
                  5. Conclusion 
               
                  				In this paper, we proposed a novel profiling method to reduce overhead by storing
                  profile information from the first run and reusing it on subsequent runs. During the
                  first run, algorithms that are efficient in terms of memory or runtime in a specific
                  execution environment are determined from among the various algorithms. As a result,
                  in the same execution environment, the same execution result for the algorithms could
                  be expected. In such cases, profiling at every execution is generally not the right
                  way. Also, in some cases, the difference between the results for memory- or runtime-efficient
                  algorithms is minimal. Therefore, our proposed method can eliminate the related overhead
                  by profiling the previous result when the execution environment does not change. We
                  analyzed memory usage over time in detail, and observed that memory overshoots during
                  profiling. The memory overshoot limits data-level parallelism, thus degrading execution
                  throughput.
                  			
               
               
                  				Based on this knowledge, we devised a new profiling method to eliminate the overshoot
                  and improve performance. Using Inception-V3 trained with ImageNet datasets, we achieved
                  higher throughput than vanilla TensorFlow and TensorFlow with XLA JIT compilation
                  by up to 1.12 times and 1.11 times, respectively, without losing accuracy.
                  
                  
                  			
               
             
          
         
            
                  ACKNOWLEDGMENTS
               
                  				This work was partially supported by SK Telecom Co., LTD.
                  			
               
             
            
                  
                     REFERENCES
                  
                     
                        
                        Abadi M., et al. , 2016, Tensorflow: A System for Large-scale Machine Learning, in
                           Proc. of OSDI'16, berkeley, ca, usa, pp. 265-283

 
                     
                        
                        Jia Y., et al. , 2014, Caffe: Convolutional Architecture for Fast Feature Embedding,
                           in Proc. of MM'14, new york, ny, usa, acm, pp. 675-678

 
                     
                        
                        Bastien F., et al. , 2012, Theano: New Features and Speed Improvements, CoRR

 
                     
                        
                        Al-Rfou R., et al. , 2016, Theano: A Python Framework for Fast Computation of Mathematical
                           Expressions, CoRR

 
                     
                        
                        2015, Keras

 
                     
                        
                        Seide F., Agarwal A., 2016, CNTK: Microsoft's Open-source Deep-learning Toolkit, in
                           Proc. of KDD'16, New York, NY, USA, ACM, pp. 2135-2135

 
                     
                        
                        Collobert R., Bengio S., Mariéthoz J., 2002, Torch: A Modular Machine Learning Software
                           Library, Technical Report, IDIAP

 
                     
                        
                        2017, PyTorch., 

 
                     
                        
                        Salazar N. D., et al. , 2018, Application of Transfer Learning for Object Recognition
                           Using Convolutional Neural Networks, in Proc. of 2018 IEEE Colombian Conference on
                           Applications in Computational Intelligence, Medellin, Colombia, pp. 14-25

 
                     
                        
                        Spanhol F. A., et al. , 2016, Breast Cancer Histopathological Image Classification
                           Using Convolutional Neural Networks, in Proc. of IJCNN'16, Vancouver, BC, Canada,
                           pp. 2560-2567

 
                     
                        
                        Siniscalchi S. M., Salerno V. M., 2016, Adaptation to New Microphones Using Artificial
                           Neural Networks with Trainable Activation Functions, IEEE Transactions on Neural Networks
                           and Learning Systems, Vol. 28, No. 8, pp. 1959-1965

 
                     
                        
                        Hautamäki V., et al. , 2015, Boosting Universal Speech Attributes Classification with
                           Deep Neural Network for Foreign Accent Characterization, in Proc. of INTERSPEECH'15,
                           Dresden, Germany, pp. 408-412

 
                     
                        
                        Garzón-Alfonso C. C., Rodríguez-Martínez M., 2018, Twitter Health Surveillance (THS)
                           System, in Proc. of 2018 IEEE International Conference on Big Data, Seattle, W,A USA,
                           pp. 1647-1654

 
                     
                        
                        Ravì D., et al. , 2016, Deep Learning for Health Informatics, IEEE Journal of Biomedical
                           and Health Informatics, Vol. 21, No. 1, pp. 4-21

 
                     
                        
                        Salerno V., Rabbeni G., 2018, An Extreme Learning Machine Approach to Effective Energy
                           Disaggregation, Electronics, Vol. 7, No. 10, pp. 235

 
                     
                        
                        Zhang C., et al. , 2018, Sequence-to-point Learning with Neural Networks for Non-intrusive
                           Load Monitoring, in Proc. of AAAI'18, New Orleans, LA, USA, pp. 2604-2611

 
                     
                        
                        He K., Zhang X., Ren S., Sun J., 2015, Delving Deep into Rectifiers: Surpassing Human-level
                           Performance on Imagenet Classification, in Proc. of ICCV'15, Santiago, Chile, pp. 1026-1034

 
                     
                        
                        Dodge S. F., Karam L. J., 2017, A Study and Comparison of Human and Deep Learning
                           Recognition Performance Under Visual Distortions, in Proc. of ICCCN'17, Vancouver,
                           BC, Canada, pp. 1-7

 
                     
                        
                        He K., Zhang X., Ren S., Sun J., 2016, Deep Residual Learning for Image Recognition,
                           in Proc. of CVPR'16, Las Vegas, NV, USA, pp. 770-778

 
                     
                        
                        Vasilache N., et al. , 2014, Fast Convolutional Nets with fbfft: A GPU Performance
                           Evaluation, CoRR

 
                     
                        
                        Winograd S., 1980, Arithmetic Complexity of Computations, Society for Industrial and
                           Applied Mathematics

 
                     
                        
                        Lavin A., Gray S., 2016, Fast Algorithms for Convolutional Neural Networks, in Proc.
                           of CVPR'16, Las Vegas, NV, USA, pp. 4013-4021

 
                     
                        
                        Chetlur S., et al. , 2014, cuDNN: Efficient Primitives for Deep Learning, CoRR

 
                     
                        
                        Deng J., et al. , 2009, ImageNet: A Large-scale Hierarchical Image Database, in Proc.
                           of CVPR'09, Miami, FL, USA, pp. 248-255

 
                     
                        
                        Rhu M., et al. , 2016, vDNN: Virtualized Deep Neural Networks for Scalable, Memory-efficient
                           Neural Network Design, in Proc. of MICRO'16, Taipei, Taiwan, pp. 1-13

 
                   
                
             
            Author
             
             
             
            
            
               			Minseong Kim received a BSc in electrical engineering and a PhD in electronics,
               electrical and computer engineering from Korea University in 2011 and 2018, respectively.
               He is currently with SK hynix, South Korea. His research interests include compiler
               construction, microarchitectures, memory systems, and parallel and distributed architectural
               simulator designs.
               		
            
            
            
               			Kyu Hyun Choi received a PhD in electronics, electrical and computer engineering
               from Korea University in 2018. He is currently a senior engineer with SK hynix, South
               Korea. His research interests include microarchitectures, GPUs, parallel computing,
               memory system design, and processing in memory.
               		
            
            
            
               			Yoonah Paik received her B.Eng. in Electrical Engineering from Korea University,
               Seoul, Korea, in 2015, and she is currently working toward an integrated MSc and PhD
               in the Compiler and Microarchitecture Lab at Korea University. Her current research
               interests include microarchitectures, memory system design, and system software.
               		
            
            
            
               			Seon Wook Kim received his BSc in Electronics and Computer Engineering from Korea
               University in 1988, an MSc in Electrical Engineering from Ohio State University in
               1990, and a PhD in Electrical and Computer Engineering from Purdue University in 2001.
               He was a senior researcher at the Agency for Defense Development (1990-1995) and was
               a staff software engineer at Intel/KSL (2001-2002). Currently, he is a professor for
               the School of Electrical and Computer Engineering at Korea University. His research
               interests include compiler construction, microarchitectures, system optimization,
               and SoC design. He is a senior member of ACM and IEEE.