KimMinseong1,2
ChoiKyu Hyun1,2
PaikYoonah1
KimSeon Wook1
-
(School of Electrical and Computer Engineering, Korea University / 145 Anam-ro, Seongbuk-gu,
Seoul 02841, Korea
{minseong, nurlonn, yoonpaik, seon}@korea.ac.kr
)
-
(SK Hynix Inc. / 2091 Gyeongchung-daero, Bubaleup, Icheon-si, Gyeonggi-do 17336, Korea.
This work was done when Minseong Kim and Kyu Hyun Choi were with Korea University
)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
TensorFlow, Profiling, Optimization, Batch
1. Introduction
TensorFlow [1] is one of the most well-known deep learning frameworks, providing high-level APIs
for defining, training, and inferencing deep learning models on various machines,
such as multicores and GPUs. TensorFlow performs numerical computations based on data
flow graphs that are a series of TensorFlow operations. In the past few years, image
recognition and classification studies have made tremendous progress due to the collaborative
development of such frameworks as TensorFlow, Caffe [2], Theano [3,4], Keras [5], CNTK [6], Torch [7], and PyTorch [8]. Besides, deep learning techniques have been actively studied and developed in various
application fields, such as image recognition [9,10], speech recognition [11,12], public health [13,14], load monitoring [15,16], and so on. In particular, the convolutional neural network (CNN), one of the deep
learning models, achieves higher performance on image recognition tasks than humans
in some domains [17-19].
Among various CNN operations, the convolution operation is the key component,
many variants of which are currently available [20-22]. Because the performance of these algorithms depends on hardware configuration, input
characteristics, and model behavior, TensorFlow utilizes various algorithms, i.e.,
it selects the best algorithm for each convolution operation, using profiling to achieve
optimal performance. However, the profiling overhead is considerably large, because
all available algorithms are examined for each convolution node, and the examination
incurs overhead in execution time and memory usage. In addition, the current profiling
process is performed whenever the application is launched; thus, redundant profiling
processes might be executed.
In this paper, we present a novel profiling method to reduce overhead by storing
and reusing the profiling results. We conducted a comprehensive study on TensorFlow
and analyzed memory usage over time that limits data parallelism and fails to deliver
maximum performance. We observed that memory capacity overshoots during the profiling
phase, and we found opportunities to eliminate this memory overshoot by recording
profiling results and reusing them without losing accuracy. Thus, we constructed a
novel profiling framework to store profiling results during the first run and reuse
the profiling results from the second run on, regardless of the TensorFlow detail
configurations, on NVidia GPUs using cuDNN [23], which is a GPU-accelerated library of primitives for deep neural networks. From
the measurement results, we could achieve up to 1.12 times and 1.11 times higher throughput
than vanilla TensorFlow and TensorFlow with XLA JIT compilation, respectively, using
Inception-V3, which is a widely used image classification model that is trained with
ImageNet datasets [24].
The rest of the paper is organized as follows. In Section~2, we describe the background
to and motivation for our study. In Section 3, we introduce the overall architecture
of our proposed profiling method. We evaluate the performance of the method (in terms
of throughput) in Section 4, and present memory usage over time. The paper concludes
in Section 5.
2. TensorFlow and Its Profiling Execution
2.1 TensorFlow
TensorFlow is an open-source software library developed by Google [1], which performs numerical computations based on data flow graphs to conduct machine
learning and deep neural network research [1]. TensorFlow provides high-level APIs that make it easy to construct a neural network
with high flexibility that can deploy computations either on CPUs or GPUs on desktops,
servers, or mobile devices. The basic unit of TensorFlow is a computational graph
where each node represents an operation, and each edge, called a tensor, represents
data. The data comprise a set of primitive values shaped into an array of any dimension.
The computation node takes zero or more tensors as input and produces a tensor as
output.
The CNN is the current state-of-the-art image classification architecture, and
it applies a series of convolution filters to raw images in order to extract and learn
higher-level features. The CNN receives input and transforms it through a series of
hidden layers composed of convolutional, pooling, and dense fully connected layers.
The convolutional layer is the core of the CNN and consumes most of the computational
capacity from among all the layers. TensorFlow employs a few convolution algorithms
and selects one of them to achieve optimal performance in terms of memory usage and
execution time with respect to the input parameters. The selection is performed in
the profiling phase.
2.2 Profiling
Algorithm 1 presents TensorFlow's convolution profiling algorithm. TensorFlow
examines whether each convolution node was profiled when triggering the node. If the
node was profiled, the selected algorithms are chosen. Otherwise, profiling is performed
to find the best algorithm for the node.
There are two reasons for profiling to enhance performance: identifying an algorithm
to minimize memory usage and one to minimize execution time. To find the algorithms,
all available convolutions are executed, depending on the convolution type, such as
forward, backward input, and backward filter. Table 1 shows the available algorithms
from TensorFlow with NVidia GPUs. In the forward convolution type, IMPLICIT_GEMM,
IMPLICIT_PRECOMP_GEMM, GEMM, and FFT algorithms are available. All the algorithms
perform a matrix-to-matrix product, but their memory usage and execution times differ.
An implicit algorithm like IMPLICIT_GEMM requires less memory, but executes for a
longer time than explicit ones, such as GEMM and FFT [23,25].
Fig. 1 shows an example record from profiling. The first four lines represent the convolution
node's ID (name: starting address), an applied algorithm number, and its execution
time in $ms$. Line 5 shows the convolution type. In this example, the type is $\textit{backward_filter}$,
and the algorithm numbers on lines 1 to 4 represent the algorithms in Table 1, in order. In other words, algorithm numbers 0, 1, 2, and 3 imply $\textit{ALGO_0}$,
$\textit{ALGO_1}$, $\textit{FFT}$, and $\textit{ALGO_3}$, respectively. The fields
in the input and output shapes in lines 6 and 8, respectively, present the data type
(F32: 32-bit floating-point), batch size, image width, image height, and image channel.
For the filter shape in line 7, the fields are the filter width, filter height, input
channel, and output channel. Finally, the last line shows the two selected algorithms
from the profile: the first (3), achieves the minimum execution time, and the second
(0) requires the least memory.
Basically, the profiled information includes convolution type, input shape, filter
shape, and output shape. We show in Section 3.1 that the batch size has no significant
effect on algorithm selection, so we can use the profiled information from the first
run for future runs.
Algorithm 1. TensorFlow’s convolution profiling.
Fig. 1. Example record from TensorFlow profiling.
Table 1. Applicable convolution algorithms according to convolution type [23].
Algorithm
|
Meaning
|
Memory usage
|
IMPLICIT_GEMM
|
Matrix product
|
Low (no extra memory workspace)
|
IMPLICIT_PRECOMP_GEMM
|
Matrix product
|
Medium (needs some memory workspace)
|
GEMM
|
Explicit matrix product
|
High (needs significant memory workspace)
|
FFT
|
Fast Fourier transform approach
|
High (stores intermediate results of FFT)
|
FFT_TILING
|
Fast Fourier transform approach, but splits the input into 32x32 tiles
|
Medium (needs significantly less than FFT for large images)
|
(a) Forward
|
Algorithm
|
Meaning
|
Memory usage
|
ALGO_0
|
Sum of matrix product (non-deterministic)
|
Medium
|
ALGO_1
|
Sum of matrix product (deterministic)
|
Medium
|
FFT
|
Fast Fourier transform approach (deterministic)
|
High (stores intermediate results of FFT)
|
FFT_TILING
|
Fast Fourier transform approach, but splits the input into 32x32 tiles (deterministic)
|
Medium (needs significantly less than FFT for large images)
|
(b) Backward input
|
Algorithm
|
Meaning
|
Memory usage
|
ALGO_0
|
Sum of matrix product (non-deterministic)
|
Medium
|
ALGO_1
|
Sum of matrix product (deterministic)
|
Medium
|
FFT
|
Fast Fourier transform approach (deterministic)
|
High (stores intermediate results of FFT)
|
ALGO_3
|
Sum of matrix product (precomputation, non-deterministic)
|
Medium
|
(c) Backward filter
|
2.3 Profiling Drawbacks
Fig. 2 presents the memory usage of our GPU platform (detailed in Section 4) over time in
TensorFlow when the batch sizes (the numbers of processed images in one execution)
are 16 and 32 for the Inception-V3 application, which is a widely used image classification
model trained with the ImageNet Large Visual Recognition Challenge 2012 datasets.
From the experiment, we identified drawbacks in profiling execution. Memory usage
overshoots approximately 10 seconds after the start of execution, and the overshoot
was more than 70% of the total memory, after which memory usage was constant at less
than 40\%.
We found that the overshoot resulted from memory allocation at the beginning
of program execution, i.e., at the profiling phase. The memory required for profiling
all the algorithms was allocated initially. The allocated memory for the selected
algorithm was used for the future execution, and the other regions were deallocated.
More memory for processing more batches cannot be allocated for future runs of the
selected algorithm. Therefore, profiling limits the exploitation of higher data-level
parallelism.
Fig. 2. Memory usage in GPUs over time for the Inception-V3 application in TensorFlow.
3. Architecture Proposal
3.1 Algorithm Selection
Table 2 compares profiling information for minimum execution time while running Inception-V3
on batches of 16 and 32. We categorize the results into three cases: $\textit{Case~1}
$represents two batches for which the same algorithm was selected; $\textit{Case 2}$
represents a situation with the same number of different algorithm selections; $\textit{Case
3}$ represents a situation with different numbers of the same algorithm being selected.
In this example, $\textit{Case 1} covers 90% of the total selections, and the other
cases cover the remaining 10\%.
In $\textit{Case 1}$, there is no difference between, and no obstacles to storing
and reusing, the results, but $\textit{Case 2}$ and $\textit{Case 3}$ require careful
approaches. In Fig. 1, which shows the profile information, algorithm 3 was selected for minimum execution
time. However, the execution times of algorithm 3 and algorithm 1 are 0.809984 $ms$
and 0.816864 $ms$, respectively, a difference of only 0.8%. Between $\textit{Case
2}$ and $\textit{Case 3}$, the execution time difference is negligible, and therefore,
there is no significant difference in the overall performance when using either algorithm.
In the case of algorithms with minimum memory usage, the profiles are always the same.
Table 3 presents some comparison results of profile information obtained by running
the algorithm twice with a batch size of 32. We compared the results to confirm there
were no obstacles in storing and reusing the information when there is repeated execution
under the same conditions. As a result, most of them belonged to $\textit{Case 1}$,
and the remaining belonged to $\textit{Case 3}$, but there was no problem in storing
and reusing them, because the performance of the two algorithms does not differ noticeably,
as mentioned above. Therefore, the profile information can be saved to the profile
record, and the profile record can be reused for four classification criteria: convolution
type, input shape, filter shape, and output shape (excluding batch size).
Table 2. Comparison of the profiling results for different batch sizes with minimum execution times.
CT
|
Input shape
|
Filter shape
|
Output shape
|
Size 16 batch
|
Size 32 batch
|
Selected algorithm
|
#
|
Selected algorithm
|
#
|
F
|
Nx73x73x80
|
1x1x64x80
|
Nx73x73x64
|
IMP_P_GEMM
|
2
|
IMP_P_GEMM
|
2
|
F
|
Nx73x73x64
|
1x1x64x80
|
Nx73x73x80
|
IMP_P_GEMM
|
4
|
IMP_P_GEMM
|
4
|
F
|
Nx5x5x768
|
1x1x768x128
|
Nx5x5x128
|
IMP_GEMM
|
4
|
IMP_GEMM
|
4
|
F
|
Nx5x5x128
|
5x5x128x768
|
Nx1x1x128
|
GEMM
|
4
|
GEMM
|
4
|
BI
|
Nx8x8x448
|
3x3x448x384
|
Nx8x8x384
|
FFT_TILING
|
4
|
FFT_TILING
|
4
|
BI
|
Nx8x8x384
|
3x1x384x384
|
Nx8x8x384
|
ALGO_1
|
8
|
ALGO_1
|
8
|
BI
|
Nx8x8x384
|
1x3x384x384
|
Nx8x8x384
|
ALGO_1
|
8
|
ALGO_1
|
8
|
BI
|
Nx73x73x80
|
3x3x80x192
|
Nx71x71x192
|
FFT_TILING
|
2
|
FFT_TILING
|
2
|
BF
|
Nx35x35x48
|
5x5x48x64
|
Nx35x35x64
|
ALGO_3
|
6
|
ALGO_3
|
6
|
BF
|
Nx35x35x288
|
3x3x288x384
|
Nx17x17x384
|
ALGO_1
|
2
|
ALGO_1
|
2
|
(a) Case 1
|
CT
|
Input shape
|
Filter shape
|
Output shape
|
Size 16 batch
|
Size 32 batch
|
Selected algorithm
|
#
|
Selected algorithm
|
#
|
F
|
Nx8x8x2048
|
1x1x2048x192
|
Nx8x8x192
|
IMP_GEMM
|
4
|
IMP_P_GEMM
|
4
|
(b) Case 2
|
CT
|
Input shape
|
Filter shape
|
Output shape
|
Size 16 batch
|
Size 32 batch
|
Selected algorithm
|
#
|
Selected algorithm
|
#
|
BF
|
Nx35x35x96
|
3x3x96x96
|
Nx35x35x96
|
ALGO_0
|
15
|
ALGO_0
|
9
|
ALGO_1
|
5
|
ALGO_1
|
0
|
ALGO_3
|
4
|
ALGO_3
|
15
|
BF
|
Nx35x35x64
|
3x3x64x96
|
Nx35x35x96
|
ALGO_1
|
5
|
ALGO_1
|
7
|
ALGO_3
|
3
|
ALGO_3
|
1
|
(b) Case 3
|
CT: Convolution type; F: Forward; BI: Backward Input; BF: Backward Filter; #:
the number of selections.
Table 3. Some comparison results of profile information when running twice in Case 3 and with a backward filter.
Input shape
|
Filter shape
|
Output shape
|
1st run
|
2nd run
|
Selected algorithm
|
#
|
Selected algorithm
|
#
|
32x8x8x448
|
3x3x448x384
|
32x8x8x384
|
ALGO_1
|
3
|
ALGO_1
|
4
|
ALGO_3
|
1
|
ALGO_3
|
0
|
32x17x17x768
|
1x1x768x192
|
32x17x17x192
|
ALGO_0
|
9
|
ALGO_0
|
8
|
ALGO_3
|
15
|
ALGO_3
|
16
|
#: the number of selections.
3.2 Overall Architecture
Algorithm 2 shows our proposed convolution profiling algorithm. For each convolution
node, the algorithm examines whether the convolution node was profiled. If the node
was not profiled, it checks whether a profile record is available from the previous
run. If a profile record is available, we select the algorithm from the profile record.
Otherwise, we perform profiling to find the best algorithm, and save the profile information
in the profile record file for future runs.
Algorithm 2. Our proposed convolution profiling.
4. Experimental Evaluation
The performance evaluation of our proposed architecture was done on the platform
detailed in Table 4. For the evaluation, TensorFlow r1.4 and the Inception-V3 application were used.
We trained the application with ImageNet datasets, and used batch sizes of 32, 48,
and 56.
Fig. 3 compares the performance in terms of seconds per execution, and Fig. 4 does so in terms of images per second, i.e., the throughput between vanilla TensorFlow,
TensorFlow with XLA JIT compilation, and our approach with XLA.
The execution time of our approach is similar to that of XLA for batch sizes 32
and 48. With a batch size of 56, our technique requires 9.8% less execution time than
the others. The throughput of our approach is also similar to that of XLA for batch
sizes 32 and 48. With the batch size of 56, our technique outperformed the vanilla
algorithm by 1.12 times. For the vanilla TensorFlow, throughput increased to 16457,
17658, and 18397 when the batch size was 32, 48, and 56, respectively, but it did
not reach the maximum throughput of the proposed architecture. With XLA, the throughput
increased from 18120 to 20147 as the batch size increased from 32 to 48, but decreased
to 18438 for batch size 56. The reason for the decrease in throughput is that the
larger batch size implies higher memory usage, and thus, the memory insufficiency
prefers an algorithm needing minimum memory usage over minimum execution time. Our
proposed architecture eliminates the high memory usage, and the throughput steadily
increased: 18179, 19820, and 20511 for batch sizes of 32, 48, and 56, respectively.
In order to quantitatively observe the reduced memory usage, Figs. 5 and 6 present
the memory usage by GPUs over time for the Inception-V3 application on TensorFlow
with XLA JIT compilation and for our architecture with batch sizes of 32, 48, and
56. TensorFlow with XLA JIT incurs a memory usage overshoot at the initial execution
instance, i.e., the profiling phase. Otherwise, the memory usage of the proposed architecture
was constant over time, i.e., our proposed architecture eliminates memory usage overshoot
in the profiling phase. Thus, we confirmed that our approach has higher performance
than existing schemes, and we can expect our approach to achieve higher performance
for larger batch sizes.
Fig. 3. Performance comparison in terms of seconds per execution. Lower is better.
F
Fig. 4. Throughput comparison in terms of images per second. Higher is better.
Fig. 5. The memory usage of GPUs over time for the Inception-V3 application on TensorFlow with XLA JIT compilation.
Fig. 6. The memory usage of GPUs over time for the Inception-V3 application on the proposed architecture.
Table 4. Server specifications for the experiment.
Feature
|
Description
|
CPU
|
Intel Xeon CPU E5-1620 v4 @ 3.50 GHz
|
CPU memory
|
32 GB
|
GPU
|
NVidia Titan XP (2 ea)
|
GPU memory
|
12 GB per GPU
|
5. Conclusion
In this paper, we proposed a novel profiling method to reduce overhead by storing
profile information from the first run and reusing it on subsequent runs. During the
first run, algorithms that are efficient in terms of memory or runtime in a specific
execution environment are determined from among the various algorithms. As a result,
in the same execution environment, the same execution result for the algorithms could
be expected. In such cases, profiling at every execution is generally not the right
way. Also, in some cases, the difference between the results for memory- or runtime-efficient
algorithms is minimal. Therefore, our proposed method can eliminate the related overhead
by profiling the previous result when the execution environment does not change. We
analyzed memory usage over time in detail, and observed that memory overshoots during
profiling. The memory overshoot limits data-level parallelism, thus degrading execution
throughput.
Based on this knowledge, we devised a new profiling method to eliminate the overshoot
and improve performance. Using Inception-V3 trained with ImageNet datasets, we achieved
higher throughput than vanilla TensorFlow and TensorFlow with XLA JIT compilation
by up to 1.12 times and 1.11 times, respectively, without losing accuracy.
ACKNOWLEDGMENTS
This work was partially supported by SK Telecom Co., LTD.
REFERENCES
Abadi M., et al. , 2016, Tensorflow: A System for Large-scale Machine Learning, in
Proc. of OSDI'16, berkeley, ca, usa, pp. 265-283
Jia Y., et al. , 2014, Caffe: Convolutional Architecture for Fast Feature Embedding,
in Proc. of MM'14, new york, ny, usa, acm, pp. 675-678
Bastien F., et al. , 2012, Theano: New Features and Speed Improvements, CoRR
Al-Rfou R., et al. , 2016, Theano: A Python Framework for Fast Computation of Mathematical
Expressions, CoRR
2015, Keras
Seide F., Agarwal A., 2016, CNTK: Microsoft's Open-source Deep-learning Toolkit, in
Proc. of KDD'16, New York, NY, USA, ACM, pp. 2135-2135
Collobert R., Bengio S., Mariéthoz J., 2002, Torch: A Modular Machine Learning Software
Library, Technical Report, IDIAP
2017, PyTorch.,
Salazar N. D., et al. , 2018, Application of Transfer Learning for Object Recognition
Using Convolutional Neural Networks, in Proc. of 2018 IEEE Colombian Conference on
Applications in Computational Intelligence, Medellin, Colombia, pp. 14-25
Spanhol F. A., et al. , 2016, Breast Cancer Histopathological Image Classification
Using Convolutional Neural Networks, in Proc. of IJCNN'16, Vancouver, BC, Canada,
pp. 2560-2567
Siniscalchi S. M., Salerno V. M., 2016, Adaptation to New Microphones Using Artificial
Neural Networks with Trainable Activation Functions, IEEE Transactions on Neural Networks
and Learning Systems, Vol. 28, No. 8, pp. 1959-1965
Hautamäki V., et al. , 2015, Boosting Universal Speech Attributes Classification with
Deep Neural Network for Foreign Accent Characterization, in Proc. of INTERSPEECH'15,
Dresden, Germany, pp. 408-412
Garzón-Alfonso C. C., Rodríguez-Martínez M., 2018, Twitter Health Surveillance (THS)
System, in Proc. of 2018 IEEE International Conference on Big Data, Seattle, W,A USA,
pp. 1647-1654
Ravì D., et al. , 2016, Deep Learning for Health Informatics, IEEE Journal of Biomedical
and Health Informatics, Vol. 21, No. 1, pp. 4-21
Salerno V., Rabbeni G., 2018, An Extreme Learning Machine Approach to Effective Energy
Disaggregation, Electronics, Vol. 7, No. 10, pp. 235
Zhang C., et al. , 2018, Sequence-to-point Learning with Neural Networks for Non-intrusive
Load Monitoring, in Proc. of AAAI'18, New Orleans, LA, USA, pp. 2604-2611
He K., Zhang X., Ren S., Sun J., 2015, Delving Deep into Rectifiers: Surpassing Human-level
Performance on Imagenet Classification, in Proc. of ICCV'15, Santiago, Chile, pp. 1026-1034
Dodge S. F., Karam L. J., 2017, A Study and Comparison of Human and Deep Learning
Recognition Performance Under Visual Distortions, in Proc. of ICCCN'17, Vancouver,
BC, Canada, pp. 1-7
He K., Zhang X., Ren S., Sun J., 2016, Deep Residual Learning for Image Recognition,
in Proc. of CVPR'16, Las Vegas, NV, USA, pp. 770-778
Vasilache N., et al. , 2014, Fast Convolutional Nets with fbfft: A GPU Performance
Evaluation, CoRR
Winograd S., 1980, Arithmetic Complexity of Computations, Society for Industrial and
Applied Mathematics
Lavin A., Gray S., 2016, Fast Algorithms for Convolutional Neural Networks, in Proc.
of CVPR'16, Las Vegas, NV, USA, pp. 4013-4021
Chetlur S., et al. , 2014, cuDNN: Efficient Primitives for Deep Learning, CoRR
Deng J., et al. , 2009, ImageNet: A Large-scale Hierarchical Image Database, in Proc.
of CVPR'09, Miami, FL, USA, pp. 248-255
Rhu M., et al. , 2016, vDNN: Virtualized Deep Neural Networks for Scalable, Memory-efficient
Neural Network Design, in Proc. of MICRO'16, Taipei, Taiwan, pp. 1-13
Author
Minseong Kim received a BSc in electrical engineering and a PhD in electronics,
electrical and computer engineering from Korea University in 2011 and 2018, respectively.
He is currently with SK hynix, South Korea. His research interests include compiler
construction, microarchitectures, memory systems, and parallel and distributed architectural
simulator designs.
Kyu Hyun Choi received a PhD in electronics, electrical and computer engineering
from Korea University in 2018. He is currently a senior engineer with SK hynix, South
Korea. His research interests include microarchitectures, GPUs, parallel computing,
memory system design, and processing in memory.
Yoonah Paik received her B.Eng. in Electrical Engineering from Korea University,
Seoul, Korea, in 2015, and she is currently working toward an integrated MSc and PhD
in the Compiler and Microarchitecture Lab at Korea University. Her current research
interests include microarchitectures, memory system design, and system software.
Seon Wook Kim received his BSc in Electronics and Computer Engineering from Korea
University in 1988, an MSc in Electrical Engineering from Ohio State University in
1990, and a PhD in Electrical and Computer Engineering from Purdue University in 2001.
He was a senior researcher at the Agency for Defense Development (1990-1995) and was
a staff software engineer at Intel/KSL (2001-2002). Currently, he is a professor for
the School of Electrical and Computer Engineering at Korea University. His research
interests include compiler construction, microarchitectures, system optimization,
and SoC design. He is a senior member of ACM and IEEE.