Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 10, No. 2, p.128-135

ISSN (print) :

2287-5255

Received : 24 November 2020Accepted : 14 December 2021

DOI :

https://doi.org/10.5573/IEIESPC.2021.10.2.128

Regular Paper

Optimizing TensorFlow Performance by Reconstructing the Convolution Routine

KimMinseong^1,2 ChoiKyu Hyun^1,2 PaikYoonah¹ KimSeon Wook¹

(School of Electrical and Computer Engineering, Korea University / 145 Anam-ro, Seongbuk-gu, Seoul 02841, Korea {minseong, nurlonn, yoonpaik, seon}@korea.ac.kr )
(SK Hynix Inc. / 2091 Gyeongchung-daero, Bubaleup, Icheon-si, Gyeonggi-do 17336, Korea. This work was done when Minseong Kim and Kyu Hyun Choi were with Korea University )

^* Corresponding Author: Seon Wook Kim

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Using deep learning, we can currently build computational models composed of multiple processing layers to learn representations of data. Convolutional neural networks (CNNs) have been widely adopted to achieve significant performance in image recognition and classification. TensorFlow, an open-source deep learning framework from Google, uses profiling to select one convolution algorithm, from among several available, as the core of a CNN to deliver the best performance in terms of execution time and memory usage. However, the overhead from profiling is considerably significant, because TensorFlow executes and profiles all the available algorithms for the best selection whenever an application is launched. We observe that memory usage overshoots during profiling, which limits data parallelism, and thus, fails to deliver maximum performance. In this paper, we present a novel profiling method to reduce overhead by storing the profile result from the first run and reusing it from the second run on. Using Inception-V3, we achieved up to 1.12 times and 1.11 times higher throughput, compared to the vanilla TensorFlow and TensorFlow with XLA JIT compilation, respectively, without losing accuracy.

Keywords

TensorFlow, Profiling, Optimization, Batch

1. Introduction

TensorFlow ^[1] is one of the most well-known deep learning frameworks, providing high-level APIs for defining, training, and inferencing deep learning models on various machines, such as multicores and GPUs. TensorFlow performs numerical computations based on data flow graphs that are a series of TensorFlow operations. In the past few years, image recognition and classification studies have made tremendous progress due to the collaborative development of such frameworks as TensorFlow, Caffe ^[2], Theano ^[3,^4], Keras ^[5], CNTK ^[6], Torch ^[7], and PyTorch ^[8]. Besides, deep learning techniques have been actively studied and developed in various application fields, such as image recognition ^[9,^10], speech recognition ^[11,^12], public health ^[13,^14], load monitoring ^[15,^16], and so on. In particular, the convolutional neural network (CNN), one of the deep learning models, achieves higher performance on image recognition tasks than humans in some domains ^[17-^19].

Among various CNN operations, the convolution operation is the key component, many variants of which are currently available ^[20-^22]. Because the performance of these algorithms depends on hardware configuration, input characteristics, and model behavior, TensorFlow utilizes various algorithms, i.e., it selects the best algorithm for each convolution operation, using profiling to achieve optimal performance. However, the profiling overhead is considerably large, because all available algorithms are examined for each convolution node, and the examination incurs overhead in execution time and memory usage. In addition, the current profiling process is performed whenever the application is launched; thus, redundant profiling processes might be executed.

In this paper, we present a novel profiling method to reduce overhead by storing and reusing the profiling results. We conducted a comprehensive study on TensorFlow and analyzed memory usage over time that limits data parallelism and fails to deliver maximum performance. We observed that memory capacity overshoots during the profiling phase, and we found opportunities to eliminate this memory overshoot by recording profiling results and reusing them without losing accuracy. Thus, we constructed a novel profiling framework to store profiling results during the first run and reuse the profiling results from the second run on, regardless of the TensorFlow detail configurations, on NVidia GPUs using cuDNN ^[23], which is a GPU-accelerated library of primitives for deep neural networks. From the measurement results, we could achieve up to 1.12 times and 1.11 times higher throughput than vanilla TensorFlow and TensorFlow with XLA JIT compilation, respectively, using Inception-V3, which is a widely used image classification model that is trained with ImageNet datasets ^[24].

The rest of the paper is organized as follows. In Section~2, we describe the background to and motivation for our study. In Section 3, we introduce the overall architecture of our proposed profiling method. We evaluate the performance of the method (in terms of throughput) in Section 4, and present memory usage over time. The paper concludes in Section 5.

2. TensorFlow and Its Profiling Execution

2.1 TensorFlow

TensorFlow is an open-source software library developed by Google ^[1], which performs numerical computations based on data flow graphs to conduct machine learning and deep neural network research ^[1]. TensorFlow provides high-level APIs that make it easy to construct a neural network with high flexibility that can deploy computations either on CPUs or GPUs on desktops, servers, or mobile devices. The basic unit of TensorFlow is a computational graph where each node represents an operation, and each edge, called a tensor, represents data. The data comprise a set of primitive values shaped into an array of any dimension. The computation node takes zero or more tensors as input and produces a tensor as output.

The CNN is the current state-of-the-art image classification architecture, and it applies a series of convolution filters to raw images in order to extract and learn higher-level features. The CNN receives input and transforms it through a series of hidden layers composed of convolutional, pooling, and dense fully connected layers. The convolutional layer is the core of the CNN and consumes most of the computational capacity from among all the layers. TensorFlow employs a few convolution algorithms and selects one of them to achieve optimal performance in terms of memory usage and execution time with respect to the input parameters. The selection is performed in the profiling phase.

2.2 Profiling

Algorithm 1 presents TensorFlow's convolution profiling algorithm. TensorFlow examines whether each convolution node was profiled when triggering the node. If the node was profiled, the selected algorithms are chosen. Otherwise, profiling is performed to find the best algorithm for the node.

There are two reasons for profiling to enhance performance: identifying an algorithm to minimize memory usage and one to minimize execution time. To find the algorithms, all available convolutions are executed, depending on the convolution type, such as forward, backward input, and backward filter. Table 1 shows the available algorithms from TensorFlow with NVidia GPUs. In the forward convolution type, IMPLICIT_GEMM, IMPLICIT_PRECOMP_GEMM, GEMM, and FFT algorithms are available. All the algorithms perform a matrix-to-matrix product, but their memory usage and execution times differ. An implicit algorithm like IMPLICIT_GEMM requires less memory, but executes for a longer time than explicit ones, such as GEMM and FFT ^[23,^25].

Fig. 1 shows an example record from profiling. The first four lines represent the convolution node's ID (name: starting address), an applied algorithm number, and its execution time in $ms$. Line 5 shows the convolution type. In this example, the type is $\textit{backward_filter}$, and the algorithm numbers on lines 1 to 4 represent the algorithms in Table 1, in order. In other words, algorithm numbers 0, 1, 2, and 3 imply $\textit{ALGO_0}$, $\textit{ALGO_1}$, $\textit{FFT}$, and $\textit{ALGO_3}$, respectively. The fields in the input and output shapes in lines 6 and 8, respectively, present the data type (F32: 32-bit floating-point), batch size, image width, image height, and image channel. For the filter shape in line 7, the fields are the filter width, filter height, input channel, and output channel. Finally, the last line shows the two selected algorithms from the profile: the first (3), achieves the minimum execution time, and the second (0) requires the least memory.

Basically, the profiled information includes convolution type, input shape, filter shape, and output shape. We show in Section 3.1 that the batch size has no significant effect on algorithm selection, so we can use the profiled information from the first run for future runs.

Algorithm 1. TensorFlow’s convolution profiling.

Fig. 1. Example record from TensorFlow profiling.

Table 1. Applicable convolution algorithms according to convolution type [23].

Algorithm	Meaning	Memory usage
IMPLICIT_GEMM	Matrix product	Low (no extra memory workspace)
IMPLICIT_PRECOMP_GEMM	Matrix product	Medium (needs some memory workspace)
GEMM	Explicit matrix product	High (needs significant memory workspace)
FFT	Fast Fourier transform approach	High (stores intermediate results of FFT)
FFT_TILING	Fast Fourier transform approach, but splits the input into 32x32 tiles	Medium (needs significantly less than FFT for large images)
(a) Forward
Algorithm	Meaning	Memory usage
ALGO_0	Sum of matrix product (non-deterministic)	Medium
ALGO_1	Sum of matrix product (deterministic)	Medium
FFT	Fast Fourier transform approach (deterministic)	High (stores intermediate results of FFT)
FFT_TILING	Fast Fourier transform approach, but splits the input into 32x32 tiles (deterministic)	Medium (needs significantly less than FFT for large images)
(b) Backward input
Algorithm	Meaning	Memory usage
ALGO_0	Sum of matrix product (non-deterministic)	Medium
ALGO_1	Sum of matrix product (deterministic)	Medium
FFT	Fast Fourier transform approach (deterministic)	High (stores intermediate results of FFT)
ALGO_3	Sum of matrix product (precomputation, non-deterministic)	Medium
(c) Backward filter

2.3 Profiling Drawbacks

Fig. 2 presents the memory usage of our GPU platform (detailed in Section 4) over time in TensorFlow when the batch sizes (the numbers of processed images in one execution) are 16 and 32 for the Inception-V3 application, which is a widely used image classification model trained with the ImageNet Large Visual Recognition Challenge 2012 datasets. From the experiment, we identified drawbacks in profiling execution. Memory usage overshoots approximately 10 seconds after the start of execution, and the overshoot was more than 70% of the total memory, after which memory usage was constant at less than 40\%.

We found that the overshoot resulted from memory allocation at the beginning of program execution, i.e., at the profiling phase. The memory required for profiling all the algorithms was allocated initially. The allocated memory for the selected algorithm was used for the future execution, and the other regions were deallocated. More memory for processing more batches cannot be allocated for future runs of the selected algorithm. Therefore, profiling limits the exploitation of higher data-level parallelism.

Fig. 2. Memory usage in GPUs over time for the Inception-V3 application in TensorFlow.

3. Architecture Proposal

3.1 Algorithm Selection

Table 2 compares profiling information for minimum execution time while running Inception-V3 on batches of 16 and 32. We categorize the results into three cases: $\textit{Case~1} $represents two batches for which the same algorithm was selected; $\textit{Case 2}$ represents a situation with the same number of different algorithm selections; $\textit{Case 3}$ represents a situation with different numbers of the same algorithm being selected. In this example, $\textit{Case 1} covers 90% of the total selections, and the other cases cover the remaining 10\%.

In $\textit{Case 1}$, there is no difference between, and no obstacles to storing and reusing, the results, but $\textit{Case 2}$ and $\textit{Case 3}$ require careful approaches. In Fig. 1, which shows the profile information, algorithm 3 was selected for minimum execution time. However, the execution times of algorithm 3 and algorithm 1 are 0.809984 $ms$ and 0.816864 $ms$, respectively, a difference of only 0.8%. Between $\textit{Case 2}$ and $\textit{Case 3}$, the execution time difference is negligible, and therefore, there is no significant difference in the overall performance when using either algorithm. In the case of algorithms with minimum memory usage, the profiles are always the same.

Table 3 presents some comparison results of profile information obtained by running the algorithm twice with a batch size of 32. We compared the results to confirm there were no obstacles in storing and reusing the information when there is repeated execution under the same conditions. As a result, most of them belonged to $\textit{Case 1}$, and the remaining belonged to $\textit{Case 3}$, but there was no problem in storing and reusing them, because the performance of the two algorithms does not differ noticeably, as mentioned above. Therefore, the profile information can be saved to the profile record, and the profile record can be reused for four classification criteria: convolution type, input shape, filter shape, and output shape (excluding batch size).

Table 2. Comparison of the proﬁling results for different batch sizes with minimum execution times.

CT	Input shape	Filter shape	Output shape	Size 16 batch		Size 32 batch
CT	Input shape	Filter shape	Output shape	Selected algorithm	#	Selected algorithm	#
F	Nx73x73x80	1x1x64x80	Nx73x73x64	IMP_P_GEMM	2	IMP_P_GEMM	2
F	Nx73x73x64	1x1x64x80	Nx73x73x80	IMP_P_GEMM	4	IMP_P_GEMM	4
F	Nx5x5x768	1x1x768x128	Nx5x5x128	IMP_GEMM	4	IMP_GEMM	4
F	Nx5x5x128	5x5x128x768	Nx1x1x128	GEMM	4	GEMM	4
BI	Nx8x8x448	3x3x448x384	Nx8x8x384	FFT_TILING	4	FFT_TILING	4
BI	Nx8x8x384	3x1x384x384	Nx8x8x384	ALGO_1	8	ALGO_1	8
BI	Nx8x8x384	1x3x384x384	Nx8x8x384	ALGO_1	8	ALGO_1	8
BI	Nx73x73x80	3x3x80x192	Nx71x71x192	FFT_TILING	2	FFT_TILING	2
BF	Nx35x35x48	5x5x48x64	Nx35x35x64	ALGO_3	6	ALGO_3	6
BF	Nx35x35x288	3x3x288x384	Nx17x17x384	ALGO_1	2	ALGO_1	2
(a) Case 1
CT	Input shape	Filter shape	Output shape	Size 16 batch		Size 32 batch
CT	Input shape	Filter shape	Output shape	Selected algorithm	#	Selected algorithm	#
F	Nx8x8x2048	1x1x2048x192	Nx8x8x192	IMP_GEMM	4	IMP_P_GEMM	4
(b) Case 2
CT	Input shape	Filter shape	Output shape	Size 16 batch		Size 32 batch
CT	Input shape	Filter shape	Output shape	Selected algorithm	#	Selected algorithm	#
BF	Nx35x35x96	3x3x96x96	Nx35x35x96	ALGO_0	15	ALGO_0	9
				ALGO_1	5	ALGO_1	0
				ALGO_3	4	ALGO_3	15
BF	Nx35x35x64	3x3x64x96	Nx35x35x96	ALGO_1	5	ALGO_1	7
BF	Nx35x35x64	3x3x64x96	Nx35x35x96	ALGO_3	3	ALGO_3	1
(b) Case 3

CT: Convolution type; F: Forward; BI: Backward Input; BF: Backward Filter; #: the number of selections.

Table 3. Some comparison results of profile information when running twice in Case 3 and with a backward filter.

Input shape	Filter shape	Output shape	1st run		2nd run
Input shape	Filter shape	Output shape	Selected algorithm	#	Selected algorithm	#
32x8x8x448	3x3x448x384	32x8x8x384	ALGO_1	3	ALGO_1	4
32x8x8x448	3x3x448x384	32x8x8x384	ALGO_3	1	ALGO_3	0
32x17x17x768	1x1x768x192	32x17x17x192	ALGO_0	9	ALGO_0	8
32x17x17x768	1x1x768x192	32x17x17x192	ALGO_3	15	ALGO_3	16

#: the number of selections.

3.2 Overall Architecture

Algorithm 2 shows our proposed convolution profiling algorithm. For each convolution node, the algorithm examines whether the convolution node was profiled. If the node was not profiled, it checks whether a profile record is available from the previous run. If a profile record is available, we select the algorithm from the profile record. Otherwise, we perform profiling to find the best algorithm, and save the profile information in the profile record file for future runs.

Algorithm 2. Our proposed convolution profiling.

4. Experimental Evaluation

The performance evaluation of our proposed architecture was done on the platform detailed in Table 4. For the evaluation, TensorFlow r1.4 and the Inception-V3 application were used. We trained the application with ImageNet datasets, and used batch sizes of 32, 48, and 56.

Fig. 3 compares the performance in terms of seconds per execution, and Fig. 4 does so in terms of images per second, i.e., the throughput between vanilla TensorFlow, TensorFlow with XLA JIT compilation, and our approach with XLA.

The execution time of our approach is similar to that of XLA for batch sizes 32 and 48. With a batch size of 56, our technique requires 9.8% less execution time than the others. The throughput of our approach is also similar to that of XLA for batch sizes 32 and 48. With the batch size of 56, our technique outperformed the vanilla algorithm by 1.12 times. For the vanilla TensorFlow, throughput increased to 16457, 17658, and 18397 when the batch size was 32, 48, and 56, respectively, but it did not reach the maximum throughput of the proposed architecture. With XLA, the throughput increased from 18120 to 20147 as the batch size increased from 32 to 48, but decreased to 18438 for batch size 56. The reason for the decrease in throughput is that the larger batch size implies higher memory usage, and thus, the memory insufficiency prefers an algorithm needing minimum memory usage over minimum execution time. Our proposed architecture eliminates the high memory usage, and the throughput steadily increased: 18179, 19820, and 20511 for batch sizes of 32, 48, and 56, respectively.

In order to quantitatively observe the reduced memory usage, Figs. 5 and 6 present the memory usage by GPUs over time for the Inception-V3 application on TensorFlow with XLA JIT compilation and for our architecture with batch sizes of 32, 48, and 56. TensorFlow with XLA JIT incurs a memory usage overshoot at the initial execution instance, i.e., the profiling phase. Otherwise, the memory usage of the proposed architecture was constant over time, i.e., our proposed architecture eliminates memory usage overshoot in the profiling phase. Thus, we confirmed that our approach has higher performance than existing schemes, and we can expect our approach to achieve higher performance for larger batch sizes.

Fig. 3. Performance comparison in terms of seconds per execution. Lower is better.

Fig. 4. Throughput comparison in terms of images per second. Higher is better.

Fig. 5. The memory usage of GPUs over time for the Inception-V3 application on TensorFlow with XLA JIT compilation.

Fig. 6. The memory usage of GPUs over time for the Inception-V3 application on the proposed architecture.

Table 4. Server speciﬁcations for the experiment.

Feature	Description
CPU	Intel Xeon CPU E5-1620 v4 @ 3.50 GHz
CPU memory	32 GB
GPU	NVidia Titan XP (2 ea)
GPU memory	12 GB per GPU

5. Conclusion

In this paper, we proposed a novel profiling method to reduce overhead by storing profile information from the first run and reusing it on subsequent runs. During the first run, algorithms that are efficient in terms of memory or runtime in a specific execution environment are determined from among the various algorithms. As a result, in the same execution environment, the same execution result for the algorithms could be expected. In such cases, profiling at every execution is generally not the right way. Also, in some cases, the difference between the results for memory- or runtime-efficient algorithms is minimal. Therefore, our proposed method can eliminate the related overhead by profiling the previous result when the execution environment does not change. We analyzed memory usage over time in detail, and observed that memory overshoots during profiling. The memory overshoot limits data-level parallelism, thus degrading execution throughput.

Based on this knowledge, we devised a new profiling method to eliminate the overshoot and improve performance. Using Inception-V3 trained with ImageNet datasets, we achieved higher throughput than vanilla TensorFlow and TensorFlow with XLA JIT compilation by up to 1.12 times and 1.11 times, respectively, without losing accuracy.

ACKNOWLEDGMENTS

This work was partially supported by SK Telecom Co., LTD.

REFERENCES

Abadi M., et al. , 2016, Tensorﬂow: A System for Large-scale Machine Learning, in Proc. of OSDI'16, berkeley, ca, usa, pp. 265-283

Jia Y., et al. , 2014, Caffe: Convolutional Architecture for Fast Feature Embedding, in Proc. of MM'14, new york, ny, usa, acm, pp. 675-678

Bastien F., et al. , 2012, Theano: New Features and Speed Improvements, CoRR

Al-Rfou R., et al. , 2016, Theano: A Python Framework for Fast Computation of Mathematical Expressions, CoRR

2015, Keras

Seide F., Agarwal A., 2016, CNTK: Microsoft's Open-source Deep-learning Toolkit, in Proc. of KDD'16, New York, NY, USA, ACM, pp. 2135-2135

Collobert R., Bengio S., Mariéthoz J., 2002, Torch: A Modular Machine Learning Software Library, Technical Report, IDIAP

2017, PyTorch.,

Salazar N. D., et al. , 2018, Application of Transfer Learning for Object Recognition Using Convolutional Neural Networks, in Proc. of 2018 IEEE Colombian Conference on Applications in Computational Intelligence, Medellin, Colombia, pp. 14-25

Spanhol F. A., et al. , 2016, Breast Cancer Histopathological Image Classiﬁcation Using Convolutional Neural Networks, in Proc. of IJCNN'16, Vancouver, BC, Canada, pp. 2560-2567

Siniscalchi S. M., Salerno V. M., 2016, Adaptation to New Microphones Using Artiﬁcial Neural Networks with Trainable Activation Functions, IEEE Transactions on Neural Networks and Learning Systems, Vol. 28, No. 8, pp. 1959-1965

Hautamäki V., et al. , 2015, Boosting Universal Speech Attributes Classiﬁcation with Deep Neural Network for Foreign Accent Characterization, in Proc. of INTERSPEECH'15, Dresden, Germany, pp. 408-412

Garzón-Alfonso C. C., Rodríguez-Martínez M., 2018, Twitter Health Surveillance (THS) System, in Proc. of 2018 IEEE International Conference on Big Data, Seattle, W,A USA, pp. 1647-1654

Ravì D., et al. , 2016, Deep Learning for Health Informatics, IEEE Journal of Biomedical and Health Informatics, Vol. 21, No. 1, pp. 4-21

Salerno V., Rabbeni G., 2018, An Extreme Learning Machine Approach to Effective Energy Disaggregation, Electronics, Vol. 7, No. 10, pp. 235

Zhang C., et al. , 2018, Sequence-to-point Learning with Neural Networks for Non-intrusive Load Monitoring, in Proc. of AAAI'18, New Orleans, LA, USA, pp. 2604-2611

He K., Zhang X., Ren S., Sun J., 2015, Delving Deep into Rectiﬁers: Surpassing Human-level Performance on Imagenet Classiﬁcation, in Proc. of ICCV'15, Santiago, Chile, pp. 1026-1034

Dodge S. F., Karam L. J., 2017, A Study and Comparison of Human and Deep Learning Recognition Performance Under Visual Distortions, in Proc. of ICCCN'17, Vancouver, BC, Canada, pp. 1-7

He K., Zhang X., Ren S., Sun J., 2016, Deep Residual Learning for Image Recognition, in Proc. of CVPR'16, Las Vegas, NV, USA, pp. 770-778

Vasilache N., et al. , 2014, Fast Convolutional Nets with fbfft: A GPU Performance Evaluation, CoRR

Winograd S., 1980, Arithmetic Complexity of Computations, Society for Industrial and Applied Mathematics

Lavin A., Gray S., 2016, Fast Algorithms for Convolutional Neural Networks, in Proc. of CVPR'16, Las Vegas, NV, USA, pp. 4013-4021

Chetlur S., et al. , 2014, cuDNN: Efﬁcient Primitives for Deep Learning, CoRR

Deng J., et al. , 2009, ImageNet: A Large-scale Hierarchical Image Database, in Proc. of CVPR'09, Miami, FL, USA, pp. 248-255

Rhu M., et al. , 2016, vDNN: Virtualized Deep Neural Networks for Scalable, Memory-efﬁcient Neural Network Design, in Proc. of MICRO'16, Taipei, Taiwan, pp. 1-13

Author

Minseong Kim

Minseong Kim received a BSc in electrical engineering and a PhD in electronics, electrical and computer engineering from Korea University in 2011 and 2018, respectively. He is currently with SK hynix, South Korea. His research interests include compiler construction, microarchitectures, memory systems, and parallel and distributed architectural simulator designs.

Kyu Hyun Choi

Kyu Hyun Choi received a PhD in electronics, electrical and computer engineering from Korea University in 2018. He is currently a senior engineer with SK hynix, South Korea. His research interests include microarchitectures, GPUs, parallel computing, memory system design, and processing in memory.

Yoonah Paik

Yoonah Paik received her B.Eng. in Electrical Engineering from Korea University, Seoul, Korea, in 2015, and she is currently working toward an integrated MSc and PhD in the Compiler and Microarchitecture Lab at Korea University. Her current research interests include microarchitectures, memory system design, and system software.

Seon Wook Kim

Seon Wook Kim received his BSc in Electronics and Computer Engineering from Korea University in 1988, an MSc in Electrical Engineering from Ohio State University in 1990, and a PhD in Electrical and Computer Engineering from Purdue University in 2001. He was a senior researcher at the Agency for Defense Development (1990-1995) and was a staff software engineer at Intel/KSL (2001-2002). Currently, he is a professor for the School of Electrical and Computer Engineering at Korea University. His research interests include compiler construction, microarchitectures, system optimization, and SoC design. He is a senior member of ACM and IEEE.