| Title |
Communication-optimized Tensor Parallelism for Efficient Multi-GPU Training of Complex-valued CNNs |
| Authors |
김선우(Sunwoo Kim) ; 이제인(Jane Rhee) ; 윤명국(Myung Kuk Yoon) |
| DOI |
https://doi.org/10.5573/ieie.2026.63.4.53 |
| Keywords |
Complex-valued convolutional neural networks; Multi-GPU training; Tensor parallelism |
| Abstract |
Complex-valued convolutional neural networks (CV-CNNs) have gained increasing attention in the signal processing field due to their ability to preserve both phase and magnitude information. However, as complex values consist of real and imaginary components, the complex matrix multiplications at the core of CV-CNNs, quadruple the computational cost and double the memory requirements compared to real-valued convolutional neural networks. This motivates the need for effective distributed training across multiple GPUs, for which various parallelism strategies have been proposed, with Tensor Parallelism (TP) being among the most efficient. Nevertheless, directly applying TP strategies to CV-CNNs introduces substantial communication overhead, as intermediate results must be exchanged to reconstruct complete complex outputs. In this paper, we propose two novel TP techniques that exploit the mathematical independence between real and imaginary components in complex matrix multiplications. First, Split TP eliminates intermediate result exchanges by assigning each GPU to exclusively compute either the real or imaginary outputs within a single layer. Building on this idea, Direct TP extends this independence across consecutive convolutional layers, eliminating inter-layer output sharing and overlapping necessary communication with computation to efffectively hide latency. Across eight representative CNN models, Split TP and Direct TP achieve average training time reductions of 35.19% and 37.49%, respectively, over the baseline Distributed Data Parallel approach. |