| Title |
Design and Performance Analysis of a Cross-attention Transformer Model for Single-person 3D Keypoint Detection |
| Authors |
신인영(In-Yeong Shin) ; 이승호(Seung-Ho Lee) |
| DOI |
https://doi.org/10.5573/ieie.2026.63.4.78 |
| Keywords |
Feature fusion; 3d; Pose estimation; Deep learning; Transformers |
| Abstract |
In this paper, we propose a novel Spatio-Temporal Feature Fusion Transformer model based on a cross-attention mechanism to simultaneously maximize the accuracy and computational efficiency of single-person 3D pose estimation. Conventional 2D keypoint-based approaches often suffer from instability in 3D reconstruction due to noise and jittering inherent in the input data. To address this issue, we introduce the Discrete Cosine Transform (DCT) as a preprocessing step. By filtering out unnecessary high-frequency components from the time-series data, this method effectively suppresses noise and ensures the temporal continuity of motion. Furthermore, we utilize a spatial transformer to embed the geometric relationships between human joints into vectors and apply a cross-attention structure to integrate these with temporal features. This fusion model enhances estimation precision by learning spatial structural information and temporal dynamic information in a mutually complementary manner. Consequently, this study aims to demonstrate that the synergy between frequency-domain preprocessing and the spatio-temporal integrated attention mechanism leads to significant improvements in both the robustness and performance of single-person 3D pose estimation. |