Mobile QR Code
Title A Hybrid 3D CNN-ViT Model for Korean Sign Language Recognition with Heatmap Representation
Authors 임수현(SuHyeon Lim) ; 최재용(Andrew Jaeyoung Choi)
DOI https://doi.org/10.5573/ieie.2025.62.12.13
Page pp.13-21
ISSN 2287-5026
Keywords Sign langauge; Video; Heatmap; 3D convolutional network; Vision transformer
Abstract Sign language is a visual language composed of hand movements and facial expressions. This study proposes a heatmap-based sign language recognition model. By representing skeleton keypoints as heatmaps, the model enables improved spatiotemporal feature learning, enhanced robustness to pose estimation noise, and better generalization compared with 3D CNN, ViT-only, graph-based methods. The proposed model consists of a 3D Convolutional Neural Network (3D CNN) that extracts spatiotemporal features from heatmap sequences, followed by a Vision Transformer (ViT) that classifies the sign classes. The experimental results show recognition accuracies of 98% on the TEST and 70% on the non-expert TEST. These results demonstrate the model’s effectiveness in capturing the fine-grained dynamics of hand and facial expressions in sign language and suggest its potential applicability to broader domains of human action recognition.