| Title |
Cross-Attention Fusion for Audio-Visual Multimodal Emotion Recognition |
| Authors |
김정윤(Jeong-Yoon Kim) ; 이승호(Seung-Ho Lee) |
| DOI |
https://doi.org/10.5573/ieie.2026.63.4.110 |
| Keywords |
Multimodal emotion recognition; Transformer; Convolutional neural network; Self-supervised learning; Affective computing |
| Abstract |
In this paper, we propose a cross-attention fusion architecture for audio-visual multimodal emotion recognition utilizing both face images and audio signals. The visual information is normalized to 224×224×3 through face detection and alignment using RetinaFace, and the audio information is converted into a time-dependent embedding sequence (Batch, T, 1024) using the wav2vec2.0-large-robust pre-trained model. Each modality learns sequence-level features through a transformer encoder, and the cross-attention module selectively combines complementary information to generate more sophisticated multimodal representations than simple merging methods. To validate the performance of the proposed method, we conducted experiments using the CREMA-D. The entire data set was split into training and test sets at a ratio of 80% and 20%, respectively. Accuracy and weighted F1-score were adopted as the main evaluation metrics considering the class imbalance of the emotion data. Weighted F1-score is calculated by multiplying the harmonic mean of precision and recall by the number of corresponding classes and adding them together. It has the advantage of being able to evaluate classification performance in a balanced manner even in situations where the appearance rate of a specific emotion class has lower quantity. Experimental results show that the proposed cross-attention-based multimodal model achieved an accuracy of 88.3% and an F1-score of 0.883, demonstrating significant improvements over single-modality models or simple early/late fusion methods. |