IEIE - Journal of the Institute of Electronics and Information Engineers

Mobile QR Code

Main Menu

Journal Search


Title	Cross-Attention Fusion for Audio-Visual Multimodal Emotion Recognition
Authors	김정윤(Jeong-Yoon Kim) ; 이승호(Seung-Ho Lee)
DOI	https://doi.org/10.5573/ieie.2026.63.4.110
Page	pp.110-116
ISSN	2287-5026
Keywords	Multimodal emotion recognition; Transformer; Convolutional neural network; Self-supervised learning; Affective computing
Abstract	In this paper, we propose a cross-attention fusion architecture for audio-visual multimodal emotion recognition utilizing both face images and audio signals. The visual information is normalized to 224×224×3 through face detection and alignment using RetinaFace, and the audio information is converted into a time-dependent embedding sequence (Batch, T, 1024) using the wav2vec2.0-large-robust pre-trained model. Each modality learns sequence-level features through a transformer encoder, and the cross-attention module selectively combines complementary information to generate more sophisticated multimodal representations than simple merging methods. To validate the performance of the proposed method, we conducted experiments using the CREMA-D. The entire data set was split into training and test sets at a ratio of 80% and 20%, respectively. Accuracy and weighted F1-score were adopted as the main evaluation metrics considering the class imbalance of the emotion data. Weighted F1-score is calculated by multiplying the harmonic mean of precision and recall by the number of corresponding classes and adding them together. It has the advantage of being able to evaluate classification performance in a balanced manner even in situations where the appearance rate of a specific emotion class has lower quantity. Experimental results show that the proposed cross-attention-based multimodal model achieved an accuracy of 88.3% and an F1-score of 0.883, demonstrating significant improvements over single-modality models or simple early/late fusion methods.

Copyright © IEIE All right's reserved

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution and reproduction in any medium, provided the original work is property cited.