Title |
Multi-resolution Audio Feature Analysis for Speech Emotion Recognition |
Authors |
서지영(Jiyoung Seo) ; 윤수연(Suyeon Yoon) ; 김태용(Taeyong Kim) ; 이보원(Bowon Lee) |
DOI |
https://doi.org/10.5573/ieie.2023.60.3.69 |
Keywords |
Speech emotion recogniiton; Categorical emotion; Short-time Fourier transform; MFCC; Mel spectrogram |
Abstract |
Most of recent emotion recognition studies have been extensively conducted with the development of deep learning algorithms, especially using convolutional and recurrent neural networks. Compared to the vast interest in deep learning model architectures and applications, studies on the intrinsic feature of audio signal itself for speech emotion recognition (SER) have been relatively less explored. In this paper, we explore the effect of various time and frequency resolutions of commonly used audio features including Short-time Fourier transform, Mel-spectrogram, and MFCC for SER. The experiments are conducted on two publicly available datasets, namly EmoDB and IEMOCAP, using a deep neural network based on Conformer. Experimental results with various time and frequency resolutions of single features as well the combinations of multiple resolutions are presented. The highest UAs are yielded with a multi-resolution short-time Fourier transform with a hop size of 10 ms and a Mel spectrogram with window size of 32 ms and a hop size of 10 ms for EmoDB and IEMOCAP, respectively. |