Mobile QR Code
Title Zero-shot Voice Cloning based Emotion-preserving Video Dubbing
Authors 서준혁(Junhyuk Seo) ; 김태민(Taemin Kim) ; 고혜정(Hyejung Ko) ; 오희석(Heeseok Oh)
DOI https://doi.org/10.5573/ieie.2026.63.3.53
Page pp.53-62
ISSN 2287-5026
Keywords Zero-shot voice cloning; Dubbing; Video translation; Emotion preservation
Abstract This paper presents an intelligent voice translation system designed to translate foreign-language audiovisual content into another language while preserving the original speaker’s voice identity and emotional expressivity. To achieve this, we employ a deep learning-based zero-shot voice cloning framework that overcomes the limitations inherent in conventional dubbing and subtitle approaches. The proposed system constitutes a multimodal processing pipeline integrating Demucs-based source separation, Whisper and Pyannote for speech segmentation and transcription, a CLAP-based Emotion Conditioning Module, and a ZONOS TTS model for zero-shot emotional speech synthesis. A major contribution of this work lies in the Emotion Conditioning Module, which leverages a CLAP text?audio encoder to embed the emotional state of the source speech into an 8-dimensional latent vector. This embedding serves as a conditioning signal for the ZONOS TTS, enabling emotion-preserving synthesis. In a user study conducted with 51 participants using clips from "The Simpsons," 82% of respondents rated the emotion-conditioned outputs as exhibiting "more natural and clearly expressed emotions." Furthermore, comparative evaluations against leading commercial services demonstrated a distinct advantage, achieving a 62.2% user preference rate for emotion expressivity. These findings empirically demonstrate that the proposed system effectively preserves the vocal individuality and emotional nuances of the original speaker across languages. A demonstration video is available at [https://m.site.naver.com/1T3Nv].