IEIE - Journal of the Institute of Electronics and Information Engineers

Mobile QR Code

Main Menu

Journal Search


Title	Motion-disentangled Diffusion Model for High-fidelity Talking Head Generation
Authors	김세연(Se-yeon Kim) ; 박상헌(Seong-hyun Park) ; 김해문(Hae-moon Kim) ; 이태영(Tae-young Lee) ; 김승룡(Seung-ryong Kim)
DOI	https://doi.org/10.5573/ieie.2024.61.11.92
Page	pp.92-98
ISSN	2287-5026
Keywords	Talking head generation; Video generation; Diffusion based generative model
Abstract	Conventional GAN-based models for talking head generation often suffer from limited quality and unstable training. Recent approaches based on diffusion models aim to address these limitations and improve fidelity. However, they still face challenges, including extensive sampling time and difficulties in maintaining temporal consistency due to the high stochasticity of diffusion models. To overcome these challenges, we propose a novel motion-disentangled diffusion model for high-quality talking head generation, dubbed MoDiTalker. We introduce two modules: the audio-to-motion (AToM), designed to generate synchronized lip motions from audio, and the motion-to-video (MToV), designed to produce high-quality talking head video following the generated motion. AToM excels in capturing subtle lip movements by leveraging an audio attention mechanism. In addition, MToV enhances temporal consistency by leveraging an efficient tri-plane representation. Our experiments conducted on standard benchmarks demonstrate that our model achieves superior performance compared to existing models. We also provide comprehensive ablation studies and user study results.

Copyright © IEIE All right's reserved

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution and reproduction in any medium, provided the original work is property cited.