Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video

Xiuzhe Wu1 Pengfei Hu2 Yang Wu3,4 Xiaoyang Lyu1 Yan-Pei Cao3 Ying Shan3 Wenming Yang2 Zhongqian Sun4 Xiaojuan Qi1
1The University of Hong Kong, 2Tsinghua University, 3 ARC Lab, Tencent PCG, 4 Tencent AI Lab
ICCV 2023

Overview

Given a speech as input, our model generates highquality talking-head videos and supports pose-controllable synthesis. The decomposition and synthesis modules make learning from a short video more effective and the composition module enables us to synthesize high-fidelity videos.

Description of the image



"

Abstract

Synthesizing realistic videos according to a given speech is still an open challenge. Previous works have been plagued by issues such as inaccurate lip shape generation and poor image quality. The key reason is that only motions and appearances on limited facial areas (e.g., lip area) are mainly driven by the input speech. Therefore, directly learning a mapping function from speech to the entire head image is prone to ambiguity, particularly when using a short video for training. We thus propose a decomposition-synthesis-composition framework named Speech-to-Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance to facilitate effective learning from limited training data, resulting in the generation of naturallooking videos. First, given a fixed head pose (i.e., canonical space), we present a speech-driven implicit model for lip image generation which concentrates on learning speechsensitive motion and appearance. Next, to model the major speech-insensitive motion (i.e., head movement), we introduce a geometry-aware mutual explicit mapping (GAMEM) module that establishes geometric mappings between different head poses. This allows us to paste generated lip images at the canonical space onto head images with arbitrary poses and synthesize talking videos with natural head movements. In addition, a Blend-Net and a contrastive sync loss are introduced to enhance the overall synthesis performance. Quantitative and qualitative results on three benchmarks demonstrate that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.



Motivation

Our motivation lies in the fact that talking videos contain multiple types of motion, including lip, head, and torso movements. However, only specific areas, like the lips, are closely tied to speech. Simplifying the modeling of motion and appearance in these speech-sensitive areas is crucial for effective learning from short videos. To accomplish this, we apply warping to captured images with varying head poses (observed views), aligning them to a fixed head pose (canonical view) using conventional Structure-from-motion knowledge. Please see the figure below for a detailed depiction of the warping process.


This approach generates heatmaps in observed and canonical views, providing a comprehensive understanding of video content. In the left image, multiple areas are influenced by speech due to the presence of other motions. However, in the right image, eliminating other motions allows us to identify limited areas with high speech sensitivity.


Overall Framework

Our framework decomposes speech-sensitive and speech-insensitive motions/appearances first, models them individually (the major part of (a)), and composes the ultimate output image (b). Besides, the synchronization performance enhancement module and the GAMEM are illustrated in (c) and (d), respectively. The inputs include continuous pixel coordinates, speech audio signals, and timestamps. The speech-driven implicit model will generate speech-sensitive canonical-view lip images, which will be further transformed into observed space to composite the eventual output image. A full-head depth map is learned along with the training process, supporting pose-controllable synthesis.


Main Results

Qualitative Results


Quantitative Results

BibTeX

@article{wu2023speech2lip,
        title={Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video},
        author={Wu, Xiuzhe and Hu, Pengfei and Wu, Yang and Lyu, Xiaoyang and Cao, Yan-Pei and Shan, Ying and Yang, Wenming and Sun, Zhongqian and Qi, Xiaojuan},
        journal={arXiv preprint arXiv:2309.04814},
        year={2023}
      }