Motivation
Our motivation lies in the fact that talking videos contain multiple types of motion, including lip, head, and torso movements. However, only specific areas, like the lips, are closely tied to speech. Simplifying the modeling of motion and appearance in these speech-sensitive areas is crucial for effective learning from short videos. To accomplish this, we apply warping to captured images with varying head poses (observed views), aligning them to a fixed head pose (canonical view) using conventional Structure-from-motion knowledge. Please see the figure below for a detailed depiction of the warping process.
This approach generates heatmaps in observed and canonical views, providing a comprehensive understanding of video content. In the left image, multiple areas are influenced by speech due to the presence of other motions. However, in the right image, eliminating other motions allows us to identify limited areas with high speech sensitivity.