Unsupervised Audiovisual Synthesis via Exemplar Autoencoders
Kangle Deng
Aayush Bansal
Deva Ramanan
[GitHub]
[Paper]

We train Exemplar Autoencoders for infinitely many speakers using ~3 minutes of speech for an individual speaker without any additional information.


Once trained, an Exemplar Autoencoder can be used for various applications such as:



Abstract

We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers. Our approach builds on simple autoencoders that project out-of-sample data onto the distribution of the training set. We use exemplar autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target exemplar speech. In contrast to existing methods, the proposed approach can be easily extended to an arbitrarily large number of speakers and styles using only 3 minutes of target audio-video data, without requiring any training data for the input speaker.


K. Deng, A. Bansal, D. Ramanan
Unsupervised Audiovisual Synthesis via Exemplar Autoencoders.
In ICLR, 2021.

[Bibtex]

Summary Video




Assitive Tool for Speech Impared

Towards Natural Voice for Speech Impaired

Here is a woman who lost her speaking voice due to throat cancer and now must rely on an electrolarynx to generate utterances. Our system provides them with a way to speak more naturally.

Stylize Text-To-Speech (TTS) Output

Our system can be used to stylize text-to-speech output without any transcribed speech data.
Input Text TTS Output Our Stylized Output
"This sentence is generated by a TTS system."


Beyond Language Constraints

We can even input voice in completely different language, e.g. Chinese and Hindi.
Input Output
Chinese:
Hindi:

Don't use your voice as a password!




Voice Conversion


Our method can apply to in-the-wild audio, while embedding-based voice conversion methods (e.g. AutoVC) struggle to capture the voice that differs from the training set. We contrast our method with the off-the-shelf model from AutoVC by converting the input audio to John Oliver's voice.
Input-1 Input-2
AutoVC: John Oliver
Ours: John Oliver


Our exemplar autoencoders can be trained with only a modest amount of data, as little as a few minutes of speech. We show the results for a list of speakers from our CelebAudio dataset below (for research purpose only).

Takeo Kanade



Oprah Winfrey



Carl Sagan




Alan Kay



Claude Shannon



Stephen Hawking



Audiovisual Synthesis

We can convert the input speech of any individual into an audiovisual stream of any learned speaker, e.g. John Oliver.
Input Output


Audio-to-Video Synthesis

If we constrain the input within the training speaker even at inference time, the network then becomes capable of audio-to-video synthesis for a specific speaker. This application is useful for restoring the video records for some famous historical celebrities.


We take Winston Churchill's famous "end of beginning" speech as example. We only have the recordings of speech yet without video. However, with this technology, we can restore Churchill's talking head video based on the speech audio.



Broader Impact

Our work falls in line with a body of work on content generation that retargets video content, often considered in the context of facial puppeteering. While there exist many applications in entertainment, there also exist many potentials for serious abuse. Our paper (Appendix A) includes a discussion on recommended policies for content generation, as well as a forensic study (Appendix B) that suggests that synthetic audio content can be identified as such with high accuracy.

Acknowledgements

We thank Maneesh Agrawala and Fred Baik for the motivation to use samples of Electrolarynx for assistive applications. We thank Alan Black and David Forsyth for various discussion. We thank the authors of AutoVC for their related work. We also thank members of Deva's Lab for helpful discussions. Finally, we thank the authors of Colorful Image Colorization for this webpage design.