Audio Samples from "StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech"

[arXiv] [GitHub Repo] [Gradio Demo]
Abstract: Text-to-Speech (TTS) has recently seen great progress in synthesizing high-quality speech owing to the rapid development of parallel TTS systems, but producing speech with naturalistic prosodic variations, speaking styles and emotional tones remains challenging. Moreover, since duration and speech are generated separately, parallel TTS models still have problems finding the best monotonic alignments that are crucial for naturalistic speech synthesis. Here, we propose StyleTTS, a style-based generative model for parallel TTS that can synthesize diverse speech with natural prosody from a reference speech utterance. With novel Transferable Monotonic Aligner (TMA) and duration-invariant data augmentation schemes, our method significantly outperforms state-of-the-art models on both single and multi-speaker datasets in subjective tests of speech naturalness and speaker similarity. Through self-supervised learning of the speaking styles, our model can synthesize speech with the same prosodic and emotional tone as any given reference speech without the need for explicitly labeling these categories.

StyleTTS VITS FastSpeech 2 Tacotron 2

This page contains a set of audio samples in support of the paper. Some examples are randomly selected directly from the sets we used for evaluation.

All utterances were unseen during training, and the results are uncurated (NOT cherry-picked) unless otherwise specified.

For more samples, you can download our metadata that contains all audios used for evaluations and the survey results here.

Contents

1. Single Speaker (LJSpeech)


Text: After this the other conspirators traveled to obtain genuine bills and master the system of the leading houses at home and abroad.

GT StyleTTS VITS FastSpeech 2 Tacotron 2

Text: This is proved by contemporary accounts, especially one graphic and realistic article which appeared in the 'Times,'

GT StyleTTS VITS FastSpeech 2 Tacotron 2

Text: Solomons, while waiting to appear in court, persuaded the turnkeys to take him to a public-house, where all might "refresh."

GT StyleTTS VITS FastSpeech 2 Tacotron 2

2. Style-Enabled Diverse Speech Synthesis


In this section, we show the variation in our synthesized speech using the single-speaker model trained on the LJSpeech dataset. Examples 2 and 3 are used in Figure 3 in our paper. Results in this section are cherry-picked because we need to find references different enough to represent the diversity in our model.

How much variation is there? Let's find it out.

Synthesized 1 Synthesized 2 Synthesized 3 Synthesized 4 Synthesized 5
Reference 1 Reference 2 Reference 3 Reference 4 Reference 5

3. Emotional Speech Synthesis


This section contains samples of one speaker from ESD (emotional speech dataset) in five different emotions. The reference was randomly chosen from the training set for each emotion. The same references were given to the single-speaker model (LJSpeech) with the same text. Note that the single-speaker model can recognize emotions from speakers unseen during training and synthesize speech corresponding to the emotions in the reference. Five audios synthesized using single-speaker VITS with SDP are provided for comparision.

In which fox loses a tail and its elder sister finds one.

Surprise

GT Reference StyleTTS (ESD) StyleTTS (LJSpeech) VITS with SDP (LJSpeech)

Angry

GT Reference StyleTTS (ESD) StyleTTS (LJSpeech) VITS with SDP (LJSpeech)

Netural

GT Reference StyleTTS (ESD) StyleTTS (LJSpeech) VITS with SDP (LJSpeech)

Happy

GT Reference StyleTTS (ESD) StyleTTS (LJSpeech) VITS with SDP (LJSpeech)

Sad

GT Reference StyleTTS (ESD) StyleTTS (LJSpeech) VITS with SDP (LJSpeech)

4. Additional Example for Robustness


Our model is more robust to long text input. The above example of our abstract is long text directly fed into different models without breaking them up. The following example was synthesized in the same manner. The text was the first paragraph of our paper:

Text-to-speech, also known as speech synthesis, aims to synthesize natural and intelligible speech from a given text. The recent advances in deep learning has resulted in great progress in TTS technologies to the extent that several recent studies claim to have synthesized speech qualitatively similar to real human speech. However, it still remains a challenge to synthesize expressive speech that can accurately capture the extremely rich diversity occurring naturally in prosodic, temporal, and spectral characteristics of speech which together encode the paralinguistic information. For example, a same given text can be spoken in many ways depending on the context, the emotional tone, and dialectic and habitual speaking patterns of a speaker. Hence, TTS is by nature a one-to-many mapping problem that needs to be addressed as such.

StyleTTS VITS FastSpeech 2 Tacotron 2

5. Multi Speaker (LibriTTS)


Text: Out of the heart of one of the new lights There came a voice, that needle to the star Made me appear in turning thitherward.

GT StyleTTS VITS FastSpeech 2

Text: In aristocratic countries there are few public officers who do not affect to serve their country without interested motives.

GT StyleTTS VITS FastSpeech 2

Text: In his youth his gun had been his best friend; but the chase demands much of legs and muscles and heart.

GT StyleTTS VITS FastSpeech 2

6. Zero-Shot Speaker Adaptation


In this section, we show some examples of zero-shot speaker adaptation using our model trained on LibriTTS for speakers on the VCTK dataset. We compare our models with YourTTS and Meta-StyleSpeech (Min et. al.).

Text: The difference in the rainbow depends considerably upon the size of the drops, and the width of the colored band increases as the size of the drops increases.

GT StyleTTS YourTTS Meta-StyleSpeech

Text: Aristotle thought that the rainbow was caused by reflection of the sun's rays by the rain.

GT StyleTTS YourTTS Meta-StyleSpeech

Text: The actual primary rainbow observed is said to be the effect of super-imposition of a number of bows.

GT StyleTTS YourTTS Meta-StyleSpeech

7. Any-to-Any Voice Conversion


In this section, we show some examples of any-to-any voice conversion using our model trained on LibriTTS. The source and target speakers are from LJSpeech and VCTK, unseen during training. The silence was trimmed to make the better alignment.

Source Reference Converted

8. Ablation Study


Text: The essential point to be remembered is that the ornament, whatever it is, whether picture or pattern-work, should form part of the page,

GT Baseline w/ 100% hard alignment w/ 0% hard alignment w/o monotonic loss
w/o S2S loss w/o pitch extractor w/o pre-trained aligner w/o augmentation w/o discriminator
w/o residual AdaIN -> AdaLN AdaIN -> Concat. AdaIN -> IN

Text: The boy declared he saw no one, and accordingly passed through without paying the toll of a penny.

GT Baseline w/ 100% hard alignment w/ 0% hard alignment w/o monotonic loss
w/o S2S loss w/o pitch extractor w/o pre-trained aligner w/o augmentation w/o discriminator
w/o residual AdaIN -> AdaLN AdaIN -> Concat. AdaIN -> IN