StyleTTS | VITS | FastSpeech 2 | Tacotron 2 |
---|---|---|---|
This page contains a set of audio samples in support of the paper. Some examples are randomly selected directly from the sets we used for evaluation.
All utterances were unseen during training, and the results are uncurated (NOT cherry-picked) unless otherwise specified.
For more samples, you can download our metadata that contains all audios used for evaluations and the survey results here.
Text: After this the other conspirators traveled to obtain genuine bills and master the system of the leading houses at home and abroad.
GT | StyleTTS | VITS | FastSpeech 2 | Tacotron 2 |
---|---|---|---|---|
Text: This is proved by contemporary accounts, especially one graphic and realistic article which appeared in the 'Times,'
GT | StyleTTS | VITS | FastSpeech 2 | Tacotron 2 |
---|---|---|---|---|
Text: Solomons, while waiting to appear in court, persuaded the turnkeys to take him to a public-house, where all might "refresh."
GT | StyleTTS | VITS | FastSpeech 2 | Tacotron 2 |
---|---|---|---|---|
In this section, we show the variation in our synthesized speech using the single-speaker model trained on the LJSpeech dataset. Examples 2 and 3 are used in Figure 3 in our paper. Results in this section are cherry-picked because we need to find references different enough to represent the diversity in our model.
How much variation is there? Let's find it out.
Synthesized 1 | Synthesized 2 | Synthesized 3 | Synthesized 4 | Synthesized 5 |
---|---|---|---|---|
Reference 1 | Reference 2 | Reference 3 | Reference 4 | Reference 5 |
---|---|---|---|---|
This section contains samples of one speaker from ESD (emotional speech dataset) in five different emotions. The reference was randomly chosen from the training set for each emotion. The same references were given to the single-speaker model (LJSpeech) with the same text. Note that the single-speaker model can recognize emotions from speakers unseen during training and synthesize speech corresponding to the emotions in the reference. Five audios synthesized using single-speaker VITS with SDP are provided for comparision.
In which fox loses a tail and its elder sister finds one.
Surprise
GT | Reference | StyleTTS (ESD) | StyleTTS (LJSpeech) | VITS with SDP (LJSpeech) |
---|---|---|---|---|
Angry
GT | Reference | StyleTTS (ESD) | StyleTTS (LJSpeech) | VITS with SDP (LJSpeech) |
---|---|---|---|---|
Netural
GT | Reference | StyleTTS (ESD) | StyleTTS (LJSpeech) | VITS with SDP (LJSpeech) |
---|---|---|---|---|
Happy
GT | Reference | StyleTTS (ESD) | StyleTTS (LJSpeech) | VITS with SDP (LJSpeech) |
---|---|---|---|---|
Sad
GT | Reference | StyleTTS (ESD) | StyleTTS (LJSpeech) | VITS with SDP (LJSpeech) |
---|---|---|---|---|
Our model is more robust to long text input. The above example of our abstract is long text directly fed into different models without breaking them up. The following example was synthesized in the same manner. The text was the first paragraph of our paper:
Text-to-speech, also known as speech synthesis, aims to synthesize natural and intelligible speech from a given text. The recent advances in deep learning has resulted in great progress in TTS technologies to the extent that several recent studies claim to have synthesized speech qualitatively similar to real human speech. However, it still remains a challenge to synthesize expressive speech that can accurately capture the extremely rich diversity occurring naturally in prosodic, temporal, and spectral characteristics of speech which together encode the paralinguistic information. For example, a same given text can be spoken in many ways depending on the context, the emotional tone, and dialectic and habitual speaking patterns of a speaker. Hence, TTS is by nature a one-to-many mapping problem that needs to be addressed as such.
StyleTTS | VITS | FastSpeech 2 | Tacotron 2 |
---|---|---|---|
Text: Out of the heart of one of the new lights There came a voice, that needle to the star Made me appear in turning thitherward.
GT | StyleTTS | VITS | FastSpeech 2 |
---|---|---|---|
Text: In aristocratic countries there are few public officers who do not affect to serve their country without interested motives.
GT | StyleTTS | VITS | FastSpeech 2 |
---|---|---|---|
Text: In his youth his gun had been his best friend; but the chase demands much of legs and muscles and heart.
GT | StyleTTS | VITS | FastSpeech 2 |
---|---|---|---|
In this section, we show some examples of zero-shot speaker adaptation using our model trained on LibriTTS for speakers on the VCTK dataset. We compare our models with YourTTS and Meta-StyleSpeech (Min et. al.).
Text: The difference in the rainbow depends considerably upon the size of the drops, and the width of the colored band increases as the size of the drops increases.
GT | StyleTTS | YourTTS | Meta-StyleSpeech |
---|---|---|---|
Text: Aristotle thought that the rainbow was caused by reflection of the sun's rays by the rain.
GT | StyleTTS | YourTTS | Meta-StyleSpeech |
---|---|---|---|
Text: The actual primary rainbow observed is said to be the effect of super-imposition of a number of bows.
GT | StyleTTS | YourTTS | Meta-StyleSpeech |
---|---|---|---|
Source | Reference | Converted |
---|---|---|
Text: The essential point to be remembered is that the ornament, whatever it is, whether picture or pattern-work, should form part of the page,
GT | Baseline | w/ 100% hard alignment | w/ 0% hard alignment | w/o monotonic loss |
---|---|---|---|---|
w/o S2S loss | w/o pitch extractor | w/o pre-trained aligner | w/o augmentation | w/o discriminator |
---|---|---|---|---|
w/o residual | AdaIN -> AdaLN | AdaIN -> Concat. | AdaIN -> IN |
---|---|---|---|
Text: The boy declared he saw no one, and accordingly passed through without paying the toll of a penny.
GT | Baseline | w/ 100% hard alignment | w/ 0% hard alignment | w/o monotonic loss |
---|---|---|---|---|
w/o S2S loss | w/o pitch extractor | w/o pre-trained aligner | w/o augmentation | w/o discriminator |
---|---|---|---|---|
w/o residual | AdaIN -> AdaLN | AdaIN -> Concat. | AdaIN -> IN |
---|---|---|---|