diff --git a/docs/source/models/vits.md b/docs/source/models/vits.md index 5c0e92f6..0c303f7a 100644 --- a/docs/source/models/vits.md +++ b/docs/source/models/vits.md @@ -3,10 +3,15 @@ VITS (Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech ) is an End-to-End (encoder -> vocoder together) TTS model that takes advantage of SOTA DL techniques like GANs, VAE, Normalizing Flows. It does not require external alignment annotations and learns the text-to-audio alignment -using MAS as explained in the paper. The model architecture is a combination of GlowTTS encoder and HiFiGAN vocoder. +using MAS, as explained in the paper. The model architecture is a combination of GlowTTS encoder and HiFiGAN vocoder. It is a feed-forward model with x67.12 real-time factor on a GPU. +🐸 YourTTS is a multi-speaker and multi-lingual TTS model that can perform voice conversion and zero-shot speaker adaptation. +It can also learn a new language or voice with a ~ 1 minute long audio clip. This is a big open gate for training +TTS models in low-resources languages. 🐸 YourTTS uses VITS as the backbone architecture coupled with a speaker encoder model. + ## Important resources & papers +- 🐸 YourTTS: https://arxiv.org/abs/2112.02418 - VITS: https://arxiv.org/pdf/2106.06103.pdf - Neural Spline Flows: https://arxiv.org/abs/1906.04032 - Variational Autoencoder: https://arxiv.org/pdf/1312.6114.pdf