diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index d4a8cf00..2b3a9737 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -11,30 +11,25 @@ You can contribute not only with code but with bug reports, comments, questions, If you like to contribute code, squash a bug but if you don't know where to start, here are some pointers. -- [Development Road Map](https://github.com/coqui-ai/TTS/issues/378) - - You can pick something out of our road map. We keep the progess of the project in this simple issue thread. It has new model proposals or developmental updates etc. - - [Github Issues Tracker](https://github.com/idiap/coqui-ai-TTS/issues) This is a place to find feature requests, bugs. - Issues with the ```good first issue``` tag are good place for beginners to take on. - -- ✨**PR**✨ [pages](https://github.com/idiap/coqui-ai-TTS/pulls) with the ```🚀new version``` tag. - - We list all the target improvements for the next version. You can pick one of them and start contributing. + Issues with the ```good first issue``` tag are good place for beginners to + take on. Issues tagged with `help wanted` are suited for more experienced + outside contributors. - Also feel free to suggest new features, ideas and models. We're always open for new things. -## Call for sharing language models +## Call for sharing pretrained models If possible, please consider sharing your pre-trained models in any language (if the licences allow for you to do so). We will include them in our model catalogue for public use and give the proper attribution, whether it be your name, company, website or any other source specified. This model can be shared in two ways: 1. Share the model files with us and we serve them with the next 🐸 TTS release. 2. Upload your models on GDrive and share the link. -Models are served under `.models.json` file and any model is available under TTS CLI or Server end points. +Models are served under `.models.json` file and any model is available under TTS +CLI and Python API end points. Either way you choose, please make sure you send the models [here](https://github.com/coqui-ai/TTS/discussions/930). @@ -135,7 +130,8 @@ curl -LsSf https://astral.sh/uv/install.sh | sh 13. Let's discuss until it is perfect. 💪 - We might ask you for certain changes that would appear in the ✨**PR**✨'s page under 🐸TTS[https://github.com/idiap/coqui-ai-TTS/pulls]. + We might ask you for certain changes that would appear in the + [Github ✨**PR**✨'s page](https://github.com/idiap/coqui-ai-TTS/pulls). 14. Once things look perfect, We merge it to the ```dev``` branch and make it ready for the next version. @@ -143,9 +139,9 @@ curl -LsSf https://astral.sh/uv/install.sh | sh If you prefer working within a Docker container as your development environment, you can do the following: -1. Fork 🐸TTS[https://github.com/idiap/coqui-ai-TTS] by clicking the fork button at the top right corner of the project page. +1. Fork the 🐸TTS [Github repository](https://github.com/idiap/coqui-ai-TTS) by clicking the fork button at the top right corner of the page. -2. Clone 🐸TTS and add the main repo as a new remote named ```upsteam```. +2. Clone 🐸TTS and add the main repo as a new remote named ```upstream```. ```bash git clone git@github.com:/coqui-ai-TTS.git diff --git a/README.md b/README.md index 7dddf3a3..5ab60dd3 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,10 @@ - -## 🐸Coqui TTS News +# 🐸Coqui TTS +## News - 📣 Fork of the [original, unmaintained repository](https://github.com/coqui-ai/TTS). New PyPI package: [coqui-tts](https://pypi.org/project/coqui-tts) - 📣 [OpenVoice](https://github.com/myshell-ai/OpenVoice) models now available for voice conversion. - 📣 Prebuilt wheels are now also published for Mac and Windows (in addition to Linux as before) for easier installation across platforms. -- 📣 ⓍTTSv2 is here with 17 languages and better performance across the board. ⓍTTS can stream with <200ms latency. -- 📣 ⓍTTS fine-tuning code is out. Check the [example recipes](https://github.com/idiap/coqui-ai-TTS/tree/dev/recipes/ljspeech). -- 📣 [🐶Bark](https://github.com/suno-ai/bark) is now available for inference with unconstrained voice cloning. [Docs](https://coqui-tts.readthedocs.io/en/latest/models/bark.html) +- 📣 XTTSv2 is here with 17 languages and better performance across the board. XTTS can stream with <200ms latency. +- 📣 XTTS fine-tuning code is out. Check the [example recipes](https://github.com/idiap/coqui-ai-TTS/tree/dev/recipes/ljspeech). - 📣 You can use [Fairseq models in ~1100 languages](https://github.com/facebookresearch/fairseq/tree/main/examples/mms) with 🐸TTS. ## @@ -21,6 +20,7 @@ ______________________________________________________________________ [![Discord](https://img.shields.io/discord/1037326658807533628?color=%239B59B6&label=chat%20on%20discord)](https://discord.gg/5eXr5seRrv) +![PyPI - Python Version](https://img.shields.io/pypi/pyversions/coqui-tts) [![License]()](https://opensource.org/licenses/MPL-2.0) [![PyPI version](https://badge.fury.io/py/coqui-tts.svg)](https://badge.fury.io/py/coqui-tts) [![Downloads](https://pepy.tech/badge/coqui-tts)](https://pepy.tech/project/coqui-tts) @@ -63,71 +63,65 @@ repository are also still a useful source of information. | 🚀 **Released Models** | [Standard models](https://github.com/idiap/coqui-ai-TTS/blob/dev/TTS/.models.json) and [Fairseq models in ~1100 languages](https://github.com/idiap/coqui-ai-TTS#example-text-to-speech-using-fairseq-models-in-1100-languages-)| ## Features -- High-performance Deep Learning models for Text2Speech tasks. See lists of models below. -- Fast and efficient model training. -- Detailed training logs on the terminal and Tensorboard. -- Support for Multi-speaker TTS. -- Efficient, flexible, lightweight but feature complete `Trainer API`. +- High-performance text-to-speech and voice conversion models, see list below. +- Fast and efficient model training with detailed training logs on the terminal and Tensorboard. +- Support for multi-speaker and multilingual TTS. - Released and ready-to-use models. -- Tools to curate Text2Speech datasets under```dataset_analysis```. -- Utilities to use and test your models. +- Tools to curate TTS datasets under ```dataset_analysis/```. +- Command line and Python APIs to use and test your models. - Modular (but not too much) code base enabling easy implementation of new ideas. ## Model Implementations ### Spectrogram models -- Tacotron: [paper](https://arxiv.org/abs/1703.10135) -- Tacotron2: [paper](https://arxiv.org/abs/1712.05884) -- Glow-TTS: [paper](https://arxiv.org/abs/2005.11129) -- Speedy-Speech: [paper](https://arxiv.org/abs/2008.03802) -- Align-TTS: [paper](https://arxiv.org/abs/2003.01950) -- FastPitch: [paper](https://arxiv.org/pdf/2006.06873.pdf) -- FastSpeech: [paper](https://arxiv.org/abs/1905.09263) -- FastSpeech2: [paper](https://arxiv.org/abs/2006.04558) -- SC-GlowTTS: [paper](https://arxiv.org/abs/2104.05557) -- Capacitron: [paper](https://arxiv.org/abs/1906.03402) -- OverFlow: [paper](https://arxiv.org/abs/2211.06892) -- Neural HMM TTS: [paper](https://arxiv.org/abs/2108.13320) -- Delightful TTS: [paper](https://arxiv.org/abs/2110.12612) +- [Tacotron](https://arxiv.org/abs/1703.10135), [Tacotron2](https://arxiv.org/abs/1712.05884) +- [Glow-TTS](https://arxiv.org/abs/2005.11129), [SC-GlowTTS](https://arxiv.org/abs/2104.05557) +- [Speedy-Speech](https://arxiv.org/abs/2008.03802) +- [Align-TTS](https://arxiv.org/abs/2003.01950) +- [FastPitch](https://arxiv.org/pdf/2006.06873.pdf) +- [FastSpeech](https://arxiv.org/abs/1905.09263), [FastSpeech2](https://arxiv.org/abs/2006.04558) +- [Capacitron](https://arxiv.org/abs/1906.03402) +- [OverFlow](https://arxiv.org/abs/2211.06892) +- [Neural HMM TTS](https://arxiv.org/abs/2108.13320) +- [Delightful TTS](https://arxiv.org/abs/2110.12612) ### End-to-End Models -- ⓍTTS: [blog](https://coqui.ai/blog/tts/open_xtts) -- VITS: [paper](https://arxiv.org/pdf/2106.06103) -- 🐸 YourTTS: [paper](https://arxiv.org/abs/2112.02418) -- 🐢 Tortoise: [orig. repo](https://github.com/neonbjb/tortoise-tts) -- 🐶 Bark: [orig. repo](https://github.com/suno-ai/bark) - -### Attention Methods -- Guided Attention: [paper](https://arxiv.org/abs/1710.08969) -- Forward Backward Decoding: [paper](https://arxiv.org/abs/1907.09006) -- Graves Attention: [paper](https://arxiv.org/abs/1910.10288) -- Double Decoder Consistency: [blog](https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency/) -- Dynamic Convolutional Attention: [paper](https://arxiv.org/pdf/1910.10288.pdf) -- Alignment Network: [paper](https://arxiv.org/abs/2108.10447) - -### Speaker Encoder -- GE2E: [paper](https://arxiv.org/abs/1710.10467) -- Angular Loss: [paper](https://arxiv.org/pdf/2003.11982.pdf) +- [XTTS](https://arxiv.org/abs/2406.04904) +- [VITS](https://arxiv.org/pdf/2106.06103) +- 🐸[YourTTS](https://arxiv.org/abs/2112.02418) +- 🐢[Tortoise](https://github.com/neonbjb/tortoise-tts) +- 🐶[Bark](https://github.com/suno-ai/bark) ### Vocoders -- MelGAN: [paper](https://arxiv.org/abs/1910.06711) -- MultiBandMelGAN: [paper](https://arxiv.org/abs/2005.05106) -- ParallelWaveGAN: [paper](https://arxiv.org/abs/1910.11480) -- GAN-TTS discriminators: [paper](https://arxiv.org/abs/1909.11646) -- WaveRNN: [origin](https://github.com/fatchord/WaveRNN/) -- WaveGrad: [paper](https://arxiv.org/abs/2009.00713) -- HiFiGAN: [paper](https://arxiv.org/abs/2010.05646) -- UnivNet: [paper](https://arxiv.org/abs/2106.07889) +- [MelGAN](https://arxiv.org/abs/1910.06711) +- [MultiBandMelGAN](https://arxiv.org/abs/2005.05106) +- [ParallelWaveGAN](https://arxiv.org/abs/1910.11480) +- [GAN-TTS discriminators](https://arxiv.org/abs/1909.11646) +- [WaveRNN](https://github.com/fatchord/WaveRNN/) +- [WaveGrad](https://arxiv.org/abs/2009.00713) +- [HiFiGAN](https://arxiv.org/abs/2010.05646) +- [UnivNet](https://arxiv.org/abs/2106.07889) ### Voice Conversion -- FreeVC: [paper](https://arxiv.org/abs/2210.15418) -- OpenVoice: [technical report](https://arxiv.org/abs/2312.01479) +- [FreeVC](https://arxiv.org/abs/2210.15418) +- [OpenVoice](https://arxiv.org/abs/2312.01479) + +### Others +- Attention methods: [Guided Attention](https://arxiv.org/abs/1710.08969), + [Forward Backward Decoding](https://arxiv.org/abs/1907.09006), + [Graves Attention](https://arxiv.org/abs/1910.10288), + [Double Decoder Consistency](https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency/), + [Dynamic Convolutional Attention](https://arxiv.org/pdf/1910.10288.pdf), + [Alignment Network](https://arxiv.org/abs/2108.10447) +- Speaker encoders: [GE2E](https://arxiv.org/abs/1710.10467), + [Angular Loss](https://arxiv.org/pdf/2003.11982.pdf) You can also help us implement more models. ## Installation -🐸TTS is tested on Ubuntu 22.04 with **python >= 3.9, < 3.13.**. +🐸TTS is tested on Ubuntu 24.04 with **python >= 3.9, < 3.13.**, but should also +work on Mac and Windows. -If you are only interested in [synthesizing speech](https://coqui-tts.readthedocs.io/en/latest/inference.html) with the released 🐸TTS models, installing from PyPI is the easiest option. +If you are only interested in [synthesizing speech](https://coqui-tts.readthedocs.io/en/latest/inference.html) with the pretrained 🐸TTS models, installing from PyPI is the easiest option. ```bash pip install coqui-tts @@ -172,14 +166,9 @@ make system-deps # intended to be used on Ubuntu (Debian). Let us know if you h make install ``` -If you are on Windows, 👑@GuyPaddock wrote installation instructions -[here](https://stackoverflow.com/questions/66726331/how-can-i-run-mozilla-tts-coqui-tts-training-with-cuda-on-a-windows-system) -(note that these are out of date, e.g. you need to have at least Python 3.9). - - ## Docker Image -You can also try TTS without install with the docker image. -Simply run the following command and you will be able to run TTS without installing it. +You can also try out Coqui TTS without installation with the docker image. +Simply run the following command and you will be able to run TTS: ```bash docker run --rm -it -p 5002:5002 --entrypoint /bin/bash ghcr.io/coqui-ai/tts-cpu @@ -281,11 +270,12 @@ api.tts_to_file( -Synthesize speech on command line. +Synthesize speech on the command line. You can either use your trained model or choose a model from the provided list. -If you don't specify any models, then it uses LJSpeech based English model. +If you don't specify any models, then it uses a Tacotron2 English model trained +on LJSpeech. #### Single Speaker Models diff --git a/TTS/model.py b/TTS/model.py index c3707c85..779b1775 100644 --- a/TTS/model.py +++ b/TTS/model.py @@ -12,7 +12,7 @@ from trainer import TrainerModel class BaseTrainerModel(TrainerModel): """BaseTrainerModel model expanding TrainerModel with required functions by 🐸TTS. - Every new 🐸TTS model must inherit it. + Every new Coqui model must inherit it. """ @staticmethod diff --git a/TTS/tts/models/bark.py b/TTS/tts/models/bark.py index ced8f60e..c52c541b 100644 --- a/TTS/tts/models/bark.py +++ b/TTS/tts/models/bark.py @@ -206,12 +206,14 @@ class Bark(BaseTTS): speaker_wav (str): Path to the speaker audio file for cloning a new voice. It is cloned and saved in `voice_dirs` with the name `speaker_id`. Defaults to None. voice_dirs (List[str]): List of paths that host reference audio files for speakers. Defaults to None. - **kwargs: Model specific inference settings used by `generate_audio()` and `TTS.tts.layers.bark.inference_funcs.generate_text_semantic(). + **kwargs: Model specific inference settings used by `generate_audio()` and + `TTS.tts.layers.bark.inference_funcs.generate_text_semantic()`. Returns: - A dictionary of the output values with `wav` as output waveform, `deterministic_seed` as seed used at inference, - `text_input` as text token IDs after tokenizer, `voice_samples` as samples used for cloning, `conditioning_latents` - as latents used at inference. + A dictionary of the output values with `wav` as output waveform, + `deterministic_seed` as seed used at inference, `text_input` as text token IDs + after tokenizer, `voice_samples` as samples used for cloning, + `conditioning_latents` as latents used at inference. """ speaker_id = "random" if speaker_id is None else speaker_id diff --git a/TTS/tts/models/base_tts.py b/TTS/tts/models/base_tts.py index ccb023ce..33a75598 100644 --- a/TTS/tts/models/base_tts.py +++ b/TTS/tts/models/base_tts.py @@ -80,15 +80,17 @@ class BaseTTS(BaseTrainerModel): raise ValueError("config must be either a *Config or *Args") def init_multispeaker(self, config: Coqpit, data: List = None): - """Initialize a speaker embedding layer if needen and define expected embedding channel size for defining - `in_channels` size of the connected layers. + """Set up for multi-speaker TTS. + + Initialize a speaker embedding layer if needed and define expected embedding + channel size for defining `in_channels` size of the connected layers. This implementation yields 3 possible outcomes: - 1. If `config.use_speaker_embedding` and `config.use_d_vector_file are False, do nothing. + 1. If `config.use_speaker_embedding` and `config.use_d_vector_file` are False, do nothing. 2. If `config.use_d_vector_file` is True, set expected embedding channel size to `config.d_vector_dim` or 512. 3. If `config.use_speaker_embedding`, initialize a speaker embedding layer with channel size of - `config.d_vector_dim` or 512. + `config.d_vector_dim` or 512. You can override this function for new models. diff --git a/TTS/tts/models/overflow.py b/TTS/tts/models/overflow.py index ac09e406..1c146b2e 100644 --- a/TTS/tts/models/overflow.py +++ b/TTS/tts/models/overflow.py @@ -33,32 +33,33 @@ class Overflow(BaseTTS): Paper abstract:: Neural HMMs are a type of neural transducer recently proposed for - sequence-to-sequence modelling in text-to-speech. They combine the best features - of classic statistical speech synthesis and modern neural TTS, requiring less - data and fewer training updates, and are less prone to gibberish output caused - by neural attention failures. In this paper, we combine neural HMM TTS with - normalising flows for describing the highly non-Gaussian distribution of speech - acoustics. The result is a powerful, fully probabilistic model of durations and - acoustics that can be trained using exact maximum likelihood. Compared to - dominant flow-based acoustic models, our approach integrates autoregression for - improved modelling of long-range dependences such as utterance-level prosody. - Experiments show that a system based on our proposal gives more accurate - pronunciations and better subjective speech quality than comparable methods, - whilst retaining the original advantages of neural HMMs. Audio examples and code - are available at https://shivammehta25.github.io/OverFlow/. + sequence-to-sequence modelling in text-to-speech. They combine the best features + of classic statistical speech synthesis and modern neural TTS, requiring less + data and fewer training updates, and are less prone to gibberish output caused + by neural attention failures. In this paper, we combine neural HMM TTS with + normalising flows for describing the highly non-Gaussian distribution of speech + acoustics. The result is a powerful, fully probabilistic model of durations and + acoustics that can be trained using exact maximum likelihood. Compared to + dominant flow-based acoustic models, our approach integrates autoregression for + improved modelling of long-range dependences such as utterance-level prosody. + Experiments show that a system based on our proposal gives more accurate + pronunciations and better subjective speech quality than comparable methods, + whilst retaining the original advantages of neural HMMs. Audio examples and code + are available at https://shivammehta25.github.io/OverFlow/. Note: - - Neural HMMs uses flat start initialization i.e it computes the means and std and transition probabilities - of the dataset and uses them to initialize the model. This benefits the model and helps with faster learning - If you change the dataset or want to regenerate the parameters change the `force_generate_statistics` and - `mel_statistics_parameter_path` accordingly. + - Neural HMMs uses flat start initialization i.e it computes the means + and std and transition probabilities of the dataset and uses them to initialize + the model. This benefits the model and helps with faster learning If you change + the dataset or want to regenerate the parameters change the + `force_generate_statistics` and `mel_statistics_parameter_path` accordingly. - To enable multi-GPU training, set the `use_grad_checkpointing=False` in config. - This will significantly increase the memory usage. This is because to compute - the actual data likelihood (not an approximation using MAS/Viterbi) we must use - all the states at the previous time step during the forward pass to decide the - probability distribution at the current step i.e the difference between the forward - algorithm and viterbi approximation. + This will significantly increase the memory usage. This is because to compute + the actual data likelihood (not an approximation using MAS/Viterbi) we must use + all the states at the previous time step during the forward pass to decide the + probability distribution at the current step i.e the difference between the forward + algorithm and viterbi approximation. Check :class:`TTS.tts.configs.overflow.OverFlowConfig` for class arguments. """ diff --git a/TTS/tts/models/tortoise.py b/TTS/tts/models/tortoise.py index 01629b5d..738e9dd9 100644 --- a/TTS/tts/models/tortoise.py +++ b/TTS/tts/models/tortoise.py @@ -423,7 +423,9 @@ class Tortoise(BaseTTS): Transforms one or more voice_samples into a tuple (autoregressive_conditioning_latent, diffusion_conditioning_latent). These are expressive learned latents that encode aspects of the provided clips like voice, intonation, and acoustic properties. - :param voice_samples: List of arbitrary reference clips, which should be *pairs* of torch tensors containing arbitrary kHz waveform data. + + :param voice_samples: List of arbitrary reference clips, which should be *pairs* + of torch tensors containing arbitrary kHz waveform data. :param latent_averaging_mode: 0/1/2 for following modes: 0 - latents will be generated as in original tortoise, using ~4.27s from each voice sample, averaging latent across all samples 1 - latents will be generated using (almost) entire voice samples, averaged across all the ~4.27s chunks @@ -671,7 +673,7 @@ class Tortoise(BaseTTS): As cond_free_k increases, the output becomes dominated by the conditioning-free signal. diffusion_temperature: (float) Controls the variance of the noise fed into the diffusion model. [0,1]. Values at 0 are the "mean" prediction of the diffusion network and will sound bland and smeared. - hf_generate_kwargs: (**kwargs) The huggingface Transformers generate API is used for the autoregressive transformer. + hf_generate_kwargs: (`**kwargs`) The huggingface Transformers generate API is used for the autoregressive transformer. Extra keyword args fed to this function get forwarded directly to that API. Documentation here: https://huggingface.co/docs/transformers/internal/generation_utils diff --git a/TTS/tts/models/xtts.py b/TTS/tts/models/xtts.py index f05863ae..395208cc 100644 --- a/TTS/tts/models/xtts.py +++ b/TTS/tts/models/xtts.py @@ -178,7 +178,7 @@ class XttsArgs(Coqpit): class Xtts(BaseTTS): - """ⓍTTS model implementation. + """XTTS model implementation. ❗ Currently it only supports inference. @@ -460,7 +460,7 @@ class Xtts(BaseTTS): gpt_cond_chunk_len: (int) Chunk length used for cloning. It must be <= `gpt_cond_len`. If gpt_cond_len == gpt_cond_chunk_len, no chunking. Defaults to 6 seconds. - hf_generate_kwargs: (**kwargs) The huggingface Transformers generate API is used for the autoregressive + hf_generate_kwargs: (`**kwargs`) The huggingface Transformers generate API is used for the autoregressive transformer. Extra keyword args fed to this function get forwarded directly to that API. Documentation here: https://huggingface.co/docs/transformers/internal/generation_utils diff --git a/docs/source/configuration.md b/docs/source/configuration.md index ada61e16..220c96c3 100644 --- a/docs/source/configuration.md +++ b/docs/source/configuration.md @@ -1,6 +1,6 @@ # Configuration -We use 👩‍✈️[Coqpit] for configuration management. It provides basic static type checking and serialization capabilities on top of native Python `dataclasses`. Here is how a simple configuration looks like with Coqpit. +We use 👩‍✈️[Coqpit](https://github.com/idiap/coqui-ai-coqpit) for configuration management. It provides basic static type checking and serialization capabilities on top of native Python `dataclasses`. Here is how a simple configuration looks like with Coqpit. ```python from dataclasses import asdict, dataclass, field @@ -36,7 +36,7 @@ class SimpleConfig(Coqpit): check_argument("val_c", c, restricted=True) ``` -In TTS, each model must have a configuration class that exposes all the values necessary for its lifetime. +In Coqui, each model must have a configuration class that exposes all the values necessary for its lifetime. It defines model architecture, hyper-parameters, training, and inference settings. For our models, we merge all the fields in a single configuration class for ease. It may not look like a wise practice but enables easier bookkeeping and reproducible experiments. diff --git a/docs/source/docker_images.md b/docs/source/docker_images.md index 58d96120..042f9f8e 100644 --- a/docs/source/docker_images.md +++ b/docs/source/docker_images.md @@ -1,20 +1,20 @@ (docker_images)= -## Docker images +# Docker images We provide docker images to be able to test TTS without having to setup your own environment. -### Using premade images +## Using premade images You can use premade images built automatically from the latest TTS version. -#### CPU version +### CPU version ```bash docker pull ghcr.io/coqui-ai/tts-cpu ``` -#### GPU version +### GPU version ```bash docker pull ghcr.io/coqui-ai/tts ``` -### Building your own image +## Building your own image ```bash docker build -t tts . ``` diff --git a/docs/source/faq.md b/docs/source/faq.md index 1090aaa3..e0197cf7 100644 --- a/docs/source/faq.md +++ b/docs/source/faq.md @@ -1,4 +1,4 @@ -# Humble FAQ +# FAQ We tried to collect common issues and questions we receive about 🐸TTS. It is worth checking before going deeper. ## Errors with a pre-trained model. How can I resolve this? @@ -7,7 +7,7 @@ We tried to collect common issues and questions we receive about 🐸TTS. It is - If you feel like it's a bug to be fixed, then prefer Github issues with the same level of scrutiny. ## What are the requirements of a good 🐸TTS dataset? -* {ref}`See this page ` +- [See this page](what_makes_a_good_dataset.md) ## How should I choose the right model? - First, train Tacotron. It is smaller and faster to experiment with. If it performs poorly, try Tacotron2. @@ -61,7 +61,8 @@ We tried to collect common issues and questions we receive about 🐸TTS. It is - SingleGPU training: ```CUDA_VISIBLE_DEVICES="0" python train_tts.py --config_path config.json``` - MultiGPU training: ```python3 -m trainer.distribute --gpus "0,1" --script TTS/bin/train_tts.py --config_path config.json``` -**Note:** You can also train your model using pure 🐍 python. Check ```{eval-rst} :ref: 'tutorial_for_nervous_beginners'```. +**Note:** You can also train your model using pure 🐍 python. Check the +[tutorial](tutorial_for_nervous_beginners.md). ## How can I train in a different language? - Check steps 2, 3, 4, 5 above. @@ -104,7 +105,7 @@ The best approach is to pick a set of promising models and run a Mean-Opinion-Sc - Check the 4th step under "How can I check model performance?" ## How can I test a trained model? -- The best way is to use `tts` or `tts-server` commands. For details check {ref}`here `. +- The best way is to use `tts` or `tts-server` commands. For details check [here](inference.md). - If you need to code your own ```TTS.utils.synthesizer.Synthesizer``` class. ## My Tacotron model does not stop - I see "Decoder stopped with 'max_decoder_steps" - Stopnet does not work. diff --git a/docs/source/finetuning.md b/docs/source/finetuning.md index 548e385e..9c9f2c8d 100644 --- a/docs/source/finetuning.md +++ b/docs/source/finetuning.md @@ -1,4 +1,4 @@ -# Fine-tuning a 🐸 TTS model +# Fine-tuning a model ## Fine-tuning @@ -21,8 +21,9 @@ them and fine-tune it for your own dataset. This will help you in two main ways: Fine-tuning comes to the rescue in this case. You can take one of our pre-trained models and fine-tune it on your own speech dataset and achieve reasonable results with only a couple of hours of data. - However, note that, fine-tuning does not ensure great results. The model performance still depends on the - {ref}`dataset quality ` and the hyper-parameters you choose for fine-tuning. Therefore, + However, note that, fine-tuning does not ensure great results. The model + performance still depends on the [dataset quality](what_makes_a_good_dataset.md) + and the hyper-parameters you choose for fine-tuning. Therefore, it still takes a bit of tinkering. @@ -31,7 +32,7 @@ them and fine-tune it for your own dataset. This will help you in two main ways: 1. Setup your dataset. You need to format your target dataset in a certain way so that 🐸TTS data loader will be able to load it for the - training. Please see {ref}`this page ` for more information about formatting. + training. Please see [this page](formatting_your_dataset.md) for more information about formatting. 2. Choose the model you want to fine-tune. @@ -47,7 +48,8 @@ them and fine-tune it for your own dataset. This will help you in two main ways: You should choose the model based on your requirements. Some models are fast and some are better in speech quality. One lazy way to test a model is running the model on the hardware you want to use and see how it works. For - simple testing, you can use the `tts` command on the terminal. For more info see {ref}`here `. + simple testing, you can use the `tts` command on the terminal. For more info + see [here](inference.md). 3. Download the model. diff --git a/docs/source/formatting_your_dataset.md b/docs/source/formatting_your_dataset.md index 23c497d0..7376ff66 100644 --- a/docs/source/formatting_your_dataset.md +++ b/docs/source/formatting_your_dataset.md @@ -1,5 +1,5 @@ (formatting_your_dataset)= -# Formatting Your Dataset +# Formatting your dataset For training a TTS model, you need a dataset with speech recordings and transcriptions. The speech must be divided into audio clips and each clip needs transcription. @@ -49,7 +49,7 @@ The format above is taken from widely-used the [LJSpeech](https://keithito.com/L Your dataset should have good coverage of the target language. It should cover the phonemic variety, exceptional sounds and syllables. This is extremely important for especially non-phonemic languages like English. -For more info about dataset qualities and properties check our [post](https://github.com/coqui-ai/TTS/wiki/What-makes-a-good-TTS-dataset). +For more info about dataset qualities and properties check [this page](what_makes_a_good_dataset.md). ## Using Your Dataset in 🐸TTS diff --git a/docs/source/implementing_a_new_language_frontend.md b/docs/source/implementing_a_new_language_frontend.md index 2041352d..0b3ef59b 100644 --- a/docs/source/implementing_a_new_language_frontend.md +++ b/docs/source/implementing_a_new_language_frontend.md @@ -1,6 +1,6 @@ -# Implementing a New Language Frontend +# Implementing new language front ends -- Language frontends are located under `TTS.tts.utils.text` +- Language front ends are located under `TTS.tts.utils.text` - Each special language has a separate folder. - Each folder contains all the utilities for processing the text input. - `TTS.tts.utils.text.phonemizers` contains the main phonemizer for a language. This is the class that uses the utilities diff --git a/docs/source/implementing_a_new_model.md b/docs/source/implementing_a_new_model.md index 1bf7a882..a2721a1c 100644 --- a/docs/source/implementing_a_new_model.md +++ b/docs/source/implementing_a_new_model.md @@ -1,4 +1,4 @@ -# Implementing a Model +# Implementing new models 1. Implement layers. @@ -36,7 +36,7 @@ There is also the `callback` interface by which you can manipulate both the model and the `Trainer` states. Callbacks give you an infinite flexibility to add custom behaviours for your model and training routines. - For more details, see {ref}`BaseTTS ` and :obj:`TTS.utils.callbacks`. + For more details, see [BaseTTS](main_classes/model_api.md#base-tts-model) and :obj:`TTS.utils.callbacks`. 6. Optionally, define `MyModelArgs`. @@ -62,7 +62,7 @@ We love you more when you document your code. ❤️ -# Template 🐸TTS Model implementation +## Template 🐸TTS Model implementation You can start implementing your model by copying the following base class. diff --git a/docs/source/index.md b/docs/source/index.md index 79993eec..8924fdc8 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -5,58 +5,57 @@ ---- # Documentation Content -```{eval-rst} -.. toctree:: - :maxdepth: 2 - :caption: Get started - - tutorial_for_nervous_beginners - installation - faq - contributing - -.. toctree:: - :maxdepth: 2 - :caption: Using 🐸TTS - - inference - docker_images - implementing_a_new_model - implementing_a_new_language_frontend - training_a_model - finetuning - configuration - formatting_your_dataset - what_makes_a_good_dataset - tts_datasets - marytts - -.. toctree:: - :maxdepth: 2 - :caption: Main Classes - - main_classes/trainer_api - main_classes/audio_processor - main_classes/model_api - main_classes/dataset - main_classes/gan - main_classes/speaker_manager - -.. toctree:: - :maxdepth: 2 - :caption: `tts` Models - - models/glow_tts.md - models/vits.md - models/forward_tts.md - models/tacotron1-2.md - models/overflow.md - models/tortoise.md - models/bark.md - models/xtts.md - -.. toctree:: - :maxdepth: 2 - :caption: `vocoder` Models +```{toctree} +:maxdepth: 1 +:caption: Get started +tutorial_for_nervous_beginners +installation +docker_images +faq +contributing +``` + +```{toctree} +:maxdepth: 1 +:caption: Using Coqui + +inference +training_a_model +finetuning +implementing_a_new_model +implementing_a_new_language_frontend +formatting_your_dataset +what_makes_a_good_dataset +tts_datasets +marytts +``` + + +```{toctree} +:maxdepth: 1 +:caption: Main Classes + +configuration +main_classes/trainer_api +main_classes/audio_processor +main_classes/model_api +main_classes/dataset +main_classes/gan +main_classes/speaker_manager +``` + + +```{toctree} +:maxdepth: 1 +:caption: TTS Models + +models/glow_tts.md +models/vits.md +models/forward_tts.md +models/tacotron1-2.md +models/overflow.md +models/tortoise.md +models/bark.md +models/xtts.md ``` diff --git a/docs/source/inference.md b/docs/source/inference.md index 4cb8f45a..4556643c 100644 --- a/docs/source/inference.md +++ b/docs/source/inference.md @@ -1,5 +1,5 @@ (synthesizing_speech)= -# Synthesizing Speech +# Synthesizing speech First, you need to install TTS. We recommend using PyPi. You need to call the command below: @@ -136,7 +136,7 @@ wav = tts.tts(text="Hello world!", speaker_wav="my/cloning/audio.wav", language= tts.tts_to_file(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav") ``` -#### Here is an example for a single speaker model. +### Single speaker model. ```python # Init TTS with the target model name @@ -145,7 +145,7 @@ tts = TTS(model_name="tts_models/de/thorsten/tacotron2-DDC", progress_bar=False) tts.tts_to_file(text="Ich bin eine Testnachricht.", file_path=OUTPUT_PATH) ``` -#### Example voice cloning with YourTTS in English, French and Portuguese: +### Voice cloning with YourTTS in English, French and Portuguese: ```python tts = TTS(model_name="tts_models/multilingual/multi-dataset/your_tts", progress_bar=False).to("cuda") @@ -154,14 +154,14 @@ tts.tts_to_file("C'est le clonage de la voix.", speaker_wav="my/cloning/audio.wa tts.tts_to_file("Isso é clonagem de voz.", speaker_wav="my/cloning/audio.wav", language="pt", file_path="output.wav") ``` -#### Example voice conversion converting speaker of the `source_wav` to the speaker of the `target_wav` +### Voice conversion from the speaker of `source_wav` to the speaker of `target_wav` ```python tts = TTS(model_name="voice_conversion_models/multilingual/vctk/freevc24", progress_bar=False).to("cuda") tts.voice_conversion_to_file(source_wav="my/source.wav", target_wav="my/target.wav", file_path="output.wav") ``` -#### Example voice cloning by a single speaker TTS model combining with the voice conversion model. +### Voice cloning by combining single speaker TTS model with the voice conversion model. This way, you can clone voices by using any model in 🐸TTS. @@ -174,7 +174,7 @@ tts.tts_with_vc_to_file( ) ``` -#### Example text to speech using **Fairseq models in ~1100 languages** 🤯. +### Text to speech using **Fairseq models in ~1100 languages** 🤯. For these models use the following name format: `tts_models//fairseq/vits`. You can find the list of language ISO codes [here](https://dl.fbaipublicfiles.com/mms/tts/all-tts-languages.html) and learn about the Fairseq models [here](https://github.com/facebookresearch/fairseq/tree/main/examples/mms). diff --git a/docs/source/installation.md b/docs/source/installation.md index 405c4366..5becc28b 100644 --- a/docs/source/installation.md +++ b/docs/source/installation.md @@ -1,6 +1,7 @@ # Installation -🐸TTS supports python >=3.9 <3.13.0 and was tested on Ubuntu 22.04. +🐸TTS supports python >=3.9 <3.13.0 and was tested on Ubuntu 24.04, but should +also run on Mac and Windows. ## Using `pip` @@ -33,8 +34,3 @@ make install # Same as above + dev dependencies and pre-commit make install_dev ``` - -## On Windows -If you are on Windows, 👑@GuyPaddock wrote installation instructions -[here](https://stackoverflow.com/questions/66726331/) (note that these are out -of date, e.g. you need to have at least Python 3.9) diff --git a/docs/source/main_classes/model_api.md b/docs/source/main_classes/model_api.md index 71b3d416..bb7e9d1a 100644 --- a/docs/source/main_classes/model_api.md +++ b/docs/source/main_classes/model_api.md @@ -1,22 +1,22 @@ # Model API Model API provides you a set of functions that easily make your model compatible with the `Trainer`, -`Synthesizer` and `ModelZoo`. +`Synthesizer` and the Coqui Python API. -## Base TTS Model +## Base Trainer Model ```{eval-rst} .. autoclass:: TTS.model.BaseTrainerModel :members: ``` -## Base tts Model +## Base TTS Model ```{eval-rst} .. autoclass:: TTS.tts.models.base_tts.BaseTTS :members: ``` -## Base vocoder Model +## Base Vocoder Model ```{eval-rst} .. autoclass:: TTS.vocoder.models.base_vocoder.BaseVocoder diff --git a/docs/source/main_classes/trainer_api.md b/docs/source/main_classes/trainer_api.md index 335294aa..bdb6048e 100644 --- a/docs/source/main_classes/trainer_api.md +++ b/docs/source/main_classes/trainer_api.md @@ -1,3 +1,3 @@ # Trainer API -We made the trainer a separate project on https://github.com/eginhard/coqui-trainer +We made the trainer a separate project: https://github.com/idiap/coqui-ai-Trainer diff --git a/docs/source/marytts.md b/docs/source/marytts.md index 9091ca33..11cf4a2b 100644 --- a/docs/source/marytts.md +++ b/docs/source/marytts.md @@ -1,4 +1,4 @@ -# Mary-TTS API Support for Coqui-TTS +# Mary-TTS API support for Coqui TTS ## What is Mary-TTS? diff --git a/docs/source/models/xtts.md b/docs/source/models/xtts.md index 7c0f1c4a..96f5bb7c 100644 --- a/docs/source/models/xtts.md +++ b/docs/source/models/xtts.md @@ -1,25 +1,25 @@ -# ⓍTTS -ⓍTTS is a super cool Text-to-Speech model that lets you clone voices in different languages by using just a quick 3-second audio clip. Built on the 🐢Tortoise, -ⓍTTS has important model changes that make cross-language voice cloning and multi-lingual speech generation super easy. +# XTTS +XTTS is a super cool Text-to-Speech model that lets you clone voices in different languages by using just a quick 3-second audio clip. Built on the 🐢Tortoise, +XTTS has important model changes that make cross-language voice cloning and multi-lingual speech generation super easy. There is no need for an excessive amount of training data that spans countless hours. -### Features +## Features - Voice cloning. - Cross-language voice cloning. - Multi-lingual speech generation. - 24khz sampling rate. -- Streaming inference with < 200ms latency. (See [Streaming inference](#streaming-inference)) +- Streaming inference with < 200ms latency. (See [Streaming inference](#streaming-manually)) - Fine-tuning support. (See [Training](#training)) -### Updates with v2 +## Updates with v2 - Improved voice cloning. - Voices can be cloned with a single audio file or multiple audio files, without any effect on the runtime. - Across the board quality improvements. -### Code +## Code Current implementation only supports inference and GPT encoder training. -### Languages +## Languages XTTS-v2 supports 17 languages: - Arabic (ar) @@ -40,15 +40,15 @@ XTTS-v2 supports 17 languages: - Spanish (es) - Turkish (tr) -### License +## License This model is licensed under [Coqui Public Model License](https://coqui.ai/cpml). -### Contact +## Contact Come and join in our 🐸Community. We're active on [Discord](https://discord.gg/fBC58unbKE) and [Github](https://github.com/idiap/coqui-ai-TTS/discussions). -### Inference +## Inference -#### 🐸TTS Command line +### 🐸TTS Command line You can check all supported languages with the following command: @@ -64,7 +64,7 @@ You can check all Coqui available speakers with the following command: --list_speaker_idx ``` -##### Coqui speakers +#### Coqui speakers You can do inference using one of the available speakers using the following command: ```console @@ -75,10 +75,10 @@ You can do inference using one of the available speakers using the following com --use_cuda ``` -##### Clone a voice +#### Clone a voice You can clone a speaker voice using a single or multiple references: -###### Single reference +##### Single reference ```console tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \ @@ -88,7 +88,7 @@ You can clone a speaker voice using a single or multiple references: --use_cuda ``` -###### Multiple references +##### Multiple references ```console tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \ --text "Bugün okula gitmek istemiyorum." \ @@ -106,12 +106,12 @@ or for all wav files in a directory you can use: --use_cuda ``` -#### 🐸TTS API +### 🐸TTS API -##### Clone a voice +#### Clone a voice You can clone a speaker voice using a single or multiple references: -###### Single reference +##### Single reference Splits the text into sentences and generates audio for each sentence. The audio files are then concatenated to produce the final audio. You can optionally disable sentence splitting for better coherence but more VRAM and possibly hitting models context length limit. @@ -129,7 +129,7 @@ tts.tts_to_file(text="It took me quite a long time to develop a voice, and now t ) ``` -###### Multiple references +##### Multiple references You can pass multiple audio files to the `speaker_wav` argument for better voice cloning. @@ -154,7 +154,7 @@ tts.tts_to_file(text="It took me quite a long time to develop a voice, and now t language="en") ``` -##### Coqui speakers +#### Coqui speakers You can do inference using one of the available speakers using the following code: @@ -172,11 +172,11 @@ tts.tts_to_file(text="It took me quite a long time to develop a voice, and now t ``` -#### 🐸TTS Model API +### 🐸TTS Model API To use the model API, you need to download the model files and pass config and model file paths manually. -#### Manual Inference +### Manual Inference If you want to be able to `load_checkpoint` with `use_deepspeed=True` and **enjoy the speedup**, you need to install deepspeed first. @@ -184,7 +184,7 @@ If you want to be able to `load_checkpoint` with `use_deepspeed=True` and **enjo pip install deepspeed==0.10.3 ``` -##### inference parameters +#### Inference parameters - `text`: The text to be synthesized. - `language`: The language of the text to be synthesized. @@ -199,7 +199,7 @@ pip install deepspeed==0.10.3 - `enable_text_splitting`: Whether to split the text into sentences and generate audio for each sentence. It allows you to have infinite input length but might loose important context between sentences. Defaults to True. -##### Inference +#### Inference ```python @@ -231,7 +231,7 @@ torchaudio.save("xtts.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000) ``` -##### Streaming manually +#### Streaming manually Here the goal is to stream the audio as it is being generated. This is useful for real-time applications. Streaming inference is typically slower than regular inference, but it allows to get a first chunk of audio faster. @@ -275,9 +275,9 @@ torchaudio.save("xtts_streaming.wav", wav.squeeze().unsqueeze(0).cpu(), 24000) ``` -### Training +## Training -#### Easy training +### Easy training To make `XTTS_v2` GPT encoder training easier for beginner users we did a gradio demo that implements the whole fine-tuning pipeline. The gradio demo enables the user to easily do the following steps: - Preprocessing of the uploaded audio or audio files in 🐸 TTS coqui formatter @@ -286,7 +286,7 @@ To make `XTTS_v2` GPT encoder training easier for beginner users we did a gradio The user can run this gradio demo locally or remotely using a Colab Notebook. -##### Run demo on Colab +#### Run demo on Colab To make the `XTTS_v2` fine-tuning more accessible for users that do not have good GPUs available we did a Google Colab Notebook. The Colab Notebook is available [here](https://colab.research.google.com/drive/1GiI4_X724M8q2W-zZ-jXo7cWTV7RfaH-?usp=sharing). @@ -302,7 +302,7 @@ If you are not able to acess the video you need to follow the steps: 5. Soon the training is done you can go to the third Tab (3 - Inference) and then click on the button "Step 3 - Load Fine-tuned XTTS model" and wait until the fine-tuned model is loaded. Then you can do the inference on the model by clicking on the button "Step 4 - Inference". -##### Run demo locally +#### Run demo locally To run the demo locally you need to do the following steps: 1. Install 🐸 TTS following the instructions available [here](https://coqui-tts.readthedocs.io/en/latest/installation.html). @@ -319,7 +319,7 @@ If you are not able to access the video, here is what you need to do: 4. Go to the third Tab (3 - Inference) and then click on the button "Step 3 - Load Fine-tuned XTTS model" and wait until the fine-tuned model is loaded. 5. Now you can run inference with the model by clicking on the button "Step 4 - Inference". -#### Advanced training +### Advanced training A recipe for `XTTS_v2` GPT encoder training using `LJSpeech` dataset is available at https://github.com/coqui-ai/TTS/tree/dev/recipes/ljspeech/xtts_v1/train_gpt_xtts.py @@ -393,6 +393,6 @@ torchaudio.save(OUTPUT_WAV_PATH, torch.tensor(out["wav"]).unsqueeze(0), 24000) ## XTTS Model ```{eval-rst} -.. autoclass:: TTS.tts.models.xtts.XTTS +.. autoclass:: TTS.tts.models.xtts.Xtts :members: ``` diff --git a/docs/source/training_a_model.md b/docs/source/training_a_model.md index 989a5704..6f612dc0 100644 --- a/docs/source/training_a_model.md +++ b/docs/source/training_a_model.md @@ -1,4 +1,4 @@ -# Training a Model +# Training a model 1. Decide the model you want to use. @@ -132,7 +132,7 @@ In the example above, we trained a `GlowTTS` model, but the same workflow applies to all the other 🐸TTS models. -# Multi-speaker Training +## Multi-speaker Training Training a multi-speaker model is mostly the same as training a single-speaker model. You need to specify a couple of configuration parameters, initiate a `SpeakerManager` instance and pass it to the model. diff --git a/docs/source/tts_datasets.md b/docs/source/tts_datasets.md index 11da1b76..3a0bcf11 100644 --- a/docs/source/tts_datasets.md +++ b/docs/source/tts_datasets.md @@ -1,4 +1,4 @@ -# TTS Datasets +# TTS datasets Some of the known public datasets that we successfully applied 🐸TTS: diff --git a/docs/source/tutorial_for_nervous_beginners.md b/docs/source/tutorial_for_nervous_beginners.md index b417c4c4..5df56fc6 100644 --- a/docs/source/tutorial_for_nervous_beginners.md +++ b/docs/source/tutorial_for_nervous_beginners.md @@ -1,20 +1,29 @@ -# Tutorial For Nervous Beginners +# Tutorial for nervous beginners -## Installation +First [install](installation.md) Coqui TTS. -User friendly installation. Recommended only for synthesizing voice. +## Synthesizing Speech + +You can run `tts` and synthesize speech directly on the terminal. ```bash -$ pip install coqui-tts +$ tts -h # see the help +$ tts --list_models # list the available models. ``` -Developer friendly installation. +![cli.gif](https://github.com/idiap/coqui-ai-TTS/raw/main/images/tts_cli.gif) + + +You can call `tts-server` to start a local demo server that you can open on +your favorite web browser and 🗣️ (make sure to install the additional +dependencies with `pip install coqui-tts[server]`). ```bash -$ git clone https://github.com/idiap/coqui-ai-TTS -$ cd coqui-ai-TTS -$ pip install -e . +$ tts-server -h # see the help +$ tts-server --list_models # list the available models. ``` +![server.gif](https://github.com/idiap/coqui-ai-TTS/raw/main/images/demo_server.gif) + ## Training a `tts` Model @@ -99,25 +108,3 @@ We still support running training from CLI like in the old days. The same traini ``` ❗️ Note that you can also use ```train_vocoder.py``` as the ```tts``` models above. - -## Synthesizing Speech - -You can run `tts` and synthesize speech directly on the terminal. - -```bash -$ tts -h # see the help -$ tts --list_models # list the available models. -``` - -![cli.gif](https://github.com/idiap/coqui-ai-TTS/raw/main/images/tts_cli.gif) - - -You can call `tts-server` to start a local demo server that you can open on -your favorite web browser and 🗣️ (make sure to install the additional -dependencies with `pip install coqui-tts[server]`). - -```bash -$ tts-server -h # see the help -$ tts-server --list_models # list the available models. -``` -![server.gif](https://github.com/idiap/coqui-ai-TTS/raw/main/images/demo_server.gif)