docs: improve documentation

pull/4115/head^2
Enno Hermann 2024-12-12 00:37:48 +01:00
parent 236e4901d8
commit 849e75e967
25 changed files with 249 additions and 271 deletions

View File

@ -11,30 +11,25 @@ You can contribute not only with code but with bug reports, comments, questions,
If you like to contribute code, squash a bug but if you don't know where to start, here are some pointers.
- [Development Road Map](https://github.com/coqui-ai/TTS/issues/378)
You can pick something out of our road map. We keep the progess of the project in this simple issue thread. It has new model proposals or developmental updates etc.
- [Github Issues Tracker](https://github.com/idiap/coqui-ai-TTS/issues)
This is a place to find feature requests, bugs.
Issues with the ```good first issue``` tag are good place for beginners to take on.
- ✨**PR**✨ [pages](https://github.com/idiap/coqui-ai-TTS/pulls) with the ```🚀new version``` tag.
We list all the target improvements for the next version. You can pick one of them and start contributing.
Issues with the ```good first issue``` tag are good place for beginners to
take on. Issues tagged with `help wanted` are suited for more experienced
outside contributors.
- Also feel free to suggest new features, ideas and models. We're always open for new things.
## Call for sharing language models
## Call for sharing pretrained models
If possible, please consider sharing your pre-trained models in any language (if the licences allow for you to do so). We will include them in our model catalogue for public use and give the proper attribution, whether it be your name, company, website or any other source specified.
This model can be shared in two ways:
1. Share the model files with us and we serve them with the next 🐸 TTS release.
2. Upload your models on GDrive and share the link.
Models are served under `.models.json` file and any model is available under TTS CLI or Server end points.
Models are served under `.models.json` file and any model is available under TTS
CLI and Python API end points.
Either way you choose, please make sure you send the models [here](https://github.com/coqui-ai/TTS/discussions/930).
@ -135,7 +130,8 @@ curl -LsSf https://astral.sh/uv/install.sh | sh
13. Let's discuss until it is perfect. 💪
We might ask you for certain changes that would appear in the ✨**PR**✨'s page under 🐸TTS[https://github.com/idiap/coqui-ai-TTS/pulls].
We might ask you for certain changes that would appear in the
[Github ✨**PR**✨'s page](https://github.com/idiap/coqui-ai-TTS/pulls).
14. Once things look perfect, We merge it to the ```dev``` branch and make it ready for the next version.
@ -143,9 +139,9 @@ curl -LsSf https://astral.sh/uv/install.sh | sh
If you prefer working within a Docker container as your development environment, you can do the following:
1. Fork 🐸TTS[https://github.com/idiap/coqui-ai-TTS] by clicking the fork button at the top right corner of the project page.
1. Fork the 🐸TTS [Github repository](https://github.com/idiap/coqui-ai-TTS) by clicking the fork button at the top right corner of the page.
2. Clone 🐸TTS and add the main repo as a new remote named ```upsteam```.
2. Clone 🐸TTS and add the main repo as a new remote named ```upstream```.
```bash
git clone git@github.com:<your Github name>/coqui-ai-TTS.git

116
README.md
View File

@ -1,11 +1,10 @@
## 🐸Coqui TTS News
# 🐸Coqui TTS
## News
- 📣 Fork of the [original, unmaintained repository](https://github.com/coqui-ai/TTS). New PyPI package: [coqui-tts](https://pypi.org/project/coqui-tts)
- 📣 [OpenVoice](https://github.com/myshell-ai/OpenVoice) models now available for voice conversion.
- 📣 Prebuilt wheels are now also published for Mac and Windows (in addition to Linux as before) for easier installation across platforms.
- 📣 ⓍTTSv2 is here with 17 languages and better performance across the board. ⓍTTS can stream with <200ms latency.
- 📣 ⓍTTS fine-tuning code is out. Check the [example recipes](https://github.com/idiap/coqui-ai-TTS/tree/dev/recipes/ljspeech).
- 📣 [🐶Bark](https://github.com/suno-ai/bark) is now available for inference with unconstrained voice cloning. [Docs](https://coqui-tts.readthedocs.io/en/latest/models/bark.html)
- 📣 XTTSv2 is here with 17 languages and better performance across the board. XTTS can stream with <200ms latency.
- 📣 XTTS fine-tuning code is out. Check the [example recipes](https://github.com/idiap/coqui-ai-TTS/tree/dev/recipes/ljspeech).
- 📣 You can use [Fairseq models in ~1100 languages](https://github.com/facebookresearch/fairseq/tree/main/examples/mms) with 🐸TTS.
## <img src="https://raw.githubusercontent.com/idiap/coqui-ai-TTS/main/images/coqui-log-green-TTS.png" height="56"/>
@ -21,6 +20,7 @@
______________________________________________________________________
[![Discord](https://img.shields.io/discord/1037326658807533628?color=%239B59B6&label=chat%20on%20discord)](https://discord.gg/5eXr5seRrv)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/coqui-tts)
[![License](<https://img.shields.io/badge/License-MPL%202.0-brightgreen.svg>)](https://opensource.org/licenses/MPL-2.0)
[![PyPI version](https://badge.fury.io/py/coqui-tts.svg)](https://badge.fury.io/py/coqui-tts)
[![Downloads](https://pepy.tech/badge/coqui-tts)](https://pepy.tech/project/coqui-tts)
@ -63,71 +63,65 @@ repository are also still a useful source of information.
| 🚀 **Released Models** | [Standard models](https://github.com/idiap/coqui-ai-TTS/blob/dev/TTS/.models.json) and [Fairseq models in ~1100 languages](https://github.com/idiap/coqui-ai-TTS#example-text-to-speech-using-fairseq-models-in-1100-languages-)|
## Features
- High-performance Deep Learning models for Text2Speech tasks. See lists of models below.
- Fast and efficient model training.
- Detailed training logs on the terminal and Tensorboard.
- Support for Multi-speaker TTS.
- Efficient, flexible, lightweight but feature complete `Trainer API`.
- High-performance text-to-speech and voice conversion models, see list below.
- Fast and efficient model training with detailed training logs on the terminal and Tensorboard.
- Support for multi-speaker and multilingual TTS.
- Released and ready-to-use models.
- Tools to curate Text2Speech datasets under```dataset_analysis```.
- Utilities to use and test your models.
- Tools to curate TTS datasets under ```dataset_analysis/```.
- Command line and Python APIs to use and test your models.
- Modular (but not too much) code base enabling easy implementation of new ideas.
## Model Implementations
### Spectrogram models
- Tacotron: [paper](https://arxiv.org/abs/1703.10135)
- Tacotron2: [paper](https://arxiv.org/abs/1712.05884)
- Glow-TTS: [paper](https://arxiv.org/abs/2005.11129)
- Speedy-Speech: [paper](https://arxiv.org/abs/2008.03802)
- Align-TTS: [paper](https://arxiv.org/abs/2003.01950)
- FastPitch: [paper](https://arxiv.org/pdf/2006.06873.pdf)
- FastSpeech: [paper](https://arxiv.org/abs/1905.09263)
- FastSpeech2: [paper](https://arxiv.org/abs/2006.04558)
- SC-GlowTTS: [paper](https://arxiv.org/abs/2104.05557)
- Capacitron: [paper](https://arxiv.org/abs/1906.03402)
- OverFlow: [paper](https://arxiv.org/abs/2211.06892)
- Neural HMM TTS: [paper](https://arxiv.org/abs/2108.13320)
- Delightful TTS: [paper](https://arxiv.org/abs/2110.12612)
- [Tacotron](https://arxiv.org/abs/1703.10135), [Tacotron2](https://arxiv.org/abs/1712.05884)
- [Glow-TTS](https://arxiv.org/abs/2005.11129), [SC-GlowTTS](https://arxiv.org/abs/2104.05557)
- [Speedy-Speech](https://arxiv.org/abs/2008.03802)
- [Align-TTS](https://arxiv.org/abs/2003.01950)
- [FastPitch](https://arxiv.org/pdf/2006.06873.pdf)
- [FastSpeech](https://arxiv.org/abs/1905.09263), [FastSpeech2](https://arxiv.org/abs/2006.04558)
- [Capacitron](https://arxiv.org/abs/1906.03402)
- [OverFlow](https://arxiv.org/abs/2211.06892)
- [Neural HMM TTS](https://arxiv.org/abs/2108.13320)
- [Delightful TTS](https://arxiv.org/abs/2110.12612)
### End-to-End Models
- ⓍTTS: [blog](https://coqui.ai/blog/tts/open_xtts)
- VITS: [paper](https://arxiv.org/pdf/2106.06103)
- 🐸 YourTTS: [paper](https://arxiv.org/abs/2112.02418)
- 🐢 Tortoise: [orig. repo](https://github.com/neonbjb/tortoise-tts)
- 🐶 Bark: [orig. repo](https://github.com/suno-ai/bark)
### Attention Methods
- Guided Attention: [paper](https://arxiv.org/abs/1710.08969)
- Forward Backward Decoding: [paper](https://arxiv.org/abs/1907.09006)
- Graves Attention: [paper](https://arxiv.org/abs/1910.10288)
- Double Decoder Consistency: [blog](https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency/)
- Dynamic Convolutional Attention: [paper](https://arxiv.org/pdf/1910.10288.pdf)
- Alignment Network: [paper](https://arxiv.org/abs/2108.10447)
### Speaker Encoder
- GE2E: [paper](https://arxiv.org/abs/1710.10467)
- Angular Loss: [paper](https://arxiv.org/pdf/2003.11982.pdf)
- [XTTS](https://arxiv.org/abs/2406.04904)
- [VITS](https://arxiv.org/pdf/2106.06103)
- 🐸[YourTTS](https://arxiv.org/abs/2112.02418)
- 🐢[Tortoise](https://github.com/neonbjb/tortoise-tts)
- 🐶[Bark](https://github.com/suno-ai/bark)
### Vocoders
- MelGAN: [paper](https://arxiv.org/abs/1910.06711)
- MultiBandMelGAN: [paper](https://arxiv.org/abs/2005.05106)
- ParallelWaveGAN: [paper](https://arxiv.org/abs/1910.11480)
- GAN-TTS discriminators: [paper](https://arxiv.org/abs/1909.11646)
- WaveRNN: [origin](https://github.com/fatchord/WaveRNN/)
- WaveGrad: [paper](https://arxiv.org/abs/2009.00713)
- HiFiGAN: [paper](https://arxiv.org/abs/2010.05646)
- UnivNet: [paper](https://arxiv.org/abs/2106.07889)
- [MelGAN](https://arxiv.org/abs/1910.06711)
- [MultiBandMelGAN](https://arxiv.org/abs/2005.05106)
- [ParallelWaveGAN](https://arxiv.org/abs/1910.11480)
- [GAN-TTS discriminators](https://arxiv.org/abs/1909.11646)
- [WaveRNN](https://github.com/fatchord/WaveRNN/)
- [WaveGrad](https://arxiv.org/abs/2009.00713)
- [HiFiGAN](https://arxiv.org/abs/2010.05646)
- [UnivNet](https://arxiv.org/abs/2106.07889)
### Voice Conversion
- FreeVC: [paper](https://arxiv.org/abs/2210.15418)
- OpenVoice: [technical report](https://arxiv.org/abs/2312.01479)
- [FreeVC](https://arxiv.org/abs/2210.15418)
- [OpenVoice](https://arxiv.org/abs/2312.01479)
### Others
- Attention methods: [Guided Attention](https://arxiv.org/abs/1710.08969),
[Forward Backward Decoding](https://arxiv.org/abs/1907.09006),
[Graves Attention](https://arxiv.org/abs/1910.10288),
[Double Decoder Consistency](https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency/),
[Dynamic Convolutional Attention](https://arxiv.org/pdf/1910.10288.pdf),
[Alignment Network](https://arxiv.org/abs/2108.10447)
- Speaker encoders: [GE2E](https://arxiv.org/abs/1710.10467),
[Angular Loss](https://arxiv.org/pdf/2003.11982.pdf)
You can also help us implement more models.
## Installation
🐸TTS is tested on Ubuntu 22.04 with **python >= 3.9, < 3.13.**.
🐸TTS is tested on Ubuntu 24.04 with **python >= 3.9, < 3.13.**, but should also
work on Mac and Windows.
If you are only interested in [synthesizing speech](https://coqui-tts.readthedocs.io/en/latest/inference.html) with the released 🐸TTS models, installing from PyPI is the easiest option.
If you are only interested in [synthesizing speech](https://coqui-tts.readthedocs.io/en/latest/inference.html) with the pretrained 🐸TTS models, installing from PyPI is the easiest option.
```bash
pip install coqui-tts
@ -172,14 +166,9 @@ make system-deps # intended to be used on Ubuntu (Debian). Let us know if you h
make install
```
If you are on Windows, 👑@GuyPaddock wrote installation instructions
[here](https://stackoverflow.com/questions/66726331/how-can-i-run-mozilla-tts-coqui-tts-training-with-cuda-on-a-windows-system)
(note that these are out of date, e.g. you need to have at least Python 3.9).
## Docker Image
You can also try TTS without install with the docker image.
Simply run the following command and you will be able to run TTS without installing it.
You can also try out Coqui TTS without installation with the docker image.
Simply run the following command and you will be able to run TTS:
```bash
docker run --rm -it -p 5002:5002 --entrypoint /bin/bash ghcr.io/coqui-ai/tts-cpu
@ -281,11 +270,12 @@ api.tts_to_file(
<!-- begin-tts-readme -->
Synthesize speech on command line.
Synthesize speech on the command line.
You can either use your trained model or choose a model from the provided list.
If you don't specify any models, then it uses LJSpeech based English model.
If you don't specify any models, then it uses a Tacotron2 English model trained
on LJSpeech.
#### Single Speaker Models

View File

@ -12,7 +12,7 @@ from trainer import TrainerModel
class BaseTrainerModel(TrainerModel):
"""BaseTrainerModel model expanding TrainerModel with required functions by 🐸TTS.
Every new 🐸TTS model must inherit it.
Every new Coqui model must inherit it.
"""
@staticmethod

View File

@ -206,12 +206,14 @@ class Bark(BaseTTS):
speaker_wav (str): Path to the speaker audio file for cloning a new voice. It is cloned and saved in
`voice_dirs` with the name `speaker_id`. Defaults to None.
voice_dirs (List[str]): List of paths that host reference audio files for speakers. Defaults to None.
**kwargs: Model specific inference settings used by `generate_audio()` and `TTS.tts.layers.bark.inference_funcs.generate_text_semantic().
**kwargs: Model specific inference settings used by `generate_audio()` and
`TTS.tts.layers.bark.inference_funcs.generate_text_semantic()`.
Returns:
A dictionary of the output values with `wav` as output waveform, `deterministic_seed` as seed used at inference,
`text_input` as text token IDs after tokenizer, `voice_samples` as samples used for cloning, `conditioning_latents`
as latents used at inference.
A dictionary of the output values with `wav` as output waveform,
`deterministic_seed` as seed used at inference, `text_input` as text token IDs
after tokenizer, `voice_samples` as samples used for cloning,
`conditioning_latents` as latents used at inference.
"""
speaker_id = "random" if speaker_id is None else speaker_id

View File

@ -80,15 +80,17 @@ class BaseTTS(BaseTrainerModel):
raise ValueError("config must be either a *Config or *Args")
def init_multispeaker(self, config: Coqpit, data: List = None):
"""Initialize a speaker embedding layer if needen and define expected embedding channel size for defining
`in_channels` size of the connected layers.
"""Set up for multi-speaker TTS.
Initialize a speaker embedding layer if needed and define expected embedding
channel size for defining `in_channels` size of the connected layers.
This implementation yields 3 possible outcomes:
1. If `config.use_speaker_embedding` and `config.use_d_vector_file are False, do nothing.
1. If `config.use_speaker_embedding` and `config.use_d_vector_file` are False, do nothing.
2. If `config.use_d_vector_file` is True, set expected embedding channel size to `config.d_vector_dim` or 512.
3. If `config.use_speaker_embedding`, initialize a speaker embedding layer with channel size of
`config.d_vector_dim` or 512.
`config.d_vector_dim` or 512.
You can override this function for new models.

View File

@ -33,32 +33,33 @@ class Overflow(BaseTTS):
Paper abstract::
Neural HMMs are a type of neural transducer recently proposed for
sequence-to-sequence modelling in text-to-speech. They combine the best features
of classic statistical speech synthesis and modern neural TTS, requiring less
data and fewer training updates, and are less prone to gibberish output caused
by neural attention failures. In this paper, we combine neural HMM TTS with
normalising flows for describing the highly non-Gaussian distribution of speech
acoustics. The result is a powerful, fully probabilistic model of durations and
acoustics that can be trained using exact maximum likelihood. Compared to
dominant flow-based acoustic models, our approach integrates autoregression for
improved modelling of long-range dependences such as utterance-level prosody.
Experiments show that a system based on our proposal gives more accurate
pronunciations and better subjective speech quality than comparable methods,
whilst retaining the original advantages of neural HMMs. Audio examples and code
are available at https://shivammehta25.github.io/OverFlow/.
sequence-to-sequence modelling in text-to-speech. They combine the best features
of classic statistical speech synthesis and modern neural TTS, requiring less
data and fewer training updates, and are less prone to gibberish output caused
by neural attention failures. In this paper, we combine neural HMM TTS with
normalising flows for describing the highly non-Gaussian distribution of speech
acoustics. The result is a powerful, fully probabilistic model of durations and
acoustics that can be trained using exact maximum likelihood. Compared to
dominant flow-based acoustic models, our approach integrates autoregression for
improved modelling of long-range dependences such as utterance-level prosody.
Experiments show that a system based on our proposal gives more accurate
pronunciations and better subjective speech quality than comparable methods,
whilst retaining the original advantages of neural HMMs. Audio examples and code
are available at https://shivammehta25.github.io/OverFlow/.
Note:
- Neural HMMs uses flat start initialization i.e it computes the means and std and transition probabilities
of the dataset and uses them to initialize the model. This benefits the model and helps with faster learning
If you change the dataset or want to regenerate the parameters change the `force_generate_statistics` and
`mel_statistics_parameter_path` accordingly.
- Neural HMMs uses flat start initialization i.e it computes the means
and std and transition probabilities of the dataset and uses them to initialize
the model. This benefits the model and helps with faster learning If you change
the dataset or want to regenerate the parameters change the
`force_generate_statistics` and `mel_statistics_parameter_path` accordingly.
- To enable multi-GPU training, set the `use_grad_checkpointing=False` in config.
This will significantly increase the memory usage. This is because to compute
the actual data likelihood (not an approximation using MAS/Viterbi) we must use
all the states at the previous time step during the forward pass to decide the
probability distribution at the current step i.e the difference between the forward
algorithm and viterbi approximation.
This will significantly increase the memory usage. This is because to compute
the actual data likelihood (not an approximation using MAS/Viterbi) we must use
all the states at the previous time step during the forward pass to decide the
probability distribution at the current step i.e the difference between the forward
algorithm and viterbi approximation.
Check :class:`TTS.tts.configs.overflow.OverFlowConfig` for class arguments.
"""

View File

@ -423,7 +423,9 @@ class Tortoise(BaseTTS):
Transforms one or more voice_samples into a tuple (autoregressive_conditioning_latent, diffusion_conditioning_latent).
These are expressive learned latents that encode aspects of the provided clips like voice, intonation, and acoustic
properties.
:param voice_samples: List of arbitrary reference clips, which should be *pairs* of torch tensors containing arbitrary kHz waveform data.
:param voice_samples: List of arbitrary reference clips, which should be *pairs*
of torch tensors containing arbitrary kHz waveform data.
:param latent_averaging_mode: 0/1/2 for following modes:
0 - latents will be generated as in original tortoise, using ~4.27s from each voice sample, averaging latent across all samples
1 - latents will be generated using (almost) entire voice samples, averaged across all the ~4.27s chunks
@ -671,7 +673,7 @@ class Tortoise(BaseTTS):
As cond_free_k increases, the output becomes dominated by the conditioning-free signal.
diffusion_temperature: (float) Controls the variance of the noise fed into the diffusion model. [0,1]. Values at 0
are the "mean" prediction of the diffusion network and will sound bland and smeared.
hf_generate_kwargs: (**kwargs) The huggingface Transformers generate API is used for the autoregressive transformer.
hf_generate_kwargs: (`**kwargs`) The huggingface Transformers generate API is used for the autoregressive transformer.
Extra keyword args fed to this function get forwarded directly to that API. Documentation
here: https://huggingface.co/docs/transformers/internal/generation_utils

View File

@ -178,7 +178,7 @@ class XttsArgs(Coqpit):
class Xtts(BaseTTS):
"""TTS model implementation.
"""XTTS model implementation.
Currently it only supports inference.
@ -460,7 +460,7 @@ class Xtts(BaseTTS):
gpt_cond_chunk_len: (int) Chunk length used for cloning. It must be <= `gpt_cond_len`.
If gpt_cond_len == gpt_cond_chunk_len, no chunking. Defaults to 6 seconds.
hf_generate_kwargs: (**kwargs) The huggingface Transformers generate API is used for the autoregressive
hf_generate_kwargs: (`**kwargs`) The huggingface Transformers generate API is used for the autoregressive
transformer. Extra keyword args fed to this function get forwarded directly to that API. Documentation
here: https://huggingface.co/docs/transformers/internal/generation_utils

View File

@ -1,6 +1,6 @@
# Configuration
We use 👩‍✈️[Coqpit] for configuration management. It provides basic static type checking and serialization capabilities on top of native Python `dataclasses`. Here is how a simple configuration looks like with Coqpit.
We use 👩‍✈️[Coqpit](https://github.com/idiap/coqui-ai-coqpit) for configuration management. It provides basic static type checking and serialization capabilities on top of native Python `dataclasses`. Here is how a simple configuration looks like with Coqpit.
```python
from dataclasses import asdict, dataclass, field
@ -36,7 +36,7 @@ class SimpleConfig(Coqpit):
check_argument("val_c", c, restricted=True)
```
In TTS, each model must have a configuration class that exposes all the values necessary for its lifetime.
In Coqui, each model must have a configuration class that exposes all the values necessary for its lifetime.
It defines model architecture, hyper-parameters, training, and inference settings. For our models, we merge all the fields in a single configuration class for ease. It may not look like a wise practice but enables easier bookkeeping and reproducible experiments.

View File

@ -1,20 +1,20 @@
(docker_images)=
## Docker images
# Docker images
We provide docker images to be able to test TTS without having to setup your own environment.
### Using premade images
## Using premade images
You can use premade images built automatically from the latest TTS version.
#### CPU version
### CPU version
```bash
docker pull ghcr.io/coqui-ai/tts-cpu
```
#### GPU version
### GPU version
```bash
docker pull ghcr.io/coqui-ai/tts
```
### Building your own image
## Building your own image
```bash
docker build -t tts .
```

View File

@ -1,4 +1,4 @@
# Humble FAQ
# FAQ
We tried to collect common issues and questions we receive about 🐸TTS. It is worth checking before going deeper.
## Errors with a pre-trained model. How can I resolve this?
@ -7,7 +7,7 @@ We tried to collect common issues and questions we receive about 🐸TTS. It is
- If you feel like it's a bug to be fixed, then prefer Github issues with the same level of scrutiny.
## What are the requirements of a good 🐸TTS dataset?
* {ref}`See this page <what_makes_a_good_dataset>`
- [See this page](what_makes_a_good_dataset.md)
## How should I choose the right model?
- First, train Tacotron. It is smaller and faster to experiment with. If it performs poorly, try Tacotron2.
@ -61,7 +61,8 @@ We tried to collect common issues and questions we receive about 🐸TTS. It is
- SingleGPU training: ```CUDA_VISIBLE_DEVICES="0" python train_tts.py --config_path config.json```
- MultiGPU training: ```python3 -m trainer.distribute --gpus "0,1" --script TTS/bin/train_tts.py --config_path config.json```
**Note:** You can also train your model using pure 🐍 python. Check ```{eval-rst} :ref: 'tutorial_for_nervous_beginners'```.
**Note:** You can also train your model using pure 🐍 python. Check the
[tutorial](tutorial_for_nervous_beginners.md).
## How can I train in a different language?
- Check steps 2, 3, 4, 5 above.
@ -104,7 +105,7 @@ The best approach is to pick a set of promising models and run a Mean-Opinion-Sc
- Check the 4th step under "How can I check model performance?"
## How can I test a trained model?
- The best way is to use `tts` or `tts-server` commands. For details check {ref}`here <synthesizing_speech>`.
- The best way is to use `tts` or `tts-server` commands. For details check [here](inference.md).
- If you need to code your own ```TTS.utils.synthesizer.Synthesizer``` class.
## My Tacotron model does not stop - I see "Decoder stopped with 'max_decoder_steps" - Stopnet does not work.

View File

@ -1,4 +1,4 @@
# Fine-tuning a 🐸 TTS model
# Fine-tuning a model
## Fine-tuning
@ -21,8 +21,9 @@ them and fine-tune it for your own dataset. This will help you in two main ways:
Fine-tuning comes to the rescue in this case. You can take one of our pre-trained models and fine-tune it on your own
speech dataset and achieve reasonable results with only a couple of hours of data.
However, note that, fine-tuning does not ensure great results. The model performance still depends on the
{ref}`dataset quality <what_makes_a_good_dataset>` and the hyper-parameters you choose for fine-tuning. Therefore,
However, note that, fine-tuning does not ensure great results. The model
performance still depends on the [dataset quality](what_makes_a_good_dataset.md)
and the hyper-parameters you choose for fine-tuning. Therefore,
it still takes a bit of tinkering.
@ -31,7 +32,7 @@ them and fine-tune it for your own dataset. This will help you in two main ways:
1. Setup your dataset.
You need to format your target dataset in a certain way so that 🐸TTS data loader will be able to load it for the
training. Please see {ref}`this page <formatting_your_dataset>` for more information about formatting.
training. Please see [this page](formatting_your_dataset.md) for more information about formatting.
2. Choose the model you want to fine-tune.
@ -47,7 +48,8 @@ them and fine-tune it for your own dataset. This will help you in two main ways:
You should choose the model based on your requirements. Some models are fast and some are better in speech quality.
One lazy way to test a model is running the model on the hardware you want to use and see how it works. For
simple testing, you can use the `tts` command on the terminal. For more info see {ref}`here <synthesizing_speech>`.
simple testing, you can use the `tts` command on the terminal. For more info
see [here](inference.md).
3. Download the model.

View File

@ -1,5 +1,5 @@
(formatting_your_dataset)=
# Formatting Your Dataset
# Formatting your dataset
For training a TTS model, you need a dataset with speech recordings and transcriptions. The speech must be divided into audio clips and each clip needs transcription.
@ -49,7 +49,7 @@ The format above is taken from widely-used the [LJSpeech](https://keithito.com/L
Your dataset should have good coverage of the target language. It should cover the phonemic variety, exceptional sounds and syllables. This is extremely important for especially non-phonemic languages like English.
For more info about dataset qualities and properties check our [post](https://github.com/coqui-ai/TTS/wiki/What-makes-a-good-TTS-dataset).
For more info about dataset qualities and properties check [this page](what_makes_a_good_dataset.md).
## Using Your Dataset in 🐸TTS

View File

@ -1,6 +1,6 @@
# Implementing a New Language Frontend
# Implementing new language front ends
- Language frontends are located under `TTS.tts.utils.text`
- Language front ends are located under `TTS.tts.utils.text`
- Each special language has a separate folder.
- Each folder contains all the utilities for processing the text input.
- `TTS.tts.utils.text.phonemizers` contains the main phonemizer for a language. This is the class that uses the utilities

View File

@ -1,4 +1,4 @@
# Implementing a Model
# Implementing new models
1. Implement layers.
@ -36,7 +36,7 @@
There is also the `callback` interface by which you can manipulate both the model and the `Trainer` states. Callbacks give you
an infinite flexibility to add custom behaviours for your model and training routines.
For more details, see {ref}`BaseTTS <Base tts Model>` and :obj:`TTS.utils.callbacks`.
For more details, see [BaseTTS](main_classes/model_api.md#base-tts-model) and :obj:`TTS.utils.callbacks`.
6. Optionally, define `MyModelArgs`.
@ -62,7 +62,7 @@
We love you more when you document your code. ❤️
# Template 🐸TTS Model implementation
## Template 🐸TTS Model implementation
You can start implementing your model by copying the following base class.

View File

@ -5,58 +5,57 @@
----
# Documentation Content
```{eval-rst}
.. toctree::
:maxdepth: 2
:caption: Get started
tutorial_for_nervous_beginners
installation
faq
contributing
.. toctree::
:maxdepth: 2
:caption: Using 🐸TTS
inference
docker_images
implementing_a_new_model
implementing_a_new_language_frontend
training_a_model
finetuning
configuration
formatting_your_dataset
what_makes_a_good_dataset
tts_datasets
marytts
.. toctree::
:maxdepth: 2
:caption: Main Classes
main_classes/trainer_api
main_classes/audio_processor
main_classes/model_api
main_classes/dataset
main_classes/gan
main_classes/speaker_manager
.. toctree::
:maxdepth: 2
:caption: `tts` Models
models/glow_tts.md
models/vits.md
models/forward_tts.md
models/tacotron1-2.md
models/overflow.md
models/tortoise.md
models/bark.md
models/xtts.md
.. toctree::
:maxdepth: 2
:caption: `vocoder` Models
```{toctree}
:maxdepth: 1
:caption: Get started
tutorial_for_nervous_beginners
installation
docker_images
faq
contributing
```
```{toctree}
:maxdepth: 1
:caption: Using Coqui
inference
training_a_model
finetuning
implementing_a_new_model
implementing_a_new_language_frontend
formatting_your_dataset
what_makes_a_good_dataset
tts_datasets
marytts
```
```{toctree}
:maxdepth: 1
:caption: Main Classes
configuration
main_classes/trainer_api
main_classes/audio_processor
main_classes/model_api
main_classes/dataset
main_classes/gan
main_classes/speaker_manager
```
```{toctree}
:maxdepth: 1
:caption: TTS Models
models/glow_tts.md
models/vits.md
models/forward_tts.md
models/tacotron1-2.md
models/overflow.md
models/tortoise.md
models/bark.md
models/xtts.md
```

View File

@ -1,5 +1,5 @@
(synthesizing_speech)=
# Synthesizing Speech
# Synthesizing speech
First, you need to install TTS. We recommend using PyPi. You need to call the command below:
@ -136,7 +136,7 @@ wav = tts.tts(text="Hello world!", speaker_wav="my/cloning/audio.wav", language=
tts.tts_to_file(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav")
```
#### Here is an example for a single speaker model.
### Single speaker model.
```python
# Init TTS with the target model name
@ -145,7 +145,7 @@ tts = TTS(model_name="tts_models/de/thorsten/tacotron2-DDC", progress_bar=False)
tts.tts_to_file(text="Ich bin eine Testnachricht.", file_path=OUTPUT_PATH)
```
#### Example voice cloning with YourTTS in English, French and Portuguese:
### Voice cloning with YourTTS in English, French and Portuguese:
```python
tts = TTS(model_name="tts_models/multilingual/multi-dataset/your_tts", progress_bar=False).to("cuda")
@ -154,14 +154,14 @@ tts.tts_to_file("C'est le clonage de la voix.", speaker_wav="my/cloning/audio.wa
tts.tts_to_file("Isso é clonagem de voz.", speaker_wav="my/cloning/audio.wav", language="pt", file_path="output.wav")
```
#### Example voice conversion converting speaker of the `source_wav` to the speaker of the `target_wav`
### Voice conversion from the speaker of `source_wav` to the speaker of `target_wav`
```python
tts = TTS(model_name="voice_conversion_models/multilingual/vctk/freevc24", progress_bar=False).to("cuda")
tts.voice_conversion_to_file(source_wav="my/source.wav", target_wav="my/target.wav", file_path="output.wav")
```
#### Example voice cloning by a single speaker TTS model combining with the voice conversion model.
### Voice cloning by combining single speaker TTS model with the voice conversion model.
This way, you can clone voices by using any model in 🐸TTS.
@ -174,7 +174,7 @@ tts.tts_with_vc_to_file(
)
```
#### Example text to speech using **Fairseq models in ~1100 languages** 🤯.
### Text to speech using **Fairseq models in ~1100 languages** 🤯.
For these models use the following name format: `tts_models/<lang-iso_code>/fairseq/vits`.
You can find the list of language ISO codes [here](https://dl.fbaipublicfiles.com/mms/tts/all-tts-languages.html) and learn about the Fairseq models [here](https://github.com/facebookresearch/fairseq/tree/main/examples/mms).

View File

@ -1,6 +1,7 @@
# Installation
🐸TTS supports python >=3.9 <3.13.0 and was tested on Ubuntu 22.04.
🐸TTS supports python >=3.9 <3.13.0 and was tested on Ubuntu 24.04, but should
also run on Mac and Windows.
## Using `pip`
@ -33,8 +34,3 @@ make install
# Same as above + dev dependencies and pre-commit
make install_dev
```
## On Windows
If you are on Windows, 👑@GuyPaddock wrote installation instructions
[here](https://stackoverflow.com/questions/66726331/) (note that these are out
of date, e.g. you need to have at least Python 3.9)

View File

@ -1,22 +1,22 @@
# Model API
Model API provides you a set of functions that easily make your model compatible with the `Trainer`,
`Synthesizer` and `ModelZoo`.
`Synthesizer` and the Coqui Python API.
## Base TTS Model
## Base Trainer Model
```{eval-rst}
.. autoclass:: TTS.model.BaseTrainerModel
:members:
```
## Base tts Model
## Base TTS Model
```{eval-rst}
.. autoclass:: TTS.tts.models.base_tts.BaseTTS
:members:
```
## Base vocoder Model
## Base Vocoder Model
```{eval-rst}
.. autoclass:: TTS.vocoder.models.base_vocoder.BaseVocoder

View File

@ -1,3 +1,3 @@
# Trainer API
We made the trainer a separate project on https://github.com/eginhard/coqui-trainer
We made the trainer a separate project: https://github.com/idiap/coqui-ai-Trainer

View File

@ -1,4 +1,4 @@
# Mary-TTS API Support for Coqui-TTS
# Mary-TTS API support for Coqui TTS
## What is Mary-TTS?

View File

@ -1,25 +1,25 @@
# TTS
TTS is a super cool Text-to-Speech model that lets you clone voices in different languages by using just a quick 3-second audio clip. Built on the 🐢Tortoise,
TTS has important model changes that make cross-language voice cloning and multi-lingual speech generation super easy.
# XTTS
XTTS is a super cool Text-to-Speech model that lets you clone voices in different languages by using just a quick 3-second audio clip. Built on the 🐢Tortoise,
XTTS has important model changes that make cross-language voice cloning and multi-lingual speech generation super easy.
There is no need for an excessive amount of training data that spans countless hours.
### Features
## Features
- Voice cloning.
- Cross-language voice cloning.
- Multi-lingual speech generation.
- 24khz sampling rate.
- Streaming inference with < 200ms latency. (See [Streaming inference](#streaming-inference))
- Streaming inference with < 200ms latency. (See [Streaming inference](#streaming-manually))
- Fine-tuning support. (See [Training](#training))
### Updates with v2
## Updates with v2
- Improved voice cloning.
- Voices can be cloned with a single audio file or multiple audio files, without any effect on the runtime.
- Across the board quality improvements.
### Code
## Code
Current implementation only supports inference and GPT encoder training.
### Languages
## Languages
XTTS-v2 supports 17 languages:
- Arabic (ar)
@ -40,15 +40,15 @@ XTTS-v2 supports 17 languages:
- Spanish (es)
- Turkish (tr)
### License
## License
This model is licensed under [Coqui Public Model License](https://coqui.ai/cpml).
### Contact
## Contact
Come and join in our 🐸Community. We're active on [Discord](https://discord.gg/fBC58unbKE) and [Github](https://github.com/idiap/coqui-ai-TTS/discussions).
### Inference
## Inference
#### 🐸TTS Command line
### 🐸TTS Command line
You can check all supported languages with the following command:
@ -64,7 +64,7 @@ You can check all Coqui available speakers with the following command:
--list_speaker_idx
```
##### Coqui speakers
#### Coqui speakers
You can do inference using one of the available speakers using the following command:
```console
@ -75,10 +75,10 @@ You can do inference using one of the available speakers using the following com
--use_cuda
```
##### Clone a voice
#### Clone a voice
You can clone a speaker voice using a single or multiple references:
###### Single reference
##### Single reference
```console
tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
@ -88,7 +88,7 @@ You can clone a speaker voice using a single or multiple references:
--use_cuda
```
###### Multiple references
##### Multiple references
```console
tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
--text "Bugün okula gitmek istemiyorum." \
@ -106,12 +106,12 @@ or for all wav files in a directory you can use:
--use_cuda
```
#### 🐸TTS API
### 🐸TTS API
##### Clone a voice
#### Clone a voice
You can clone a speaker voice using a single or multiple references:
###### Single reference
##### Single reference
Splits the text into sentences and generates audio for each sentence. The audio files are then concatenated to produce the final audio.
You can optionally disable sentence splitting for better coherence but more VRAM and possibly hitting models context length limit.
@ -129,7 +129,7 @@ tts.tts_to_file(text="It took me quite a long time to develop a voice, and now t
)
```
###### Multiple references
##### Multiple references
You can pass multiple audio files to the `speaker_wav` argument for better voice cloning.
@ -154,7 +154,7 @@ tts.tts_to_file(text="It took me quite a long time to develop a voice, and now t
language="en")
```
##### Coqui speakers
#### Coqui speakers
You can do inference using one of the available speakers using the following code:
@ -172,11 +172,11 @@ tts.tts_to_file(text="It took me quite a long time to develop a voice, and now t
```
#### 🐸TTS Model API
### 🐸TTS Model API
To use the model API, you need to download the model files and pass config and model file paths manually.
#### Manual Inference
### Manual Inference
If you want to be able to `load_checkpoint` with `use_deepspeed=True` and **enjoy the speedup**, you need to install deepspeed first.
@ -184,7 +184,7 @@ If you want to be able to `load_checkpoint` with `use_deepspeed=True` and **enjo
pip install deepspeed==0.10.3
```
##### inference parameters
#### Inference parameters
- `text`: The text to be synthesized.
- `language`: The language of the text to be synthesized.
@ -199,7 +199,7 @@ pip install deepspeed==0.10.3
- `enable_text_splitting`: Whether to split the text into sentences and generate audio for each sentence. It allows you to have infinite input length but might loose important context between sentences. Defaults to True.
##### Inference
#### Inference
```python
@ -231,7 +231,7 @@ torchaudio.save("xtts.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)
```
##### Streaming manually
#### Streaming manually
Here the goal is to stream the audio as it is being generated. This is useful for real-time applications.
Streaming inference is typically slower than regular inference, but it allows to get a first chunk of audio faster.
@ -275,9 +275,9 @@ torchaudio.save("xtts_streaming.wav", wav.squeeze().unsqueeze(0).cpu(), 24000)
```
### Training
## Training
#### Easy training
### Easy training
To make `XTTS_v2` GPT encoder training easier for beginner users we did a gradio demo that implements the whole fine-tuning pipeline. The gradio demo enables the user to easily do the following steps:
- Preprocessing of the uploaded audio or audio files in 🐸 TTS coqui formatter
@ -286,7 +286,7 @@ To make `XTTS_v2` GPT encoder training easier for beginner users we did a gradio
The user can run this gradio demo locally or remotely using a Colab Notebook.
##### Run demo on Colab
#### Run demo on Colab
To make the `XTTS_v2` fine-tuning more accessible for users that do not have good GPUs available we did a Google Colab Notebook.
The Colab Notebook is available [here](https://colab.research.google.com/drive/1GiI4_X724M8q2W-zZ-jXo7cWTV7RfaH-?usp=sharing).
@ -302,7 +302,7 @@ If you are not able to acess the video you need to follow the steps:
5. Soon the training is done you can go to the third Tab (3 - Inference) and then click on the button "Step 3 - Load Fine-tuned XTTS model" and wait until the fine-tuned model is loaded. Then you can do the inference on the model by clicking on the button "Step 4 - Inference".
##### Run demo locally
#### Run demo locally
To run the demo locally you need to do the following steps:
1. Install 🐸 TTS following the instructions available [here](https://coqui-tts.readthedocs.io/en/latest/installation.html).
@ -319,7 +319,7 @@ If you are not able to access the video, here is what you need to do:
4. Go to the third Tab (3 - Inference) and then click on the button "Step 3 - Load Fine-tuned XTTS model" and wait until the fine-tuned model is loaded.
5. Now you can run inference with the model by clicking on the button "Step 4 - Inference".
#### Advanced training
### Advanced training
A recipe for `XTTS_v2` GPT encoder training using `LJSpeech` dataset is available at https://github.com/coqui-ai/TTS/tree/dev/recipes/ljspeech/xtts_v1/train_gpt_xtts.py
@ -393,6 +393,6 @@ torchaudio.save(OUTPUT_WAV_PATH, torch.tensor(out["wav"]).unsqueeze(0), 24000)
## XTTS Model
```{eval-rst}
.. autoclass:: TTS.tts.models.xtts.XTTS
.. autoclass:: TTS.tts.models.xtts.Xtts
:members:
```

View File

@ -1,4 +1,4 @@
# Training a Model
# Training a model
1. Decide the model you want to use.
@ -132,7 +132,7 @@
In the example above, we trained a `GlowTTS` model, but the same workflow applies to all the other 🐸TTS models.
# Multi-speaker Training
## Multi-speaker Training
Training a multi-speaker model is mostly the same as training a single-speaker model.
You need to specify a couple of configuration parameters, initiate a `SpeakerManager` instance and pass it to the model.

View File

@ -1,4 +1,4 @@
# TTS Datasets
# TTS datasets
Some of the known public datasets that we successfully applied 🐸TTS:

View File

@ -1,20 +1,29 @@
# Tutorial For Nervous Beginners
# Tutorial for nervous beginners
## Installation
First [install](installation.md) Coqui TTS.
User friendly installation. Recommended only for synthesizing voice.
## Synthesizing Speech
You can run `tts` and synthesize speech directly on the terminal.
```bash
$ pip install coqui-tts
$ tts -h # see the help
$ tts --list_models # list the available models.
```
Developer friendly installation.
![cli.gif](https://github.com/idiap/coqui-ai-TTS/raw/main/images/tts_cli.gif)
You can call `tts-server` to start a local demo server that you can open on
your favorite web browser and 🗣️ (make sure to install the additional
dependencies with `pip install coqui-tts[server]`).
```bash
$ git clone https://github.com/idiap/coqui-ai-TTS
$ cd coqui-ai-TTS
$ pip install -e .
$ tts-server -h # see the help
$ tts-server --list_models # list the available models.
```
![server.gif](https://github.com/idiap/coqui-ai-TTS/raw/main/images/demo_server.gif)
## Training a `tts` Model
@ -99,25 +108,3 @@ We still support running training from CLI like in the old days. The same traini
```
❗️ Note that you can also use ```train_vocoder.py``` as the ```tts``` models above.
## Synthesizing Speech
You can run `tts` and synthesize speech directly on the terminal.
```bash
$ tts -h # see the help
$ tts --list_models # list the available models.
```
![cli.gif](https://github.com/idiap/coqui-ai-TTS/raw/main/images/tts_cli.gif)
You can call `tts-server` to start a local demo server that you can open on
your favorite web browser and 🗣️ (make sure to install the additional
dependencies with `pip install coqui-tts[server]`).
```bash
$ tts-server -h # see the help
$ tts-server --list_models # list the available models.
```
![server.gif](https://github.com/idiap/coqui-ai-TTS/raw/main/images/demo_server.gif)