diff --git a/docs/source/index.md b/docs/source/index.md index d5f77ad4..756cea8e 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -47,6 +47,7 @@ models/glow_tts.md models/vits.md models/forward_tts.md + models/tacotron1-2.md .. toctree:: :maxdepth: 2 diff --git a/docs/source/models/tacotron1-2.md b/docs/source/models/tacotron1-2.md new file mode 100644 index 00000000..90833ecb --- /dev/null +++ b/docs/source/models/tacotron1-2.md @@ -0,0 +1,63 @@ +# 🌮 Tacotron 1 and 2 + +Tacotron is one of the first successful DL-based text-to-mel models and opened up the whole TTS field for more DL research. + +Tacotron mainly is an encoder-decoder model with attention. + +The encoder takes input tokens (characters or phonemes) and the decoder outputs mel-spectrogram* frames. Attention module in-between learns to align the input tokens with the output mel-spectrgorams. + +Tacotron1 and 2 are both built on the same encoder-decoder architecture but they use different layers. Additionally, Tacotron1 uses a Postnet module to convert mel-spectrograms to linear spectrograms with a higher resolution before the vocoder. + +Vanilla Tacotron models are slow at inference due to the auto-regressive* nature that prevents the model to process all the inputs in parallel. One trick is to use a higher “reduction rate” that helps the model to predict multiple frames at once. That is, reduction rate 2 reduces the number of decoder iterations by half. + +Tacotron also uses a Prenet module with Dropout that projects the model’s previous output before feeding it to the decoder again. The paper and most of the implementations use the Dropout layer even in inference and they report the attention fails or the voice quality degrades otherwise. But the issue with that, you get a slightly different output speech every time you run the model. + +Tsraining the attention is notoriously problematic in Tacoron models. Especially, in inference, for some input sequences, the alignment fails and causes the model to produce unexpected results. There are many different methods proposed to improve the attention. + +After hundreds of experiments, @ 🐸TTS we suggest Double Decoder Consistency that leads to the most robust model performance. + +If you have a limited VRAM, then you can try using the Guided Attention Loss or the Dynamic Convolutional Attention. You can also combine the two. + + +## Important resources & papers +- Tacotron: https://arxiv.org/abs/2006.06873 +- Tacotron2: https://arxiv.org/abs/2008.03802 +- Double Decoder Consistency: https://coqui.ai/blog/tts/solving-attention-problems-of-tts-models-with-double-decoder-consistency +- Guided Attention Loss: https://arxiv.org/abs/1710.08969 +- Forward & Backward Decoder: https://arxiv.org/abs/1907.09006 +- Forward Attention: https://arxiv.org/abs/1807.06736 +- Gaussian Attention: https://arxiv.org/abs/1910.10288 +- Dynamic Convolutional Attention: https://arxiv.org/pdf/1910.10288.pdf + + +## BaseTacotron +```{eval-rst} +.. autoclass:: TTS.tts.models.base_tacotron.BaseTacotron + :members: +``` + +## Tacotron +```{eval-rst} +.. autoclass:: TTS.tts.models.tacotron.Tacotron + :members: +``` + +## Tacotron2 +```{eval-rst} +.. autoclass:: TTS.tts.models.tacotron2.Tacotron2 + :members: +``` + +## TacotronConfig +```{eval-rst} +.. autoclass:: TTS.tts.configs.tacotron_config.TacotronConfig + :members: +``` + +## Tacotron2Config +```{eval-rst} +.. autoclass:: TTS.tts.configs.tacotron2_config.Tacotron2Config + :members: +``` + + diff --git a/docs/source/training_a_model.md b/docs/source/training_a_model.md index deb94e85..3e781461 100644 --- a/docs/source/training_a_model.md +++ b/docs/source/training_a_model.md @@ -1,18 +1,19 @@ # Training a Model -1. Decide what model you want to use. +1. Decide the model you want to use. Each model has a different set of pros and cons that define the run-time efficiency and the voice quality. It is up to you to decide what model servers your needs. Other than referring to the papers, one easy way is to test the 🐸TTS community models and see how fast and good each of the models. Or you can start a discussion on our communication channels. -2. Understand the configuration, its fields and values of your model. +2. Understand the configuration, its fields and values. For instance, if you want to train a `Tacotron` model then see the `TacotronConfig` class and make sure you understand it. -3. Go to the recipes and check the recipe of your target model. +3. Check the recipes. - Recipes do not promise perfect models but they provide a good start point for `Nervous Beginners`. A recipe script for - `GlowTTS` using `LJSpeech` dataset looks like below. Let's be creative and call this `train_glowtts.py`. + Recipes are located under `TTS/recipes/`. They do not promise perfect models but they provide a good start point for + `Nervous Beginners`. + A recipe for `GlowTTS` using `LJSpeech` dataset looks like below. Let's be creative and call this `train_glowtts.py`. ```python # train_glowtts.py @@ -20,7 +21,8 @@ import os from TTS.trainer import Trainer, TrainingArgs - from TTS.tts.configs import BaseDatasetConfig, GlowTTSConfig + from TTS.tts.configs.shared_config import BaseDatasetConfig + from TTS.tts.configs.glow_tts_config import GlowTTSConfig from TTS.tts.datasets import load_tts_samples from TTS.tts.models.glow_tts import GlowTTS from TTS.utils.audio import AudioProcessor @@ -183,3 +185,80 @@ 8. Return to the step 1 and reiterate for training a `vocoder` model. In the example above, we trained a `GlowTTS` model, but the same workflow applies to all the other 🐸TTS models. + + +# Multi-speaker Training + +Training a multi-speaker model is mostly the same as training a single-speaker model. +You need to specify a couple of configuration parameters, initiate a `SpeakerManager` instance and pass it to the model. + +The configuration parameters define whether you want to train the model with a speaker-embedding layer or pre-computed +d-vectors. For using d-vectors, you first need to compute the d-vectors using the `SpeakerEncoder`. + +The same Glow-TTS model above can be trained on a multi-speaker VCTK dataset with the script below. + +```python +import os + +from TTS.config.shared_configs import BaseAudioConfig +from TTS.trainer import Trainer, TrainingArgs +from TTS.tts import BaseDatasetConfig, GlowTTSConfig +from TTS.tts.datasets import load_tts_samples +from TTS.tts.glow_tts import GlowTTS +from TTS.tts.utils.speakers import SpeakerManager +from TTS.utils.audio import AudioProcessor + +# define dataset config for VCTK +output_path = os.path.dirname(os.path.abspath(__file__)) +dataset_config = BaseDatasetConfig(name="vctk", meta_file_train="", path=os.path.join(output_path, "../VCTK/")) + +# init audio processing config +audio_config = BaseAudioConfig(sample_rate=22050, do_trim_silence=True, trim_db=23.0) + +# init training config +config = GlowTTSConfig( + batch_size=64, + eval_batch_size=16, + num_loader_workers=4, + num_eval_loader_workers=4, + run_eval=True, + test_delay_epochs=-1, + epochs=1000, + text_cleaner="phoneme_cleaners", + use_phonemes=True, + phoneme_language="en-us", + phoneme_cache_path=os.path.join(output_path, "phoneme_cache"), + print_step=25, + print_eval=False, + mixed_precision=True, + output_path=output_path, + datasets=[dataset_config], + use_speaker_embedding=True, +) + +# init audio processor +ap = AudioProcessor(**config.audio.to_dict()) + +# load training samples +train_samples, eval_samples = load_tts_samples(dataset_config, eval_split=True) + +# ONLY FOR MULTI-SPEAKER: init speaker manager for multi-speaker training +speaker_manager = SpeakerManager() +speaker_manager.set_speaker_ids_from_data(train_samples + eval_samples) +config.num_speakers = speaker_manager.num_speakers + +# init model +model = GlowTTS(config, speaker_manager) + +# init the trainer and 🚀 +trainer = Trainer( + TrainingArgs(), + config, + output_path, + model=model, + train_samples=train_samples, + eval_samples=eval_samples, + training_assets={"audio_processor": ap}, +) +trainer.fit() +``` diff --git a/docs/source/tutorial_for_nervous_beginners.md b/docs/source/tutorial_for_nervous_beginners.md index dc5e9a6c..828314ad 100644 --- a/docs/source/tutorial_for_nervous_beginners.md +++ b/docs/source/tutorial_for_nervous_beginners.md @@ -29,10 +29,10 @@ each line. import os # GlowTTSConfig: all model related values for training, validating and testing. - from TTS.tts.configs import GlowTTSConfig + from TTS.tts.configs.glow_tts_config import GlowTTSConfig # BaseDatasetConfig: defines name, formatter and path of the dataset. - from TTS.tts.configs import BaseDatasetConfig + from TTS.tts.configs.shared_config import BaseDatasetConfig # init_training: Initialize and setup the training environment. # Trainer: Where the ✨️ happens. @@ -79,7 +79,7 @@ each line. # Initiate the Trainer. # Trainer provides a generic API to train all the 🐸TTS models with all its perks like mixed-precision training, - # distributed training etc. + # distributed training, etc. trainer = Trainer( TrainingArgs(), config,