Add preliminary sphinx documentation

pull/506/head
Eren Gölge 2021-06-27 20:55:20 +02:00
parent 6b265ae8e3
commit 65958eaa41
27 changed files with 1200 additions and 198 deletions

2
.gitignore vendored
View File

@ -140,7 +140,7 @@ events.out*
old_configs/*
model_importers/*
model_profiling/*
docs/*
docs/source/TODO/*
.noseids
.dccache
log.txt

View File

@ -6,16 +6,6 @@ help:
target_dirs := tests TTS notebooks
system-deps: ## install linux system deps
sudo apt-get install -y libsndfile1-dev
dev-deps: ## install development deps
pip install -r requirements.dev.txt
pip install -r requirements.tf.txt
deps: ## install 🐸 requirements.
pip install -r requirements.txt
test_all: ## run tests and don't stop on an error.
nosetests --with-cov -cov --cover-erase --cover-package TTS tests --nologcapture --with-id
./run_bash_tests.sh
@ -34,5 +24,21 @@ style: ## update code style.
lint: ## run pylint linter.
pylint ${target_dirs}
system-deps: ## install linux system deps
sudo apt-get install -y libsndfile1-dev
dev-deps: ## install development deps
pip install -r requirements.dev.txt
pip install -r requirements.tf.txt
doc-deps: ## install docs dependencies
pip install -r docs/requirements.txt
hub-deps: ## install deps for torch hub use
pip install -r requirements.hub.txt
deps: ## install 🐸 requirements.
pip install -r requirements.txt
install: ## install 🐸 TTS for development.
pip install -e .[all]

203
README.md
View File

@ -5,6 +5,7 @@
[![CircleCI](https://github.com/coqui-ai/TTS/actions/workflows/main.yml/badge.svg)]()
[![License](<https://img.shields.io/badge/License-MPL%202.0-brightgreen.svg>)](https://opensource.org/licenses/MPL-2.0)
[![Docs](<https://readthedocs.org/projects/tts/badge/?version=latest&style=plastic>)](https://tts.readthedocs.io/en/latest/)
[![PyPI version](https://badge.fury.io/py/TTS.svg)](https://badge.fury.io/py/TTS)
[![Covenant](https://camo.githubusercontent.com/7d620efaa3eac1c5b060ece5d6aacfcc8b81a74a04d05cd0398689c01c4463bb/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f436f6e7472696275746f72253230436f76656e616e742d76322e3025323061646f707465642d6666363962342e737667)](https://github.com/coqui-ai/TTS/blob/master/CODE_OF_CONDUCT.md)
[![Downloads](https://pepy.tech/badge/tts)](https://pepy.tech/project/tts)
@ -16,12 +17,10 @@
📢 [English Voice Samples](https://erogol.github.io/ddc-samples/) and [SoundCloud playlist](https://soundcloud.com/user-565970875/pocket-article-wavernn-and-tacotron2)
👩🏽‍🍳 [TTS training recipes](https://github.com/erogol/TTS_recipes)
📄 [Text-to-Speech paper collection](https://github.com/erogol/TTS-papers)
## 💬 Where to ask questions
Please use our dedicated channels for questions and discussion. Help is much more valuable if it's shared publicly, so that more people can benefit from it.
Please use our dedicated channels for questions and discussion. Help is much more valuable if it's shared publicly so that more people can benefit from it.
| Type | Platforms |
| ------------------------------- | --------------------------------------- |
@ -40,14 +39,11 @@ Please use our dedicated channels for questions and discussion. Help is much mor
## 🔗 Links and Resources
| Type | Links |
| ------------------------------- | --------------------------------------- |
| 💼 **Documentation** | [ReadTheDocs](https://tts.readthedocs.io/en/latest/)
| 💾 **Installation** | [TTS/README.md](https://github.com/coqui-ai/TTS/tree/dev#install-tts)|
| 👩‍💻 **Contributing** | [CONTRIBUTING.md](https://github.com/coqui-ai/TTS/blob/main/CONTRIBUTING.md)|
| 📌 **Road Map** | [Main Development Plans](https://github.com/coqui-ai/TTS/issues/378)
| 👩🏾‍🏫 **Tutorials and Examples** | [TTS/Wiki](https://github.com/coqui-ai/TTS/wiki/%F0%9F%90%B8-TTS-Notebooks,-Examples-and-Tutorials) |
| 🚀 **Released Models** | [TTS Releases](https://github.com/coqui-ai/TTS/releases) and [Experimental Models](https://github.com/coqui-ai/TTS/wiki/Experimental-Released-Models)|
| 🖥️ **Demo Server** | [TTS/server](https://github.com/coqui-ai/TTS/tree/master/TTS/server)|
| 🤖 **Synthesize speech** | [TTS/README.md](https://github.com/coqui-ai/TTS#example-synthesizing-speech-on-terminal-using-the-released-models)|
| 🛠️ **Implementing a New Model** | [TTS/Wiki](https://github.com/coqui-ai/TTS/wiki/Implementing-a-New-Model-in-%F0%9F%90%B8TTS)|
## 🥇 TTS Performance
<p align="center"><img src="https://raw.githubusercontent.com/coqui-ai/TTS/main/images/TTS-performance.png" width="800" /></p>
@ -56,20 +52,19 @@ Underlined "TTS*" and "Judy*" are 🐸TTS models
<!-- [Details...](https://github.com/coqui-ai/TTS/wiki/Mean-Opinion-Score-Results) -->
## Features
- High performance Deep Learning models for Text2Speech tasks.
- High-performance Deep Learning models for Text2Speech tasks.
- Text2Spec models (Tacotron, Tacotron2, Glow-TTS, SpeedySpeech).
- Speaker Encoder to compute speaker embeddings efficiently.
- Vocoder models (MelGAN, Multiband-MelGAN, GAN-TTS, ParallelWaveGAN, WaveGrad, WaveRNN)
- Fast and efficient model training.
- Detailed training logs on console and Tensorboard.
- Support for multi-speaker TTS.
- Efficient Multi-GPUs training.
- Detailed training logs on the terminal and Tensorboard.
- Support for Multi-speaker TTS.
- Efficient, flexible, lightweight but feature complete `Trainer API`.
- Ability to convert PyTorch models to Tensorflow 2.0 and TFLite for inference.
- Released models in PyTorch, Tensorflow and TFLite.
- Released and read-to-use models.
- Tools to curate Text2Speech datasets under```dataset_analysis```.
- Demo server for model testing.
- Notebooks for extensive model benchmarking.
- Modular (but not too much) code base enabling easy testing for new ideas.
- Utilities to use and test your models.
- Modular (but not too much) code base enabling easy implementation of new ideas.
## Implemented Models
### Text-to-Spectrogram
@ -98,8 +93,9 @@ Underlined "TTS*" and "Judy*" are 🐸TTS models
- WaveRNN: [origin](https://github.com/fatchord/WaveRNN/)
- WaveGrad: [paper](https://arxiv.org/abs/2009.00713)
- HiFiGAN: [paper](https://arxiv.org/abs/2010.05646)
- UnivNet: [paper](https://arxiv.org/abs/2106.07889)
You can also help us implement more models. Some 🐸TTS related work can be found [here](https://github.com/erogol/TTS-papers).
You can also help us implement more models.
## Install TTS
🐸TTS is tested on Ubuntu 18.04 with **python >= 3.6, < 3.9**.
@ -110,7 +106,7 @@ If you are only interested in [synthesizing speech](https://github.com/coqui-ai/
pip install TTS
```
By default this only installs the requirements for PyTorch. To install the tensorflow dependencies as well, use the `tf` extra.
By default, this only installs the requirements for PyTorch. To install the tensorflow dependencies as well, use the `tf` extra.
```bash
pip install TTS[tf]
@ -123,12 +119,6 @@ git clone https://github.com/coqui-ai/TTS
pip install -e .[all,dev,notebooks,tf] # Select the relevant extras
```
We use ```espeak-ng``` to convert graphemes to phonemes. You might need to install separately.
```bash
sudo apt-get install espeak-ng
```
If you are on Ubuntu (Debian), you can also run following commands for installation.
```bash
@ -137,6 +127,7 @@ $ make install
```
If you are on Windows, 👑@GuyPaddock wrote installation instructions [here](https://stackoverflow.com/questions/66726331/how-can-i-run-mozilla-tts-coqui-tts-training-with-cuda-on-a-windows-system).
## Directory Structure
```
|- notebooks/ (Jupyter Notebooks for model evaluation, parameter selection and data analysis.)
@ -147,6 +138,7 @@ If you are on Windows, 👑@GuyPaddock wrote installation instructions [here](ht
|- distribute.py (train your TTS model using Multiple GPUs.)
|- compute_statistics.py (compute dataset statistics for normalization.)
|- convert*.py (convert target torch model to TF.)
|- ...
|- tts/ (text to speech models)
|- layers/ (model layer definitions)
|- models/ (model definitions)
@ -156,167 +148,4 @@ If you are on Windows, 👑@GuyPaddock wrote installation instructions [here](ht
|- (same)
|- vocoder/ (Vocoder models.)
|- (same)
```
## Sample Model Output
Below you see Tacotron model state after 16K iterations with batch-size 32 with LJSpeech dataset.
> "Recent research at Harvard has shown meditating for as little as 8 weeks can actually increase the grey matter in the parts of the brain responsible for emotional regulation and learning."
Audio examples: [soundcloud](https://soundcloud.com/user-565970875/pocket-article-wavernn-and-tacotron2)
<img src="images/example_model_output.png?raw=true" alt="example_output" width="400"/>
## Datasets and Data-Loading
🐸TTS provides a generic dataloader easy to use for your custom dataset.
You just need to write a simple function to format the dataset. Check ```datasets/preprocess.py``` to see some examples.
After that, you need to set ```dataset``` fields in ```config.json```.
Some of the public datasets that we successfully applied 🐸TTS:
- [LJ Speech](https://keithito.com/LJ-Speech-Dataset/)
- [Nancy](http://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/)
- [TWEB](https://www.kaggle.com/bryanpark/the-world-english-bible-speech-dataset)
- [M-AI-Labs](http://www.caito.de/2019/01/the-m-ailabs-speech-dataset/)
- [LibriTTS](https://openslr.org/60/)
- [Spanish](https://drive.google.com/file/d/1Sm_zyBo67XHkiFhcRSQ4YaHPYM0slO_e/view?usp=sharing) - thx! @carlfm01
## Example: Synthesizing Speech on Terminal Using the Released Models.
<img src="images/tts_cli.gif"/>
After the installation, 🐸TTS provides a CLI interface for synthesizing speech using pre-trained models. You can either use your own model or the release models under 🐸TTS.
Listing released 🐸TTS models.
```bash
tts --list_models
```
Run a TTS model, from the release models list, with its default vocoder. (Simply copy and paste the full model names from the list as arguments for the command below.)
```bash
tts --text "Text for TTS" \
--model_name "<type>/<language>/<dataset>/<model_name>" \
--out_path folder/to/save/output.wav
```
Run a tts and a vocoder model from the released model list. Note that not every vocoder is compatible with every TTS model.
```bash
tts --text "Text for TTS" \
--model_name "<type>/<language>/<dataset>/<model_name>" \
--vocoder_name "<type>/<language>/<dataset>/<model_name>" \
--out_path folder/to/save/output.wav
```
Run your own TTS model (Using Griffin-Lim Vocoder)
```bash
tts --text "Text for TTS" \
--model_path path/to/model.pth.tar \
--config_path path/to/config.json \
--out_path folder/to/save/output.wav
```
Run your own TTS and Vocoder models
```bash
tts --text "Text for TTS" \
--config_path path/to/config.json \
--model_path path/to/model.pth.tar \
--out_path folder/to/save/output.wav \
--vocoder_path path/to/vocoder.pth.tar \
--vocoder_config_path path/to/vocoder_config.json
```
Run a multi-speaker TTS model from the released models list.
```bash
tts --model_name "<type>/<language>/<dataset>/<model_name>" --list_speaker_idxs # list the possible speaker IDs.
tts --text "Text for TTS." --out_path output/path/speech.wav --model_name "<language>/<dataset>/<model_name>" --speaker_idx "<speaker_id>"
```
**Note:** You can use ```./TTS/bin/synthesize.py``` if you prefer running ```tts``` from the TTS project folder.
## Example: Using the Demo Server for Synthesizing Speech
<!-- <img src="https://raw.githubusercontent.com/coqui-ai/TTS/main/images/demo_server.gif" height="56"/> -->
<img src="images/demo_server.gif"/>
You can boot up a demo 🐸TTS server to run inference with your models. Note that the server is not optimized for performance
but gives you an easy way to interact with the models.
The demo server provides pretty much the same interface as the CLI command.
```bash
tts-server -h # see the help
tts-server --list_models # list the available models.
```
Run a TTS model, from the release models list, with its default vocoder.
If the model you choose is a multi-speaker TTS model, you can select different speakers on the Web interface and synthesize
speech.
```bash
tts-server --model_name "<type>/<language>/<dataset>/<model_name>"
```
Run a TTS and a vocoder model from the released model list. Note that not every vocoder is compatible with every TTS model.
```bash
tts-server --model_name "<type>/<language>/<dataset>/<model_name>" \
--vocoder_name "<type>/<language>/<dataset>/<model_name>"
```
## Example: Training and Fine-tuning LJ-Speech Dataset
Here you can find a [CoLab](https://gist.github.com/erogol/97516ad65b44dbddb8cd694953187c5b) notebook for a hands-on example, training LJSpeech. Or you can manually follow the guideline below.
To start with, split ```metadata.csv``` into train and validation subsets respectively ```metadata_train.csv``` and ```metadata_val.csv```. Note that for text-to-speech, validation performance might be misleading since the loss value does not directly measure the voice quality to the human ear and it also does not measure the attention module performance. Therefore, running the model with new sentences and listening to the results is the best way to go.
```
shuf metadata.csv > metadata_shuf.csv
head -n 12000 metadata_shuf.csv > metadata_train.csv
tail -n 1100 metadata_shuf.csv > metadata_val.csv
```
To train a new model, you need to define your own ```config.json``` to define model details, trainin configuration and more (check the examples). Then call the corressponding train script.
For instance, in order to train a tacotron or tacotron2 model on LJSpeech dataset, follow these steps.
```bash
python TTS/bin/train_tacotron.py --config_path TTS/tts/configs/config.json
```
To fine-tune a model, use ```--restore_path```.
```bash
python TTS/bin/train_tacotron.py --config_path TTS/tts/configs/config.json --restore_path /path/to/your/model.pth.tar
```
To continue an old training run, use ```--continue_path```.
```bash
python TTS/bin/train_tacotron.py --continue_path /path/to/your/run_folder/
```
For multi-GPU training, call ```distribute.py```. It runs any provided train script in multi-GPU setting.
```bash
CUDA_VISIBLE_DEVICES="0,1,4" python TTS/bin/distribute.py --script train_tacotron.py --config_path TTS/tts/configs/config.json
```
Each run creates a new output folder accomodating used ```config.json```, model checkpoints and tensorboard logs.
In case of any error or intercepted execution, if there is no checkpoint yet under the output folder, the whole folder is going to be removed.
You can also enjoy Tensorboard, if you point Tensorboard argument```--logdir``` to the experiment folder.
## [Contribution guidelines](https://github.com/coqui-ai/TTS/blob/main/CONTRIBUTING.md)
### Acknowledgement
- https://github.com/keithito/tacotron (Dataset pre-processing)
- https://github.com/r9y9/tacotron_pytorch (Initial Tacotron architecture)
- https://github.com/kan-bayashi/ParallelWaveGAN (GAN based vocoder library)
- https://github.com/jaywalnut310/glow-tts (Original Glow-TTS implementation)
- https://github.com/fatchord/WaveRNN/ (Original WaveRNN implementation)
- https://arxiv.org/abs/2010.05646 (Original HiFiGAN implementation)
```

20
docs/Makefile Normal file
View File

@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = _build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

0
docs/README.md Normal file
View File

5
docs/requirements.txt Normal file
View File

@ -0,0 +1,5 @@
furo
myst-parser == 0.15.1
sphinx == 4.0.2
sphinx_inline_tabs
sphinx_copybutton

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

View File

@ -0,0 +1,25 @@
# AudioProcessor
`TTS.utils.audio.AudioProcessor` is the core class for all the audio processing routines. It provides an API for
- Feature extraction.
- Sound normalization.
- Reading and writing audio files.
- Sampling audio signals.
- Normalizing and denormalizing audio signals.
- Griffin-Lim vocoder.
The `AudioProcessor` needs to be initialized with `TTS.config.shared_configs.BaseAudioConfig`. Any model config
also must inherit or initiate `BaseAudioConfig`.
## AudioProcessor
```{eval-rst}
.. autoclass:: TTS.utils.audio.AudioProcessor
:members:
```
## BaseAudioConfig
```{eval-rst}
.. autoclass:: TTS.config.shared_configs.BaseAudioConfig
:members:
```

102
docs/source/conf.py Normal file
View File

@ -0,0 +1,102 @@
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import sys
sys.path.insert(0, os.path.abspath('../../TTS'))
autodoc_mock_imports = ["tts"]
# -- Project information -----------------------------------------------------
project = 'TTS'
copyright = "2021 Coqui GmbH, 2020 TTS authors"
author = 'Coqui GmbH'
with open("../../TTS/VERSION", "r") as ver:
version = ver.read().strip()
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
release = version
# The main toctree document.
master_doc = "index"
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
]
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store', 'TODO/*']
source_suffix = [".rst", ".md"]
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'furo'
html_tite = "TTS"
html_theme_options = {
"light_logo": "logo.png",
"dark_logo": "logo.png",
"sidebar_hide_name": True,
}
html_sidebars = {
'**': [
"sidebar/scroll-start.html",
"sidebar/brand.html",
"sidebar/search.html",
"sidebar/navigation.html",
"sidebar/ethical-ads.html",
"sidebar/scroll-end.html",
]
}
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
# using markdown
extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.autosummary',
'sphinx.ext.doctest',
'sphinx.ext.intersphinx',
'sphinx.ext.todo',
'sphinx.ext.coverage',
'sphinx.ext.napoleon',
'sphinx.ext.viewcode',
'sphinx.ext.autosectionlabel',
'myst_parser',
"sphinx_copybutton",
"sphinx_inline_tabs",
]
# 'sphinxcontrib.katex',
# 'sphinx.ext.autosectionlabel',

View File

@ -0,0 +1,59 @@
# Configuration
We use 👩‍✈️[Coqpit] for configuration management. It provides basic static type checking and serialization capabilities on top of native Python `dataclasses`. Here is how a simple configuration looks like with Coqpit.
```python
from dataclasses import asdict, dataclass, field
from typing import List, Union
from coqpit.coqpit import MISSING, Coqpit, check_argument
@dataclass
class SimpleConfig(Coqpit):
val_a: int = 10
val_b: int = None
val_d: float = 10.21
val_c: str = "Coqpit is great!"
vol_e: bool = True
# mandatory field
# raise an error when accessing the value if it is not changed. It is a way to define
val_k: int = MISSING
# optional field
val_dict: dict = field(default_factory=lambda: {"val_aa": 10, "val_ss": "This is in a dict."})
# list of list
val_listoflist: List[List] = field(default_factory=lambda: [[1, 2], [3, 4]])
val_listofunion: List[List[Union[str, int, bool]]] = field(
default_factory=lambda: [[1, 3], [1, "Hi!"], [True, False]]
)
def check_values(
self,
): # you can define explicit constraints manually or by`check_argument()`
"""Check config fields"""
c = asdict(self) # avoid unexpected changes on `self`
check_argument("val_a", c, restricted=True, min_val=10, max_val=2056)
check_argument("val_b", c, restricted=True, min_val=128, max_val=4058, allow_none=True)
check_argument("val_c", c, restricted=True)
```
In TTS, each model must have a configuration class that exposes all the values necessary for its lifetime.
It defines model architecture, hyper-parameters, training, and inference settings. For our models, we merge all the fields in a single configuration class for ease. It may not look like a wise practice but enables easier bookkeeping and reproducible experiments.
The general configuration hierarchy looks like below:
```
ModelConfig()
|
| -> ... # model specific configurations
| -> ModelArgs() # model class arguments
| -> BaseDatasetConfig() # only for tts models
| -> BaseXModelConfig() # Generic fields for `tts` and `vocoder` models.
|
| -> BaseTrainingConfig() # trainer fields
| -> BaseAudioConfig() # audio processing fields
```
In the example above, ```ModelConfig()``` is the final configuration that the model receives and it has all the fields necessary for the model.
We host pre-defined model configurations under ```TTS/<model_class>/configs/```.Although we recommend a unified config class, you can decompose it as you like as for your custom models as long as all the fields for the trainer, model, and inference APIs are provided.

View File

@ -0,0 +1,3 @@
```{include} ../../CONTRIBUTING.md
:relative-images:
```

View File

@ -0,0 +1,21 @@
# Converting Torch Tacotron to TF 2
Currently, 🐸TTS supports the vanilla Tacotron2 and MelGAN models in TF 2.It does not support advanced attention methods and other small tricks used by the Torch models. You can convert any Torch model trained after v0.0.2.
You can also export TF 2 models to TFLite for even faster inference.
## How to convert from Torch to TF 2.0
Make sure you installed Tensorflow v2.2. It is not installed by default by :frog: TTS.
All the TF related code stays under ```tf``` folder.
To convert a **compatible** Torch model, run the following command with the right arguments:
```bash
python TTS/bin/convert_tacotron2_torch_to_tf.py\
--torch_model_path /path/to/torch/model.pth.tar \
--config_path /path/to/model/config.json\
--output_path /path/to/output/tf/model
```
This will create a TF model file. Notice that our model format is not compatible with the official TF checkpoints. We created our custom format to match Torch checkpoints we use. Therefore, use the ```load_checkpoint``` and ```save_checkpoint``` functions provided under ```TTS.tf.generic_utils```.

25
docs/source/dataset.md Normal file
View File

@ -0,0 +1,25 @@
# Datasets
## TTS Dataset
```{eval-rst}
.. autoclass:: TTS.tts.datasets.TTSDataset
:members:
```
## Vocoder Dataset
```{eval-rst}
.. autoclass:: TTS.vocoder.datasets.gan_dataset.GANDataset
:members:
```
```{eval-rst}
.. autoclass:: TTS.vocoder.datasets.wavegrad_dataset.WaveGradDataset
:members:
```
```{eval-rst}
.. autoclass:: TTS.vocoder.datasets.wavernn_dataset.WaveRNNDataset
:members:
```

114
docs/source/faq.md Normal file
View File

@ -0,0 +1,114 @@
# Humble FAQ
We tried to collect common issues and questions we receive about 🐸TTS. It is worth checking before going deeper.
## Errors with a pre-trained model. How can I resolve this?
- Make sure you use the right commit version of 🐸TTS. Each pre-trained model has its corresponding version that needs to be used. It is defined on the model table.
- If it is still problematic, post your problem on [Discussions](https://github.com/coqui-ai/TTS/discussions). Please give as many details as possible (error message, your TTS version, your TTS model and config.json etc.)
- If you feel like it's a bug to be fixed, then prefer Github issues with the same level of scrutiny.
## What are the requirements of a good 🐸TTS dataset?
* https://github.com/coqui-ai/TTS/wiki/What-makes-a-good-TTS-dataset
## How should I choose the right model?
- First, train Tacotron. It is smaller and faster to experiment with. If it performs poorly, try Tacotron2.
- Tacotron models produce the most natural voice if your dataset is not too noisy.
- If both models do not perform well and especially the attention does not align, then try AlignTTS or GlowTTS.
- If you need faster models, consider SpeedySpeech, GlowTTS or AlignTTS. Keep in mind that SpeedySpeech requires a pre-trained Tacotron or Tacotron2 model to compute text-to-speech alignments.
## How can I train my own `tts` model?
0. Check your dataset with notebooks in [dataset_analysis](https://github.com/coqui-ai/TTS/tree/master/notebooks/dataset_analysis) folder. Use [this notebook](https://github.com/coqui-ai/TTS/blob/master/notebooks/dataset_analysis/CheckSpectrograms.ipynb) to find the right audio processing parameters. A better set of parameters results in a better audio synthesis.
1. Write your own dataset `formatter` in `datasets/formatters.py` or format your dataset as one of the supported datasets, like LJSpeech.
A `formatter` parses the metadata file and converts a list of training samples.
2. If you have a dataset with a different alphabet than English, you need to set your own character list in the ```config.json```.
- If you use phonemes for training and your language is supported [here](https://github.com/rhasspy/gruut#supported-languages), you don't need to set your character list.
- You can use `TTS/bin/find_unique_chars.py` to get characters used in your dataset.
3. Write your own text cleaner in ```utils.text.cleaners```. It is not always necessary, except when you have a different alphabet or language-specific requirements.
- A `cleaner` performs number and abbreviation expansion and text normalization. Basically, it converts the written text to its spoken format.
- If you go lazy, you can try using ```basic_cleaners```.
4. Fill in a ```config.json```. Go over each parameter one by one and consider it regarding the appended explanation.
- Check the `Coqpit` class created for your target model. Coqpit classes for `tts` models are under `TTS/tts/configs/`.
- You just need to define fields you need/want to change in your `config.json`. For the rest, their default values are used.
- 'sample_rate', 'phoneme_language' (if phoneme enabled), 'output_path', 'datasets', 'text_cleaner' are the fields you need to edit in most of the cases.
- Here is a sample `config.json` for training a `GlowTTS` network.
```json
{
"model": "glow_tts",
"batch_size": 32,
"eval_batch_size": 16,
"num_loader_workers": 4,
"num_eval_loader_workers": 4,
"run_eval": true,
"test_delay_epochs": -1,
"epochs": 1000,
"text_cleaner": "english_cleaners",
"use_phonemes": false,
"phoneme_language": "en-us",
"phoneme_cache_path": "phoneme_cache",
"print_step": 25,
"print_eval": true,
"mixed_precision": false,
"output_path": "recipes/ljspeech/glow_tts/",
"test_sentences": ["Test this sentence.", "This test sentence.", "Sentence this test."],
"datasets":[{"name": "ljspeech", "meta_file_train":"metadata.csv", "path": "recipes/ljspeech/LJSpeech-1.1/"}]
}
```
6. Train your model.
- SingleGPU training: ```CUDA_VISIBLE_DEVICES="0" python train_tts.py --config_path config.json```
- MultiGPU training: ```CUDA_VISIBLE_DEVICES="0,1,2" python distribute.py --script train_tts.py --config_path config.json```
- This command uses all the GPUs given in ```CUDA_VISIBLE_DEVICES```. If you don't specify, it uses all the GPUs available.
**Note:** You can also train your model using pure 🐍 python. Check ```{eval-rst} :ref: 'tutorial_for_nervous_beginners'```.
## How can I train in a different language?
- Check steps 2, 3, 4, 5 above.
## How can I train multi-GPUs?
- Check step 5 above.
## How can I check model performance?
- You can inspect model training and performance using ```tensorboard```. It will show you loss, attention alignment, model output. Go with the order below to measure the model performance.
1. Check ground truth spectrograms. If they do not look as they are supposed to, then check audio processing parameters in ```config.json```.
2. Check train and eval losses and make sure that they all decrease smoothly in time.
3. Check model spectrograms. Especially, training outputs should look similar to ground truth spectrograms after ~10K iterations.
4. Your model would not work well at test time until the attention has a near diagonal alignment. This is the sublime art of TTS training.
- Attention should converge diagonally after ~50K iterations.
- If attention does not converge, the probabilities are;
- Your dataset is too noisy or small.
- Samples are too long.
- Batch size is too small (batch_size < 32 would be having a hard time converging)
- You can also try other attention algorithms like 'graves', 'bidirectional_decoder', 'forward_attn'.
- 'bidirectional_decoder' is your ultimate savior, but it trains 2x slower and demands 1.5x more GPU memory.
- You can also try the other models like AlignTTS or GlowTTS.
## How do I know when to stop training?
There is no single objective metric to decide the end of a training since the voice quality is a subjective matter.
In our model trainings, we follow these steps;
- Check test time audio outputs, if it does not improve more.
- Check test time attention maps, if they look clear and diagonal.
- Check validation loss, if it converged and smoothly went down or started to overfit going up.
- If the answer is YES for all of the above, then test the model with a set of complex sentences. For English, you can use the `TestAttention` notebook.
Keep in mind that the approach above only validates the model robustness. It is hard to estimate the voice quality without asking the actual people.
The best approach is to pick a set of promising models and run a Mean-Opinion-Score study asking actual people to score the models.
## My model does not learn. How can I debug?
- Go over the steps under "How can I check model performance?"
## Attention does not align. How can I make it work?
- Check the 4th step under "How can I check model performance?"
## How can I test a trained model?
- The best way is to use `tts` or `tts-server` commands. For details check {ref}`here <Synthesizing Speech>`.
- If you need to code your own ```TTS.utils.synthesizer.Synthesizer``` class.
## My Tacotron model does not stop - I see "Decoder stopped with 'max_decoder_steps" - Stopnet does not work.
- In general, all of the above relates to the `stopnet`. It is the part of the model telling the `decoder` when to stop.
- In general, a poor `stopnet` relates to something else that is broken in your model or dataset. Especially the attention module.
- One common reason is the silent parts in the audio clips at the beginning and the ending. Check ```trim_db``` value in the config. You can find a better value for your dataset by using ```CheckSpectrogram``` notebook. If this value is too small, too much of the audio will be trimmed. If too big, then too much silence will remain. Both will curtail the `stopnet` performance.

View File

@ -0,0 +1,82 @@
# Formatting Your Dataset
For training a TTS model, you need a dataset with speech recordings and transcriptions. The speech must be divided into audio clips and each clip needs transcription.
If you have a single audio file and you need to split it into clips, there are different open-source tools for you. We recommend Audacity. It is an open-source and free audio editing software.
It is also important to use a lossless audio file format to prevent compression artifacts. We recommend using `wav` file format.
Let's assume you created the audio clips and their transcription. You can collect all your clips under a folder. Let's call this folder `wavs`.
```
/wavs
| - audio1.wav
| - audio2.wav
| - audio3.wav
...
```
You can either create separate transcription files for each clip or create a text file that maps each audio clip to its transcription. In this file, each line must be delimitered by a special character separating the audio file name from the transcription. And make sure that the delimiter is not used in the transcription text.
We recommend the following format delimited by `|`.
```
# metadata.txt
audio1.wav | This is my sentence.
audio2.wav | This is maybe my sentence.
audio3.wav | This is certainly my sentence.
audio4.wav | Let this be your sentence.
...
```
In the end, we have the following folder structure
```
/MyTTSDataset
|
| -> metadata.txt
| -> /wavs
| -> audio1.wav
| -> audio2.wav
| ...
```
The format above is taken from widely-used the [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) dataset. You can also download and see the dataset. 🐸TTS already provides tooling for the LJSpeech. if you use the same format, you can start training your models right away.
## Dataset Quality
Your dataset should have good coverage of the target language. It should cover the phonemic variety, exceptional sounds and syllables. This is extremely important for especially non-phonemic languages like English.
For more info about dataset qualities and properties check our [post](https://github.com/coqui-ai/TTS/wiki/What-makes-a-good-TTS-dataset).
## Using Your Dataset in 🐸TTS
After you collect and format your dataset, you need to check two things. Whether you need a `formatter` and a `text_cleaner`. The `formatter` loads the text file (created above) as a list and the `text_cleaner` performs a sequence of text normalization operations that converts the raw text into the spoken representation (e.g. converting numbers to text, acronyms, and symbols to the spoken format).
If you use a different dataset format then the LJSpeech or the other public datasets that 🐸TTS supports, then you need to write your own `formatter`.
If your dataset is in a new language or it needs special normalization steps, then you need a new `text_cleaner`.
What you get out of a `formatter` is a `List[List[]]` in the following format.
```
>>> formatter(metafile_path)
[["audio1.wav", "This is my sentence.", "MyDataset"],
["audio1.wav", "This is maybe a sentence.", "MyDataset"],
...
]
```
Each sub-list is parsed as ```["<filename>", "<transcription>", "<speaker_name">]```.
```<speaker_name>``` is the dataset name for single speaker datasets and it is mainly used
in the multi-speaker models to map the speaker of the each sample. But for now, we only focus on single speaker datasets.
The purpose of a `formatter` is to parse your metafile and load the audio file paths and transcriptions. Then, its output passes to a `Dataset` object. It computes features from the audio signals, calls text normalization routines, and converts raw text to
phonemes if needed.
See `TTS.tts.datasets.TTSDataset`, a generic `Dataset` implementation for the `tts` models.
See `TTS.vocoder.datasets.*`, for different `Dataset` implementations for the `vocoder` models.
See `TTS.utils.audio.AudioProcessor` that includes all the audio processing and feature extraction functions used in a
`Dataset` implementation. Feel free to add things as you need.passed

View File

@ -0,0 +1,61 @@
# Implementing a Model
1. Implement layers.
You can either implement the layers under `TTS/tts/layers/new_model.py` or in the model file `TTS/tts/model/new_model.py`.
You can also reuse layers already implemented.
2. Test layers.
We keep tests under `tests` folder. You can add `tts` layers tests under `tts_tests` folder.
Basic tests are checking input-output tensor shapes and output values for a given input. Consider testing extreme cases that are more likely to cause problems like `zero` tensors.
3. Implement loss function.
We keep loss functions under `TTS/tts/layers/losses.py`. You can also mix-and-match implemented loss functions as you like.
A loss function returns a dictionary in a format ```{loss: loss, loss1:loss1 ...}``` and the dictionary must at least define the `loss` key which is the actual value used by the optimizer. All the items in the dictionary are automatically logged on the terminal and the Tensorboard.
4. Test the loss function.
As we do for the layers, you need to test the loss functions too. You need to check input/output tensor shapes,
expected output values for a given input tensor. For instance, certain loss functions have upper and lower limits and
it is a wise practice to test with the inputs that should produce these limits.
5. Implement `MyModel`.
In 🐸TTS, a model class is a self-sufficient implementation of a model directing all the interactions with the other
components. It is enough to implement the API provided by the `BaseModel` class to comply.
A model interacts with the `Trainer API` for training, `Synthesizer API` for inference and testing.
A 🐸TTS model must return a dictionary by the `forward()` and `inference()` functions. This dictionary must also include the `model_outputs` key that is considered as the main model output by the `Trainer` and `Synthesizer`.
You can place your `tts` model implementation under `TTS/tts/models/new_model.py` then inherit and implement the `BaseTTS`.
There is also the `callback` interface by which you can manipulate both the model and the `Trainer` states. Callbacks give you
the infinite flexibility to add custom behaviours for your model and training routines.
For more details, see {ref}`BaseTTS <Base TTS Model>` and `TTS/utils/callbacks.py`.
6. Optionally, define `MyModelArgs`.
`MyModelArgs` is a 👨Coqpit class that sets all the class arguments of the `MyModel`. It should be enough to pass
an `MyModelArgs` instance to initiate the `MyModel`.
7. Test `MyModel`.
As the layers and the loss functions, it is recommended to test your model. One smart way for testing is that you
create two models with the exact same weights. Then we run a training loop with one of these models and
compare the weights with the other model. All the weights need to be different in a passing test. Otherwise, it
is likely that a part of the model is malfunctioning or not even attached to the model's computational graph.
8. Define `MyModelConfig`.
Place `MyModelConfig` file under `TTS/models/configs`. It is enough to inherit the `BaseTTSConfig` to make your
config compatible with the `Trainer`. You should also include `MyModelArgs` as a field if defined. The rest of the fields should define the model
specific values and parameters.
9. Write Docstrings.
We love you more when you document your code. ❤️

40
docs/source/index.md Normal file
View File

@ -0,0 +1,40 @@
```{include} ../../README.md
:relative-images:
```
----
# Documentation Content
```{eval-rst}
.. toctree::
:maxdepth: 2
:caption: Get started
tutorial_for_nervous_beginners
installation
faq
contributing
.. toctree::
:maxdepth: 2
:caption: Using 🐸TTS
inference
implementing_a_new_model
training_a_model
configuration
formatting_your_dataset
what_makes_a_good_dataset
tts_datasets
.. toctree::
:maxdepth: 2
:caption: Main Classes
trainer_api
audio_processor
model_api
configuration
dataset
```

103
docs/source/inference.md Normal file
View File

@ -0,0 +1,103 @@
(synthesizing_speech)=
# Synthesizing Speech
First, you need to install TTS. We recommend using PyPi. You need to call the command below:
```bash
$ pip install TTS
```
After the installation, 2 terminal commands are available.
1. TTS Command Line Interface (CLI). - `tts`
2. Local Demo Server. - `tts-server`
## On the Commandline - `tts`
![cli.gif](https://github.com/coqui-ai/TTS/raw/main/images/tts_cli.gif)
After the installation, 🐸TTS provides a CLI interface for synthesizing speech using pre-trained models. You can either use your own model or the release models under 🐸TTS.
Listing released 🐸TTS models.
```bash
tts --list_models
```
Run a TTS model, from the release models list, with its default vocoder. (Simply copy and paste the full model names from the list as arguments for the command below.)
```bash
tts --text "Text for TTS" \
--model_name "<type>/<language>/<dataset>/<model_name>" \
--out_path folder/to/save/output.wav
```
Run a tts and a vocoder model from the released model list. Note that not every vocoder is compatible with every TTS model.
```bash
tts --text "Text for TTS" \
--model_name "<type>/<language>/<dataset>/<model_name>" \
--vocoder_name "<type>/<language>/<dataset>/<model_name>" \
--out_path folder/to/save/output.wav
```
Run your own TTS model (Using Griffin-Lim Vocoder)
```bash
tts --text "Text for TTS" \
--model_path path/to/model.pth.tar \
--config_path path/to/config.json \
--out_path folder/to/save/output.wav
```
Run your own TTS and Vocoder models
```bash
tts --text "Text for TTS" \
--config_path path/to/config.json \
--model_path path/to/model.pth.tar \
--out_path folder/to/save/output.wav \
--vocoder_path path/to/vocoder.pth.tar \
--vocoder_config_path path/to/vocoder_config.json
```
Run a multi-speaker TTS model from the released models list.
```bash
tts --model_name "<type>/<language>/<dataset>/<model_name>" --list_speaker_idxs # list the possible speaker IDs.
tts --text "Text for TTS." --out_path output/path/speech.wav --model_name "<language>/<dataset>/<model_name>" --speaker_idx "<speaker_id>"
```
**Note:** You can use ```./TTS/bin/synthesize.py``` if you prefer running ```tts``` from the TTS project folder.
## On the Demo Server - `tts-server`
<!-- <img src="https://raw.githubusercontent.com/coqui-ai/TTS/main/images/demo_server.gif" height="56"/> -->
![server.gif](https://github.com/coqui-ai/TTS/raw/main/images/demo_server.gif)
You can boot up a demo 🐸TTS server to run an inference with your models. Note that the server is not optimized for performance
but gives you an easy way to interact with the models.
The demo server provides pretty much the same interface as the CLI command.
```bash
tts-server -h # see the help
tts-server --list_models # list the available models.
```
Run a TTS model, from the release models list, with its default vocoder.
If the model you choose is a multi-speaker TTS model, you can select different speakers on the Web interface and synthesize
speech.
```bash
tts-server --model_name "<type>/<language>/<dataset>/<model_name>"
```
Run a TTS and a vocoder model from the released model list. Note that not every vocoder is compatible with every TTS model.
```bash
tts-server --model_name "<type>/<language>/<dataset>/<model_name>" \
--vocoder_name "<type>/<language>/<dataset>/<model_name>"
```
## TorchHub
You can also use [this simple colab notebook](https://colab.research.google.com/drive/1iAe7ZdxjUIuN6V4ooaCt0fACEGKEn7HW?usp=sharing) using TorchHub to synthesize speech.

View File

@ -0,0 +1,39 @@
# Installation
🐸TTS supports python >=3.6 <=3.9 and tested on Ubuntu 18.10, 19.10, 20.10.
## Using `pip`
`pip` is recommended if you want to use 🐸TTS only for inference.
You can install from PyPI as follows:
```bash
pip install TTS # from PyPI
```
By default, this only installs the requirements for PyTorch. To install the tensorflow dependencies as well, use the `tf` extra.
```bash
pip install TTS[tf]
```
Or install from Github:
```bash
pip install git+https://github.com/coqui-ai/TTS # from Github
```
## Installing From Source
This is recommended for development and more control over 🐸TTS.
```bash
git clone https://github.com/coqui-ai/TTS/
cd TTS
make system-deps # only on Linux systems.
make install
```
## On Windows
If you are on Windows, 👑@GuyPaddock wrote installation instructions [here](https://stackoverflow.com/questions/66726331/

35
docs/source/make.bat Normal file
View File

@ -0,0 +1,35 @@
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=.
set BUILDDIR=_build
if "%1" == "" goto help
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.http://sphinx-doc.org/
exit /b 1
)
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
:end
popd

24
docs/source/model_api.md Normal file
View File

@ -0,0 +1,24 @@
# Model API
Model API provides you a set of functions that easily make your model compatible with the `Trainer`,
`Synthesizer` and `ModelZoo`.
## Base TTS Model
```{eval-rst}
.. autoclass:: TTS.model.BaseModel
:members:
```
## Base `tts` Model
```{eval-rst}
.. autoclass:: TTS.tts.models.base_tts.BaseTTS
:members:
```
## Base `vocoder` Model
```{eval-rst}
.. autoclass:: TTS.tts.models.base_vocoder.BaseVocoder`
:members:
```

View File

@ -0,0 +1,17 @@
# .readthedocs.yml
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
# Required
version: 2
# Build documentation in the docs/ directory with Sphinx
sphinx:
builder: html
configuration: docs/conf.py
# Optionally set the version of Python and requirements required to build your docs
python:
version: 3.8
install:
- requirements: doc/requirements.txt

View File

@ -0,0 +1,17 @@
# Trainer API
The {class}`TTS.trainer.Trainer` provides a lightweight, extensible, and feature-complete training run-time. We optimized it for 🐸 but
can also be used for any DL training in different domains. It supports distributed multi-gpu, mixed-precision (apex or torch.amp) training.
## Trainer
```{eval-rst}
.. autoclass:: TTS.trainer.Trainer
:members:
```
## TrainingArgs
```{eval-rst}
.. autoclass:: TTS.trainer.TrainingArgs
:members:
```

View File

@ -0,0 +1,165 @@
# Training a Model
1. Decide what model you want to use.
Each model has a different set of pros and cons that define the run-time efficiency and the voice quality. It is up to you to decide what model servers your needs. Other than referring to the papers, one easy way is to test the 🐸TTS
community models and see how fast and good each of the models. Or you can start a discussion on our communication channels.
2. Understand the configuration class, its fields and values of your model.
For instance, if you want to train a `Tacotron` model then see the `TacotronConfig` class and make sure you understand it.
3. Go to the recipes and check the recipe of your target model.
Recipes do not promise perfect models but they provide a good start point for `Nervous Beginners`. A recipe script training
a `GlowTTS` model on `LJSpeech` dataset looks like below. Let's be creative and call this script `train_glowtts.py`.
```python
# train_glowtts.py
import os
from TTS.tts.configs import GlowTTSConfig
from TTS.tts.configs import BaseDatasetConfig
from TTS.trainer import init_training, Trainer, TrainingArgs
output_path = os.path.dirname(os.path.abspath(__file__))
dataset_config = BaseDatasetConfig(name="ljspeech", meta_file_train="metadata.csv", path=os.path.join(output_path, "../LJSpeech-1.1/"))
config = GlowTTSConfig(
batch_size=32,
eval_batch_size=16,
num_loader_workers=4,
num_eval_loader_workers=4,
run_eval=True,
test_delay_epochs=-1,
epochs=1000,
text_cleaner="english_cleaners",
use_phonemes=False,
phoneme_language="en-us",
phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
print_step=25,
print_eval=True,
mixed_precision=False,
output_path=output_path,
datasets=[dataset_config]
)
args, config, output_path, _, c_logger, tb_logger = init_training(TrainingArgs(), config)
trainer = Trainer(args, config, output_path, c_logger, tb_logger)
trainer.fit()
```
You need to change fields of the `BaseDatasetConfig` to match your own dataset and then update `GlowTTSConfig`
fields as you need.
4. Run the training.
You need to call the python training script.
```bash
$ CUDA_VISIBLE_DEVICES="0" python train_glowtts.py
```
Notice that you set the GPU you want to use on your system by setting `CUDA_VISIBLE_DEVICES` environment variable.
To see available GPUs on your system, you can use `nvidia-smi` command on the terminal.
If you like to run a multi-gpu training
```bash
$ CUDA_VISIBLE_DEVICES="0, 1, 2" python TTS/bin/distribute.py --script <path_to_your_script>/train_glowtts.py
```
The example above runs a multi-gpu training using GPUs `0, 1, 2`.
The beginning of a training run looks like below.
```console
> Experiment folder: /your/output_path/-Juni-23-2021_02+52-78899209
> Using CUDA: True
> Number of GPUs: 1
> Setting up Audio Processor...
| > sample_rate:22050
| > resample:False
| > num_mels:80
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:1024
| > power:1.5
| > preemphasis:0.0
| > griffin_lim_iters:60
| > signal_norm:True
| > symmetric_norm:True
| > mel_fmin:0
| > mel_fmax:None
| > spec_gain:20.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:True
| > do_trim_silence:True
| > trim_db:45
| > do_sound_norm:False
| > stats_path:None
| > base:10
| > hop_length:256
| > win_length:1024
| > Found 13100 files in /your/dataset/path/ljspeech/LJSpeech-1.1
> Using model: glow_tts
> Model has 28356129 parameters
> EPOCH: 0/1000
> DataLoader initialization
| > Use phonemes: False
| > Number of instances : 12969
| > Max length sequence: 187
| > Min length sequence: 5
| > Avg length sequence: 98.3403500655409
| > Num. instances discarded by max-min (max=500, min=3) seq limits: 0
| > Batch group size: 0.
> TRAINING (2021-06-23 14:52:54)
--> STEP: 0/405 -- GLOBAL_STEP: 0
| > loss: 2.34670
| > log_mle: 1.61872
| > loss_dur: 0.72798
| > align_error: 0.52744
| > current_lr: 2.5e-07
| > grad_norm: 5.036039352416992
| > step_time: 5.8815
| > loader_time: 0.0065
...
```
5. Run the Tensorboard.
```bash
$ tensorboard --logdir=<path to your training directory>
```
6. Check the logs and the Tensorboard and monitor the training.
On the terminal and Tensorboard, you can monitor the losses and their changes over time. Also Tensorboard provides certain figures and sample outputs.
Note that different models have different metrics, visuals and outputs to be displayed.
You should also check the [FAQ page](https://github.com/coqui-ai/TTS/wiki/FAQ) for common problems and solutions
that occur in a training.
7. Use your best model for inference.
Use `tts` or `tts-server` commands for testing your models.
```bash
$ tts --text "Text for TTS" \
--model_path path/to/checkpoint_x.pth.tar \
--config_path path/to/config.json \
--out_path folder/to/save/output.wav
```
8. Return to the step 1 and reiterate for training a `vocoder` model.
In the example above, we trained a `GlowTTS` model, but the same workflow applies to all the other 🐸TTS models.

View File

@ -0,0 +1,16 @@
# TTS Datasets
Some of the known public datasets that we successfully applied 🐸TTS:
- [English - LJ Speech](https://keithito.com/LJ-Speech-Dataset/)
- [English - Nancy](http://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/)
- [English - TWEB](https://www.kaggle.com/bryanpark/the-world-english-bible-speech-dataset)
- [English - LibriTTS](https://openslr.org/60/)
- [English - VCTK](https://datashare.ed.ac.uk/handle/10283/2950)
- [Multilingual - M-AI-Labs](http://www.caito.de/2019/01/the-m-ailabs-speech-dataset/)
- [Spanish](https://drive.google.com/file/d/1Sm_zyBo67XHkiFhcRSQ4YaHPYM0slO_e/view?usp=sharing) - thx! @carlfm01
- [German - Thorsten OGVD](https://github.com/thorstenMueller/deep-learning-german-tts)
- [Japanese - Kokoro](https://www.kaggle.com/kaiida/kokoro-speech-dataset-v11-small/version/1)
- [Chinese](https://www.data-baker.com/open_source.html)
Let us know if you use 🐸TTS on a different dataset.

View File

@ -0,0 +1,175 @@
# Tutorial For Nervous Beginners
## Installation
User friendly installation. Recommended only for synthesizing voice.
```bash
$ pip install TTS
```
Developer friendly installation.
```bash
$ git clone https://github.com/coqui-ai/TTS
$ cd TTS
$ pip install -e .
```
## Training a `tts` Model
A breakdown of a simple script training a GlowTTS model on LJspeech dataset. See the comments for the explanation of
each line.
### Pure Python Way
```python
import os
# GlowTTSConfig: all model related values for training, validating and testing.
from TTS.tts.configs import GlowTTSConfig
# BaseDatasetConfig: defines name, formatter and path of the dataset.
from TTS.tts.configs import BaseDatasetConfig
# init_training: Initialize and setup the training environment.
# Trainer: Where the ✨️ happens.
# TrainingArgs: Defines the set of arguments of the Trainer.
from TTS.trainer import init_training, Trainer, TrainingArgs
# we use the same path as this script as our training folder.
output_path = os.path.dirname(os.path.abspath(__file__))
# set LJSpeech as our target dataset and define its path so that the Trainer knows what data formatter it needs.
dataset_config = BaseDatasetConfig(name="ljspeech", meta_file_train="metadata.csv", path=os.path.join(output_path, "../LJSpeech-1.1/"))
# Configure the model. Every config class inherits the BaseTTSConfig to have all the fields defined for the Trainer.
config = GlowTTSConfig(
batch_size=32,
eval_batch_size=16,
num_loader_workers=4,
num_eval_loader_workers=4,
run_eval=True,
test_delay_epochs=-1,
epochs=1000,
text_cleaner="english_cleaners",
use_phonemes=False,
phoneme_language="en-us",
phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
print_step=25,
print_eval=True,
mixed_precision=False,
output_path=output_path,
datasets=[dataset_config]
)
# Take the config and the default Trainer arguments, setup the training environment and override the existing
# config values from the terminal. So you can do the following.
# >>> python train.py --coqpit.batch_size 128
args, config, output_path, _, _, _= init_training(TrainingArgs(), config)
# Initiate the Trainer.
# Trainer provides a generic API to train all the 🐸TTS models with all its perks like mixed-precision training,
# distributed training etc.
trainer = Trainer(args, config, output_path)
# And kick it 🚀
trainer.fit()
```
### CLI Way
We still support running training from CLI like in the old days. The same training can be started as follows.
1. Define your `config.json`
```json
{
"model": "glow_tts",
"batch_size": 32,
"eval_batch_size": 16,
"num_loader_workers": 4,
"num_eval_loader_workers": 4,
"run_eval": true,
"test_delay_epochs": -1,
"epochs": 1000,
"text_cleaner": "english_cleaners",
"use_phonemes": false,
"phoneme_language": "en-us",
"phoneme_cache_path": "phoneme_cache",
"print_step": 25,
"print_eval": true,
"mixed_precision": false,
"output_path": "recipes/ljspeech/glow_tts/",
"datasets":[{"name": "ljspeech", "meta_file_train":"metadata.csv", "path": "recipes/ljspeech/LJSpeech-1.1/"}]
}
```
2. Start training.
```bash
$ CUDA_VISIBLE_DEVICES="0" python TTS/bin/train_tts.py --config_path config.json
```
## Training a `vocoder` Model
```python
import os
from TTS.vocoder.configs import HifiganConfig
from TTS.trainer import init_training, Trainer, TrainingArgs
output_path = os.path.dirname(os.path.abspath(__file__))
config = HifiganConfig(
batch_size=32,
eval_batch_size=16,
num_loader_workers=4,
num_eval_loader_workers=4,
run_eval=True,
test_delay_epochs=-1,
epochs=1000,
seq_len=8192,
pad_short=2000,
use_noise_augment=True,
eval_split_size=10,
print_step=25,
print_eval=True,
mixed_precision=False,
lr_gen=1e-4,
lr_disc=1e-4,
# `vocoder` only needs a data path and they read recursively all the `.wav` files underneath.
data_path=os.path.join(output_path, "../LJSpeech-1.1/wavs/"),
output_path=output_path,
)
args, config, output_path, _, c_logger, tb_logger = init_training(TrainingArgs(), config)
trainer = Trainer(args, config, output_path, c_logger, tb_logger)
trainer.fit()
```
❗️ Note that you can also start the training run from CLI as the `tts` model above.
## Synthesizing Speech
You can run `tts` and synthesize speech directly on the terminal.
```bash
$ tts -h # see the help
$ tts --list_models # list the available models.
```
![cli.gif](https://github.com/coqui-ai/TTS/raw/main/images/tts_cli.gif)
You can call `tts-server` to start a local demo server that you can open it on
your favorite web browser and 🗣️.
```bash
$ tts-server -h # see the help
$ tts-server --list_models # list the available models.
```
![server.gif](https://github.com/coqui-ai/TTS/raw/main/images/demo_server.gif)

View File

@ -0,0 +1,19 @@
# What makes a good TTS dataset
## What Makes a Good Dataset
* **Gaussian like distribution on clip and text lengths**. So plot the distribution of clip lengths and check if it covers enough short and long voice clips.
* **Mistake free**. Remove any wrong or broken files. Check annotations, compare transcript and audio length.
* **Noise free**. Background noise might lead your model to struggle, especially for a good alignment. Even if it learns the alignment, the final result is likely to be suboptimial.
* **Compatible tone and pitch among voice clips**. For instance, if you are using audiobook recordings for your project, it might have impersonations for different characters in the book. These differences between samples downgrade the model performance.
* **Good phoneme coverage**. Make sure that your dataset covers a good portion of the phonemes, di-phonemes, and in some languages tri-phonemes.
* **Naturalness of recordings**. For your model WISIAIL (What it sees is all it learns). Therefore, your dataset should accommodate all the attributes you want to hear from your model.
## Preprocessing Dataset
If you like to use a bespoken dataset, you might like to perform a couple of quality checks before training. 🐸TTS provides a couple of notebooks (CheckSpectrograms, AnalyzeDataset) to expedite this part for you.
* **AnalyzeDataset** is for checking dataset distribution in terms of the clip and transcript lengths. It is good to find outlier instances (too long, short text but long voice clip, etc.)and remove them before training. Keep in mind that we like to have a good balance between long and short clips to prevent any bias in training. If you have only short clips (1-3 secs), then your model might suffer for long sentences and if your instances are long, then it might not learn the alignment or might take too long to train the model.
* **CheckSpectrograms** is to measure the noise level of the clips and find good audio processing parameters. The noise level might be observed by checking spectrograms. If spectrograms look cluttered, especially in silent parts, this dataset might not be a good candidate for a TTS project. If your voice clips are too noisy in the background, it makes things harder for your model to learn the alignment, and the final result might be different than the voice you are given.
If the spectrograms look good, then the next step is to find a good set of audio processing parameters, defined in ```config.json```. In the notebook, you can compare different sets of parameters and see the resynthesis results in relation to the given ground-truth. Find the best parameters that give the best possible synthesis performance.
Another practical detail is the quantization level of the clips. If your dataset has a very high bit-rate, that might cause slow data-load time and consequently slow training. It is better to reduce the sample-rate of your dataset to around 16000-22050.