typos and minor fixes (#2508)

* Update tacotron1-2.md

* Update README.md

* Update Tutorial_2_train_your_first_TTS_model.ipynb

* Update synthesizer.py

There is no arg called --speaker_name

* Update formatting_your_dataset.md

* Update AnalyzeDataset.ipynb

* Update AnalyzeDataset.ipynb

* Update AnalyzeDataset.ipynb

* Update finetuning.md

* Update train_yourtts.py

* Update train_yourtts.py

* Update train_yourtts.py

* Update finetuning.md
pull/2576/head v0.14.0_models
prakharpbuf 2023-04-26 09:22:57 -04:00 committed by GitHub
parent 2071088bab
commit c1875f68df
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
8 changed files with 18 additions and 18 deletions

View File

@ -312,7 +312,7 @@ tts.tts_to_file(text="This is a test.", file_path=OUTPUT_PATH, emotion="Happy",
#### Multi-speaker Models
- List the available speakers and choose as <speaker_id> among them:
- List the available speakers and choose a <speaker_id> among them:
```
$ tts --model_name "<language>/<dataset>/<model_name>" --list_speaker_idxs

View File

@ -269,8 +269,8 @@ class Synthesizer(object):
elif not speaker_name and not speaker_wav:
raise ValueError(
" [!] Look like you use a multi-speaker model. "
"You need to define either a `speaker_name` or a `speaker_wav` to use a multi-speaker model."
" [!] Looks like you are using a multi-speaker model. "
"You need to define either a `speaker_idx` or a `speaker_wav` to use a multi-speaker model."
)
else:
speaker_embedding = None

View File

@ -11,7 +11,7 @@ them and fine-tune it for your own dataset. This will help you in two main ways:
Since a pre-trained model has already learned features that are relevant for the task, it will converge faster on
a new dataset. This will reduce the cost of training and let you experiment faster.
2. Better resutls with small datasets
2. Better results with small datasets
Deep learning models are data hungry and they give better performance with more data. However, it is not always
possible to have this abundance, especially in specific domains. For instance, the LJSpeech dataset, that we released most of
@ -19,7 +19,7 @@ them and fine-tune it for your own dataset. This will help you in two main ways:
the help of a voice actor.
Fine-tuning comes to the rescue in this case. You can take one of our pre-trained models and fine-tune it on your own
speech dataset and achive reasonable results with only a couple of hours of data.
speech dataset and achieve reasonable results with only a couple of hours of data.
However, note that, fine-tuning does not ensure great results. The model performance is still depends on the
{ref}`dataset quality <what_makes_a_good_dataset>` and the hyper-parameters you choose for fine-tuning. Therefore,
@ -35,7 +35,7 @@ them and fine-tune it for your own dataset. This will help you in two main ways:
2. Choose the model you want to fine-tune.
You can list the availabe models in the command line with
You can list the available models in the command line with
```bash
tts --list_models

View File

@ -125,4 +125,4 @@ See `TTS.tts.datasets.TTSDataset`, a generic `Dataset` implementation for the `t
See `TTS.vocoder.datasets.*`, for different `Dataset` implementations for the `vocoder` models.
See `TTS.utils.audio.AudioProcessor` that includes all the audio processing and feature extraction functions used in a
`Dataset` implementation. Feel free to add things as you need.passed
`Dataset` implementation. Feel free to add things as you need.

View File

@ -12,7 +12,7 @@ Vanilla Tacotron models are slow at inference due to the auto-regressive* nature
Tacotron also uses a Prenet module with Dropout that projects the models previous output before feeding it to the decoder again. The paper and most of the implementations use the Dropout layer even in inference and they report the attention fails or the voice quality degrades otherwise. But the issue with that, you get a slightly different output speech every time you run the model.
Tsraining the attention is notoriously problematic in Tacoron models. Especially, in inference, for some input sequences, the alignment fails and causes the model to produce unexpected results. There are many different methods proposed to improve the attention.
Training the attention is notoriously problematic in Tacoron models. Especially, in inference, for some input sequences, the alignment fails and causes the model to produce unexpected results. There are many different methods proposed to improve the attention.
After hundreds of experiments, @ 🐸TTS we suggest Double Decoder Consistency that leads to the most robust model performance.

View File

@ -44,7 +44,7 @@
"\n",
"### **First things first**: we need some data.\n",
"\n",
"We're training a Text-to-Speech model, so we need some _text_ and we need some _speech_. Specificially, we want _transcribed speech_. The speech must be divided into audio clips and each clip needs transcription. More details about data requirements such as recording characteristics, background noise abd vocabulary coverage can be found in the [🐸TTS documentation](https://tts.readthedocs.io/en/latest/formatting_your_dataset.html).\n",
"We're training a Text-to-Speech model, so we need some _text_ and we need some _speech_. Specificially, we want _transcribed speech_. The speech must be divided into audio clips and each clip needs transcription. More details about data requirements such as recording characteristics, background noise and vocabulary coverage can be found in the [🐸TTS documentation](https://tts.readthedocs.io/en/latest/formatting_your_dataset.html).\n",
"\n",
"If you have a single audio file and you need to **split** it into clips. It is also important to use a lossless audio file format to prevent compression artifacts. We recommend using **wav** file format.\n",
"\n",

View File

@ -45,7 +45,7 @@
"source": [
"NUM_PROC = 8\n",
"DATASET_CONFIG = BaseDatasetConfig(\n",
" name=\"ljspeech\", meta_file_train=\"metadata.csv\", path=\"/absolute/path/to/your/dataset/\"\n",
" formatter=\"ljspeech\", meta_file_train=\"metadata.csv\", path=\"/absolute/path/to/your/dataset/\"\n",
")"
]
},
@ -64,7 +64,7 @@
" cols = line.split(\"|\")\n",
" wav_file = os.path.join(root_path, \"wavs\", cols[0] + \".wav\") \n",
" text = cols[1]\n",
" items.append({\"text\": text, \"audio_file\": wav_file, \"speaker_name\": speaker_name})\n",
" items.append({\"text\": text, \"audio_file\": wav_file, \"speaker_name\": speaker_name, \"root_path\": root_path})\n",
" return items"
]
},

View File

@ -32,7 +32,7 @@ OUT_PATH = os.path.dirname(os.path.abspath(__file__)) # "/raid/coqui/Checkpoint
# If you want to do transfer learning and speedup your training you can set here the path to the original YourTTS model
RESTORE_PATH = None # "/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth"
# This paramter is usefull to debug, it skips the training epochs and just do the evaluation and produce the test sentences
# This paramter is useful to debug, it skips the training epochs and just do the evaluation and produce the test sentences
SKIP_TRAIN_EPOCH = False
# Set here the batch size to be used in training and evaluation
@ -78,7 +78,7 @@ vctk_config = BaseDatasetConfig(
], # Ignore the test speakers to full replicate the paper experiment
)
# Add here all datasets configs, in our case we just want to train with the VCTK dataset then we need to add just VCTK. Note: If you want to added new datasets just added they here and it will automatically compute the speaker embeddings (d-vectors) for this new dataset :)
# Add here all datasets configs, in our case we just want to train with the VCTK dataset then we need to add just VCTK. Note: If you want to add new datasets, just add them here and it will automatically compute the speaker embeddings (d-vectors) for this new dataset :)
DATASETS_CONFIG_LIST = [vctk_config]
### Extract speaker embeddings
@ -123,7 +123,7 @@ audio_config = VitsAudioConfig(
num_mels=80,
)
# Init VITSArgs setting the arguments that is needed for the YourTTS model
# Init VITSArgs setting the arguments that are needed for the YourTTS model
model_args = VitsArgs(
d_vector_file=D_VECTOR_FILES,
use_d_vector_file=True,
@ -131,15 +131,15 @@ model_args = VitsArgs(
num_layers_text_encoder=10,
speaker_encoder_model_path=SPEAKER_ENCODER_CHECKPOINT_PATH,
speaker_encoder_config_path=SPEAKER_ENCODER_CONFIG_PATH,
resblock_type_decoder="2", # On the paper, we accidentally trained the YourTTS using ResNet blocks type 2, if you like you can use the ResNet blocks type 1 like the VITS model
# Usefull parameters to enable the Speaker Consistency Loss (SCL) discribed in the paper
resblock_type_decoder="2", # In the paper, we accidentally trained the YourTTS using ResNet blocks type 2, if you like you can use the ResNet blocks type 1 like the VITS model
# Useful parameters to enable the Speaker Consistency Loss (SCL) described in the paper
# use_speaker_encoder_as_loss=True,
# Usefull parameters to the enable multilingual training
# Useful parameters to enable multilingual training
# use_language_embedding=True,
# embedded_language_dim=4,
)
# General training config, here you can change the batch size and others usefull parameters
# General training config, here you can change the batch size and others useful parameters
config = VitsConfig(
output_path=OUT_PATH,
model_args=model_args,