TTS/docs/source/formatting_your_dataset.md

(formatting_your_dataset)=
# Formatting Your Dataset

For training a TTS model, you need a dataset with speech recordings and transcriptions. The speech must be divided into audio clips and each clip needs transcription.

If you have a single audio file and you need to split it into clips, there are different open-source tools for you. We recommend Audacity. It is an open-source and free audio editing software.

It is also important to use a lossless audio file format to prevent compression artifacts. We recommend using `wav` file format.

Let's assume you created the audio clips and their transcription. You can collect all your clips in a folder. Let's call this folder `wavs`.

```
/wavs
  | - audio1.wav
  | - audio2.wav
  | - audio3.wav
  ...
```

You can either create separate transcription files for each clip or create a text file that maps each audio clip to its transcription. In this file, each column must be delimited by a special character separating the audio file name, the transcription and the normalized transcription. And make sure that the delimiter is not used in the transcription text.

We recommend the following format delimited by `|`. In the following example, `audio1`, `audio2` refer to files `audio1.wav`, `audio2.wav` etc.

```
# metadata.txt

audio1|This is my sentence.|This is my sentence.
audio2|1469 and 1470|fourteen sixty-nine and fourteen seventy
audio3|It'll be $16 sir.|It'll be sixteen dollars sir.
...
```
*If you don't have normalized transcriptions, you can use the same transcription for both columns. If it's your case, we recommend to use normalization later in the pipeline, either in the text cleaner or in the phonemizer.*


In the end, we have the following folder structure
```
/MyTTSDataset
      |
      | -> metadata.txt
      | -> /wavs
              | -> audio1.wav
              | -> audio2.wav
              | ...
```

The format above is taken from widely-used the [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) dataset. You can also download and see the dataset. 🐸TTS already provides tooling for the LJSpeech. if you use the same format, you can start training your models right away.

## Dataset Quality

Your dataset should have good coverage of the target language. It should cover the phonemic variety, exceptional sounds and syllables. This is extremely important for especially non-phonemic languages like English.

For more info about dataset qualities and properties check our [post](https://github.com/coqui-ai/TTS/wiki/What-makes-a-good-TTS-dataset).

## Using Your Dataset in 🐸TTS

After you collect and format your dataset, you need to check two things. Whether you need a `formatter` and a `text_cleaner`. The `formatter` loads the text file (created above) as a list and the `text_cleaner` performs a sequence of text normalization operations that converts the raw text into the spoken representation (e.g. converting numbers to text, acronyms, and symbols to the spoken format).

If you use a different dataset format than the LJSpeech or the other public datasets that 🐸TTS supports, then you need to write your own `formatter`.

If your dataset is in a new language or it needs special normalization steps, then you need a new `text_cleaner`.

What you get out of a `formatter` is a `List[Dict]` in the following format.

```
>>> formatter(metafile_path)
[
    {"audio_file":"audio1.wav", "text":"This is my sentence.", "speaker_name":"MyDataset", "language": "lang_code"},
    {"audio_file":"audio1.wav", "text":"This is maybe a sentence.", "speaker_name":"MyDataset", "language": "lang_code"},
    ...
]
```

Each sub-list is parsed as ```{"<filename>", "<transcription>", "<speaker_name">]```.
```<speaker_name>``` is the dataset name for single speaker datasets and it is mainly used
in the multi-speaker models to map the speaker of the each sample. But for now, we only focus on single speaker datasets.

The purpose of a `formatter` is to parse your manifest file and load the audio file paths and transcriptions.
Then, the output is passed to the `Dataset`. It computes features from the audio signals, calls text normalization routines, and converts raw text to
phonemes if needed.

## Loading your dataset

Load one of the dataset supported by 🐸TTS.

```python
from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.datasets import load_tts_samples


# dataset config for one of the pre-defined datasets
dataset_config = BaseDatasetConfig(
    formatter="vctk", meta_file_train="", language="en-us", path="dataset-path")
)

# load training samples
train_samples, eval_samples = load_tts_samples(dataset_config, eval_split=True)
```

Load a custom dataset with a custom formatter.

```python
from TTS.tts.datasets import load_tts_samples


# custom formatter implementation
def formatter(root_path, manifest_file, **kwargs):  # pylint: disable=unused-argument
    """Assumes each line as ```<filename>|<transcription>```
    """
    txt_file = os.path.join(root_path, manifest_file)
    items = []
    speaker_name = "my_speaker"
    with open(txt_file, "r", encoding="utf-8") as ttf:
        for line in ttf:
            cols = line.split("|")
            wav_file = os.path.join(root_path, "wavs", cols[0])
            text = cols[1]
            items.append({"text":text, "audio_file":wav_file, "speaker_name":speaker_name, "root_path": root_path})
    return items

# load training samples
train_samples, eval_samples = load_tts_samples(dataset_config, eval_split=True, formatter=formatter)
```

See `TTS.tts.datasets.TTSDataset`, a generic `Dataset` implementation for the `tts` models.

See `TTS.vocoder.datasets.*`, for different `Dataset` implementations for the `vocoder` models.

See `TTS.utils.audio.AudioProcessor` that includes all the audio processing and feature extraction functions used in a
`Dataset` implementation. Feel free to add things as you need.
Add fine-tunning documentation 2021-09-12 20:34:19 +00:00			`(formatting_your_dataset)=`
Add preliminary sphinx documentation 2021-06-27 18:55:20 +00:00			`# Formatting Your Dataset`

			`For training a TTS model, you need a dataset with speech recordings and transcriptions. The speech must be divided into audio clips and each clip needs transcription.`

			`If you have a single audio file and you need to split it into clips, there are different open-source tools for you. We recommend Audacity. It is an open-source and free audio editing software.`

			It is also important to use a lossless audio file format to prevent compression artifacts. We recommend using `wav` file format.

fix typos 2023-12-05 08:46:36 +00:00			Let's assume you created the audio clips and their transcription. You can collect all your clips in a folder. Let's call this folder `wavs`.
Add preliminary sphinx documentation 2021-06-27 18:55:20 +00:00
			```
			`/wavs`
			`\| - audio1.wav`
			`\| - audio2.wav`
			`\| - audio3.wav`
			`...`
			```

fix typos 2023-12-05 08:46:36 +00:00			`You can either create separate transcription files for each clip or create a text file that maps each audio clip to its transcription. In this file, each column must be delimited by a special character separating the audio file name, the transcription and the normalized transcription. And make sure that the delimiter is not used in the transcription text.`
Add preliminary sphinx documentation 2021-06-27 18:55:20 +00:00
Update docs 2022-03-07 11:27:13 +00:00			We recommend the following format delimited by `\|`. In the following example, `audio1`, `audio2` refer to files `audio1.wav`, `audio2.wav` etc.
Add preliminary sphinx documentation 2021-06-27 18:55:20 +00:00
			```
			`# metadata.txt`

Fix doc dataset (#3070) * fix formatting dataset doc * fix autocomplete 2023-10-16 10:29:52 +00:00			`audio1\|This is my sentence.\|This is my sentence.`
			`audio2\|1469 and 1470\|fourteen sixty-nine and fourteen seventy`
			`audio3\|It'll be $16 sir.\|It'll be sixteen dollars sir.`
Add preliminary sphinx documentation 2021-06-27 18:55:20 +00:00			`...`
			```
Fix doc dataset (#3070) * fix formatting dataset doc * fix autocomplete 2023-10-16 10:29:52 +00:00			`If you don't have normalized transcriptions, you can use the same transcription for both columns. If it's your case, we recommend to use normalization later in the pipeline, either in the text cleaner or in the phonemizer.`

Add preliminary sphinx documentation 2021-06-27 18:55:20 +00:00
			`In the end, we have the following folder structure`
			```
			`/MyTTSDataset`
			`\|`
			`\| -> metadata.txt`
			`\| -> /wavs`
			`\| -> audio1.wav`
			`\| -> audio2.wav`
			`\| ...`
			```

			`The format above is taken from widely-used the [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) dataset. You can also download and see the dataset. 🐸TTS already provides tooling for the LJSpeech. if you use the same format, you can start training your models right away.`

			`## Dataset Quality`

			`Your dataset should have good coverage of the target language. It should cover the phonemic variety, exceptional sounds and syllables. This is extremely important for especially non-phonemic languages like English.`

			`For more info about dataset qualities and properties check our [post](https://github.com/coqui-ai/TTS/wiki/What-makes-a-good-TTS-dataset).`

			`## Using Your Dataset in 🐸TTS`

			After you collect and format your dataset, you need to check two things. Whether you need a `formatter` and a `text_cleaner`. The `formatter` loads the text file (created above) as a list and the `text_cleaner` performs a sequence of text normalization operations that converts the raw text into the spoken representation (e.g. converting numbers to text, acronyms, and symbols to the spoken format).

fix typos 2023-12-05 08:46:36 +00:00			If you use a different dataset format than the LJSpeech or the other public datasets that 🐸TTS supports, then you need to write your own `formatter`.
Add preliminary sphinx documentation 2021-06-27 18:55:20 +00:00
			If your dataset is in a new language or it needs special normalization steps, then you need a new `text_cleaner`.

Update dataset formatting docs 2022-02-14 10:49:25 +00:00			What you get out of a `formatter` is a `List[Dict]` in the following format.
Add preliminary sphinx documentation 2021-06-27 18:55:20 +00:00
			```
			`>>> formatter(metafile_path)`
Update dataset formatting docs 2022-02-14 10:49:25 +00:00			`[`
			`{"audio_file":"audio1.wav", "text":"This is my sentence.", "speaker_name":"MyDataset", "language": "lang_code"},`
			`{"audio_file":"audio1.wav", "text":"This is maybe a sentence.", "speaker_name":"MyDataset", "language": "lang_code"},`
			`...`
Add preliminary sphinx documentation 2021-06-27 18:55:20 +00:00			`]`
			```

Update dataset formatting docs 2022-02-14 10:49:25 +00:00			Each sub-list is parsed as ```{"<filename>", "<transcription>", "<speaker_name">]```.
Add preliminary sphinx documentation 2021-06-27 18:55:20 +00:00			```<speaker_name>``` is the dataset name for single speaker datasets and it is mainly used
			`in the multi-speaker models to map the speaker of the each sample. But for now, we only focus on single speaker datasets.`

Update dataset formatting docs 2022-02-14 10:49:25 +00:00			The purpose of a `formatter` is to parse your manifest file and load the audio file paths and transcriptions.
			Then, the output is passed to the `Dataset`. It computes features from the audio signals, calls text normalization routines, and converts raw text to
Add preliminary sphinx documentation 2021-06-27 18:55:20 +00:00			`phonemes if needed.`

Update dataset formatting docs 2022-02-14 10:49:25 +00:00			`## Loading your dataset`

			`Load one of the dataset supported by 🐸TTS.`

			```python
			`from TTS.tts.configs.shared_configs import BaseDatasetConfig`
			`from TTS.tts.datasets import load_tts_samples`


			`# dataset config for one of the pre-defined datasets`
			`dataset_config = BaseDatasetConfig(`
d-vector handling (#1945) * Update BaseDatasetConfig - Add dataset_name - Chane name to formatter_name * Update compute_embedding - Allow entering dataset by args - Use released model by default - Use the new key format * Update loading * Update recipes * Update other dep code * Update tests * Fixup * Load multiple embedding files * Fix argument names in dep code * Update docs * Fix argument name * Fix linter 2022-09-13 12:10:33 +00:00			`formatter="vctk", meta_file_train="", language="en-us", path="dataset-path")`
Update dataset formatting docs 2022-02-14 10:49:25 +00:00			`)`

			`# load training samples`
			`train_samples, eval_samples = load_tts_samples(dataset_config, eval_split=True)`
			```

			`Load a custom dataset with a custom formatter.`

			```python
			`from TTS.tts.datasets import load_tts_samples`


			`# custom formatter implementation`
			`def formatter(root_path, manifest_file, **kwargs): # pylint: disable=unused-argument`
			"""Assumes each line as ```<filename>\|<transcription>```
			`"""`
			`txt_file = os.path.join(root_path, manifest_file)`
			`items = []`
			`speaker_name = "my_speaker"`
			`with open(txt_file, "r", encoding="utf-8") as ttf:`
			`for line in ttf:`
			`cols = line.split("\|")`
			`wav_file = os.path.join(root_path, "wavs", cols[0])`
			`text = cols[1]`
Update docs 2023-01-02 09:07:03 +00:00			`items.append({"text":text, "audio_file":wav_file, "speaker_name":speaker_name, "root_path": root_path})`
Update dataset formatting docs 2022-02-14 10:49:25 +00:00			`return items`

			`# load training samples`
			`train_samples, eval_samples = load_tts_samples(dataset_config, eval_split=True, formatter=formatter)`
			```

Add preliminary sphinx documentation 2021-06-27 18:55:20 +00:00			See `TTS.tts.datasets.TTSDataset`, a generic `Dataset` implementation for the `tts` models.

			See `TTS.vocoder.datasets.*`, for different `Dataset` implementations for the `vocoder` models.

			See `TTS.utils.audio.AudioProcessor` that includes all the audio processing and feature extraction functions used in a
typos and minor fixes (#2508) * Update tacotron1-2.md * Update README.md * Update Tutorial_2_train_your_first_TTS_model.ipynb * Update synthesizer.py There is no arg called --speaker_name * Update formatting_your_dataset.md * Update AnalyzeDataset.ipynb * Update AnalyzeDataset.ipynb * Update AnalyzeDataset.ipynb * Update finetuning.md * Update train_yourtts.py * Update train_yourtts.py * Update train_yourtts.py * Update finetuning.md 2023-04-26 13:22:57 +00:00			`Dataset` implementation. Feel free to add things as you need.