mimic2/TRAINING_DATA.md

# Training Data


This repo supports the following speech datasets:
  * [LJ Speech](https://keithito.com/LJ-Speech-Dataset/) (Public Domain)
  * [Blizzard 2012](http://www.cstr.ed.ac.uk/projects/blizzard/2012/phase_one) (Creative Commons Attribution Share-Alike)

You can use any other dataset if you write a preprocessor for it.


### Writing a Preprocessor

Each training example consists of:
  1. The text that was spoken
  2. A mel-scale spectrogram of the audio
  3. A linear-scale spectrogram of the audio

The preprocessor is responsible for generating these. See [ljspeech.py](datasets/ljspeech.py) for a
commented example.

For each training example, a preprocessor should:

  1. Load the audio file:
     ```python
     wav = audio.load_wav(wav_path)
     ```

  2. Compute linear-scale and mel-scale spectrograms (float32 numpy arrays):
     ```python
     spectrogram = audio.spectrogram(wav).astype(np.float32)
     mel_spectrogram = audio.melspectrogram(wav).astype(np.float32)
     ```

  3. Save the spectrograms to disk:
     ```python
     np.save(os.path.join(out_dir, spectrogram_filename), spectrogram.T, allow_pickle=False)
     np.save(os.path.join(out_dir, mel_spectrogram_filename), mel_spectrogram.T,  allow_pickle=False)
     ```
     Note that the transpose of the matrix returned by `audio.spectrogram` is saved so that it's
     in time-major format.

  4. Generate a tuple `(spectrogram_filename, mel_spectrogram_filename, n_frames, text)` to
     write to train.txt. n_frames is just the length of the time axis of the spectrogram.


After you've written your preprocessor, you can add it to [preprocess.py](preprocess.py) by
following the example of the other preprocessors in that file.


### Non-English Data

If your training data is in a language other than English, you will probably want to change the
text cleaners by setting the `cleaners` hyperparameter.

  * If your text is in a Latin script or can be transliterated to ASCII using the
    [Unidecode](https://pypi.python.org/pypi/Unidecode) library, you can use the transliteration
    cleaners by setting the hyperparameter `cleaners=transliteration_cleaners`.

  * If you don't want to transliterate, you can define a custom character set.
    This allows you to train directly on the character set used in your data.

    To do so, edit [symbols.py](text/symbols.py) and change the `_characters` variable to be a
    string containing the UTF-8 characters in your data. Then set the hyperparameter `cleaners=basic_cleaners`.

  * If you're not sure which option to use, you can evaluate the transliteration cleaners like this:

    ```python
    from text import cleaners
    cleaners.transliteration_cleaners('Здравствуйте')   # Replace with the text you want to try
    ```
Add documentation on preprocessing training data. 2017-08-14 14:48:50 +00:00			`# Training Data`


			`This repo supports the following speech datasets:`
			`* [LJ Speech](https://keithito.com/LJ-Speech-Dataset/) (Public Domain)`
			`* [Blizzard 2012](http://www.cstr.ed.ac.uk/projects/blizzard/2012/phase_one) (Creative Commons Attribution Share-Alike)`

			`You can use any other dataset if you write a preprocessor for it.`


			`### Writing a Preprocessor`

			`Each training example consists of:`
			`1. The text that was spoken`
			`2. A mel-scale spectrogram of the audio`
			`3. A linear-scale spectrogram of the audio`

			`The preprocessor is responsible for generating these. See [ljspeech.py](datasets/ljspeech.py) for a`
Update documentation 2017-09-05 00:53:41 +00:00			`commented example.`
Add documentation on preprocessing training data. 2017-08-14 14:48:50 +00:00
			`For each training example, a preprocessor should:`

			`1. Load the audio file:`
			```python
			`wav = audio.load_wav(wav_path)`
			```

			`2. Compute linear-scale and mel-scale spectrograms (float32 numpy arrays):`
			```python
			`spectrogram = audio.spectrogram(wav).astype(np.float32)`
			`mel_spectrogram = audio.melspectrogram(wav).astype(np.float32)`
			```

			`3. Save the spectrograms to disk:`
			```python
			`np.save(os.path.join(out_dir, spectrogram_filename), spectrogram.T, allow_pickle=False)`
			`np.save(os.path.join(out_dir, mel_spectrogram_filename), mel_spectrogram.T, allow_pickle=False)`
			```
			Note that the transpose of the matrix returned by `audio.spectrogram` is saved so that it's
			`in time-major format.`

			4. Generate a tuple `(spectrogram_filename, mel_spectrogram_filename, n_frames, text)` to
			`write to train.txt. n_frames is just the length of the time axis of the spectrogram.`


			`After you've written your preprocessor, you can add it to [preprocess.py](preprocess.py) by`
			`following the example of the other preprocessors in that file.`


Update documentation 2017-09-05 00:53:41 +00:00			`### Non-English Data`
Add documentation on preprocessing training data. 2017-08-14 14:48:50 +00:00
Update documentation 2017-09-05 00:53:41 +00:00			`If your training data is in a language other than English, you will probably want to change the`
Rename "pipeline" to "cleaners" No need to introduce new terminology. 2017-09-05 04:54:23 +00:00			text cleaners by setting the `cleaners` hyperparameter.
Add documentation on preprocessing training data. 2017-08-14 14:48:50 +00:00
Update documentation 2017-09-05 00:53:41 +00:00			`* If your text is in a Latin script or can be transliterated to ASCII using the`
			`[Unidecode](https://pypi.python.org/pypi/Unidecode) library, you can use the transliteration`
Rename "pipeline" to "cleaners" No need to introduce new terminology. 2017-09-05 04:54:23 +00:00			cleaners by setting the hyperparameter `cleaners=transliteration_cleaners`.
Add documentation on preprocessing training data. 2017-08-14 14:48:50 +00:00
Update documentation 2017-09-05 00:53:41 +00:00			`* If you don't want to transliterate, you can define a custom character set.`
			`This allows you to train directly on the character set used in your data.`
Add documentation on preprocessing training data. 2017-08-14 14:48:50 +00:00
Update documentation 2017-09-05 00:53:41 +00:00			To do so, edit [symbols.py](text/symbols.py) and change the `_characters` variable to be a
Rename "pipeline" to "cleaners" No need to introduce new terminology. 2017-09-05 04:54:23 +00:00			string containing the UTF-8 characters in your data. Then set the hyperparameter `cleaners=basic_cleaners`.
Update documentation 2017-09-05 00:53:41 +00:00
Rename "pipeline" to "cleaners" No need to introduce new terminology. 2017-09-05 04:54:23 +00:00			`* If you're not sure which option to use, you can evaluate the transliteration cleaners like this:`
Update documentation 2017-09-05 00:53:41 +00:00
			```python
			`from text import cleaners`
Rename "pipeline" to "cleaners" No need to introduce new terminology. 2017-09-05 04:54:23 +00:00			`cleaners.transliteration_cleaners('Здравствуйте') # Replace with the text you want to try`
Update documentation 2017-09-05 00:53:41 +00:00			```