mimic2/TRAINING_DATA.md

# Training Data


This repo supports the following speech datasets:
  * [LJ Speech](https://keithito.com/LJ-Speech-Dataset/) (Public Domain)
  * [Blizzard 2012](http://www.cstr.ed.ac.uk/projects/blizzard/2012/phase_one) (Creative Commons Attribution Share-Alike)

You can use any other dataset if you write a preprocessor for it.


### Writing a Preprocessor

Each training example consists of:
  1. The text that was spoken
  2. A mel-scale spectrogram of the audio
  3. A linear-scale spectrogram of the audio

The preprocessor is responsible for generating these. See [ljspeech.py](datasets/ljspeech.py) for a
commented example.

For each training example, a preprocessor should:

  1. Load the audio file:
     ```python
     wav = audio.load_wav(wav_path)
     ```

  2. Compute linear-scale and mel-scale spectrograms (float32 numpy arrays):
     ```python
     spectrogram = audio.spectrogram(wav).astype(np.float32)
     mel_spectrogram = audio.melspectrogram(wav).astype(np.float32)
     ```

  3. Save the spectrograms to disk:
     ```python
     np.save(os.path.join(out_dir, spectrogram_filename), spectrogram.T, allow_pickle=False)
     np.save(os.path.join(out_dir, mel_spectrogram_filename), mel_spectrogram.T,  allow_pickle=False)
     ```
     Note that the transpose of the matrix returned by `audio.spectrogram` is saved so that it's
     in time-major format.

  4. Generate a tuple `(spectrogram_filename, mel_spectrogram_filename, n_frames, text)` to
     write to train.txt. n_frames is just the length of the time axis of the spectrogram.


After you've written your preprocessor, you can add it to [preprocess.py](preprocess.py) by
following the example of the other preprocessors in that file.


### Non-English Data

If your training data is in a language other than English, you will probably want to change the
text cleaners by setting the `cleaners` hyperparameter.

  * If your text is in a Latin script or can be transliterated to ASCII using the
    [Unidecode](https://pypi.python.org/pypi/Unidecode) library, you can use the transliteration
    cleaners by setting the hyperparameter `cleaners=transliteration_cleaners`.

  * If you don't want to transliterate, you can define a custom character set.
    This allows you to train directly on the character set used in your data.

    To do so, edit [symbols.py](text/symbols.py) and change the `_characters` variable to be a
    string containing the UTF-8 characters in your data. Then set the hyperparameter `cleaners=basic_cleaners`.

  * If you're not sure which option to use, you can evaluate the transliteration cleaners like this:

    ```python
    from text import cleaners
    cleaners.transliteration_cleaners('Здравствуйте')   # Replace with the text you want to try
    ```