mimic2/TRAINING_DATA.md

2.7 KiB

Training Data

This repo supports the following speech datasets:

You can use any other dataset if you write a preprocessor for it.

Writing a Preprocessor

Each training example consists of:

  1. The text that was spoken
  2. A mel-scale spectrogram of the audio
  3. A linear-scale spectrogram of the audio

The preprocessor is responsible for generating these. See ljspeech.py for a commented example.

For each training example, a preprocessor should:

  1. Load the audio file:

    wav = audio.load_wav(wav_path)
    
  2. Compute linear-scale and mel-scale spectrograms (float32 numpy arrays):

    spectrogram = audio.spectrogram(wav).astype(np.float32)
    mel_spectrogram = audio.melspectrogram(wav).astype(np.float32)
    
  3. Save the spectrograms to disk:

    np.save(os.path.join(out_dir, spectrogram_filename), spectrogram.T, allow_pickle=False)
    np.save(os.path.join(out_dir, mel_spectrogram_filename), mel_spectrogram.T,  allow_pickle=False)
    

    Note that the transpose of the matrix returned by audio.spectrogram is saved so that it's in time-major format.

  4. Generate a tuple (spectrogram_filename, mel_spectrogram_filename, n_frames, text) to write to train.txt. n_frames is just the length of the time axis of the spectrogram.

After you've written your preprocessor, you can add it to preprocess.py by following the example of the other preprocessors in that file.

Non-English Data

If your training data is in a language other than English, you will probably want to change the text cleaners by setting the cleaners hyperparameter.

  • If your text is in a Latin script or can be transliterated to ASCII using the Unidecode library, you can use the transliteration cleaners by setting the hyperparameter cleaners=transliteration_cleaners.

  • If you don't want to transliterate, you can define a custom character set. This allows you to train directly on the character set used in your data.

    To do so, edit symbols.py and change the _characters variable to be a string containing the UTF-8 characters in your data. Then set the hyperparameter cleaners=basic_cleaners.

  • If you're not sure which option to use, you can evaluate the transliteration cleaners like this:

    from text import cleaners
    cleaners.transliteration_cleaners('Здравствуйте')   # Replace with the text you want to try