mimic2/TRAINING_DATA.md

71 lines
2.7 KiB
Markdown
Raw Normal View History

# Training Data
This repo supports the following speech datasets:
* [LJ Speech](https://keithito.com/LJ-Speech-Dataset/) (Public Domain)
* [Blizzard 2012](http://www.cstr.ed.ac.uk/projects/blizzard/2012/phase_one) (Creative Commons Attribution Share-Alike)
You can use any other dataset if you write a preprocessor for it.
### Writing a Preprocessor
Each training example consists of:
1. The text that was spoken
2. A mel-scale spectrogram of the audio
3. A linear-scale spectrogram of the audio
The preprocessor is responsible for generating these. See [ljspeech.py](datasets/ljspeech.py) for a
2017-09-05 00:53:41 +00:00
commented example.
For each training example, a preprocessor should:
1. Load the audio file:
```python
wav = audio.load_wav(wav_path)
```
2. Compute linear-scale and mel-scale spectrograms (float32 numpy arrays):
```python
spectrogram = audio.spectrogram(wav).astype(np.float32)
mel_spectrogram = audio.melspectrogram(wav).astype(np.float32)
```
3. Save the spectrograms to disk:
```python
np.save(os.path.join(out_dir, spectrogram_filename), spectrogram.T, allow_pickle=False)
np.save(os.path.join(out_dir, mel_spectrogram_filename), mel_spectrogram.T, allow_pickle=False)
```
Note that the transpose of the matrix returned by `audio.spectrogram` is saved so that it's
in time-major format.
4. Generate a tuple `(spectrogram_filename, mel_spectrogram_filename, n_frames, text)` to
write to train.txt. n_frames is just the length of the time axis of the spectrogram.
After you've written your preprocessor, you can add it to [preprocess.py](preprocess.py) by
following the example of the other preprocessors in that file.
2017-09-05 00:53:41 +00:00
### Non-English Data
2017-09-05 00:53:41 +00:00
If your training data is in a language other than English, you will probably want to change the
text cleaners by setting the `cleaners` hyperparameter.
2017-09-05 00:53:41 +00:00
* If your text is in a Latin script or can be transliterated to ASCII using the
[Unidecode](https://pypi.python.org/pypi/Unidecode) library, you can use the transliteration
cleaners by setting the hyperparameter `cleaners=transliteration_cleaners`.
2017-09-05 00:53:41 +00:00
* If you don't want to transliterate, you can define a custom character set.
This allows you to train directly on the character set used in your data.
2017-09-05 00:53:41 +00:00
To do so, edit [symbols.py](text/symbols.py) and change the `_characters` variable to be a
string containing the UTF-8 characters in your data. Then set the hyperparameter `cleaners=basic_cleaners`.
2017-09-05 00:53:41 +00:00
* If you're not sure which option to use, you can evaluate the transliteration cleaners like this:
2017-09-05 00:53:41 +00:00
```python
from text import cleaners
cleaners.transliteration_cleaners('Здравствуйте') # Replace with the text you want to try
2017-09-05 00:53:41 +00:00
```