mirror of https://github.com/MycroftAI/mimic2.git
71 lines
2.7 KiB
Markdown
71 lines
2.7 KiB
Markdown
# Training Data
|
|
|
|
|
|
This repo supports the following speech datasets:
|
|
* [LJ Speech](https://keithito.com/LJ-Speech-Dataset/) (Public Domain)
|
|
* [Blizzard 2012](http://www.cstr.ed.ac.uk/projects/blizzard/2012/phase_one) (Creative Commons Attribution Share-Alike)
|
|
|
|
You can use any other dataset if you write a preprocessor for it.
|
|
|
|
|
|
### Writing a Preprocessor
|
|
|
|
Each training example consists of:
|
|
1. The text that was spoken
|
|
2. A mel-scale spectrogram of the audio
|
|
3. A linear-scale spectrogram of the audio
|
|
|
|
The preprocessor is responsible for generating these. See [ljspeech.py](datasets/ljspeech.py) for a
|
|
commented example.
|
|
|
|
For each training example, a preprocessor should:
|
|
|
|
1. Load the audio file:
|
|
```python
|
|
wav = audio.load_wav(wav_path)
|
|
```
|
|
|
|
2. Compute linear-scale and mel-scale spectrograms (float32 numpy arrays):
|
|
```python
|
|
spectrogram = audio.spectrogram(wav).astype(np.float32)
|
|
mel_spectrogram = audio.melspectrogram(wav).astype(np.float32)
|
|
```
|
|
|
|
3. Save the spectrograms to disk:
|
|
```python
|
|
np.save(os.path.join(out_dir, spectrogram_filename), spectrogram.T, allow_pickle=False)
|
|
np.save(os.path.join(out_dir, mel_spectrogram_filename), mel_spectrogram.T, allow_pickle=False)
|
|
```
|
|
Note that the transpose of the matrix returned by `audio.spectrogram` is saved so that it's
|
|
in time-major format.
|
|
|
|
4. Generate a tuple `(spectrogram_filename, mel_spectrogram_filename, n_frames, text)` to
|
|
write to train.txt. n_frames is just the length of the time axis of the spectrogram.
|
|
|
|
|
|
After you've written your preprocessor, you can add it to [preprocess.py](preprocess.py) by
|
|
following the example of the other preprocessors in that file.
|
|
|
|
|
|
### Non-English Data
|
|
|
|
If your training data is in a language other than English, you will probably want to change the
|
|
text cleaners by setting the `cleaners` hyperparameter.
|
|
|
|
* If your text is in a Latin script or can be transliterated to ASCII using the
|
|
[Unidecode](https://pypi.python.org/pypi/Unidecode) library, you can use the transliteration
|
|
cleaners by setting the hyperparameter `cleaners=transliteration_cleaners`.
|
|
|
|
* If you don't want to transliterate, you can define a custom character set.
|
|
This allows you to train directly on the character set used in your data.
|
|
|
|
To do so, edit [symbols.py](text/symbols.py) and change the `_characters` variable to be a
|
|
string containing the UTF-8 characters in your data. Then set the hyperparameter `cleaners=basic_cleaners`.
|
|
|
|
* If you're not sure which option to use, you can evaluate the transliteration cleaners like this:
|
|
|
|
```python
|
|
from text import cleaners
|
|
cleaners.transliteration_cleaners('Здравствуйте') # Replace with the text you want to try
|
|
```
|