mirror of https://github.com/MycroftAI/mimic2.git
Add documentation on preprocessing training data.
parent
516ff9db55
commit
3c211e7a19
|
@ -59,8 +59,7 @@ pip install -r requirements.txt
|
|||
* [LJ Speech](https://keithito.com/LJ-Speech-Dataset/) (Public Domain)
|
||||
* [Blizzard 2012](http://www.cstr.ed.ac.uk/projects/blizzard/2012/phase_one) (Creative Commons Attribution Share-Alike)
|
||||
|
||||
You can use other datasets if you convert them to the right format. See
|
||||
[ljspeech.py](datasets/ljspeech.py) for an example.
|
||||
You can use other datasets if you convert them to the right format. See [TRAINING_DATA.md](TRAINING_DATA.md) for more info.
|
||||
|
||||
|
||||
2. **Unpack the dataset into `~/tacotron`**
|
||||
|
|
|
@ -0,0 +1,66 @@
|
|||
# Training Data
|
||||
|
||||
|
||||
This repo supports the following speech datasets:
|
||||
* [LJ Speech](https://keithito.com/LJ-Speech-Dataset/) (Public Domain)
|
||||
* [Blizzard 2012](http://www.cstr.ed.ac.uk/projects/blizzard/2012/phase_one) (Creative Commons Attribution Share-Alike)
|
||||
|
||||
You can use any other dataset if you write a preprocessor for it.
|
||||
|
||||
|
||||
### Writing a Preprocessor
|
||||
|
||||
Each training example consists of:
|
||||
1. The text that was spoken
|
||||
2. A mel-scale spectrogram of the audio
|
||||
3. A linear-scale spectrogram of the audio
|
||||
|
||||
The preprocessor is responsible for generating these. See [ljspeech.py](datasets/ljspeech.py) for a
|
||||
heavily-commented example.
|
||||
|
||||
For each training example, a preprocessor should:
|
||||
|
||||
1. Load the audio file:
|
||||
```python
|
||||
wav = audio.load_wav(wav_path)
|
||||
```
|
||||
|
||||
2. Compute linear-scale and mel-scale spectrograms (float32 numpy arrays):
|
||||
```python
|
||||
spectrogram = audio.spectrogram(wav).astype(np.float32)
|
||||
mel_spectrogram = audio.melspectrogram(wav).astype(np.float32)
|
||||
```
|
||||
|
||||
3. Save the spectrograms to disk:
|
||||
```python
|
||||
np.save(os.path.join(out_dir, spectrogram_filename), spectrogram.T, allow_pickle=False)
|
||||
np.save(os.path.join(out_dir, mel_spectrogram_filename), mel_spectrogram.T, allow_pickle=False)
|
||||
```
|
||||
Note that the transpose of the matrix returned by `audio.spectrogram` is saved so that it's
|
||||
in time-major format.
|
||||
|
||||
4. Generate a tuple `(spectrogram_filename, mel_spectrogram_filename, n_frames, text)` to
|
||||
write to train.txt. n_frames is just the length of the time axis of the spectrogram.
|
||||
|
||||
|
||||
After you've written your preprocessor, you can add it to [preprocess.py](preprocess.py) by
|
||||
following the example of the other preprocessors in that file.
|
||||
|
||||
|
||||
|
||||
### Text Processing During Training and Eval
|
||||
|
||||
Some additional processing is done to the text during training and eval. The text is run
|
||||
through the `to_sequence` function in [textinput.py](util/textinput.py).
|
||||
|
||||
This performs several transformations:
|
||||
1. Leading and trailing whitespace and quotation marks are removed.
|
||||
2. Text is converted to ASCII by removing diacritics (e.g. "Crème brûlée" becomes "Creme brulee").
|
||||
3. Numbers are converted to strings using the heuristics in [numbers.py](util/numbers.py).
|
||||
*This is specific to English*.
|
||||
4. Abbreviations are expanded (e.g. "Mr" becomes "Mister"). *This is also specific to English*.
|
||||
5. Characters outside the input alphabet (ASCII characters and some punctuation) are removed.
|
||||
6. Whitespace is collapsed so that every sequence of whitespace becomes a single ASCII space.
|
||||
|
||||
**Several of these steps are inappropriate for non-English text and you may want to disable or
|
||||
modify them if you are not using English training data.**
|
|
@ -6,6 +6,20 @@ from util import audio
|
|||
|
||||
|
||||
def build_from_path(in_dir, out_dir, num_workers=1, tqdm=lambda x: x):
|
||||
'''Preprocesses the LJ Speech dataset from a given input path into a given output directory.
|
||||
|
||||
Args:
|
||||
in_dir: The directory where you have downloaded the LJ Speech dataset
|
||||
out_dir: The directory to write the output into
|
||||
num_workers: Optional number of worker processes to parallelize across
|
||||
tqdm: You can optionally pass tqdm to get a nice progress bar
|
||||
|
||||
Returns:
|
||||
A list of tuples describing the training examples. This should be written to train.txt
|
||||
'''
|
||||
|
||||
# We use ProcessPoolExecutor to parallize across processes. This is just an optimization and you
|
||||
# can omit it and just call _process_utterance on each input if you want.
|
||||
executor = ProcessPoolExecutor(max_workers=num_workers)
|
||||
futures = []
|
||||
index = 1
|
||||
|
@ -20,12 +34,36 @@ def build_from_path(in_dir, out_dir, num_workers=1, tqdm=lambda x: x):
|
|||
|
||||
|
||||
def _process_utterance(out_dir, index, wav_path, text):
|
||||
'''Preprocesses a single utterance audio/text pair.
|
||||
|
||||
This writes the mel and linear scale spectrograms to disk and returns a tuple to write
|
||||
to the train.txt file.
|
||||
|
||||
Args:
|
||||
out_dir: The directory to write the spectrograms into
|
||||
index: The numeric index to use in the spectrogram filenames.
|
||||
wav_path: Path to the audio file containing the speech input
|
||||
text: The text spoken in the input audio file
|
||||
|
||||
Returns:
|
||||
A (spectrogram_filename, mel_filename, n_frames, text) tuple to write to train.txt
|
||||
'''
|
||||
|
||||
# Load the audio to a numpy array:
|
||||
wav = audio.load_wav(wav_path)
|
||||
|
||||
# Compute the linear-scale spectrogram from the wav:
|
||||
spectrogram = audio.spectrogram(wav).astype(np.float32)
|
||||
n_frames = spectrogram.shape[1]
|
||||
|
||||
# Compute a mel-scale spectrogram from the wav:
|
||||
mel_spectrogram = audio.melspectrogram(wav).astype(np.float32)
|
||||
|
||||
# Write the spectrograms to disk:
|
||||
spectrogram_filename = 'ljspeech-spec-%05d.npy' % index
|
||||
mel_filename = 'ljspeech-mel-%05d.npy' % index
|
||||
np.save(os.path.join(out_dir, spectrogram_filename), spectrogram.T, allow_pickle=False)
|
||||
np.save(os.path.join(out_dir, mel_filename), mel_spectrogram.T, allow_pickle=False)
|
||||
|
||||
# Return a tuple describing this training example:
|
||||
return (spectrogram_filename, mel_filename, n_frames, text)
|
||||
|
|
Loading…
Reference in New Issue