Add documentation on preprocessing training data.

2017-08-14 10:48:50 -04:00 · 2017-08-14 10:48:50 -04:00 · 3c211e7a19
parent 516ff9db55
commit 3c211e7a19
3 changed files with 105 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -59,8 +59,7 @@ pip install -r requirements.txt
    * [LJ Speech](https://keithito.com/LJ-Speech-Dataset/) (Public Domain)
    * [Blizzard 2012](http://www.cstr.ed.ac.uk/projects/blizzard/2012/phase_one) (Creative Commons Attribution Share-Alike)

-   You can use other datasets if you convert them to the right format. See
-   [ljspeech.py](datasets/ljspeech.py) for an example.
+   You can use other datasets if you convert them to the right format. See [TRAINING_DATA.md](TRAINING_DATA.md) for more info.


 2. **Unpack the dataset into `~/tacotron`**
--- a/TRAINING_DATA.md
+++ b/TRAINING_DATA.md
@ -0,0 +1,66 @@
+# Training Data
+
+
+This repo supports the following speech datasets:
+  * [LJ Speech](https://keithito.com/LJ-Speech-Dataset/) (Public Domain)
+  * [Blizzard 2012](http://www.cstr.ed.ac.uk/projects/blizzard/2012/phase_one) (Creative Commons Attribution Share-Alike)
+
+You can use any other dataset if you write a preprocessor for it.
+
+
+### Writing a Preprocessor
+
+Each training example consists of:
+  1. The text that was spoken
+  2. A mel-scale spectrogram of the audio
+  3. A linear-scale spectrogram of the audio
+
+The preprocessor is responsible for generating these. See [ljspeech.py](datasets/ljspeech.py) for a
+heavily-commented example.
+
+For each training example, a preprocessor should:
+
+  1. Load the audio file:
+     ```python
+     wav = audio.load_wav(wav_path)
+     ```
+
+  2. Compute linear-scale and mel-scale spectrograms (float32 numpy arrays):
+     ```python
+     spectrogram = audio.spectrogram(wav).astype(np.float32)
+     mel_spectrogram = audio.melspectrogram(wav).astype(np.float32)
+     ```
+
+  3. Save the spectrograms to disk:
+     ```python
+     np.save(os.path.join(out_dir, spectrogram_filename), spectrogram.T, allow_pickle=False)
+     np.save(os.path.join(out_dir, mel_spectrogram_filename), mel_spectrogram.T,  allow_pickle=False)
+     ```
+     Note that the transpose of the matrix returned by `audio.spectrogram` is saved so that it's
+     in time-major format.
+
+  4. Generate a tuple `(spectrogram_filename, mel_spectrogram_filename, n_frames, text)` to
+     write to train.txt. n_frames is just the length of the time axis of the spectrogram.
+
+
+After you've written your preprocessor, you can add it to [preprocess.py](preprocess.py) by
+following the example of the other preprocessors in that file.
+
+
+
+### Text Processing During Training and Eval
+
+Some additional processing is done to the text during training and eval. The text is run
+through the `to_sequence` function in [textinput.py](util/textinput.py).
+
+This performs several transformations:
+  1. Leading and trailing whitespace and quotation marks are removed.
+  2. Text is converted to ASCII by removing diacritics (e.g. "Crème brûlée" becomes "Creme brulee").
+  3. Numbers are converted to strings using the heuristics in [numbers.py](util/numbers.py).
+     *This is specific to English*.
+  4. Abbreviations are expanded (e.g. "Mr" becomes "Mister"). *This is also specific to English*.
+  5. Characters outside the input alphabet (ASCII characters and some punctuation) are removed.
+  6. Whitespace is collapsed so that every sequence of whitespace becomes a single ASCII space.
+
+**Several of these steps are inappropriate for non-English text and you may want to disable or
+modify them if you are not using English training data.**
--- a/datasets/ljspeech.py
+++ b/datasets/ljspeech.py
@ -6,6 +6,20 @@ from util import audio


 def build_from_path(in_dir, out_dir, num_workers=1, tqdm=lambda x: x):
+  '''Preprocesses the LJ Speech dataset from a given input path into a given output directory.
+
+    Args:
+      in_dir: The directory where you have downloaded the LJ Speech dataset
+      out_dir: The directory to write the output into
+      num_workers: Optional number of worker processes to parallelize across
+      tqdm: You can optionally pass tqdm to get a nice progress bar
+
+    Returns:
+      A list of tuples describing the training examples. This should be written to train.txt
+  '''
+
+  # We use ProcessPoolExecutor to parallize across processes. This is just an optimization and you
+  # can omit it and just call _process_utterance on each input if you want.
  executor = ProcessPoolExecutor(max_workers=num_workers)
  futures = []
  index = 1
@ -20,12 +34,36 @@ def build_from_path(in_dir, out_dir, num_workers=1, tqdm=lambda x: x):


 def _process_utterance(out_dir, index, wav_path, text):
+  '''Preprocesses a single utterance audio/text pair.
+
+  This writes the mel and linear scale spectrograms to disk and returns a tuple to write
+  to the train.txt file.
+
+  Args:
+    out_dir: The directory to write the spectrograms into
+    index: The numeric index to use in the spectrogram filenames.
+    wav_path: Path to the audio file containing the speech input
+    text: The text spoken in the input audio file
+
+  Returns:
+    A (spectrogram_filename, mel_filename, n_frames, text) tuple to write to train.txt
+  '''
+
+  # Load the audio to a numpy array:
  wav = audio.load_wav(wav_path)
+
+  # Compute the linear-scale spectrogram from the wav:
  spectrogram = audio.spectrogram(wav).astype(np.float32)
  n_frames = spectrogram.shape[1]
+
+  # Compute a mel-scale spectrogram from the wav:
  mel_spectrogram = audio.melspectrogram(wav).astype(np.float32)
+
+  # Write the spectrograms to disk:
  spectrogram_filename = 'ljspeech-spec-%05d.npy' % index
  mel_filename = 'ljspeech-mel-%05d.npy' % index
  np.save(os.path.join(out_dir, spectrogram_filename), spectrogram.T, allow_pickle=False)
  np.save(os.path.join(out_dir, mel_filename), mel_spectrogram.T, allow_pickle=False)
+
+  # Return a tuple describing this training example:
  return (spectrogram_filename, mel_filename, n_frames, text)