Go to file
Keith Ito e61fa839f9 Remove dead code 2017-09-26 08:46:21 -07:00
datasets Use single quotes 2017-09-04 21:20:25 -07:00
models Simplify text processing, make it easier to adapt to non-English data. 2017-09-04 20:04:59 -07:00
tests Rename "pipeline" to "cleaners" 2017-09-04 21:54:23 -07:00
text Rename "pipeline" to "cleaners" 2017-09-04 21:54:23 -07:00
util Remove dead code 2017-09-26 08:46:21 -07:00
.gitignore Initial commit 2017-07-08 13:08:26 -04:00
LICENSE Initial commit 2017-07-08 13:08:26 -04:00
README.md Remove TF from requirements and add to install instructions 2017-09-12 20:43:01 -07:00
TRAINING_DATA.md Rename "pipeline" to "cleaners" 2017-09-04 21:54:23 -07:00
demo_server.py Trim silence from output 2017-09-25 10:03:03 -07:00
eval.py Trim silence from output 2017-09-25 10:03:03 -07:00
hparams.py Rename "pipeline" to "cleaners" 2017-09-04 21:54:23 -07:00
preprocess.py Use single quotes 2017-09-04 21:20:25 -07:00
requirements.txt Remove TF from requirements and add to install instructions 2017-09-12 20:43:01 -07:00
synthesizer.py Trim silence from output 2017-09-25 10:03:03 -07:00
train.py Use single quotes 2017-09-04 21:20:25 -07:00

README.md

Tacotron

An implementation of Tacotron speech synthesis in TensorFlow.

Audio Samples

  • Audio Samples from models trained using this repo.
    • The first set was trained for 877K steps on the LJ Speech Dataset
      • Speech started to become intelligble around 20K steps.
      • Although loss continued to decrease, there wasn't much noticable improvement after ~250K steps.
    • The second set was trained by @MXGray for 140K steps on the Nancy Corpus.

Background

Earlier this year, Google published a paper, Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model, where they present a neural text-to-speech model that learns to synthesize speech directly from (text, audio) pairs. However, they didn't release their source code or training data. This is an attempt to provide an open-source implementation of the model described in their paper.

The quality isn't as good as Google's demo yet, but hopefully it will get there someday :-). Pull requests are welcome!

Quick Start

Installing dependencies

  1. Install Python 3.

  2. Install TensorFlow 1.3. Install with GPU support if it's available for your platform.

  3. Install requirements:

    pip install -r requirements.txt
    

Using a pre-trained model

  1. Download and unpack a model:

    curl http://data.keithito.com/data/speech/tacotron-20170720.tar.bz2 | tar xjC /tmp
    
  2. Run the demo server:

    python3 demo_server.py --checkpoint /tmp/tacotron-20170720/model.ckpt
    
  3. Point your browser at localhost:9000

    • Type what you want to synthesize

Training

Note: you need at least 40GB of free disk space to train a model.

  1. Download a speech dataset.

    The following are supported out of the box:

    You can use other datasets if you convert them to the right format. See TRAINING_DATA.md for more info.

  2. Unpack the dataset into ~/tacotron

    After unpacking, your tree should look like this for LJ Speech:

    tacotron
      |- LJSpeech-1.0
          |- metadata.csv
          |- wavs
    

    or like this for Blizzard 2012:

    tacotron
      |- Blizzard2012
          |- ATrampAbroad
          |   |- sentence_index.txt
          |   |- lab
          |   |- wav
          |- TheManThatCorruptedHadleyburg
              |- sentence_index.txt
              |- lab
              |- wav
    
  3. Preprocess the data

    python3 preprocess.py --dataset ljspeech
    
    • Use --dataset blizzard for Blizzard data
  4. Train a model

    python3 train.py
    

    Tunable hyperparameters are found in hparams.py. You can adjust these at the command line using the --hparams flag, for example --hparams="batch_size=16,outputs_per_step=2". Hyperparameters should generally be set to the same values at both training and eval time.

  5. Monitor with Tensorboard (optional)

    tensorboard --logdir ~/tacotron/logs-tacotron
    

    The trainer dumps audio and alignments every 1000 steps. You can find these in ~/tacotron/logs-tacotron.

  6. Synthesize from a checkpoint

    python3 demo_server.py --checkpoint ~/tacotron/logs-tacotron/model.ckpt-185000
    

    Replace "185000" with the checkpoint number that you want to use, then open a browser to localhost:9000 and type what you want to speak. Alternately, you can run eval.py at the command line:

    python3 eval.py --checkpoint ~/tacotron/logs-tacotron/model.ckpt-185000
    

    If you set the --hparams flag when training, set the same value here.

Miscellaneous Notes

  • TCMalloc seems to improve training speed and avoids occasional slowdowns seen with the default allocator. You can enable it by installing it and setting LD_PRELOAD=/usr/lib/libtcmalloc.so.

  • You can train with CMUDict by downloading the dictionary to ~/tacotron/training and then passing the flag --hparams="use_cmudict=True" to train.py. This will allow you to pass ARPAbet phonemes enclosed in curly braces at eval time to force a particular pronunciation, e.g. Turn left on {HH AW1 S S T AH0 N} Street.

  • If you pass a Slack incoming webhook URL as the --slack_url flag to train.py, it will send you progress updates every 1000 steps.

  • Occasionally, you may see a spike in loss and the model will forget how to attend (the alignments will no longer make sense). Although it will recover eventually, it may save time to restart at a checkpoint prior to the spike by passing the --restore_step=150000 flag to train.py (replacing 150000 with a step number prior to the spike). Update: a recent fix to gradient clipping by @candlewill may have fixed this.

Other Implementations