Add a note about training with cmudict

pull/2/head
Keith Ito 2017-07-08 18:40:30 -04:00
parent 5a51708b7b
commit 0868767148
1 changed files with 45 additions and 28 deletions

View File

@ -3,24 +3,19 @@
An implementation of Google's Tacotron speech synthesis model in Tensorflow.
## Overview
### Example Output
* **[Audio Samples](https://keithito.github.io/audio-samples/)** after training for 185k steps (~2 days). The model is still training. I'll update the samples when it's further along.
## Background
Earlier this year, Google published a paper, [Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model](https://arxiv.org/pdf/1703.10135.pdf),
where they present a neural text-to-speech model that learns to synthesize speech directly from
(text, audio) pairs.
(text, audio) pairs. However, they didn't release their source code or training data. This is an
attempt to provide an open-source implementation of the model described in their paper.
Google [released](https://google.github.io/tacotron) some nice audio samples that their model
generated but didn't provide their source code or training data. This is an attempt to
implement the model described in their paper.
### Sample Output
Output after training for 185K steps (~2 days):
* [Audio Samples](https://keithito.github.io/audio-samples/)
The quality isn't as good as what Google demoed. But hopefully it will get there someday :-).
The quality isn't as good as Google's demo yet, but hopefully it will get there someday :-).
@ -34,23 +29,26 @@ pip install -r requirements.txt
### Using a pre-trained model
1. Download and unpack a model:
1. **Download and unpack a model**:
```
curl http://data.keithito.com/data/speech/tacotron-20170708.tar.bz2 | tar x -C /tmp
```
2. Run the demo server:
2. **Run the demo server**:
```
python3 demo_server.py --checkpoint /tmp/tacotron-20170708/model.ckpt
```
3. Point your browser at [localhost:9000](http://localhost:9000) and type!
3. **Point your browser at [localhost:9000](http://localhost:9000)**
* Type what you want to synthesize
### Training
1. Download a speech dataset. The following are supported out of the box:
1. **Download a speech dataset.**
The following are supported out of the box:
* [LJ Speech](https://keithito.com/LJ-Speech-Dataset) (Public Domain)
* [Blizzard 2012](http://www.cstr.ed.ac.uk/projects/blizzard/2012/phase_one) (Creative Commons Attribution Share-Alike)
@ -58,8 +56,9 @@ pip install -r requirements.txt
[ljspeech.py](datasets/ljspeech.py) for an example.
2. Unpack the dataset into `~/tacotron`. After unpacking, your tree should look like this for
LJ Speech:
2. **Unpack the dataset into `~/tacotron`**
After unpacking, your tree should look like this for LJ Speech:
```
tacotron
|- LJSpeech-1.0
@ -81,28 +80,46 @@ pip install -r requirements.txt
|- wav
```
3. Preprocess the data
3. **Preprocess the data**
```
python3 preprocess.py --dataset ljspeech
```
*Use --dataset blizzard for Blizzard data*
* Use `--dataset blizzard` for Blizzard data
4. Train
4. **Train a model**
```
python3 train.py
```
*Note: using [TCMalloc](http://goog-perftools.sourceforge.net/doc/tcmalloc.html) seems to
improve training performance.*
5. Monitor with Tensorboard (optional)
5. **Monitor with Tensorboard** (optional)
```
tensorboard --logdir ~/tacotron/logs-tacotron
```
The trainer dumps audio and alignments every 1000 steps. You can find these in
`~/tacotron/logs-tacotron`. You can also pass a Slack webhook URL as the `--slack_url`
flag, and it will send you progress updates.
`~/tacotron/logs-tacotron`.
6. **Synthesize from a checkpoint**
```
python demo_server.py --checkpoint ~/tacotron/logs-tacotron/model.ckpt-185000
```
Replace "185000" with the checkpoint number that you want to use, then open a browser
to [localhost:9000](http://localhost:9000) and type what you want to speak.
## Miscellanous Notes
* [TCMalloc](http://goog-perftools.sourceforge.net/doc/tcmalloc.html) seems to improve
training speed and avoids occasional slowdowns seen with the default allocator. You
can enable it by installing it and setting `LD_PRELOAD=/usr/lib/libtcmalloc.so`.
* You can train with [CMUDict](http://www.speech.cs.cmu.edu/cgi-bin/cmudict) by downloading the
dictionary to ~/tacotron/training and then passing the flag `--hparams="use_cmudict=True"` to
train.py. This will allow you to pass ARPAbet phonemes enclosed in curly braces at eval
time to force a particular pronunciation, e.g. `Turn left on {HH AW1 S S T AH0 N} Street.`
* If you pass a Slack incoming webhook URL as the `--slack_url` flag to train.py, it will send
you progress updates every 1000 steps.
## Other Implementations