Update README

2018-02-05 14:28:56 -08:00 · 2018-02-05 14:28:56 -08:00 · fd83b876a4
parent 1f58edd56d
commit fd83b876a4
1 changed files with 4 additions and 166 deletions
--- a/README.md
+++ b/README.md
@ -1,168 +1,6 @@
-# Tacotron
+# mimic2

-An implementation of Tacotron speech synthesis in TensorFlow.
+This is a fork of [keithito/tacotron](https://github.com/keithito/tacotron)
+with changes specific to Mimic 2 applied.

-
-### Audio Samples
-
-  * **[Audio Samples](https://keithito.github.io/audio-samples/)** from models trained using this repo.
-    * The first set was trained for 877K steps on the [LJ Speech Dataset](https://keithito.com/LJ-Speech-Dataset/)
-      * Speech started to become intelligble around 20K steps.
-      * Although loss continued to decrease, there wasn't much noticable improvement after ~250K steps.
-    * The second set was trained by [@MXGray](https://github.com/MXGray) for 140K steps on the [Nancy Corpus](http://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/).
-
-
-
-## Background
-
-Earlier this year, Google published a paper, [Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model](https://arxiv.org/pdf/1703.10135.pdf),
-where they present a neural text-to-speech model that learns to synthesize speech directly from
-(text, audio) pairs. However, they didn't release their source code or training data. This is an
-attempt to provide an open-source implementation of the model described in their paper.
-
-The quality isn't as good as Google's demo yet, but hopefully it will get there someday :-).
-Pull requests are welcome!
-
-
-
-## Quick Start
-
-### Installing dependencies
-
-1. Install Python 3.
-
-2. Install the latest version of [TensorFlow](https://www.tensorflow.org/install/) for your platform. For better
-   performance, install with GPU support if it's available. This code works with TensorFlow 1.3 or 1.4.
-
-3. Install requirements:
-   ```
-   pip install -r requirements.txt
-   ```
-
-
-### Using a pre-trained model
-
-1. **Download and unpack a model**:
-   ```
-   curl http://data.keithito.com/data/speech/tacotron-20170720.tar.bz2 | tar xjC /tmp
-   ```
-
-2. **Run the demo server**:
-   ```
-   python3 demo_server.py --checkpoint /tmp/tacotron-20170720/model.ckpt
-   ```
-
-3. **Point your browser at localhost:9000**
-   * Type what you want to synthesize
-
-
-
-### Training
-
-*Note: you need at least 40GB of free disk space to train a model.*
-
-1. **Download a speech dataset.**
-
-   The following are supported out of the box:
-    * [LJ Speech](https://keithito.com/LJ-Speech-Dataset/) (Public Domain)
-    * [Blizzard 2012](http://www.cstr.ed.ac.uk/projects/blizzard/2012/phase_one) (Creative Commons Attribution Share-Alike)
-
-   You can use other datasets if you convert them to the right format. See [TRAINING_DATA.md](TRAINING_DATA.md) for more info.
-
-
-2. **Unpack the dataset into `~/tacotron`**
-
-   After unpacking, your tree should look like this for LJ Speech:
-   ```
-   tacotron
-     |- LJSpeech-1.0
-         |- metadata.csv
-         |- wavs
-   ```
-
-   or like this for Blizzard 2012:
-   ```
-   tacotron
-     |- Blizzard2012
-         |- ATrampAbroad
-         |   |- sentence_index.txt
-         |   |- lab
-         |   |- wav
-         |- TheManThatCorruptedHadleyburg
-             |- sentence_index.txt
-             |- lab
-             |- wav
-   ```
-
-3. **Preprocess the data**
-   ```
-   python3 preprocess.py --dataset ljspeech
-   ```
-     * Use `--dataset blizzard` for Blizzard data
-
-4. **Train a model**
-   ```
-   python3 train.py
-   ```
-
-   Tunable hyperparameters are found in [hparams.py](hparams.py). You can adjust these at the command
-   line using the `--hparams` flag, for example `--hparams="batch_size=16,outputs_per_step=2"`.
-   Hyperparameters should generally be set to the same values at both training and eval time.
-
-
-5. **Monitor with Tensorboard** (optional)
-   ```
-   tensorboard --logdir ~/tacotron/logs-tacotron
-   ```
-
-   The trainer dumps audio and alignments every 1000 steps. You can find these in
-   `~/tacotron/logs-tacotron`.
-
-6. **Synthesize from a checkpoint**
-   ```
-   python3 demo_server.py --checkpoint ~/tacotron/logs-tacotron/model.ckpt-185000
-   ```
-   Replace "185000" with the checkpoint number that you want to use, then open a browser
-   to `localhost:9000` and type what you want to speak. Alternately, you can
-   run [eval.py](eval.py) at the command line:
-   ```
-   python3 eval.py --checkpoint ~/tacotron/logs-tacotron/model.ckpt-185000
-   ```
-   If you set the `--hparams` flag when training, set the same value here.
-
-
-## Notes and Common Issues
-
-  * [TCMalloc](http://goog-perftools.sourceforge.net/doc/tcmalloc.html) seems to improve
-    training speed and avoids occasional slowdowns seen with the default allocator. You
-    can enable it by installing it and setting `LD_PRELOAD=/usr/lib/libtcmalloc.so`.
-
-  * You can train with [CMUDict](http://www.speech.cs.cmu.edu/cgi-bin/cmudict) by downloading the
-    dictionary to ~/tacotron/training and then passing the flag `--hparams="use_cmudict=True"` to
-    train.py. This will allow you to pass ARPAbet phonemes enclosed in curly braces at eval
-    time to force a particular pronunciation, e.g. `Turn left on {HH AW1 S S T AH0 N} Street.`
-
-  * If you pass a Slack incoming webhook URL as the `--slack_url` flag to train.py, it will send
-    you progress updates every 1000 steps.
-
-  * Occasionally, you may see a spike in loss and the model will forget how to attend (the
-    alignments will no longer make sense). Although it will recover eventually, it may
-    save time to restart at a checkpoint prior to the spike by passing the
-    `--restore_step=150000` flag to train.py (replacing 150000 with a step number prior to the
-    spike). **Update**: a recent [fix](https://github.com/keithito/tacotron/pull/7) to gradient
-    clipping by @candlewill may have fixed this.
-    
-  * During eval and training, audio length is limited to `max_iters * outputs_per_step * frame_shift_ms`
-    milliseconds. With the defaults (max_iters=200, outputs_per_step=5, frame_shift_ms=12.5), this is
-    12.5 seconds.
-    
-    If your training examples are longer, you will see an error like this:
-    `Incompatible shapes: [32,1340,80] vs. [32,1000,80]`
-    
-    To fix this, you can set a larger value of `max_iters` by passing `--hparams="max_iters=300"` to
-    train.py (replace "300" with a value based on how long your audio is and the formula above).
-
-
-## Other Implementations
-  * By Alex Barron: https://github.com/barronalex/Tacotron
-  * By Kyubyong Park: https://github.com/Kyubyong/tacotron
+Copyright (c) 2017 Keith Ito