TTS/recipes/blizzard2013
Eren Gölge 9e5a469c64
d-vector handling (#1945)
* Update BaseDatasetConfig

- Add dataset_name
- Chane name to formatter_name

* Update compute_embedding

- Allow entering dataset by args
- Use released model by default
- Use the new key format

* Update loading

* Update recipes

* Update other dep code

* Update tests

* Fixup

* Load multiple embedding files

* Fix argument names in dep code

* Update docs

* Fix argument name

* Fix linter
2022-09-13 14:10:33 +02:00
..
tacotron1-Capacitron d-vector handling (#1945) 2022-09-13 14:10:33 +02:00
tacotron2-Capacitron d-vector handling (#1945) 2022-09-13 14:10:33 +02:00
README.md Capacitron (#977) 2022-05-20 16:17:11 +02:00

README.md

How to get the Blizzard 2013 Dataset

The Capacitron model is a variational encoder extension of standard Tacotron based models to model prosody.

To take full advantage of the model, it is advised to train the model with a dataset that contains a significant amount of prosodic information in the utterances. A tested candidate for such applications is the blizzard2013 dataset from the Blizzard Challenge, containing many hours of high quality audio book recordings.

To get a license and download link for this dataset, you need to visit the website of the Centre for Speech Technology Research of the University of Edinburgh.

You get access to the raw dataset in a couple of days. There are a few preprocessing steps you need to do to be able to use the high fidelity dataset.

  1. Get the forced time alignments for the blizzard dataset from here.
  2. Segment the high fidelity audio-book files based on the instructions here.