TTS/notebooks/DDC_TTS_and_MultiBand_MelGA...

347 lines
9.1 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"Collapsed": "false",
"colab_type": "text",
"id": "6LWsNd3_M3MP"
},
"source": [
"# Mozilla TTS on CPU Real-Time Speech Synthesis with Tensorflow"
]
},
{
"cell_type": "markdown",
"metadata": {
"Collapsed": "false",
"colab_type": "text",
"id": "FAqrSIWgLyP0"
},
"source": [
"**These models are converted from released [PyTorch models](https://colab.research.google.com/drive/1u_16ZzHjKYFn1HNVuA4Qf_i2MMFB9olY?usp=sharing) using our TF utilities provided in Mozilla TTS.**\n",
"\n",
"These TF models support TF 2.2 and for different versions you might need to\n",
"regenerate them. \n",
"\n",
"We use Tacotron2 and MultiBand-Melgan models and LJSpeech dataset.\n",
"\n",
"Tacotron2 is trained using [Double Decoder Consistency](https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency/) (DDC) only for 130K steps (3 days) with a single GPU.\n",
"\n",
"MultiBand-Melgan is trained 1.45M steps with real spectrograms.\n",
"\n",
"Note that both model performances can be improved with more training.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"Collapsed": "false",
"colab_type": "text",
"id": "Ku-dA4DKoeXk"
},
"source": [
"### Download Models"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"Collapsed": "false",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 162
},
"colab_type": "code",
"id": "jGIgnWhGsxU1",
"outputId": "08b0dddd-4edf-48c9-e8e5-a419b36a5c3d",
"tags": []
},
"outputs": [],
"source": [
"!gdown --id 1p7OSEEW_Z7ORxNgfZwhMy7IiLE1s0aH7 -O data/tts_model.pkl\n",
"!gdown --id 18CQ6G6tBEOfvCHlPqP8EBI4xWbrr9dBc -O data/config.json"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"Collapsed": "false",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 235
},
"colab_type": "code",
"id": "4dnpE0-kvTsu",
"outputId": "2fe836eb-c7e7-4f1e-9352-0142126bb19f",
"tags": []
},
"outputs": [],
"source": [
"!gdown --id 1rHmj7CqD3Sfa716Y3ub_vpIBrQg_b1yF -O data/vocoder_model.pkl\n",
"!gdown --id 1Rd0R_nRCrbjEdpOwq6XwZAktvugiBvmu -O data/config_vocoder.json\n",
"!gdown --id 11oY3Tv0kQtxK_JPgxrfesa99maVXHNxU -O data/scale_stats.npy"
]
},
{
"cell_type": "markdown",
"metadata": {
"Collapsed": "false",
"colab_type": "text",
"id": "Zlgi8fPdpRF0"
},
"source": [
"### Define TTS function"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"Collapsed": "false",
"colab": {},
"colab_type": "code",
"id": "f-Yc42nQZG5A"
},
"outputs": [],
"source": [
"def tts(model, text, CONFIG, p):\n",
" t_1 = time.time()\n",
" waveform, alignment, mel_spec, mel_postnet_spec, stop_tokens, inputs = synthesis(model, text, CONFIG, use_cuda, ap, speaker_id, style_wav=None,\n",
" truncated=False, enable_eos_bos_chars=CONFIG.enable_eos_bos_chars,\n",
" backend='tf')\n",
" waveform = vocoder_model.inference(torch.FloatTensor(mel_postnet_spec.T).unsqueeze(0))\n",
" waveform = waveform.numpy()[0, 0]\n",
" rtf = (time.time() - t_1) / (len(waveform) / ap.sample_rate)\n",
" tps = (time.time() - t_1) / len(waveform)\n",
" print(waveform.shape)\n",
" print(\" > Run-time: {}\".format(time.time() - t_1))\n",
" print(\" > Real-time factor: {}\".format(rtf))\n",
" print(\" > Time per step: {}\".format(tps))\n",
" IPython.display.display(IPython.display.Audio(waveform, rate=CONFIG.audio['sample_rate'])) \n",
" return alignment, mel_postnet_spec, stop_tokens, waveform"
]
},
{
"cell_type": "markdown",
"metadata": {
"Collapsed": "false",
"colab_type": "text",
"id": "ZksegYQepkFg"
},
"source": [
"### Load Models"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"Collapsed": "false",
"colab": {},
"colab_type": "code",
"id": "oVa0kOamprgj"
},
"outputs": [],
"source": [
"import os\n",
"import torch\n",
"import time\n",
"import IPython\n",
"\n",
"from TTS.tts.tf.utils.generic_utils import setup_model\n",
"from TTS.tts.tf.utils.io import load_checkpoint\n",
"from TTS.utils.io import load_config\n",
"from TTS.tts.utils.text.symbols import symbols, phonemes\n",
"from TTS.utils.audio import AudioProcessor\n",
"from TTS.tts.utils.synthesis import synthesis"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"Collapsed": "false",
"colab": {},
"colab_type": "code",
"id": "EY-sHVO8IFSH"
},
"outputs": [],
"source": [
"# runtime settings\n",
"use_cuda = False"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"Collapsed": "false",
"colab": {},
"colab_type": "code",
"id": "_1aIUp2FpxOQ"
},
"outputs": [],
"source": [
"# model paths\n",
"TTS_MODEL = \"data/tts_model.pkl\"\n",
"TTS_CONFIG = \"data/config.json\"\n",
"VOCODER_MODEL = \"data/vocoder_model.pkl\"\n",
"VOCODER_CONFIG = \"data/config_vocoder.json\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"Collapsed": "false",
"colab": {},
"colab_type": "code",
"id": "CpgmdBVQplbv"
},
"outputs": [],
"source": [
"# load configs\n",
"TTS_CONFIG = load_config(TTS_CONFIG)\n",
"VOCODER_CONFIG = load_config(VOCODER_CONFIG)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"Collapsed": "false",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 471
},
"colab_type": "code",
"id": "zmrQxiozIUVE",
"outputId": "fa71bd05-401f-4e5b-a6f7-60ae765966db",
"tags": []
},
"outputs": [],
"source": [
"# load the audio processor\n",
"TTS_CONFIG.audio['stats_path'] = 'data/scale_stats.npy'\n",
"ap = AudioProcessor(**TTS_CONFIG.audio) "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"Collapsed": "false",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 72
},
"colab_type": "code",
"id": "8fLoI4ipqMeS",
"outputId": "595d990f-930d-4698-ee14-77796b5eed7d",
"tags": []
},
"outputs": [],
"source": [
"# LOAD TTS MODEL\n",
"# multi speaker \n",
"speaker_id = None\n",
"speakers = []\n",
"\n",
"# load the model\n",
"num_chars = len(phonemes) if TTS_CONFIG.use_phonemes else len(symbols)\n",
"model = setup_model(num_chars, len(speakers), TTS_CONFIG)\n",
"model.build_inference()\n",
"model = load_checkpoint(model, TTS_MODEL)\n",
"model.decoder.set_max_decoder_steps(1000)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"Collapsed": "false",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 489
},
"colab_type": "code",
"id": "zKoq0GgzqzhQ",
"outputId": "2cc3deae-144f-4465-da3b-98628d948506"
},
"outputs": [],
"source": [
"from TTS.vocoder.tf.utils.generic_utils import setup_generator\n",
"from TTS.vocoder.tf.utils.io import load_checkpoint\n",
"\n",
"# LOAD VOCODER MODEL\n",
"vocoder_model = setup_generator(VOCODER_CONFIG)\n",
"vocoder_model.build_inference()\n",
"vocoder_model = load_checkpoint(vocoder_model, VOCODER_MODEL)\n",
"vocoder_model.inference_padding = 0\n",
"\n",
"ap_vocoder = AudioProcessor(**VOCODER_CONFIG['audio']) "
]
},
{
"cell_type": "markdown",
"metadata": {
"Collapsed": "false",
"colab_type": "text",
"id": "Ws_YkPKsLgo-"
},
"source": [
"## Run Inference"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"Collapsed": "false",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 134
},
"colab_type": "code",
"id": "FuWxZ9Ey5Puj",
"outputId": "07ede6e5-06e6-4612-f687-7984d20e5254"
},
"outputs": [],
"source": [
"sentence = \"Bill got in the habit of asking himself “Is that thought true?” and if he wasnt absolutely certain it was, he just let it go.\"\n",
"align, spec, stop_tokens, wav = tts(model, sentence, TTS_CONFIG, ap)"
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"name": "DDC-TTS_and_MultiBand-MelGAN_TF_Example.ipynb",
"provenance": [],
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}