TTS/notebooks/DDC_TTS_and_MultiBand_MelGA...

388 lines
10 KiB
Plaintext
Raw Normal View History

2020-07-20 12:55:00 +00:00
{
2021-03-18 12:27:23 +00:00
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "6LWsNd3_M3MP"
},
"source": [
"# Mozilla TTS on CPU Real-Time Speech Synthesis with TFLite"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "FAqrSIWgLyP0"
},
"source": [
"**These models are converted from released [PyTorch models](https://colab.research.google.com/drive/1u_16ZzHjKYFn1HNVuA4Qf_i2MMFB9olY?usp=sharing) using our TF utilities provided in Mozilla TTS.**\n",
"\n",
"#### **Notebook Details**\n",
"These TFLite models support TF 2.3rc0 and for different versions you might need to regenerate them. \n",
"\n",
"TFLite optimizations degrades the TTS model performance and we do not apply\n",
"any optimization for the vocoder model due to the same reason. If you like to\n",
"keep the quality, consider to regenerate TFLite model accordingly.\n",
"\n",
"Models optimized with TFLite can be slow on a regular CPU since it is optimized\n",
"specifically for lower-end systems.\n",
"\n",
"---\n",
"\n",
"\n",
"\n",
"#### **Model Details** \n",
"We use Tacotron2 and MultiBand-Melgan models and LJSpeech dataset.\n",
"\n",
"Tacotron2 is trained using [Double Decoder Consistency](https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency/) (DDC) only for 130K steps (3 days) with a single GPU.\n",
"\n",
"MultiBand-Melgan is trained 1.45M steps with real spectrograms.\n",
"\n",
"Note that both model performances can be improved with more training.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Ku-dA4DKoeXk"
},
"source": [
"### Download TF Models and configs"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
2020-07-20 12:55:00 +00:00
"colab": {
2021-03-18 12:27:23 +00:00
"base_uri": "https://localhost:8080/",
"height": 162
2020-07-20 12:55:00 +00:00
},
2021-03-18 12:27:23 +00:00
"colab_type": "code",
"id": "jGIgnWhGsxU1",
"outputId": "57af701e-77ec-400d-fee5-64aa7603d357"
},
"outputs": [],
"source": [
"!gdown --id 17PYXCmTe0el_SLTwznrt3vOArNGMGo5v -O tts_model.tflite\n",
"!gdown --id 18CQ6G6tBEOfvCHlPqP8EBI4xWbrr9dBc -O config.json"
]
2020-07-20 12:55:00 +00:00
},
2021-03-18 12:27:23 +00:00
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 235
2020-07-20 12:55:00 +00:00
},
2021-03-18 12:27:23 +00:00
"colab_type": "code",
"id": "4dnpE0-kvTsu",
"outputId": "6aab0622-9add-4ee4-b9f8-177d6ddc0e86"
},
"outputs": [],
"source": [
"!gdown --id 1aXveT-NjOM1mUr6tM4JfWjshq67GvVIO -O vocoder_model.tflite\n",
"!gdown --id 1Rd0R_nRCrbjEdpOwq6XwZAktvugiBvmu -O config_vocoder.json\n",
"!gdown --id 11oY3Tv0kQtxK_JPgxrfesa99maVXHNxU -O scale_stats.npy"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "_ZuDrj_ioqHE"
},
"source": [
"### Setup Libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 964
2020-07-20 12:55:00 +00:00
},
2021-03-18 12:27:23 +00:00
"colab_type": "code",
"id": "X2axt5BYq7gv",
"outputId": "aa53986f-f218-4d17-8667-0d74bb90c927"
},
"outputs": [],
"source": [
"# need it for char to phoneme conversion\n",
"! sudo apt-get install espeak"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 144
2020-07-20 12:55:00 +00:00
},
2021-03-18 12:27:23 +00:00
"colab_type": "code",
"id": "ZduAf-qYYEIT",
"outputId": "c1fcac0d-b8f8-442c-d598-4f549c42b698"
},
"outputs": [],
"source": [
"!git clone https://github.com/mozilla/TTS"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
2020-07-20 12:55:00 +00:00
},
2021-03-18 12:27:23 +00:00
"colab_type": "code",
"id": "ofPCvPyjZEcT",
"outputId": "f3d3ea73-eae5-473c-db19-276bd0e721cc"
},
"outputs": [],
"source": [
"%cd TTS\n",
"!git checkout c7296b3\n",
"!pip install -r requirements.txt\n",
"!python setup.py install\n",
"!pip install tensorflow==2.3.0rc0\n",
"%cd .."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Zlgi8fPdpRF0"
},
"source": [
"### Define TTS function"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "f-Yc42nQZG5A"
},
"outputs": [],
"source": [
"def run_vocoder(mel_spec):\n",
" vocoder_inputs = mel_spec[None, :, :]\n",
" # get input and output details\n",
" input_details = vocoder_model.get_input_details()\n",
" # reshape input tensor for the new input shape\n",
" vocoder_model.resize_tensor_input(input_details[0]['index'], vocoder_inputs.shape)\n",
" vocoder_model.allocate_tensors()\n",
" detail = input_details[0]\n",
" vocoder_model.set_tensor(detail['index'], vocoder_inputs)\n",
" # run the model\n",
" vocoder_model.invoke()\n",
" # collect outputs\n",
" output_details = vocoder_model.get_output_details()\n",
" waveform = vocoder_model.get_tensor(output_details[0]['index'])\n",
" return waveform \n",
"\n",
"\n",
"def tts(model, text, CONFIG, p):\n",
" t_1 = time.time()\n",
" waveform, alignment, mel_spec, mel_postnet_spec, stop_tokens, inputs = synthesis(model, text, CONFIG, use_cuda, ap, speaker_id, style_wav=None,\n",
" truncated=False, enable_eos_bos_chars=CONFIG.enable_eos_bos_chars,\n",
" backend='tflite')\n",
" waveform = run_vocoder(mel_postnet_spec.T)\n",
" waveform = waveform[0, 0]\n",
" rtf = (time.time() - t_1) / (len(waveform) / ap.sample_rate)\n",
" tps = (time.time() - t_1) / len(waveform)\n",
" print(waveform.shape)\n",
" print(\" > Run-time: {}\".format(time.time() - t_1))\n",
" print(\" > Real-time factor: {}\".format(rtf))\n",
" print(\" > Time per step: {}\".format(tps))\n",
" IPython.display.display(IPython.display.Audio(waveform, rate=CONFIG.audio['sample_rate'])) \n",
" return alignment, mel_postnet_spec, stop_tokens, waveform"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "ZksegYQepkFg"
},
"source": [
"### Load TF Models"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "oVa0kOamprgj"
},
"outputs": [],
"source": [
"import os\n",
"import torch\n",
"import time\n",
"import IPython\n",
"\n",
"from TTS.tf.utils.tflite import load_tflite_model\n",
"from TTS.tf.utils.io import load_checkpoint\n",
"from TTS.utils.io import load_config\n",
"from TTS.utils.text.symbols import symbols, phonemes\n",
"from TTS.utils.audio import AudioProcessor\n",
"from TTS.tts.utils.synthesis import synthesis"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "EY-sHVO8IFSH"
},
"outputs": [],
"source": [
"# runtime settings\n",
"use_cuda = False"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "_1aIUp2FpxOQ"
},
"outputs": [],
"source": [
"# model paths\n",
"TTS_MODEL = \"tts_model.tflite\"\n",
"TTS_CONFIG = \"config.json\"\n",
"VOCODER_MODEL = \"vocoder_model.tflite\"\n",
"VOCODER_CONFIG = \"config_vocoder.json\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "CpgmdBVQplbv"
},
"outputs": [],
"source": [
"# load configs\n",
"TTS_CONFIG = load_config(TTS_CONFIG)\n",
"VOCODER_CONFIG = load_config(VOCODER_CONFIG)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 471
2020-07-20 12:55:00 +00:00
},
2021-03-18 12:27:23 +00:00
"colab_type": "code",
"id": "zmrQxiozIUVE",
"outputId": "ca7e9016-4c28-4cef-efe7-0613d399aa4c"
},
"outputs": [],
"source": [
"# load the audio processor\n",
"ap = AudioProcessor(**TTS_CONFIG.audio) "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "8fLoI4ipqMeS"
},
"outputs": [],
"source": [
"# LOAD TTS MODEL\n",
"# multi speaker \n",
"speaker_id = None\n",
"speakers = []\n",
"\n",
"# load the models\n",
"model = load_tflite_model(TTS_MODEL)\n",
"vocoder_model = load_tflite_model(VOCODER_MODEL)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Ws_YkPKsLgo-"
},
"source": [
"## Run Inference"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 134
2020-07-20 12:55:00 +00:00
},
2021-03-18 12:27:23 +00:00
"colab_type": "code",
"id": "FuWxZ9Ey5Puj",
"outputId": "d1888ebd-3208-42a4-aaf9-78d0e3ec987d"
},
"outputs": [],
"source": [
"sentence = \"Bill got in the habit of asking himself “Is that thought true?” and if he wasnt absolutely certain it was, he just let it go.\"\n",
"align, spec, stop_tokens, wav = tts(model, sentence, TTS_CONFIG, ap)"
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"name": "DDC-TTS_and_MultiBand-MelGAN_TFLite_Example.ipynb",
"provenance": [],
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}