TTS/notebooks/DDC_TTS_and_MultiBand_MelGA...

388 lines
10 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "6LWsNd3_M3MP"
},
"source": [
"# Mozilla TTS on CPU Real-Time Speech Synthesis with TFLite"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "FAqrSIWgLyP0"
},
"source": [
"**These models are converted from released [PyTorch models](https://colab.research.google.com/drive/1u_16ZzHjKYFn1HNVuA4Qf_i2MMFB9olY?usp=sharing) using our TF utilities provided in Mozilla TTS.**\n",
"\n",
"#### **Notebook Details**\n",
"These TFLite models support TF 2.3rc0 and for different versions you might need to regenerate them. \n",
"\n",
"TFLite optimizations degrades the TTS model performance and we do not apply\n",
"any optimization for the vocoder model due to the same reason. If you like to\n",
"keep the quality, consider to regenerate TFLite model accordingly.\n",
"\n",
"Models optimized with TFLite can be slow on a regular CPU since it is optimized\n",
"specifically for lower-end systems.\n",
"\n",
"---\n",
"\n",
"\n",
"\n",
"#### **Model Details** \n",
"We use Tacotron2 and MultiBand-Melgan models and LJSpeech dataset.\n",
"\n",
"Tacotron2 is trained using [Double Decoder Consistency](https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency/) (DDC) only for 130K steps (3 days) with a single GPU.\n",
"\n",
"MultiBand-Melgan is trained 1.45M steps with real spectrograms.\n",
"\n",
"Note that both model performances can be improved with more training.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Ku-dA4DKoeXk"
},
"source": [
"### Download TF Models and configs"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 162
},
"colab_type": "code",
"id": "jGIgnWhGsxU1",
"outputId": "57af701e-77ec-400d-fee5-64aa7603d357"
},
"outputs": [],
"source": [
"!gdown --id 17PYXCmTe0el_SLTwznrt3vOArNGMGo5v -O tts_model.tflite\n",
"!gdown --id 18CQ6G6tBEOfvCHlPqP8EBI4xWbrr9dBc -O config.json"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 235
},
"colab_type": "code",
"id": "4dnpE0-kvTsu",
"outputId": "6aab0622-9add-4ee4-b9f8-177d6ddc0e86"
},
"outputs": [],
"source": [
"!gdown --id 1aXveT-NjOM1mUr6tM4JfWjshq67GvVIO -O vocoder_model.tflite\n",
"!gdown --id 1Rd0R_nRCrbjEdpOwq6XwZAktvugiBvmu -O config_vocoder.json\n",
"!gdown --id 11oY3Tv0kQtxK_JPgxrfesa99maVXHNxU -O scale_stats.npy"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "_ZuDrj_ioqHE"
},
"source": [
"### Setup Libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 964
},
"colab_type": "code",
"id": "X2axt5BYq7gv",
"outputId": "aa53986f-f218-4d17-8667-0d74bb90c927"
},
"outputs": [],
"source": [
"# need it for char to phoneme conversion\n",
"! sudo apt-get install espeak"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 144
},
"colab_type": "code",
"id": "ZduAf-qYYEIT",
"outputId": "c1fcac0d-b8f8-442c-d598-4f549c42b698"
},
"outputs": [],
"source": [
"!git clone https://github.com/mozilla/TTS"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"colab_type": "code",
"id": "ofPCvPyjZEcT",
"outputId": "f3d3ea73-eae5-473c-db19-276bd0e721cc"
},
"outputs": [],
"source": [
"%cd TTS\n",
"!git checkout c7296b3\n",
"!pip install -r requirements.txt\n",
"!python setup.py install\n",
"!pip install tensorflow==2.3.0rc0\n",
"%cd .."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Zlgi8fPdpRF0"
},
"source": [
"### Define TTS function"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "f-Yc42nQZG5A"
},
"outputs": [],
"source": [
"def run_vocoder(mel_spec):\n",
" vocoder_inputs = mel_spec[None, :, :]\n",
" # get input and output details\n",
" input_details = vocoder_model.get_input_details()\n",
" # reshape input tensor for the new input shape\n",
" vocoder_model.resize_tensor_input(input_details[0]['index'], vocoder_inputs.shape)\n",
" vocoder_model.allocate_tensors()\n",
" detail = input_details[0]\n",
" vocoder_model.set_tensor(detail['index'], vocoder_inputs)\n",
" # run the model\n",
" vocoder_model.invoke()\n",
" # collect outputs\n",
" output_details = vocoder_model.get_output_details()\n",
" waveform = vocoder_model.get_tensor(output_details[0]['index'])\n",
" return waveform \n",
"\n",
"\n",
"def tts(model, text, CONFIG, p):\n",
" t_1 = time.time()\n",
" waveform, alignment, mel_spec, mel_postnet_spec, stop_tokens, inputs = synthesis(model, text, CONFIG, use_cuda, ap, speaker_id, style_wav=None,\n",
" truncated=False, enable_eos_bos_chars=CONFIG.enable_eos_bos_chars,\n",
" backend='tflite')\n",
" waveform = run_vocoder(mel_postnet_spec.T)\n",
" waveform = waveform[0, 0]\n",
" rtf = (time.time() - t_1) / (len(waveform) / ap.sample_rate)\n",
" tps = (time.time() - t_1) / len(waveform)\n",
" print(waveform.shape)\n",
" print(\" > Run-time: {}\".format(time.time() - t_1))\n",
" print(\" > Real-time factor: {}\".format(rtf))\n",
" print(\" > Time per step: {}\".format(tps))\n",
" IPython.display.display(IPython.display.Audio(waveform, rate=CONFIG.audio['sample_rate'])) \n",
" return alignment, mel_postnet_spec, stop_tokens, waveform"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "ZksegYQepkFg"
},
"source": [
"### Load TF Models"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "oVa0kOamprgj"
},
"outputs": [],
"source": [
"import os\n",
"import torch\n",
"import time\n",
"import IPython\n",
"\n",
"from TTS.tf.utils.tflite import load_tflite_model\n",
"from TTS.tf.utils.io import load_checkpoint\n",
"from TTS.utils.io import load_config\n",
"from TTS.utils.text.symbols import symbols, phonemes\n",
"from TTS.utils.audio import AudioProcessor\n",
"from TTS.tts.utils.synthesis import synthesis"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "EY-sHVO8IFSH"
},
"outputs": [],
"source": [
"# runtime settings\n",
"use_cuda = False"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "_1aIUp2FpxOQ"
},
"outputs": [],
"source": [
"# model paths\n",
"TTS_MODEL = \"tts_model.tflite\"\n",
"TTS_CONFIG = \"config.json\"\n",
"VOCODER_MODEL = \"vocoder_model.tflite\"\n",
"VOCODER_CONFIG = \"config_vocoder.json\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "CpgmdBVQplbv"
},
"outputs": [],
"source": [
"# load configs\n",
"TTS_CONFIG = load_config(TTS_CONFIG)\n",
"VOCODER_CONFIG = load_config(VOCODER_CONFIG)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 471
},
"colab_type": "code",
"id": "zmrQxiozIUVE",
"outputId": "ca7e9016-4c28-4cef-efe7-0613d399aa4c"
},
"outputs": [],
"source": [
"# load the audio processor\n",
"ap = AudioProcessor(**TTS_CONFIG.audio) "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "8fLoI4ipqMeS"
},
"outputs": [],
"source": [
"# LOAD TTS MODEL\n",
"# multi speaker \n",
"speaker_id = None\n",
"speakers = []\n",
"\n",
"# load the models\n",
"model = load_tflite_model(TTS_MODEL)\n",
"vocoder_model = load_tflite_model(VOCODER_MODEL)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Ws_YkPKsLgo-"
},
"source": [
"## Run Inference"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 134
},
"colab_type": "code",
"id": "FuWxZ9Ey5Puj",
"outputId": "d1888ebd-3208-42a4-aaf9-78d0e3ec987d"
},
"outputs": [],
"source": [
"sentence = \"Bill got in the habit of asking himself “Is that thought true?” and if he wasnt absolutely certain it was, he just let it go.\"\n",
"align, spec, stop_tokens, wav = tts(model, sentence, TTS_CONFIG, ap)"
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"name": "DDC-TTS_and_MultiBand-MelGAN_TFLite_Example.ipynb",
"provenance": [],
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}