mirror of https://github.com/coqui-ai/TTS.git
388 lines
10 KiB
Plaintext
388 lines
10 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"colab_type": "text",
|
||
"id": "6LWsNd3_M3MP"
|
||
},
|
||
"source": [
|
||
"# Mozilla TTS on CPU Real-Time Speech Synthesis with TFLite"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"colab_type": "text",
|
||
"id": "FAqrSIWgLyP0"
|
||
},
|
||
"source": [
|
||
"**These models are converted from released [PyTorch models](https://colab.research.google.com/drive/1u_16ZzHjKYFn1HNVuA4Qf_i2MMFB9olY?usp=sharing) using our TF utilities provided in Mozilla TTS.**\n",
|
||
"\n",
|
||
"#### **Notebook Details**\n",
|
||
"These TFLite models support TF 2.3rc0 and for different versions you might need to regenerate them. \n",
|
||
"\n",
|
||
"TFLite optimizations degrades the TTS model performance and we do not apply\n",
|
||
"any optimization for the vocoder model due to the same reason. If you like to\n",
|
||
"keep the quality, consider to regenerate TFLite model accordingly.\n",
|
||
"\n",
|
||
"Models optimized with TFLite can be slow on a regular CPU since it is optimized\n",
|
||
"specifically for lower-end systems.\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"\n",
|
||
"\n",
|
||
"#### **Model Details** \n",
|
||
"We use Tacotron2 and MultiBand-Melgan models and LJSpeech dataset.\n",
|
||
"\n",
|
||
"Tacotron2 is trained using [Double Decoder Consistency](https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency/) (DDC) only for 130K steps (3 days) with a single GPU.\n",
|
||
"\n",
|
||
"MultiBand-Melgan is trained 1.45M steps with real spectrograms.\n",
|
||
"\n",
|
||
"Note that both model performances can be improved with more training.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"colab_type": "text",
|
||
"id": "Ku-dA4DKoeXk"
|
||
},
|
||
"source": [
|
||
"### Download TF Models and configs"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 162
|
||
},
|
||
"colab_type": "code",
|
||
"id": "jGIgnWhGsxU1",
|
||
"outputId": "57af701e-77ec-400d-fee5-64aa7603d357"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"!gdown --id 17PYXCmTe0el_SLTwznrt3vOArNGMGo5v -O tts_model.tflite\n",
|
||
"!gdown --id 18CQ6G6tBEOfvCHlPqP8EBI4xWbrr9dBc -O config.json"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 235
|
||
},
|
||
"colab_type": "code",
|
||
"id": "4dnpE0-kvTsu",
|
||
"outputId": "6aab0622-9add-4ee4-b9f8-177d6ddc0e86"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"!gdown --id 1aXveT-NjOM1mUr6tM4JfWjshq67GvVIO -O vocoder_model.tflite\n",
|
||
"!gdown --id 1Rd0R_nRCrbjEdpOwq6XwZAktvugiBvmu -O config_vocoder.json\n",
|
||
"!gdown --id 11oY3Tv0kQtxK_JPgxrfesa99maVXHNxU -O scale_stats.npy"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"colab_type": "text",
|
||
"id": "_ZuDrj_ioqHE"
|
||
},
|
||
"source": [
|
||
"### Setup Libraries"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 964
|
||
},
|
||
"colab_type": "code",
|
||
"id": "X2axt5BYq7gv",
|
||
"outputId": "aa53986f-f218-4d17-8667-0d74bb90c927"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# need it for char to phoneme conversion\n",
|
||
"! sudo apt-get install espeak"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 144
|
||
},
|
||
"colab_type": "code",
|
||
"id": "ZduAf-qYYEIT",
|
||
"outputId": "c1fcac0d-b8f8-442c-d598-4f549c42b698"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"!git clone https://github.com/mozilla/TTS"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 1000
|
||
},
|
||
"colab_type": "code",
|
||
"id": "ofPCvPyjZEcT",
|
||
"outputId": "f3d3ea73-eae5-473c-db19-276bd0e721cc"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"%cd TTS\n",
|
||
"!git checkout c7296b3\n",
|
||
"!pip install -r requirements.txt\n",
|
||
"!python setup.py install\n",
|
||
"!pip install tensorflow==2.3.0rc0\n",
|
||
"%cd .."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"colab_type": "text",
|
||
"id": "Zlgi8fPdpRF0"
|
||
},
|
||
"source": [
|
||
"### Define TTS function"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {},
|
||
"colab_type": "code",
|
||
"id": "f-Yc42nQZG5A"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def run_vocoder(mel_spec):\n",
|
||
" vocoder_inputs = mel_spec[None, :, :]\n",
|
||
" # get input and output details\n",
|
||
" input_details = vocoder_model.get_input_details()\n",
|
||
" # reshape input tensor for the new input shape\n",
|
||
" vocoder_model.resize_tensor_input(input_details[0]['index'], vocoder_inputs.shape)\n",
|
||
" vocoder_model.allocate_tensors()\n",
|
||
" detail = input_details[0]\n",
|
||
" vocoder_model.set_tensor(detail['index'], vocoder_inputs)\n",
|
||
" # run the model\n",
|
||
" vocoder_model.invoke()\n",
|
||
" # collect outputs\n",
|
||
" output_details = vocoder_model.get_output_details()\n",
|
||
" waveform = vocoder_model.get_tensor(output_details[0]['index'])\n",
|
||
" return waveform \n",
|
||
"\n",
|
||
"\n",
|
||
"def tts(model, text, CONFIG, p):\n",
|
||
" t_1 = time.time()\n",
|
||
" waveform, alignment, mel_spec, mel_postnet_spec, stop_tokens, inputs = synthesis(model, text, CONFIG, use_cuda, ap, speaker_id, style_wav=None,\n",
|
||
" truncated=False, enable_eos_bos_chars=CONFIG.enable_eos_bos_chars,\n",
|
||
" backend='tflite')\n",
|
||
" waveform = run_vocoder(mel_postnet_spec.T)\n",
|
||
" waveform = waveform[0, 0]\n",
|
||
" rtf = (time.time() - t_1) / (len(waveform) / ap.sample_rate)\n",
|
||
" tps = (time.time() - t_1) / len(waveform)\n",
|
||
" print(waveform.shape)\n",
|
||
" print(\" > Run-time: {}\".format(time.time() - t_1))\n",
|
||
" print(\" > Real-time factor: {}\".format(rtf))\n",
|
||
" print(\" > Time per step: {}\".format(tps))\n",
|
||
" IPython.display.display(IPython.display.Audio(waveform, rate=CONFIG.audio['sample_rate'])) \n",
|
||
" return alignment, mel_postnet_spec, stop_tokens, waveform"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"colab_type": "text",
|
||
"id": "ZksegYQepkFg"
|
||
},
|
||
"source": [
|
||
"### Load TF Models"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {},
|
||
"colab_type": "code",
|
||
"id": "oVa0kOamprgj"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"import os\n",
|
||
"import torch\n",
|
||
"import time\n",
|
||
"import IPython\n",
|
||
"\n",
|
||
"from TTS.tf.utils.tflite import load_tflite_model\n",
|
||
"from TTS.tf.utils.io import load_checkpoint\n",
|
||
"from TTS.utils.io import load_config\n",
|
||
"from TTS.utils.text.symbols import symbols, phonemes\n",
|
||
"from TTS.utils.audio import AudioProcessor\n",
|
||
"from TTS.tts.utils.synthesis import synthesis"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {},
|
||
"colab_type": "code",
|
||
"id": "EY-sHVO8IFSH"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# runtime settings\n",
|
||
"use_cuda = False"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {},
|
||
"colab_type": "code",
|
||
"id": "_1aIUp2FpxOQ"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# model paths\n",
|
||
"TTS_MODEL = \"tts_model.tflite\"\n",
|
||
"TTS_CONFIG = \"config.json\"\n",
|
||
"VOCODER_MODEL = \"vocoder_model.tflite\"\n",
|
||
"VOCODER_CONFIG = \"config_vocoder.json\""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {},
|
||
"colab_type": "code",
|
||
"id": "CpgmdBVQplbv"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# load configs\n",
|
||
"TTS_CONFIG = load_config(TTS_CONFIG)\n",
|
||
"VOCODER_CONFIG = load_config(VOCODER_CONFIG)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 471
|
||
},
|
||
"colab_type": "code",
|
||
"id": "zmrQxiozIUVE",
|
||
"outputId": "ca7e9016-4c28-4cef-efe7-0613d399aa4c"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# load the audio processor\n",
|
||
"ap = AudioProcessor(**TTS_CONFIG.audio) "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {},
|
||
"colab_type": "code",
|
||
"id": "8fLoI4ipqMeS"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# LOAD TTS MODEL\n",
|
||
"# multi speaker \n",
|
||
"speaker_id = None\n",
|
||
"speakers = []\n",
|
||
"\n",
|
||
"# load the models\n",
|
||
"model = load_tflite_model(TTS_MODEL)\n",
|
||
"vocoder_model = load_tflite_model(VOCODER_MODEL)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"colab_type": "text",
|
||
"id": "Ws_YkPKsLgo-"
|
||
},
|
||
"source": [
|
||
"## Run Inference"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 134
|
||
},
|
||
"colab_type": "code",
|
||
"id": "FuWxZ9Ey5Puj",
|
||
"outputId": "d1888ebd-3208-42a4-aaf9-78d0e3ec987d"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"sentence = \"Bill got in the habit of asking himself “Is that thought true?” and if he wasn’t absolutely certain it was, he just let it go.\"\n",
|
||
"align, spec, stop_tokens, wav = tts(model, sentence, TTS_CONFIG, ap)"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"colab": {
|
||
"collapsed_sections": [],
|
||
"name": "DDC-TTS_and_MultiBand-MelGAN_TFLite_Example.ipynb",
|
||
"provenance": [],
|
||
"toc_visible": true
|
||
},
|
||
"kernelspec": {
|
||
"display_name": "Python 3",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.8.5"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 4
|
||
}
|