Neural network-based TTS model

We present the training process of an End-to-end neural network-based TTS model and its optimization

ML Model

BME-DNN-TTS-M6deliv-all-RA73.zip

Developed by

Budapest University of Technology and Economics (BME)

License

3-clause BSD license (BSD-3-Clause)

Main Characteristic

One of the novel neural network-based text-to-speech (TTS) methods is Tacotron2 which is quite flexible as an adaptive TTS system. It is the first component of an end-to-end system, but it is tested and published by the authors with a 22kHz training dataset. In order to use this system as a TTS component in a real environment, we started to optimize it. Two different goals were set at the beginning: Firstly to reach at least a real-time or faster system, and secondly to determine a resource-efficient training environment. This asset presents this training process.

Research areas

Collaborative AI

Technical Categories

Audio processing Machine learning Natural language processing

Keywords

Last updated

12.06.2021 - 12:33

Detailed Description

Format: Pytorch

Install & Run: Please ensure you have installed Pytorch.

Additional information: There were three systems, a small and a big desktop system and a server one. The oldest GPU is based on the NVIDIA® Maxwell™ architecture, the other one on the Pascal™ architecture, and in the server machine, the GPU uses the Volta™ architecture. The training dataset was LJSpeech 1.1. The test environment was a server configuration with the following main components:

* GPU: NVIDIA Tesla V100 (16 GB)
* CPU: 2.0 GHz Intel® Xeon® Platinum 8167M
* Number of GPUs: 8
* Number of CPU’s cores: 52
* System memory: 768 Gbyte
* OS: Ubuntu 16.04

References:
https://www.ai4europe.eu/sites/default/files/2021-05/BME-DNN-TTS-M6deliv-all-RA73.pdf

Documents

General documentation

Trustworthy AI

The provided information is trustworthy. The quality depends on user data.

GDPR Requirements

This a general framework and GDPR related data are not included.