One of the novel neural network-based text-to-speech (TTS) methods is Tacotron2 which is quite flexible as an adaptive TTS system. It is the first component of an end-to-end system, but it is tested and published by the authors with a 22kHz training dataset. In order to use this system as a TTS component in a real environment, we started to optimize it. Two different goals were set at the beginning: Firstly to reach at least a real-time or faster system, and secondly to determine a resource-efficient training environment. This asset presents this training process.
Install & Run: Please ensure you have installed Pytorch.
Additional information: There were three systems, a small and a big desktop system and a server one. The oldest GPU is based on the NVIDIA® Maxwell™ architecture, the other one on the Pascal™ architecture, and in the server machine, the GPU uses the Volta™ architecture. The training dataset was LJSpeech 1.1. The test environment was a server configuration with the following main components:
* GPU: NVIDIA Tesla V100 (16 GB)
* CPU: 2.0 GHz Intel® Xeon® Platinum 8167M
* Number of GPUs: 8
* Number of CPU’s cores: 52
* System memory: 768 Gbyte
* OS: Ubuntu 16.04