The NII speech synthesis entry for Blizzard Challenge 2016

Juvela, Lauri; Wang, Xin; Takaki, Shinji; Kim, SangJin; Airaksinen, Manu; Yamagishi, Junichi

doi:10.21437/Blizzard.2016-2

This paper decribes the NII speech synthesis entry for Blizzard Challenge 2016, where the task was to build a voice from audiobook data. The synthesis system is built using the NII parametric speech synthesis framework that utilizes Long Short Term Memory (LSTM) Recurrent Neural Network (RNN) for acoustic modeling. For this entry, we first built a voice using a large data set, and then used the audiobook data to adapt the acoustic model to the target speaker. Additionally, the recent full band glottal vocoder GlottDNN was used in the system with a DNN-based excitation model for generating glottal waveforms. The vocoder estimates the vocal tract in a band-wise manner using Quasi Closed Phase (QCP) inversefiltering at the low-band. At synthesis stage, the excitation model is used to generate voiced excitation from acoustic features, after which a vocal tract filter is applied to generate synthetic speech. The Blizzard Challenge listening test results show that the proposed system achieves comparable quality with the benchmark parametric synthesis systems.

The NII speech synthesis entry for Blizzard Challenge 2016

Lauri Juvela, Xin Wang, Shinji Takaki, SangJin Kim, Manu Airaksinen, Junichi Yamagishi