ISCA Archive - The Xiaomi-ASLP Text-to-speech System for Blizzard Challenge 2023
ISCA Archive Blizzard 2023
ISCA Archive Blizzard 2023

The Xiaomi-ASLP Text-to-speech System for Blizzard Challenge 2023

Yuepeng Jiang, Kun Song, Fengyu Yang, Lei Xie, Meng Meng, Yu Ji, Yujun Wang

This paper describes the Xiaomi-ASLP text-to-speech (TTS) system for the hub task 2023-FH1 of Blizzard Challenge 2023. The goal of the hub task is to build a single-speaker French TTS system trained on (but not limited to) the single-speaker French audiobook corpus released by the Blizzard Challenge 2023 organization. We present a fully end-to-end TTS system based on VITS. In our implementation of the system, we replace the duration alignment module with a length regulator and duration predictor. Additionally, we introduce a style adaptor to model the style and prosody of the generated speech. The style adaptor consists of a fine-grained prosody module and a global style module based on a language model. To further enhance the audio quality of the synthesized output, we leverage the super-resolution capability of the vocoder to upsample the 16kHz synthesized waveform to 48kHz.