Abstract
In this paper, we analyze the performance of a modern end-to-end speech synthesis model called Variational Inference with adversarial learning for end-to-end Text-to-Speech (VITS). We build on the original VITS model and examine how different modifications to its architecture affect synthetic speech quality and computational complexity. Experiments with two Czech voices, a male and a female, were carried out. To assess the quality of speech synthesized by the different modified models, MUSHRA listening tests were performed. The computational complexity was measured in terms of synthesis speed over real time. While the original VITS model is still preferred regarding speech quality, we present a modification of the original structure with a significantly better response yet providing acceptable output quality. Such a configuration can be used when system response latency is critical.
This research was supported by the Technology Agency of the Czech Republic (TA CR), project No. TL05000546.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
Method for the subjective assessment of intermediate quality level of coding systems. Technical report BS.1534-2, International Telecommunication Union (2014)
Arik, S., et al.: Deep voice: real-time neural text-to-speech. In: International Conference on Machine Learning (2017)
Casanova, E., Weber, J., Shulby, C., Junior, A.C., Gölge, E., Ponti, M.A.: YourTTS: towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. In: International Conference on Machine Learning. Baltimore, USA (2022)
Chen, M., et al.: MultiSpeech: Multi-Speaker text to speech with transformer. In: INTERSPEECH, pp. 4024–4028. International Speech Communication Association, Shanghai, China (2020). https://doi.org/10.21437/Interspeech.2020-3139
Cho, H., Jung, W., Lee, J., Woo, S.H.: SANE-TTS: stable and natural end-to-end multilingual text-to-speech. In: INTERSPEECH, pp. 1–5. Incheon, Korea (2022). https://doi.org/10.21437/Interspeech.2022-46
Donahue, J., Dieleman, S., Binkowski, M., Elsen, E., Simonyan, K.: End-to-end adversarial text-to-speech. In: International Conference on Learning Representations (2021)
Hanzlíček, Z., Matoušek, J.: Phonetic speech segmentation of audiobooks by using adapted LSTM-based acoustic models. In: Garcia, A.C.B., Ferro, M., Ribón, J.C.R. (eds.) IBERAMIA 2022. Lecture Notes in Computer Science, vol. 13788, pp. 317–327. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-22419-5_27
Hanzlíček, Z., Vít, J.: LSTM-based speech segmentation trained on different foreign languages. In: Sojka, P., Kopeček, I., Pala, K., Horák, A. (eds.) TSD 2020. LNCS (LNAI), vol. 12284, pp. 456–464. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58323-1_49
Henter, G.E., Merritt, T., Shannon, M., Mayo, C., King, S.: Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech. In: INTERSPEECH, pp. 1504–1508. Singapore (2014)
Jeong, M., Kim, H., Cheon, S.J., Choi, B.J., Kim, N.S.: Diff-TTS: a denoising diffusion model for text-to-speech. In: INTERSPEECH, pp. 3605–3609. Brno, Czechia (2021). https://doi.org/10.21437/Interspeech.2021-469
Kim, J., Kim, S., Kong, J., Yoon, S.: Glow-TTS: a generative flow for text-to-speech via monotonic alignment search. In: Neural Information Processing Systems. Vancouver, Canada (2020)
Kim, J., Kong, J., Son, J.: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In: International Conference on Machine Learning, pp. 5530–5540 (2021)
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: International Conference on Learning Representations, pp. 1–14. No. Ml, Banff, Canada (2014)
Kominek, J., Schultz, T., Black, A.W.: Synthesizer voice quality of new languages calibrated with mean cepstral distortion. In: Speech Technology for Under-Resourced Languages, pp. 63–68. Hanoi, Vietnam (2008)
Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. In: Conference on Neural Information Processing Systems. Vancouver, Canada (2020)
Kubichek, R.F.: Mel-cepstral distance measure for objective speech quality assessment. In: IEEE Pacific Rim Conference on Communications Computers and Signal Processing, pp. 125–128. Victoria, Canada (1993). https://doi.org/10.1109/pacrim.1993.407206
Kumar, K., et al.: MelGAN: generative adversarial networks for conditional waveform synthesis. In: Advances in Neural Information Processing Systems (2019)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations. New Orleans, USA (2019)
Matoušek, J., Tihelka, D.: On comparison of phonetic representations for Czech neural speech synthesis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) Text, Speech, and Dialogue, TSD 2022. Lecture Notes in Computer Science, vol. 13502, pp. 410–422. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16270-1_34
Matoušek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: Language Resources and Evaluation Conference, pp. 1296–1299. Marrakech, Morocco (2008)
Merritt, T., Clark, R.A.J., Wu, Z., Yamagishi, J., King, S.: Deep neural network-guided unit selection synthesis. In: IEEE International Conference on Acoustics Speech and Signal Processing, pp. 5145–5149. Shanghai, China (2016)
van den Oord, A., et al.: Parallel WaveNet: fast high-fidelity speech synthesis. CoRR (2017). arxiv.org/abs/1711.10433
Ping, W., Peng, K., Chen, J.: Clarinet: parallel wave generation in end-to-end text-to-speech. In: International Conference on Learning Representations. New Orleans, USA (2019)
Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis. In: IEEE International Conference on Acoustics Speech and Signal Processing, pp. 3617–3621. Brighton, United Kingdom (2019). https://doi.org/10.1109/ICASSP.2019.8683143
Ren, Y., et al.: FastSpeech 2: fast and high-quality end-to-end text to speech. In: International Conference on Learning Representations (2021)
Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows. In: International Conference on Machine Learning. Lille, France (2015)
Shang, Z., Shi, P., Zhang, P., Wang, L., Zhao, G.: HierTTS : expressive end-to-end text-to-waveform using a multi-scale hierarchical variational auto-encoder. Appl. Sci. (Switzerland) 13(2), 868 (2023)
Shirahata, Y., Yamamoto, R., Song, E., Terashima, R., Kim, J.M., Tachibana, K.: Period VITS: variational inference with explicit pitch modeling for end-to-end emotional speech synthesis. In: IEEE International Conference on Acoustics Speech and Signal Processing. Rhodes Island, Greece (2023). https://doi.org/10.1109/ICASSP49357.2023.10096480
Song, K., et al.: AdaVITS: Tiny VITS for low computing resource speaker adaptation. In: International Symposium on Chinese Language Processing. Singapore (2022). https://doi.org/10.1109/ISCSLP57327.2022.10037585
Tan, X., et al.: NaturalSpeech: end-to-end text to speech synthesis with human-level quality. CoRR, pp. 1–19 (2022). arxiv.org/abs/2205.04421
Tan, X., Qin, T., Soong, F.K., Liu, T.Y.: A Survey on neural speech synthesis. CoRR abs/2106.1 (2021). https://doi.org/10.48550/arXiv.2106.15561
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 6000–6010. Long Beach, CA, USA (2017)
Vít, J., Hanzlíček, Z., Matoušek, J.: On the analysis of training data for WaveNet-based speech synthesis. In: IEEE International Conference on Acoustics Speech and Signal Processing, pp. 5684–5688. Calgary, Canada (2018). https://doi.org/10.1109/ICASSP.2018.8461960
Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. In: INTERSPEECH, pp. 4006–4010. Stockholm, Sweden (2017). https://doi.org/10.21437/Interspeech.2017-1452
Řezáčková, M., Švec, J., Tihelka, D.: T5G2P: using text-to-text transfer transformer for grapheme-to-phoneme conversion. In: INTERSPEECH, pp. 6–10. Brno, Czechia (2021). https://doi.org/10.21437/Interspeech.2021-546
Acknowledgements
Computational resources were provided by the e-INFRA CZ project (ID:90140), supported by the Ministry of Education, Youth and Sports of the Czech Republic.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Matoušek, J., Tihelka, D. (2023). VITS: Quality Vs. Speed Analysis. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2023. Lecture Notes in Computer Science(), vol 14102. Springer, Cham. https://doi.org/10.1007/978-3-031-40498-6_19
Download citation
DOI: https://doi.org/10.1007/978-3-031-40498-6_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40497-9
Online ISBN: 978-3-031-40498-6
eBook Packages: Computer ScienceComputer Science (R0)