VITS: Quality Vs. Speed Analysis | SpringerLink
Skip to main content

VITS: Quality Vs. Speed Analysis

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14102))

Included in the following conference series:

Abstract

In this paper, we analyze the performance of a modern end-to-end speech synthesis model called Variational Inference with adversarial learning for end-to-end Text-to-Speech (VITS). We build on the original VITS model and examine how different modifications to its architecture affect synthetic speech quality and computational complexity. Experiments with two Czech voices, a male and a female, were carried out. To assess the quality of speech synthesized by the different modified models, MUSHRA listening tests were performed. The computational complexity was measured in terms of synthesis speed over real time. While the original VITS model is still preferred regarding speech quality, we present a modification of the original structure with a significantly better response yet providing acceptable output quality. Such a configuration can be used when system response latency is critical.

This research was supported by the Technology Agency of the Czech Republic (TA CR), project No. TL05000546.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 7549
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 9437
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/coqui-ai/TTS.

References

  1. Method for the subjective assessment of intermediate quality level of coding systems. Technical report BS.1534-2, International Telecommunication Union (2014)

    Google Scholar 

  2. Arik, S., et al.: Deep voice: real-time neural text-to-speech. In: International Conference on Machine Learning (2017)

    Google Scholar 

  3. Casanova, E., Weber, J., Shulby, C., Junior, A.C., Gölge, E., Ponti, M.A.: YourTTS: towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. In: International Conference on Machine Learning. Baltimore, USA (2022)

    Google Scholar 

  4. Chen, M., et al.: MultiSpeech: Multi-Speaker text to speech with transformer. In: INTERSPEECH, pp. 4024–4028. International Speech Communication Association, Shanghai, China (2020). https://doi.org/10.21437/Interspeech.2020-3139

  5. Cho, H., Jung, W., Lee, J., Woo, S.H.: SANE-TTS: stable and natural end-to-end multilingual text-to-speech. In: INTERSPEECH, pp. 1–5. Incheon, Korea (2022). https://doi.org/10.21437/Interspeech.2022-46

  6. Donahue, J., Dieleman, S., Binkowski, M., Elsen, E., Simonyan, K.: End-to-end adversarial text-to-speech. In: International Conference on Learning Representations (2021)

    Google Scholar 

  7. Hanzlíček, Z., Matoušek, J.: Phonetic speech segmentation of audiobooks by using adapted LSTM-based acoustic models. In: Garcia, A.C.B., Ferro, M., Ribón, J.C.R. (eds.) IBERAMIA 2022. Lecture Notes in Computer Science, vol. 13788, pp. 317–327. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-22419-5_27

    Chapter  Google Scholar 

  8. Hanzlíček, Z., Vít, J.: LSTM-based speech segmentation trained on different foreign languages. In: Sojka, P., Kopeček, I., Pala, K., Horák, A. (eds.) TSD 2020. LNCS (LNAI), vol. 12284, pp. 456–464. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58323-1_49

    Chapter  Google Scholar 

  9. Henter, G.E., Merritt, T., Shannon, M., Mayo, C., King, S.: Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech. In: INTERSPEECH, pp. 1504–1508. Singapore (2014)

    Google Scholar 

  10. Jeong, M., Kim, H., Cheon, S.J., Choi, B.J., Kim, N.S.: Diff-TTS: a denoising diffusion model for text-to-speech. In: INTERSPEECH, pp. 3605–3609. Brno, Czechia (2021). https://doi.org/10.21437/Interspeech.2021-469

  11. Kim, J., Kim, S., Kong, J., Yoon, S.: Glow-TTS: a generative flow for text-to-speech via monotonic alignment search. In: Neural Information Processing Systems. Vancouver, Canada (2020)

    Google Scholar 

  12. Kim, J., Kong, J., Son, J.: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In: International Conference on Machine Learning, pp. 5530–5540 (2021)

    Google Scholar 

  13. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: International Conference on Learning Representations, pp. 1–14. No. Ml, Banff, Canada (2014)

    Google Scholar 

  14. Kominek, J., Schultz, T., Black, A.W.: Synthesizer voice quality of new languages calibrated with mean cepstral distortion. In: Speech Technology for Under-Resourced Languages, pp. 63–68. Hanoi, Vietnam (2008)

    Google Scholar 

  15. Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. In: Conference on Neural Information Processing Systems. Vancouver, Canada (2020)

    Google Scholar 

  16. Kubichek, R.F.: Mel-cepstral distance measure for objective speech quality assessment. In: IEEE Pacific Rim Conference on Communications Computers and Signal Processing, pp. 125–128. Victoria, Canada (1993). https://doi.org/10.1109/pacrim.1993.407206

  17. Kumar, K., et al.: MelGAN: generative adversarial networks for conditional waveform synthesis. In: Advances in Neural Information Processing Systems (2019)

    Google Scholar 

  18. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations. New Orleans, USA (2019)

    Google Scholar 

  19. Matoušek, J., Tihelka, D.: On comparison of phonetic representations for Czech neural speech synthesis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) Text, Speech, and Dialogue, TSD 2022. Lecture Notes in Computer Science, vol. 13502, pp. 410–422. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16270-1_34

    Chapter  Google Scholar 

  20. Matoušek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: Language Resources and Evaluation Conference, pp. 1296–1299. Marrakech, Morocco (2008)

    Google Scholar 

  21. Merritt, T., Clark, R.A.J., Wu, Z., Yamagishi, J., King, S.: Deep neural network-guided unit selection synthesis. In: IEEE International Conference on Acoustics Speech and Signal Processing, pp. 5145–5149. Shanghai, China (2016)

    Google Scholar 

  22. van den Oord, A., et al.: Parallel WaveNet: fast high-fidelity speech synthesis. CoRR (2017). arxiv.org/abs/1711.10433

  23. Ping, W., Peng, K., Chen, J.: Clarinet: parallel wave generation in end-to-end text-to-speech. In: International Conference on Learning Representations. New Orleans, USA (2019)

    Google Scholar 

  24. Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis. In: IEEE International Conference on Acoustics Speech and Signal Processing, pp. 3617–3621. Brighton, United Kingdom (2019). https://doi.org/10.1109/ICASSP.2019.8683143

  25. Ren, Y., et al.: FastSpeech 2: fast and high-quality end-to-end text to speech. In: International Conference on Learning Representations (2021)

    Google Scholar 

  26. Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows. In: International Conference on Machine Learning. Lille, France (2015)

    Google Scholar 

  27. Shang, Z., Shi, P., Zhang, P., Wang, L., Zhao, G.: HierTTS : expressive end-to-end text-to-waveform using a multi-scale hierarchical variational auto-encoder. Appl. Sci. (Switzerland) 13(2), 868 (2023)

    Google Scholar 

  28. Shirahata, Y., Yamamoto, R., Song, E., Terashima, R., Kim, J.M., Tachibana, K.: Period VITS: variational inference with explicit pitch modeling for end-to-end emotional speech synthesis. In: IEEE International Conference on Acoustics Speech and Signal Processing. Rhodes Island, Greece (2023). https://doi.org/10.1109/ICASSP49357.2023.10096480

  29. Song, K., et al.: AdaVITS: Tiny VITS for low computing resource speaker adaptation. In: International Symposium on Chinese Language Processing. Singapore (2022). https://doi.org/10.1109/ISCSLP57327.2022.10037585

  30. Tan, X., et al.: NaturalSpeech: end-to-end text to speech synthesis with human-level quality. CoRR, pp. 1–19 (2022). arxiv.org/abs/2205.04421

  31. Tan, X., Qin, T., Soong, F.K., Liu, T.Y.: A Survey on neural speech synthesis. CoRR abs/2106.1 (2021). https://doi.org/10.48550/arXiv.2106.15561

  32. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 6000–6010. Long Beach, CA, USA (2017)

    Google Scholar 

  33. Vít, J., Hanzlíček, Z., Matoušek, J.: On the analysis of training data for WaveNet-based speech synthesis. In: IEEE International Conference on Acoustics Speech and Signal Processing, pp. 5684–5688. Calgary, Canada (2018). https://doi.org/10.1109/ICASSP.2018.8461960

  34. Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. In: INTERSPEECH, pp. 4006–4010. Stockholm, Sweden (2017). https://doi.org/10.21437/Interspeech.2017-1452

  35. Řezáčková, M., Švec, J., Tihelka, D.: T5G2P: using text-to-text transfer transformer for grapheme-to-phoneme conversion. In: INTERSPEECH, pp. 6–10. Brno, Czechia (2021). https://doi.org/10.21437/Interspeech.2021-546

Download references

Acknowledgements

Computational resources were provided by the e-INFRA CZ project (ID:90140), supported by the Ministry of Education, Youth and Sports of the Czech Republic.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jindřich Matoušek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Matoušek, J., Tihelka, D. (2023). VITS: Quality Vs. Speed Analysis. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2023. Lecture Notes in Computer Science(), vol 14102. Springer, Cham. https://doi.org/10.1007/978-3-031-40498-6_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-40498-6_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-40497-9

  • Online ISBN: 978-3-031-40498-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics