SWave: Improving Vocoder Efficiency by Straightening the Waveform Generation Path

Liu, Pan; Zhou, Jianping; Tian, Xiaohua; Lin, Zhouhan

doi:10.1007/978-3-031-78172-8_26

Pan Liu¹³,
Jianping Zhou¹³,
Xiaohua Tian¹³ &
…
Zhouhan Lin¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15306))

Included in the following conference series:

International Conference on Pattern Recognition

136 Accesses

Abstract

Diffusion model-based vocoders have exhibited outstanding performance in the realm of speech synthesis. However, owing to the curved nature of generation path, they necessitate traversing numerous steps to guarantee the speech quality, hindering their applicability in real-world scenarios. In this paper, we propose SWave (Code and speech samples are available at https://swave-demo.github.io/), a novel vocoder based on rectified flow which improves the efficiency of speech synthesis by Straightening the Waveform generation path. Specifically, we employ rectification to transform the noise distribution into the data distribution with a probability flow that is as straight as possible. Subsequently, we use distillation and fine-tuning to further enhance the generation efficiency and quality, respectively. Experiments on the LJSpeech dataset demonstrate that compared with other vocoders such as FastDiff and WaveGrad, SWave enhances the generation efficiency. In particular, with a straightforward sampling schedule, SWave generates comparable speech to WaveGrad with significantly fewer steps (2 steps vs 25 steps).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis

FCHiFi-GAN: Aggrandizing Fast Convergence with Batchwise Normalization

Fast Griffin Lim based waveform generation strategy for text-to-speech synthesis

Article 15 August 2020

Notes

References

Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M., Chan, W.: WaveGrad: estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713 (2020)
Chen, Z., et al.: InferGrad: improving diffusion models for vocoder by considering inference in training. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8432–8436. IEEE (2022)
Google Scholar
Guan, W., Su, Q., Zhou, H., Miao, S., Xie, X., Li, L., Hong, Q.: ReFlow-TTS: a rectified flow model for high-fidelity text-to-speech. arXiv preprint arXiv:2309.17056 (2023)
Guo, Y., Du, C., Ma, Z., Chen, X., Yu, K.: VoiceFlow: efficient text-to-speech with rectified flow matching. arXiv preprint arXiv:2309.05027 (2023)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020)
Google Scholar
Huang, R., Lam, M.W., Wang, J., Su, D., Yu, D., Ren, Y., Zhao, Z.: FastDiff: a fast conditional diffusion model for high-quality speech synthesis. arXiv preprint arXiv:2204.09934 (2022)
Ito, K., Johnson, L.: The LJ speech dataset (2017)
Google Scholar
Kalchbrenner, N., et al.: Efficient neural audio synthesis. In: International Conference on Machine Learning, pp. 2410–2419. PMLR (2018)
Google Scholar
Kingma, D.P., Dhariwal, P.: Glow: generative flow with invertible 1x1 convolutions. Adv. Neural Inf. Process. Syst. 31 (2018)
Google Scholar
Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural. Inf. Process. Syst. 33, 17022–17033 (2020)
Google Scholar
Kong, Z., Ping, W., Huang, J., Zhao, K., Catanzaro, B.: DiffWave: a versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761 (2020)
Kumar, K., et al.: MelGAN: generative adversarial networks for conditional waveform synthesis. Adv. Neural Inf. Process. Syst. 32 (2019)
Google Scholar
Liu, X., Gong, C., Liu, Q.: Flow straight and fast: learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)
Liu, X., Zhang, X., Ma, J., Peng, J., Liu, Q.: InstaFlow: one step is enough for high-quality diffusion-based text-to-image generation. arXiv preprint arXiv:2309.06380 (2023)
Mehri, S., Kumar, K., Gulrajani, I., Kumar, R., Jain, S., Sotelo, J., Courville, A., Bengio, Y.: SampleRNN: an unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837 (2016)
Oord, A.v.d., et al.: WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)
Peng, K., Ping, W., Song, Z., Zhao, K.: Non-autoregressive neural text-to-speech. In: International Conference on Machine Learning, pp. 7586–7598. PMLR (2020)
Google Scholar
Popov, V., Vovk, I., Gogoryan, V., Sadekova, T., Kudinov, M.: Grad-TTS: a diffusion probabilistic model for text-to-speech. In: International Conference on Machine Learning, pp. 8599–8608. PMLR (2021)
Google Scholar
Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. IEEE (2019)
Google Scholar
Rezende, D., Mohamed, S.: Variational inference with normalizing flows. In: International Conference on Machine Learning, pp. 1530–1538. PMLR (2015)
Google Scholar
Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. arXiv preprint arXiv:2303.01469 (2023)
Yamamoto, R., Song, E., Kim, J.M.: Parallel waveGAN: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203. IEEE (2020)
Google Scholar

Download references

Acknowledgment

This work was sponsored by the National Key Research and Development Program of China (No. 2023ZD0121402, 2020YFB1708700) and National Natural Science Foundation of China (NSFC) grant (No. 62106143, 61922055, 61872233, 62272293).

Author information

Authors and Affiliations

Shanghai Jiao Tong University, Shanghai, China
Pan Liu, Jianping Zhou, Xiaohua Tian & Zhouhan Lin

Authors

Pan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jianping Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohua Tian
View author publications
You can also search for this author in PubMed Google Scholar
Zhouhan Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhouhan Lin .

Editor information

Editors and Affiliations

University of Salford, Salford, Lancashire, UK
Apostolos Antonacopoulos
Indian Institute of Technology Bombay, Mumbai, Maharashtra, India
Subhasis Chaudhuri
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
IIT Kharagpur, Kharagpur, West Bengal, India
Saumik Bhattacharya
Indian Statistical Institute Kolkata, Kolkata, West Bengal, India
Umapada Pal

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1713 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, P., Zhou, J., Tian, X., Lin, Z. (2025). SWave: Improving Vocoder Efficiency by Straightening the Waveform Generation Path. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15306. Springer, Cham. https://doi.org/10.1007/978-3-031-78172-8_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-78172-8_26
Published: 03 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78171-1
Online ISBN: 978-3-031-78172-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

SWave: Improving Vocoder Efficiency by Straightening the Waveform Generation Path

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis

FCHiFi-GAN: Aggrandizing Fast Convergence with Batchwise Normalization

Fast Griffin Lim based waveform generation strategy for text-to-speech synthesis

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1713 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

SWave: Improving Vocoder Efficiency by Straightening the Waveform Generation Path

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis

FCHiFi-GAN: Aggrandizing Fast Convergence with Batchwise Normalization

Fast Griffin Lim based waveform generation strategy for text-to-speech synthesis

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1713 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation