A deep learning approaches in text-to-speech system: a systematic review and recent research perspective | Multimedia Tools and Applications Skip to main content
Log in

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Text-to-speech systems (TTS) have come a long way in the last decade and are now a popular research topic for creating various human-computer interaction systems. Although, a range of speech synthesis models for various languages with several motive applications is available based on domain requirements. However, recent developments in speech synthesis have primarily attributed to deep learning-based techniques that have improved a variety of application scenarios, including intelligent speech interaction, chatbots, and conversational artificial intelligence (AI). Text-to-speech systems are discussed in this survey article as an active topic of study that has achieved significant progress in the recent decade, particularly for Indian and non-Indian languages. Furthermore, the study also covers the lifecycle of text-to-speech systems as well as developed platforms in it. We performed an efficient search for published survey articles up to May 2021 in the web of science, PubMed, Scopus, EBSCO(Elton B. Stephens CO (company)) and Google Scholar for Text-to-speech Systems (TTS) in various languages based on different approaches. This survey article offers a study of the contributions made by various researchers in Indian and non-Indian language text-to-speech systems and the techniques used to implement it with associated challenges in designing TTS systems. The work also compared different language text-to-speech systems based on the quality metrics such as recognition rate, accuracy, TTS score, precision, recall, and F1-score. Further, the study summarizes existing ideas and their shortcomings, emphasizing the scope of future research in Indian and non-Indian languages TTS, which may assist beginners in designing robust TTS systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

Not applicable.

References

  1. Adam EEB (2020) Deep learning based NLP techniques in text to speech synthesis for communication recognition. J Soft Comput Paradigm (JSCP) 2(04):209–215

  2. Adeeba F, Habib T, Hussain S, Shahid KS (2016) Comparison of Urdu text to speech synthesis using unit selection and HMM based techniques. In: 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA), IEEE, pp 79–83

  3. Ahmad A, Selim MR, Iqbal MZ, Rahman MS (2022) Expressive Speech synthesis by modeling prosody with variational autoencoders for bangla text-to-speech

  4. Alam F, Nath PK, Khan M (2007) Text to speech for Bangla language using festival. BRAC University

  5. Alsharhan E, Ramsay A (2019) Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions. Inf Process Manag 56(2):343–353

    Article  Google Scholar 

  6. Amrouche A, Bentrcia Y, Boubakeur KN, Abed A (2022) DNN-based Arabic Speech Synthesis. In: 2022 9th International Conference on Electrical and Electronics Engineering (ICEEE). IEEE, pp 378–382

  7. Anto A, Nisha KK (2016) Text to speech synthesis system for English to Malayalam translation. In: 2016 International Conference on Emerging Technological Trends (ICETT), pp 1–6

  8. Arık SO, Chrzanowski M, Coates A, Diamos G, Gibiansky A, Kang Y, Li X, Miller J, Ng A, Raiman J, Sengupta S (2017) Deep voice: real-time neural text-to-speech. In: International conference on machine learning, pp 195–204

  9. Aryal S, Gutierrez-Osuna R (2016) Data driven articulatory synthesis with deep neural networks. Comput Speech Lang 36:260–273

    Article  Google Scholar 

  10. Bahrampour A, Barkhoda W, Azami BZ (2009) Implementation of three text to speech systems for Kurdish language. In: Iberoamerican congress on pattern recognition. Springer, Berlin, pp 321–328

  11. Barkana BD, Patel A (2020) Analysis of vowel production in Mandarin/Hindi/American-accented English for accent recognition systems. Appl Acoust 162:107203

    Article  Google Scholar 

  12. Bhuyan MP, Sarma SK (2019) A higher-order N-gram model to enhance automatic word prediction for assamese sentences containing ambiguous words. Int J Eng Adv Technol 8(6):2921–2926

  13. Bhuyan MP, Sarma SK, Rahman M (2020) Natural language processing based stochastic model for the correctness of assamese sentences. In: 2020 5th International Conference on Communication and Electronics Systems (ICCES), pp 1179–1182

  14. Birkholz P, Martin L, Xu Y, Scherbaum S, Neuschaefer-Rube C (2017) Manipulation of the prosodic features of vocal tract length, nasality and articulatory precision using articulatory synthesis. Comput Speech Lang 41:116–127

    Article  Google Scholar 

  15. Cataldo E, Leta FR, Lucero J, Nicolato L (2006) Synthesis of voiced sounds using low-dimensional models of the vocal cords and time-varying subglottal pressure. Mech Res Commun 33(2):250–260

    Article  MATH  Google Scholar 

  16. Chan KY, Hall MD (2019) The importance of vowel formant frequencies and proximity in vowel space to the perception of foreign accent. J Phonetics 77:100919

  17. Chauhan A, Chauhan V, Singh SP, Tomar AK, Chauhan H (2011) A text to speech system for hindi using english language. IJCST 2(3)

  18. Chen LW, Rudnicky A (2022) Fine-grained style control in transformer-based text-to-speech synthesis. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 7907–7911

  19. Chen M, Chen M, Liang S, Ma J, Chen L, Wang S, Xiao J (2019) Cross-lingual, multi-speaker text-to-speech synthesis using neural speaker embedding. In: Interspeech, pp 2105–2109

  20. Dagba TK, Boco C (2014) A text to speech system for phone language using multisyn algorithm. Procedia Comput Sci 35:447–455

  21. Dessai NF, Naik G, Pawar J (2016) Development of Konkani TTS system using concatenative synthesis. In: 2016 International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT), pp 344–348

  22. Dhananjaya MS, Krupa BN, Sushma R (2016) Kannada text to speech conversion: a novel approach. In: 2016 International Conference on Electrical, Electronics, Communication, Computer and Optimization Techniques (ICEECCOT), pp 168–172

  23. Dong Y, Zhou T, Dong C-Y, Wang H-L (2010) A two-stage prosodic structure generation strategy for mandarin text-to-speech systems. Acta Automatica Sinica, 36(11):1569–1574

  24. Dootio MA, Wagan AI (2019) Development of Sindhi text corpus. J King Saud Univ Comput Inf Sci

  25. Du C, Guo Y, Chen X, Yu K (2022) VQTTS: high-fidelity text-to-speech synthesis with self-supervised VQ acoustic feature. arXiv preprint arXiv:2204.00768.

  26. Fahmy FK, Abbas HM, Khalil MI (2022) Boosting subjective quality of Arabic text-to-speech (TTS) using end-to-end deep architecture. Int J Speech Technol 25(1):79–88

    Article  Google Scholar 

  27. Gormez Z, Orhan Z (2008) TTTS: Turkish text-to-speech system. In: Proc. 12th WSEAS International Conference on Computers, Heraklion/Crete Island, Greece, pp 977–982

  28. Gupta A, Gaur R, Dhuriya A, Chadha HS, Chhimwal N, Shah P, Raghavan V (2022) Effectiveness of text to speech pseudo labels for forced alignment and cross lingual pretrained models for low resource speech recognition. arXiv preprint arXiv:2203.16823

  29. Gutkin A, Ha L, Jansche M et al (2016) TTS for low resource languages:a bangla synthesizer

  30. Hakan T, Uslu IB, Karamehmet T (2017) Implementation of turkish text-to-speech synthesis on a voice synthesizer card with prosodic features. Anadolu Universitesi Bilim Ve Teknoloji Dergisi A-Uygulamalı Bilimler ve Mühendislik. 18:584–5943

  31. Haq R, Zhang X, Khan W, Feng Z (2022) Urdu named entity recognition system using deep learning approaches. Comput J

  32. Hasnat MA, Chowdhury MR, Khan M (2009) An open source tesseract based optical character recognizer for bangla script. In: 2009 10th international conference on document analysis and recognition, pp 671–675

  33. Hebbi C, Sooraj JS, Mamatha HR (2022) Text to speech conversion of handwritten Kannada Words using various machine learning models. In: Evolution in computational intelligence. Springer, Singapore, pp 21–33

  34. Himmy D, Sharma D (2017) Punjabi text to speech using phoneme concatenation. Int J Adv Res Comput Eng Technol 6(8)

  35. Hossain PS, Chakrabarty A, Kim K, Piran M (2022) Multi-Label Extreme Learning Machine (MLELMs) for Bangla Regional Speech Recognition. Appl Sci 12(11):5463

    Article  Google Scholar 

  36. Htun HM, Zin T, Tun HM (2015) Text to speech conversion using different speech synthesis. Int J Sci Technol Res 4(7):104–108

    Google Scholar 

  37. Ifeanyi N, Ikenna O, Izunna O (2014) Text–To–Speech Synthesis (TTS). Int J Res Inform Technol 2(5):154–163

    Google Scholar 

  38. Ilyes R, Ayed YB (2014) Statistical parametric speech synthesis for Arabic language using ANN. In: 2014 1st International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), pp 452–457

  39. Inoue K, Hara S, Abe M, Hojo N, Ijima Y (2021) Model architectures to extrapolate emotional expressions in DNN-based text-to-speech. Speech Commun 126:35–43

    Article  Google Scholar 

  40. Isewon I, Oyelade OJ, Oladipupo OO (2012) Design and implementation of text to speech conversion for visually impaired people. Int J Appl Inform Syst 7(2):26–30

    Google Scholar 

  41. Jariwala N, Patel B (2018) A system for the conversion of digital Gujarati text-to-speech for visually impaired people. In: Speech and language processing for human-machine communications. Springer, Singapore, pp 67–75

  42. Jia Y, Zhang Y, Weiss R, Wang Q, Shen J, Ren F, Nguyen P, Pang R, Moreno L, Wu Y (2018) Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Advances in neural information processing systems, 31

  43. Karpov A, Krnoul Z, Zelezny M, Ronzhin A (2013) Multimodal synthesizer for Russian and Czech sign languages and audio-visual speech. In: International conference on universal access in human-computer interaction. Springer, Berlin, pp 520–529

  44. Kaur N, Singh P (2022) Speech waveform reconstruction from speech parameters for an effective text to speech synthesis system using minimum phase harmonic sinusoidal model for Punjabi. Multimed Tools Appl:1–20

  45. Kayte S, Gawali B (2015) A text-to-speech synthesis for Marathi language using festival and Festvox. Int J Comput Appl 975:35–41

    Google Scholar 

  46. Koshi B, Bajrami X, Hamiti M (2016) Alternative creation of text to speech technology for the Albanian language. IFAC-PapersOnLine 49(29):259–262

    Article  Google Scholar 

  47. Krnoul Z, Kanis J, Zelezny M, Muller L (2007) Czech text-to-sign speech synthesizer. In: International workshop on machine learning for multimodal interaction. Springer, Berlin, pp 180–191

  48. Kumar Y, Singh N (2017) An automatic speech recognition system for spontaneous Punjabi speech corpus. Int J Speech Technol 20:297–303

    Article  Google Scholar 

  49. Kumar B, Sarungbam JK, Choudhary A (2014) Script identification and language detection of 12 Indian languages using DWT and template matching of frequently occurring character (s). In: 2014 5th international conference-confluence the next generation information technology summit (confluence), pp 669–674

  50. Kumar Y, Singh N, Kumar M et al (2021) AutoSSR: an efficient approach for automatic spontaneous speech recognition model for the Punjabi Language. Soft Comput 25:1617–1630

    Article  Google Scholar 

  51. Kumar Y, Kaur K, Kaur S (2021) Study of automatic text summarization approaches in different languages. Artif Intell Rev 54:1–33

    Article  Google Scholar 

  52. Kumar Y, Koul A, Mahajan S (2022) A deep learning approaches and fastai text classification to predict 25 medical diseases from medical speech utterances, transcription and intent. Soft computing, pp 1–20

  53. Kumari L, Sharma A (2021) A review of deep learning techniques in document image word spotting. Arch Computat Methods Eng

  54. Li R, Wu Z, Liu X, Meng H, Cai L (2017) Multi-task learning of structured output layer bidirectional LSTMs for speech synthesis. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 5510–5514

  55. Li X, Ma D, Yin B (2021) Advance research in agricultural text-to-speech: the word segmentation of analytic language and the deep learning-based end-to-end system. Comput Electron Agric 180:105908

    Article  Google Scholar 

  56. Li X, Liang C, Ma S, Liu C, Chen S, Li R, He H (2022) A new type of Chinese speech synthesis technology and system research. In: International Conference on Electronic Information Engineering, Big Data, and Computer Technology (EIBDCT 2022), vol 12256. SPIE, pp 667–672

  57. Li J, Meng Y, Li C, Wu Z, Meng H, Weng C, Su D (2022) Enhancing speaking styles in conversational text-to-speech synthesis with graph-based multi-modal context modeling. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 7917–7921

  58. Mache S, Mahender C (2016) Development of text-to-speech synthesizer for Pali language. J Comput Eng 18(3):35–42

  59. Mache SR, Baheti MR, Mahender CN (2015) Review on text-to-speech synthesizer. Int J Adv Res Comput Commun Eng 4(8):54–59

    Google Scholar 

  60. Malloy ML, Nowak RD (2014) Near-optimal adaptive compressed sensing. IEEE Trans Inf Theory 60(7):4001–4012

    Article  MathSciNet  MATH  Google Scholar 

  61. Matousek J, Tihelka D, Romportl J (2006) Current state of Czech text-to-speech system ARTIC. In: International conference on text, speech and dialogue. Springer, Berlin,  pp 439–446

  62. Mitsui K, Zhao T, Sawada K, Hono Y, Nankaku Y, Tokuda K (2022) End-to-end text-to-speech based on latent representation of speaking styles using spontaneous dialogue. arXiv preprint arXiv:2206.12040.

  63. Narendra NP, Rao KS, Ghosh K, Vempada RR, Maity S (2011) Development of syllable-based text to speech synthesis system in Bengali. Int J Speech Technol 14(3):167–181

    Article  Google Scholar 

  64. Ngo T, Akagi M, Birkholz P (2020) Effect of articulatory and acoustic features on the intelligibility of speech in noise: an articulatory synthesis study. Speech Commun 117:13–20

    Article  Google Scholar 

  65. Ni J, Wang L, Gao H, Qian K, Zhang Y, Chang S, Hasegawa-Johnson M (2022) Unsupervised text-to-speech synthesis by unsupervised automatic speech recognition. arXiv preprint arXiv:2203.15796

  66. Ning Y, He S, Wu Z, Xing C, Zhang LJ (2019) A review of deep learning based speech synthesis. Appl Sci 9(19):4050

    Article  Google Scholar 

  67. Nongmeikapam K, RK VR, Singh OI, Bandyopadhyay S (2012) Automatic segmentation of manipuri (Meiteilon) word into syllabic units. arXiv preprint arXiv:1207.3932

  68. Oord AVD, Kalchbrenner N, Vinyals O et al (2016) Conditional image generation with pixelcnn decoders. In: Proceedings of the annual conference on neural information processing systems, Barcelona, Spain, 5–10 December 2016; pp 4790–4798

  69. Oord A, Li Y, Babuschkin I, Simonyan K, Vinyals O et al (2018)Parallel wavenet: fast high-fidelity speech synthesis. In: International conference on machine learning, pp 3918–3926

  70. Panda SP, Nayak AK (2016) A pronunciation rule-based speech synthesis technique for Odia numerals. In: Computational intelligence in data mining, vol 1. Springer, New Delhi, pp 483–491

  71. Panda SP, Nayak AK (2018) A Context-based Numeral Reading Technique for Text to Speech Systems. Int J Electr Comput Eng 8(6):2088–8708

    Google Scholar 

  72. Panda SP, Nayak AK, Rai SC (2020) A survey on speech synthesis techniques in Indian languages. Multimedia Syst 26:453–478

    Article  Google Scholar 

  73. Pellicani AD, Fontes AR, Santos FF, Pellicani AD, Aguiar-Ricz LN (2018) Fundamental frequency and formants before and after prolonged voice use in teachers. J Voice 32(2):177–184

    Article  Google Scholar 

  74. Prafianto H, Nose T, Chiba Y, Ito A (2019) Improving human scoring of prosody using parametric speech synthesis. Speech Commun 111:14–21

    Article  Google Scholar 

  75. Pribilova A, Pribil J (2006) Non-linear frequency scale mapping for voice conversion in text-to-speech system with cepstral description. Speech Commun 48(12):1691–1703

    Article  Google Scholar 

  76. Rahman M, Sarma P, Bhuyan MP, Das A et al (2019) Image to speech synthesizer with reference to Assamese numerals. Int J Innov Technol Explor Eng 9(1):900–905

    Article  Google Scholar 

  77. Raj AA, Sarkar T, Pammi SC, Yuvaraj S, Bansal M, Prahallad K, Black AW (2007) Text processing for text-to-speech systems in Indian languages. In: Ssw, pp 188–193

  78. Rajendran V, Kumar GB (2015) Text processing for developing unrestricted Tamil text to speech synthesis system. Indian J Sci Technol 8(29):112–124

    Article  Google Scholar 

  79. Ramli I, Jamil N, Seman N, Ardi N (2015) An improved syllabification for a better Malay language text-to-speech synthesis (TTS). Procedia Comput Sci 76:417–424

    Article  Google Scholar 

  80. Ramsay A, Mansour H (2008) Towards including prosody in a text-to-speech system for modern standard Arabic. Comput Speech Lang 22(1):84–103

    Article  Google Scholar 

  81. Rashid M, Singh H (2019) Text to speech conversion in Punjabi language using nourish forwarding algorithm. Int J Inf Technol: 1–10

  82. Rebai I, BenAyed Y (2015) Text-to-speech synthesis system with Arabic diacritic recognition system. Comput Speech Lang 34(1):43–60

    Article  Google Scholar 

  83. Reddy MV, Hanumanthappa M (2015) Phoneme-to-speech dictionary for Indian languages. In: 2015 International Conference on Soft-Computing and Networks Security (ICSNS), pp 1–4

  84. Rojc M, Kacic Z (2007) Time and space-efficient architecture for a corpus-based text-to-speech synthesis system. Speech Commun 49(3):230–249

    Article  Google Scholar 

  85. Romportl J, Kala J (2007) Prosody modelling in Czech text-to-speech synthesis

  86. Sak H, Gungor T, Safkan Y (2006) A corpus-based concatenative speech synthesis system for Turkish. Turkish J Electr Eng Comput Sci 14(2):209–223

    Google Scholar 

  87. Samuel Manoharan J (2022) A novel text-to-speech synthesis system using syllable-bBased HMM for Tamil language. In: Proceedings of second international conference on sustainable expert systems. Springer, Singapore, pp 305–314

  88. Sharma B, Adiga N, Prasanna SM (2015) Development of Assamese text-to-speech synthesis system. In: TENCON 2015–2015 IEEE Region 10 Conference, pp 1–6

  89. Sharma P, Abrol V, Sao AK (2018) Reducing footprint of unit selection based text-to-speech system using compressed sensing and sparse representation. Comput Speech Lang 52:191–208

    Article  Google Scholar 

  90. Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan R, Saurous RA (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4779–4783

  91. Shetake PS, Patil A, Jadhav P (2014) Review of text to speech conversion methods. Int J Industrial Electron Electr Eng 2(8):29–35

    Google Scholar 

  92. Shivakumar KM, Aravind KG, Anoop TV, Gupta D (2016) Kannada speech to text conversion using CMU Sphinx. In: 2016 International Conference on Inventive Computation Technologies (ICICT), vol 3. IEEE, pp 1–6

  93. Singh P, Lehal GS (2006) Text-to-speech synthesis system for Punjabi language. In: Proceedings of international conference on multidisciplinary information sciences and technologies, Merida, Spain

  94. Smit P, Virpioja S, Kurimo M (2021) Advances in subword-based HMM-DNN speech recognition across languages. Comput Speech Lang 66:101158

    Article  Google Scholar 

  95. Soman A, Kumar SS, Hemanth VK, Manikandan MS, Soman KP (2011) Corpus driven malayalam text-to-speech synthesis for interactive voice response system. Int J Comput Appl 29(4):0975–8887

    Google Scholar 

  96. Sultana T, Abbasi AR, Usmani BA, Khan S, Ahmed W, Qaseem N, Sidra (2016) Towards development of real-time handwritten urdu character to speech conversion system for visually impaired. Int J Adv Comput Sci Appl 7(12)

  97. Sun J, Wang S, Dong Y (2013) Sparse block circulant matrices for compressed sensing. IET Commun 7(13):1412–1418

    Article  Google Scholar 

  98. Sunil ME, Vinay S (2022) Kannada sentiment analysis using vectorization and machine learning. In: Sentimental analysis and deep learning. Springer, Singapore, pp 677–689

  99. Suzuki M, Kuroiwa R, Innami K, Kobayashi et al (2017) Accent sandhi estimation of Tokyo dialect of Japanese using conditional random fields. IEICE Trans Inf Syst 100(4):655–661

    Article  Google Scholar 

  100. Tachibana H, Uenoyama K, Aihara S (2018) Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 4784–4788

  101. Takamichi S, Nakata W, Tanji N, Saruwatari H (2022) J-MAC: Japanese multi-speaker audiobook corpus for speech synthesis. arXiv preprint arXiv:2201.10896

  102. Tan X, Chen J, Liu H, Cong J, Zhang C, Liu Y, … Liu TY (2022) NaturalSpeech:end-to-end text to speech synthesis with human-level quality. arXiv preprint arXiv:2205.04421

  103. Thu CST, Zin T (2014) Implementation of text to speech conversion. Int J Eng Res Technol 3(3):911–915

    Google Scholar 

  104. Tokuda K, Nankaku Y, Toda T, Zen H, Yamagishi J, Oura K (2013) Speech synthesis based on hidden Markov models. Proc IEEE 101(5):1234–1252

    Article  Google Scholar 

  105. Toth B, Nemeth G (2008) Hidden-Markov-Model based speech synthesis in Hungarian. J Info-Commun 7:30–34

    Google Scholar 

  106. Tran DC (2020) The first vietnamese fosd-tacotron-2-based text-to-speech model dataset. Data Br 31:105775

    Article  Google Scholar 

  107. Uliniansyah MT, Nurfadhilah E, Aini LR, Junde J, Ayuningtyas F, Santosa A (2016) A tool to solve sentence segmentation problem on preparing speech database for Indonesian text-to-speech system. Procedia Comput Sci 81:188–193

  108. Van Der Lee C, Gatt A, Van Miltenburg E, Krahmer E (2021) Human evaluation of automatically generated text: current trends and best practice guidelines. Comput Speech Lang 67:101151

    Article  Google Scholar 

  109. Varghese JM, Hande S (2015) Design of Gujarati text-to-speech system. Int J Res 2(5):1017–1019

  110. Veisi H, Hosseini H, MohammadAmini M, Fathy W, Mahmudi A (2022) Jira: a Central Kurdish speech recognition system, designing and building speech corpus and pronunciation lexicon. Lang Resour Eval: 1–25

  111. Venkateswarlu S, Kamesh DBK, Sastry JKR, Rani R (2016) Text to speech conversion. Indian J Sci Technol 9(38):1–3

    Article  Google Scholar 

  112. Vijayarani S, Sakila A (2015) Template matching technique for searching words in document images. Int J Cybern Inform (IJCI) 4(6):25–35

    Google Scholar 

  113. Wang W, Xu S, Xu B (2016) First step towards end-to-end parametric TTS synthesis: generating spectral parameters with neural attention. In: Interspeech, pp 2243–2247

  114. Weiss RJ, Skerry-Ryan RJ, Battenberg E, Mariooryad S, Kingma DP (2021) Wave-tacotron: spectrogram-free end-to-end text-to-speech synthesis. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 5679–5683

  115. Ye Z, Zhao Z, Ren Y, Wu F (2022) SyntaSpeech: syntax-aware generative adversarial text-to-speech. arXiv preprint arXiv:2204.11792

  116. Yilmaz E, Ganzeboom MS, Beijer LJ et al (2016) A Dutch dysarthric speech database for individualized speech therapy research, pp 792–795

  117. Zelasko P, Ziolko B, Jadczyk T, Skurzok D (2016) AGH corpus of Polish speech. Lang Resour Eval 50(3):585–601

    Article  Google Scholar 

  118. Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Speech Commun 51(11):1039–1064

    Article  Google Scholar 

  119. Zhang C, Zhang S, Zhong H (2019) A prosodic mandarin text-to-speech system based on tacotron. In: 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp 165–169

  120. Zhou Y, Song C, Li X, Zhang L, Wu Z, Bian Y, … Meng H (2022) Content-dependent fine-grained speaker embedding for zero-shot speaker adaptation in text-to-speech synthesis. arXiv preprint arXiv:2204.00990

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yogesh Kumar.

Ethics declarations

Ethical approval

All procedures used in research involving human subjects have complied with the institutional and national research committee’s ethical requirements and the 1964 Helsinki Declaration and its subsequent revisions or comparable ethical standards.

Human and animal rights

This survey article does not include any animal experiments conducted by any of the authors.

Informed consent

All study participants were given the opportunity to give their informed consent.

Conflict of interest

There are no conflicts of interest declared by the authors.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kumar, Y., Koul, A. & Singh, C. A deep learning approaches in text-to-speech system: a systematic review and recent research perspective. Multimed Tools Appl 82, 15171–15197 (2023). https://doi.org/10.1007/s11042-022-13943-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13943-4

Keywords

Navigation