Automatic Speech Recognition System for Tonal Languages: State-of-the-Art Survey | Archives of Computational Methods in Engineering Skip to main content

Advertisement

Log in

Automatic Speech Recognition System for Tonal Languages: State-of-the-Art Survey

  • Original Paper
  • Published:
Archives of Computational Methods in Engineering Aims and scope Submit manuscript

Abstract

Natural language and human–machine interaction is a very much traversed as well as challenging research domain. However, the main objective is of getting the system that can communicate in well-organized manner with the human, regardless of operational environment. In this paper a systematic survey on Automatic Speech Recognition (ASR) for tonal languages spoken around the globe is carried out. The tonal languages of Asian, Indo-European and African continents are reviewed but the tonal languages of American and Austral-Asian are not reviewed. The most important part of this paper is to present the work done in the previous years on the ASR of Asian continent tonal languages like Chinese, Thai, Vietnamese, Mandarin, Mizo, Bodo and Indo-European continent tonal languages like Punjabi, Lithuanian, Swedish, Croatian and African continent tonal languages like Yoruba and Hausa. Finally, the synthesis analysis is explored based on the findings. Many issues and challenges related with tonal languages are discussed. It is observed that the lot of work have been done for the Asian continent tonal languages i.e. Chinese, Thai, Vietnamese, Mandarin but little work been reported for the Mizo, Bodo, Indo-European tonal languages like Punjabi, Latvian, Lithuanian as well for the African continental tonal languages i.e. Hausa and Yourba.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Adetunmbi OA, Obe OO, Iyanda JN (2016) Development of standard Yoruba speech-to-text system using HTK. Int J Speech Technol 19(4):929–944

    Google Scholar 

  2. Ahkuputra V, Jitapunkul S, Jittiwarangkul N, Maneenoi E, Kasuriya S (1988) A comparison of Thai speech recognition systems using hidden markov model, neural network, and fuzzy-neural network. In: Proceedings of the 5th international conference on spoken language processing (ICSLP), vol 3, pp 715–717

  3. Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Casper J, Catanzaro B, Cheng Q, Chen G, Chen J, Chen J, Chen Z, Chrazanowski M, Coates A, Diamos G, Ding K, Du N, Elsen E, Engel J, Fang W, Jiang B, Ju C, Jun B, Legresley P, Lin L, Liu J, Liu Y, Li W, Li X, Ma D, Narang S, Ng A, Ozair S, Peng Y, Prenger R, Qian S, Srinet K, Sriram A, Tang H, Tang L, Wang C, Wang J, Wang K, Wang Yi, Wang Z, Wang Z, Wu S, Wei L, Xiao B, Xie W, Xie Y, Yogatama D, Yuan B, Zhan J, Zhu Z (2016) Deep speech 2: end-to-end speech recognition in English and Mandarin. In: Proceedings of the 33rd international conference on machine learning (ICML), New York, vol 48, pp 173–182

  4. Arora A, Kadyan V, Singh A (2019) Effect of tonal features on various dialectal variations of Punjabi language. In: Proceedings of the conference on advances in signal processing and communication, pp 467–475

  5. Besacier L, Le VB, Boitet C, Berment V, 2006 ASR and translation for under-resourced languages. In: Proceedings of the international conference on acoustics, speech and signal processing, Toulouse, France, vol 5, pp 1221–1224

  6. Bhuriyakorn P, Punyabukkana P, Suchato A (2008) A genetic algorithm-aided hidden Markov model topology estimation for phoneme recognition of Thai continuous speech. In: Proceedings of the 9th ACIS international conference on software engineering, artificial intelligence, networking, and parallel/distributed computing, pp 475–480

  7. Chan W, Lane I (2016) On online attention-based speech recognition and joint Mandarin Character-Pinyin training. In: Proceedings of the Interspeech, San Francisco USA, pp 3404–3408

  8. Chen NF, Sivdas S, Lim BP, Ngo HG, Xu H, Pham VT, Ma B, Li H (2014) Strategies for Vietnamese keyword search. In: Proceedings of the international conference on acoustic, speech and signal processing, (ICASSP), pp 4121–4125

  9. Chen NF, Wee D, Tong R, Ma B, Li H (2016) Large-scale characterization of non-native Mandarin Chinese spoken by speakers of European origin: analysis on iCALL. Speech Commun 84:46–56

    Google Scholar 

  10. Chiang CY, Wang XD, Liao YF, Wang YR, Chen SH, Hirose K (2007) Latent prosody model of continuous Mandarin speech. In: Proceedings of the international conference on acoustics, speech and signal processing (ICASSP), vol 4, pp 625–628

  11. Chotimongkol A, Saykhum K, Chootrakool P, Thatphithakkul N, Wutiwiwatchai C (2009) LOTUS-BN: a Thai broadcast news corpus and its research applications. In: Proceedings of the oriental COCOSDA international conference on speech database and assessments, pp 44–49

  12. Chuong NT, Chaloupka J (2013) Developing text and speech database for speech recognition of Vietnamese. In: Proceedings of the 7th international conference on intelligent data acquisition and advanced computing systems (IDAACS), Berlin Germany, vol 1, pp 163–166

  13. Dat DT, Castelli E, Serignat JF, Loan TV, Hung LX (2005) Influence of F0 on Vietnamese syllable perception. In: Proceedings of the 9th European conference on speech communication and technology, (InterSpeech), Libson, Portgual, pp 1697–1700

  14. Dat TD, Castelli E, Hung LX, Serignat JF, Loan TV (2006) Linear F0 contour model for Vietnamese tones and Vietnamese syllable synthesis with TD-PSOLA. In: Proceedings of the 2nd international symposium on tonal aspects of languages (TAL), La Rochelle, France, pp 137–142

  15. Dey A, Lalhminghlui W, Sarmah P, Samudravijaya K, Prasanna SRM, Sinha R, Nirmala SR (2017) Mizo phone recognition system. In: Proceedings of the India council international conference (INDICON), pp 1–5

  16. Dhanjal S, Bhatia SS (2017) Development of a standard text and speech corpus for the Punjabi language. In: Proceedings of the international conference oriental COCOSDA held jointly with 2013 conference on Asian spoken language research and Evaluation (O-COCOSDA/CASLRE), pp 1–6

  17. Dua M, Aggarwal RK, Kadyan V, Dua S (2012) Punjabi automatic speech recognition using HTK. Int J Comput Sci Issues 9(4):359–364

    Google Scholar 

  18. Dua M, Aggarwal RK, Kadyan V, Dua S (2012) Punjabi speech to text system for connected words. In: Proceedings of the 4th international conference on advances in recent technologies in communication and computing, pp 206–209

  19. Fagbolu O, Ojoawo A, Ajibade K, Alese B (2015) Digital Yoruba Corpus. Int J Innov Sci Eng Technol 2(8):918–925

    Google Scholar 

  20. Fu QJ, Zeng FG, Shannon RV, Soli SD (1998) Importance of tonal envelope cues in Chinese speech recognition. J Acoust Soc Am 104(1):505–510

    Google Scholar 

  21. Ghai W, Singh N (2012) Analysis of automatic speech recognition system for Indo-Aryan languages: Punjabi a case study. Int J Soft Comput Eng 2(1):379–385

    Google Scholar 

  22. Ghai W, Singh N (2013) Continuous speech recognition for Punjabi language. Int J Comput Appl 72(14):23–28

    Google Scholar 

  23. Ghai W, Singh N (2013) Phone based acoustic modeling for automatic speech recognition for Punjabi language. J Speech Sci 1(3):69–83

    Google Scholar 

  24. Govind D, Sarmah P, Prasanna SRM (2012) Role of pitch slope and duration in synthesized mizo tones. In: Proceedings of the speech prosody 2012

  25. Greibus M, Ringeliene Z, Telksnys L (2017) The phoneme set influence for Lithuanian speech commands recognition accuracy. In: Proceedings of the open conference of electrical, electronics and information sciences (estream), pp 1–4

  26. Guglani J, Mishra AN (2018) Continuous Punjabi speech recognition model based on Kaldi ASR. Int J Speech Technol 21(2):211–216

    Google Scholar 

  27. Gulic M, Lucanin D, Simic A (2011) A digit and spelling speech recognition system for Croatian language. In: Proceedings of the 34th international convention, MIPRO, Opatija, Croatia, pp 1673–1678

  28. Hallgren M, Larsby B, Arlinger S (2006) A Swedish version of the hearing in noise test (HINT) for measurement of speech recognition. Int J Audiol 45(4):227–237

    Google Scholar 

  29. Hoffmeister B, Plahl C, Fritz P, Heigold G, Loof J, Schluter R, Ney H (2007) Development of the 2007 RWTH Mandarin LVCSR system. In: Proceedings of the workshop on automatic speech recognition and understanding, pp 455–460

  30. Hu X, Saiko M, Hori C (2014) Incorporating tone features to convolutional neural network to improve Mandarin/Thai speech recognition. In: Proceedings of the signal and information processing association annual summit and conference (APSIPA), Asia-Pacific, Siem Reap, Combodia, pp 1–5

  31. Huang H, Hu Y, Xu H (2017) Mandarin tone modeling using recurrent neural networks. arXiv preprint arXiv:1711.01946

  32. Hung PN, Loan TV, Quang NH (2015) Corpus and statistical analysis of F0 variation for Vietnamese Dialect identification. In: Proceedings of the 3rd international conference on computer and computing science (COMCOMS), Hanoi, Vietnam, 111:205–210

  33. Huy NV, Mai LC, Thang VT, Truong DQ (2014) Vietnamese recognition using tonal phoneme based on multi space distribution. J Comput Sci Cybern 30(1):28–38

    Google Scholar 

  34. Hwang MY, Peng G, Ostendorf M, Wang W, Faria A, Heidel A (2009) Building a highly accurate Mandarin speech recognizer with language-independent technologies and language-dependent modules. IEEE Trans Audio Speech Lang Process 17(7):1253–1262

    Google Scholar 

  35. Jitapunkul S, Luksaneeyanawin S, Ahkuputra V, Maneenoi E, Kasuriya S, Amornkul P (1998) Recent advances of Thai speech recognition in Thailand. In: Proceedings of the Asia-Pacific conference on circuits and systems. Microelectronics and integrating systsems, pp 173–176

  36. Jiyong Z, Fang Z, Mingxing XU, Shuqing Li (2000) Intra-syllable dependent phonetic modeling for Chinese speech recognition. International symposium on Chinese spoken language processing, Beijing, pp 73–76

  37. Jongtaveesataporn M, Wutiwiwatchai C, Iwano K, Furui S (2008) Thai broadcast news corpus construction and evaluation. In: Proceedings of the international conference on large resources and evaluation (LREC), Morocco, pp 1249–1254

  38. Jongtaveesataporn M, Thienlikit I, Wutiwiwatchai C, Furui S (2009) Lexical units for Thai LVSCR. Speech Commun 51(4):379–389

    Google Scholar 

  39. Kadyan V, Mantri A, Aggarwal RK (2017) A heterogeneous speech feature vectors generation approach with hybrid HMM classifiers. Int J Speech Technol 20(4):761–769

    Google Scholar 

  40. Kadyan V, Mantri A, Aggarwal RK (2017) Refinement of HMM model parameters for Punjabi automatic speech recognition (PASR) system. IETE J Res 64(5):673–688

    Google Scholar 

  41. Kadyan V, Mantri A, Aggarwal RK, Singh A (2018) A comparative study of deep neural network based Punjabi-ASR system. Int J Speech Technol 22(1):111–119

    Google Scholar 

  42. Karafiat M, Grezl F, Hannemann M, Cernocky JH (2014) BUT neural network features for spontaneous Vietnamese in BABEL. In: Proceedings of the international conference on acoustic, speech and signal processing (ICASSP), pp 5622–5626

  43. Karnjanadecha M, Kimsawad P (2002) A comparison of front-end analysis for Thai speech recognition. In: Proceedings of the 7th international conference on spoken language processing, Denver, Colorado, USA, 16–20

  44. Kasuriya S, Kanokphara S, Thatphithakkul N, Cotsomrong P, Sunpethniyom T (2004) Context-independent acoustic models for Thai speech recognition. In: Proceedings of the international symposium on communications and information technology (ISCIT), Sapporo, Japan, vol 2, pp 991–994

  45. Kasuriya S, Sornlertlamvanich V, Cotsomrong P, Kanokphara S, Thatphithakkul N (2004) Thai speech recognition corpora. J Comput Lang Comput 14(4):279–293

    Google Scholar 

  46. Kasuriya S, Sornlertlamvanich V, Cotsomrong P, Kanokphara S, Thatphithakkul N (2003) Thai speech corpus for speech recognition. In: Proceedings of the Oriental COCOSDA, pp 54–61

  47. Kaur A, Singh A (2016) Power-normalized cepstral coefficients (PNCC) for Punjabi automatic speech recognition using phone based modelling in HTK. In: Proceedings of the 2nd international conference on applied and theoretical computing and communication technology (iCATccT), Bangalore, India, pp 372–375

  48. Kaur A, Singh A (2016) Optimizing feature extraction techniques constituting phone based modelling on connected words for Punjabi automatic speech recognition. In: Proceedings of the 2nd international conference on advances in computing, communications and informatics (ICACCI), Jaipur, India, pp 2104–2108

  49. Kertkeidkachorn N, Chanjaradwichai S, Suri T, Likitsupin K, Vorapatratorn S, Hirankan P, Limpanadusadee W, Chuetanapinyo S, Pitakpawatkul K, Puangsri N, Tangsirirat N, Trakulsuk K, Punyabukkana P, Suchato A (2012) The CU-MFEC Corpus for Thai and English spelling speech recognition. In: Proceedings of the international conference on speech database and assessments, Macau, China, 18–23

  50. Kertkeidkachorn N, Punyabukkana P, Suchato A (2014) Using tone information in Thai spelling speech recognition. In: Proceedings of the 28th Pacific Asia conference on language, information and computation, (PACLIC), pp 178–184

  51. Kumar, Singh (2016) Automatic spontaneous speech recognition for Punjabi language interview speech corpus. Int J Educ Manag Eng 6(6):64–73

    Google Scholar 

  52. Kumar, Singh (2016) An automatic spontaneous live speech recognition for Punjabi language corpus. Int J Circuit Theory Appl 9(20):9575–9582

    Google Scholar 

  53. Kumar, Singh (2017) An automatic speech recognition system for spontaneous Punjabi speech corpus. Int J Speech Technol 20(2):297–303

    Google Scholar 

  54. Kumar R (2010) Comparison of HMM and DTW for isolated word recognition system of Punjabi language. In: Proceedings of the IberoAmerican congress on pattern recognition, pp 244–252

  55. Lata S, Arora S (2012) Exploratory analysis of Punjabi tones in relation to orthographic characters: a case study. In: Proceedings of the workshop on indian language and data: resources and evaluation workshop programme, pp 76–80

  56. Lata S, Arora S (2013) Laryngeal tonal characteristics of Punjabi—an experimental study. In: Proceedings of the international conference on human computer interactions (ICHCI), Chennai, India, pp 1–6

  57. Laurinciukaite S, Filipovic M, Telksnys L (2009) Lithuanian continuous speech corpus LRN 1: an improvement. Inf Technol Control 38(3):203–207

    Google Scholar 

  58. Laurinciukaite S, Silingas D, Skripkauskas M, Telksnys L (2006) Lithuanian continuous speech corpus LRN 0.1: design and potential applications. Inf Technol Control 35(4):431–440

    Google Scholar 

  59. Laurinciukaite S, Telksnys L, Kasparaitis P, Kliukiene R, Paukstyte V (2018) Lithuanian speech corpus Liepa for development of human-computer interfaces working in voice recognition and synthesis mode. Informatics 29(3):487–498

    Google Scholar 

  60. Le VB, Besacier L (2005) First steps in fast acoustic modeling for a new target language: application to Vietnamese. In: Proceeding of the international conference on acoustics, speech and signal processing (ICASSP), vol 1, pp 821–824

  61. Le VB, Besacier L (2006) Comparison of acoustic modeling techniques for Vietnamese and Khmer ASR. In: Proceedings of the 9th international conference on spoken language processing

  62. Le VB, Besacier L (2009) Automatic speech recognition for under-resourced languages: application to Vietnamese language. IEEE Trans Audio Speech Lang Process 17(8):1471–1482

    Google Scholar 

  63. Le VB, Tran DD, Besacier L, Castelli E, Serignat JF (2005) First steps in building a large vocabulary continuous speech recognition system for Vietnamese. In: Proceedings of the 3rd international conference on research, innovation and vision of the future in computing & communication technologies (RIVF), Can Tho, Vietnam, pp 330–333

  64. Lei X, Siu M, Hwang MY, Ostendorf M, Lee T (2006) Improved tone modeling for Mandarin broadcast news speech recognition. In: Proceedings of the 9th international conference on spoken language processing, pp 1237–1240

  65. Li J, Zhang H, Cai X, Xu Bo (2015) Towards End-to end speech recognition for Chinese Mandarin using long short-term memory recurrent neural networks. In: Proceedings of the 16th annual conference of international speech communication association, Dresden, Germany, pp 3615–3619

  66. Li TF (2003) Speech recognition of Mandarin monosyllables. Pattern Recognit 36(11):2713–2721

    Google Scholar 

  67. Li X, Wang X, Qian Y, Lin S (2009) Candidate generation for interactive Chinese speech recognition. In: Proceedings of the joint conferences on pervasive computing (JCPC), vol 583, p 588

  68. Li X, Wu X (2015) Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition. In: Proceedings of the international conference on acoustics, speech and signal processing (ICASSP), Brisbane, QLD, Australia, pp 4520–4524

  69. Li X, Yang Y, Pang Z, Wu X (2015) A comparative study on selecting acoustic modeling units in deep neural networks based large vocabulary Chinese speech recognition. Neurocomputing 170:251–256

    Google Scholar 

  70. Lileikyte R, Gorin A, Lamel L, Gauvain JL, Fraga-Silva T (2016) Lithuanian broadcast speech transcription using semi-supervised acoustic model training. Procedia Comput Sci 81:107–113

    Google Scholar 

  71. Lileikyte R, Lamel L, Gauvain JL, Gorin A (2017) Conversational telephone speech recognition for Lithuanian. Comput Speech Lang 49:71–82

    Google Scholar 

  72. Lindh, Eriksson (2009) The SweDat Project and Swedia database for phonetic and acoustic research. In: Proceedings of the 5th international conference on e-science, pp 45–49

  73. Liu L, Zheng TF, Wu W (2008) State-dependent phoneme-based model merging for dialectal Chinese speech recognition. Speech Commun 50(7):605–615

    Google Scholar 

  74. Ljubesic N, Agic Z, Klubicka F, Batanovic V, Erjavec T (2018) hr500 K—a reference training corpus of Croatian. In: Proceedings of the conference on language technologies & digital humanities, Ljubljana, Solvenia, pp 154–160

  75. Lu L, Ghosal A, Renals S (2011) Regularized subspace gaussian mixture models for cross-lingual speech recognition. In: Proceedings of the workshop automatic speech recognition and understanding, pp 365–370

  76. Luka MK, Frank IA, Onwodi G (2012) Neural network based Hausa language speech recognition. Int J Adv Res Artif Intell 1(2):39–44

    Google Scholar 

  77. Ma B, Zhu D, Tong R (2006) Chinese Dialect identification using tone features based on pitch flux. In: Proceedings of the international conference on acoustics, speech and signal processing (ICASSP), vol 1, pp 1029–1032

  78. Maneenoi E, Ahkuputra V, Luksaneeyanawin S, Jitapunkul S (2002) Acoustic modeling of onset-rhyme for Thai continuous speech recognition. In: Proceedings of the 9th Australian international conference on speech science & technology, Melbourne, pp 462–467

  79. Martincic-Ipsic S, Zibert J, Ipsic I, Mihelic F (2003) A Bilingual Spoken Dialog System for Solvenian and Croatian Weather Forecasts. In: Proceedings of the Region 8 EUROCON 2003. Computer as a Tool, 2: 140–143

  80. Maskeliunas R, Rudzionis A, Ratkevicius K (2009) Investigation of foreign languages models for Lithuanian speech recognition. Electron Electr Eng 91(3):15–20

    Google Scholar 

  81. Ng T, Zhang B, Nguyen K, Nguyen L (2008) Progress in the BBN 2007 Mandarin speech to text system. In: Proceedings of the international conference on acoustics, speech and signal processing (ICASSP), pp 1537–1540

  82. Nguyen QB, Vu TT, Luong CM (2016) The effect of tone modeling in Vietnamese LVCSR system. Procedia Comput Sci 81:174–181

    Google Scholar 

  83. Nguyen TL, Tran DD (2012) Influences of particles on Vietnamese tonal co-articulation. In: Proceedings of the 3rd workshop on South and Southeast Asian natural language processing (SANLP), Mumbai, pp 163–172

  84. Nguyen VH, Luong CM, Vu TT (2015) Tonal phoneme based model for Vietnamese LVCSR. In: Proceedings of the international conference oriental COCOSDA held jointly with conference on Asian spoken language research and evaluation (O-COCOSDA/CASLRE), Shanghai, China, pp 118–122

  85. Nouza J, Cerva P, Zdansky J, Kucharova M (2012) A study on adapting Czech automatic speech recognition system to Croatian language. In: Proceedings of the 54th international symposium, Zadar, Croatia, pp 227–230

  86. Odelobi OA (2008) Recognition of tones in Yoruba speech: experiments with artificial neural networks. J Speech Audio Image Biomed Signal Process Neural Netw 83:23–47

    Google Scholar 

  87. Ohman T (1998) An audio-visual speech database and automatic measurements of visual speech. Quarterly Progress and Status Report, Department of Speech, Music and Hearing, Royal Institute of Technology, Stockholm, Sweden, Stockholm, Sweden

  88. Oparin I, Lamel L, Gauvain JL (2013) Rapid development of a Latvian speech-to-text system. In: Proceedings of the international conference on acoustic, speech, and signal processing, pp 7309–7313

  89. Pinnis M, Auzina I, Goba K (2014) Designing the Latvian speech recognition corpus. In: Proceedings of the international conference on language resources and evaluation (LREC), pp 1547–1553

  90. Pisarn C, Theeramunkong T (2004) Speed compensation for improving Thai spelling recognition with a continuous speech corpus. In: Proceedings of the international conference on intelligence in communication systems, pp 100–111

  91. Pisarn C, Theeramunkong T (2006) Improving Thai spelling recognition with tone features. In: Proceedings of the international conference on natural language processing, Finland, pp 388–398

  92. Pisarn C, Theeramunkong T (2007) An HMM-based method for Thai spelling speech recognition. Comput Math Appl 54(1):76–95

    MATH  Google Scholar 

  93. Pisarn C, Theeramunkong T, Cercone N, Chalidabhongse J (2005) Thai spelling recognition using a continuous speech corpus. Int J Comput Process Orient Lang 18(4):243–264

    Google Scholar 

  94. Plahl C, Hoffmeister B, Heigold G, Loof J, Schluter R, Ney H (2009) Development of the GALE 2008 Mandarin LVCSR system. In: Proceedings of the 10th annual conference of the international speech communication association, pp 2307–2311

  95. Potisuk S, Harper MP, Gandour J (1999) Classification of Thai tone sequences in syllable-segmented speech using the analysis-by synthesis method. IEEE Trans Speech Audio Process 7(1):95–102

    Google Scholar 

  96. Predawan S, Kimpan C, Wutiwiwatchai C (2009) Monosyllabic Thai tone recognition using ant-miner algorithm. Int J Comput Sci Netw Secur 9(1):227–234

    Google Scholar 

  97. Qian Y, Soong FK (2009) A multi-space distribution (MSD) and two-stream tone modeling approach to Mandarin speech recognition. Speech Commun 51(12):1169–1179

    Google Scholar 

  98. Quang NH, Loan TV, Dat LT (2010) Automatic speech recognition for Vietnamese using HTK system. In: Proceedings of the international conference on computing & communication technologies, research, innovation, and vision of the future (RIVF), pp 1–4

  99. Quang NH, Pascal N, Eric, Loan TV (2008) Using tone information for Vietnamese continuous speech recognition. In: Proceedings of the international conference on research, innovation and vision for future in computing and communication technologies, pp 103–106

  100. Quang NH, Pascal N, Eric, Loan TV (2008) Large vocabulary continuous speech recognition for Vietnamese, a under-resourced language. In: Proceedings of the 1st international workshop on spoken languages technologies for under-resourced languages (STLU), pp 23–26

  101. Quang NH, Pascal N, Eric C, Loan TV (2008) Tone recognition of Vietnamese continuous speech using hidden Markov model. In: Proceedings of the 2nd international conference on communications and electronics, vol 235, p 239

  102. Quang NH, Pascal N, Eric C, Loan TV (2008) A novel approach in continuous speech recognition for Vietnamese, an isolating tonal language. In: Proceedings of the 9th annual conference of the international speech communication association (Interspeech), Brisbane, Australia, pp 1149–1152

  103. Raskinis G, Raskiniene D (2003) Building medium vocabulary isolated word Lithuanian HMM speech recognition system. Informatics 4(1):75–84

    MATH  Google Scholar 

  104. Rasymas T, Rudzionis V (2015) Combining different speech recognizers by using CART classifier. In: Proceedings of the 3rd workshop on advances in information, electronic and electrical engineering (AIEEE), pp 1–4

  105. Rasymas T, Rudzionis V (2015) Evaluation of methods to combine different speech recognizers. In: Proceedings of the federated conference on computer science and information systems (FEDCSIS), pp 1043–1047

  106. Rudzionis V, Raskinis G, Maskeliunas R, Rudzionis A, Ratkevicius K, Bartisiute G (2014) Web services based hybrid recognizer of Lithuanian voice commands. Electron Electr Eng 20(9):50–53

    Google Scholar 

  107. Safarik R, Mateju L (2018) Automatic development of ASR system for an under-resourced language. In: Proceedings of the 41st international conference on telecommunications and signal processing (TSP), pp 100–103

  108. Salimbajevs A, Strigins J (2015) Error analysis and improving speech recognition for Latvian language. In: Proceedings of the 10th international conference recent advances in natural language processing, pp 563–569

  109. Sarma BD, Sarmah P, Lalhminghlui W, Prasanna SRM (2015) Detection of Mizo tones. In: Proceedings of the 16th annual conference of the international speech communication association, Dreseden, Germany, pp 934–937

  110. Schlippe T, Djomgang EGK, Vu NT, Ochs S, Schultz T (2012) Hausa large vocabulary continuous speech recognition. In: Proceedings of the 3rd workshop on spoken language technologies for under-resourced languages, Cape Town, South Africa, pp 11–14

  111. Seljan S, Dunder I (2014) Combined automatic speech recognition and machine translation in business correspondence domain for English–Croatian. Int J Ind Syst Eng 8(11):1980–1986

    Google Scholar 

  112. Shan C, Zhang J, Wang Y, Xie L (2018) Attention-based end-to-end speech recognition on voice search. In: Proceedings of the international conference on acoustics, speech and signal processing (ICASSP), Calagary, Canada, pp 4764–4768

  113. Shen JL, Wang HM, Lyu RY, Lee LS (1999) Automatic selection of phonetically distributed sentence sets for speaker adaptation with application to large vocabulary Mandarin speech recognition. Comput Speech Lang 13(1):79–97

    Google Scholar 

  114. Silingas D, Laurinciukaite S, Telksnys L (2004) Towards acoustic modeling of Lithuanian speech. In: Proceedings of the 9th conference on speech and computer (SPECOM), pp 326–333

  115. Skrabal M, Benko V (2018) Czech & Slovak corpus resources go (not only) Latvian. In: Proceedings of the 18th international conference in human language technologies—the Baltic perspective. IOS Press, Baltic, vol 307, p 158

  116. Sodanil M, Nituwat S, Haruechaiyasak C (2010) Improving ASR for continuous Thai words using ANN/HMM. In: Proceedings of the 10th international conference on innovative internet community system (I2CS), pp 247–256

  117. Srijiranon K, Eiamkanitchat N (2015) Thai speech recognition using neuro-fuzzy system. In: Proceedings of the 12th international conference on electrical engineering/electronics, computer, telecommunications and information technology (ECTI-CON), Hua Hin, Thailand, pp 1–6

  118. Srisuwan N, Phukpattaranont P, Limsakul C (2013) Three steps of neuron network classification for EMG-based Thai tones speech recognition. In: Proceedings of the 10th international conference on electrical engineering/electronics, computer, telecommunications and information technology, Krabi, Thailand, pp 1–6

  119. Suebvisai S, Charoenpornsawat P, Black A, Woszczyna M, Schultz T (2005) Thai automatic speech recognition. In: Proceedings of the international conference on acoustic, speech and signal processing, vol 1, pp 857–860

  120. Tantibundhit C, Onsuwan C (2015) Speech intelligibility tests and analysis of confusions and perceptual representations of Thai initial consonants. Speech Commun 72:109–125

    Google Scholar 

  121. Thathong U, Jitapunkul S, Ahkuputra V, Maneenoi E, Thampanitchawong B (2000) Classification of Thai consonant naming using Thai tone. In: Proceedings of the 6th international conference on spoken language processing (ICSLP), Beijing China, vol 3, pp 46–50

  122. Theera-Umpon N, Chansareewittaya S, Auephanwiriyakul S (2011) Phoneme and tonal accent recognition for Thai speech. Expert Syst Appl 38(10):13254–13259

    Google Scholar 

  123. Thubthong N, Kijsirikul B (2001) Tone recognition of continuous Thai speech under tonal assimilation and declination effects using half-tone model. Int J Uncertain Fuzziness Knowl-Based Syst 9(6):815–825

    MATH  Google Scholar 

  124. Tsai WH, Chang WW (2002) Discriminative training of Gaussian mixture bigram models with application to Chinese Dialect identification. Speech Commun 36(3–4):317–326

    MATH  Google Scholar 

  125. Valente F, Doss MM, Plahl C, Ravuri S, Wang W (2010) A comparative large scale study of MLP features for Mandarin ASR. In: Proceedings of the 11th annual conference of the international speech communication association, pp 2630–2633

  126. Vu NT, Schlippe T, Kraus F, Schultz T (2010) Rapid bootstrapping of five Eastern European languages using the rapid language adaptation toolkit. In: Proceedings of the 11th annual conference of the international speech communication association, pp 865–868

  127. Vu Q, Demuynck K, Compernolle DV (2006) Vietnamese automatic speech recognition: the flavor approach. In: Proceedings of the international symposium on Chinese spoken language processing, pp 464–474

  128. Vu TN, Schultz T (2009) Vietnamese large vocabulary continuous speech recognition. In: Proceedings of the workshop on automatic speech recognition and understanding, vol 333, p 338

  129. Vu TT, Nguyen KT, Ha LT, Luong MC, Nakamura S (2009) Towards Asian speech translation: the development of speech and text Corpora for Vietnamese language. In: Proceedings of the workshop on technologies and Corpora for Asia-Pacific speech translation (TCAST), pp 15–20

  130. Wang HM (2000) Experiments in syllable-based retrieval of broadcast news speech in Mandarin Chinese. Speech Commun 32(1–2):49–60

    Google Scholar 

  131. Wang L, Ambikairajah E, Choi EHC (2007) A novel method for automatic tonal and non-tonal language classification. In: Proceedings of the international conference on multimedia and expo, pp 352–355

  132. Wang S, Levow GA (2008) Mandarin Chinese tone nucleus detection with landmarks. In: Proceedings of the 9th annual conference of the international speech communication association, pp 1101–1104

  133. Wong PF, Siu MH (2004) Decision tree based tone modeling for Chinese speech recognition. In: Proceedings of the international conference on acoustic, speech and signal processing, vol 1, pp 905–908

  134. Wutiwiwatchai C, Cotsomrong P, Suebvisai S, Kanokphara S (2002) Phonetically distributed speech corpus for Thai language. In: Proceedings of the international conference on language resources and evaluation (LREC), pp 869–872

  135. Wutiwiwatchai C, Furui S (2007) Thai speech processing technology: a review. Speech Commun 49:8–27

    Google Scholar 

  136. Li X, Sun Y (2001) Chinese speech recognition model based on activation of the state feedback neural network. Tsinghua Sci Technol 6(4):369–373

    Google Scholar 

  137. Yang D, Pan YC, Furui S (2012) Vocabulary expansion through automatic abbreviation generation for Chinese voice search. Comput Speech Lang 26(5):321–335

    Google Scholar 

  138. Yu KM (2010) Laryngealization and features for Chinese tonal recognition. In: Proceedings of the 11th annual conference of the international speech communication association, pp 1529–1532

  139. Yusof SAM, Atanda AF, Hariharan M (2013) A review of Yourba automatic speech recognition. In: Proceedings of the 3rd international conference on system engineering and technology, Shah Alam Malaysia, pp 242–247

  140. Zhang J, Zheng F, Li J, Luo C, Zhang G (2001) Improved context-dependent acoustic modeling for continuous Chinese speech recognition. In: Proceedings of the 7th European conference on speech communication and technology, 1617–1625

  141. Zhang J, Hirose K (2004) Tone nucleus modeling for Chinese lexical tone recognition. Speech Commun 42(3):447–466

    Google Scholar 

  142. Zhang JS, Hirose K (2000) Anchoring hypothesis and its application to tone recognition of Chinese continuous speech. In: Proceedings of the international conference on acoustics, speech and signal processing (ICASSP), vol 3, pp 1419–1422

  143. Zhang Y, Medievski A, Lawrence J, Song J (2002) A study on tone statistics in Chinese names. Speech Commun 36:267–275

    MATH  Google Scholar 

  144. Zhou S, Dong L, Xu S, Xu B (2018) Syllable-based sequence-to sequence speech recognition with the transformer in Mandarin Chinese. arXiv preprint arXiv:1804.10752

  145. Zhou S, Dong L, Xu S, Xu B (2018) A comparison of modeling units in sequence-to-sequence speech recognition with the transformer on Mandarin Chinese. In: Proceedings of the international conference on neural information processing (ICONIP), pp 210–220

  146. Zou W, Jiang D, Zhao S, Li X (2018) A comparable study of modeling units for end-to-end Mandarin speech recognition. arXiv preprint arXiv:1805.03832

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amitoj Singh.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: A Quality Assessment Forms

1.1 Screening Question

figure a

Appendix 2: Data items Extracted from All Papers

Data item

Description

Token Initialization

Unique Identifier for the study

Data source

Author, title year

Category of article

Article from journal, conference, workshop.

Study objectives

What are the objectives of the study, i.e., search focus, i.e., the research areas of the research on which the paper focuses.

Design of study

Category of study—feature extraction, classification, SR, comparative analysis, etc.

What is the SR technique of tonal languages?

Spoken words are transcribed by the computers into readable text.

Method of comparison

Important parameters values for SR, i.e., f0, tone, energy, gain, amplitude, power, frequency response, voiced and unvoiced feature

Data analysis phase

Data analysis with respect to presentation of source, error rate, recognition accuracy

Tool and the usage

It refers to the SR tool, developer and usage of the tool

Result of study

Result or conclusions from the primary study

Appendix 3

A:

Tonal Assimilation

ACC:

Letter Accuracy

AI:

Artificial Intelligence

AMDF:

Average Magnitude Difference Function

ANN:

Artificial Neural Network

ASR:

Automatic Speech Recognition

ATWV:

Actual Term Weighted Value

BC:

Broadcast Conversational

BCC:

Broadcasting Corporation of Chinese

BLSTM:

Bidirectional Long Short Term Memory

BN:

Broadcast News

BTEC:

Basic Travel Expression Corpus

CAP:

Computer Assisted Pronunciation

CART:

Classification And Regression Tree

CD:

Context Dependent

CD-T-175:

Context Dependent Model

CE:

Consonantal Effects

CER:

Character Error Rate

CI:

Context Independent

CI-T-5:

Context Independent Model

CNN:

Convolutional Neural Network

COR:

Letter Correctness

CPM:

Compound Pseudo-Morpheme

CRF:

Conditional Random Field

CTC:

Connectionist Temporal Classification

CTS:

Conversational Telephone Speech

CV:

Consonant Vowel

D:

Declination Normalization

DBNFs:

Deep Bottle Neck Features

DCTC:

Discrete Cosine Transform Coefficient

DE:

Differential Evolution

DNN:

Deep Neural Network

DR:

Detection Rate

DRT:

Diagnostic Rhyme Test

DTW:

Dynamic Time Wrap

EMG:

Electromyography

ENV:

Environment

F:

Average Pitch Value of Whole Utterance

f0:

Fundamental Frequency

F0:

Time Periods of the Successive Human Vocal Cords Vibrations

FFT:

Fast Fourier Transform

fi:

Average Pitch Values of Voiced Segment

FLP:

Full Language Pack

G2P:

Grapheme-to-Phoneme

GA:

Genetic Algorithm

GFCC:

Gammatone Frequency Cepstral Coefficient

GMBM:

Gaussian Mixture Bigram Model

GMM:

Gaussian Mixture Model

GP:

Global Phone

HCRF:

Hidden Conditional Random Field

HINT:

Hearing In Noise Test

HMM:

Hidden Markov Model

H-T-30:

Half-Tone Model

I2R:

Institute for Infocomm Research

IC:

InterCorp

IF:

Initial/Final

ISDPs:

Intra Syllable Dependent Phone Set

KLT:

Karhunen–Loeve Transformation

KNN:

Nearest Neighbors

KWS:

Keyword Search

KWS:

Keyword Spotting

LCR:

Letter Correct Rate

LDA:

Linear Discriminant Analysis

LM:

Language Model

LM_DP_MI:

Language Model with Dynamic Programming Word Segmentation Method

LM_MM:

Language Model with Polysyllabic Words with Maximum Matching

LM1:

Closed Type Model

LM2:

Mix Type Model

LM3:

Open Type Model

LOTUS:

Large Vocabulary Thai Continuous Speech

LPCC:

Linear Predictive Cepstral Coefficient

LPM:

Latent Prosody Model

LR:

Logistic Regression

LRN 0.1:

Lithuanian Radio News Prototype Version 0.1

LRN1:

Lithuanian Radio News Version 1

LSTM:

Long Short-Term Memory

LSTM-OP:

Long Short-Term Memory with Output Projection

LSTMP:

Long Short-Term Memory Projected

LVCSR:

Large Vocabulary Continuous Speech Recognition

LVQ:

Learning Vector Quantization

MBN:

Mandarin Broadcast News

MCE:

Minimum Classification Error

MFCC:

Mel-Frequency Cepstral Coefficient

MFPLP:

Mel Frequency with Perceptual Linear Prediction

MGCPM:

Mixed Gaussian Continuous Probability Model

ML:

Maximum Likelihood

MLE:

Maximum Likelihood Estimation

MLLR:

Maximum Likelihood Linear Adaption

MLLT:

Maximum Likelihood Linear Transformation

MLP:

Multi-Layer Perceptron

MMI:

Maximum Mutual Information

MSD:

MultiSpace Distribution

MT:

Machine Translation

NCC:

Normalized Cross-Correlation

NECTEC:

National Electronics and Computer Technology Center

NECTEC-ATR:

National Electronics and Computer Technology Center Advanced Telecommunications Research

NN:

Neural Network

Norm_Log_F0_Mean_Dev:

f0 Logarithmic Value and normalizing the value of f0 by mean and standard deviation of every sentence

OOV:

Out of Vocabulary

ORCHID:

Open Linguistic Resources Channeled Toward Interdisciplinary Research

PB:

Phonetically Balanced

PCR:

Phoneme Correct Rate

PD:

Phonetically Distributed

PER:

Phone Error Rate

PLP:

Perceptual Linear Prediction

PLP:

Perceptual Linear Predictive

PM:

Pseudo-Morpheme

PNCC:

Power Normalized Cepstral Coefficients

QDA:

Quadratic Discriminant Analysis

RLAT:

Rapid Language Adaptation Toolkit

RNN:

Recurrent Neural Network

SA:

Syllable Accuracy

SBN:

Stacked Bottle Neck

SD:

Spoken Documents

SDPBMM:

State-Dependent Phoneme Based Model Merging

SER:

Sentence Error Rate

SFNNAM:

State Feedback Neural Network Activation Model

SGMM:

Subspace Gaussian Mixture Model

SLER:

Syllable Error Rate

SLM:

Syllable Lattice Matching

sMBR:

State-Level Minimum Bayes Risk

SQ:

Speech Queries

SR:

Speech Recognition

SRU:

Speech Recognition Unit

STT:

Speech-To-Text

SVM:

Support Vector Machine

TAR:

Tone Accuracy Rate

TCR:

Tone Correct Rate

TD:

Text Documents

TDRT-I:

Thai Diagnostic Rhyme Test for Initials

TF:

Tone Features

Tone 2:

High-Rising

Tone 3:

Dipping

Tone 4:

Falling

Tone 5:

Undefined

Tone1:

High-Level

TQ:

Text Queries

VERA:

Voice Encrypted Recognition Authentication

VLLP:

Very Limited Language Pack

VNBN:

Vietnamese Broadcast News

VOV:

Voice of Vietnamese

VRS:

Voice Recognition System

VRT:

Voice Recognition Technology

VSM:

Vector Space Model

WA:

Word Accuracy

WDC:

Wu Dialect Chinese

WER:

Word Error Rate

WFST:

Weighted Finite State Transducer

WU:

Word Unit

XIF:

Extended Initial/Final

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kaur, J., Singh, A. & Kadyan, V. Automatic Speech Recognition System for Tonal Languages: State-of-the-Art Survey. Arch Computat Methods Eng 28, 1039–1068 (2021). https://doi.org/10.1007/s11831-020-09414-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11831-020-09414-4

Navigation