Abstract
In this paper, the improvements in the recently implemented Kannada speech recognition system is demonstrated in detail. The Kannada automatic speech recognition (ASR) system consists of ASR models which are created by using Kaldi, IVRS call flow and weather and agricultural commodity prices information databases. The task specific speech data used in the recently developed spoken dialogue system had high level of different background noises. The different types of noises present in collected speech data had an adverse effect on the on line and off line speech recognition performances. Therefore, to improve the speech recognition accuracy in Kannada ASR system, a noise reduction algorithm is developed which is a fusion of spectral subtraction with voice activity detection (SS-VAD) and minimum mean square error spectrum power estimator based on zero crossing (MMSE-SPZC) estimator. The noise elimination algorithm is added in the system before the feature extraction part. An alternative ASR models are created using subspace Gaussian mixture models (SGMM) and deep neural network (DNN) modeling techniques. The experimental results show that, the fusion of noise elimination technique and SGMM/DNN based modeling gives a better relative improvement of 7.68% accuracy compared to the recently developed GMM-HMM based ASR system. The least word error rate (WER) acoustic models could be used in spoken dialogue system. The developed spoken query system is tested from Karnataka farmers under uncontrolled environment.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abhishek, D., Shahnawazuddin, S., Deepak, K. T., Siddika, I., Prasanna, S. R. M., & Sinha, R. (2017). Improvements in IITG Assamese spoken query system: Background noise suppression and alternate acoustic modeling. Journal of Signal Processing Systems, 88(01), 91–102.
Abushariah, M. A. M., Ainon, R. N., Zainuddin, R., Elshafei, M., & Khalifa, O. O. (2010). Natural speaker-independent Arabic speech recognition system based on Hidden Markov Models using Sphinx tools. In Computer and Communication Engineering (ICCCE), 2010 International Conference on, Kuala Lumpur, pp. 1–6.
Agricultural Marketing Information Network—AGMARKNET. (2011). http://agmarknet.nic.in.
Ali, A., Zhang, Y., Cardinal, P., Dahak, N., Vogel, S., Glass, J. (Dec 2014). A complete KALDI recipe for building arabic speech recognition systems. Spoken Language Technology Workshop (SLT), IEEE, South Lake Tahoe, NV, pp. 525–529.
Al-Qatab, B. A. Q., & Ainon, R. N. (2010). Arabic speech recognition using Hidden Markov Model Toolkit(HTK). In 2010 International Symposium on Information Technology, Kuala Lumpur, pp. 557–556.
Ansari, Z., & Seyyedsalehi, S. A. (2016). Toward growing modular deep neural networks for continuous speech recognition. Neural Computing and Applications, 28, 1177–1196. https://doi.org/10.1007/s00521-016-2438-x.
Cohen, I., & Berdugo, B. (2002). Noise estimation by minima controlled recursive averaging for robust speech enhancement. IEEE Signal Processing Letters, 9(1), 12–15.
Cole, C., Karam, M. & Aglan, H. (March 2008). Spectral subtraction of noise in speech processing applications. In 40th Southeastern Symposium System Theory, SSST-2008, pp. 50–53, 16–18.
Dahl, G., Yu, D., Deng, L., & Acero, A, (2012). Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. In IEEE Transactions on Audio Speech, and Language Processing (receiving 2013 IEEE SPS Best Paper Award), pp. 30–42.
David, H., & James, G. (2014) Speech recognition without a lexicon—bridging the gap between graphemic and phonetic systems. INTERSPEECH, Singapore, pp. 14–18.
Derbali, M., Mu’Tasem, J., & Taib, M. (2012). A review of speech recognition with Sphinx engine in language detection. Journal of Theoretical and Applied Information Technology, 40(2), 147–155.
Dey, A., Shahnawazuddin, S., Deepak, K. T., Imani, S., Prasanna, S. R. M., & Sinha, R. (2016). Enhancements in Assamese spoken query system: Enabling background noise suppression and flexible queries. In 2016 Twenty Second National Conference on Communication (NCC), pp. 1–6.
Glass, J. R. (1999). Challenges for spoken dialogue systems. In Proceedings of IEEE ASRU workshop.
Goel, S., & Bhattacharya, M. (July 2010). Speech based dialog query system over asterisk pbx server. In 2nd International Conference on Signal Processing Signal Processing Systems (ICSPS), Dalian.
Hinton, G. E., Deng, L., Yu, D., Dahl, G., Mohamed, A. R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., & Kings-bury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. Signal Processing Magazine.
Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computer, 18, 1527–1554.
Hu, Y., & Loizou, P. (2007). Subjective comparison and evaluation of speech enhancement algorithms. Speech Communications, 49, 588–601.
Huanhuan, L., Xiaoqing, Y., Wanggen, W., & Ram, S. (July 2012) An improved spectral subtraction method. International Conference on Audio, Language and Image Processing (ICALIP), Shanghai, pp. 790–793.
India Telecom Online—ITO. (2013). www.indiatelecomonline.com.
Jounghoon, B., & Hanseok, K. (2003). A novel spectral subtraction scheme for robust speech recognition: spectral subtraction using spectral harmonics of speech. IEEE International Conference on Multimedia and Expo, vol. 3, pp. I-648–I-651.
Karan, B., Sahoo, J., & Sahu, P. K. (2015). Automatic speech recognition based odia system. International Conference on Microwave, Optical and Communcation Engineering, December 18–20, 2015, IIT Bhubaneswar, India.
Karnataka Raitha Mitra. (2008). raitamitra.kar.nic.in/statistics.html.
Karpov, A., Markov, K., Kipyatkova, I., Vazhinina, D., & Ronzhin, A. (2014). Large vocabulary Russian speech recognition using syntactico-statistical language modeling. Speech Communications, 56(0167–6393), 213–228.
Kipyatkova, I. S., & Karpov, A. A. (2017). A study of neural network russian language models for automatic continuous speech recognition systems. Automation and Remote Control, 78(5), 858–867.
Kotkar, P., Thies, W., & Amarsinghe, S. (April 2008). An audio wiki for publishing user-generated content in the developing world. In HCI for Community and International Development, Florence, Italy.
Lamere, P., Kwok, P., Evandro, B. G., Singh, R., Walker, W., Wolf, P. (2003). The CMU Sphinx-4 speech recognition system. In IEEE International Conference on Acoustics, Speech and Signal Processing.
Loizou, P. (2007). Speech enhancement: Theory and practice (1st ed.). Boca Raton, FL: CRC Taylor & Francis.
Lu, Y., & Loizou, P. C. (2011). Estimators of the magnitude-squared spectrum and methods for incorporating SNR uncertainty. IEEE Transactions on Audio, Speech, and Language processing, 19(5), 1123–1137.
Ming, J., & Crookes, D. (2017). Speech enhancement based on full-sentence correlation and clean speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(3), 531–543.
Nahar, K. M. O., & Squeir, M. A. (2016). Arabic phonemes recognition using hybrid LVQ/HMM model for continuous speech recognition. International Journal of Speech Technology, 19, 495–508.
Popovic, B., Ostrogonac, S., Pakoci, E., Jakovljevic, N., Delic, V. (2015). Deep neural network based continuous speech recognition for Serbian using the Kaldi toolkit. Berlin: Springer, https://doi.org/10.1007/978-3-319-23132-23.
Povey, D., Burget, L., Agarwal, M., Akyazi, P., Kai, F., Ghoshal, A., et al. (2011). The subspace gaussian mixture model-a structured model for speech recognition. Computer Speech and Language, 25(2), 404–439.
Prabhaker, M. (April 2006). Tamil market: A spoken dialog system for rural India. In ACM CHI Conference.
Rabiner, L. R. (1994). Applications of voice processing to telecommunications. Proceedings of IEEE, 82, 199–228.
Rabiner, L. R. (1997). Applications of speech recognition in the area of telecommunications. IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, pp. 501–510.
Rose, Richard, & Tang Yun, (2011). An investigation of subspace modeling for phonetic and speaker variability in automatic speech recognition. In ICASSP, pp. 4508–4511.
Rose R. C, Yin, S.C., & Tang, Y, (2011). An investigation of subspace modeling for phonetic and speaker variability in automatic speech recognition, in Proc. ICASSP, pp. 4508-4511.
Sailor, H. B., & Patil, H. A. (2016). Novel unsupervised auditory filterbank learning using convolutional RBM for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(12), 2341–2353.
Shahnawazuddin, S., Thotappa, D., Sharma, B. D., Deka, A., Parasanna, S. R. M., & Sinha, R. (2013). Assamese spoken query system to access the price of agricultural commodities. National Conference on Communications (NCC), New Delhi, India, pp. 1–5.
Thimmaraja, G. Y., & Jayanna, H. S. (2017). A spoken query system for the agricultural commodity prices and weather information access in Kannada language. International Journal of Speech Technology (IJST), 20(3), 635–644. https://doi.org/10.1007/s10772-017-9428-y.
Trihandoyo, A., Belloum, A., & Hou, K. M. (1995). A real-time speech recognition architecture for a multi-channel interactive voice response system. Proceedings of ICASSP, 4, 2687–2690.
Van Segbroeck, M., & Van Hamme, H. (2011). Advances in missing feature techniques for robust large-vocabulary continuous speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 19(1), 123–137.
Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., Wolf, P., Woelfel, J. (2004). Sphinx-4: A flexible open source framework for speech recognition. Menlo Park: Sun Microsystems, Inc.
Wolfe, P. J., & Godsill, S. J. (Aug. 2001). Simple alternatives to the Ephraim and Malah suppression rule for speech enhancement. In Proceedings of 11th IEEE Signal Process. Workshop Statist. Signal Process., pp. 496–499.
Xia, B., Liang, Y., & Bao, C. (Nov. 2009). A modified spectral subtraction method for speech enhancement based on masking property of human auditory system. International Conference on Wireless Communications Signal Processing, WCSP, pp. 1–5.
Yi, H., & Loizou, P. C. (2008). Evaluation of objective quality measures for speech enhancement. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 229.
Zhang, S. X., Ragni, A., & Gales, M. J. F. (2010). Structured log linear models for noise robust speech recognition. IEEE Signal Processing Letters, 17(11), 945–948.
Acknowledgements
This study was supported by Department of Electronics and Information Technology (DeitY), Ministry of Communications and Information Technology, Government of India.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations
Rights and permissions
About this article
Cite this article
Thimmaraja Yadava, G., Jayanna, H.S. Enhancements in automatic Kannada speech recognition system by background noise elimination and alternate acoustic modelling. Int J Speech Technol 23, 149–167 (2020). https://doi.org/10.1007/s10772-020-09671-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-020-09671-5
Keywords
- Speech
- Speech recognition
- Interactive voice response system (IVRS)
- Automatic speech recognition (ASR)
- Spectral subtraction with voice activity detection (SS-VAD)
- Minimum mean square error spectrum power estimator based on zero crossing (MMSE-SPZC)
- Minimum mean square error spectrum power (MMSE-SP)
- Maximum a Posteriori (MAP)