Abstract
Cepstral mean and variance normalization (CMVN) is an efficient noise compensation technique popularly used in many speech applications. CMVN eliminates the mismatch between training and test utterances by transforming them to zero mean and unit variance. In this work, we argue that some amount of useful information is lost during normalization as every utterance is forced to have the same first- and second-order statistics, i.e., zero mean and unit variance. We propose to modify CMVN methodology to retain the useful information and yet compensate for noise. The proposed normalization approach transforms every test utterance to utterance-specific clean mean (i.e., utterance mean if the noise was absent) and clean variance, instead of zero mean and unit variance. We derive expressions to estimate the clean mean and variance from a noisy utterance. The proposed normalization is effective in the recognizing voice commands that are typically short (single words or short phrases), where more advanced methods [such as histogram equalization (HEQ)] are not effective. Recognition results show a relative improvement (RI) of \(21\,\%\) in word error rate over conventional CMVN on the Aurora-2 database and a RI of 20 and \(11\,\%\) over CMVN and HEQ on short utterances of the Aurora-2 database.
Similar content being viewed by others
References
R. Balchandran, R. Mammone, Non-parametric estimation and correction of non-linear distortion in speech system. in Proceedings of ICASSP (1998)
J. Du, R.H. Wang, Cepstral shape normalization for robust speech recognition. in Proceedings of ICASSP (2008), pp. 4389–4392
S. Furui, Cepstral analysis technique for automatic speaker verification. IEEE Trans. Acoust. Speech Signal Process. 29, 254–272 (1981)
M. Gales, Maximum likelihood linear transformations for hmm-based speech recognition. Comput. Speech Lang. 12, 75–98 (1998)
L. Garcia, J.C. Segura, J. Ramirez, A. Torre, C. Benitez, Parametric nonlinear feature equalization for robust speech recognition. in Proceedings of ICASSP (2006)
C. Hsu, L. Lee, Higher order cepstral moment normalization for improved robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 17(2), 205–220 (2009)
V. Joshi, N.V. Prasad, S. Umesh, Modified cepstral mean normalization–transforming to utterance specific non-zero mean. in Interspeech, (Lyon, 2013), pp. 881–885
C. Leggetter, P. Woodland, Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models. Comput. Speech Lang. 9, 171–185 (1995)
J. Li, L. Deng, Y. Gong, R. Haeb-Umbach, An overview of noise-robust automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 22, 1–33 (2013)
S. Molau, M. Pitz, H. Ney, Histogram based normalization in the acoustic feature space. in Proceedings of ASRU (2001)
P. Moreno, Speech recognition in noisy environments. PhD thesis, Carnegie Mellon University (1996)
P. Moreno, B. Raj, R. Stern, A vector taylor series approach for environment-independent speech recognition. in Proceedings of ICASSP (1996), pp. 733–736
Y. Obuchi, R. Stern, Normalization of time-derivative parameters using histogram equalization. in Proceedings of EUROSPEECH 2003 (Geneva, 2003)
D. Pearce, H.G. Hirsch, The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. in ISCA ITRW ASR2000 (2000), pp. 29–32
N. Prasad, S. Umesh, Improved cepstral mean and variance normalization using bayesian framework. in Proceedings of Automatic Speech Recognition and Understanding (ASRU) (2013), pp. 156–161
J. Segura, C. Benitez, A. Torre, A. Rubio, J. Ramirez, Cepstral domain segmental nonlinear feature transformations for robust speech recognition. IEEE Signal Process. Lett. 11, 517–520 (2004)
O. Strand, A. Egeberg, Cepstral mean and variance normalization in the model domain. in ISCA Tutorial and Research Workshop (2004)
R. Togneri, A. Ming Toh, S. Nordholm, Evaluation and modification of cepstral moment normalization for speech recognition in additibe babble ensemble. in Australian International Conference on Speech Science and Technology (2006)
A. Torre, J. Segura, C. Benitez, A. Peinado, A. Rubio, Non-linear transformations of the feature space for robust speech recognition. in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 1 (2002), pp. 401–404
A. Torre, A. Peinado, J. Segura, J. Perez-Cordoba, M. Benitez, A. Rubio, Histogram equalization of speech representation for robust speech recognition. IEEE Trans. Speech Audio Process. 13(3), 355–366 (2005)
O. Viikki, K. Laurila, Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Commun. 25(1), 133–147 (1998)
S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P.C. Woodland, The HTK Book, version 3.4. (Cambridge University Engineering Department, Cambridge, 2006)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Joshi, V., Prasad, N.V. & Umesh, S. Modified Mean and Variance Normalization: Transforming to Utterance-Specific Estimates. Circuits Syst Signal Process 35, 1593–1609 (2016). https://doi.org/10.1007/s00034-015-0129-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-015-0129-y