Abstract
In this paper Factored front-end CMLLR (F-FE-CMLLR) is investigated for the task of joint speaker and environment normalization in the frame-work of DNN-HMM acoustic modeling. It is a feature-space transform comprising of the composition of front-end CMLLR for environment and global CMLLR for speaker normalizations. The transform is applied to the input noisy, speaker-independent features and the resulting canonical features are passed on to the DNN-HMM for training and decoding. Two estimation procedures for F-FE-CMLLR are investigated, namely, sequential and iterative training. One of the key attributes of F-FE-CMLLR is that in the iterative training paradigm it is likely to foster acoustic factorization, which enables more effective transfer of the environment transform from one condition to another. Moreover, being a feature space transform, it becomes straightforward to use it in the context of DNN-HMM acoustic modeling. The performance of the proposed scheme is evaluated on the Aurora-4 noisy speech recognition task. The dominant acoustic factors in the task are the microphone variability, additive noise with varying SNRs and speakers. It is shown that F-FE-CMLLR yields a large improvement in performance compared to the baseline features, which are processed with CMLLR for speaker adaptive training (SAT). The improvement is observed in all acoustic conditions existing in the test sets. Moreover, the iterative training of F-FE-CMLLR outperforms sequential training under all test conditions. Specifically, when all three type of acoustic conditions co-exist, the sequential training yields a 13% relative improvement over SAT features. The iterative training provides an additional improvement on the top, amounting to an 18% relative gain over-all. It is argued that the improvement over sequential training is observed due to acoustic factorization that holds in an implicit sense.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Leggetter, C. J., & Woodland, P. C. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models. Computer Speech and Language, 9, 171–185.
Gales, M. J. F. (1998). Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech and Language, 12, 75–98.
Acero, A., & Deng, L., et al. (2000). HMM adaptation using vector taylor series for noisy speech recognition. In Proceedings of ICSLP.
Gales, M. J. F. (1995). Model-based techniques for noise robust speech recognition. In Ph.D. thesis, Cambridge: Cambridge University.
Hinton, G., Deng, L., et al. (2012). Deep Neural Networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29, 82–97.
Seide, F., Li, G., Chen, X., & Yu, D. (2011). Feature engineering in context-dependent Deep Neural Networks for conversational speech transcription. In Proceedings of IEEE ASRU.
Jaitly, N., Nguyen, P., Senior, A., & Vanhoucke, V. (2012). Application of pretrained deep neural networks to large vocabulary speech recognition. In Proceedings of International Speech.
Liao, H., McDermott, E., & Senior, A. (2013). Large scale Deep Neural Network acoustic modeling with semi-supervised training data for Youtube video transcription. In Proceedings of ASRU.
Abrash, V., Franco, H., Sankar, A., & Cohen, M. (1995). Connectionist speaker normalization and adaptation. In Proceedings of International Speech.
Gemello, R., Mana, F., Scanzio, S., Laface, P., & Mori, R. D. (2006). Adaptation of hybrid ann/hmm models using linear hidden transformations and conservative training. In Proceedings of ICASSP.
Price, R., Iso, K., & Shinoda, K. (2014). Speaker adaptation of deep neural networks using a hierarchy of output layers. In Proceedings of IEEE SLT.
Huang, Z., Li, J., Siniscalchi, S. M., Chen, I.-F., Wu, J., & Lee, C. H. (2015). Rapid adaptation for deep neural networks through multi-task learning. In Proceedings of International Speech.
Yu, D., Yao, K., Su, H., Li, G., & Seide, F. (2013). Kl-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In Proceedings of ICASSP.
Huang, Y., Slaney, M., Seltzer, M. L., & Gong, Y. (2014). Towards better performance with heterogeneous training data in acoustic modeling using deep neural networks. In Proceedings of International Speech.
Tan, T., Qian, Y., Yin, M., Zhuang, Y., Yu, & K. (2015). Cluster adaptive training for deep neural network. In Proceedings of ICASSP.
Miao, Y., Jiang, L., Zhang, H., & Metze, F. (2014). Improvements to speaker adaptive training of deep neural networks. In Proceedings of IEEE SLT.
G. Saon, H. Soltau, D. Nahamoo, Picheny, M. (2013) Speaker adaptation of Neural Network Acoustic models using I-vectors. In Proceedings of ASRU.
Senior, A., & Lopez-Moreno, I. (2014). Improving DNN Speaker Independence With I-Vector Inputs.In Proceedings of ICASSP.
Garimella, S., Mandal, A., & Strom, N., et al. (2015). Robust I-vector based adaptation of DNN Acoustic Model for speech recognition. In Proceedings of International Speech.
Variani, E., Lei, X., & McDermott, E., et al. (2014). Deep neural networks for small footprint text-dependent speaker verification. In Proceedings of ICASSP.
Seltzer, M. L., Yu, D., & Wang, Y. (2013). An investigation of deep neural networks for noise robust speech recognition. In Proceedings of ICASSP.
Abdel-Hamid, O., & Jiang, H. (2013). Fast speaker adaptation of hybrid nn/hmm model for speech recognition based on discriminative learning of speaker code. In Proceedings of ICASSP.
Qian, Y., Tan, T., Yu, D., & Zhang, Y. (2016) Integrated adaptation with multi-factor joint-learning for far-field speech recognition. In Proceedings of ICASSP.
Karanasou, P., Wang, M. J. F. G. Y., & Woodland, P. C. (2014). Adaptation of deep neural network acoustic models using factorised i-vectors. In Proceedings of International Speech.
Liao, H., & Gales, M. J. F. (2005). Joint uncertainty decoding for noise robust speech recognition. In Proceedings of International Speech.
Rath, S. P., Burget, L., Karafiat, M., Glembek, O., & Cernocky, J. (2013). A region-specific feature-space transformation for speaker adaptation and singularity analysis of Jacobian matrix. In Proceedings of International Speech.
Gales, M. J. F., & Flego, F. (2012). Model-based approaches for degraded channel modelling in robust ASR. In Proceedings of International Speech.
Rath, S., Sivadas, S., & Ma, B. (2015). Joint environment and speaker normalization using factored front-end CMLLR. In Proceedings of International Speech.
Gales, M. J. F. (2001). Acoustic factorisation. In Proceedings of ASRU.
Wang, Y. Q., & Gales, M. J. F. (2013). An explicit independence constraint for factorised adaptation in speech recognition. In Proceedings of International Speech.
Seltzer, M., & Acero, A. (2011). Separating speaker and environmental variability using factored transforms. In Proceedings of International Speech.
Seltzer, M., & Acero, A. (2012). Factored adaptation using a combination of feature-space and model-space transforms. In Proceedings of International Speech.
Seo, H., Kang, H.-G., & Seltzer, M. L. (2014). Factored adaptation of speaker and environment using orthogonal subspace transforms. In Proceedings of IEEE ICASSP.
Parihar, N., & Picone, J. (2002). Aurora working group: DSR frontend LVCSR evaluation AU/384/02. In Technical Report, Institute for Signal and Information Processing, Mississippi: Mississippi State University.
Povey, D., & Ghoshal, A., et al. (2011). The Kaldi Speech Recognition Toolkit. In Proceedings of IEEE ASRU.
Gopinath, R. (1998). Maximum likelihood modeling with Gaussian distributions for classification. In Proceedings IEEE ICASSP.
Gales, M. J. F. (1999). Semi-tied covariance matrices for hidden Markov models. IEEE Transactions Speech and Audio Proceedings, 7, 272–281.
Hinton, G. (2010). A Practical Guide to Training Restricted Boltzmann Machines. https://www.cs.toronto.edu/~hinton/absps/guideTR.pdf.
Vesely, K., Karafiat, M., & Grezl, F. (2011). Convolutive Bottleneck Network features for LVCSR. In Proceedings of IEEE ASRU.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Rath, S.P. Factored front-end CMLLR for joint speaker and environment normalization under DNN-HMM. Int J Speech Technol 20, 859–867 (2017). https://doi.org/10.1007/s10772-017-9453-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-017-9453-x