Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition
- PMID: 22559385
- DOI: 10.1121/1.3699200
Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition
Abstract
In an attempt to increase the robustness of automatic speech recognition (ASR) systems, a feature extraction scheme is proposed that takes spectro-temporal modulation frequencies (MF) into account. This physiologically inspired approach uses a two-dimensional filter bank based on Gabor filters, which limits the redundant information between feature components, and also results in physically interpretable features. Robustness against extrinsic variation (different types of additive noise) and intrinsic variability (arising from changes in speaking rate, effort, and style) is quantified in a series of recognition experiments. The results are compared to reference ASR systems using Mel-frequency cepstral coefficients (MFCCs), MFCCs with cepstral mean subtraction (CMS) and RASTA-PLP features, respectively. Gabor features are shown to be more robust against extrinsic variation than the baseline systems without CMS, with relative improvements of 28% and 16% for two training conditions (using only clean training samples or a mixture of noisy and clean utterances, respectively). When used in a state-of-the-art system, improvements of 14% are observed when spectro-temporal features are concatenated with MFCCs, indicating the complementarity of those feature types. An analysis of the importance of specific MF shows that temporal MF up to 25 Hz and spectral MF up to 0.25 cycles/channel are beneficial for ASR.
Similar articles
-
Separable spectro-temporal Gabor filter bank features: Reducing the complexity of robust features for automatic speech recognition.J Acoust Soc Am. 2015 Apr;137(4):2047-59. doi: 10.1121/1.4916618. J Acoust Soc Am. 2015. PMID: 25920855
-
Nonlinear spectro-temporal features based on a cochlear model for automatic speech recognition in a noisy situation.Neural Netw. 2013 Sep;45:62-9. doi: 10.1016/j.neunet.2013.02.006. Epub 2013 Mar 7. Neural Netw. 2013. PMID: 23558292
-
Temporal envelope compensation for robust phoneme recognition using modulation spectrum.J Acoust Soc Am. 2010 Dec;128(6):3769-80. doi: 10.1121/1.3504658. J Acoust Soc Am. 2010. PMID: 21218908
-
Classification of stop place in consonant-vowel contexts using feature extrapolation of acoustic-phonetic features in telephone speech.J Acoust Soc Am. 2012 Feb;131(2):1536-46. doi: 10.1121/1.3672706. J Acoust Soc Am. 2012. PMID: 22352523
-
Speaker recognition based on deep learning: An overview.Neural Netw. 2021 Aug;140:65-99. doi: 10.1016/j.neunet.2021.03.004. Epub 2021 Mar 17. Neural Netw. 2021. PMID: 33744714 Review.
Cited by
-
Spectro-temporal modulation glimpsing for speech intelligibility prediction.Hear Res. 2022 Dec;426:108620. doi: 10.1016/j.heares.2022.108620. Epub 2022 Sep 21. Hear Res. 2022. PMID: 36175300 Free PMC article. Review.
-
Simple Acoustic Features Can Explain Phoneme-Based Predictions of Cortical Responses to Speech.Curr Biol. 2019 Jun 17;29(12):1924-1937.e9. doi: 10.1016/j.cub.2019.04.067. Epub 2019 May 23. Curr Biol. 2019. PMID: 31130454 Free PMC article.
-
Matching Pursuit Analysis of Auditory Receptive Fields' Spectro-Temporal Properties.Front Syst Neurosci. 2017 Feb 9;11:4. doi: 10.3389/fnsys.2017.00004. eCollection 2017. Front Syst Neurosci. 2017. PMID: 28232791 Free PMC article.
-
Speech Intelligibility Prediction using Spectro-Temporal Modulation Analysis.IEEE/ACM Trans Audio Speech Lang Process. 2021;29:210-225. doi: 10.1109/taslp.2020.3039929. Epub 2020 Nov 24. IEEE/ACM Trans Audio Speech Lang Process. 2021. PMID: 33748329 Free PMC article.
-
Auditory Cortical Plasticity Dependent on Environmental Noise Statistics.Cell Rep. 2020 Mar 31;30(13):4445-4458.e5. doi: 10.1016/j.celrep.2020.03.014. Cell Rep. 2020. PMID: 32234479 Free PMC article.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Research Materials