Abstract
This article presents an approach for the automatic recognition of non-native speech. Some non-native speakers tend to pronounce phonemes as they would in their native language. Model adaptation can improve the recognition rate for non-native speakers, but has difficulties dealing with pronunciation errors like phoneme insertions or substitutions. For these pronunciation mismatches, pronunciation modeling can make the recognition system more robust. Our approach is based on acoustic model transformation and pronunciation modeling for multiple non-native accents. For acoustic model transformation, two approaches are evaluated: MAP and model re-estimation. For pronunciation modeling, confusion rules (alternate pronunciations) are automatically extracted from a small non-native speech corpus. This paper presents a novel approach to introduce confusion rules in the recognition system which are automatically learned through pronunciation modelling. The modified HMM of a foreign spoken language phoneme includes its canonical pronunciation along with all the alternate non-native pronunciations, so that spoken language phonemes pronounced correctly by a non-native speaker could be recognized. We evaluate our approaches on the European project HIWIRE non-native corpus which contains English sentences pronounced by French, Italian, Greek and Spanish speakers. Two cases are studied: the native language of the test speaker is either known or unknown. Our approach gives better recognition results than the classical acoustic adaptation of HMM when the foreign origin of the speaker is known. We obtain 22% WER reduction compared to the reference system.
Furthermore, we take into account the written form of the spoken words: non-native speakers may rely on the writing of the words in order to pronounce them. This approach does not provide any further improvements.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
For native speakers it would be more efficient to use a native ASR.
References
Bartkova, K., & Jouvet, D. (2006). Using multilingual units for improved modeling of pronunciation variants. In Proceedings IEEE international conference on acoustic, speech and signal processing, Toulouse, France.
Bartkova, K., & Jouvet, D. (2007). On using units trained on foreign data for improved multiple accent speech recognition. Speech Communication, 49, 836–846.
Bisani, M., & Ney, H. (2003). Multigram-based grapheme-to-phoneme conversion for LVCSR. In Proceedings Interspeech.
Bouselmi, G., Fohr, D., Illina, I., & Haton, J.-P. (2005). Fully automated non-native speech recognition using confusion-based acoustic model integration. In Proceedings Interspeech, Lisboa.
Bouselmi, G., Fohr, D., Illina, I., & Haton, J.-P. (2006). Fully automated non-native speech recognition using confusion-based acoustic model integration and graphemic constraints. In Proceedings IEEE international conference on acoustic, speech and signal processing (pp. 345–348), Toulouse, France.
Bouselmi, G., Fohr, D., & Illina, I. (2007). Combined acoustic and pronunciation modelling for non-native speech recognition. In Proceedings Interspeech (pp. 1449–1452), Antwerp, Belgium.
Clarke, C., & Jurafsky, D. (2006). Limitations of MLLR adaptation with Spanish-accented English: an error analysis. In Proceedings international conference on spoken language processing (pp. 1117–1120), Pittsburgh, PA, USA.
Coile, B. V. (1990). Inductive learning of grapheme-to-phoneme rules. In Proceedings international conference on spoken language processing.
Compernolle, D. V. (2001). Recognizing speech of goats, wolves, sheep and … non-natives. Speech Communication, 35(1–2), 71–79.
Cremelie, N., & Martens, J.-P. (1997). Automatic rule based generation of word pronunciation networks. In Proceedings of Eurospeech97 (pp. 2459–2462).
Cremelie, N., & Martens, J.-P. (1999). In search of better pronunciation models for speech recognition. Speech Communication, 29(2–4), 115–136.
Flege, J., Schirru, C., & MacKay, I. (2003). Interaction between the native and second language phonetic subsystems. Speech Communication, 40, 467–491.
Gillick, L., & Cox, S. (1989). Some statistical issues in the comparison of speech recognition algorithms. In Proceedings IEEE international conference on acoustic, speech and signal processing (pp. 532–535).
Goronzy, S., Rapp, S., & Kompe, R. (2004). Generating non-native pronunciation variants for lexicon adaptation. Speech Communication, 42, 109–123.
He, X., & Zhao, Y. (2003). Fast model selection based speaker adaptation for non native speech. IEEE Transactions on Acoustics, Speech, and Signal Processing, pp. 298–307.
Jeffers, R. J., & Lehiste, I. (1979). Principles and methods for historical linguistics. Cambridge: MIT Press.
Jurafsky, D., Ward, W., Jianping, Z., Herold, K., Xiuyang, Y., & Sen, Z. (2001). What kind of pronunciation variation is hard for triphones to model. In Proceedings IEEE international conference on acoustic, speech and signal processing.
Ladefoged, P., & Maddieson, I. (1996). The sounds of the world’s languages. Oxford: Blackwell Publishers.
Lawson, A., Harris, D., & Grieco, J. (2003). Effect of foreign accent on speech recognition in the NATO N-4 corpus. In Proceedings interspeech (pp. 1505–1508), Geneva, Switzerland.
Livescu, K., & Glass, J. (2000). Lexical modeling of non-native speech for automatic speech recognition. In Proceedings IEEE international conference on acoustic, speech and signal processing (pp. 1683–1686), Istanbul, Turkey.
Minematsu, N., Osaki, K., & Hirose, K. (2003). Improvement of non-native speech recognition by effectively modeling frequently observed pronunciation habits. In Proceedings interspeech (pp. 2597–2600), Geneva, Switzerland.
Morgan, J. (2004). Making a speech recognizer tolerate non-native speech through Gaussian mixture merging. In Proceedings InSTIL/ICALL (pp. 213–216), Venice, Italy.
Oh, Y. R., Yoon, J. S., & Kim, H. K. (2007). Acoustic model adaptation based on pronunciation variability analysis for non-native speech recognition. Speech Communication, 49, 59–70.
Raux, A. (2004). Automated lexical adaptation and speaker clustering based on pronunciation habits for non-native speech recognition. In Proceedings international conference on spoken language processing (pp. 613–616), Jeju Island, Korea.
Saraclar, M., Nock, H., & Khudanpur, S. (2000). Pronunciation modeling by sharing Gaussian densities across phonetic models. Computer Speech & Language, 14, 137–160.
Schaden, S. (2003). Generating non-native pronunciation lexicons by phonological rules. In Proceedings ICPhS (pp. 2545–2548).
Stouten, F., & Martens, J.-P. (2007). Recognition of foreign names spoken by native speakers. In Proceedings Interspeech (pp. 2133–2136), Antwerp, Belgium.
Tomokiyo, M., & Waibel, A. (2001). Adaptation methods for non-native speech. In Multilinguality in spoken language processing (pp. 137–140), Aalborg, Denmark.
University, C. M. (1998). The CMU pronouncing dictionary v.0.6d. http://www.speech.cs.cmu.edu/.
Acknowledgements
This work was partially funded by the European project HIWIRE (Human Input that Works In Real Environments), contract number 507943, sixth framework program, information society technologies.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bouselmi, G., Fohr, D. & Illina, I. Multilingual recognition of non-native speech using acoustic model transformation and pronunciation modeling. Int J Speech Technol 15, 203–213 (2012). https://doi.org/10.1007/s10772-012-9134-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-012-9134-8