Bidirectional LSTM Networks for Context-Sensitive Keyword Detection in a Cognitive Virtual Agent Framework

Wöllmer, Martin; Eyben, Florian; Graves, Alex; Schuller, Björn; Rigoll, Gerhard

doi:10.1007/s12559-010-9041-8

Bidirectional LSTM Networks for Context-Sensitive Keyword Detection in a Cognitive Virtual Agent Framework

Published: 27 April 2010

Volume 2, pages 180–190, (2010)
Cite this article

Cognitive Computation Aims and scope Submit manuscript

Martin Wöllmer¹,
Florian Eyben¹,
Alex Graves²,
Björn Schuller¹ &
…
Gerhard Rigoll¹

1041 Accesses
3 Altmetric
Explore all metrics

Abstract

Robustly detecting keywords in human speech is an important precondition for cognitive systems, which aim at intelligently interacting with users. Conventional techniques for keyword spotting usually show good performance when evaluated on well articulated read speech. However, modeling natural, spontaneous, and emotionally colored speech is challenging for today’s speech recognition systems and thus requires novel approaches with enhanced robustness. In this article, we propose a new architecture for vocabulary independent keyword detection as needed for cognitive virtual agents such as the SEMAINE system. Our word spotting model is composed of a Dynamic Bayesian Network (DBN) and a bidirectional Long Short-Term Memory (BLSTM) recurrent neural net. The BLSTM network uses a self-learned amount of contextual information to provide a discrete phoneme prediction feature for the DBN, which is able to distinguish between keywords and arbitrary speech. We evaluate our Tandem BLSTM-DBN technique on both read speech and spontaneous emotional speech and show that our method significantly outperforms conventional Hidden Markov Model-based approaches for both application scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

RAttSR: A Novel Low-Cost Reconstructed Attention-Based End-to-End Speech Recognizer

Article 24 December 2023

Robust and efficient keyword spotting using a bidirectional attention LSTM

Article 11 November 2023

A Semantic-Aware Strategy for Automatic Speech Recognition Incorporating Deep Learning Models

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

References

Taylor JG (2009) Cognitive computation. Cognit Comput. 1(1):4–16
Article Google Scholar
Vo MT, Waibel A (1993) Multimodal human-computer interaction. In: Proceedings of ISSD. Waseda, pp 95–101
Oviatt S (2000) Multimodal interface research: A science without borders. In: Proceedings of ICSLP. pp 1–6
Schröder M, Cowie R, Heylen D, Pantic M, Pelachaud C, Schuller B (2008) Towards responsive Sensitive Artificial Listeners. In: Proceedings of 4th international workshop on human-computer conversation. Bellagio. pp 1–6
Rose RC (1995) Keyword detection in conversational speech utterances using hidden Markov model based continuous speech recognition. Comput Speech Lang 9(4):309–333
Article Google Scholar
Keshet J, Grangier D, Bengio S (2007) Discriminative Keyword Spotting. In: Proceedings of NOLISP. Paris. pp 47–50
Wöllmer M, Eyben F, Keshet J, Graves A, Schuller B, Rigoll G (2009) Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks. In: Proceedings of ICASSP. Taipei. pp 3949–3952
Liu H, Lieberman H, Selker T (2003) A model of textual affect sensing using real-world knowledge. In: Proceedings of the 8th international conference on intelligent user interfaces. Miami, Florida. pp 125–132
Ma C, Prendinger H, Ishizuka M (2005) A Chat system based on emotion estimation from text and embodied conversational messengers. In: Entertainment Computing. vol. 3711/2005. Springer. pp 535–538
Ziemke T, Lowe R (2009) On the role of emotion in embodied cognitive architectures: from organisms to robots. Cognit Comput 1(1):104–117
Article Google Scholar
Rose RC, Paul DB (1990) A hidden markov model based keyword recognition system. In: Proceedings of ICASSP. Albuquerque. p. 129–132
Ketabdar H, Vepa J, Bengio S, Boulard H (2006) Posterior based keyword spotting with a priori thresholds. In: IDAIP-RR. pp 1–8
Benayed Y, Fohr D, Haton JP, Chollet G (2003) Confidence measure for keyword spotting using support vector machines. In: Proceedings of ICASSP. pp 588–591
Mamou J, Ramabhadran B, Siohan O (2007) Vocabulary independent spoken term detection. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. Amsterdam. pp 615–622
Weintraub M (1993) Keyword-spotting using SRI’s DECIPHER large vocabulary speech recognition system. In: Proceedings of ICASSP. Minneapolis. pp 463–466
Bilmes JA (2003) Graphical models and automatic speech recognition. In: Rosenfeld R, Ostendorf M, Khudanpur S, Johnson M (eds). Mathematical foundations of speech and language processing. New York: Springer. pp 191–246
Bilmes JA, Bartels C (2005) Graphical model architectures for speech recognition. IEEE Signal Process Mag 22(5):89–100
Article Google Scholar
Lin H, Stupakov A, Bilmes JA (2009) Improving multi-lattice alignment based spoken keyword spotting. In: Proceedings of ICASSP. Taipei. pp 4877–4880
Lin H, Bilmes JA, Vergyri D, Kirchhoff K (2007) OOV detection by joint word/phone lattice alignment. In: Proceedings of ASRU. Kyoto. pp 478–483
Wöllmer M, Eyben F, Schuller B, Rigoll G (2009) Robust vocabulary independent keyword spotting with graphical models. In: Proceedings of ASRU. Merano. pp 349–353
Graves A, Fernandez S, Schmidhuber J (2005) Bidirectional LSTM networks for improved phoneme classification and recognition. In: Proceedings of ICANN. Warsaw. pp 602–610
Eyben F, Wöllmer M, Graves A, Schuller B, Douglas-Cowie E, Cowie R (2009) On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues. J Multimodal User Interfaces (JMUI), Special Issue on Real-time Affect Analysis and Interpretation: Closing the Loop in Virtual Agents 3:7–19
Google Scholar
Hermansky H, Ellis DPW, Sharma S (2000) Tandem connectionist feature extraction for conventional HMM systems. In: Proceedings of ICASSP. Istanbul. pp 1635–1638
Ketabdar H, Bourlard H (2008) Enhanced phone posteriors for improving speech recognition systems. In: IDIAP-RR. 39. pp 1–23
Ellis DPW, Singh R, Sivadas S (2001) Tandem acoustic modeling in large-vocabulary recognition. In: Proceedings of ICASSP. Salt Lake City. pp 517–520
Boulard H, Morgan N (1994) Connectionist speech recognition: a hybrid approach. Kluwer Academic Publishers, Dordrecht
Google Scholar
Bengio Y (1999) Markovian models for sequential data. Neural Comput Surv 2:129–162
Google Scholar
Fernandez S, Graves A, Schmidhuber J (2007) An application of recurrent neural networks to discriminative keyword spotting. In: Proceedings of ICANN. Porto. pp 220–229
Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL (1993) DARPA TIMIT acoustic phonetic continuous speech corpus CDROM. NIST
Douglas-Cowie E, Cowie R, Sneddon I, Cox C, Lowry O, McRorie M, et al. (2007) The HUMAINE Database: addressing the collection and annotation of naturalistic and induced emotional data. In: Affective computing and intelligent interaction. vol. 4738/2007. Springer. pp. 488–500
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article CAS PubMed Google Scholar
Yang HH, Sharma S, van Vuuren S, Hermansky H (2000) Relevance of time-frequency features for phonetic and speaker/channel classification. Speech Commun. 31:35–50
Article Google Scholar
Bilmes JA (1998) Maximum mutual information based reduction strategies for cross-correlation based joint distributional modeling. In: Proceedings of ICASSP. pp 469–472
Schuller B, Müller R, Eyben F, Gast J, Hörnler B, Wöllmer M, et al. (2009) Being bored? recognising natural interest by extensive audiovisual integration for real-life application. Image Vis Comput J (IMAVIS), Special Issue on Visual and Multimodal Analysis of Human Spontaneous Behavior 27(12):1760–1774
Google Scholar
Schuller B, Rigoll G (2009) Recognising interest in conversational speech—comparing bag of frames and supra-segmental features. In: Proceedings of interspeech. Brighton. pp 1999–2002
Quattoni A, Wang S, Morency LP, Collins M, Darrell T (2007) hidden conditional random fields. IEEE Trans Pattern Anal Mach Intell 29:1848–1853
Article PubMed Google Scholar
Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In: Kremer SC, Kolen JF (eds) A field guide to dynamical recurrent neural networks. IEEE Press, . pp 1–15
Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166
Article CAS PubMed Google Scholar
Schaefer AM, Udluft S, Zimmermann HG (2008) Learning long-term dependencies with recurrent neural networks. Neurocomputing 71(13-15):2481–2488
Article Google Scholar
Lin T, Horne BG, Tino P, Giles CL (1996) Learning long-term dependencies in NARX recurrent neural networks. IEEE Trans Neural Netw 7(6):1329–1338
Article CAS PubMed Google Scholar
Lang KJ, Waibel AH, Hinton GE (1990) A time-delay neural network architecture for isolated word recognition. Neural Netw 3(1):23–43
Article Google Scholar
Schmidhuber J (1992) Learning complex extended sequences using the principle of history compression. Neural Comput 4(2):234–242
Article Google Scholar
Jaeger H (2001) The echo state approach to analyzing and training recurrent neural networks. Bremen: German national research center for information technology. (Tech. Rep. No. 148)
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45:2673–2681
Article Google Scholar
Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18(5-6):602–610
Article PubMed Google Scholar
Graves A, Fernandez S, Liwicki M, Bunke H, Schmidhuber J (2008) Unconstrained online handwriting recognition with recurrent neural networks. Adv Neural Inf Process Syst. 20:1–8
Google Scholar
Liwicki M, Graves A, Fernandez S, Bunke H, Schmidhuber J (2007) A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks. In: Proceedings of ICDAR. Curitiba. pp 367–371
Wöllmer M, Eyben F, Schuller B, Sun Y, Moosmayr T, Nguyen-Thien N (2009) Robust in-car spelling recognition—a tandem BLSTM-HMM approach. In: Proceedings of interspeech. Brighton. p. 2507–2510
Wöllmer M, Eyben F, Reiter S, Schuller B, Cox C, Douglas-Cowie E, et al. (2008) Abandoning emotion classes—towards continuous emotion recognition with modelling of long-range dependencies. In: Proceedings of interspeech. Brisbane. p. 597–600
Wöllmer M, Eyben F, Schuller B, Douglas-Cowie E, Cowie R. Data-driven clustering in emotional space for affect recognition using discriminatively trained LSTM networks. In: Proceedings of interspeech. Brighton. pp 1595–1598 (2009)
Jensen FV (1996) An introduction to Bayesian networks. Springer, Brelin
Google Scholar
Zweig G, Padmanabhan M (2000) Exact alpha-beta computation in logarithmic space with application to map word graph construction. In: Proceedings of ICSLP. Beijing. pp 855–858
Bilmes J, Zweig G (2002) The graphical models toolkit: an open source software system for speech and time-series processing. In: Proceedings of ICASSP. pp 3916–3919
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B. 39:185–197
Google Scholar
Bilmes J (2008) Gaussian models in automatic speech recognition. In: Signal processing in acoustics. Springer, New York. pp 521–555
Bilmes J (1997) A gentle tutorial on the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden markov models. University of Berkeley. Technical Report ICSI-TR-97-02
Williams RJ, Zipser D (1995) Gradient-based learning algorithms for recurrent neural networks and their computational complexity. In: Chauvin Y, Rumelhart DE, (eds) Back-propagation: theory, architectures and applications. Lawrence Erlbaum Publishers, Hillsdale, pp 433–486
Graves A (2008) Supervised sequence labelling with recurrent neural networks. Technische Universität München, Germany
Google Scholar
Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X et al. (2006) The HTK book (v3.4). Cambridge University Press, Cambridge
Google Scholar
Baum LE, Petrie T, Soules G, Weiss N (1970) A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann Math Stat 41(1):164–171
Article Google Scholar
Wöllmer M, Eyben F, Schuller B, Rigoll G (2010) Spoken term detection with connectionist temporal classification—a novel hybrid CTC-DBN approach. In: Proceedings of ICASSP. Dallas. pp. 5274–5277
Graves A, Fernandez S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: Labelling unsegmented data with recurrent neural networks. In: Proceedings of ICML. Pittsburgh. p. 369–376
Gillick L, Cox SJ (1989) Some statistical issues in the comparison of speech recognition algorithms. In: Proceedings of ICASSP. Glasgow. pp 23–26
Wöllmer M, Al-Hames M, Eyben F, Schuller B, Rigoll G (2009) A multidimensional dynamic time warping algorithm for efficient multimodal fusion of asynchronous data streams. Neurocomputing 73:366–380
Article Google Scholar
Bengio S (2003) An asynchronous Hidden Markov model for audio-visual speech recognition. Advances in NIPS 15. pp 1–8

Download references

Acknowledgements

The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement No. 211486 (SEMAINE).

Author information

Authors and Affiliations

Institute for Human-Machine Communication, Technische Universität München, Arcisstrasse 21, 80290, München, Germany
Martin Wöllmer, Florian Eyben, Björn Schuller & Gerhard Rigoll
Institute for Computer Science VI, Technische Universität München, Boltzmannstrasse 3, 85748, München, Germany
Alex Graves

Authors

Martin Wöllmer
View author publications
You can also search for this author in PubMed Google Scholar
Florian Eyben
View author publications
You can also search for this author in PubMed Google Scholar
Alex Graves
View author publications
You can also search for this author in PubMed Google Scholar
Björn Schuller
View author publications
You can also search for this author in PubMed Google Scholar
Gerhard Rigoll
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martin Wöllmer.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wöllmer, M., Eyben, F., Graves, A. et al. Bidirectional LSTM Networks for Context-Sensitive Keyword Detection in a Cognitive Virtual Agent Framework. Cogn Comput 2, 180–190 (2010). https://doi.org/10.1007/s12559-010-9041-8

Download citation

Published: 27 April 2010
Issue Date: September 2010
DOI: https://doi.org/10.1007/s12559-010-9041-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Bidirectional LSTM Networks for Context-Sensitive Keyword Detection in a Cognitive Virtual Agent Framework

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

RAttSR: A Novel Low-Cost Reconstructed Attention-Based End-to-End Speech Recognizer

Robust and efficient keyword spotting using a bidirectional attention LSTM

A Semantic-Aware Strategy for Automatic Speech Recognition Incorporating Deep Learning Models

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Bidirectional LSTM Networks for Context-Sensitive Keyword Detection in a Cognitive Virtual Agent Framework

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

RAttSR: A Novel Low-Cost Reconstructed Attention-Based End-to-End Speech Recognizer

Robust and efficient keyword spotting using a bidirectional attention LSTM

A Semantic-Aware Strategy for Automatic Speech Recognition Incorporating Deep Learning Models

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation