Abstract
In this paper, we describe the IBM system submitted to the NIST Rich Transcription Spring 2006 (RT06s) evaluation campaign for automatic speech activity detection (SAD). This SAD system has been developed and evaluated on CHIL lecture meeting data using far-field microphone sensors, namely a single distant microphone (SDM) configuration and a multiple distant microphone (MDM) condition. The IBM SAD system employs a three-class statistical classifier, trained on features that augment traditional signal energy ones with features that are based on acoustic phonetic likelihoods. The latter are obtained using a large speaker-independent acoustic model trained on meeting data. In the detection stage, after feature extraction and classification, the resulting sequence of classified states is further collapsed into segments belonging to only two classes, speech or silence, following two levels of smoothing. In the MDM condition, the process is repeated for every available microphone channel, and the outputs are combined based on a simple majority voting rule, biased towards speech. The system performed well at the RT06s evaluation campaign, resulting to 8.62% and 5.01% “speaker diarization error” in the SDM and MDM conditions respectively.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Macho, D., Padrell, J., Abad, A., et al.: Automatic speech activity detection, source localization, and speech recognition on the CHIL seminar corpus. In: Proc. ICME (2005)
Li, Q., Zheng, J., Zhou, Q., Lee, C.-H.: A robust, real-time endpoint detector with energy normalization for ASR in adverse environments. In: Proc. ICASSP, pp. 233–236 (2001)
Martin, A., Charlet, D., Mauuary, L.: Robust speech/non-speech detection using LDA applied to MFCC. In: Proc. ICASSP, pp. 237–240 (2001)
Bou-Ghazale, S., Assaleh, K.: A robust endpoint detection of speech for noisy environments with application to automatic speech recognition. In: Proc. ICASSP, pp. 3808–3811 (2002)
Padrell, J., Macho, D., Nadeu, C.: Robust speech activity detection using LDA applied to FF parameters. In: Proc. ICASSP, vol. 1, pp. 557–560 (2005)
Monkowski, M.: Automatic Gain Control in a Speech Recognition System, U.S. Patent US6314396
Marcheret, E., Visweswariah, K., Potamianos, G.: Speech activity detection fusing acoustic phonetic and energy features. In: Proc. ICSLP (2005)
Chu, S.M., Marcheret, E., Potamianos, G.: Automatic speech recognition and speech activity detection in the CHIL smart room. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 332–343. Springer, Heidelberg (2006)
Huang, J., Westphal, M., Chen, S., et al.: The IBM rich transcription spring 2006 speech-to-text system for lecture meetings. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299. Springer, Heidelberg (2006)
Van Compernolle, D.: Switching adaptive filters for enhancing noisy and reverberant speech from microphone array recordings. In: Proc. ICASSP, pp. 833–836 (1990)
Armani, L., Matassoni, M., Omologo, M., Svaizer, P.: Use of a CSP-based voice activity detector for distant-talking ASR. In: Proc. Eurospeech, pp. 501–504 (2003)
Novak, M., Gopinath, R.A., Sedivy, J.: Efficient hierarchical labeler algorithm for Gaussian likelihoods computation in resource constrained speech recognition systems, available on-line at: http://www.research.ibm.com/people/r/rameshg/novak-icassp.ps
Gersho, A., Gray, R.M.: Vector Quantization and Signal Compression, 3rd edn., ch. 11. Kluwer Academic Publishers, Dordrecht (1993)
Ramaswamy, G.N., Navratil, A., Chaudhari, U.V., Zilca, R.D.: The IBM system for the NIST 2002 cellular speaker verification evaluation. In: Proc. ICASSP, vol. 2, pp. 61–64 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Marcheret, E., Potamianos, G., Visweswariah, K., Huang, J. (2006). The IBM RT06s Evaluation System for Speech Activity Detection in CHIL Seminars. In: Renals, S., Bengio, S., Fiscus, J.G. (eds) Machine Learning for Multimodal Interaction. MLMI 2006. Lecture Notes in Computer Science, vol 4299. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11965152_29
Download citation
DOI: https://doi.org/10.1007/11965152_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69267-6
Online ISBN: 978-3-540-69268-3
eBook Packages: Computer ScienceComputer Science (R0)