Multi-stage Speaker Diarization for Conference and Lecture Meetings

Zhu, X.; Barras, C.; Lamel, L.; Gauvain, J-L.

doi:10.1007/978-3-540-68585-2_49

X. Zhu^1,2,
C. Barras^1,2,
L. Lamel¹ &
…
J-L. Gauvain¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 4625))

Included in the following conference series:

1303 Accesses

Abstract

The LIMSI RT-07S speaker diarization system for the conference and lecture meetings is presented in this paper. This system builds upon the RT-06S diarization system designed for lecture data. The baseline system combines agglomerative clustering based on Bayesian information criterion (BIC) with a second clustering using state-of-the-art speaker identification (SID) techniques. Since the baseline system provides a high speech activity detection (SAD) error around of 10% on lecture data, some different acoustic representations with various normalization techniques are investigated within the framework of log-likelihood ratio (LLR) based speech activity detector. UBMs trained on the different types of acoustic features are also examined in the SID clustering stage. All SAD acoustic models and UBMs are trained with the forced alignment segmentations of the conference data. The diarization system integrating the new SAD models and UBM gives comparable results on both the RT-07S conference and lecture evaluation data for the multiple distant microphone (MDM) condition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Ensemble of Incremental System Enhancements for Robust Speaker Diarization in Code-Switched Real-Life Audios

Improvement of Speaker Number Estimation by Applying an Overlapped Speech Detector

Unsupervised adaptation of PLDA models for broadcast diarization

Article Open access 27 December 2019

References

NIST, Spring 2007 Rich Transcription (RT-07S) Meeting Recognition Evaluation Plan (February 2007), http://www.nist.gov/speech/tests/rt/rt2007/spring/docs/rt07s-meeting-eval-plan-v2.pdf
Anguera, X., Wooters, C., Hernando, J.: Speaker Diarization for Multi-Party Meetings Using Acoustic Fusion. In: Automatic Speech Recognition and Understanding (ASRU 2005), San Juan, Puerto Rico. IEEE, Los Alamitos (2005)
Google Scholar
Zhu, X., Barras, C., Meignier, S., Gauvain, J.-L.: Combining Speaker Identification and BIC for Speaker Diarization. In: ISCA Interspeech 2005, Lisbon, September 2005, pp. 2441–2444 (2005)
Google Scholar
Barras, C., Zhu, X., Meignier, S., Gauvain, J.-L.: Multi-Stage Speaker Diarization of Broadcast News. The IEEE Transactions on Audio, Speech and Language Processing, September 2006 (to appear)
Google Scholar
Zhu, X., Barras, C., Lamel, L., Gauvain, J.L.: Speaker Diarization: from Broadcast News to Lectures. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, Springer, Heidelberg (2006)
Google Scholar
Siegler, M., Jain, U., Raj, B., Stern, R.: Automatic segmentation and clustering of broadcast news audio. In: The DARPA Speech Recognition Workshop, Chantilly, USA (February 1997)
Google Scholar
Chen, S., Gopalakrishnan, P.: Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In: DARPA Broadcast News Transcription and Understanding Workshop, Landsdowne, USA (February 1998)
Google Scholar
Cettolo, M.: Segmentation, classification and clustering of an Italian broadcast news corpus. In: Conf. on Content-Based Multimedia Information Access (RIAO 2000), April 2000, Paris (2000)
Google Scholar
Barras, C., Zhu, X., Meignier, S., Gauvain, J.L.: Improving speaker diarization. In: The Proceedings of Fall 2004 Rich Transcription Workshop (RT 2004), November 2004, Palisades, NY, USA (2004)
Google Scholar
Tranter, S.E., Reynolds, D.A.: Speaker diarization for broadcast news. In: Proc. ISCA Speaker Recognition Workshop Odyssey 2004, May 2004, Toledo, Spain (2004)
Google Scholar
Schroeder, J., Campbell, J. (eds.): Digital Signal Processing (DSP), a review journal - Special issue on NIST 1999 speaker recognition workshop. Academic Press, London (2000)
Google Scholar
Barras, C., Gauvain, J.-L.: Feature and score normalization for speaker verification of cellular data. In: IEEE ICASSP 2003, Hong Kong (2003)
Google Scholar
Pelecanos, J., Sridharan, S.: Feature warping for robust speaker verification. In: Proc. ISCA Speaker Recognition Workshop Odyssey 2001, June 2001, pp. 213–218 (2001)
Google Scholar
Gauvain, J.-L., Lee, C.H.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing 2(2), 291–298 (1994)
Article Google Scholar
Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted Gaussian mixture models. Digital Signal Processing (DSP), a review journal - Special issue on NIST 1999 speaker recognition workshop 10(1-3), 19–41 (2000)
Google Scholar
Reynolds, D.A., Singer, E., Carlson, B.A., O’Leary, G.C., McLaughlin, J.J., Zissman, M.A.: Blind clustering of speech utterances based on speaker and language characteristics. In: Proc. of International Conf. on Spoken Language Processing (ICSLP 1998) (1998)
Google Scholar
Stolcke, A., Anguera, X., Boakye, K., Cetin, O., Grezl, F., Janin, A., Mandal, A., Peskin, B., Wooters, C., Zheng, J.: The ICSI-SRI Spring 2005 Speech-To-Text evaluation System. In: Rich Transcription 2005 Spring Meeting Recognition Evaluation, July 2005, Edinburgh, Great Britain (2005)
Google Scholar
NIST, Fall 2004 Rich Transcription (RT-04F) evaluation plan (August 2004), http://www.nist.gov/speech/tests/rt/rt2004/fall/docs/rt04f-eval-plan-v14.pdf
NIST, Spring 2006 Rich Transcription (RT-06S) Meeting Recognition Evaluation Plan (February 2006), http://www.nist.gov/speech/tests/rt/rt2006/spring/docs/rt06s-meeting-eval-plan-v2.pdf
Wooters, C., Huijbregts, M.: The ICSI RT07s Speaker Diarization System. In: Rich Transcription 2007 Meeting Recognition Workshop, Baltimore, USA (May 2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Spoken Language Processing Group, LIMSI-CNRS, BP 133, 91403, Orsay cedex, France
X. Zhu, C. Barras, L. Lamel & J-L. Gauvain
Univ Paris-Sud, F-91405, Orsay, France
X. Zhu & C. Barras

Authors

X. Zhu
View author publications
You can also search for this author in PubMed Google Scholar
C. Barras
View author publications
You can also search for this author in PubMed Google Scholar
L. Lamel
View author publications
You can also search for this author in PubMed Google Scholar
J-L. Gauvain
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Rainer Stiefelhagen Rachel Bowers Jonathan Fiscus

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, X., Barras, C., Lamel, L., Gauvain, JL. (2008). Multi-stage Speaker Diarization for Conference and Lecture Meetings. In: Stiefelhagen, R., Bowers, R., Fiscus, J. (eds) Multimodal Technologies for Perception of Humans. RT CLEAR 2007 2007. Lecture Notes in Computer Science, vol 4625. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68585-2_49

Download citation

DOI: https://doi.org/10.1007/978-3-540-68585-2_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68584-5
Online ISBN: 978-3-540-68585-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics