Multi-stage Speaker Diarization for Conference and Lecture Meetings | SpringerLink
Skip to main content

Multi-stage Speaker Diarization for Conference and Lecture Meetings

  • Conference paper
Multimodal Technologies for Perception of Humans (RT 2007, CLEAR 2007)

Abstract

The LIMSI RT-07S speaker diarization system for the conference and lecture meetings is presented in this paper. This system builds upon the RT-06S diarization system designed for lecture data. The baseline system combines agglomerative clustering based on Bayesian information criterion (BIC) with a second clustering using state-of-the-art speaker identification (SID) techniques. Since the baseline system provides a high speech activity detection (SAD) error around of 10% on lecture data, some different acoustic representations with various normalization techniques are investigated within the framework of log-likelihood ratio (LLR) based speech activity detector. UBMs trained on the different types of acoustic features are also examined in the SID clustering stage. All SAD acoustic models and UBMs are trained with the forced alignment segmentations of the conference data. The diarization system integrating the new SAD models and UBM gives comparable results on both the RT-07S conference and lecture evaluation data for the multiple distant microphone (MDM) condition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 5719
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 7149
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. NIST, Spring 2007 Rich Transcription (RT-07S) Meeting Recognition Evaluation Plan (February 2007), http://www.nist.gov/speech/tests/rt/rt2007/spring/docs/rt07s-meeting-eval-plan-v2.pdf

  2. Anguera, X., Wooters, C., Hernando, J.: Speaker Diarization for Multi-Party Meetings Using Acoustic Fusion. In: Automatic Speech Recognition and Understanding (ASRU 2005), San Juan, Puerto Rico. IEEE, Los Alamitos (2005)

    Google Scholar 

  3. Zhu, X., Barras, C., Meignier, S., Gauvain, J.-L.: Combining Speaker Identification and BIC for Speaker Diarization. In: ISCA Interspeech 2005, Lisbon, September 2005, pp. 2441–2444 (2005)

    Google Scholar 

  4. Barras, C., Zhu, X., Meignier, S., Gauvain, J.-L.: Multi-Stage Speaker Diarization of Broadcast News. The IEEE Transactions on Audio, Speech and Language Processing, September 2006 (to appear)

    Google Scholar 

  5. Zhu, X., Barras, C., Lamel, L., Gauvain, J.L.: Speaker Diarization: from Broadcast News to Lectures. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, Springer, Heidelberg (2006)

    Google Scholar 

  6. Siegler, M., Jain, U., Raj, B., Stern, R.: Automatic segmentation and clustering of broadcast news audio. In: The DARPA Speech Recognition Workshop, Chantilly, USA (February 1997)

    Google Scholar 

  7. Chen, S., Gopalakrishnan, P.: Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In: DARPA Broadcast News Transcription and Understanding Workshop, Landsdowne, USA (February 1998)

    Google Scholar 

  8. Cettolo, M.: Segmentation, classification and clustering of an Italian broadcast news corpus. In: Conf. on Content-Based Multimedia Information Access (RIAO 2000), April 2000, Paris (2000)

    Google Scholar 

  9. Barras, C., Zhu, X., Meignier, S., Gauvain, J.L.: Improving speaker diarization. In: The Proceedings of Fall 2004 Rich Transcription Workshop (RT 2004), November 2004, Palisades, NY, USA (2004)

    Google Scholar 

  10. Tranter, S.E., Reynolds, D.A.: Speaker diarization for broadcast news. In: Proc. ISCA Speaker Recognition Workshop Odyssey 2004, May 2004, Toledo, Spain (2004)

    Google Scholar 

  11. Schroeder, J., Campbell, J. (eds.): Digital Signal Processing (DSP), a review journal - Special issue on NIST 1999 speaker recognition workshop. Academic Press, London (2000)

    Google Scholar 

  12. Barras, C., Gauvain, J.-L.: Feature and score normalization for speaker verification of cellular data. In: IEEE ICASSP 2003, Hong Kong (2003)

    Google Scholar 

  13. Pelecanos, J., Sridharan, S.: Feature warping for robust speaker verification. In: Proc. ISCA Speaker Recognition Workshop Odyssey 2001, June 2001, pp. 213–218 (2001)

    Google Scholar 

  14. Gauvain, J.-L., Lee, C.H.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing 2(2), 291–298 (1994)

    Article  Google Scholar 

  15. Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted Gaussian mixture models. Digital Signal Processing (DSP), a review journal - Special issue on NIST 1999 speaker recognition workshop 10(1-3), 19–41 (2000)

    Google Scholar 

  16. Reynolds, D.A., Singer, E., Carlson, B.A., O’Leary, G.C., McLaughlin, J.J., Zissman, M.A.: Blind clustering of speech utterances based on speaker and language characteristics. In: Proc. of International Conf. on Spoken Language Processing (ICSLP 1998) (1998)

    Google Scholar 

  17. Stolcke, A., Anguera, X., Boakye, K., Cetin, O., Grezl, F., Janin, A., Mandal, A., Peskin, B., Wooters, C., Zheng, J.: The ICSI-SRI Spring 2005 Speech-To-Text evaluation System. In: Rich Transcription 2005 Spring Meeting Recognition Evaluation, July 2005, Edinburgh, Great Britain (2005)

    Google Scholar 

  18. NIST, Fall 2004 Rich Transcription (RT-04F) evaluation plan (August 2004), http://www.nist.gov/speech/tests/rt/rt2004/fall/docs/rt04f-eval-plan-v14.pdf

  19. NIST, Spring 2006 Rich Transcription (RT-06S) Meeting Recognition Evaluation Plan (February 2006), http://www.nist.gov/speech/tests/rt/rt2006/spring/docs/rt06s-meeting-eval-plan-v2.pdf

  20. Wooters, C., Huijbregts, M.: The ICSI RT07s Speaker Diarization System. In: Rich Transcription 2007 Meeting Recognition Workshop, Baltimore, USA (May 2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Rainer Stiefelhagen Rachel Bowers Jonathan Fiscus

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhu, X., Barras, C., Lamel, L., Gauvain, JL. (2008). Multi-stage Speaker Diarization for Conference and Lecture Meetings. In: Stiefelhagen, R., Bowers, R., Fiscus, J. (eds) Multimodal Technologies for Perception of Humans. RT CLEAR 2007 2007. Lecture Notes in Computer Science, vol 4625. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68585-2_49

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-68585-2_49

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-68584-5

  • Online ISBN: 978-3-540-68585-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics