Abstract
In this paper we present our contribution to the task 2 of the short-duration speaker verification (SdSV) challenge. The main task for this challenge is to find new technologies for text-dependent and text-independent speaker verification in short duration scenario. Some of the approaches used by the authors during participation in the challenge are presented. Described speaker verification systems include baseline x-vector system with PLDA backend and score normalization, x-vector system with neural PLDA backend and fusion of both systems.
The main goal of this paper is to analyze influence of different score normalization methods on x-vector based speaker verification systems performance. We found that system with PLDA backend and ZT-normalization method (single system) gives superior performance in Farsi trials, but gives lower performance improvement in English trials. Overall, in terms of minDCF single system performs 46.3% better than baseline x-vector system. We found that enroll data augmentation is useless for Neural PLDA backend, as performance of the system does not improve after adding augmented enroll data. Single system with ZT-score normalization and additional enroll audio augmentation performs 14.8% better than Neural PLDA backend system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Zeinali, H., Lee, K.A., Alam, J., Burget L.: Short-duration Speaker Verification (SdSV) Challenge 2020: The Challenge Evaluation Plan. arXiv preprint https://arxiv.org/abs/1912.06311 (2019)
Jung, J.W., Heo, H.S., Kim, J.H., Shim, H.J., Yu, H.J.: RawNet: advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification. In: Proceedings Interspeech 2019, pp. 1268–1272 (2019)
Yun, S., Cho, J., Eum, J., Chang, W., Hwang, K.: An end-to-end text-independent speaker verification framework with a keyword adversarial network. In: Proceedings Interspeech 2019, pp. 2923–2927 (2019)
Li, C., et al.: Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint https://arxiv.org/abs/1705.02304 (2017)
Xie, W., Nagrani, A., Chung, J.S., Zisserman, A.: Utterance-level aggregation for speaker recognition in the wild. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5791–5795. IEEE (2019)
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2010)
Rohdin, J., et al.: End-to-end DNN based speaker recognition inspired by i-vector and PLDA. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4874–4878 (2018)
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: rRobust DNN embeddings for speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333 (2018)
Prince, S.J., Elder, J.H.: Probabilistic linear discriminant analysis for inferences about identity. In: 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8 (2007)
Garcia-Romero, D., et al.: X-vector DNN refinement with full-length recordings for speaker recognition. In: Proceedings Interspeech 2019, pp. 1493–1496 (2019)
Ramoji, S., Krishnan, P., Ganapathy, S.: NPLDA: a deep neural PLDA model for speaker verification. In: Proceedings Odyssey 2020 The Speaker and Language Recognition Workshop, pp. 202–209 (2020)
Povey, D., Ghoshal, A., Boulianne, G., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society (2011)
Snyder, D., et al.: Speaker recognition for multi-speaker conversations using x-vectors. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5796–5800. IEEE (2019)
Barras, C., Gauvain, J.L.: Feature and score normalization for speaker verification of cellular data. In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2003, vol. 2, pp. 49–52. IEEE (2003)
Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. arXiv preprint https://arxiv.org/abs/1706.08612 (2017)
Zeinali, H., Burget, L., Černocký, J.: A multi purpose and large scale speech corpus in Persian and English for speaker and speech recognition: the DeepMine database. arXiv preprint https://arxiv.org/abs/1912.03627 (2019)
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: LibriSpeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. IEEE (2015)
Thienpondt, J., Desplanques, B., Demuynck, K.: Cross-lingual speaker verification with domain-balanced hard prototype mining and language-dependent score normalization. arXiv preprint https://arxiv.org/abs/2007.07689 (2020)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141. IEEE (2018)
Gao, S.: Res2Net: a new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. (2019)
Gao, Z., et al.: Improving aggregation and loss function for better embedding learning in end-to-end speaker verification system. In: Proceedings Interspeech 2019, pp. 361–365 (2019)
Thienpondt, J., Desplanques, B., Demuynck, K.: ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. arXiv preprint https://arxiv.org/abs/2005.07143 (2020)
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699. IEEE (2019)
Acknowledgements
The study was performed by a grant from the Russian Science Foundation (project 16-15-00038).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Rakhmanenko, I., Kostyuchenko, E., Choynzonov, E., Balatskaya, L., Shelupanov, A. (2020). Score Normalization of X-Vector Speaker Verification System for Short-Duration Speaker Verification Challenge. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2020. Lecture Notes in Computer Science(), vol 12335. Springer, Cham. https://doi.org/10.1007/978-3-030-60276-5_44
Download citation
DOI: https://doi.org/10.1007/978-3-030-60276-5_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60275-8
Online ISBN: 978-3-030-60276-5
eBook Packages: Computer ScienceComputer Science (R0)