{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T13:17:29Z","timestamp":1740143849893,"version":"3.37.3"},"reference-count":47,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2021,12,1]],"date-time":"2021-12-01T00:00:00Z","timestamp":1638316800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,12,20]],"date-time":"2021-12-20T00:00:00Z","timestamp":1639958400000},"content-version":"vor","delay-in-days":19,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"beijing municipal education commission cooperation beijing natural science foundation","award":["KZ 201910005007"]},{"DOI":"10.13039\/501100001809","name":"national natural science foundation of china","doi-asserted-by":"publisher","award":["61971016"],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J AUDIO SPEECH MUSIC PROC."],"published-print":{"date-parts":[[2021,12]]},"abstract":"Abstract<\/jats:title>With the sharp booming of online live streaming platforms, some anchors seek profits and accumulate popularity by mixing inappropriate content into live programs. After being blacklisted, these anchors even forged their identities to change the platform to continue live, causing great harm to the network environment. Therefore, we propose an anchor voiceprint recognition in live streaming via RawNet-SA and gated recurrent unit (GRU) for anchor identification of live platform. First, the speech of the anchor is extracted from the live streaming by using voice activation detection (VAD) and speech separation. Then, the feature sequence of anchor voiceprint is generated from the speech waveform with the self-attention network RawNet-SA. Finally, the feature sequence of anchor voiceprint is aggregated by GRU to transform into a deep voiceprint feature vector for anchor recognition. Experiments are conducted on the VoxCeleb, CN-Celeb, and MUSAN dataset, and the competitive results demonstrate that our method can effectively recognize the anchor voiceprint in video streaming.<\/jats:p>","DOI":"10.1186\/s13636-021-00234-3","type":"journal-article","created":{"date-parts":[[2021,12,20]],"date-time":"2021-12-20T14:05:17Z","timestamp":1640009117000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":6,"title":["Anchor voiceprint recognition in live streaming via RawNet-SA and gated recurrent unit"],"prefix":"10.1186","volume":"2021","author":[{"given":"Jiacheng","family":"Yao","sequence":"first","affiliation":[]},{"given":"Jing","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Jiafeng","family":"Li","sequence":"additional","affiliation":[]},{"given":"Li","family":"Zhuo","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2021,12,20]]},"reference":[{"key":"234_CR1","doi-asserted-by":"publisher","first-page":"19","DOI":"10.1006\/dspr.1999.0361","volume":"10","author":"DA Reynolds","year":"2000","unstructured":"D.A. Reynolds, T.F. Quatieri, R.B. Dunn, Speaker verification using adapted Gaussian mixture models. Digital Signal Processing 10, 19 (2000)","journal-title":"Digital Signal Processing"},{"key":"234_CR2","doi-asserted-by":"publisher","first-page":"788","DOI":"10.1109\/TASL.2010.2064307","volume":"19","author":"N Dehak","year":"2011","unstructured":"N. Dehak, P.J. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19, 788 (2011)","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"234_CR3","first-page":"4052","volume-title":"Deep neural networks for small footprint text-dependent speaker verification, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"E Variani","year":"2014","unstructured":"E. Variani, X. Lei, E. McDermott, I.L. Moreno, J. Gonzalez-Dominguez, Deep neural networks for small footprint text-dependent speaker verification, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, Florence, Italy, 2014), pp. 4052\u20134056"},{"key":"234_CR4","doi-asserted-by":"publisher","first-page":"5329","DOI":"10.1109\/ICASSP.2018.8461375","volume-title":"X-Vectors: Robust DNN embeddings for speaker recognition, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"D Snyder","year":"2018","unstructured":"D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-Vectors: Robust DNN embeddings for speaker recognition, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, Calgary, AB, 2018), pp. 5329\u20135333"},{"key":"234_CR5","doi-asserted-by":"crossref","unstructured":"M. McLaren, L. Ferrer, D. Castan, and A. Lawson, The speakers in the wild (SITW) speaker recognition database, in (2016), pp. 818\u2013822.","DOI":"10.21437\/Interspeech.2016-1129"},{"key":"234_CR6","doi-asserted-by":"publisher","first-page":"1942","DOI":"10.1109\/TASLP.2017.2732162","volume":"25","author":"Y Qian","year":"2017","unstructured":"Y. Qian, N. Chen, H. Dinkel, Z. Wu, Deep feature engineering for noise robust spoofing detection. IEEE\/ACM Trans. Audio Speech Lang. Process. 25, 1942 (2017)","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"234_CR7","doi-asserted-by":"crossref","unstructured":"G. Lavrentyeva, S. Novoselov, E. Malykh, A. Kozlov, O. Kudashev, and V. Shchemelinin, Audio replay attack detection with deep learning frameworks, in Interspeech 2017 (ISCA, 2017), pp. 82\u201386.","DOI":"10.21437\/Interspeech.2017-360"},{"key":"234_CR8","doi-asserted-by":"publisher","first-page":"1985","DOI":"10.1109\/TASLP.2019.2937413","volume":"27","author":"A Gomez-Alanis","year":"2019","unstructured":"A. Gomez-Alanis, A.M. Peinado, J.A. Gonzalez, A.M. Gomez, A gated recurrent convolutional neural network for robust spoofing detection. IEEE\/ACM Trans. Audio Speech Lang. Process. 27, 1985 (2019)","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"234_CR9","doi-asserted-by":"crossref","unstructured":"A. Gomez-Alanis, A. M. Peinado, J. A. Gonzalez, and A. M. Gomez, A light convolutional GRU-RNN deep feature extractor for ASV spoofing detection, in Interspeech 2019 (ISCA, 2019), pp. 1068\u20131072.","DOI":"10.21437\/Interspeech.2019-2212"},{"key":"234_CR10","doi-asserted-by":"crossref","unstructured":"A. Gomez-Alanis, J. A. Gonzalez-Lopez, S. P. Dubagunta, A. M. Peinado, and M. Magimai.-Doss, On joint optimization of automatic speaker verification and anti-spoofing in the embedding space, IEEE Trans. Inform. Forensic Secur. 16, 1579 (2021).","DOI":"10.1109\/TIFS.2020.3039045"},{"key":"234_CR11","doi-asserted-by":"publisher","first-page":"108530","DOI":"10.1109\/ACCESS.2020.3000641","volume":"8","author":"A Gomez-Alanis","year":"2020","unstructured":"A. Gomez-Alanis, J.A. Gonzalez-Lopez, A.M. Peinado, A Kernel density estimation based loss function and its application to ASV-Spoofing Detection. IEEE Access 8, 108530 (2020)","journal-title":"IEEE Access"},{"key":"234_CR12","doi-asserted-by":"crossref","unstructured":"A. Nagrani, J. S. Chung, and A. Zisserman, VoxCeleb: a large-scale speaker identification dataset, in Interspeech 2017 (ISCA, 2017), pp. 2616\u20132620.","DOI":"10.21437\/Interspeech.2017-950"},{"key":"234_CR13","doi-asserted-by":"crossref","unstructured":"A. Hajavi and A. Etemad, A deep neural network for short-segment speaker recognition, in Interspeech 2019 (ISCA, 2019), pp. 2878\u20132882.","DOI":"10.21437\/Interspeech.2019-2240"},{"key":"234_CR14","doi-asserted-by":"crossref","unstructured":"Y. Jiang, Y. Song, I. McLoughlin, Z. Gao, and L.-R. Dai, An effective deep embedding learning architecture for speaker verification, in Interspeech 2019 (ISCA, 2019), pp. 4040\u20134044.","DOI":"10.21437\/Interspeech.2019-1606"},{"key":"234_CR15","first-page":"154","volume":"36","author":"RC Guido","year":"2019","unstructured":"R.C. Guido, Paraconsistent feature engineering [Lecture Notes], IEEE Signal Process. Mag. 36, 154 (2019)","journal-title":"Mag."},{"key":"234_CR16","doi-asserted-by":"crossref","unstructured":"J. Jung, H.-S. Heo, J. Kim, H. Shim, and H.-J. Yu, RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification, in Interspeech 2019 (ISCA, 2019), pp. 1268\u20131272.","DOI":"10.21437\/Interspeech.2019-1982"},{"key":"234_CR17","first-page":"1724","volume-title":"Learning phrase representations using RNN encoder-decoder for statistical machine translation, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"K Cho","year":"2014","unstructured":"K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, Learning phrase representations using RNN encoder-decoder for statistical machine translation, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (Association for Computational Linguistics, Doha, Qatar, 2014), pp. 1724\u20131734"},{"key":"234_CR18","doi-asserted-by":"crossref","unstructured":"J. Jung, S. Kim, H. Shim, J. Kim, and H.-J. Yu, Improved RawNet with feature map scaling for text-independent speaker verification using raw waveforms, in Interspeech 2020 (ISCA, 2020), pp. 1496\u20131500.","DOI":"10.21437\/Interspeech.2020-1011"},{"key":"234_CR19","first-page":"6000","volume-title":"Attention is all you need, in Proceedings of the 31st International Conference on Neural Information Processing Systems","author":"A Vaswani","year":"2017","unstructured":"A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, \u0141. Kaiser, I. Polosukhin, Attention is all you need, in Proceedings of the 31st International Conference on Neural Information Processing Systems (Curran Associates Inc., Long Beach, California, USA, 2017), pp. 6000\u20136010"},{"key":"234_CR20","doi-asserted-by":"crossref","unstructured":"M. India, P. Safari, and J. Hernando, Self multi-head attention for speaker recognition, in Interspeech 2019 (ISCA, 2019), pp. 4305\u20134309.","DOI":"10.21437\/Interspeech.2019-2616"},{"key":"234_CR21","doi-asserted-by":"crossref","unstructured":"P. Safari, M. India, and J. Hernando, Self-attention encoding and pooling for speaker recognition, in Interspeech 2020 (ISCA, 2020), pp. 941\u2013945.","DOI":"10.21437\/Interspeech.2020-1446"},{"key":"234_CR22","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","volume":"9","author":"S Hochreiter","year":"1997","unstructured":"S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Computation 9, 1735 (1997)","journal-title":"Neural Computation"},{"key":"234_CR23","doi-asserted-by":"publisher","first-page":"1123","DOI":"10.3390\/app9061123","volume":"9","author":"M Jabreel","year":"2019","unstructured":"M. Jabreel, A. Moreno, A deep learning-based approach for multi-label emotion classification in Tweets. Applied Sciences 9, 1123 (2019)","journal-title":"Applied Sciences"},{"key":"234_CR24","doi-asserted-by":"publisher","first-page":"180","DOI":"10.1049\/el:20000192","volume":"36","author":"K-H Woo","year":"2000","unstructured":"K.-H. Woo, T.-Y. Yang, K.-J. Park, C. Lee, Robust voice activity detection algorithm for estimating noise spectrum. Electron. Lett. 36, 180 (2000)","journal-title":"Electron. Lett."},{"key":"234_CR25","first-page":"61","volume":"99","author":"R Chengalvarayan","year":"1999","unstructured":"R. Chengalvarayan, Robust energy normalization using speech\/nonspeech discriminator for German connected digit recognition, in EUROSPEECH. Vol. 99, 61\u201364 (1999)","journal-title":"Vol."},{"key":"234_CR26","doi-asserted-by":"crossref","unstructured":"A. Benyassine, E. Shlomot, H.- Su, D. Massaloux, C. Lamblin, and J.- Petit, ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications, IEEE Communications Magazine 35, 64 (1997).","DOI":"10.1109\/35.620527"},{"key":"234_CR27","doi-asserted-by":"crossref","unstructured":"J. Wagner, D. Schiller, A. Seiderer, and E. Andr\u00e9, Deep learning in paralinguistic recognition tasks: are hand-crafted features still relevant? in Interspeech 2018 (ISCA, 2018), pp. 147\u2013151.","DOI":"10.21437\/Interspeech.2018-1238"},{"key":"234_CR28","doi-asserted-by":"publisher","first-page":"2154","DOI":"10.21105\/joss.02154","volume":"5","author":"R Hennequin","year":"2020","unstructured":"R. Hennequin, A. Khlif, F. Voituret, M. Moussallam, Spleeter: A fast and efficient music source separation tool with pre-trained models. JOSS 5, 2154 (2020)","journal-title":"JOSS"},{"key":"234_CR29","doi-asserted-by":"crossref","unstructured":"O. Ronneberger, P. Fischer, and T. Brox, U-Net: Convolutional networks for biomedical image segmentation, in medical image computing and computer-assisted intervention\u2014MICCAI 2015, edited by N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi (Springer International Publishing, Cham, 2015), pp. 234\u2013241.","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"234_CR30","unstructured":"A. D\u00e9fossez, N. Usunier, L. Bottou, and F. Bach, Music source separation in the waveform domain, ArXiv:1911.13254 [Cs, Eess, Stat] (2019)."},{"key":"234_CR31","unstructured":"M. Ravanelli and Y. Bengio, Interpretable convolutional filters with SincNet, ArXiv:1811.09725 [Cs, Eess] (2019)."},{"key":"234_CR32","doi-asserted-by":"crossref","unstructured":"X. Wang, R. Girshick, A. Gupta, and K. He, Non-local neural networks, in 2018 IEEE\/CVF Conference on computer vision and pattern recognition (2018), pp. 7794\u20137803.","DOI":"10.1109\/CVPR.2018.00813"},{"key":"234_CR33","doi-asserted-by":"publisher","first-page":"1437","DOI":"10.1109\/TPAMI.2017.2711011","volume":"40","author":"R Arandjelovic","year":"2018","unstructured":"R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, J. Sivic, NetVLAD: CNN architecture for weakly supervised place recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40, 1437 (2018)","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"234_CR34","doi-asserted-by":"publisher","first-page":"35","DOI":"10.1007\/978-3-030-20890-5_3","volume-title":"computer vision\u2014ACCV 2018","author":"Y Zhong","year":"2019","unstructured":"Y. Zhong, R. Arandjelovi\u0107, A. Zisserman, in computer vision\u2014ACCV 2018, ed. by C. V. Jawahar, H. Li, G. Mori, K. Schindler. GhostVLAD for set-based face recognition (Springer International Publishing, Cham, 2019), pp. 35\u201350"},{"key":"234_CR35","doi-asserted-by":"crossref","unstructured":"J. S. Chung, A. Nagrani, and A. Zisserman, VoxCeleb2: deep speaker recognition, in Interspeech 2018 (ISCA, 2018), pp. 1086\u20131090.","DOI":"10.21437\/Interspeech.2018-1929"},{"key":"234_CR36","doi-asserted-by":"crossref","unstructured":"Y. Fan, J. W. Kang, L. T. Li, K. C. Li, H. L. Chen, S. T. Cheng, P. Y. Zhang, Z. Y. Zhou, Y. Q. Cai, and D. Wang, CN-Celeb: A challenging chinese speaker recognition dataset, in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, Barcelona, Spain, 2020), pp. 7604\u20137608.","DOI":"10.1109\/ICASSP40776.2020.9054017"},{"key":"234_CR37","unstructured":"D. Snyder, G. Chen, and D. Povey, MUSAN: a music, speech, and noise corpus, ArXiv:1510.08484 [Cs] (2015)."},{"key":"234_CR38","doi-asserted-by":"crossref","unstructured":"W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, Utterance-level aggregation for speaker recognition in the wild, in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, Brighton, United Kingdom, 2019), pp. 5791\u20135795.","DOI":"10.1109\/ICASSP.2019.8683120"},{"key":"234_CR39","doi-asserted-by":"publisher","first-page":"101027","DOI":"10.1016\/j.csl.2019.101027","volume":"60","author":"A Nagrani","year":"2020","unstructured":"A. Nagrani, J.S. Chung, W. Xie, A. Zisserman, Voxceleb: large-scale speaker verification in the wild. Computer Speech & Language 60, 101027 (2020)","journal-title":"Computer Speech & Language"},{"key":"234_CR40","unstructured":"N. R. Koluguri, J. Li, V. Lavrukhin, and B. Ginsburg, SpeakerNet: 1D depth-wise separable convolutional network for text-independent speaker recognition and verification, ArXiv:2010.12653 [Eess] (2020)."},{"key":"234_CR41","doi-asserted-by":"crossref","unstructured":"J. Deng, J. Guo, J. Yang, N. Xue, I. Cotsia, and S. P. Zafeiriou, ArcFace: additive angular margin loss for deep face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 1 (2021).","DOI":"10.1109\/TPAMI.2021.3087709"},{"key":"234_CR42","doi-asserted-by":"crossref","unstructured":"M. India, P. Safari, and J. Hernando, Double multi-head attention for speaker verification, in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, Toronto, ON, Canada, 2021), pp. 6144\u20136148.","DOI":"10.1109\/ICASSP39728.2021.9414877"},{"key":"234_CR43","doi-asserted-by":"crossref","unstructured":"K. Okabe, T. Koshinaka, and K. Shinoda, Attentive statistics pooling for deep speaker embedding, in Interspeech 2018 (ISCA, 2018), pp. 2252\u20132256.","DOI":"10.21437\/Interspeech.2018-993"},{"key":"234_CR44","doi-asserted-by":"publisher","first-page":"294","DOI":"10.1109\/SLT48900.2021.9383565","volume-title":"Cross attentive pooling for speaker verification, in 2021 IEEE Spoken Language Technology Workshop (SLT)","author":"S Min Kye","year":"2021","unstructured":"S. Min Kye, Y. Kwon, J. Son Chung, Cross attentive pooling for speaker verification, in 2021 IEEE Spoken Language Technology Workshop (SLT) (IEEE, Shenzhen, China, 2021), pp. 294\u2013300"},{"key":"234_CR45","doi-asserted-by":"crossref","unstructured":"S. Shon, H. Tang, and J. Glass, VoiceID Loss: Speech enhancement for speaker verification, in Interspeech 2019 (ISCA, 2019), pp. 2888\u20132892.","DOI":"10.21437\/Interspeech.2019-1496"},{"key":"234_CR46","doi-asserted-by":"publisher","first-page":"531","DOI":"10.1007\/11744085_41","volume-title":"Computer Vision \u2013 ECCV 2006","author":"S Ioffe","year":"2006","unstructured":"S. Ioffe, in Computer Vision \u2013 ECCV 2006, ed. by A. Leonardis, H. Bischof, A. Pinz. Probabilistic linear discriminant analysis (Heidelberg, Springer, Berlin, 2006), pp. 531\u2013542"},{"key":"234_CR47","first-page":"1660","volume-title":"Speaker verification using kernel-based binary classifiers with binary operation derived features, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"H-S Lee","year":"2014","unstructured":"H.-S. Lee, Y. Tso, Y.-F. Chang, H.-M. Wang, S.-K. Jeng, Speaker verification using kernel-based binary classifiers with binary operation derived features, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, Florence, Italy, 2014), pp. 1660\u20131664"}],"container-title":["EURASIP Journal on Audio, Speech, and Music Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-021-00234-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13636-021-00234-3\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-021-00234-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,12,20]],"date-time":"2021-12-20T14:19:23Z","timestamp":1640009963000},"score":1,"resource":{"primary":{"URL":"https:\/\/asmp-eurasipjournals.springeropen.com\/articles\/10.1186\/s13636-021-00234-3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,12]]},"references-count":47,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,12]]}},"alternative-id":["234"],"URL":"https:\/\/doi.org\/10.1186\/s13636-021-00234-3","relation":{},"ISSN":["1687-4722"],"issn-type":[{"type":"electronic","value":"1687-4722"}],"subject":[],"published":{"date-parts":[[2021,12]]},"assertion":[{"value":"4 August 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 December 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"20 December 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"45"}}