{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2023,5,6]],"date-time":"2023-05-06T04:31:24Z","timestamp":1683347484312},"reference-count":39,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2023,5,5]],"date-time":"2023-05-05T00:00:00Z","timestamp":1683244800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"ANID\/FONDECYT Iniciaci\u00f3n","award":["11230129"]},{"name":"Competition for Research Regular Projects, year 2021","award":["LPR21-02"]},{"name":"Universidad Tecnol\u00f3gica Metropolitana, and beca Santander Movilidad Internacional Profesores CONVOCATORIA","award":["2021\u20132022\/2023"]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"Speech processing algorithms, especially sound source localization (SSL), speech enhancement, and speaker tracking are considered to be the main fields in this application. Most speech processing algorithms require knowing the number of speakers for real implementation. In this article, a novel method for estimating the number of speakers is proposed based on the hive shaped nested microphone array (HNMA) by wavelet packet transform (WPT) and 2D sub-band adaptive steered response power (SB-2DASRP) with phase transform (PHAT) and maximum likelihood (ML) filters, and, finally, the agglomerative classification and elbow criteria for obtaining the number of speakers in near-field scenarios. The proposed HNMA is presented for aliasing and imaging elimination and preparing the proper signals for the speaker counting method. In the following, the Blackman\u2013Tukey spectral estimation method is selected for detecting the proper frequency components of the recorded signal. The WPT is considered for smart sub-band processing by focusing on the frequency bins of the speech signal. In addition, the SRP method is implemented in 2D format and adaptively by ML and PHAT filters on the sub-band signals. The SB-2DASRP peak positions are extracted on various time frames based on the standard deviation (SD) criteria, and the final number of speakers is estimated by unsupervised agglomerative clustering and elbow criteria. The proposed HNMA-SB-2DASRP method is compared with the frequency-domain magnitude squared coherence (FD-MSC), i-vector probabilistic linear discriminant analysis (i-vector PLDA), ambisonics features of the correlational recurrent neural network (AF-CRNN), and speaker counting by density-based classification and clustering decision (SC-DCCD) algorithms on noisy and reverberant environments, which represents the superiority of the proposed method for real implementation.<\/jats:p>","DOI":"10.3390\/s23094499","type":"journal-article","created":{"date-parts":[[2023,5,5]],"date-time":"2023-05-05T07:57:31Z","timestamp":1683273451000},"page":"4499","source":"Crossref","is-referenced-by-count":0,"title":["Speaker Counting Based on a Novel Hive Shaped Nested Microphone Array by WPT and 2D Adaptive SRP Algorithms in Near-Field Scenarios"],"prefix":"10.3390","volume":"23","author":[{"ORCID":"http:\/\/orcid.org\/0000-0002-6391-6863","authenticated-orcid":false,"given":"Ali","family":"Dehghan Firoozabadi","sequence":"first","affiliation":[{"name":"Department of Electricity, Universidad Tecnol\u00f3gica Metropolitana, Av. Jos\u00e9 Pedro Alessandri 1242, Santiago 7800002, Chile"}]},{"ORCID":"http:\/\/orcid.org\/0000-0003-2500-3294","authenticated-orcid":false,"given":"Pablo","family":"Adasme","sequence":"additional","affiliation":[{"name":"Electrical Engineering Department, Universidad de Santiago de Chile, Av. Victor Jara 3519, Santiago 9170124, Chile"}]},{"ORCID":"http:\/\/orcid.org\/0000-0002-5692-5673","authenticated-orcid":false,"given":"David","family":"Zabala-Blanco","sequence":"additional","affiliation":[{"name":"Department of Computing and Industries, Universidad Cat\u00f3lica del Maule, Talca 3466706, Chile"}]},{"ORCID":"http:\/\/orcid.org\/0000-0002-3958-503X","authenticated-orcid":false,"given":"Pablo","family":"Palacios J\u00e1tiva","sequence":"additional","affiliation":[{"name":"Department of Electrical Engineering, Universidad de Chile, Santiago 8370451, Chile"},{"name":"Escuela de Inform\u00e1tica y Telecomunicaciones, Universidad Diego Portales, Santiago 8370190, Chile"}]},{"ORCID":"http:\/\/orcid.org\/0000-0003-3461-4484","authenticated-orcid":false,"given":"Cesar","family":"Azurdia-Meza","sequence":"additional","affiliation":[{"name":"Department of Electrical Engineering, Universidad de Chile, Santiago 8370451, Chile"}]}],"member":"1968","published-online":{"date-parts":[[2023,5,5]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"612750","DOI":"10.3389\/frobt.2021.612750","article-title":"Speech Interaction to Control a Hands-Free Delivery Robot for High-Risk Health Care Scenarios","volume":"8","author":"Grasse","year":"2021","journal-title":"Front. Robot. AI"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"782","DOI":"10.1109\/LRA.2020.2965417","article-title":"Multiple Sound Source Position Estimation by Drone Audition Based on Data Association Between Sound Source Localization and Identification","volume":"5","author":"Wakabayashi","year":"2020","journal-title":"IEEE Robot. Autom. Lett."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"76","DOI":"10.1109\/JSTSP.2019.2903492","article-title":"Speaker Tracking Based on Distributed Particle Filter and Iterative Covariance Intersection in Distributed Microphone Networks","volume":"13","author":"Wang","year":"2019","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"125","DOI":"10.1109\/TCE.2020.2986003","article-title":"Speech Enhancement Parameter Adjustment to Maximize Accuracy of Automatic Speech Recognition","volume":"66","author":"Kawase","year":"2020","journal-title":"IEEE Trans. Consum. Electron."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"32187","DOI":"10.1109\/ACCESS.2020.2973541","article-title":"Text-Independent Speaker Identification Through Feature Fusion and Deep Neural Network","volume":"8","author":"Jahangir","year":"2020","journal-title":"IEEE Access"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"1378","DOI":"10.1109\/TCSI.2019.2960843","article-title":"Low-Energy Voice Activity Detection via Energy-Quality Scaling from Data Conversion to Machine Learning","volume":"67","author":"Teo","year":"2020","journal-title":"IEEE Trans. Circuits Syst. I Regul. Pap."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"6458","DOI":"10.1109\/TSP.2018.2876349","article-title":"Source Counting and Separation Based on Simplex Analysis","volume":"66","author":"Talmon","year":"2018","journal-title":"IEEE Trans. Signal Process."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Wang, Z.Q., and Wang, D. (2021, January 6\u201311). Count and Separate: Incorporating Speaker Counting for Continuous Speaker Separation. Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021), Toronto, ON, Canada.","DOI":"10.1109\/ICASSP39728.2021.9414677"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"1031","DOI":"10.1109\/TASLP.2019.2892895","article-title":"A Geometric Model for Prediction of Spatial Aliasing in 2.5D Sound Field Synthesis","volume":"27","author":"Winter","year":"2019","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"537","DOI":"10.1109\/TASLP.2020.3045556","article-title":"Speaker Separation Using Speaker Inventories and Estimated Speech","volume":"29","author":"Wang","year":"2021","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Rouvier, M., Bousquet, P.M., and Favre, B. (September, January 31). Speaker diarization through speaker embeddings. Proceedings of the 23rd European Signal Processing Conference (EUSIPCO 2015), Nice, France.","DOI":"10.1109\/EUSIPCO.2015.7362751"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"255","DOI":"10.1016\/j.aej.2016.12.009","article-title":"Speaker diarization system using HXLPS and deep neural network","volume":"57","author":"Ramaiah","year":"2018","journal-title":"Alex. Eng. J."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Yin, R., Bredin, H., and Barras, C. (2017, January 20\u201324). Speaker change detection in broadcast TV using bidirectional long short-term memory networks. Proceedings of the Interspeech Conference, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-65"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"356","DOI":"10.1109\/TASL.2011.2125954","article-title":"Speaker Diarization: A Review of Recent Research","volume":"20","author":"Anguera","year":"2012","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Huijbregts, M., Leeuwen, D.A., and Jong, F. (2009, January 6\u201310). Speech overlap detection in a two-pass speaker diarization system. Proceedings of the Interspeech Conference, Brighton, UK.","DOI":"10.21437\/Interspeech.2009-326"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"1035","DOI":"10.1109\/TASLP.2017.2678684","article-title":"Teager\u2013Kaiser Energy Operators for Overlapped Speech Detection","volume":"25","author":"Shokouhi","year":"2017","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Andrei, V., Cucuand, H., and Burileanu, C. (2017, January 20\u201324). Detecting overlapped speech on short timeframes using deep learning. Proceedings of the Interspeech Conference, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-188"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Lef\u00e8vre, A., Bach, F., and F\u00e9votte, C. (2011, January 22\u201327). Itakura-Saito nonnegative matrix factorization with group sparsity. Proceedings of the 36th International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic.","DOI":"10.1109\/ICASSP.2011.5946318"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Bregman, A.S. (1994). Auditory Scene Analysis: The Perceptual Organization of Sound, MIT Press.","DOI":"10.1121\/1.408434"},{"key":"ref_20","unstructured":"Kumar, P.V.A., Balakrishna, J., Prakash, C., and Gangashetty, S.V. (2011, January 16\u201318). Bessel features for estimating number of speakers from multi speaker speech signals. Proceedings of the 18th International Conference on Systems, Signals and Image Processing (IWSSIP), Sarajevo, Bosnia and Herzegovina."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Maka, T., and Lazoryszczak, M. (2018, January 19\u201321). Detecting the Number of Speakers in Speech Mixtures by Human and Machine. Proceedings of the 25th Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA), Poznan, Poland.","DOI":"10.23919\/SPA.2018.8563405"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"268","DOI":"10.1109\/TASLP.2018.2877892","article-title":"CountNet: Estimating the Number of Concurrent Speakers Using Supervised Learning","volume":"27","author":"Chakrabarty","year":"2019","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"850","DOI":"10.1109\/JSTSP.2019.2910759","article-title":"Overlapped Speech Detection and Competing Speaker Counting\u2014Humans Versus Deep Learning","volume":"13","author":"Andrei","year":"2019","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Pasha, S., Donley, J., and Ritz, C. (2017, January 12\u201315). Blind speaker counting in highly reverberant environments by clustering coherence features. Proceedings of the 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, Malaysia.","DOI":"10.1109\/APSIPA.2017.8282303"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Vinals, I., Gimeno, P., Ortega, A., Miguel, A., and Lleida, E. (2018, January 2\u20136). Estimation of the Number of Speakers with Variational Bayesian PLDA in the DIHARD Diarization Challenge. Proceedings of the Interspeech Conference, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-1841"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Grumiaux, P.A., Kiti\u0107, S., Girin, L., and Gu\u00e9rin, A. (2021, January 18\u201321). High-Resolution Speaker Counting in Reverberant Rooms Using CRNN with Ambisonics Features. Proceedings of the 28th European Signal Processing Conference (EUSIPCO), Amsterdam, The Netherlands.","DOI":"10.23919\/Eusipco47968.2020.9287637"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"176541","DOI":"10.1109\/ACCESS.2019.2956772","article-title":"Estimating Number of Speakers via Density-Based Clustering and Classification Decision","volume":"7","author":"Yang","year":"2019","journal-title":"IEEE Access"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Firoozabadi, A.D., Irarrazaval, P., Adasme, P., Zabala-Blanco, D., Palacios-J\u00e1tiva, P., Durney, H., Sanhueza, M., and Azurdia-Meza, C.A. (2021, January 23\u201327). Speakers counting by proposed nested microphone array in combination with limited space SRP. Proceedings of the 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland.","DOI":"10.23919\/EUSIPCO54536.2021.9616309"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"777","DOI":"10.1109\/TIM.2004.827304","article-title":"Experimental evaluation of a nested microphone array with adaptive noise cancellers","volume":"53","author":"Zheng","year":"2004","journal-title":"IEEE Trans. Instrum. Meas."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Niu, Y., Chen, J., and Li, B. (2014, January 26\u201328). Novel PSD estimation algorithm based on compressed sensing and Blackman-Tukey approach. Proceedings of the 4th IEEE International Conference on Information Science and Technology, Shenzhen, China.","DOI":"10.1109\/ICIST.2014.6920383"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Rickard, S., and Yilmaz, O. (2002, January 13\u201317). On the approximate W-disjoint orthogonality of speech. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Orlando, FL, USA.","DOI":"10.1109\/ICASSP.2002.1005793"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"4041","DOI":"10.1109\/TSP.2020.3006742","article-title":"Novel Fractional Wavelet Packet Transform: Theory, Implementation, and Applications","volume":"68","author":"Shi","year":"2020","journal-title":"IEEE Trans. Signal Process."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Wang, Z., and Li, S. (2012, January 16\u201318). Discrete Fourier Transform and Discrete Wavelet Packet Transform in speech denoising. Proceedings of the 5th International Congress on Image and Signal Processing, Chongqing, China.","DOI":"10.1109\/CISP.2012.6469868"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Zhuo, D.B., and Cao, H. (2021). Fast Sound Source Localization Based on SRP-PHAT Using Density Peaks Clustering. Appl. Sci., 11.","DOI":"10.3390\/app11010445"},{"key":"ref_35","unstructured":"Firoozabadi, A.D., and Abutalebi, H.R. (2010, January 11\u201313). SRP-ML: A Robust SRP-based speech source localization method for Noisy environments. Proceedings of the 18th Iranian Conference on Electrical Engineering (ICEE), Isfahan, Iran."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Babichev, S., Taif, M.A., and Lytvynenko, V. (2016, January 23\u201327). Inductive model of data clustering based on the agglomerative hierarchical algorithm. Proceedings of the First International Conference on Data Stream Mining & Processing (DSMP), Lviv, Ukraine.","DOI":"10.1109\/DSMP.2016.7583499"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Wang, J., and Wichakool, W. (2017, January 7\u20138). Artificial elbow joint classification using upper arm based on surface-EMG signal. Proceedings of the 3rd International Conference on Engineering Technologies and Social Sciences (ICETSS), Bangkok, Thailand.","DOI":"10.1109\/ICETSS.2017.8324198"},{"key":"ref_38","unstructured":"Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S., Dahlgren, N.L., and Zue, V. (1993). TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1, Linguistic Data Consortium. Available online: https:\/\/catalog.ldc.upenn.edu\/LDC93S1."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"943","DOI":"10.1121\/1.382599","article-title":"Image method for efficiently simulating small room acoustics","volume":"65","author":"Allen","year":"1979","journal-title":"J. Acoust. Soc. Am."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/9\/4499\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,5,5]],"date-time":"2023-05-05T08:07:49Z","timestamp":1683274069000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/9\/4499"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,5,5]]},"references-count":39,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2023,5]]}},"alternative-id":["s23094499"],"URL":"https:\/\/doi.org\/10.3390\/s23094499","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,5,5]]}}}