Abstract
In this paper, intelligent audio signal processing examples are shortly described. The focus is, however, on the machine learning approach and datasets needed, especially for deep learning models. Years of intense research produced many important results in this area; however, the goal of fully intelligent signal processing, characterized by its autonomous acting, is not yet achieved. Therefore, a review of state-of-the-art concerning this area is given. The aspect of showing the importance of acquiring an appropriate dataset containing audio samples dedicated to the task is also shown. The paper starts with samples of audio-related datasets resulting from the search engine inquiry. Then, examples of research studies along with results are given. Also, several works carried out by the author and her collaborators are presented. Some thoughts on future work are included with answering a question of whether annotated datasets are still needed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Candel, D., Ñanculef, R., Concha, C., Allende, H.: A sequential minimal optimization algorithm for the all-distances support vector machine. In: Bloch, I., Cesar, R.M. (eds.) CIARP 2010. LNCS, vol. 6419, pp. 484–491. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16687-7_64
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: KDD 2016: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016). https://doi.org/10.1145/2939672.2939785 v
Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29, 139–164 (1997)
Yiu, T.: Understanding random forest. How the algorithm works and why it is so effective, towards data science. https://towardsdatascience.com/understanding-random-forest-58381e0602d2. Accessed 21 June 2022
Classification: ROC curve and AUC machine learning crash course google developers. https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc. Accessed 21 June 2022
Narkhede, S.: Understanding AUC – ROC curve, towards data science. https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5. Accessed 21 June 2022
Cao, X., Cai, Y., Cui, X.: A parallel numerical acoustic simulation on a GPU using an edge-based smoothed finite element method. Adv. Eng. Softw. 148 (2020). https://doi.org/10.1016/j.advengsoft.2020.102835. Accessed 21 June 2022
Bianco, M., et al.: Machine learning in acoustics: theory and applications. J. Acoust. Soc. Am. 146(5), 3590 (2019). https://doi.org/10.1121/1.5133944
Tang, Z., Bryan, N., Li, D., Langlois, T., Manocha, D.: Scene-aware audio rendering via deep acoustic analysis. IEEE Trans. Vis. Comput. Graph. 26(5), 1991–2001 (2019). https://doi.org/10.1109/TVCG.2020.2973058
Huang, P., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 23(12), 2136–2147 (2015). https://doi.org/10.1109/TASLP.2015.2468583
Kurowski, A., Zaporowski, S., Czyżewski, A.: Automatic labeling of traffic sound recordings using autoencoder-derived features. In: 2019 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA), Poznan, Poland, pp. 38–43 (2019). https://doi.org/10.23919/SPA.2019.8936709
Naranjo-Alcazar, J., Perez-Castanos, S., Zuccarello, P., Cobos, M.: Acoustic scene classification with squeeze-excitation residual networks. IEEE Access 8, 112287–112296 (2020). https://doi.org/10.1109/ACCESS.2020.3002761
Shen, Y., Cao, J., Wang, J., Yang, Z.: Urban acoustic classification based on deep feature transfer learning. J. Franklin Inst. 357(1), 667–686 (2020). https://doi.org/10.1016/j.jfranklin.2019.10.014
Valada, A., Spinello, L., Burgard, W.: Deep feature learning for acoustics-based terrain classification. In: Bicchi, A., Burgard, W. (eds.) Robotics Research. SPAR, vol. 3, pp. 21–37. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-60916-4_2
Avramidis, K., Kratimenos, A., Garoufis, C., Zlatintsi, A., Maragos, P.: Deep convolutional and recurrent networks for polyphonic instrument classification from monophonic raw audio waveforms. In: Proceedings of the 46th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021), Toronto, Canada, 6–11 June 2021, pp. 3010–3014 (2021). https://doi.org/10.48550/arXiv.2102.06930
Thoma, M.: Creativity in machine learning, ArXiv preprint no. 1601.03642 (2016). https://arxiv.org/abs/1601.03642. Accessed 21 June 2022
Kurowski, A., Kostek, B.: Reinforcement learning algorithm and FDTD-based simulation applied to schroeder diffuser design optimization. IEEE Access 9, 136004–136017 (2021). https://doi.org/10.1109/access.2021.311462
Buduma, N., Locasio, N.: Fundamentals of Deep Learning. Designing Next-Generation Machine Intelligence Algorithms. O’Reilly Media, Inc. (2017)
The Functional API: https://keras.io/guides/functional_api/. Accessed 21 June 2022
Lerch, A., Knees P.: Machine learning applied to music/audio signal processing. Electronics 10(24), 3077 (2021). https://doi.org/10.3390/electronics10243077
Zhang, X., Yu, Y., Gao, Y., Chen, X., Li, W.: Research on singing voice detection based on a long-term recurrent convolutional network with vocal separation and temporal smoothing. Electronics 9, 1458 (2020)
Krause, M., Müller, M., Weiß, C.: Singing voice detection in opera recordings: a case study on robustness and generalization. Electronics 10, 1214 (2021)
Gao, Y., Zhang, X., Li, W.: Vocal melody extraction via HRNet-based singing voice separation and encoder-decoder-based F0 estimation. Electronics 10, 298 (2021)
Abeßer, J., Müller, M.: Jazz Bass transcription using a U-net architecture. Electronics 10, 670 (2021)
Taenzer, M., Mimilakis, S.I., Abeßer, J.: Informing piano multi-pitch estimation with inferred local polyphony based on convolutional neural networks. Electronics 10, 851 (2021)
Hernandez-Olivan, C., Zay Pinilla, I., Hernandez-Lopez, C., Beltran, J.R.: A comparison of deep learning methods for timbre analysis in polyphonic automatic music transcription. Electronics 10, 810 (2021)
Vande Veire, L., De Boom, C., De Bie, T.: Sigmoidal NMFD: convolutional NMF with saturating activations for drum mixture decomposition. Electronics 10, 284 (2021)
Pinto, A.S., Böck, S., Cardoso, J.S., Davies, M.E.P.: User-driven fine-tuning for beat tracking. Electronics 10, 1518 (2021)
Carsault, T., Nika, J., Esling, P., Assayag, G.: Combining real-time extraction and prediction of musical chord progressions for creative applications. Electronics 10, 2634 (2021)
Lattner, S., Nistal, J.: Stochastic restoration of heavily compressed musical audio using generative adversarial networks. Electronics 10, 1349 (2021)
Venkatesh, S., Moffat, D., Miranda, E.R.: Investigating the effects of training set synthesis for audio segmentation of radio broadcast. Electronics 10, 827 (2021)
Grollmisch, S., Cano, E.: Improving semi-supervised learning for audio classification with FixMatch. Electronics 10, 1807 (2021)
Zinemanas, P., Rocamora, M., Miron, M., Font, F., Serra, X.: An interpretable deep learning model for automatic sound classification. Electronics 10, 850 (2021)
Krug, A., Ebrahimzadeh, M., Alemann, J., Johannsmeier, J., Stober, S.: Analyzing and visualizing deep neural networks for speech recognition with saliency-adjusted neuron activation profiles. Electronics 10, 1350 (2021)
Zeng, T., Lau, F.C.M.: Automatic melody harmonization via reinforcement learning by exploring structured representations for melody sequences. Electronics 10, 2469 (2021)
Kostek, B., et al.: Report of the ISMIS 2011 contest: music information retrieval. In: Kryszkiewicz, M., Rybinski, H., Skowron, A., Raś, Z.W. (eds.) ISMIS 2011. LNCS (LNAI), vol. 6804, pp. 715–724. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21916-0_75
Kostek, B.: Music information retrieval in music repositories. In: Rough Sets and Intelligent Systems - Professor Zdzisław Pawlak in Memoriam, vol. 1, pp. 464–489 (2013). https://doi.org/10.1007/978-3-642-30344-9_17
Czyzewski, A., Kostek, B., Bratoszewski, P., Kotus, J., Szykulski, M.: An audio-visual corpus for multimodal automatic speech recognition. J. Intell. Inf. Syst. 49(2), 167–192 (2017). https://doi.org/10.1007/s10844-016-0438-z
Haq, P., Jackson, J.E.: Speaker-dependent audio-visual emotion recognition. In: AVSP, Norwich, UK, pp. 53–58, September 2009
Livingstone, S.R., Russo, F.A.: The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018). https://doi.org/10.1371/journal.pone.0196391
Dupuis, M.K.P.K.: Toronto emotional speech set (TESS) (2010). https://tspace.library.utoronto.ca/handle/1807/24487. Accessed 21 May 2022
Piczak, K.J.: ESC: dataset for environmental sound classification. In: Proceedings of the ACM International Conference on Multimedia, pp. 1015–1018. ACM (2015)
https://towardsdatascience.com/40-open-source-audio-datasets-for-ml-59dc39d48f06. Accessed 21 May 2022
https://towardsdatascience.com/a-data-lakes-worth-of-audio-datasets-b45b88cd4ad. Accessed 21 May 2022
https://paperswithcode.com/datasets?mod=audio. Accessed 21 May 2022
https://www.twine.net/blog/100-audio-and-video-datasets/. Accessed 21 May 2022
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780 (2017). https://doi.org/10.1109/ICASSP.2017.7952261
Salamon, J., Jacoby, C., Bello, J.P.: A dataset and taxonomy for urban sound research. In: Proceedings of the ACM International Conference on Multimedia, pp. 1041–1044. ACM (2014)
Mesaros, A., Heittola, T., Virtanen, T.: TUT database for acoustic scene classification and sound event detection. In: 24th European Signal Processing Conference (EUSIPCO), pp. 1128–1132 (2016). https://doi.org/10.1109/EUSIPCO.2016.7760424
Stowell, D., Giannoulis, D., Benetos, E., Lagrange, M., Plumbley, M.D.: Detection and classification of acoustic scenes and events. IEEE Trans. Multimedia 17(10), 1733–1746 (2015). https://doi.org/10.1109/TMM.2015.2428998
Fonseca, E., Favory, X., Pons, J., Font, F., Serra, X.: FSD50K: an open dataset of human-labeled sound events. IEEE/ACM Trans. Audio Speech Lang. Process. 30 (2022). https://arxiv.org/pdf/2010.00475.pdf
Hershey, S., et al.: The benefit of temporally-strong labels in audio event classification. In: 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021)
Foster, P., Sigtia, S., Krstulovic, S., Barker, J., Plumbley, M.D.: CHiME-home: a dataset for sound source recognition in a domestic environment. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE (2015)
Kostek, B., Plewa, M.: Parametrisation and correlation analysis applied to music mood classification. Int. J. Comput. Intell. Stud. 2(1), 4–25 (2013)
Ciborowski, T., Reginis, S., Kurowski, A., Weber, D., Kostek, B.: Classifying emotions in film music - a deep learning approach. Electronics 10, 2955v (2021). https://doi.org/10.3390/electronics10232955
Dorochowicz, A., Kurowski, A., Kostek, B.: Employing subjective tests and deep learning for discovering the relationship between personality types and preferred music genres. Electronics 9, 2016 (2020). https://doi.org/10.3390/electronics9122016
Rosner, A., Kostek, B.: Automatic music genre classification based on musical instrument track separation. J. Intell. Inf. Syst. 50(2), 363–384 (2017). https://doi.org/10.1007/s10844-017-0464-5
Blaszke, M., Kostek, B.: Musical instrument identification using deep learning approach. Sensors 22, 3033 (2022). https://doi.org/10.3390/s22083033
Korzekwa, D., et al.: Detection of lexical stress errors in non-native (L2) English with data augmentation and attention (2021). https://doi.org/10.21437/interspeech.2021-86
Korvel, G., Treigys, P., Tamulevicus, G., Bernataviciene, J., Kostek, B.: Analysis of 2D feature spaces for deep learning-based speech recognition. J. Audio Eng. Soc. 66(12), 1072–1081 (2018). https://doi.org/10.17743/jaes.2018.0066
Korvel, G., Treigys, P., Kostek, B.: Highlighting interlanguage phoneme differences based on similarity matrices and convolutional neural network. J. Acoust. Soc. Am. 149, 508–523 (2021). https://doi.org/10.1121/10.0003339
Tamulevicius, G., Korvel, G., Yayak, A.B., Treigys, P., Bernataviciene, J., Kostek, B.: A study of cross-linguistic speech emotion recognition based on 2D feature spaces. Electronics 9, 1725 (2020). https://doi.org/10.3390/electronics9101725
Kurowski, A., Marciniuk, K.B.: Separability assessment of selected types of vehicle-associated noise. In: MISSI 2016, pp. 113–121 (2016)
Odya, P., Kotus, J., Kurowski, A., Kostek, B.: Acoustic sensing analytics applied to speech in reverberation conditions. Sensors 21, 6320 (2021). https://doi.org/10.3390/s21186320
Slakh Demo Site for the Synthesized Lakh Dataset (Slakh). http://www.slakh.com/. Accessed 20 June 2022
Żwan, P., Kostek, B.: System for automatic singing voice recognition. J. Audio Eng. Soc. 56(9), 710–723 (2008)
Lech, M., Kostek, B., Czyzewski, A.: Examining classifiers applied to static hand gesture recognition in novel sound mixing system. MISSI 2012, 77–86 (2012)
Korvel, G., Kąkol, K., Kurasova, O., Kostek, B.: Evaluation of lombard speech models in the context of speech in noise enhancement. IEEE Access 8, 155156–155170 (2020). https://doi.org/10.1109/access.2020.3015421
Ezzerg, A., et al.: Enhancing audio quality for expressive neural text-to-speech. In: Proceedings 11th ISCA Speech Synthesis Workshop (SSW 11), pp. 78–83 (2021). https://doi.org/10.21437/SSW.2021-14
AlBadawy, E.A., Lyu, S.: Voice conversion using speech-to-speech neuro-style transfer. Proc. Interspeech 2020, 4726–4730 (2020). https://doi.org/10.21437/Interspeech.2020-3056
Cífka, O., Şimşekli, U.G., Richard, G.: Groove2Groove: one-shot music style transfer with supervision from synthetic data. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 2638–2650 (2020). https://doi.org/10.1109/TASLP.2020.3019642
Mukherjee, S., Mulimani, M.: ComposeInStyle: music composition with and without style transfer. Expert Syst. Appl. 191, 116195 (2022). https://doi.org/10.1016/j.eswa.2021.116195
Korzekwa, D., Lorenzo-Trueba, J., Drugman, T., Kostek, B.: Computer-assisted pronunciation training—speech synthesis is almost all you need. Speech Commun. 142, 22–33 (2022). https://doi.org/10.1016/j.specom.2022.06.003
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kostek, B. (2022). Intelligent Audio Signal Processing – Do We Still Need Annotated Datasets?. In: Nguyen, N.T., Tran, T.K., Tukayev, U., Hong, TP., Trawiński, B., Szczerbicki, E. (eds) Intelligent Information and Database Systems. ACIIDS 2022. Lecture Notes in Computer Science(), vol 13758. Springer, Cham. https://doi.org/10.1007/978-3-031-21967-2_55
Download citation
DOI: https://doi.org/10.1007/978-3-031-21967-2_55
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21966-5
Online ISBN: 978-3-031-21967-2
eBook Packages: Computer ScienceComputer Science (R0)