Abstract
Spectrograms provide rich feature information of music data. Significant progress has been made in music classification using spectrograms and Convolutional Neural Networks (CNNs). However, the softmax loss commonly used in existing CNNs lacks sufficient power to discriminate deep features of music. To overcome this limitation, we propose a Combined Angular Margin and Cosine Margin Softmax Loss (AMCM-Softmax) approach in this paper to enhance intra-class compactness and inter-class discrepancy simultaneously. Specifically, normalization on the weight vectors and feature vectors is adopted to eliminate radial variations. Then, an angular margin parameter and a cosine margin parameter are introduced to maximize the decision margin by enforcing angular and cosine margin constraints. Consequently, the discrimination of features is enhanced by normalization and margin maximization. The decision boundary and the target logit curve of AMCM-Softmax can provide a clear geometric interpretation. Extensive experiments on music datasets show that AMCM-Softmax consistently outperforms the current state-of-the-art approaches in classifying genre and emotion. Our work also shows that a margin loss function can lead to better performance and be used in an advanced CNN model for music classification.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Costa YM, Oliveira LS, Silla CN Jr (2017) An evaluation of convolutional neural networks for music classification using spectrograms. Appl Soft Comput 52:28–38
Simonetta F, Ntalampiras S, Avanzini F (2019) Multimodal music information processing and retrieval: survey and future challenges. In: 2019 international workshop on multilayer music representation and processing (MMRP), pp 10–18
Zhuang Y, Chen Y, Zheng J (2020) Music genre classification with transformer classifier. In: Proceedings of the 2020 4th international conference on digital signal processing, Chengdu, China, June 19–21, 2020, pp 155–159
Chaudhary D, Singh NP, Singh S (2020) Development of music emotion classification system using convolution neural network. Int J Speech Technol 1–10
Doerfler M, Grill T, Bammer R, Flexer A (2020) Basic filters for convolutional neural networks: training or design. Neural Comput Appl 32(4):941–954
Choi K, Fazekas G, Sandler M (2016) Automatic tagging using deep convolutional neural networks. arXiv preprint arXiv:1606.00298
Choi K, Fazekas G, Sandler M, Cho K (2017) Convolutional recurrent neural networks for music classification. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, LA, USA, March 5–9, 2017, pp 2392–2396
Liu W, Wen Y, Yu Z, Yang M (2016) Large-margin softmax loss for convolutional neural networks. In: International conference on machine learning, New York City, NY, USA, June 19–24, 2016, pp 507–516
Liu W, Wen Y, Yu Z, Li M, Raj B, Song L (2017) SphereFace: deep hypersphere embedding for face recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, July 21–26, 2017, pp 6738–6746
Wang F, Cheng J, Liu W, Liu H (2018) Additive margin softmax for face verification. IEEE Signal Process Lett 25(7):926–930
Wang H, Wang Y, Zhou Z, Ji X, Gong D, Zhou J, Li Z, Liu W (2018) Cosface: large margin cosine loss for deep face recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA, June 18–22, 2018, pp 5265–5274
Wen Y, Zhang K, Li Z, Qiao Y (2016) A discriminative feature learning approach for deep face recognition. In: European conference on computer vision, Amsterdam, The Netherlands, October 11–14, 2016, pp 499–515
Deng J, Guo J, Xue N, Zafeiriou S (2019) Arcface: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Long Beach, CA, USA, June 16–20, 2019, pp 4690–4699
Hadsell R, Chopra S, LeCun Y (2006) Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR'06), New York, NY, USA, June 17–22, 2006, Vol. 2, pp 1735–1742
Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Boston, MA, USA, June 7–12, 2015, pp 815–823
Wang J, Song Y, Leung T, Rosenberg C, Wang J, Philbin J, Chen B, Wu Y (2014) Learning fine-grained image similarity with deep ranking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Columbus, OH, USA, June 23–28, 2014, pp 1386–1393
Liu H, Zhu X, Lei Z, Li SZ (2019) Adaptiveface: Adaptive margin and sampling for face recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Long Beach, CA, USA, June 16–20, 2019, pp 11947–11956
Ferraro A, Bogdanov D, Jay XS, Jeon H, Yoon J (2021) How Low Can You Go? Reducing Frequency and Time Resolution in Current CNN Architectures for Music Auto-tagging. In: 2020 28th European signal processing conference (EUSIPCO), Amsterdam, Netherlands, January 18–21, 2021, pp 131–135
Liang B, Gu M (2020) Music genre classification using transfer learning. In: 2020 IEEE conference on multimedia information processing and retrieval (MIPR), Shenzhen, China, August 6–8, 2020, pp 392–393
Parkhi OM, Vedaldi A, Zisserman A (2015) Deep face recognition. In Proceedings of the British machine vision conference, Swansea, UK, September 7–10, 2015, pp 1–12
Sun Y, Chen Y, Wang X, Tang X (2014) Deep learning face representation by joint identification-verification. Adv Neural Inf Process Syst 27:1988–1996
Taenzer M, Abeßer J, Mimilakis SI, Weiß C, Müller M, Lukashevich H, Fraunhofer IDMT (2019) Investigating CNN-based instrument family recognition for western classical music recordings. In: Proceedings of the 20th international society for music information retrieval conference, Delft, The Netherlands, November 4–8, 2019, pp 612–619
Taigman Y, Yang M, Ranzato MA, Wolf L (2014) Deepface: closing the gap to human-level performance in face verification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Columbus, OH, USA, June 23–28, 2014, pp 1701–1708
Bhattacharjee M, Prasanna SM, Guha P (2020) Speech/music classification using features from spectral peaks. IEEE/ACM Trans Audio Speech Language Process 28:1549–1559
Kim J, Urbano J, Liem C, Hanjalic A (2020) One deep music representation to rule them all? a comparative analysis of different representation learning strategies. Neural Comput Appl 32(4):1067–1093
Hansen C, Hansen C, Maystre L, Mehrotra R, Brost B, Tomasi F, Lalmas M (2020) Contextual and sequential user embeddings for large-scale music recommendation. In: Fourteenth ACM conference on recommender systems, virtual event, Brazil, September 22–26, 2020, pp 53–62
Rahardwika DS, Rachmawanto EH, Sari CA, Irawan C, Kusumaningrum DP, Trusthi SL (2020) Comparison of SVM, KNN, and NB Classifier for Genre Music Classification based on Metadata. In: 2020 international seminar on application for technology of information and communication (iSemantic), pp 12–16
Sharma G, Umapathy K, Krishnan S (2020) Trends in audio signal feature extraction methods. Appl Acoust 158:107020
Zeng Y, Mao H, Peng D, Yi Z (2019) Spectrogram based multi-task audio classification. Multimed Tools Appl 78(3):3705–3722
Choi K, Fazekas G, Sandler M (2016) Explaining deep convolutional neural networks on music classification. arXiv preprint arXiv:1607.02444
Kong Q, Feng X, Li Y (2014) Music genre classification using convolutional neural network. In: Proceedings of international society for music information retrieval conference, Taipei, Taiwan, China, October 27–31, 2014
Lidy T, Schindler A (2016) Parallel convolutional neural networks for music genre and mood classification. In: Proceedings of the 17th international society for music information retrieval conference, New York City, United States, August 7–11, 2016
Liu X, Chen Q, Wu X, Liu Y, Liu Y (2017) CNN based music emotion classification. arXiv preprint arXiv:1704.05665
Zhang W, Lei W, Xu X, Xing X (2016) Improved music genre classification with convolutional neural networks. In: 17th annual conference of the international speech communication association, San Francisco, CA, USA, September 8–12, 2016, pp 3304–3308
Pons J, Serra X (2019) Randomly weighted CNNs for (music) audio classification. In: IEEE international conference on acoustics, speech and signal processing, Brighton, United Kingdom, May 12–17, 2019, pp 336–340
Li C, Bao Z, Li L, Zhao Z (2020) Exploring temporal representations by leveraging attention-based bidirectional LSTM-RNNs for multi-modal emotion recognition. Inf Process Manag 57(3):102185
Russo M, Kraljević L, Stella M, Sikora M (2020) Cochleogram-based approach for detecting perceived emotions in music. Inf Process Manag 57(5):102270
Zhou ZH, Feng J (2019) Deep forest. National Sci Rev 6(1):74–86
Chopra S, Hadsell R, LeCun Y (2005) Learning a similarity metric discriminatively, with application to face verification. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05), San Diego, CA, USA, June 20–26, 2005, Vol. 1, pp 539–546
Hoffer E, Ailon N (2015) Deep metric learning using triplet network. In: International workshop on similarity-based pattern recognition, Copenhagen, Denmark, October 12–14, 2015, pp 84–92
Kemelmacher-Shlizerman I, Seitz SM, Miller D, Brossard E (2016) The megaface benchmark: 1 million faces for recognition at scale. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, June 27–30, 2016, pp 4873–4882
Wolf L, Hassner T, Maoz I (2011) Face recognition in unconstrained videos with matched background similarity. In: the 24th IEEE conference on computer vision and pattern recognition, Colorado Springs, CO, USA, 20–25 June 2011, pp 529–534
Ranjan R, Castillo CD, Chellappa R (2017) L2-constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507
Wang F, Xiang X, Cheng J, Yuille AL (2017) Normface: L2 hypersphere embedding for face verification. In: Proceedings of the 25th ACM international conference on Multimedia, Mountain View, CA, USA, October 23–27, 2017, pp 1041–1049
Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Trans Speech Audio Process 10(5):293–302
Defferrard M, Benzi K, Vandergheynst P, Bresson X (2017) FMA: a dataset for music analysis. In: 18th international society for music information retrieval conference, Suzhou, China, October 23–27, 2017, pp 316–323
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, June 27–30, 2016, pp 770–778
Gulli A, Pal S (2017) Deep learning with Keras. Packt Publishing Ltd
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Acknowledgements
This work was supported by the Jiangsu Provincial Key Constructive Laboratory for Big Data of Psychology and Cognitive Science under Grant No.72592062003G, the Natural Science Foundation of the Colleges and Universities in Anhui Province of China under Grant No. KJ2020A0035 and No. KJ2021A0640, and the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA).
Funding
Jiangsu Provincial Key Constructive Laboratory for Big Data of Psychology and Cognitive Science, No.72592062003G, Xiaofeng Yuan, Natural Science Foundation of the Colleges and Universities in Anhui Province of China, No. KJ2020A0035, Yi Yang, Natural Science Foundation of the Colleges and Universities in Anhui Province of China, No. KJ2021A0640, Yang Wang, Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA), Hong Yan.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, J., Han, L., Wang, Y. et al. Combined angular margin and cosine margin softmax loss for music classification based on spectrograms. Neural Comput & Applic 34, 10337–10353 (2022). https://doi.org/10.1007/s00521-022-06896-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-022-06896-0