Abstract
Methods for identifying a cover song typically involve comparing the similarity of chroma features between the query song and another song in the data set. However, considerable time is required for pairwise comparisons. In addition, to save disk space, most songs stored in the data set are in a compressed format. Therefore, to eliminate some decoding procedures, this study extracted music information directly from the modified discrete cosine transform coefficients of advanced audio coding and then mapped these coefficients to 12-dimensional chroma features. The chroma features were segmented to preserve the melodies. Each chroma feature segment was trained and learned by a sparse autoencoder, a deep learning architecture of artificial neural networks. The deep learning procedure was to transform chroma features into an intermediate representation for dimension reduction. Experimental results from a covers80 data set showed that the mean reciprocal rank increased to 0.5 and the matching time was reduced by over 94% compared with traditional approaches.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Al-Shareef, A. J., Mohamed, E. A., & Al-Judaibi, E. (2008). One hour ahead load forecasting using artificial neural network for the western area of Saudi Arabia. International Journal of Electrical and Computer Engineering, 3(13), 834–840.
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-wise training of deep networks. In Proceedings of the Advances in Neural Information Processing Systems (pp. 153–160).
Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends and in Machine Learning, 2(1), 1–127.
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828.
Casella, G., & George, E. I. (1992). Explaining the Gibbs sampler. The American Statistician, 46(3), 167–174.
Chang, T. M., Chen, E. T., Hsieh, C. B., & Chang, P. C. (2013). Cover song identification with direct chroma feature extraction from AAC files. In Proceedings of GCCE, Tokyo (pp. 55–56).
Dahl, G. E., et al. (2010). Phone recognitionwith the mean-covariance restricted Boltzmann machine. Advances in Neural Information Processing Systems, 23, 469–477.
Ellis, D. (2006). Beat tracking with dynamic programming. In MIREX 2006 audio beat tracking contest system description.
Ellis, D. P. W., & Poliner, G. E. (2007). Identifying cover songs with chroma features and dynamic programming beat tracking. In Proceedings of the international conference on acoustics, speech and signal processing (ICASSP), Honolulu, HI (pp. 1429–1432).
Fujishima, T. (1999). Realtime chord recognition of musical sound: A system using common lisp music. In Proceedings of international computer music conference, Beijing (pp. 464–467).
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. ArXiv e-prints 1207, 580.
Hinton, G. E., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97.
Hinton, E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.
Hinton, G. E., & Salakhutdinov, R. S. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.
ISO/IEC 13818-7. (1997). Information technology—Generic coding of moving pictures and associated audio information—Part 7: Advanced audio coding (AAC).
Kiranyaz, S., Qureshi, A. F., & Gabbouj, M. (2006). A generic audio classification and segmentation approach for multimedia indexing and retrieval. IEEE Transactions on Audio, Speech, and Language Processing, 14(3), 1062–1081.
Lee, K. (2006). Identifying cover songs from audio using harmonic representation. Music Information Retrieval Evaluation eXchange (MIREX) extended abstract.
Matlab Central, Deep Learning Toolbox [Online]. http://www.mathworks.com/matlabcentral/fileexchange/38310-deep-learning-toolbox.
Mnih, A., & Hinton, G. E. (2005). Learning nonlinear constraints with contrastive backpropagation. In 2005 IEEE international joint conference on neural networks, IJCNN’05. Proceedings (pp. 1302–1307).
Muller, M., Ellis, D. P. W., Klapuri, A., & Richard, G. (2011). Signal processing for music analysis. IEEE Journal of Selected Topics in Signal Processing, 5(6), 1088–1110.
Nair, V., & Hinton, G. E. (2009). 3D object recognition with deep belief nets. In Proceedings of the 22nd International Conference on Neural Information Processing Systems, NIPS ’09 (pp. 1339–1347).
Ng, A. (2011). Sparse autoencoder. In CS294A lecture notes.
Patel, N., & Sethi, I. (1996). Audio characterization for video indexing. In Proceedings of SPIE (pp. 373–384).
Ranzato, M., Boureau, Y., & LeCun,Y. (2007). Sparse feature learning for deep belief networks. In Advances in neural information processing systems 20 (NIPS).
Ravelli, E., Richard, G., & Daudet, L. (2010). Audio signal representations for indexing in the transform domain. IEEE Transactions on Audio, Speech, and Language Processing, 18(3), 434–446.
Riley, M., Heinen, E., & Ghosh, J. (2008). A text retrieval approach to content-based audio retrieval. In Proceedings of international conference on music information retrieval, Philadelphia, Pennsylvaia (pp. 295–300).
Sailer, C., & Dressler, K. (2006). Finding cover songs by melodic similarity. Music Information Retrieval Evaluation eXchange (MIREX) extended abstract
Salakhutdinov, R. (2009). Learning deep generative models. Doctoral dissertation, University of Toronto.
Salakhutdinov, R. Nonlinear dimensionality reduction using neural networks. http://www.cs.toronto.edu/~rsalakhu/talks/NLDR_NIPS06workshop.pdf.
Serra, J., G’omez, E., & Herrera, P. (2008). Transposing chroma representations to a common key. In Proceedings of IEEE CS conference on the use of symbols to represent music and multimedia objects, Citeseer (pp. 45–48).
Shepard, R. N. (1982). Structural representations of musical pitch. In D. Deutsch (Ed.), The psychology of music (1st ed.). Amsterdam: Swets & Zeitlinger.
Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. In D. E. Rumelhart, J. L. McClelland & C. PDP Research Group (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1, pp. 194–281). Cambridge, MA: MIT Press.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskeve, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958.
The Covers80 cover song data set [Online]. http://labrosa.ee.columbia.edu/projects/coversongs/covers80/.
Tsai, T. H., & Chang, W. C. (2009). Two-stage method for specific audio retrieval based on MP3 compression domain. In Proceedings of IEEE international symposium on circuits and systems (pp. 713–716).
Tsai, T. H., & Wang, Y. T. (2004). Content-based retrieval of audio example on MP3 compression domain. In Proceedings of IEEE 6th workshop on multimedia signal processing (pp. 123–126).
Voorhees, E. M. (1999). The TREC-8 question answering track report. In Proceedings of the 8th text retrieval conference (TREC-8).
Waterman, M. S., & Smith, T. F. (1978). RNA secondary structure: A complete mathematical analysis. Mathematical Biosciences, 42(3–4), 257–266.
Yapp, L., & Zick, G. (1997). Speech recognition on MPEG/audio encoded files. In Proceedings of IEEE international conference multimedia computing and systems (pp. 624–625).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Fang, JT., Chang, YR. & Chang, PC. Deep learning of chroma representation for cover song identification in compression domain. Multidim Syst Sign Process 29, 887–902 (2018). https://doi.org/10.1007/s11045-017-0476-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11045-017-0476-x