Abstract
Recently, with the advent of Convolutional Neural Network (CNN) era, Neural style transfer on images has become a very active research topic and the style of an image can be transferred to another image through a CNN so that the image retains both its own content and another style of image. In this work, we propose an algorithm for audio style transfer that uses the force of CNN to generate a new audio from a style audio. We use Continuous Wavelet Transfer(CWT) to convert the audio into a spectrogram and then use the spectrogram as the representation of the audio image through image style transfer method to obtain a new image, and finally, generate an audio using iterative phase reconstruction with Griffin-Lim. We succeed in transferring audio such as light music but had difficulty in transferring audio that has lyrics and high-level metrics such as emotion or tone. We propose several measures to improve the quality of audio and a lot of experimental results shows that our method is better than other methods in terms of sound quality.










Similar content being viewed by others
References
Aytar Y, Vondrick C, Torralba A (2016) Soundnet: Learning sound representations from unlabeled video[C]. Advances in Neural Information Processing Systems:892–900
Shaun Barry and Youngmoo Kim, Style transfer for musical audio using multiple time-frequency representations, Unpublished article available at: https://tinyurl.com/y7nu7r9s, 2018.
Brunner G, Konrad A, Wang Y, et al. MIDI-VAE: Modeling dynamics and instrumentation of music with applications to style transfer[J]. arXiv preprint arXiv:1809.07600, 2018.
Ephrat A, Mosseri I, Lang O et al (2018) Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. ACM T Graphic. https://doi.org/10.1145/3197517.3201357
Gatys L A, Ecker A S, Bethge M. A neural algorithm of artistic style[J]. arXiv preprint arXiv:1508.06576, 2015.
Gatys LA, Ecker AS, Bethge M (2016) Image style transfer using convolutional neural networks[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition:2414–2423
Giurgiutiu V, Yu L (2003) Comparison of short-time Fourier transform and wavelet transform of transient and tone burst wave propagation signals for structural health monitoring[C]. Proceedings of 4th International Workshop on Structural Health Monitoring:1267–1274
Griffin D, Lim J (1984) Signal estimation from modified short-time Fourier transform[J]. IEEE Trans Acoust Speech Signal Process 32(2):236–243
Grinstein E, Duong NQK, Ozerov A et al (2018) Audio style transfer[C]//2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE:586–590
He K, Wang Y, Hopcroft J (2016) A powerful generative model using random weights for the deep image representation[C]. Advances in Neural Information Processing Systems:631–639
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks[C]. Advances in Neural Information Processing Systems:1097–1105
Lu H, Li Y, Chen M, Kim H, Serikawa S (2018) Brain intelligence: go beyond artificial intelligence. Mobile Networks and Applications 23:368–375
Lu H, Li Y, Uemura T et al (2018) FDCNet: filtering deep convolutional network for marine organism classification[J]. Multimed Tools Appl 77(17):21847–21860
Lu H, Li Y, Uemura T, Kim H, Serikawa S (2018) Low illumination underwater light field images reconstruction using deep convolutional neural networks. Futur Gener Comput Syst 82:142–148
Lu H, Li Y, Mu S, Wang D, Kim H, Serikawa S (2018) Motor anomaly detection for unmanned aerial vehicles using reinforcement learning. IEEE Internet Things J 5(4):2315–2322
Lu H, Wang D, Li Y et al (2019) CONet: a cognitive ocean network[J]. IEEE Wireless Communications 26(3):90–96
Mital P K. Time domain neural audio style transfer[J]. arXiv preprint arXiv:1711.11160, 2017.
Nash J (1951) Non-cooperative games[J]. Annals of Mathematics (Second Series) 54(2):286–295
Shih Y, Paris S, Durand F et al (2013) Data-driven hallucination of different times of day from a single outdoor photo[J]. ACM Transactions on Graphics (TOG) 32(6):200
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
Ulyanov D, Lebedev V. Audio texture synthesis and style transfer[J]. URL https://dmitryulyanov.github.io/audio-texture-synthesis-and-style-transfer, 2016.
Ustyuzhaninov I, Brendel W, Gatys L A, et al. Texture synthesis using shallow convolutional networks with random filters[J]. arXiv preprint arXiv:1606.00021, 2016.
Verma P, Smith J O. Neural style transfer for audio spectrograms[J]. arXiv preprint arXiv:1801.01589, 2018.
Wyse L. Audio spectrogram representations for processing with convolutional neural networks[J]. arXiv preprint arXiv:1706.09559, 2017.
Xu X, He L, Shimada A et al (2016) Learning unified binary codes for cross-modal retrieval via latent semantic hashing[J]. Neurocomputing 213:191–203
Xu X, Shen F, Yang Y et al (2017) Learning discriminative binary codes for large-scale cross-modal retrieval[J]. IEEE Transactions on Image Processing 26(5):2494–2507
Xu X, Zhou X, Shen F et al (2019) Fusion by synthesizing: a multi-view deep neural network for zero-shot recognition[J]. Signal Processing 164:354–367
Zhu C, Byrd RH, Lu P et al (1997) Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization[J]. ACM Transactions on Mathematical Software (TOMS) 23(4):550–560
Acknowledgements
This work was supported by National Natural Science Foundation of China (61503128, 61772179), Science and Technology Plan Project of Hunan Province (2016TP1020), Scientific Research Fund of Hunan Provincial Education Department (16C0226, 17C0223, and 18A333), Scientific Research Fund of Hunan Provincial Key Laboratory of Intelligent Information Processing and Application (IIPA19K05). We would like to thank NVIDIA for the GPU donation.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chen, J., Yang, G., Zhao, H. et al. Audio style transfer using shallow convolutional networks and random filters. Multimed Tools Appl 79, 15043–15057 (2020). https://doi.org/10.1007/s11042-020-08798-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-08798-6