Audio style transfer using shallow convolutional networks and random filters

Chen, Jiyou; Yang, Gaobo; Zhao, Huihuang; Ramasamy, Manimaran

doi:10.1007/s11042-020-08798-6

Audio style transfer using shallow convolutional networks and random filters

Published: 06 April 2020

Volume 79, pages 15043–15057, (2020)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Jiyou Chen^1,2,
Gaobo Yang¹,
Huihuang Zhao² &
…
Manimaran Ramasamy²

1000 Accesses
9 Citations
Explore all metrics

Abstract

Recently, with the advent of Convolutional Neural Network (CNN) era, Neural style transfer on images has become a very active research topic and the style of an image can be transferred to another image through a CNN so that the image retains both its own content and another style of image. In this work, we propose an algorithm for audio style transfer that uses the force of CNN to generate a new audio from a style audio. We use Continuous Wavelet Transfer(CWT) to convert the audio into a spectrogram and then use the spectrogram as the representation of the audio image through image style transfer method to obtain a new image, and finally, generate an audio using iterative phase reconstruction with Griffin-Lim. We succeed in transferring audio such as light music but had difficulty in transferring audio that has lyrics and high-level metrics such as emotion or tone. We propose several measures to improve the quality of audio and a lot of experimental results shows that our method is better than other methods in terms of sound quality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

NCCNet: Arbitrary Neural Style Transfer with Multi-channel Conversion

Multi-model Neural Style Transfer (MMNST) for Audio and Image

Sound Transformation: Applying Image Neural Style Transfer Networks to Audio Spectograms

References

Aytar Y, Vondrick C, Torralba A (2016) Soundnet: Learning sound representations from unlabeled video[C]. Advances in Neural Information Processing Systems:892–900
Shaun Barry and Youngmoo Kim, Style transfer for musical audio using multiple time-frequency representations, Unpublished article available at: https://tinyurl.com/y7nu7r9s, 2018.
Brunner G, Konrad A, Wang Y, et al. MIDI-VAE: Modeling dynamics and instrumentation of music with applications to style transfer[J]. arXiv preprint arXiv:1809.07600, 2018.
Ephrat A, Mosseri I, Lang O et al (2018) Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. ACM T Graphic. https://doi.org/10.1145/3197517.3201357
Gatys L A, Ecker A S, Bethge M. A neural algorithm of artistic style[J]. arXiv preprint arXiv:1508.06576, 2015.
Gatys LA, Ecker AS, Bethge M (2016) Image style transfer using convolutional neural networks[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition:2414–2423
Giurgiutiu V, Yu L (2003) Comparison of short-time Fourier transform and wavelet transform of transient and tone burst wave propagation signals for structural health monitoring[C]. Proceedings of 4th International Workshop on Structural Health Monitoring:1267–1274
Griffin D, Lim J (1984) Signal estimation from modified short-time Fourier transform[J]. IEEE Trans Acoust Speech Signal Process 32(2):236–243
Article Google Scholar
Grinstein E, Duong NQK, Ozerov A et al (2018) Audio style transfer[C]//2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE:586–590
He K, Wang Y, Hopcroft J (2016) A powerful generative model using random weights for the deep image representation[C]. Advances in Neural Information Processing Systems:631–639
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks[C]. Advances in Neural Information Processing Systems:1097–1105
Lu H, Li Y, Chen M, Kim H, Serikawa S (2018) Brain intelligence: go beyond artificial intelligence. Mobile Networks and Applications 23:368–375
Article Google Scholar
Lu H, Li Y, Uemura T et al (2018) FDCNet: filtering deep convolutional network for marine organism classification[J]. Multimed Tools Appl 77(17):21847–21860
Article Google Scholar
Lu H, Li Y, Uemura T, Kim H, Serikawa S (2018) Low illumination underwater light field images reconstruction using deep convolutional neural networks. Futur Gener Comput Syst 82:142–148
Article Google Scholar
Lu H, Li Y, Mu S, Wang D, Kim H, Serikawa S (2018) Motor anomaly detection for unmanned aerial vehicles using reinforcement learning. IEEE Internet Things J 5(4):2315–2322
Article Google Scholar
Lu H, Wang D, Li Y et al (2019) CONet: a cognitive ocean network[J]. IEEE Wireless Communications 26(3):90–96
Article Google Scholar
Mital P K. Time domain neural audio style transfer[J]. arXiv preprint arXiv:1711.11160, 2017.
Nash J (1951) Non-cooperative games[J]. Annals of Mathematics (Second Series) 54(2):286–295
Article MathSciNet Google Scholar
Shih Y, Paris S, Durand F et al (2013) Data-driven hallucination of different times of day from a single outdoor photo[J]. ACM Transactions on Graphics (TOG) 32(6):200
Article Google Scholar
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
Ulyanov D, Lebedev V. Audio texture synthesis and style transfer[J]. URL https://dmitryulyanov.github.io/audio-texture-synthesis-and-style-transfer, 2016.
Ustyuzhaninov I, Brendel W, Gatys L A, et al. Texture synthesis using shallow convolutional networks with random filters[J]. arXiv preprint arXiv:1606.00021, 2016.
Verma P, Smith J O. Neural style transfer for audio spectrograms[J]. arXiv preprint arXiv:1801.01589, 2018.
Wyse L. Audio spectrogram representations for processing with convolutional neural networks[J]. arXiv preprint arXiv:1706.09559, 2017.
Xu X, He L, Shimada A et al (2016) Learning unified binary codes for cross-modal retrieval via latent semantic hashing[J]. Neurocomputing 213:191–203
Article Google Scholar
Xu X, Shen F, Yang Y et al (2017) Learning discriminative binary codes for large-scale cross-modal retrieval[J]. IEEE Transactions on Image Processing 26(5):2494–2507
Article MathSciNet Google Scholar
Xu X, Zhou X, Shen F et al (2019) Fusion by synthesizing: a multi-view deep neural network for zero-shot recognition[J]. Signal Processing 164:354–367
Article Google Scholar
Zhu C, Byrd RH, Lu P et al (1997) Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization[J]. ACM Transactions on Mathematical Software (TOMS) 23(4):550–560
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (61503128, 61772179), Science and Technology Plan Project of Hunan Province (2016TP1020), Scientific Research Fund of Hunan Provincial Education Department (16C0226, 17C0223, and 18A333), Scientific Research Fund of Hunan Provincial Key Laboratory of Intelligent Information Processing and Application (IIPA19K05). We would like to thank NVIDIA for the GPU donation.

Author information

Authors and Affiliations

College of Information Science and Engineering, Hunan University, Changsha, 410082, China
Jiyou Chen & Gaobo Yang
Hunan Provincial Key Laboratory of Intelligent Information Processing and Application, Hengyang Normal University, Hengyang, 421002, China
Jiyou Chen, Huihuang Zhao & Manimaran Ramasamy

Authors

Jiyou Chen
View author publications
You can also search for this author in PubMed Google Scholar
Gaobo Yang
View author publications
You can also search for this author in PubMed Google Scholar
Huihuang Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Manimaran Ramasamy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huihuang Zhao.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, J., Yang, G., Zhao, H. et al. Audio style transfer using shallow convolutional networks and random filters. Multimed Tools Appl 79, 15043–15057 (2020). https://doi.org/10.1007/s11042-020-08798-6

Download citation

Received: 19 May 2019
Revised: 05 January 2020
Accepted: 28 February 2020
Published: 06 April 2020
Issue Date: June 2020
DOI: https://doi.org/10.1007/s11042-020-08798-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Audio style transfer using shallow convolutional networks and random filters

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

NCCNet: Arbitrary Neural Style Transfer with Multi-channel Conversion

Multi-model Neural Style Transfer (MMNST) for Audio and Image

Sound Transformation: Applying Image Neural Style Transfer Networks to Audio Spectograms

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Audio style transfer using shallow convolutional networks and random filters

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

NCCNet: Arbitrary Neural Style Transfer with Multi-channel Conversion

Multi-model Neural Style Transfer (MMNST) for Audio and Image

Sound Transformation: Applying Image Neural Style Transfer Networks to Audio Spectograms

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation