Human-Like Storyteller: A Hierarchical Network with Gated Memory for Visual Storytelling | SpringerLink
Skip to main content

Human-Like Storyteller: A Hierarchical Network with Gated Memory for Visual Storytelling

  • Conference paper
  • First Online:
Computational Science – ICCS 2021 (ICCS 2021)

Abstract

Different from the visual captioning that describes an image concretely, the visual storytelling aims at generating an imaginative paragraph with a deep understanding of the given image stream. It is more challenging for the requirements of inferring contextual relationships among images. Intuitively, humans tend to tell the story around a central idea that is constantly expressed with the continuation of the storytelling. Therefore, we propose the Human-Like StoryTeller (HLST), a hierarchical neural network with a gated memory module, which imitates the storytelling process of human beings. First, we utilize the hierarchical decoder to integrate the context information effectively. Second, we introduce the memory module as the story’s central idea to enhance the coherence of generated stories. And the multi-head attention mechanism with a self adjust query is employed to initialize the memory module, which distils the salient information of the visual semantic features. Finally, we equip the memory module with a gated mechanism to guide the story generation dynamically. During the generation process, the expressed information contained in memory is erased with the control of the read and write gate. The experimental results indicate that our approach significantly outperforms all state-of-the-art (SOTA) methods.

L. Zhang and Y. Kong—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 9380
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 11725
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/lichengunc/vist_eval.

  2. 2.

    https://pytorch.org/.

References

  1. Banerjee, S., Lavie, A.: Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)

    Google Scholar 

  2. Chen, Z., Zhang, X., Boedihardjo, A.P., Dai, J., Lu, C.T.: Multimodal storytelling via generative adversarial imitation learning. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, pp. 3967–3973 (2017)

    Google Scholar 

  3. Devlin, J., et al.: Language models for image captioning: the quirks and what works. arXiv preprint arXiv:1505.01809 (2015)

  4. Gonzalez-Rico, D., Pineda, G.F.: Contextualize, show and tell: a neural visual storyteller. CoRR abs/1806.00738 (2018)

    Google Scholar 

  5. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  6. Hsu, C., Chen, S., Hsieh, M., Ku, L.: Using inter-sentence diverse beam search to reduce redundancy in visual storytelling. CoRR abs/1805.11867 (2018)

    Google Scholar 

  7. Hu, J., Cheng, Y., Gan, Z., Liu, J., Gao, J., Neubig, G.: What makes a good story? Designing composite rewards for visual storytelling. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, pp. 7969–7976, April 2020

    Google Scholar 

  8. Huang, Q., Gan, Z., Celikyilmaz, A., Wu, D., Wang, J., He, X.: Hierarchically structured reinforcement learning for topically coherent visual story generation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8465–8472 (2019)

    Google Scholar 

  9. Huang, T.H.K., et al.: Visual storytelling. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1233–1239 (2016)

    Google Scholar 

  10. Jung, Y., Kim, D., Woo, S., Kim, K., Kim, S., Kweon, I.S.: Hide-and-tell: learning to bridge photo streams for visual storytelling. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11213–11220 (2020)

    Google Scholar 

  11. Kim, T., Heo, M., Son, S., Park, K., Zhang, B.: GLAC net: GLocal attention cascading networks for multi-image cued story generation. CoRR abs/1805.10973 (2018)

    Google Scholar 

  12. Li, T., Li, S.: Incorporating textual evidence in visual storytelling. In: Proceedings of the 1st Workshop on Discourse Structure in Neural NLG, Tokyo, Japan, pp. 13–17. Association for Computational Linguistics, November 2019

    Google Scholar 

  13. Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. Association for Computational Linguistics, July 2004

    Google Scholar 

  14. Lin, C.Y., Och, F.J.: Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, Spain, pp. 605–612, July 2004

    Google Scholar 

  15. Liu, F., Perez, J.: Gated end-to-end memory networks. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 1–10 (2017)

    Google Scholar 

  16. Park, C.C., Kim, G.: Expressing an image stream with a sequence of natural sentences. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 73–81. Curran Associates, Inc. (2015)

    Google Scholar 

  17. Sukhbaatar, S., Szlam, A., Weston, J., Fergus, R.: End-to-end memory networks. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 2440–2448. Curran Associates, Inc. (2015)

    Google Scholar 

  18. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)

    Google Scholar 

  19. Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566–4575 (2015)

    Google Scholar 

  20. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164 (2015)

    Google Scholar 

  21. Wang, J., Fu, J., Tang, J., Li, Z., Mei, T.: Show, reward and tell: automatic generation of narrative paragraph from photo stream by adversarial training. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  22. Wang, X., Chen, W., Wang, Y.F., Wang, W.Y.: No metrics are perfect: adversarial reward learning for visual storytelling. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), pp. 899–909 (2018)

    Google Scholar 

  23. Yang, P., et al.: Knowledgeable storyteller: a commonsense-driven generative model for visual storytelling. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 5356–5362. AAAI Press (2019)

    Google Scholar 

  24. Yu, L., Bansal, M., Berg, T.: Hierarchically-attentive RNN for album summarization and storytelling. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 966–971. Association for Computational Linguistics, September 2017

    Google Scholar 

Download references

Acknowledgement

This research is supported by the National Key R&D Program of China (No.2017YFC0820700, No.2018YFB1004700).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cong Cao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, L. et al. (2021). Human-Like Storyteller: A Hierarchical Network with Gated Memory for Visual Storytelling. In: Paszynski, M., Kranzlmüller, D., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds) Computational Science – ICCS 2021. ICCS 2021. Lecture Notes in Computer Science(), vol 12743. Springer, Cham. https://doi.org/10.1007/978-3-030-77964-1_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-77964-1_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-77963-4

  • Online ISBN: 978-3-030-77964-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics