Abstract
Different from the visual captioning that describes an image concretely, the visual storytelling aims at generating an imaginative paragraph with a deep understanding of the given image stream. It is more challenging for the requirements of inferring contextual relationships among images. Intuitively, humans tend to tell the story around a central idea that is constantly expressed with the continuation of the storytelling. Therefore, we propose the Human-Like StoryTeller (HLST), a hierarchical neural network with a gated memory module, which imitates the storytelling process of human beings. First, we utilize the hierarchical decoder to integrate the context information effectively. Second, we introduce the memory module as the story’s central idea to enhance the coherence of generated stories. And the multi-head attention mechanism with a self adjust query is employed to initialize the memory module, which distils the salient information of the visual semantic features. Finally, we equip the memory module with a gated mechanism to guide the story generation dynamically. During the generation process, the expressed information contained in memory is erased with the control of the read and write gate. The experimental results indicate that our approach significantly outperforms all state-of-the-art (SOTA) methods.
L. Zhang and Y. Kong—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Banerjee, S., Lavie, A.: Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
Chen, Z., Zhang, X., Boedihardjo, A.P., Dai, J., Lu, C.T.: Multimodal storytelling via generative adversarial imitation learning. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, pp. 3967–3973 (2017)
Devlin, J., et al.: Language models for image captioning: the quirks and what works. arXiv preprint arXiv:1505.01809 (2015)
Gonzalez-Rico, D., Pineda, G.F.: Contextualize, show and tell: a neural visual storyteller. CoRR abs/1806.00738 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hsu, C., Chen, S., Hsieh, M., Ku, L.: Using inter-sentence diverse beam search to reduce redundancy in visual storytelling. CoRR abs/1805.11867 (2018)
Hu, J., Cheng, Y., Gan, Z., Liu, J., Gao, J., Neubig, G.: What makes a good story? Designing composite rewards for visual storytelling. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, pp. 7969–7976, April 2020
Huang, Q., Gan, Z., Celikyilmaz, A., Wu, D., Wang, J., He, X.: Hierarchically structured reinforcement learning for topically coherent visual story generation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8465–8472 (2019)
Huang, T.H.K., et al.: Visual storytelling. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1233–1239 (2016)
Jung, Y., Kim, D., Woo, S., Kim, K., Kim, S., Kweon, I.S.: Hide-and-tell: learning to bridge photo streams for visual storytelling. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11213–11220 (2020)
Kim, T., Heo, M., Son, S., Park, K., Zhang, B.: GLAC net: GLocal attention cascading networks for multi-image cued story generation. CoRR abs/1805.10973 (2018)
Li, T., Li, S.: Incorporating textual evidence in visual storytelling. In: Proceedings of the 1st Workshop on Discourse Structure in Neural NLG, Tokyo, Japan, pp. 13–17. Association for Computational Linguistics, November 2019
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. Association for Computational Linguistics, July 2004
Lin, C.Y., Och, F.J.: Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, Spain, pp. 605–612, July 2004
Liu, F., Perez, J.: Gated end-to-end memory networks. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 1–10 (2017)
Park, C.C., Kim, G.: Expressing an image stream with a sequence of natural sentences. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 73–81. Curran Associates, Inc. (2015)
Sukhbaatar, S., Szlam, A., Weston, J., Fergus, R.: End-to-end memory networks. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 2440–2448. Curran Associates, Inc. (2015)
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)
Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566–4575 (2015)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164 (2015)
Wang, J., Fu, J., Tang, J., Li, Z., Mei, T.: Show, reward and tell: automatic generation of narrative paragraph from photo stream by adversarial training. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Wang, X., Chen, W., Wang, Y.F., Wang, W.Y.: No metrics are perfect: adversarial reward learning for visual storytelling. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), pp. 899–909 (2018)
Yang, P., et al.: Knowledgeable storyteller: a commonsense-driven generative model for visual storytelling. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 5356–5362. AAAI Press (2019)
Yu, L., Bansal, M., Berg, T.: Hierarchically-attentive RNN for album summarization and storytelling. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 966–971. Association for Computational Linguistics, September 2017
Acknowledgement
This research is supported by the National Key R&D Program of China (No.2017YFC0820700, No.2018YFB1004700).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, L. et al. (2021). Human-Like Storyteller: A Hierarchical Network with Gated Memory for Visual Storytelling. In: Paszynski, M., Kranzlmüller, D., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds) Computational Science – ICCS 2021. ICCS 2021. Lecture Notes in Computer Science(), vol 12743. Springer, Cham. https://doi.org/10.1007/978-3-030-77964-1_21
Download citation
DOI: https://doi.org/10.1007/978-3-030-77964-1_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-77963-4
Online ISBN: 978-3-030-77964-1
eBook Packages: Computer ScienceComputer Science (R0)