Textbook Question Answering with Multi-type Question Learning and Contextualized Diagram Representation

He, Jianwei; Fu, Xianghua; Long, Zi; Wang, Shuxin; Liang, Chaojie; Lin, Hongbin

doi:10.1007/978-3-030-86380-7_8

Jianwei He^12,13,
Xianghua Fu¹²,
Zi Long¹²,
Shuxin Wang¹²,
Chaojie Liang^12,13 &
…
Hongbin Lin^12,13

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12894))

Included in the following conference series:

International Conference on Artificial Neural Networks

Abstract

Textbook question answering (TQA) is a multi-modal task that requires complex parsing and reasoning over scientific diagrams and long text to answer various types of questions, including true/false questions, reading comprehension, and diagram questions, making TQA a superset of question answering (QA) and visual question answering (VQA). In this paper, we introduce a Multi-Head TQA architecture (MHTQA) for solving the TQA task. To overcome the long text issue, we apply the open-source search engine Solr to select sentences from lesson essays. In order to answer questions that have different input formats and share knowledge, we build a bottom-shared model with a transformer and three QA networks. For diagram questions, previous approaches did not incorporate the textual context to produce diagram representation, resulting in insufficient utilize of diagram semantic information. To address this issue, we learn a contextualized diagram representation through the novel Contextualized Iterative Dual Fusion network (CIDF) using the visual and semantic features of the diagram image and the lesson essays. We jointly train different types of questions in a multi-task learning manner for knowledge sharing by an efficient sampling strategy of Multi-type Question Learning (MQL). The experimental results show that our model outperforms the existing single model on all question types by a margin of 4.6%, 1.7%, 1%, 1.9% accuracy on Text T/F, Text MC, Diagram, and overall accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 11439; Price includes VAT (Japan)

Softcover Book: JPY 14299; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Diagram Perception Networks for Textbook Question Answering via Joint Optimization

Article 30 November 2023

MoQA – A Multi-modal Question Answering Architecture

MRHF: Multi-stage Retrieval and Hierarchical Fusion for Textbook Question Answering

Notes

References

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Antol, S., et al.: VQA: visual question answering. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2425–2433 (2015)
Google Scholar
Baek, Y., Lee, B., Han, D., Yun, S., Lee, H.: Character region awareness for text detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9365–9374 (2019)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019)
Google Scholar
Gómez-Pérez, J.M., Ortega, R.: ISAAQ-mastering textbook questions with pre-trained transformers and bottom-up and top-down attention. In: EMNLP, pp. 5469–5479 (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Hermann, K., et al.: Teaching machines to read and comprehend. In: NIPS (2015)
Google Scholar
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. ArXiv abs/1607.01759 (2017)
Google Scholar
Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A diagram is worth a dozen images. In: ECCV (2016)
Google Scholar
Kembhavi, A., Seo, M., Schwenk, D., Choi, J., Farhadi, A., Hajishirzi, H.: Are you smarter than a sixth grader? Textbook question answering for multimodal machine comprehension. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5376–5384 (2017)
Google Scholar
Kim, D., Kim, S., Kwak, N.: Textbook question answering with multi-modal context graph understanding and self-supervised open-set comprehension. In: ACL (2019)
Google Scholar
Kim, D., Yoo, Y.J., Kim, J., Lee, S., Kwak, N.: Dynamic graph generation network: generating relational knowledge from diagrams. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4167–4175 (2018)
Google Scholar
Li, J., Su, H., Zhu, J., Wang, S., Zhang, B.: Textbook question answering under instructor guidance with memory networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3655–3663 (2018)
Google Scholar
Li, J., Su, H., Zhu, J., Wang, S., Zhang, B.: Textbook question answering under instructor guidance with memory networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3655–3663 (2018)
Google Scholar
Liu, W., Chen, C., Wong, K.Y.K., Su, Z., Han, J.: STAR-Net: a spatial attention residue network for scene text recognition. In: BMVC, vol. 2, p. 7 (2016)
Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. ArXiv abs/1907.11692 (2019)
Google Scholar
McCann, B., Keskar, N., Xiong, C., Socher, R.: The natural language decathlon: multitask learning as question answering. ArXiv abs/1806.08730 (2018)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: EMNLP (2014)
Google Scholar
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)
Google Scholar
Seo, M.J., Kembhavi, A., Farhadi, A., Hajishirzi, H.: Bidirectional attention flow for machine comprehension. In: ICLR (2017)
Google Scholar
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: EMNLP, pp. 5103–5114 (2019)
Google Scholar

Download references

Acknowledgments

This work was supported by the Stable Support Projects for Shenzhen Higher Education Institutions (SZWD2021011) and the Scientific Research Platforms and Projects in Universities in Guangdong Province (2019KTSCX204).

Author information

Authors and Affiliations

Shenzhen Technology University, Shenzhen, China
Jianwei He, Xianghua Fu, Zi Long, Shuxin Wang, Chaojie Liang & Hongbin Lin
Shenzhen University, Shenzhen, China
Jianwei He, Chaojie Liang & Hongbin Lin

Authors

Jianwei He
View author publications
You can also search for this author in PubMed Google Scholar
Xianghua Fu
View author publications
You can also search for this author in PubMed Google Scholar
Zi Long
View author publications
You can also search for this author in PubMed Google Scholar
Shuxin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chaojie Liang
View author publications
You can also search for this author in PubMed Google Scholar
Hongbin Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xianghua Fu .

Editor information

Editors and Affiliations

Comenius University in Bratislava, Bratislava, Slovakia
Igor Farkaš
iMotions A/S, Copenhagen, Denmark
Paolo Masulli
University of Tübingen, Tübingen, Baden-Württemberg, Germany
Sebastian Otte
Universität Hamburg, Hamburg, Germany
Stefan Wermter

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

He, J., Fu, X., Long, Z., Wang, S., Liang, C., Lin, H. (2021). Textbook Question Answering with Multi-type Question Learning and Contextualized Diagram Representation. In: Farkaš, I., Masulli, P., Otte, S., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2021. ICANN 2021. Lecture Notes in Computer Science(), vol 12894. Springer, Cham. https://doi.org/10.1007/978-3-030-86380-7_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-86380-7_8
Published: 07 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86379-1
Online ISBN: 978-3-030-86380-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Textbook Question Answering with Multi-type Question Learning and Contextualized Diagram Representation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Diagram Perception Networks for Textbook Question Answering via Joint Optimization

MoQA – A Multi-modal Question Answering Architecture

MRHF: Multi-stage Retrieval and Hierarchical Fusion for Textbook Question Answering

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Textbook Question Answering with Multi-type Question Learning and Contextualized Diagram Representation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Diagram Perception Networks for Textbook Question Answering via Joint Optimization

MoQA – A Multi-modal Question Answering Architecture

MRHF: Multi-stage Retrieval and Hierarchical Fusion for Textbook Question Answering

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation