Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

Uehara, Kohei; Goswami, Nabarun; Wang, Hanqin; Baba, Toshiaki; Tanaka, Kohtaro; Hashimoto, Tomohiro; Wang, Kai; Ito, Rei; Naoya, Takagi; Umagami, Ryo; Wen, Yingyi; Anakewat, Tanachai; Harada, Tatsuya

Computer Science > Computer Vision and Pattern Recognition

arXiv:2401.10005 (cs)

[Submitted on 18 Jan 2024 (v1), last revised 18 Jul 2024 (this version, v2)]

Title:Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

Authors:Kohei Uehara, Nabarun Goswami, Hanqin Wang, Toshiaki Baba, Kohtaro Tanaka, Tomohiro Hashimoto, Kai Wang, Rei Ito, Takagi Naoya, Ryo Umagami, Yingyi Wen, Tanachai Anakewat, Tatsuya Harada

View PDF HTML (experimental)

Abstract:The increasing demand for intelligent systems capable of interpreting and reasoning about visual content requires the development of large Vision-and-Language Models (VLMs) that are not only accurate but also have explicit reasoning capabilities. This paper presents a novel approach to develop a VLM with the ability to conduct explicit reasoning based on visual content and textual instructions. We introduce a system that can ask a question to acquire necessary knowledge, thereby enhancing the robustness and explicability of the reasoning process. To this end, we developed a novel dataset generated by a Large Language Model (LLM), designed to promote chain-of-thought reasoning combined with a question-asking mechanism. The dataset covers a range of tasks, from common ones like caption generation to specialized VQA tasks that require expert knowledge. Furthermore, using the dataset we created, we fine-tuned an existing VLM. This training enabled the models to generate questions and perform iterative reasoning during inference. The results demonstrated a stride toward a more robust, accurate, and interpretable VLM, capable of reasoning explicitly and seeking information proactively when confronted with ambiguous visual input.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2401.10005 [cs.CV]
	(or arXiv:2401.10005v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2401.10005

Submission history

From: Kohei Uehara [view email]
[v1] Thu, 18 Jan 2024 14:21:56 UTC (1,789 KB)
[v2] Thu, 18 Jul 2024 02:35:30 UTC (6,527 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators