Abstract
In this paper, we propose a novel large scale Visual Question Answering (VQA) dataset, which aims at real environment applications. Existing VQA datasets either require high constructing labor costs or have only limited power for evaluating complicated scene understanding ability involving in VQA tasks. Moreover, most VQA datasets do not tackle scenes containing object occlusion, which could be crucial for real-world applications. In this work, we propose a synthetic multi-view VQA dataset along with a dataset generation process. We build our dataset from three real object model datasets. Each scene is observed from multiple virtual cameras, which often requires a multi-view scene understanding. Our dataset requires relatively low labor cost and in the meantime, have highly complicated visual information. In addition, our dataset can be further adapted to users’ requirements by extending the dataset setup. We evaluated two previous multi-view VQA methods on our datasets. The results show that both 3D understanding and appearance understanding is crucial to achieving high performance in our dataset, and there is still room for future methods to improve. Our dataset provides a possible way for bridging the VQA methods aiming at CG dataset with real-world applications, such as robot picking tasks.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The ability to interact with human operators through natural language plays an essential role in real environment Human-Robot Interaction (HRI) applications. Such as for domestic robot applications, the ability makes it possible to manipulate robots through natural language demands. For video surveillance systems, the ability to express changes occurred in the videos through natural language can help decrease the labor cost significantly.
The visual question answering task is one of the critical vision and language tasks, which is defined as an answer prediction process given an image along with a question querying information about the image. The VQA task allows the machine to observe surroundings and offer the required information to human operators.
Existing VQA datasets either build from crowd-sourced real photos [1,2,3] or CG generated images [4]. Real image VQA datasets are relatively suitable for testing image understanding ability of VQA methods. However, constructing those datasets requires high labor costs, and those datasets often suffer from human reporting biases. In addition, those datasets have limited ability to precisely diagnose various VQA abilities. On the contrary, CG VQA datasets require less labor cost but often only consist of simple geometric objects; thus, they are unsuitable for evaluating image content understanding ability. Additionally, it is unclear how to adapt such CG VQA datasets to real-world applications. Moreover, most existing VQA datasets do not discuss the scene containing object occlusion. Those above issues make it difficult to adapt existing VQA datasets to real application situations.
To solve the above problems, we propose a multi-view VQA dataset consisting of real daily supply objects in occluded scene-setting. Our dataset is practical for training and evaluating the VQA abilities for real environment applications, such as robot picking. Inspired by the dataset generation process introduced in [4], we propose a four steps dataset generation process. We first obtained real daily supply object models from several reported object model datasets. Next, we annotated objects with their attributes and labels. Following that, we created scenes by randomly placing object models in simulated scenes and photographing scenes from multiple virtual cameras. Finally, we generated question-answer pairs for each scene based on pre-defined question templates along with the recorded scene information. Our dataset is useful for evaluating a range of critical VQA abilities, such as multi-view scene understanding, hierarchical object recognition, attribute recognition, and counting. In addition, the automatic dataset generation process allows users to adapt our dataset to a new environment with additional objects and question types.
We evaluated two previously proposed multi-view VQA methods on our dataset. Comparing to previous synthetic VQA dataset [4], the experiment results show that our dataset is challenging for various question types, especially for spatial-related questions that require an understanding of object spatial relationships. The experiment results also indicate that both 3D and appearance understanding could be critical for obtaining high performance on the dataset; thus, our dataset provides a useful testbed for the future VQA researching. Our dataset is suitable for training and evaluating various VQA skills aiming for real-world applications. The expandable automatic dataset generation process makes it possible to bridge the VQA methods aiming at CG dataset with various real-world applications.
2 Related Works
2.1 VQA Dataset
Real Image VQA Datasets. VQA_v2 and GQA dataset are two representative popular real images VQA datasets. VQA_v2 dataset consists of crowd-sourced images and question-answer pairs. VQA_v2 contains images ranging from indoor scenes to outdoor scenes, often containing massive visual information. VQA_v2 can be used for evaluating various VQA skills and also acts as an evaluation dataset for VQA challenges [5]. However, its human-made property makes it containing human-reported biases [4]. Similar to VQA_v2, the GQA dataset also is built from a crowd-sourced image dataset. The GQA dataset used Visual Genome dataset [6] as its dataset source. The detailed image information is recorded in scene graphs. The GQA dataset generates question-answer pairs based on recorded scene graphs and pre-defined grammar; thus, it is relatively less biased. However, both the two datasets require high labor costs and cannot avoid latent human-centered biases.
CG Image VQA Datasets. CLEVR is one of the representative CG VQA datasets. CLEVR defines an automatic scene generation engine that generates scenes with randomly placed geometric objects. CLEVR also proposes a question generation program that generates question-answer pairs based on recorded scene information. CLEVR strictly controls the dataset bias and provides detailed diagnose for various VQA abilities. However, the use of pure geometric objects makes it difficult for evaluating complicated image understanding. Moreover, it cannot be directly adapted to real-world applications with complicated visual information.
Most existing VQA datasets do not tackle the scenes with object occlusion, which is common in real-world applications. Based on the above, we propose a novel synthetic multi-view VQA dataset with more realistic and complicated objects comparing to CLEVR, and lesser labor costly comparing to conventional real image VQA datasets. The multiple virtual camera setting makes our dataset suitable for training and evaluating VQA methods for real applications use.
2.2 Object Models Dataset
The YCB [7], Bigbird [8], and NEDO item database [9] are well-known as scanned real object datasets. YCB is created for robot manipulation. It consists of daily supply objects (e.g., hammer, tennis ball, bowl) with different shapes, sizes, and textures. The authors created YCB object models through a high-resolution RGB-D scanner. Bigbird has similar object classes and dataset construction process with YCB. The Bigbird contains more packaged food and bottle-shaped object models (e.g., shampoo, detergent). The NEDO item database consists of daily supply object models with content shapes. Considering object classes, NEDO contains more food classes and office supplies. We adapted the above three datasets into our dataset generation process. We accomplished this by re-annotating object class labels and placing objects in CG scenes. It is noteworthy that our dataset can be further extended by adding more object models.
2.3 Multi-view VQA
Conventional VQA methods [10,11,12] predict answers from single-view images and questions. However, single-view images are inadequate for answering questions on various occasions, such as object occlusion, severe lighting conditions. Qiu et al. [13] proposed a multi-view VQA framework that predicts answers under a multi-view image scene-setting. The authors divided the multi-view VQA process into two separate components: multi-view image representation; question answering. In our work, we used Qiu et al. methods to benchmark the performance on our dataset.
3 Real Object Multi-view CLEVR Dataset
3.1 Dataset Generation Process
Inspired by the dataset generation process introduced in [4], we propose a four steps dataset generation process. We show an overview of those four steps in Fig. 1. Each sample of our dataset consists of a CG scene observed from a multiple virtual camera setup, and a question-answer pair about scene contents. We started our dataset generation process by collecting 3D real daily supply objects models from three previously reported object model datasets. After the collecting process, we labeled object models based on the WordNet [14] hierarchical definition and annotated attributes for objects. Following that, we created CG scenes based on an automatic generation engine with those annotated objects. Finally, we generated question-answer pairs based on a series of pre-defined templates and recorded scene information. In the following sections, we dictate these steps in detail.
3.2 Object Models Collecting
In order to obtain realistic object models, we selected three open-sourced datasets: YCB, Bigbird, and NEDO item database introduced in Sect. 2.2 as our object model source datasets. Both of the three datasets consist of daily supply object models, ranging from food, playing things, washing materials, kitchenware to sports equipment. The YCB and Bigbird datasets are collected by an RGB-D scanner. This collecting process makes part of their models tending to be incomplete in shape, which is unsuitable for recognition related tasks. On the contrary, models in the NEDO item dataset are relatively complete in shapes. However, there is a considerable part of models are packed in boxed-packages, which makes it difficult to recognize those objects from the models’ appearance. Considering the above problems, we removed models with incomplete shapes along with unrecognizable models packed in the packages. After the above step, we obtained a clean version of the object model set with 134 object models in total.
3.3 Object Models Annotation
In order to generate meaningful question-answer pairs, we labeled the 134 object models with class labels and attribute. We followed the hierarchical object class definition defined in WordNet to label objects. In detail, for each object model, we observed its appearance to apply a leaf class label to it according to the WordNet hierarchical class structure. In addition, we added zero to three levels of inherited hypernyms in depth to further enabling referring objects through their hypernyms. Such as, for the question “Are there any foods visible?”, if there is an apple, the answer for that question will be “yes” as “apple” is one of the hyponyms of “food.” We also labeled each object with its dominant color. After this step, we constructed a hierarchical class definition with a total of 62 classes, which contains 35 leaf classes and 27 hypernym classes. We show one object annotation example in Fig. 2 left. Additionally, we show the object instance distribution in Fig. 2 right. All object classes, hypernyms, and their hierarchical relationships are shown in Fig. 3.
3.4 Scene Generation
Our scene generation process is based on the CLEVR scene generation engine. In detail, we created a base scene containing a ground plane along with ambient and spotlight lighting. Various scenes can be generated automatically by placing objects on the ground plane.
While creating a single scene, we placed objects with random numbers ranging from three to ten in the ground plane and arranged them randomly without object intersections. Unlike the original CLEVR setting, we adopted the multi-view CLEVR setup proposed in [4] and placed four virtual cameras above the ground plane. Those cameras take photos from evenly space viewpoints around the center of each scene. Moreover, in order to create scenes with high occlusion, we set a threshold of minimum pixel numbers to force each scene to have objects under the threshold pixel numbers from two camera viewpoints in minimum. This setup makes our dataset difficult to be resolved from single-view information.
Through the above processes, we obtained four images of each scene observed from four viewpoints along with a scene graph that records the scene information containing object positions; object attributes to enable the following question-answer pair generation process.
3.5 Question-Answer Pairs Generation
We introduced four question types, including exist questions (querying object existence in a scene), query color, query class, and counting questions. In addition, based on the existence of spatial relationship words (e.g., “left,” “right,” “front,” “behind”), the questions can be further divided into spatial-related and non-spatial questions. In order to create questions, we first designed a series of question templates (78 in total) that provide the basic structure and question type of questions. We show eight templates in Fig. 4. Questions are instantiated by randomly choose words for the “colored” part (e.g., <C>, <H>). Next, we computed the answer for each question based on the pre-defined function program proposed in [4] and recorded scene information. Though the above process, we created 20 question-answer pairs for each scene. Then, we adjusted the overall dataset to keep the answer forming a uniform distribution. This adjustment makes our dataset hard to be answered without the image information understanding.
3.6 Dataset Statistics
We build ROM_CLEVR_v1 upon the above setup. Moreover, we also created a ROM_CLEVR_v0.5 with a smaller scale. We show the detailed statistics of these two versions in Table 1. We also show several examples of ROM_CLEVR_v1 in Fig. 5. Our dataset provides a way to train and evaluate VQA methods for real-world applications, such as robot picking. Moreover, the dataset can be adapted to user requirements by modifying the object models and question setting. It also has the ability to act as a pre-train dataset for real-world vision and language AI applications.
4 Experiments
4.1 Experimental Setup
In this section, we benchmark the two multi-view VQA methods proposed in [13] on our dataset. We also discuss the possible approach to improve the accuracy.
Implementation Details. There are two sub-tasks to answer questions in our dataset. First, multi-view image recognition is necessary. We implemented this by two approaches: view pooling operation (VP) [15], which combines multi-view image features (CNN features) via max or average pooling; scene representation network (SRN) [16], a conditional variational autoencoder-based method which embeds multi-view image information into a continuous scene representation. Second, we used FiLM [12] to predict answers from questions and integrated multi-view image information. In all experiments, we pre-trained SRN network for 200 epochs with batch size 36 and a starting learning rate of 0.0005. We trained FiLM network for 30 epochs with batch size 64 and a starting learning rate of 0.0005.
In the following subsections, we discuss the experiment results on datasets with different scales, the effect of different input image resolutions, and multi-view image information integrating approaches.
4.2 Results on Dataset with Different Scales
We first implemented the two approaches, VP_FiLM and SRN_FiLM, on v0.5 and v1.0 of our dataset introduced in Sect. 3.6. We show results on Table 2. Both two approaches obtained relatively lower accuracy on ROM_CLEVR_v1.0, especially for spatial-related questions. This result indicates that the previous approaches may have limited abilities for large scale scenes with more complicated visual information and object arrangements. This result also shows that it might be necessary to use more powerful models for more realistic dataset setups.
4.3 Results on Different Input Image Sizes
We conducted experiments with different input image resolutions and multi-view fusion approaches on ROM_CLEVR_v1.0. The experiment results are shown in Table 3. In this section, we first analyze the effect of input image resolution.
For the input image resolution of 64*64, both two approaches obtained the lowest accuracy comparing to the results under other resolutions. This trend is especially true for spatial-related questions. This result indicates that for the proposed dataset, input image resolution 64*64 might result in information deficiency.
In contrast, there were no apparent performance gaps for both approaches among resolution of 128*128 and 256*256. This result indicates that while the minimum resolution is satisfied, the performance boost cannot always be obtained by simply increasing the input image resolution. One possible reason is that the hyperparameter tuning for higher resolution input tends to be more difficult.
4.4 Results on Different Input Image Features
In this subsection, we discuss the performance of two multi-view image integrating approaches. The SRN_FiLM outperforms VP_FiLM by a large margin for input image resolution of 64*64. The performance is significant, especially for spatial-related questions. Both two approaches achieved similar performance for input image resolution 128*128 and 256*256. These results might come from that the SRN network is relatively difficult to be directly applied to high image resolutions, such as 128*128.
For non-spatial questions, VP_FiLM performs slightly better than SRN_FiLM, while for spatial-related questions, SRN_FiLM performs far better. This result comes from that the SRN network has the ability to explore latent 3D information from multi-view images, which is challenging for view pooling structure. Integrating these two approaches might help improve performance. Both two approaches perform poorly for counting questions, which indicates that there is still room for the future method to improve.
4.5 Qualitative Results
We show several result examples in Fig. 6. For query and exist question types (eg., Question (1), (2), (3)), both two methods give a correct answer. Query and exist questions tend to be less challenging for these approaches. One possible reason is that these questions require a minimum dominant color or texture features, which are relatively easy to obtain through CNN structures. For counting questions, both methods performs relatively worse (e.g., Question (4)). It brings a challenge to improve the counting ability in scenes containing complicated object models.
5 Conclusion
We proposed a large scale multi-view VQA dataset, which consists of CG scenes with realistic daily supply object models and automatically generated questions. Existing CG VQA datasets are often built from simple geometric objects, which makes it difficult to evaluate complicated scenes understanding ability. VQA datasets with crowd-sourced images tend to suffer from human-centric biases and often require high labor costs for generating related questions. Comparing to the above datasets, our dataset consists of various realistic object models and also can be generated automatically with low labor costs. The hierarchical class definition of our dataset enables hierarchical object recognition, which is important in the real-world environment. The occlusion setting also makes our model more suitable for real-world environment applications, which always require multi-view understanding. We evaluate two previous multi-view VQA approaches on our dataset. The experiment results show that our dataset is still challenging for spatial-related questions and counting questions. We also found that ensembling scene representation approaches with traditional image feature extractors (e.g., CNNs) might provide a possible solution for achieving more leading performance.
References
Antol, S., et al.: VQA: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2016)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in Visual Question Answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6700–6709 (2019)
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910 (2017)
VQA Challenge Homepage. https://visualqa.org/challenge.html. Accessed 31 Jan 2020
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7
Calli, B., Singh, A., Walsman, A., Srinivasa, S., Abbeel, P., Dollar, A.M.: The YCB object and model set: towards common benchmarks for manipulation research. In: 2015 International Conference on Advanced Robotics (ICAR), pp. 510–517. IEEE (2015)
Singh, A., Sha, J., Narayan, K.S., Achim, T., Abbeel, P.: BigBIRD: a large-scale 3D database of object instances. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 509–516. IEEE (2014)
Araki, R., Yamashita, T., Fujiyoshi, H.: ARC 2017 RGB-D dataset for object detection and segmentation. In: Late Breaking Results Poster on International Conference on Robotics and Automation (2018)
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: Film: visual reasoning with a general conditioning layer. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Qiu, Y., Satoh, Y., Suzuki, R., Kataoka, H.: Incorporating 3D information into visual question answering. In: 2019 International Conference on 3D Vision (3DV), Québec City, QC, Canada, pp. 756–765 (2019)
Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)
Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3D shape recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 945–953 (2015)
Eslami, S.A., et al.: Neural scene representation and rendering. Science 360(6394), 1204–1210 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Qiu, Y., Satoh, Y., Suzuki, R., Iwata, K. (2020). Multi-view Visual Question Answering Dataset for Real Environment Applications. In: Degen, H., Reinerman-Jones, L. (eds) Artificial Intelligence in HCI. HCII 2020. Lecture Notes in Computer Science(), vol 12217. Springer, Cham. https://doi.org/10.1007/978-3-030-50334-5_26
Download citation
DOI: https://doi.org/10.1007/978-3-030-50334-5_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-50333-8
Online ISBN: 978-3-030-50334-5
eBook Packages: Computer ScienceComputer Science (R0)