Keywords

1 Introduction

The ability to interact with human operators through natural language plays an essential role in real environment Human-Robot Interaction (HRI) applications. Such as for domestic robot applications, the ability makes it possible to manipulate robots through natural language demands. For video surveillance systems, the ability to express changes occurred in the videos through natural language can help decrease the labor cost significantly.

The visual question answering task is one of the critical vision and language tasks, which is defined as an answer prediction process given an image along with a question querying information about the image. The VQA task allows the machine to observe surroundings and offer the required information to human operators.

Existing VQA datasets either build from crowd-sourced real photos [1,2,3] or CG generated images [4]. Real image VQA datasets are relatively suitable for testing image understanding ability of VQA methods. However, constructing those datasets requires high labor costs, and those datasets often suffer from human reporting biases. In addition, those datasets have limited ability to precisely diagnose various VQA abilities. On the contrary, CG VQA datasets require less labor cost but often only consist of simple geometric objects; thus, they are unsuitable for evaluating image content understanding ability. Additionally, it is unclear how to adapt such CG VQA datasets to real-world applications. Moreover, most existing VQA datasets do not discuss the scene containing object occlusion. Those above issues make it difficult to adapt existing VQA datasets to real application situations.

To solve the above problems, we propose a multi-view VQA dataset consisting of real daily supply objects in occluded scene-setting. Our dataset is practical for training and evaluating the VQA abilities for real environment applications, such as robot picking. Inspired by the dataset generation process introduced in [4], we propose a four steps dataset generation process. We first obtained real daily supply object models from several reported object model datasets. Next, we annotated objects with their attributes and labels. Following that, we created scenes by randomly placing object models in simulated scenes and photographing scenes from multiple virtual cameras. Finally, we generated question-answer pairs for each scene based on pre-defined question templates along with the recorded scene information. Our dataset is useful for evaluating a range of critical VQA abilities, such as multi-view scene understanding, hierarchical object recognition, attribute recognition, and counting. In addition, the automatic dataset generation process allows users to adapt our dataset to a new environment with additional objects and question types.

We evaluated two previously proposed multi-view VQA methods on our dataset. Comparing to previous synthetic VQA dataset [4], the experiment results show that our dataset is challenging for various question types, especially for spatial-related questions that require an understanding of object spatial relationships. The experiment results also indicate that both 3D and appearance understanding could be critical for obtaining high performance on the dataset; thus, our dataset provides a useful testbed for the future VQA researching. Our dataset is suitable for training and evaluating various VQA skills aiming for real-world applications. The expandable automatic dataset generation process makes it possible to bridge the VQA methods aiming at CG dataset with various real-world applications.

2 Related Works

2.1 VQA Dataset

Real Image VQA Datasets. VQA_v2 and GQA dataset are two representative popular real images VQA datasets. VQA_v2 dataset consists of crowd-sourced images and question-answer pairs. VQA_v2 contains images ranging from indoor scenes to outdoor scenes, often containing massive visual information. VQA_v2 can be used for evaluating various VQA skills and also acts as an evaluation dataset for VQA challenges [5]. However, its human-made property makes it containing human-reported biases [4]. Similar to VQA_v2, the GQA dataset also is built from a crowd-sourced image dataset. The GQA dataset used Visual Genome dataset [6] as its dataset source. The detailed image information is recorded in scene graphs. The GQA dataset generates question-answer pairs based on recorded scene graphs and pre-defined grammar; thus, it is relatively less biased. However, both the two datasets require high labor costs and cannot avoid latent human-centered biases.

CG Image VQA Datasets. CLEVR is one of the representative CG VQA datasets. CLEVR defines an automatic scene generation engine that generates scenes with randomly placed geometric objects. CLEVR also proposes a question generation program that generates question-answer pairs based on recorded scene information. CLEVR strictly controls the dataset bias and provides detailed diagnose for various VQA abilities. However, the use of pure geometric objects makes it difficult for evaluating complicated image understanding. Moreover, it cannot be directly adapted to real-world applications with complicated visual information.

Most existing VQA datasets do not tackle the scenes with object occlusion, which is common in real-world applications. Based on the above, we propose a novel synthetic multi-view VQA dataset with more realistic and complicated objects comparing to CLEVR, and lesser labor costly comparing to conventional real image VQA datasets. The multiple virtual camera setting makes our dataset suitable for training and evaluating VQA methods for real applications use.

2.2 Object Models Dataset

The YCB [7], Bigbird [8], and NEDO item database [9] are well-known as scanned real object datasets. YCB is created for robot manipulation. It consists of daily supply objects (e.g., hammer, tennis ball, bowl) with different shapes, sizes, and textures. The authors created YCB object models through a high-resolution RGB-D scanner. Bigbird has similar object classes and dataset construction process with YCB. The Bigbird contains more packaged food and bottle-shaped object models (e.g., shampoo, detergent). The NEDO item database consists of daily supply object models with content shapes. Considering object classes, NEDO contains more food classes and office supplies. We adapted the above three datasets into our dataset generation process. We accomplished this by re-annotating object class labels and placing objects in CG scenes. It is noteworthy that our dataset can be further extended by adding more object models.

2.3 Multi-view VQA

Conventional VQA methods [10,11,12] predict answers from single-view images and questions. However, single-view images are inadequate for answering questions on various occasions, such as object occlusion, severe lighting conditions. Qiu et al. [13] proposed a multi-view VQA framework that predicts answers under a multi-view image scene-setting. The authors divided the multi-view VQA process into two separate components: multi-view image representation; question answering. In our work, we used Qiu et al. methods to benchmark the performance on our dataset.

3 Real Object Multi-view CLEVR Dataset

Fig. 1.
figure 1

Dataset generation process of the proposed dataset. This process allows us to extend the dataset by adding object models and question types.

3.1 Dataset Generation Process

Inspired by the dataset generation process introduced in [4], we propose a four steps dataset generation process. We show an overview of those four steps in Fig. 1. Each sample of our dataset consists of a CG scene observed from a multiple virtual camera setup, and a question-answer pair about scene contents. We started our dataset generation process by collecting 3D real daily supply objects models from three previously reported object model datasets. After the collecting process, we labeled object models based on the WordNet [14] hierarchical definition and annotated attributes for objects. Following that, we created CG scenes based on an automatic generation engine with those annotated objects. Finally, we generated question-answer pairs based on a series of pre-defined templates and recorded scene information. In the following sections, we dictate these steps in detail.

3.2 Object Models Collecting

In order to obtain realistic object models, we selected three open-sourced datasets: YCB, Bigbird, and NEDO item database introduced in Sect. 2.2 as our object model source datasets. Both of the three datasets consist of daily supply object models, ranging from food, playing things, washing materials, kitchenware to sports equipment. The YCB and Bigbird datasets are collected by an RGB-D scanner. This collecting process makes part of their models tending to be incomplete in shape, which is unsuitable for recognition related tasks. On the contrary, models in the NEDO item dataset are relatively complete in shapes. However, there is a considerable part of models are packed in boxed-packages, which makes it difficult to recognize those objects from the models’ appearance. Considering the above problems, we removed models with incomplete shapes along with unrecognizable models packed in the packages. After the above step, we obtained a clean version of the object model set with 134 object models in total.

Fig. 2.
figure 2

Object annotation example (left) and object instances (model) number per object class (right).

Fig. 3.
figure 3

Hierarchical object class definition of ROM_CLEVR_v1.0 dataset. Object classes are shown in gray ovals; hypernyms are shown in white ovals.

3.3 Object Models Annotation

In order to generate meaningful question-answer pairs, we labeled the 134 object models with class labels and attribute. We followed the hierarchical object class definition defined in WordNet to label objects. In detail, for each object model, we observed its appearance to apply a leaf class label to it according to the WordNet hierarchical class structure. In addition, we added zero to three levels of inherited hypernyms in depth to further enabling referring objects through their hypernyms. Such as, for the question “Are there any foods visible?”, if there is an apple, the answer for that question will be “yes” as “apple” is one of the hyponyms of “food.” We also labeled each object with its dominant color. After this step, we constructed a hierarchical class definition with a total of 62 classes, which contains 35 leaf classes and 27 hypernym classes. We show one object annotation example in Fig. 2 left. Additionally, we show the object instance distribution in Fig. 2 right. All object classes, hypernyms, and their hierarchical relationships are shown in Fig. 3.

3.4 Scene Generation

Our scene generation process is based on the CLEVR scene generation engine. In detail, we created a base scene containing a ground plane along with ambient and spotlight lighting. Various scenes can be generated automatically by placing objects on the ground plane.

While creating a single scene, we placed objects with random numbers ranging from three to ten in the ground plane and arranged them randomly without object intersections. Unlike the original CLEVR setting, we adopted the multi-view CLEVR setup proposed in [4] and placed four virtual cameras above the ground plane. Those cameras take photos from evenly space viewpoints around the center of each scene. Moreover, in order to create scenes with high occlusion, we set a threshold of minimum pixel numbers to force each scene to have objects under the threshold pixel numbers from two camera viewpoints in minimum. This setup makes our dataset difficult to be resolved from single-view information.

Through the above processes, we obtained four images of each scene observed from four viewpoints along with a scene graph that records the scene information containing object positions; object attributes to enable the following question-answer pair generation process.

Fig. 4.
figure 4

Question templates examples of exist and query class question.

3.5 Question-Answer Pairs Generation

We introduced four question types, including exist questions (querying object existence in a scene), query color, query class, and counting questions. In addition, based on the existence of spatial relationship words (e.g., “left,” “right,” “front,” “behind”), the questions can be further divided into spatial-related and non-spatial questions. In order to create questions, we first designed a series of question templates (78 in total) that provide the basic structure and question type of questions. We show eight templates in Fig. 4. Questions are instantiated by randomly choose words for the “colored” part (e.g., <C>, <H>). Next, we computed the answer for each question based on the pre-defined function program proposed in [4] and recorded scene information. Though the above process, we created 20 question-answer pairs for each scene. Then, we adjusted the overall dataset to keep the answer forming a uniform distribution. This adjustment makes our dataset hard to be answered without the image information understanding.

Table 1. Dataset statistics: Object (Obj).
Fig. 5.
figure 5

Four examples of ROM_CLEVR_v1.0 dataset: observed from default view, 90\(^\circ \), 180\(^\circ \), and 270\(^\circ \) (left to right).

3.6 Dataset Statistics

We build ROM_CLEVR_v1 upon the above setup. Moreover, we also created a ROM_CLEVR_v0.5 with a smaller scale. We show the detailed statistics of these two versions in Table 1. We also show several examples of ROM_CLEVR_v1 in Fig. 5. Our dataset provides a way to train and evaluate VQA methods for real-world applications, such as robot picking. Moreover, the dataset can be adapted to user requirements by modifying the object models and question setting. It also has the ability to act as a pre-train dataset for real-world vision and language AI applications.

4 Experiments

4.1 Experimental Setup

In this section, we benchmark the two multi-view VQA methods proposed in [13] on our dataset. We also discuss the possible approach to improve the accuracy.

Implementation Details. There are two sub-tasks to answer questions in our dataset. First, multi-view image recognition is necessary. We implemented this by two approaches: view pooling operation (VP) [15], which combines multi-view image features (CNN features) via max or average pooling; scene representation network (SRN) [16], a conditional variational autoencoder-based method which embeds multi-view image information into a continuous scene representation. Second, we used FiLM [12] to predict answers from questions and integrated multi-view image information. In all experiments, we pre-trained SRN network for 200 epochs with batch size 36 and a starting learning rate of 0.0005. We trained FiLM network for 30 epochs with batch size 64 and a starting learning rate of 0.0005.

In the following subsections, we discuss the experiment results on datasets with different scales, the effect of different input image resolutions, and multi-view image information integrating approaches.

Table 2. Accuracy on ROM_CLEVR_v0.5 and ROM_CLEVR_v1.0 dataset: Spatial related question accuracy (S); Non-spatial question accuracy (NS).

4.2 Results on Dataset with Different Scales

We first implemented the two approaches, VP_FiLM and SRN_FiLM, on v0.5 and v1.0 of our dataset introduced in Sect. 3.6. We show results on Table 2. Both two approaches obtained relatively lower accuracy on ROM_CLEVR_v1.0, especially for spatial-related questions. This result indicates that the previous approaches may have limited abilities for large scale scenes with more complicated visual information and object arrangements. This result also shows that it might be necessary to use more powerful models for more realistic dataset setups.

Table 3. Overall and per-question-type accuracy on ROM_CLEVR_v1.0 dataset: Resolution (Reso); Spatial related question accuracy (S); Non-spatial question accuracy (NS).

4.3 Results on Different Input Image Sizes

We conducted experiments with different input image resolutions and multi-view fusion approaches on ROM_CLEVR_v1.0. The experiment results are shown in Table 3. In this section, we first analyze the effect of input image resolution.

For the input image resolution of 64*64, both two approaches obtained the lowest accuracy comparing to the results under other resolutions. This trend is especially true for spatial-related questions. This result indicates that for the proposed dataset, input image resolution 64*64 might result in information deficiency.

Fig. 6.
figure 6

Results examples on ROM_CLEVR_v1.0 dataset. The error answer predictions are shown in red. (color figure online)

In contrast, there were no apparent performance gaps for both approaches among resolution of 128*128 and 256*256. This result indicates that while the minimum resolution is satisfied, the performance boost cannot always be obtained by simply increasing the input image resolution. One possible reason is that the hyperparameter tuning for higher resolution input tends to be more difficult.

4.4 Results on Different Input Image Features

In this subsection, we discuss the performance of two multi-view image integrating approaches. The SRN_FiLM outperforms VP_FiLM by a large margin for input image resolution of 64*64. The performance is significant, especially for spatial-related questions. Both two approaches achieved similar performance for input image resolution 128*128 and 256*256. These results might come from that the SRN network is relatively difficult to be directly applied to high image resolutions, such as 128*128.

For non-spatial questions, VP_FiLM performs slightly better than SRN_FiLM, while for spatial-related questions, SRN_FiLM performs far better. This result comes from that the SRN network has the ability to explore latent 3D information from multi-view images, which is challenging for view pooling structure. Integrating these two approaches might help improve performance. Both two approaches perform poorly for counting questions, which indicates that there is still room for the future method to improve.

4.5 Qualitative Results

We show several result examples in Fig. 6. For query and exist question types (eg., Question (1), (2), (3)), both two methods give a correct answer. Query and exist questions tend to be less challenging for these approaches. One possible reason is that these questions require a minimum dominant color or texture features, which are relatively easy to obtain through CNN structures. For counting questions, both methods performs relatively worse (e.g., Question (4)). It brings a challenge to improve the counting ability in scenes containing complicated object models.

5 Conclusion

We proposed a large scale multi-view VQA dataset, which consists of CG scenes with realistic daily supply object models and automatically generated questions. Existing CG VQA datasets are often built from simple geometric objects, which makes it difficult to evaluate complicated scenes understanding ability. VQA datasets with crowd-sourced images tend to suffer from human-centric biases and often require high labor costs for generating related questions. Comparing to the above datasets, our dataset consists of various realistic object models and also can be generated automatically with low labor costs. The hierarchical class definition of our dataset enables hierarchical object recognition, which is important in the real-world environment. The occlusion setting also makes our model more suitable for real-world environment applications, which always require multi-view understanding. We evaluate two previous multi-view VQA approaches on our dataset. The experiment results show that our dataset is still challenging for spatial-related questions and counting questions. We also found that ensembling scene representation approaches with traditional image feature extractors (e.g., CNNs) might provide a possible solution for achieving more leading performance.