Abstract
Human–Object Interaction (HOI) detection is important to human-centric scene understanding tasks. Existing works tend to assume that the same verb has similar visual characteristics in different HOI categories, an approach that ignores the diverse semantic meanings of the verb. To address this issue, in this paper, we propose a novel Polysemy Deciphering Network (PD-Net) that decodes the visual polysemy of verbs for HOI detection in three distinct ways. First, we refine features for HOI detection to be polysemy-aware through the use of two novel modules: namely, Language Prior-guided Channel Attention (LPCA) and Language Prior-based Feature Augmentation (LPFA). LPCA highlights important elements in human and object appearance features for each HOI category to be identified; moreover, LPFA augments human pose and spatial features for HOI detection using language priors, enabling the verb classifiers to receive language hints that reduce intra-class variation for the same verb. Second, we introduce a novel Polysemy-Aware Modal Fusion module, which guides PD-Net to make decisions based on feature types deemed more important according to the language priors. Third, we propose to relieve the verb polysemy problem through sharing verb classifiers for semantically similar HOI categories. Furthermore, to expedite research on the verb polysemy problem, we build a new benchmark dataset named HOI-VerbPolysemy (HOI-VP), which includes common verbs (predicates) that have diverse semantic meanings in the real world. Finally, through deciphering the visual polysemy of verbs, our approach is demonstrated to outperform state-of-the-art methods by significant margins on the HICO-DET, V-COCO, and HOI-VP databases. Code and data in this paper are available at https://github.com/MuchHair/PD-Net.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
In recent years, researchers working in the field of computer vision have begun to pay increasing attention to scene understanding tasks (Zheng et al. 2015; Lu et al. 2016; Zhang et al. 2017; Zhao et al. 2020; Lin et al. 2020). Since human beings are often central to real-world scenes, Human–Object Interaction (HOI) detection has become a fundamental problem in scene understanding. HOI detection involves not only identifying the classes and locations of objects in the images, but also the interactions (verbs) between each human–object pair. As shown in Fig. 1, an interaction between a human–object pair can be represented by a triplet \({<}{} \textit{person verb object}{>}\), herein referred to as one HOI category. One human–object pair may comprise multiple triplets, e.g. \({<}{} \textit{person fly airplane}{>}\) and \({<}{} \textit{person ride airplane}{>}\).
Examples reflecting the verb polysemy problem in HOI detection. In terms of describing the HOIs, a and b present HOI examples of “play”. “feet” are more important in (a) while “hands” are more important in (b). c and d illustrate HOI examples of “hold”. The human–object pairs in (c) and (d) are characterized by dramatically different human–object spatial features, i.e. the relative location between two bounding boxes. e and f illustrate HOI examples of “fly”. The “person” in (e) exhibits discriminative pose features while the “person” in (f) does not
The HOI detection task is notably challenging (Chao et al. 2018; Gao et al. 2018). One major reason is that verbs can be polysemic. As illustrated in Fig. 1, a verb may convey substantially different semantic meanings and visual characteristics with respect to different objects, as these objects may have diverse functions and attributes. One pair of examples can be found in Fig. 1a, b. Here, the “feet” are the more discriminative parts of the human figure for \({<}{} \textit{person play soccer}\_\textit{ball}{>}\) while “hands” are more important for describing \({<}{} \textit{person play frisbee}{>}\). A second pair of examples is presented in Fig. 1c, d. The human–object pairs in (c) and (d), despite being tagged with the same verb, present dramatically different human–object spatial features. Another more serious consideration is that the importance of the same type of visual feature may vary dramatically as the objects of interest change. For example, the human pose plays a vital role in describing \({<}{} \textit{person fly kite}{>}\) in Fig. 1e; by contrast, the human pose is invisible and therefore useless for characterizing \({<}{} \textit{person fly airplane}{>}\) in Fig. 1f. Verb polysemy therefore presents a significant challenge in the HOI detection.
The problem of verb polysemy is relatively underexplored, and sometimes even ignored, in existing works (Xu et al. 2019; Li et al. 2019b; Liao et al. 2020; Wang et al. 2020). Most contemporary approaches tend to assume that the same verb will have similar visual characteristics across different HOI categories, and accordingly opt to design object-shared verb classifiers. When the verb classifier is shared among all objects, each verb obtains more training samples, thereby promoting the robustness of the classification for HOI categories with a small sample size. However, due to the polysemic nature of the verbs, a dramatic semantic gap may exist between instances of the same verb across different HOI categories. Chao et al. (2018) constructed object-specific verb classifiers for each HOI category, which are able to overcome the polysemy problem for HOI categories that have sufficient training samples. However, this approach lacks few- and zero-shot learning abilities for HOI categories where only small amounts of training data are available.
Visual features of each human–object pair are duplicated multiple times so that polysemy-aware visual features can be obtained under the guidance of language priors. Each polysemy-aware feature is sent to a specific verb classifier, which can be any type of the three verb classifiers mentioned in this paper. To reduce the number of duplicated human–object pairs, meaningless HOI categories (e.g. \({<}{} \textit{person eat book}{>}\) and \({<}{} \textit{person ride book}{>}\)) are ignored. Meaningful and common HOI categories (e.g. \({<}{} \textit{person hold book}{>}\) and \({<}{} \textit{person read book}{>}\)) are available in each popular HOI detection database. Multiple verbs are “checked” in this figure as HOI detection performs multi-label verb classification for each human–object pair
In this paper, we propose a novel Polysemy Deciphering Network (PD-Net) to address the challenging verb polysemy problem. As illustrated in Fig. 2, PD-Net transforms the multi-label verb classifications for each human–object pair into a set of binary classification problems. Here, each binary classifier is used for the verification of one verb category. The classifiers share the majority of their parameters; the main difference lies in the input features. Next, we decode the verb polysemy in the following three ways.
First, we enable features sent for each binary classifier to be polysemy-aware using two novel modules, namely Language Prior-guided Channel Attention (LPCA) and Language Prior-based Feature Augmentation (LPFA). The language priors are two word embeddings made up of one verb and one object. The object class is predicted by one object detector; the verb is the one to be determined by one specific binary verb classifier. For its part, LPCA is applied to both the human and object appearance features. The two appearance features are usually redundant, as only part of their information is involved for one specific HOI category (see Fig. 1). Therefore, LPCA is used to highlight important elements in the appearance features for each binary classifier. Moreover, both human–object spatial and human pose features are often vague and can vary dramatically for the same verb, as shown in Fig. 1a–d; we therefore propose LPFA, which concatenates the two features with language priors, respectively. In this way, the classifiers can receive hints to reduce the intra-class variation of the same verb for the pose and spatial features.
We further design a novel Polysemy-Aware Modal Fusion module (PAMF), which produces attention scores based on the above language priors in order to dynamically fuse multiple feature types. The language priors provide hints regarding the importance of the features for each HOI category. As can be seen in Fig. 1, the human pose feature is discriminative when the language prior is “fly kite” (Fig. 1e), but is less useful when the language prior is “fly airplane” (Fig. 1f). Therefore, our proposed PAMF deciphers the verb polysemy problem by highlighting the features that are more important for each HOI category.
Moreover, as mentioned above, both object-shared and object-specific verb classifiers have limitations. We therefore propose a novel clustering-based object-specific verb classifier, which combines the advantages of object-shared and object-specific classifiers. The main motivation is to ensure that semantically similar HOI categories containing the same verb, e.g. \({<}{} \textit{person hold cow}{>}\) and \({<}{} \textit{person hold elephant}{>}\) can share the same verb classifier. HOIs that are semantically very different (e.g. \({<}{} \textit{person hold book}{>}\) and \({<}\)person hold backpack\({>}\)) are identified using another verb classifier. In this way, the verb polysemy problem is mitigated. Meanwhile, clustering-based object-specific classifiers has the capacity to handle the few- and zero-shot learning problems that arise in HOI detection, since we merge the training data of semantically similar HOI categories.
To the best of our knowledge, our proposed PD-Net is the first approach to explicitly handle the verb polysemy problem in HOI detection. More impressively, our experimental results on three databases demonstrate that our approach consistently outperforms state-of-the-art methods by considerable margins. A preliminary version of this paper has been published in (Zhong et al. 2020). Compared with the conference version, this version further proposes a novel Language Prior-guided Channel Attention module, simplifies the architecture of PD-Net by using Clustering-based Object-Specific classifiers, builds a new database (named HOI-VP) to facilitate the research on the verb polysemy problem, and includes further experimental investigations.
The remainder of this paper is organized as follows. Section 2 briefly reviews related works. The details of the proposed components of PD-Net are described in Sect. 3. The databases and implementation details are introduced in Sect. 4, while the experimental results are presented in Sect. 5. Finally, we conclude the paper in Sect. 6.
2 Related Works
2.1 Human–Object Interaction Detection
HOI detection performs multi-label verb classification for each human–object pair, meaning that the interaction between the same human–object pair may be described using multiple verbs. Depending on the order of verb classification and target object association, existing HOI detection approaches can be divided into two categories. The first category of methods infer the verb actions being performed by one person, then associate each verb with a single object in the image. Multiple target object association approaches have been proposed. For example, Shen et al. (2018) proposed an approach based on the value of object detection scores, while Gkioxari et al. (2018) fitted a distribution density function of the target object locations based on the human appearance feature. Moreover, Qi et al. (2018) adopted a graph parsing network to associate the target objects. Liao et al. (2020) and Wang et al. (2020) first defined interaction points for HOI detection; next, they locate the interaction points and associate each point with one human–object pair.
The second category of methods first pair each human instance with all object instances as candidate human–object pairs, then recognize the verb for each candidate pair (Gupta et al. 2019). Many types of features have been employed to promote the verb classification performance. For example, Wan et al. (2019) employed both human parts and pose-aware features for verb classification, while Xu et al. (2020) exploited human gaze and intention to assist HOI detection. Furthermore, Wang et al. (2019a) extracted context-aware human and object appearance features to promote HOI detection performance. Li et al. (2020a) utilized 3D pose models and 3D object location to assist HOI detection. Moreover, Li et al. (2020b) annotated large amounts of part-level human–object interactions and trained a PaStaNet, which is helpful for HOI detection models to make use of fine-grained human part features. A large number of novel model architectures for HOI detection have been also developed. For example, Li et al. (2019b) introduced a Transferable Interactiveness Network that suppresses candidate pairs without interactions. Peyre et al. (2019) constructed a multi-stream model that projects visual features and word embeddings to a joint space, which is helpful for unseen HOI category detection. Xu et al. (2019) constructed a graph neural network to promote the quality of word embeddings by utilizing the correlation between semantically similar verbs, while Zhou et al. (2020) proposed a cascade architecture that facilitates coarse-to-fine HOI detection.
Besides, HOI recognition (Chao et al. 2015) is one similar task to HOI detection. Briefly, HOI recognition methods predict all possible HOI categories in an image but they do not detect the location of involved human–object pairs. For example, Kato et al. (2018) proposed to compose classifiers for unseen verb-noun pairs by leveraging an external knowledge graph and graph convolutional networks.
2.2 The Polysemy Problem
This problem is very common in our daily life. There have been some researches exploring the polysemy problem, e.g. natural language processing Ma et al. 2020; Huang et al. 2012; Oomoto et al. 2017) and recommendation systems (Liu et al. 2019). For example, the polysemy problem in natural language processing is mainly due to different usages of a word in grammar, e.g., functioned as a verb or a noun (Ma et al. 2020). Besides, each node in a recommendation system could have multiple facets because of the different links with its neighbour nodes, which brings in the node polysemy problem (Liu et al. 2019).
In this paper, we address the verb polysemy problem in HOI detection, i.e., a verb may convey substantially different visual characteristics when associated with different objects. There are also some related areas that may face the same verb polysemy problem, e.g. Action Recognition (Simonyan and Zisserman 2014; Tran et al. 2015; Damen et al. 2018) and Visual Relationship Detection (Lu et al. 2016; Krishna et al. 2017; Kuznetsova et al. 2020; Ji et al. 2020).
Action recognition methods aim to recognize human actions from an image or a video. In particular, the EPIC-KITCHENS Dataset (Damen et al. 2018) is a new large-scale egocentric video benchmark for action recognition tasks. Besides recognizing the human action from the video, the task on this dataset also involves identifying the interacted object category. It includes 125 verb and 352 noun categories. There are also many verbs suffering from the polysemy problem, e.g. “hold”, “open” and “close”.
Visual relationship detection involves detecting and localizing pairs of objects in an image and also classifying the predicate or interaction between each subject-object pair. Different from HOI detection, the subject for each subject-object pair in visual relationship detection can be any object category besides human. Polysemy verbs, e.g. “carry”, “ride”, “hold” are also common relationships in visual relationship detection. Therefore, visual relationship detection also suffers from the polysemy problem.
Furthermore, as the vast majority of HOI detection methods are based on single images (Gao et al. 2018; Xu et al. 2019; Li et al. 2019b; Gupta et al. 2019; Peyre et al. 2019; Qi et al. 2018; Liao et al. 2020; Wang et al. 2020; Ulutan et al. 2020), we only consider the verb polysemy problem for image-based HOI detection. However, image-based methods ignore the temporal information, which also provides rich cues to decipher the verb polysemy problem. Therefore, we expect more works on video-based HOI detection in the future.
2.3 The Exploitation of Language Priors
Language priors have also been successfully utilized in many computer vision-related fields, including Scene Graph Generation (Lu et al. 2016; Zhang et al. 2017; Gu et al. 2019; Wang et al. 2019b), Image Captioning (Zhou et al. 2019; Yao et al. 2019), and Visual Question Answering (Zhou et al. 2019; Gao et al. 2019; Marino et al. 2019). Moreover, several works (Xu et al. 2019; Peyre et al. 2019) have adopted language priors for HOI detection. All of these approaches project visual features and word embeddings to a joint space, which improves HOI detection by exploiting the semantic relationship between similar verbs or HOI categories (e.g. “drink” and “sip” or “ride horse” and “ride cow”). However, these works do not employ language priors to solve the challenging verb polysemy problem. Compared with the above methods, PD-Net aims to solve the verb polysemy problem by using three novel language prior-based components: Language Prior-guided Channel Attention, Language Prior-based Feature Augmentation, and Polysemy-Aware Modal Fusion.
Overview of the Polysemy Deciphering Network. For the sake of simplicity, only one binary CSP classifier (for “hold”) is illustrated here. PD-Net takes four feature streams as input: the human appearance stream (H stream), the object appearance stream (O stream), the human–object spatial stream (S stream), and the human pose stream (P stream). These four feature streams are first processed by either LPCA or LPFA to be polysemy-aware. They are then sent to the H, O, S, and P blocks for binary classification, respectively. Subsequently, the classification scores from the four feature streams are fused using the attention scores produced by \(\mathbf{PAMF}\). Here, \(\otimes \) and \(\oplus \) denote the element-wise multiplication and addition operations, respectively
2.4 Attention Models
Attention mechanisms are becoming a popular component in computer vision tasks, including Image Captioning (Chen et al. 2017; Xu et al. 2015; You et al. 2016), Action Recognition (Girdhar and Ramanan 2017; Meng et al. 2019), and Pose Estimation (Li et al. 2019a; Ye et al. 2016). Existing studies on attention mechanisms can be roughly divided into three categories: namely, hard regional attention (Jaderberg et al. 2015; Li et al. 2018; Wang et al. 2021), soft spatial attention (Wang et al. 2017; Pereira et al. 2019; Zhu et al. 2018), and channel attention (Pereira et al. 2019; Hu et al. 2018; Ding et al. 2020). Hard regional attention methods typically predict regions of interest (ROIs) first, and then only utilize features in ROIs for subsequent tasks. In comparison, soft spatial attention and channel attention (CA) models use soft weights to highlight important features in the spatial and channel dimensions, respectively. There have also been existing works that adopted attention models to assist with HOI detection tasks. For example, Gao et al. (2018) and Wang et al. (2019a) employed an attention mechanism to enhance the human and object features by aggregating contextual information. Wan et al. (2019) proposed the PMFNet model, which adopts human pose and spatial features as cues to infer the importance of each human part. Ulutan et al. (2020) used cues derived from a human–object spatial configuration to highlight the important elements in appearance features.
To the best of our knowledge, few works thus far have made use of the attention mechanism to solve the verb polysemy problem in HOI detection. Moreover, existing attention models for HOI detection usually employ visual features (e.g. the appearance and pose features) as cues. By contrast, our proposed Language Prior-guided Channel Attention and Polysemy-Aware Modal Fusion adopt language priors as cues; these priors have clear semantic meanings, and are therefore well-suited to resolving the verb polysemy problem.
3 Method
We first formulate a basic HOI detection scheme that is adopted by most existing works (Gupta et al. 2019; Li et al. 2019b; Wan et al. 2019; Zhou and Chi 2019; Li et al. 2020a). Then, we explain the verb polysemy problem and formulate the PD-Net framework. Finally, we describe each of the key components in PD-Net.
3.1 Problem Formulation
Given an image I, human and object proposals are generated using Faster R-CNN (Ren et al. 2015). Each human proposal h and each object proposal o are paired as a candidate for verb classification. HOI detection models (Gupta et al. 2019; Li et al. 2019b; Wan et al. 2019; Zhou and Chi 2019; Li et al. 2020a) then produce a set of verb classification scores for each candidate pair (h, o). The classification scores for verb v can be represented as follows:
where i is a subscript that stands for the i-th feature type. Function \(G_{i}(\cdot )\) denotes the model that produces the i-th type of features (Gupta et al. 2019; Li et al. 2019b; Wan et al. 2019; Zhou and Chi 2019; Li et al. 2020a). \(T_{(i,v)}(\cdot )\) represents the classifier for the verb v using features produced by \(G_{i}(\cdot )\). \(\sigma (\cdot )\) denotes the sigmoid activation function.
As described in Sect. 1, existing works (Gupta et al. 2019; Li et al. 2019b; Wan et al. 2019; Zhou and Chi 2019; Li et al. 2020a) suffer from the verb polysemy problem. First, owing to the large intra-class appearance variance for one verb, it is challenging for function \(G_{i}(\cdot )\) to learn discriminative visual features for a polysemic verb. Second, the importance of the same feature type may vary dramatically as the objects of interest change. Third, \(T_{(i,v)}(\cdot )\) is often shared by different HOI categories with the same verb, which makes it difficult for \(T_{(i,v)}(\cdot )\) to capture important visual cues to identify one polysemic verb. Therefore, Eq. (1) cannot well address the verb polysemy problem.
Accordingly, in this paper, we propose PD-Net to address the above three problems. Given one human–object pair (h, o), the classification scores for verb v predicted by PD-Net can be represented as follows:
where \(w_{o}\) and \(w_{v}\) represent the word embeddings for the object and verb categories, respectively. \(G_{i}(\cdot )\) can leverage word embeddings of one verb-object pair (HOI category) to generate polysemy-aware features. \(a_{(i,o,v)}\) denotes attention scores produced by the Polysemy-Aware Modal Fusion module. \(T_{(i,o,v)}\) represents the clustering-based object-specific classifiers that are shared by semantically similar HOI categories with the same verb.
In the following, we introduce the framework of PD-Net based on Eq. (2).
3.2 Overview of PD-Net
The framework of PD-Net is illustrated in Fig. 3. Similar to existing works (Li et al. 2019b; Gupta et al. 2019; Wan et al. 2019; Zhou and Chi 2019; Li et al. 2020a), four types of visual features are adopted, i.e., human appearance, object appearance, human–object spatial, and human pose features. We construct these four types of features for PD-Net following (Gupta et al. 2019). In particular, the human and object appearance features are \(K_{A}\)-dimensional vectors that are extracted from Faster R-CNN model using human and object bounding boxes, respectively. The human–object spatial feature is a 42-dimensional vector encoded using the bounding box coordinates of one human–object pair. Moreover, we use a pose estimation model (Fang et al. 2017) to obtain the coordinates of 17 keypoints for each human instance. Subsequently, following (Gupta et al. 2019), the human keypoints and the bounding box coordinates of object proposal are then encoded into a 272-dimensional pose feature vector.
As outlined in Fig. 2, we transform the multi-label verb classification into a set of binary classification problems. Each of the binary classifiers is used for one verb category verification and only processes features that use this verb as language priors. Moreover, each binary classifier includes a set of H, O, S, and P blocks; Apart from the final layer which is used for verb prediction, the parameters of the other layers in each respective block are shared across different binary classifiers. Therefore, their overall model size is comparable to an ordinary multi-label classifier. The binary classifiers mainly differ in terms of their input features and the way in which they combine predictions from the four feature streams.
In the following, we propose four novel components to handle the verb polysemy problem in HOI detection. First, we introduce the Language Prior-guided Channel Attention and Language Prior-based Feature Augmentation modules, which facilitate the four types of features being polysemy-aware. Second, we design the Polysemy-Aware Modal Fusion module that adaptively fuses the prediction scores produced by the four feature streams and obtains the final prediction score for each binary classifier. Finally, we propose a Clustering-based Object-Specific classification scheme that strikes a balance between resolving the verb polysemy problem and reducing the number of binary classifiers in PD-Net.
3.3 Polysemy-Aware Feature Generation
We here introduce two novel components, i.e. Language Prior-guided Channel Attention and Language Prior-based Feature Augmentation, that generate polysemy-aware features. The two components are denoted as \(G_{i}(\cdot )\) in Eq. (2). The language prior we used in this paper is the concatenated word embedding of two words: one word denotes the verb to be identified (\(w_{v}\) in Eq. (2)), while the other one is the detected object in the human–object pair (\(w_{o}\) in Eq. (2)). The word embeddings are generated using the word2vec tool, which was trained on the Google News dataset (Mikolov et al. 2013). The dimension of the language prior is 600.
3.3.1 Language Prior-guided Channel Attention
Both the human and object appearance features are usually redundant as only part of their information is involved for a specific HOI category. One example can be found in Fig. 1a, b, where the human body parts most relevant to “play soccer_ball” and “play frisbee” are the “feet” and “hands”, respectively. We therefore propose Language Prior-guided Channel Attention (LPCA) to highlight the important elements in human and object appearance features, based on the channel attention scheme and the guidance of language priors. LPCA is realized through two steps, which are outlined below.
First, we infer the important elements in the human or object appearance feature \(F_{A}\) using language priors. \(F_{A}\) is extracted from the Faster R-CNN model using a bounding box. As can be seen from Fig. 4, the language prior is projected to a \(K_{A}\)-dimensional vector (denoted as \(L_{A}\)) via two successive fully-connected layers. The dimension of the first one is set to \(\frac{K_{A}}{2}\); \(K_{A}\) is equal to the dimension of \(F_{A}\). Similar to (Peyre et al. 2019), \(L_{A}\) is normalized via its L2 norm. To drive \(L_{A}\) to pay attention to important elements in \(F_{A}\) regarding to the verb-object pair, we perform element-wise multiplication between \(L_{A}\) and \(F_{A}\), as follows:
and further compute the summation of elements in \(L_{B}\):
where \(\sigma \) denotes the sigmoid activation function and \(Sum(\cdot )\) denotes the summation of all elements in one vector. During the training stage, we minimize the binary cross-entropy loss between \({\mathcal {S}}_{au}\) and the binary label for the verb to verify. During inference, the operation in Eq. (4) can be ignored. By optimizing the fully-connected layers via the verb verification goal, the value of elements in \(L_{A}\) can reflect the importance of the corresponding elements in \(F_{A}\).
It is worth noting that the quality of both \(L_{A}\) and \(L_{B}\) can be affected by the discrepancy between visual features and word embeddings, since the word embeddings are not specifically designed for computer vision tasks (Xu et al. 2019). Therefore, directly using \(L_{B}\) as representation for verb verification may be suboptimal. Consequently, to handle this problem, we propose the following strategy to further enhance the quality of the representations.
Second, we obtain attention scores based on \(L_{B}\) via a plain channel attention module:
where \( C_{att}\) stands for the final channel attention scores for \(F_{A}\). \(D(\cdot )\) is realized by two successive fully-connected layers. By imposing a supervision on \({\mathcal {S}}_{au}\), elements in \(L_{B}\) that are relevant to the verb to verify have large values, which facilitates the learning of an effective channel attention. For its part, the plain CA module then makes use of the correlation between elements in \(L_{B}\) to promote the quality of attention scores. Finally, the polysemy-aware human or object appearance features can be obtained via the following equation:
Existing channel attention models, e.g. the SE network (Hu et al. 2018), usually utilize appearance features to generate channel attention scores for the features themselves. Besides, the VSGNet model (Ulutan et al. 2020) adopted the human–object spatial features as cues to infer channel attention scores for the appearance feature. Compared with the above two works, LPCA contains clearer semantic information for each HOI. First, the input of LPCA includes language priors, which have clear semantic meaning. Second, LPCA imposes an auxiliary binary cross-entropy loss between \(S_{au}\) and the verb to verify. This extra loss enables the visual features for different HOI categories with the same verb to be polysemy-aware according to their specific language priors. Moreover, language priors are adopted in (Peyre et al. 2019) to construct verb classifiers. In their method, human and object features are fixed. In comparison, we utilize language priors to generate polysemy-aware features, which are adjustable for each verb-object pair to identify.
Finally, LPCA can also be regarded as cross-modal attention (Lu et al. 2019) or a conditioning module (Perez et al. 2018). The differences between LPCA and existing cross-modal attention methods (e.g. ViLBERT (Lu et al. 2019)) and conditioning modules (e.g. FiLM (Perez et al. 2018)) are illustrated as follows. First, ViLBERT utilizes a set of image region features and a sequence of word embeddings as input. Then, it conducts cross modal attention based on the standard transformer blocks (Vaswani et al. 2017), which compute query, key, and value matrices to produce the attention-pooled features. In comparison, LPCA only uses one visual feature vector and one text segment (the concatenation of the word embeddings of the verb and object category) as the input. The Hadamard product between the features \(F_{A}\) and \(L_{A}\) is utilized to generate the channel attention scores. Second, FiLM (Perez et al. 2018) carries out a feature-wise affine transformation on a deep model’s intermediate features, directly conditioned on the language prior. In comparison, in order to address the discrepancy between visual features and word embeddings, LPCA generates channel attention scores based on the correlation between visual feature and word embeddings.
3.3.2 Language Prior-Based Feature Augmentation
Language Prior-based Feature Augmentation (LPFA) is applied to human–object spatial and human pose features. As illustrated in Fig. 1c–f, the spatial and pose features are often vague and vary dramatically for the same verb, meaning that they contain insufficient information. Therefore, we propose LPFA to augment the pose and spatial features. More specifically, we concatenate each of the two features with the 600-dimensional language prior As a result of this concatenation, the classifiers receive hints that can aid in reducing the intra-class variation of the same verb for the pose and spatial features.
3.4 Polysemy-Aware Modal Fusion
As illustrated in Fig. 3, the four feature streams are sent to the H, O, S and P blocks, respectively. The H and O blocks are constructed using two successive fully-connected layers, while the S and P blocks are constructed using three successive fully-connected layers. In the interests of simplicity, the dimension of each hidden fully-connected layer is set as the dimension of its input feature vector. The output dimension of these four blocks is set to \(K_{C}\), which is the number of binary classifiers in PD-Net.
As discussed in Sect. 1, one major challenge posed by the verb polysemy problem is that the relative importance of each of the four feature streams to the identification of the same verb may vary dramatically as the objects change. As shown in Fig. 1e, f, the human appearance and pose features are most important for detecting \({<}{} \textit{person fly kite}{>}\); by contrast, these features are almost invisible and therefore less useful for detecting \({<}{} \textit{person fly airplane}{>}\). Therefore, we propose Polysemy-Aware Modal Fusion (PAMF) to generate attention scores that dynamically fuse the predictions of the four feature streams. To the best of our knowledge, PAMF is the first attention module that effectively uses language prior to address the verb polysemy problem in HOI detection. In more detail, we use the same 600-dimensional word embedding (e.g. “hold book”) that is implemented in LPCA and LPFA. The language prior is fed into two successive fully-connected layers, the dimensions of which are 48 and 4, respectively. The first fully-connected layer is followed by a ReLU layer, while the second one is followed by a sigmoid activation function. The output of PAMF is used as attention scores for the four feature streams. In this way, the important feature streams with respect to each HOI category is highlighted, while those that are less important are suppressed. We can see that PAMF is a weight-light module that can be effectively optimized even with limited training data. Moreover, we use the pre-trained word embeddings (Mikolov et al. 2013) as input for PAMF. These word embeddings have semantical relationships with each other as prior, which further reduces the optimization difficulty of PAMF.
Therefore, Eq. (2) can be rewritten as follows:
where i denotes one feature stream, while \(a_{(i,o,v)}\) is the attention score generated by PAMF for the i-th feature stream. \(s_{(i,o,v)}\) is the output for verb v generated by the i-th feature stream.
3.5 Clustering-Based Object Specific Verb Classifiers
Although the above proposed components, i.e. LPCA, LPFA, and PAMF, can help object-shared verb classifiers to relieve the verb polysemy problem, the defects of object-shared verb classifiers still remains. The essential reason is that HOIs with different objects share the same verb classifier. Under ideal circumstances, object-specific verb classifiers can overcome this problem if sufficient training data exists for each HOI category. However, if we assume that the number of object categories is |O| and the number of verb categories is |V|, the total number of their combinations will therefore be \(|O|\times |V|\), which is usually very large even if meaningless HOI categories are removed. Therefore, it is too difficult to obtain sufficient train samples for each HOI category. Moreover, due to the class imbalance problem for HOI categories, the object-specific classifiers lack few- and zero-shot learning abilities for HOI categories which have small amount of training data. Therefore, both types of verb classifiers have limitations.
In this subsection, we introduce a novel verb classifier, named Clustering-based object-SPecific (CSP) verb classifiers, which are denoted as \(T_{(i,o,v)}\) in Eq. (2). CSP classifiers can strike a balance between overcoming the verb polysemy problem and handling the zero- or few-shot learning problems. The main motivation behind CSP classifiers is that some HOIs tagged with the same verb are both semantically and visually similar, e.g. \({<}{} \textit{person hold sheep}{>}\), \({<}{} \textit{person hold horse}{>}\), and \({<}{} \textit{person hold cow}{>}\); therefore, they can share the same verb classifier, meaning that the number of SP classifiers is consequently reduced. In more detail, we first obtain all meaningful and common HOI categories for each verb, which are available in popular databases such as HICO-DET (Chao et al. 2018) and V-COCO (Gupta and Malik 2015). The number of meaningful HOI categories including the verb v is indicated by \({O}_{v}\). We then use the K-means method (MacQueen et al. 1967) to cluster the HOI categories with the same verb v into \({C}_{v}\) clusters according to the cosine distance between the word embeddings of the objects. We empirically set the \({C}_{v}\) for each verb as a rounded number of the square root of \({O}_{v}\).
We provide visualization of clustering results for some polysemic verbs in the supplementary file. During both training and inference, only one CSP classifier is adopted to predict whether the verb is positive for one verb-object pair. The adopted CSP classifier is determined by the object category in the verb-object pair. This clustering strategy is also capable of handling the few- and zero-shot HOI detection problems (Bansal et al. 2020). For example, during testing, a new HOI category \({<}{} \textit{person hold elephant}{>}\) can share the same classifier with other HOI categories that have similar semantic meanings (e.g. \({<}{} \textit{person hold horse}{>}\)).
Besides our automatic way to cluster semantically similar HOI categories, one alternative is to find all possible semantic meanings for a verb from an English dictionary. A dictionary usually elaborates different semantic meanings for each verb. Therefore, the number of clusters for a verb can be determined. Then, one may manually associate each HOI category with one semantic meaning.
One recent work (Bansal et al. 2020) also utilized clustering methods to achieve zero-shot HOI detection. There are two differences between the clustering strategies in this work and (Bansal et al. 2020). First, we utilize clustering to build new verb classifiers. In comparison, the clustering strategy in (Bansal et al. 2020) is used to generate new training data. Second, we cluster the available HOI categories in one database for each respective verb. In comparison, clustering is conducted only once in (Bansal et al. 2020). Our idea is that some HOI categories are meaningless. By clustering the available HOI categories for each verb, we can obtain more meaningful and fine-grained clustering result.
3.6 Training and Testing
3.6.1 Training
PD-Net can be conceptualized as a multi-task network. Its loss for the verification of the verb v in one HOI category (h, v, o) can be represented as follows:
where \({\mathcal {L}}_{BCE}\) represents binary cross-entropy loss, while \({l}_{v}\) denotes a binary label (\({l}_{v} \in \) \(\{ 0,1 \}\)) for one verb to verify. Moreover, \({\mathcal {S}}^{{\mathbf {H}}}_{au}\) and \({\mathcal {S}}^{{\mathbf {O}}}_{au}\) denote the output of Eq. (4) for the human and object appearance features, respectively.
3.6.2 Testing
During testing, we use the same method as that utilized in the training stage to obtain the language priors. Here, the object category in the prior is predicted using Faster R-CNN (rather than the ground-truth); the verb category in the prior varies for each binary classifier of the verb. Following (Li et al. 2019b, 2020a; Ulutan et al. 2020; Wan et al. 2019), we also construct an Interactiveness Network (INet) capable of suppressing pairs without interaction. Finally, the prediction score for one HOI category (h, v, o) is represented as follows:
where \({\mathcal {S}}_h\) and \({\mathcal {S}}_o\) are the detection scores of human and object proposals, respectively, while \({\mathcal {S}}^{{\mathbf {I}}}_{(h,o)}\) denotes the prediction score generated by the pre-trained INet. In the experimental section below, we demonstrate that INet slightly promotes the performance of PD-Net.
4 Experimental Setup
4.1 Datasets
HICO-DET (Chao et al. 2018) is a large-scale dataset for HOI detection, containing a total of 47,776 images; of these, 38,118 images are assigned to the training set, while the remaining 9568 images are used as the testing set. There are 117 verb categories, 80 object categories, and 600 common HOI categories overall; moreover, these 600 HOI categories are divided into 138 rare and 462 non-rare categories. Each rare HOI category contains less than 10 training samples. Each verb is included in an average of five HOI categories.
V-COCO (Gupta and Malik 2015) is a subset of MS-COCO (Lin et al. 2014) and contains 2533, 2867, and 4946 images used for training, validation and testing, respectively. There are 24 verb categories and 259 HOI categories in total. Each verb is included in 10 HOI categories on average.
HOI-VerbPolysemy (HOI-VP) is a new database constructed in this paper. To the best of our knowledge, this is the first database to be designed explicitly for the verb polysemy problem in HOI detection. In more detail, it consists of 15 common verbs (predicates) that have rich and diverse semantic meanings. It also contains 517 common objects in real-world scenarios. Each verb is included in an average of 55 HOI categories, as detailed in Table 1. In particular, “in” and “on” are two highly common predicates that are also polysemic in visual relationship detection tasks (Lu et al. 2016; Krishna et al. 2017; Kuznetsova et al. 2020; Ji et al. 2020) and are thus both included in the HOI-VP database. There are 21,928 and 7262 images used for training and testing, respectively. All images are collected from the VG database (Krishna et al. 2017), while the corresponding annotations are provided by the HCVRD database (Zhuang et al. 2017). In the HOI-VP dataset, we only use images that were labelled with the 15 predicates listed in Table 1. These images and their labels are collected based on the HCVRD dataset. Therefore, the images of HOI-VP can be considered as a subset of HCVRD.
It is worth noting here that the annotations in HCVRD contain noise. For example, the same verb may be annotated with different words, e.g., “hold”, “holds”, and “holding”, while a similar problem exists for the objects, e.g., “camera”, “digital camera”, and “video camera”. We therefore merge different annotations for the same verb or object categories, respectively. In the following, we take “hold” as example to explain the correction of annotations. We search for the highly relevant labels with the key word “hold”. Then, we manually check the images with these labels, and make sure that these labels indeed have the same semantic meaning. Finally, we merge the labels of the same semantic meaning with “hold”. The annotation noise for each object category is mainly from its fine-grained attributes. For example, the object “shirt” may be labelled as “black shirt”, “blue shirt”, and “stripe shirt”. The merging steps for object labels are the same as those for the verbs. Some sample images from HOI-VP are illustrated in Fig. 9. This database will be made publicly available to expedite research into the verb polysemy problem.
4.2 Evaluation Metrics
According to the official protocols (Chao et al. 2018; Gupta and Malik 2015), mean average precision (mAP) is used as the evaluation metric for HOI detection on both HICO-DET and V-COCO datasets. A positive human–object pair must meet the following requirements: first, the predicted HOI category must be the same type as the ground truth; second, both the human and object proposals must have an Intersection over Union (IoU) with the ground truth proposals of more than 0.5. Moreover, there are two mAP modes in HICO-DET, namely the Default (DT) mode and the Known-Object (KO) mode. In the DT mode, we calculate the average precision (AP) for each HOI category in all testing images. In the KO mode, the object categories in all images are known; therefore, we need only to compute the AP for each HOI category from images containing the interested object. For example, we evaluate the AP of \({<}{} \textit{person ride horse}{>}\) using only those testing images that contain a “horse”. Since the images that contain the object category of interest are known, the KO mode is better able to reflect the verb classification ability. For V-COCO, the role mAP (Gupta and Malik 2015) (\(AP_{role}\)) is used for evaluation.
For the HOI-VP database, we use an evaluation protocol similar to that of HICO-DET. As there are as many as 517 object categories in HOI-VP, object detection becomes a challenging task. Accordingly, to reduce the impact of object detection errors, the ground-truth bounding boxes and categories for both human and object instances are provided. This strategy is similar to the Predicate Classification (PREDCLS) protocol, which has been widely adopted in scene graph generation tasks (Zellers et al. 2018; Lin et al. 2020). It facilitates a clean comparison of verb classification ability between different HOI detection models.
4.3 Implementation Details
To facilitate fair comparison with existing works, we consider two popular object detection models for PD-Net. The first of these, Faster R-CNN (Ren et al. 2015) with ResNet-50-FPN (Lin et al. 2017) backbone, attaches a Feature Pyramid Network (FPN) to ResNet-50 (He et al. 2016) and generates object proposals from the FPN. Based on these proposals, instance appearance features are extracted from the ResNet-50 model. The second model is Faster R-CNN with ResNet-152 backbone (He et al. 2016). Here, both instance proposals and appearance features are obtained from the ResNet-152 model. The above two object detectors are trained on the COCO database (Lin et al. 2014). As shown in Table 2, the two object detectors achieve comparable detection performance. Moreover, to facilitate fair comparison with the majority of existing works (Ulutan et al. 2020; Gupta et al. 2019; Li et al. 2019b; Peyre et al. 2019; Qi et al. 2018), we fix the parameters of both object detectors. Following (Gupta et al. 2019), for both human and each object category, we first select the top 10 proposals according to the detection scores after non-maximum suppression. Moreover, the bounding boxes whose detection scores are lower than 0.01 are removed. The dimension of appearance features for both object detectors, i.e. \(K_{A}\), is 2,048.
Utilizing the same approach as existing works (Gao et al. 2018; Xu et al. 2019; Li et al. 2019b; Gupta et al. 2019; Peyre et al. 2019; Qi et al. 2018; Liao et al. 2020; Wang et al. 2020; Ulutan et al. 2020; Zhong et al. 2021), the HOI categories that appear in the training set are set as the meaningful and common HOI categories in each HOI database. The dimension of output layers of the H, O, S, and P blocks, i.e. \(K_{C}\), is set to 187, 45, and 83 on the HICO-DET, V-COCO, and HOI-VP databases, respectively; these figures are equal to the number of CSP classifiers on each respective database. We train PD-Net for 6 (10) epochs using Adam optimizer (Kingma et al. 2014) with a learning rate of 1e-3 (1e-4) on HICO-DET (V-COCO), while on HOI-VP, we train PD-Net for 12 epochs using a learning rate of 1e-3. During testing, we rank the HOI candidate pairs according to their detection scores (obtained via Eq. (9)) and calculate mAP for evaluation purposes.
5 Experimental Results and Discussion
5.1 Ablation Studies
To demonstrate the effectiveness of each proposed component in PD-Net, we perform ablation studies on the HICO-DET database. In Table 3, the baseline is constructed by removing Language Prior-guided Channel Attention (LPCA), Language Prior-based Feature Augmentation (LPFA), and Polysemy-Aware Modal Fusion (PAMF) from PD-Net; we also replace Clustering-based object-SPecific (CSP) classifiers with object-SHared (SH) classifiers. The other settings for the baseline remain the same as in PD-Net. For both models, the Faster R-CNN with ResNet-152 backbone is used for object detection. Experimental results are summarized in Table 3. From these results, we can make the following observations.
5.1.1 Effectiveness of PAMF
Polysemy-Aware Modal Fusion is designed to decipher the verb polysemy by assigning larger weights to more important feature types for each HOI category. As shown in Table 3, PAMF promotes the performance of the baseline by 1.29% and 1.36% mAP in DT and KO modes, respectively.
5.1.2 Effectiveness of LPFA
Language Prior-based Feature Augmentation is used to provide hints for the classifier in order to reduce the intra-class variation of the pose and spatial features by augmenting them with language priors. When LPFA is incorporated, HOI detection performance is promoted by 0.52% and 0.21% mAP in DT and KO modes, respectively.
5.1.3 Effectiveness of LPCA
The appearance features are redundant for HOI detection. Language Prior-guided Channel Attention is proposed to generate polysemy-aware appearance features. As can be seen from Table 3, LPCA promotes the HOI detection performance by a clear margin of 1.33% and 0.21% in DT and KO modes, respectively.
5.1.4 Effectiveness of CSP Classifiers
Clustering-based object-SPecific classifiers can relieve the verb polysemy problem by assigning the same verb classifier to semantically similar HOI categories. As shown in Table 3, CSP classifiers improve the HOI detection performance by 1.06% and 2.13% mAP in DT and KO modes, respectively.
5.1.5 Drop-One-Out Study
We further perform a drop-one-out study in which each proposed component is removed individually. These experimental results further demonstrate that each component is indeed helpful to promote HOI detection performance.
Finally, when INet is integrated, the mAP of PD-Net in the DT mode is further promoted by 0.60%. However, the mAP in the KO mode does not improve. This is because INet can assist PD-Net by suppressing candidate pairs without interactions, which are usually caused by incorrect or redundant object proposals in the DT mode. However, the KO mode is comparatively less affected by object detection errors; therefore, PD-Net can achieve high performance without the assistance of INet in this mode. This experiment demonstrates that the strong performance of PD-Net is primarily a result of its excellent verb classification ability.
5.2 Comparisons with Variants of PD-Net
5.2.1 Comparisons with Variants of the Language Prior
In this experiment, we remove the word embedding of the object category from the language prior so that only the word embedding of the verb category to identify is used as input for PAMF, LPFA, and LPCA. As shown in Table 4, without the word embedding of the object category, the performance of PD-Net drops by a large margin of 2.39% (1.87%) mAP in DT (KO) mode. These experimental results indicate that the word embedding of the object category in the language prior is an important hint to decipher the verb polysemy problem.
5.2.2 Comparisons with Variants of LPCA
In this experiment, we compare the performance of Language Prior-guided Channel Attention (LPCA) with five possible variants: namely, Plain Channel Attention (CA), ‘w/o \({\mathcal {S}}_{au}\)’, ‘w/o \({\mathcal {C}}_{att}\)’, ‘\(D([L_{A}, F_{A}])\)’ and FiLM (Perez et al. 2018). The other implementation details of PD-Net are kept the same for different variants. Plain CA means that we feed the appearance feature \(F_{A}\) directly into a plain CA module, i.e. \(D(\cdot )\) in Fig. 4, and obtain \({\tilde{F}}_{A}\). ‘w/o \(S_{au}\)’ involves removing the extra supervision signal \(S_{au}\) from LPCA, while ‘w/o \({C}_{att}\)’ means that we directly use \(L_{B}\) in Fig. 4 as the input of the \({\mathbf {H}}\) and \({\mathbf {O}}\) blocks in Fig. 3, without the further processing by the plain CA module. ‘\(D([L_{A}, F_{A}])\)’ means that we use the concatenation of \(L_{A}\) and \(F_{A}\) as the input for function \(D(\cdot )\) to generate channel attention scores in Eq. (5). FiLM means that we replace LPCA with a FiLM layer (Perez et al. 2018). Experimental results are tabulated in Table 5. In this table, ‘w/o LPCA’ is a baseline that removes the entire LPCA module from PD-Net. From these results, we can make the following observations.
First, the plain CA module alone slightly promotes the performance of PD-Net. One main reason for this is that the plain CA module has very little ability to identify important elements in the appearance features for each HOI category.
Second, without supervision from \(S_{au}\), the performance of LPCA degrades dramatically. Compared to the plain CA module, this setting adopts language priors to provide cues regarding the channel-wise importance of \(F_{A}\) for each HOI category. However, it receives only implicit supervision from the binary score \({\mathcal {S}}^{\mathbf {PD}}_{(h,o,v)}\) in Eq. (9), which is too weak to optimize LPCA’s parameters. We therefore observe degraded performance after the extra supervision \({\mathcal {S}}_{au}\) is removed.
Third, ‘w/o \(C_{att}\)’ obtains better performance than both the ‘Plain CA’ and ‘w/o \(S_{au}\)’ settings. However, its performance is still lower than that of our proposed LPCA by 0.38% and 0.52% mAP in DT and KO modes, respectively. This may be because \(L_{A}\) is obtained via projection from the language prior. As word embeddings are not specifically designed for computer vision tasks, \(L_{A}\) may not always be reliable and the quality of \(L_{B}\) is affected (Xu et al. 2019). Therefore, further processing \(L_{B}\) using the plain CA module is helpful.
Fourth, LPCA outperforms \(D([L_{A}, F_{A}])\) by significant margins in both DT and KO modes. There are two main reasons for this. First, the concatenation operation significantly increases the model size of the channel attention module, which makes the model more difficult to train. Second, with the optimization on \({\mathcal {S}}_{au}\), \(L_{B}\) can provide more direct hints about important channels in \(F_{A}\) to the verb to verify than \([L_{A}, F_{A}]\).
Fifth, LPCA also outperforms FiLM (Perez et al. 2018) by significant margins in both DT and KO modes.This is because the feature-wise affine transformation in (Perez et al. 2018) is directly conditioned on the language prior; therefore, it can be affected by the semantic misalignments between visual features and word embeddings. In comparison, LPCA can better address the discrepancy because it produces channel attention scores conditioned on the correlation between language priors and visual features. Moreover, FiLM (Perez et al. 2018) is only supervised by the final classification loss of the model while LPCA is also optimized by an auxiliary supervision on \(S_{au}\).
In comparison, our proposed LPCA achieves the best performance for the following reasons. First, it adopts language priors to provide hints regarding the channel-wise importance of \(F_{A}\) for each HOI category. Second, it imposes direct supervision to the attention module, which helps to more effectively optimize the model parameters. Third, it refines the attention vector obtained from the language priors using a plain CA module, which enhances the quality of the channel attention vectors. The above experimental results and analysis demonstrate the effectiveness of LPCA.
5.2.3 Comparisons with Variants of Verb Classifiers
To further demonstrate the advantages of clustering-based object-specific (CSP) classifiers, we compare their performance with that of object-SHared (SH) and object-SPecific (SP) verb classifiers. To facilitate fair comparison, other settings of PD-Net remain unchanged. Experimental results are tabulated in Table 6. It is shown that SH classifiers outperform SP classifiers by 1.42% (2.06%) mAP in DT (KO) mode for rare HOI categories. This is because SH classifiers enable these rare HOI categories to share verb classifiers with other HOI categories that have sufficient training data. By comparison, SP classifiers are better able to relieve the verb polysemy problem for the HOI categories that have sufficient training data. Therefore, the SP classifiers outperform SH classifiers by 0.23% (0.41%) mAP in DT (KO) mode for non-rare HOI categories.
In comparison, CSP classifiers achieve superior performance on both rare and non-rare HOI categories. This is due to the same verb classifiers being assigned to semantically similar HOI categories, enabling HOI categories with few training samples to share verb classifiers with those HOI categories that have sufficient training data. Moreover, different verb classifiers are adopted for semantically different HOI categories, which is helpful to overcome the verb polysemy problem. Overall, CSP classifiers outperform SH and SP classifiers by 1.31% (2.03%) and 1.46% (2.19%) mAP in DT (KO) mode for the full HOI categories, respectively. The superior performance on rare HOI categories demonstrates the effectiveness of CSP classifiers in the few-shot learning ability. We also further justify the effectiveness of CSP classifiers in terms of zero-shot HOI detection in Section B of the supplementary file.
5.3 Comparisons with State-of-the-Art Methods
We compare the performance of PD-Net with state-of-the-art methods on three databases, namely HICO-DET, V-COCO, and HOI-VP. Experimental results are summarized in Table 7, Table 8, and Table 10, respectively.
5.3.1 Performance Comparisons on HICO-DET
As shown in Table 7, PD-Net outperforms state-of-the-art methods by significant margins using both object detector backbones. It is worth noting that one most recent method, i.e. PPDM, adopts CenterNet with Hourglass-104 backbone (Zhou et al. 2019b) as the object detector. As shown in Table 2, this object detector significantly outperforms the two Faster R-CNN object detectors utilized in our model. To facilitate fair comparison, we mainly compare PPDM with PD-Net in the KO mode, as this mode is less affected by object detection results. As shown in Table 7, PD-Net outperforms PPDM in KO mode by significant margins of 2.28%, 5.05% and 1.60% mAP on the full, rare and non-rare HOI categories, respectively. Moreover, PD-Net also outperforms PPDM by 0.64% in the DT mode on the full HOI categories.
Moreover, as shown in Table 7 and Table 2, the object detector adopted by another recent work (Ulutan et al. 2020) is also much stronger than ours. But PD-Net still outperforms this model by large margins of 2.57% (22.37–19.80%), 1.56% (17.61–16.05%), and 2.88% (23.79–20.91%) mAP in the DT mode on the full, rare, and non-rare HOI categories, respectively.
Finally, with a similar multi-stream representation network and object detector backbone (ResNet-50-FPN), PD-Net outperforms one very recent model 2D-RN (Li et al. 2020a) by 3.03% (25.59%-22.56%) and 0.78% (20.76%-19.98%) in mAP on the full HOI categories in the KO and DT modes, respectively. Another advantage of PD-Net compared with 2D-RN is that PD-Net requires no extra human annotation. Besides, 3D human pose and 3D object locations are also utilized to improve 2D-RN during inference in (Li et al. 2020a). To facilitate fair comparisons, we only compare the performance of PD-Net with methods that utilize 2D human pose and 2D object locations during inference.
To further illustrate the advantage of PD-Net in deciphering the verb polysemy problem, we present the top 10 verbs (from the total 117 verbs in HICO-DET) ranked by the number of HOI categories in which each verb is included in Fig. 5. The largest number of HOI categories associated with the same verb (“hold”) is 61. As these verbs are more likely to be affected by the visual polysemy problem, we therefore compare the performance of PD-Net with one state-of-the-art method (Gupta et al. 2019) on these verbs. This method is chosen as it is very similar to our baseline. Results show that PD-Net achieves superior performance on all of these top 10 verbs.
5.3.2 Performance Comparisons on V-COCO
To boost the performance on V-COCO, we add another appearance feature stream to both our baseline and PD-Net, following (Wan et al. 2019). There are consequently a total of five feature streams for experiments on V-COCO. This new stream extracts appearance features from union boxes composed of human–object pairs. We further apply LPCA to this feature stream in PD-Net. As shown in Table 8, PD-Net outperforms state-of-the-art methods by clear margins with both object detectors. In particular, PD-Net outperforms one of the most recently developed methods, i.e. VSGNet (Ulutan et al. 2020). As shown in Table 2, the object detector utilized by VSGNet is much stronger (Huang et al. 2017) than ours; nevertheless, PD-Net still outperforms VSGNet by clear margins, as indicated in Table 8.
Moreover, PD-Net outperforms another particularly strong model, named PMFNet (Wan et al. 2019) by 0.3% (52.3%-52.0%) in mAP. The excellent performance of PMFNet may benefit from the use of human part features. Therefore, we adopt the same five feature streams that include the human part features in PMFNet as input for PD-Net; this model is denoted as PD-Net\(^{\dag }\) in Table 8. The contributions in this paper remain unchanged. PD-Net\(^{\dag }\) outperforms PMFNet by a large margin of 1.3% (53.3%-52.0%) in mAP. Moreover, as shown in Table 9, we compare the performance between PD-Net\(^{\dagger }\) and PMFNet on each of the 24 verbs in V-COCO. Here, our method demonstrates superior performance on the vast majority of verb classes.
5.3.3 Performance Comparisons on HOI-VP
We next compare the performance of PD-Net with some recent open-source methods, i.e. iCAN (Gao et al. 2018), TIN (Li et al. 2019b), No-Frills (Gupta et al. 2019), and PMFNet (Wan et al. 2019) on the new HOI-VP database. We also reproduce the method presented in (Peyre et al. 2019) that achieves high performance on the HICO-DET database. To facilitate fair comparison, we compare the performance of PD-Net with each of these methods using the same feature extraction backbone, respectively. As shown in Table 10, PD-Net consistently achieves the best performance out of all compared methods. In particular, PD-Net outperforms one recent powerful model PMFNet by a clear margin of 1.36% (63.66%-62.30%) in mAP. As the verbs (predicates) in the HOI-VP database are very common and polysemic in real world scenarios, experimental results on this database demonstrate the superiority of PD-Net to overcome the verb polysemy problem.
Visualization of PD-Net’s advantage in deciphering the verb polysemy problem on HICO-DET. We randomly select three verbs affected by the polysemy problem: “hold” (top row), “ride” (middle row), and “open” (bottom row). The green and red numbers denote the AP of our baseline and PD-Net respectively for the same HOI category (Color figure online)
5.4 Qualitative Visualization Results
Figure 6 illustrates attention scores produced by PAMF for four types of features. HOI categories in this figure share the verb “ride”, but differ dramatically in semantic meanings. The “person” proposal in Fig. 6a is very small and severely occluded while the “airplane” proposal is very large; therefore, object appearance feature is much more important for verb classification than the human appearance feature. In Fig. 6b, both the spatial feature and the object appearance feature play important roles in determining the verb. Attention scores for Fig. 6c, d are similar, as \({<}{} \textit{person ride horse}{>}\) and \({<}{} \textit{person ride elephant}{>}\) are indeed close in semantics.
Figures 7, 8, and 9 provide more examples that demonstrate PD-Net’s advantages in deciphering the verb polysemy problem on HICO-DET, V-COCO, and HOI-VP, respectively. The performance gain by PD-Net compared with our baseline reaches 10.6%, 3.33%, and 48.1% in AP for the “open microwave”, “carry backpack”, and “play drum” category on the three datasets, respectively.
6 Conclusion
The verb polysemy problem is relatively underexplored and is sometimes even ignored in existing works for HOI detection. Accordingly, in this paper, we propose a novel model named PD-Net, which significantly mitigates the challenging verb polysemy problem. PD-Net includes four novel components: Language Prior-guided Channel Attention, Language Prior-based Feature Augmentation, Polysemy-Aware Modal Fusion, and Clustering-based Object Specific classifiers. Language Prior-guided Channel Attention and Language Prior-based Feature Augmentation are introduced to generate polysemy-aware visual features. Polysemy-Aware Modal Fusion highlights important feature types for each HOI category. The Clustering-based Object Specific classifiers not only relieve the verb polysemy problem, but also is capable of handling the zero- or few-shot learning problems. Exhaustive ablation studies are performed to demonstrate the effectiveness of these components. We further develop and present a new dataset, named HOI-VP, that is specifically designed to expedite the research on the verb polysemy problem for HOI detection. Finally, by decoding the verb polysemy, we achieve state-of-the-art methods on the three HOI detection benchmarks. In the future, we will study the verb polysemy problem in related tasks to HOI detection, e.g., visual relationship detection and action recognition.
References
Bansal, A., Rambhatla, S., Shrivastava, A., & Chellappa, R. (2020). Detecting human–object interactions via functional generalization. In AAAI (pp. 10460–10469).
Chao, Y., Liu, Y., Liu, X., Zeng, H., & Deng, J. (2018). Learning to detect human–object interactions. In WACV (pp. 381–389).
Chao, Y., Wang, Z., He, Y., Wang, J., & Deng, J. (2015). Hico: A benchmark for recognizing human–object interactions in images. In ICCV (pp. 1017–1025).
Chen, X., & Gupta, A. (2017). An implementation of faster rcnn with study for region sampling. arXiv:1702.02138.
Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., & Chua, T. (2017). Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In CVPR (pp. 5659–5667).
Damen, D., Doughty, H., Maria Farinella, G., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., & Price, W. (2018). Scaling egocentric vision: The epic-kitchens dataset. In ECCV (pp. 720–736).
Ding, C., Wang, K., Wang, P., & Tao, D. (2020). Multi-task learning with coarse priors for robust part-aware person re-identification. TPAMI.
Fang, H., Xie, S., Tai, Y., & Lu, C. (2017). Rmpe: Regional multi-person pose estimation. In ICCV (pp. 382–391).
Gao, P., Jiang, Z., You, H., Lu, P., Hoi, S., Wang, X., & Li, H. (2019). Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In CVPR (pp. 6639–6648).
Gao, H., Zou, Y., & Huang, J. (2018). iCAN: Instance-Centric Attention Network for Human–Object Interaction Detection. In BMVC (p. 41).
Girdhar, R., & Ramanan, D. (2017). Attentional pooling for action recognition. In NeurIPS (pp. 34–45).
Girshick, R., Radosavovic, I., Gkioxari, G., Dollár, P., & He, K. (2018). Detectron. https://github.com/facebookresearch/detectron.
Gkioxari, G., Girshick, R., Dollár, P., & He, K. (2018). Detecting and recognizing human–object interactions. In CVPR (pp. 8359–8367).
Gu, J., Zhao, H., Lin, Z., Li, S., Cai, J., & Ling, M. (2019). Scene graph generation with external knowledge and image reconstruction. In CVPR (pp. 1969–1978).
Gupta, S., & Malik, J. (2015). Visual semantic role labeling. arXiv:1505.04474.
Gupta, T., Schwing, A., & Hoiem, D. (2019). No-frills human–object interaction detection: Factorization, layout encodings, and training techniques. In ICCV (pp. 9677–9685).
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In ICCV (pp. 2961–2969).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR (pp. 770–778).
Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In CVPR (pp. 7132–7141).
Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wojna, Z., Song, Y., & Guadarrama, S., et al. (2017). Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR (pp. 7310–7311).
Huang, E., Socher, R., Manning, C., & Ng, A. (2012). Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th annual meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 873–882).
Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spatial transformer networks. In NeurIPS (pp. 2017–2025).
Ji, J., Krishna, R., Fei-Fei, L., & Niebles, J.C. (2020). Action genome: Actions as compositions of spatio-temporal scene graphs. In CVPR (pp. 10236–10247).
Kato, K., Li, Y., & Gupta, A. (2018). Compositional learning for human object interaction. In ECCV (pp. 234–251).
Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1), 32–73.
Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., & Ferrari, V. (2020). The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 128(7), 1956–1981.
Li, B., Liang, J., & Wang, Y. (2019a). Compression artifact removal with stacked multi-context channel-wise attention network. In ICIP (pp. 3601–3605).
Li, Y., Liu, X., Lu, H., Wang, S., Liu, J., Li, J., & Lu, C. (2020a). Detailed 2d-3d joint representation for human–object interaction. In CVPR (pp. 10166–10175).
Li, Y., Xu, L., Liu, X., Huang, X., Xu, Y., Wang, S., Fang, HS., Ma, Z., Chen, M., & Lu, C. (2020b). Pastanet: Toward human activity knowledge engine. In CVPR (pp. 382–391).
Li, Y., Zhou, S., Huang, X., Xu, L., Ma, Z., Fang, HS., Wang, Y., & Lu, C. (2019b). Transferable interactiveness knowledge for human–object interaction detection. In CVPR (pp. 3585–3594).
Li, W., Zhu, X., & Gong, S. (2018). Harmonious attention network for person re-identification. In CVPR (pp. 2285–2294).
Liao, Y., Liu, S., Wang, F., Chen, Y., Qian, C., & Feng, J. (2020). Ppdm: Parallel point detection and matching for real-time human–object interaction detection. In CVPR (pp. 482–490).
Lin, X., Ding, C., Zeng, J., & Tao, D. (2020). Gps-net: Graph property sensing network for scene graph generation. In CVPR (pp. 3746–3753).
Lin, T., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In CVPR (pp. 2117–2125).
Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In ECCV (pp. 740–755).
Liu, N., Tan, Q., Li, Y., Yang, H., Zhou, J., & Hu, X. (2019). Is a single vector enough? exploring node polysemy for network embedding. In ACM SIGKDD (pp. 932–940).
Lu, J., Batra, D., Parikh, D., & Lee, S. (2019). Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS (pp. 13–23).
Lu, C., Krishna, R., Bernstein, M., & Fei-Fei, L. (2016). Visual relationship detection with language priors. In ECCV (pp. 852–869).
Ma, R., Jin, L., Liu, Q., Chen, L., & Yu, K. (2020). Addressing the polysemy problem in language modeling with attentional multi-sense embeddings. In ICASSP (pp. 8129–8133).
MacQueen, J., et al. (1967). Some methods for classification and analysis of multivariate observations. In the Processing of the fifth Berkeley symposium on mathematical statistics and probability (pp. 281–297).
Marino, K., Rastegari, M., Farhadi, A., & Mottaghi, R. (2019). Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR (pp. 3195–3204).
Massa, F., & Girshick, R. (2018). maskrcnn-benchmark: Fast, modular reference implementation of instance segmentation and object detection algorithms in PyTorch. https://github.com/facebookresearch/maskrcnn-benchmark.
Meng, L., Zhao, B., Chang, B., Huang, G., Sun, W., Tung, F., & Sigal, L. (2019). Interpretable spatio-temporal attention for video action recognition. In ICCV workshops (pp. 1513–1522).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In NeurIPS (pp. 3111–3119).
Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In ECCV (pp. 483–499).
Oomoto, K., Oikawa, H., Yamamoto, E., Yoshida, M., Okabe, M., & Umemura, K. (2017). Polysemy detection in distributed representation of word sense. In KST (pp. 28–33).
Pereira, S., Pinto, A., Amorim, J., Ribeiro, A., Alves, V., & Silva, C. A. (2019). Adaptive feature recombination and recalibration for semantic segmentation with fully convolutional networks. IEEE Transactions on Medical Imaging, 38(12), 2914–2925.
Perez, E., Strub, F., de Vries, H., Dumoulin, V., & Courville, A. (2018). FiLM: Visual reasoning with a general conditioning layer. In AAAI.
Peyre, J., Laptev, I., Schmid, C., & Sivic, J. (2019). Detecting unseen visual relations using analogies. In ICCV (pp. 1981–1990).
Qi, S., Wang, W., Jia, B., Shen, J., & Zhu, S.C. (2018). Learning human–object interactions by graph parsing neural networks. In ECCV (pp. 401–417).
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS (pp. 91–99).
Shen, L., Yeung, S., Hoffman, J., Mori, G., & Li, F. (2018). Scaling human–object interaction recognition through zero-shot learning. In WACV (pp. 1568–1576).
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NeurIPS (pp. 568–576).
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In ICCV (pp. 4489–4497).
Ulutan, O., Iftekhar, A., & Manjunath, B. (2020). Vsgnet: Spatial attention network for detecting human object interactions using graph convolutions. In CVPR (pp. 13617–13626).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., & Gomez, A., (2017). Attention is all you need. In NeurIPS (pp. 5998–6008).
Wan, B., Zhou, D., Liu, Y., Li, R., & He, X. (2019). Pose-aware multi-level feature network for human object interaction detection. In ICCV (pp. 9469–9478).
Wang, T., Anwer, RM., Khan, MH., Khan, FS., Pang, Y., Shao, L., & Laaksonen, J. (2019a). Deep contextual attention for human–object interaction detection. In ICCV (pp. 5694–5702).
Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., & Tang, X. (2017). Residual attention network for image classification. In CVPR (pp. 3156–3164).
Wang, W., Wang, R., Shan, S., & Chen, X. (2019b). Exploring context and visual pattern of relationship for scene graph generation. In CVPR (pp. 8188–8197).
Wang, T., Yang, T., Danelljan, M., Khan, FS., Zhang, X., & Sun, J. (2020). Learning human–object interaction detection using interaction points. In CVPR (pp. 4116–4125).
Wang, N., Zhang, Y., & Zhang, L. (2021). Dynamic selection network for image inpainting. TIP, 30, 1784–1798.
Xu, K., Ba, J, Kiros, R, Cho, K, Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In ICML (pp. 2048–2057).
Xu, B., Wong, Y., Li, J., Zhao, Q., & Kankanhalli, M.S. (2019). Learning to detect human–object interactions with knowledge. In CVPR (pp. 2019–2028).
Xu, B., Li, J., Wong, Y., Zhao, Q., & Kankanhalli, M. S. (2020). Interact as you intend: Intention-driven human-object interaction detection. TMM, 22(6), 1423–1432.
Yao, T., Pan, Y., Li, Y., & Mei, T. (2019). Hierarchy parsing for image captioning. arXiv:1909.03918.
Ye, Q., Yuan, S., & Kim, T. (2016). Spatial attention deep net with partial pso for hierarchical hybrid hand pose estimation. In ECCV (pp. 346–361).
You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image captioning with semantic attention. In CVPR (pp. 4651–4659).
Zellers, R., Yatskar, M., Thomson, S., & Choi, Y. (2018). Neural motifs: Scene graph parsing with global context. In CVPR (pp. 5831–5840).
Zhang, H., Kyaw, Z., Chang, SF., & Chua, T.S. (2017). Visual translation embedding network for visual relation detection. In CVPR (pp. 5532–5540).
Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., & Lin, D. (2020). Temporal action detection with structured segment networks. IJCV, 128(1), 74–95.
Zheng, B., Zhao, Y., Yu, J., Ikeuchi, K., & Zhu, S. C. (2015). Scene understanding by reasoning stability and safety. IJCV, 112(2), 221–238.
Zhong, X., Ding, C., Qu, X., & Tao, D. (2020). Polysemy deciphering network for human–object interaction detection. In ECCV (pp. 69-85).
Zhong, X., Qu, X., Ding, C., & Tao, D. (2021). Glance and Gaze: Inferring action-aware points for one-stage human–object interaction detection. In CVPR.
Zhou, P., & Chi, M. (2019). Relation parsing neural network for human–object interaction detection. In ICCV (pp. 843–851).
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., & Gao, J. (2019). Unified vision-language pre-training for image captioning and VQA. arXiv:1909.11059.
Zhou, X., Wang, D., & Krähenbühl, P. (2019b). Objects as points. arXiv:1904.07850.
Zhou, T., Wang, W., Qi, S., Ling, H., & Shen, J. (2020). Cascaded human–object interaction recognition. In CVPR (pp. 4263–4272).
Zhuang, B., Wu, Q., Shen, C., Reid, I., & Hengel, Avd. (2017). Care about you: towards large-scale human-centric visual relationship detection. arXiv:1705.09892.
Zhu, Y., Zhao, C., Guo, H., Wang, J., Zhao, X., & Lu, H. (2018). Attention couplenet: Fully convolutional attention coupling network for object detection. TIP, 28(1), 113–126.
Zoph, B., Vasudevan, V., Shlens, J., & Le, Q.V. (2018). Learning transferable architectures for scalable image recognition. In CVPR (pp. 8697–8710).
Acknowledgements
This work was supported by the National Natural Science Foundation of China under Grant 62076101, 61702193, and U1801262, the Program for Guangdong Introducing Innovative and Entrepreneurial Teams under Grant 2017ZT07X183, the Natural Science Fund of Guangdong Province under Grant 2018A030313869, the Science and Technology Program of Guangzhou under Grant 201804010272, the Guangzhou Key Laboratory of Body Data Science under Grant 201605030011, and the Fundamental Research Funds for the Central Universities of China under Grant 2019JQ01.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Dima Damen.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Zhong, X., Ding, C., Qu, X. et al. Polysemy Deciphering Network for Robust Human–Object Interaction Detection. Int J Comput Vis 129, 1910–1929 (2021). https://doi.org/10.1007/s11263-021-01458-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-021-01458-8