HOME: 3D Human–Object Mesh Topology-Enhanced Interaction Recognition in Images
Abstract
:1. Introduction
- We provide the first perspective that human–object interaction is derived from HOM geometric topology and also propose a novel method of interaction detection that considers the bottom-up topological cues.
- We propose a HOME framework that fuses both visual cues and topological cues, respectively, from CNN and MeshCNN. It approaches the state-of-art level.
2. Related Work
2.1. HOI Detection
2.2. Graph Models
2.3. 3D Perception
3. HOME
3.1. HOM Modeling
- (1)
- Calculate the space distance set from the object center O to all human body triangle center points.
- (2)
- Find the closest human body mesh triangle according to the distance set, and the corresponding three vertices .
- (3)
- Calculate the spatial distance set from the human body triangle to the object center, and find the closest object triangle and corresponding key points based on the same principle.
- (4)
- Eliminate triangle and , and then merge and into as the final HOM model.
3.2. Framework of HOME
3.2.1. Visual Cues
- Appearance Feature Extraction. In this section, we describe how to extract appearance features from instance human and object, respectively. After the detection step, we first extract the global feature of the entire image, and then, we use ROI Pooling to extract the instance features of human and objects, being noted as and , from the global feature heat map.
- 2D Spatial Feature Extraction. We use a double-channel binary map to represent the 2D spatial relation of a human and an object. In the channel for the human, the value is set to 1 in the location that is in the human body bounding box while it is set to 0 in other areas. In addition, in the channel for the object, the value is set to 1 in the region of object while it is set to 0 in other places. The double-channel map is fed into a convolutional neural network to extract the 2D spatial feature .
3.2.2. Topological Cues
- Bottom-up feature extraction. As the interaction semantics is related to body shape, pose, object orientation, and size, the topological cues locate at the low-dimensional manifold of the HOM space. So, the information extraction is carried out in a bottom-up manner. For each image I, HOM models are built for all detected persons and objects, obtaining . To feed it into MeshCNN, HOM should be translated to an edge-based feature in size of , where is the edge number of the mesh, while 5 means the number of features corresponding to a central edge. These features cover the angle between adjacent triangle faces, the respective vertex angle of the two adjacent triangles, and the ratio of edge length to height of the two adjacent triangles.
3.3. Topology-Enhanced Fusion
3.4. Training and Inference
- Loss for Training. Since HOI detection is a multi-label classification task, we choose the binary cross-entropy loss function during the training stage. The loss terms corresponding to human, object, and spatial map are , , and , respectively. Finally, the total loss for training detection is:
- Inference. For a given input image, the final interaction classification is computed based on scores , , and from different branches. The score for inference is formulated as:
4. Experimental Results
4.1. Implementation Details
4.2. Dataset and Metric
- Dataset. We adopt the widely-used HOI benchmark HICO-DET [10] to validate the effectiveness of HOME. HICO-DET is an instance-level benchmark consisting of 47,776 images (38,118 for training and 9658 for testing) and 600 HOI categories. It contains 80 object categories from COCO [40] 117 verbs and more than 150 thousand annotated HOI pairs. Based on previous work [11,16,22], we use 80% of the datasets as the training set, and the other 20% as validation. We reconstruct the meshes of humans and objects, and build HOM models for each pair of human and object. Particularly, the MeshCNN is pre-trained on the HOM training set before end-to-end training for fast convergence.
- Evaluation Metrics. To evaluate the performance of the methods, we adopt the commonly used mAP (mean average precision) as in previous works [11,12,19,21]. Predicting is valid when it satisfies: (i) the predicted bounding boxes locate people and objects with IOU ≤ 0.5, and (ii) the interaction is classified correctly.
4.3. Quantitative Evaluation
- Performance. For the datasets, we select several classical algorithms for quantitative comparison. Table 1 shows the results of our method compared with other methods [10,11,13,14,19,22,41]. In order to make a fair comparison, we adopt the same object detection results as DRG [22], ICAN [11], and PMFNet [14]. Unlike other approaches, GPNN [10] utilizes additional knowledge graphs to detect human–object pair interactions. We adopt the same backbone network ResNet50-FPN for visual feature extraction as in PMFNet [14] and DRG [22]. Among them, PMFNet uses human pose information to amplify local areas of the human body to obtain fine-grained information. DRG [22] made use of the heterogeneity of nodes to construct two sparse subgraphs centered on people and objects. The 2D-baseline method is a pruned DRG [22] that excludes two sparse subgraphs. HOME* is a HOME plus version that fusing HOM topological feature into DGR method by referring to the HOME framework. We can see that the HOME method shows the improvement of 0.37, 0.71, 0.33 mAP on Full, Rare, Non-Rare in contract with the 2d-baseline, and that HOME* shows 0.26, 0.32, 0.21 improvement in contrast to DGR. Both HOME and HOME* achieve state-of-art performance, validating the significance of the HOM topological cue to interaction recognition.
4.4. Ablation Studies
- Multiple branches. To validate the effectiveness of multi-branch fusion, we ablate spatial features, topological features, and both of them for comparison. For the first, we delete the 2D spatial feature by extracting a branch during inference. For the second, we delete the 3D human–object topological cue branch. In addition, for the third, we delete both the spatial feature branch and the topological cue branch for testing.
- Fusion mode with HOM. We test two fusion methods: early fusion and late fusion of 2D human visual features and 3D human–object topological features. Early fusion firstly fuses two feature vectors and then sends them into the interaction detection. Late fusion is to fuse the detection results after obtaining the results from different branches. The evaluation results are reported in Table 3. We can see that both fusion modes achieve improvement in contrast with the 2D-baseline, and that the early fusion performs the best. The late fusion shows worse improvement than the early one due to the fact that human–object topologies of some different interactive actions may be similar and may result in ambiguity. To some extent, the early fusion method, which we used in HOME, complements appearance features with human–object topological features in a coarse-to-fine manner, showing its better generalization.
- Failure Case. Although HOME assists rough visual features and improves the judgment of interactive behavior by utilizing 3D human–object topology information, there also exists some specific scenes that could not be handled, as shown in Figure 9. Firstly, in the multi human–object scene, if two people are next to each other, the category may be fused since they have similar spatial relation to the object, as shown in Figure 9a,b. Secondly, due to lacking temporal information, it is difficult to understand if both of the two persons are washing one car, resulting in missed detection, as shown in Figure 9c. In addition, the object detector fails to detect objects that occluded by body, e.g., the phone cannot be detected in Figure 9d.
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
DRG | Dual relation graph |
GPNN | Graph parsing neural networks |
HICO-DET | Human Interacting with Common objects for the HOI Detection Task |
HOI | Human–Object Interaction |
HOM | Human–Object Mesh |
HOME | Human–Object Mesh Topology Enhanced Interaction Recognition Method |
HORCNN | Human Object Region-based CNN |
ICAN | Instance-centric attention network |
InteractNet | Refer to the method in the work Detecting and Recognizing Human–Object Interactions [13] |
IOU | Intersection over Union |
mAP | mean average precision |
No-Frills | Refer to the no-frills approach for HOI Detection [10] |
PMFNet | Pose-aware Multi-level Feature Network |
References
- Yu, Y.; Ko, H.; Choi, J.; Kim, G. End-to-end concept word detection for video captioning, retrieval, and question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3165–3173. [Google Scholar]
- Yu, Y.; Kim, J.; Kim, G. A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 471–487. [Google Scholar]
- Dzabraev, M.; Kalashnikov, M.; Komkov, S.; Petiushko, A. Mdmmt: Multidomain multimodal transformer for video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3354–3363. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
- Liu, K.; Liu, W.; Gan, C.; Tan, M.; Ma, H. T-C3D: Temporal convolutional 3D network for real-time action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Yu, S.; Cheng, Y.; Xie, L.; Luo, Z.; Huang, M.; Li, S. A novel recurrent hybrid network for feature fusion in action recognition. J. Vis. Commun. Image Represent. 2017, 49, 192–203. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems 28, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Qi, S.; Wang, W.; Jia, B.; Shen, J.; Zhu, S.C. Learning human-object interactions by graph parsing neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 401–417. [Google Scholar]
- Gao, C.; Zou, Y.; Huang, J.B. ican: Instance-centric attention network for human-object interaction detection. arXiv 2018, arXiv:1808.10437. [Google Scholar]
- Li, Y.L.; Zhou, S.; Huang, X.; Xu, L.; Ma, Z.; Fang, H.S.; Wang, Y.; Lu, C. Transferable interactiveness knowledge for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3585–3594. [Google Scholar]
- Gkioxari, G.; Girshick, R.; Dollár, P.; He, K. Detecting and recognizing human-object interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8359–8367. [Google Scholar]
- Wan, B.; Zhou, D.; Liu, Y.; Li, R.; He, X. Pose-aware multi-level feature network for human object interaction detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 9469–9478. [Google Scholar]
- Li, Y.L.; Xu, L.; Liu, X.; Huang, X.; Xu, Y.; Wang, S.; Fang, H.S.; Ma, Z.; Chen, M.; Lu, C. Pastanet: Toward human activity knowledge engine. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 382–391. [Google Scholar]
- Ulutan, O.; Iftekhar, A.; Manjunath, B.S. Vsgnet: Spatial attention network for detecting human object interactions using graph convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13617–13626. [Google Scholar]
- Hanocka, R.; Hertz, A.; Fish, N.; Giryes, R.; Fleishman, S.; Cohen-Or, D. Meshcnn: A network with an edge. ACM Trans. Graph. (TOG) 2019, 38, 1–12. [Google Scholar] [CrossRef] [Green Version]
- Gupta, S.; Malik, J. Visual semantic role labeling. arXiv 2015, arXiv:1505.04474. [Google Scholar]
- Chao, Y.W.; Liu, Y.; Liu, X.; Zeng, H.; Deng, J. Learning to detect human-object interactions. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 381–389. [Google Scholar]
- Girdhar, R.; Ramanan, D. Attentional pooling for action recognition. In Proceedings of the Advances in Neural Information Processing Systems, 30, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Fang, H.S.; Cao, J.; Tai, Y.W.; Lu, C. Pairwise body-part attention for recognizing human-object interactions. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 51–67. [Google Scholar]
- Gao, C.; Xu, J.; Zou, Y.; Huang, J.B. Drg: Dual relation graph for human-object interaction detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 696–712. [Google Scholar]
- Zhong, X.; Ding, C.; Qu, X.; Tao, D. Polysemy deciphering network for robust human–object interaction detection. Int. J. Comput. Vis. 2021, 129, 1910–1929. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Wang, H.; Zheng, W.s.; Yingbiao, L. Contextual heterogeneous graph network for human-object interaction detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 248–264. [Google Scholar]
- Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the Advances in Neural Information Processing Systems, 30, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Zhang, M.; Wang, Y.; Kadam, P.; Liu, S.; Kuo, C.C.J. Pointhop++: A lightweight learning model on point sets for 3d classification. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 3319–3323. [Google Scholar]
- Zhang, M.; You, H.; Kadam, P.; Liu, S.; Kuo, C.C.J. Pointhop: An explainable machine learning method for point cloud classification. IEEE Trans. Multimed. 2020, 22, 1744–1755. [Google Scholar] [CrossRef] [Green Version]
- Jiang, M.; Wu, Y.; Zhao, T.; Zhao, Z.; Lu, C. Pointsift: A sift-like network module for 3d point cloud semantic segmentation. arXiv 2018, arXiv:1807.00652. [Google Scholar]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
- Qi, C.R.; Litany, O.; He, K.; Guibas, L.J. Deep hough voting for 3d object detection in point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 9277–9286. [Google Scholar]
- Shi, S.; Wang, X.; Li, H. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seoul, Korea, 27–28 October 2019; pp. 770–779. [Google Scholar]
- Bogo, F.; Romero, J.; Loper, M.; Black, M.J. FAUST: Dataset and evaluation for 3D mesh registration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 23–28 June 2014; pp. 3794–3801. [Google Scholar]
- Bucki, M.; Lobos, C.; Payan, Y. A fast and robust patient specific finite element mesh registration technique: Application to 60 clinical cases. Med. Image Anal. 2010, 14, 303–317. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar]
- Pavlakos, G.; Choutas, V.; Ghorbani, N.; Bolkart, T.; Osman, A.A.; Tzionas, D.; Black, M.J. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seoul, Korea, 27–28 October 2019; pp. 10975–10985. [Google Scholar]
- Li, Y.L.; Liu, X.; Lu, H.; Wang, S.; Liu, J.; Li, J.; Lu, C. Detailed 2d-3d joint representation for human-object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10166–10175. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
- Gupta, T.; Schwing, A.; Hoiem, D. No-frills human-object interaction detection: Factorization, layout encodings, and training techniques. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 9677–9685. [Google Scholar]
Method | Visual Feature Backbone | Full | Rare | Non-Rare |
---|---|---|---|---|
HORCNN [19] | CaffeNet | 7.81 | 5.37 | 8.54 |
InteractNet [13] | ResNet50-FPN | 9.94 | 7.16 | 10.77 |
GPNN [10] | ResNet101 | 13.11 | 9.34 | 14.23 |
ICAN [11] | ResNet50 | 14.84 | 10.45 | 16.15 |
No-Frills [41] | ResNet152 | 17.18 | 12.17 | 18.68 |
PMFNet [14] | ResNet50-FPN | 17.46 | 15.65 | 18.00 |
DRG [22] | ResNet50-FPN | 19.26 | 17.74 | 19.71 |
2d-baseline | ResNet50-FPN | 18.78 | 16.52 | 19.32 |
HOME | ResNet50-FPN | 19.15 | 17.23 | 19.66 |
HOME* | ResNet50-FPN | 19.52 | 18.06 | 19.92 |
Method | Full | Rare | Non-Rare |
---|---|---|---|
HOME | 19.15 | 17.23 | 19.66 |
w/o spatial & Topology | 15.86 | 13.06 | 16.7 |
w/o spatial | 16.19 | 14.24 | 16.87 |
w/o Topology | 18.78 | 16.52 | 19.32 |
Method | Full | Rare | Non-Rare |
---|---|---|---|
2D-baseline | 18.78 | 16.52 | 19.32 |
HOME-late fusion | 18.96 | 17.03 | 19.57 |
HOME-early fusion | 19.15 | 17.23 | 19.66 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Peng, W.; Li, C.; Tang, K.; Liu, X.; Fang, M. HOME: 3D Human–Object Mesh Topology-Enhanced Interaction Recognition in Images. Mathematics 2022, 10, 2841. https://doi.org/10.3390/math10162841
Peng W, Li C, Tang K, Liu X, Fang M. HOME: 3D Human–Object Mesh Topology-Enhanced Interaction Recognition in Images. Mathematics. 2022; 10(16):2841. https://doi.org/10.3390/math10162841
Chicago/Turabian StylePeng, Weilong, Cong Li, Keke Tang, Xianyong Liu, and Meie Fang. 2022. "HOME: 3D Human–Object Mesh Topology-Enhanced Interaction Recognition in Images" Mathematics 10, no. 16: 2841. https://doi.org/10.3390/math10162841
APA StylePeng, W., Li, C., Tang, K., Liu, X., & Fang, M. (2022). HOME: 3D Human–Object Mesh Topology-Enhanced Interaction Recognition in Images. Mathematics, 10(16), 2841. https://doi.org/10.3390/math10162841