Next Article in Journal
Model of Smart Locating Application for Small Businesses
Previous Article in Journal
An Effective Design Scheme of Single- and Dual-Band Power Dividers for Frequency-Dependent Port Terminations
 
 
Article
Peer-Review Record

Part-Wise Adaptive Topology Graph Convolutional Network for Skeleton-Based Action Recognition

Electronics 2023, 12(9), 1992; https://doi.org/10.3390/electronics12091992
by Jiale Wang 1, Lian Zou 1,*, Cien Fan 1 and Ruan Chi 2
Reviewer 1:
Reviewer 2:
Reviewer 3:
Reviewer 4: Anonymous
Electronics 2023, 12(9), 1992; https://doi.org/10.3390/electronics12091992
Submission received: 16 March 2023 / Revised: 13 April 2023 / Accepted: 18 April 2023 / Published: 25 April 2023
(This article belongs to the Section Artificial Intelligence)

Round 1

Reviewer 1 Report

The paper at hand proposes a skeleton-based action recognition approach based on graphical convolutional networks (GCNs). The paper's novelty constitutes the introduction of a Part-wise Adaptive Topology Graph Convolutional block that enables the segmentation of different body parts in a dynamic way.

 

The manuscript is well-organized and well-written.

 

The literature review is comprehensive and organized.

 

The method is clearly presented and Figure 3 highly aids the comprehension of the methodology.

 

Experiments are conducted on 3 datasets, providing an ablation study, parameters configuration experiments as well as extensive comparative results highly demonstrating the superiority of the introduced model.

 

Comments

1) In Section 2.3: The authors are recommended to discuss two techniques that are related to the philosophy behind the implementation of part-wise topology and the segmentation of body parts in classical contemporary action recognition tasks, that is (i) the attention-based and (ii) joint-aware action recognition. Please, discuss the references below.

Li, Jun, et al. "Spatio-temporal attention networks for action recognition and detection." IEEE Transactions on Multimedia 22.11 (2020): 2990-3001.

Santavas, Nicholas, et al. "Attention! a lightweight 2d hand pose estimation approach." IEEE Sensors Journal 21.10 (2020): 11488-11496.

Shah, Anshul, et al. "Pose and joint-aware action recognition." Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2022.

Oikonomou, Katerina Maria, et al. "Joint-Aware Action Recognition for Ambient Assisted Living." 2022 IEEE International Conference on Imaging Systems and Techniques (IST). IEEE, 2022.

 

2) In Section 3.2, given that the adaptive topology block constitutes one of the main contributions of the manuscript, please provide a separate clear depiction of the block's structure and link with the description within the sub-section to enhance its comprehension.

 

3) In Section 4, although a division of the datasets into training and testing sets is provided , it is not clear if authors followed any cross-validation and/or subject-invariant validation strategy, i.e., specific subjects are entirely left out from the training set for testing (an example of this strategy can be found here: Singh, Suriya, et al. "First person action recognition using deep learned descriptors." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016).  Please, describe the exact validation protocol adopted, refer to the works that followed exactly the same strategy for each dataset or in case no similar work exists describe the reason for applying a different strategy.

 

4) What is the computational complexity and the required inference time for the proposed part-wise adaptive topology block? Would it be suitable for real-time applications? Please, discuss the above property of the introduced method.

 

5) The conclusion section is very short. Please, further discuss the findings and benefits of the proposed method as well as ideas for future work.

 

Overall, the work is innovative and the manuscript well-written. Hence, given the above enhancements, I am happy to recommend the publication of the paper.

Comments for author File: Comments.pdf

Author Response

Thank you for your detailed evaluation of the manuscript. We appreciate your positive feedback and constructive comments. Here are our responses to your specific comments and suggestions:

 

  • We have added a discussion of joint-aware and attention-based action recognition techniques to Section 2.3, along with the references you suggested. We use a multi-scale design to obtain more informative joints and connections and a part-wise approach to determine the importance of each part. This is similar to the motivation behind them. We believe that this addition will strengthen the section and provide a better context for our proposed approach.
  • The structure of the part-wise adaptive topology block is shown separately in Section 3.2. We have covered the feature transformation and topology generation aspects of the PAT-GC block's structure. The most recent manuscript contains the added content.
  • In Section 4, we apologize for the lack of clarity regarding the validation protocol. For NTU RGB+D 60 and NTU RGB+D 120, we followed a cross-validation strategy. For each dataset, the provided video clips of two benchmarks are the same but are divided into training and testing sets based on subjects or views(setups). For Kinetics 400, we use subject-invariant validation strategy. We have re-described the dataset in accordance with the original paper. “The dataset is divided into three parts, one for training with 250-1000 videos per class, one for validation with 50 videos per class, and one for testing with 100 videos per class.”
  • In our latest manuscript, we have provided the number of parameters of the model and the inference time in our experimental environment. Application of the model in real-time is possible. It should be noted that the multi-stream ensemble strategy and the model's parameter count are related. Some effective models* combine the data from multiple streams early in the modeling process and use one main stream to extract discriminative features. Our approach uses the aggregation method used by the majority of approaches** because multi-stream aggregation is not the paper's main focus, and the model parameters are only provided as a general guide.

*Song, Y.F.; Zhang, Z.; Shan, C.; Wang, L. Constructing stronger and faster baselines for skeleton-based action recognition. IEEE 577 transactions on pattern analysis and machine intelligence 2022, 45, 1474–1488.

** Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; Lu, H. Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 183–192.

Chen, T.; Zhou, D.; Wang, J.; Wang, S.; Guan, Y.; He, X.; Ding, E. Learning multi-granular spatio-temporal graph network for skeleton-based action recognition. In Proceedings of the Proceedings of the 29th ACM international conference on multimedia, 2021, pp. 4334–4

Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 12026–12035.

  • We appreciate your suggestion to expand the conclusion section. We have discussed the findings and benefits of the proposed method, and potential directions for future research.

 

Thank you again for your evaluation and valuable feedback. We will carefully consider and address all of your comments and suggestions.

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper proposes and analyses the use of the Part-wise Adaptive Topology Graph Convolution (PAT-GC) for activity recognition. The novelty is not very high, but the analyses and the presentation are very interesting. Also, the comparison to previous works is very interesting. I think the paper can be accepted after some improvements. My comments are:

·       Regarding the datasets, I think it is interesting to clarify if you are considering a subject-wise cross validation when dividing the data into training and testing. Also, this distribution must be the same than in previous works.

·       Can you clarify how did you finetune the system? Did you use a validation subset?

·       Regarding the results presentation, I’d suggest including confidence intervals to see the significance of the results.

Author Response

Thank you for your valuable comments on our paper. We have addressed your feedback as follows:

  • We apologize for the lack of clarity. For NTU RGB+D 60 and NTU RGB+D 120, we followed a cross-validation strategy. For each dataset, the provided video clips of two benchmarks are the same but are divided into training and testing sets based on subjects or views(setups). We have clarified this in the revised manuscript.
  • Our model was trained from scratch on the datasets without fine-tuning on pre-trained models.
  • Regarding the results presentation, to compare with the state-of-the-art methods, we employed the same result presentation form as these methods. We would appreciate if you could provide additional relevant information.

We appreciate your constructive comments. Please let us know if you have any other questions or require any clarification.

Reviewer 3 Report

The paper is well structured. The experiments must be extended with the following aspects:

-is it suitable for real time application - discussion about inference time must be added

-what about number of weights - discussion about network memory compared with the other solutions (presented in Table 3) 

-what are the activities that are recognised with higher / lower accuracy

Author Response

Thank you for the constructive comments on our paper. We have addressed the feedback as follows:

  • Yes, it is suitable for real time application. We have added a discussion on inference time in our experimental environment.
  • We have added the number of parameters of the model in the latest manuscript. However, it should be noted that the number of parameters of the model is related to the strategy of multi-stream ensemble. Some efficient models* fuse the multi-stream data at the early stage of model and apply one main stream to extract discriminative features. Since multi-stream aggregation is not the focus of the paper, our method follows the aggregation method of the majority of approaches**, so the model parameters are only provided for rough reference.

*Song, Y.F.; Zhang, Z.; Shan, C.; Wang, L. Constructing stronger and faster baselines for skeleton-based action recognition. IEEE 577 transactions on pattern analysis and machine intelligence 2022, 45, 1474–1488.

** Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; Lu, H. Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 183–192.

Chen, T.; Zhou, D.; Wang, J.; Wang, S.; Guan, Y.; He, X.; Ding, E. Learning multi-granular spatio-temporal graph network for skeleton-based action recognition. In Proceedings of the Proceedings of the 29th ACM international conference on multimedia, 2021, pp. 4334–4

Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 12026–12035.

  • In Figure 5, we added some qualitative results, which show several actions with lower accuracy. The difficult actions are usually easily confused with other actions.

We appreciate your constructive comments. Please let us know if you have any other questions or require any clarification.

Reviewer 4 Report

In this paper the authors introduce an algorithm called Part-wise Adaptive Topology Graph Convolutional Network (PAT-GCN) to perform human action recognition using skeleton data. This method employs hierarchical partitioning and adaptive learning to model complex relationships between body parts. The authors also claim state-of-the-art performance on three public large datasets.

One of the contributions, in lines 107-109 is "We propose a hierarchical approach to partition the skeleton topology into multiple parts at two different scales. This method enables the exploration of movement patterns for various body parts, as well as their interrelationships during motion.". Please, explain the contribution in the context of other papers that also employ fine-coarse distinctions, as the following ones, and discuss them in the related work section:

Dang, Lingwei, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. "Msr-gcn: Multi-scale residual graph convolution networks for human motion prediction." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11467-11476. 2021.

Yan, Zigeng, Di-Hua Zhai, and Yuanqing Xia. "DMS-GCN: dynamic multiscale spatiotemporal graph convolutional networks for human motion prediction." arXiv preprint arXiv:2112.10365 (2021).

Gou, Ruru, Wenzhu Yang, Zifei Luo, Yunfeng Yuan, and Andong Li. "Tohjm-Trained Multiscale Spatial Temporal Graph Convolutional Neural Network for Semi-Supervised Skeletal Action Recognition." Electronics 11, no. 21 (2022): 3498.

Lines 119-120: "Finally, in section 5, we conclude the paper." You explain your conclusions, I guess.

Lines 122-123: "Convolutional Neural Networks (CNNs) have been remarkably successful at processing Euclidean data, such as images." Could you elaborate on images as Euclidean data? Do you mean that the pixels have a spatial relationship among them? In other places, like in line 145, you say that the human skeleton can be seen as a "non-Euclidean graph structure". Do you mean that there is a non-Euclidean distance or that there is not a notion of distance but of relative location?

The paper states that the approach builds on top of other approaches, as the work of Liu et al. [19], or in line 289 is said "Building on previous work". In the first case it is not explicitly said if this work is just an improvement over [19], or the two approaches differ in more aspects; in the second case, it is not clear what is that previous work, as it is not referenced. In general, it is not clear what are the differences and/or novelties with respect to other works, as [18], [19], or other adaptive approaches as the following ones:

Shi, Lei, Yifan Zhang, Jian Cheng, and Hanqing Lu. "Skeleton-based action recognition with multi-stream adaptive graph convolutional networks." IEEE Transactions on Image Processing 29 (2020): 9532-9545.

Alsarhan, Tamam, Usman Ali, and Hongtao Lu. "Enhanced discriminative graph convolutional network with adaptive temporal modelling for skeleton-based action recognition." Computer Vision and Image Understanding 216 (2022): 103348.

Yu, Lubin, Lianfang Tian, Qiliang Du, and Jameel Ahmed Bhutto. "Multi‐stream adaptive spatial‐temporal attention graph convolutional network for skeleton‐based action recognition." IET Computer Vision 16, no. 2 (2022): 143-158.

Hang, Rui, and Minxian Li. "Spatial-Temporal Adaptive Graph Convolutional Network for Skeleton-based Action Recognition." In Proceedings of the Asian Conference on Computer Vision, pp. 1265-1281. 2022.

Zhang, Zhitao, Zhengyou Wang, Shanna Zhuang, and Fuyu Huang. "Structure-feature fusion adaptive graph convolutional networks for skeleton-based action recognition." IEEE Access 8 (2020): 228108-228117.

I believe that the Related Work section has to be significantly expanded and make clear the similarities and differences of your method with respect to the literature, in order to rightly assess the contributions.

Author Response

Thank you for your constructive comments on our paper "Part-wise Adaptive Topology Graph Convolutional Network for Skeleton-Based Action Recognition". We appreciate your time and feedback. Please find below our responses:

 

1)Regarding the hierarchical partitioning contribution, we agree that we should discuss more related works employing similar coarse-fine methods. We have added additional discussions on Dang et al.(2021), Yan et al.(2021) and Gou et al.(2022) to section 2.3. These methods cluster joints to obtain coarser pose. Our method does not produce any pseudo joints or coarse poses. In order to find more informative joints and connections, we learn the communication relationships between joints at coarse and fine levels. The joints are shared by regions of importance, and the hierarchical representation provides a more flexible representation.

2) Yes, when we refer to images as Euclidean data, we mean that the pixels have a spatial relationship among them. Images can be represented as a grid of pixels, where each pixel has a specific position in a 2D space. In this 2D grid, the distance between pixels follows the rules of Euclidean geometry. Neighboring pixels are spatially close and have a well-defined Euclidean distance between them. On the other hand, when we mention that the human skeleton can be seen as a "non-Euclidean graph structure," we are referring to the fact that the spatial relationships between the joints in a human skeleton are not easily described by Euclidean distances. Instead, the human skeleton can be represented as a graph, where the nodes represent the joints and the edges represent the connections the joints. The connections between nodes are described by an adjacency matrix, and the node data is specific to the task. In skeleton action recognition, the input data of the nodes is their spatial positions. However, in other tasks, such as data mining, the nodes may represent other types of information. The connection relationship is a type of prior information. Although the skeletal joints may be physically connected in a certain way, it exists another relationship due to the control of the human brain. For example, there is a relationship between the arm and the leg, which needs to be considered when analyzing human movement.

3) We apologize for the lack of relevant references. In our latest manuscript, we have added references. Previous methods used multi streams, and this is not related to adaptivity and is a common approach. Some efficient models* fuse the multi-stream data at the early stage of model and apply one main stream to extract discriminative features. Since multi-stream aggregation is not the focus of the paper, our method follows the aggregation method of the majority of approaches**.

*Song, Y.F.; Zhang, Z.; Shan, C.; Wang, L. Constructing stronger and faster baselines for skeleton-based action recognition. IEEE 577 transactions on pattern analysis and machine intelligence 2022, 45, 1474–1488.

** Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; Lu, H. Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 183–192.

Chen, T.; Zhou, D.; Wang, J.; Wang, S.; Guan, Y.; He, X.; Ding, E. Learning multi-granular spatio-temporal graph network for skeleton-based action recognition. In Proceedings of the Proceedings of the 29th ACM international conference on multimedia, 2021, pp. 4334–4

Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 12026–12035.

Shi, Lei, Yifan Zhang, Jian Cheng, and Hanqing Lu. "Skeleton-based action recognition with multi-stream adaptive graph convolutional networks." IEEE Transactions on Image Processing 29 (2020): 9532-9545.

Alsarhan, Tamam, Usman Ali, and Hongtao Lu. "Enhanced discriminative graph convolutional network with adaptive temporal modelling for skeleton-based action recognition." Computer Vision and Image Understanding 216 (2022): 103348.

In summary, we appreciate all the comments and agree that we should strengthen our paper by clarifying our contributions over existing works and improving our explanations. Please let us know if you have any other questions or concerns.

Round 2

Reviewer 3 Report

Since all my comments were addressed, I recommend to publish the paper.

Reviewer 4 Report

I believe the authors have addressed my comments. The paper should be revised for minor typos or inconsistencies. For example, in line 343 the word "privous" appears instead of "previous", and numbers in the thousands are written in the 56,880 format (line 367), as well as in the 63026 format (line 374).

Back to TopTop