Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Deep Clustering Efficient Learning Network for Motion Recognition Based on Self-Attention Mechanism

Appl. Sci. 2023, 13(5), 2996; https://doi.org/10.3390/app13052996

by Tielin Ru^1,*

and Ziheng Zhu²

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Appl. Sci. 2023, 13(5), 2996; https://doi.org/10.3390/app13052996

Submission received: 12 January 2023 / Revised: 17 February 2023 / Accepted: 21 February 2023 / Published: 26 February 2023

(This article belongs to the Special Issue Current Trends and Future Perspectives on Computer Vision and Pattern Recognition)

Round 1

Reviewer 1 Report

In this work, the author proposed a deep clustering learning network of motion recognition under the self-attention mechanism, which can effectively solve the problems of accuracy and efficiency of sports event analysis and judgment, and proved the effectiveness and feasibility of this method in sports video analysis and reasoning through experiments. In response to the following comments, my decision is to make major changes. If the following problems can be solved, I will suggest that the manuscript be published in this journal.

#Strength:

+(1) Through the LSTM, this network can not only solve the problem of gradient disappearance and explosion in the recurrent neural network (RNN), but also capture the internal correlation between multiple people in the sports field for identification, etc;

+(2) On the basis of (1), the DEC is added to integrate the motion coding information in key frames to improve the judgment efficiency;

+(3) With the self-attention mechanism, it can not only analyze the whole process of the whole sports video macroscopically but also focus on the specific attributes of the movement to capture more important details, extract the key posture features of athletes and further enhance the features, effectively reducing the parameters of the self-attention mechanism in the calculation process, reducing the computational complexity while maintaining the ability to capture details, Improve the accuracy and efficiency of reasoning and judgment. Through verification on large video datasets of mainstream sports, we have achieved high accuracy and improved the efficiency of detection and recognition.

#Weaknesses:

-(1) Please provide an example of the paper on the detection and recognition of multi-person behavior events, and describe the method of the paper.

-(2) In Figure 1, there should be a general title rather than a direct description.

-(3) In Tables 2 and 3, the experimental conclusion data of this paper should be bold.

-(4) Line 48, Literature[11,12]->The literature[11,12]

-(5) Line 210, , And iteratively->, and iteratively

-(6) Line 229, , At the same time->, at the same time

-(7) Line 323, ,M describes->, M describes

-(8) Line 325, ,->,

-(9) Line 330, ,and->, and .Finally->. Finally

Author Response

Reviewer 1：

(1)

This article describes the local group relationship analysis method for group activity identification. In this article, the author proposes a method to solve the problem of human group activity identification by using local group relationship. Specifically, instead of analyzing the motion of each human, we first group individual human objects into local groups to represent the relationships in the entire scene. By modeling each human motion and local group relationship, we can maximize the important motion information. The author uses the gated recursive cell model to deal with the trajectory information of any length with nonlinear hidden cells. In the experiment on the data set of public human group activities, the performance of the proposed method is compared with other competitive methods, and it is shown that the proposed algorithm is superior to other methods. At the end of the paper, a new feature descriptor is proposed to identify group activities in surveillance video. Multiple human objects are grouped into local groups, and their relationships are significantly considered. The GRU model captures the time dynamics of multiple relationships with different lengths. This paper proves that the proposed local group feature is effective by superior to competition method, and will further expand the proposed method to deal with appearance features and various scenes.

(2)-(9)

Corrected.

Reviewer 2 Report

This paper proposes a deep clustering learning network for motion recognition under the self-attention mechanism, which is used to solve the accuracy and efficiency problems of sports event analysis and judgment. This work is interesting. However, it requires significant improvements with respect to the following points.

#1- Please read the whole paper again and correct the possible typos. Some of them are highlighted in the attached file.

#2- The mathematical formulation of the considered problem is missing or unclear!

#3- The authors are encouraged to provide the pseudocode of the given method.

#4- All experimental parameters should be provided.

#5- Experimental results and their comparison is not seen with statistical evaluation.

#6- Some remarks on the main results would be necessary and helpful!

Comments for author File: Comments.pdf

Author Response

Reviewer 2：

（1）#1- Please read the whole paper again and correct the possible typos. Some of them are highlighted in the attached file.

Reply:Thank you very much for your suggestion. We have reviewed the full paper and made changes.

(2) The mathematical formulation of the considered problem is missing or unclear!

Reply:Thank you very much for your suggestions. We have read a lot of references and combined with the methods of the literature to deduce the formula in this paper one by one again, and explained the meaning of each parameter, making the article fuller and smoother.

(3) The authors are encouraged to provide the pseudocode of the given method.

Reply:Thank you very much for your suggestion. We will supplement the pseudocode part of the method in this article and emphasize it to make the method in this article more clear and easy for readers to read.

(4) All experimental parameters should be provided.

Reply:Thank you very much for your suggestion. We have sorted out the experimental part again, explained the indicators of the experimental comparison, and supplemented the explanation of the formula principle in the original text to make the article more convincing and make the readers read more smoothly.

(5) Experimental results and their comparison is not seen with statistical evaluation.

Reply:Thank you very much for your suggestion. In the experiment part, we mainly focus on the work of motion detection and recognition, and from the simple pedestrian motion detection analogy to the more complex motion behavior detection, the targeted comparison indicators have been improved in the accuracy rate and recall rate.(Line 452)

(6) Some remarks on the main results would be necessary and helpful!

Reply:Thank you very much for your suggestion. For the main experimental results, we made necessary comments in the following discussion and conclusion, which is very important for the integrity of the article.

Reviewer 3 Report

This paper provides only very brief illustration of the proposed method, without details on specific architecture or design. There is no way to reproduce the method presented in this paper.

The experiments are designed with very abnormal training settings. Misuse of key terms, such as automatic encoder, epochal, 100 times of training, etc, makes the study unreliable. More clarification is required.

The accuracies of the proposed method presented in Table 2 & 3 don't agree with the confusion matrix in Figure 8. How are the accuracies for the benchmarking models obtained for Table 2 & 3? I will suggest the authors to provide the source of implementations for the benchmakring models.

Author Response

Reviewer 3：

(1) This paper provides only very brief illustration of the proposed method, without details on specific architecture or design. There is no way to reproduce the method presented in this paper.

Reply: Thank you very much for your suggestion. In response to this problem, we have learned from some top visual publications and top-level conferences, and will supplement the experimental description of the method and carry out pseudocode annotation in the article(Line 362), in order to further ensure the reproducibility of the method we proposed. We put the code and data set in https://github.com/rtl2023/Behavior-detection.

(2) The experiments are designed with very abnormal training settings. Misuse of key terms, such as automatic encoder, epochal, 100 times of training, etc, makes the study unreliable. More clarification is required.

Reply: Thank you very much for your suggestion. As you said, we had misunderstandings in the use of key terms, and there were words or numbers missing. We rechecked the experimental part and corrected the errors in time. Some of the experimental descriptions were incorrectly stated. We revised them and invited Professor Chen to polish our paper.

(3) The accuracies of the proposed method presented in Table 2 & 3 don't agree with the confusion matrix in Figure 8. How are the accuracies for the benchmarking models obtained for Table 2 & 3? I will suggest the authors to provide the source of implementations for the benchmakring models.

Reply: Thank you very much for your suggestion. We have conducted experiments on four data sets and checked the data in detail. The reproduction of the benchmark model is based on the official and third-party open source links or the experimental settings in the method paper.

Reference：

[1] Zhang, Yunhua, et al. "Audio-adaptive activity recognition across video domains." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[2] Shu, Xiangbo, et al. "Hierarchical long short-term concurrent memory for human interaction recognition." IEEE transactions on pattern analysis and machine intelligence 43.3 (2019): 1110-1118.

[3] Ijaz, Momal, Renato Diaz, and Chen Chen. "Multimodal transformer for nursing activity recognition." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[4] Kim, Dongkeun, et al. "Detector-free weakly supervised group activity recognition." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[5] Han, Mingfei, et al. "Dual-AI: dual-path actor interaction learning for group activity recognition." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

[6] Doshi, Keval, and Yasin Yilmaz. "Federated learning-based driver activity recognition for edge devices." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[7] Vats, Arpita, and David C. Anastasiu. "Key point-based driver activity recognition." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[8] https://github.com/rtl2023/Behavior-detection

Round 2

Reviewer 2 Report

The authors have replied to my comments properly. However, there are still many grammatical errors. Some of these grammatical errors are listed below, indicated by line number:

- Line 13, fusing-> using

- Line 17 , The-> . The

- Line 83 However,-> however,

- Line 44 , Researchers-> , researchers

- Line 95 , Improve-> , improve

- Line 120 ; By-> . By

- Line 138 .In-> . In

- Line 158 , Finally,-> , finally,

- Line 175 , From-> . From

- Line 178 .In-> . In

- Line 201 .The-> . The

- Line 206 , The-> . The

- Line 217 ,b1-> , b1

- Line 217 ,W1-> , W1

- Line 236 ,In-> . In

- Line 240 ,which-> , which

- Line 278 ; Channel-> . Channel

- Line 291 .A feature-> . A feature

- Line 299 Relu-> ??

- Line 321 ,where-> , where

- Line 321 ,Then-> , then

- Line 321 ,and-> , and

- Line 322 ,where-> , where

- Line 331 ,and-> , and

- Line 335 ,In-> . In

- Line 340 :Concat()-> : Concat()

- Line 347 .At-> . At

- Line 352 .E-> . E

- Line 355 self- attention-> self-attention

- Line 367 2.40GHz-> 2.40 GHz

- Line 367 8GB-> 8 GB

- Line 369 .The-> . The

- Line 372 118.8GB-> 118.8 GB

- Line 372 ,including-> , including

- Line 385 set 0.001-> set to 0.001

- Line 431 Figure 6:-> Figure 6.

- Line 450 .We->. We

- Line 452 ,the-> , the

- Line 453 Figure 7:-> Figure 7.

- Line 466 , By-> . By

- Line 486 , This-> . This

- Line 490 ; In order to-> . In order to

- Line 502 ; The-> . The

Author Response

Dear Editors and Reviewers,

Thank you for your letter and for the reviewers’ comments. Those comments are all valuable and very helpful for revising and improving our paper and also are very important for guiding our research. We have studied the comments carefully and have made corrections which we hope meet with approval. Revised parts are marked in the new manuscript.

#Response to Reviewers

Reviewer 2：

- Line 13, fusing-> using