Next Article in Journal
Fault Current Limiting Characteristics of a Small-Scale Bridge Type SFCL with Single HTSC Element Using Flux-Coupling
Next Article in Special Issue
Special Issue on Intelligent Electronic Devices
Previous Article in Journal
Analysis of HMAX Algorithm on Black Bar Image Dataset
Previous Article in Special Issue
A Hybrid Tabu Search and 2-opt Path Programming for Mission Route Planning of Multiple Robots under Range Limitations
 
 
Article
Peer-Review Record

Learning Effective Skeletal Representations on RGB Video for Fine-Grained Human Action Quality Assessment

Electronics 2020, 9(4), 568; https://doi.org/10.3390/electronics9040568
by Qing Lei 1,2,3, Hong-Bo Zhang 1,2,3, Ji-Xiang Du 1,2,3, Tsung-Chih Hsiao 1,3,* and Chih-Cheng Chen 4,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Electronics 2020, 9(4), 568; https://doi.org/10.3390/electronics9040568
Submission received: 23 January 2020 / Revised: 23 March 2020 / Accepted: 26 March 2020 / Published: 28 March 2020
(This article belongs to the Special Issue Intelligent Electronic Devices)

Round 1

Reviewer 1 Report

This study proposed an integrated action classification and regression learning framework for fine-grained human action quality assessment from RGB video. The algorithms performances were comprehensively evaluated for three-different datasets. The results showed better performances than other methods. 

The work is relevant and can contribute significantly in human action quality assessment study; however, some minor improvements is needed before publication in the journal. 

  1. The judgement for the human action (e.g., diving, skating) need to clearly explained whether authors recruited professional expert or diving/skating coach etc. 
  2.  The Table 1 and Figure 7 showed similar information of the results. It also can be noticed in Table 2/Figure 9, and Table 3/Figure 11. It is better to present the results in Table form. In addition, the superior performances need to be highlighted with bold. 

Author Response

Journal: Electronics (ISSN 2079-9292)

Manuscript ID: electronics-714964

Type: Article   Number of Pages: 19

Title: Learning Effective Skeletal Representations on RGB Video for Fine-grained Human Action Quality Assessment

Dear Editor,

Thank you very much for your letter and for the comments by the reviewers. These comments are very valuable and helpful for our paper.

We appreciate the careful, constructive, and generally favorable reviews given to our paper by the reviewers.

We believe we have adequately addressed all the excellent advices and questions raised by reviewers. Furthermore, we checked the manuscript and made sure the submitted manuscript is correct.

Please contact us if any further questions remain.

Sincerely yours,

 

Prof.  Chih-Cheng Chen 

 

 

 

 

 

 

 

 

 

Response to the comments of reviewers:

Reviewer 1:

 

Comments and Suggestions for Authors

This study proposed an integrated action classification and regression learning framework for fine-grained human action quality assessment from RGB video. The algorithms performances were comprehensively evaluated for three-different datasets. The results showed better performances than other methods. 

The work is relevant and can contribute significantly in human action quality assessment study; however, some minor improvements is needed before publication in the journal.     

Q1:The judgement for the human action (e.g., diving, skating) need to clearly explained whether authors recruited professional expert or diving/skating coach etc. 

Answer:

Thanks to the reviewer for pointing out the research issue of our manuscript. The construction of the human action evaluation dataset really requires professional experts to annotate training samples according to their knowledge. That is why several limitations, such as insufficient training data and a limited number of action categories, exist in action quality assessment research.

 In this work, we validate our proposed method on the publicly available MIT Olympic Scoring dataset and UNLV Olympic Scoring dataset. All diving, figure skating and vaulting videos of these two datasets are collected from the Olympic competition and World Championship events on YouTube. The ground-truth score of each action video is freely obtained by extracting the judge’s score that is released publicly in sports footage. These statements have been added on page 11 and 16 in the revised manuscript.

 

 

Q2:The Table 1 and Figure 7 showed similar information of the results. It also can be noticed in Table 2/Figure 9, and Table 3/Figure 11. It is better to present the results in Table form. In addition, the superior performances need to be highlighted with bold. 

Answer:

Thanks to the reviewer for giving us constructive opinions. To explain why class-specific learning strategy is employed to construct regression models in the proposed framework, we compared the performance of different evaluation methods and illustrated the results on three action categories in Figures 7, 9 and 11. There is a duplicate since all related results have been presented in Tables 1, 2 and 3 respectively. Therefore, we have deleted Fig. 7, 9 and 11 in the revised manuscript. In addition, the superior performances have been highlighted with bold in all tables.      

 

Author Response File: Author Response.pdf

Reviewer 2 Report

This work proposes a framework for action quality assessment. Considering its versatile applications, the problem that the authors tackle definitely has a merit. The proposed method consists of multiple existing machine learning/computer vision solutions as the building blocks. For instance, the input video is primarily handled and featurized by OpenPose [21]. By weaving multiple frames into a data instances of frame sequence, and applying BoW, the data is classified into action classes. Using the action classes as a sort of latent variable, the proposed method internally choose a regression model and assesses the action quality.

Although it is clear that the proposed method involves a certain amount of design effort, the submitted manuscript raises questions. Why the authors use OpenPose? What the method only uses 14 body joints while there are other available signals? How does the BoW featurization apply to the data? What is the specific role of the k-means clustering algorithm. What is y for the max margin classifier? Did the SVM with the linear kernel well-classify the action category? How the authors supply the values of y? There are multiple seemingly important components that are not explained nor described enough in the current description of the method. The above mentioned questions should be answered and resolved in the manuscript.

I also would like to raise my concern towards the choice of regressor. There are potentially better choices of regressor out there; To name a few, even simple multi-layered neural network or random forest regressor may perform better than linear SVR (please forgive my rough guess, but the authors should have crunched this question by experiments).

The Pearson correlation coefficient used as the evaluation metric might not be sufficient as it is only able to capture linear correlations between predictions and ground truth. How does the authors assume that good predictions are linearly correlated with the true observations?

Lastly, the manuscript has many fragmented sentences and grammatical mistakes, which hinder the readability to a great degree. Such errors should be eliminated.

Author Response

Journal: Electronics (ISSN 2079-9292)

Manuscript ID: electronics-714964

Type: Article   Number of Pages: 19

Title: Learning Effective Skeletal Representations on RGB Video for Fine-grained Human Action Quality Assessment

 

Dear Editor,

Thank you very much for your letter and for the comments by the reviewers. These comments are very valuable and helpful for our paper.

We appreciate the careful, constructive, and generally favorable reviews given to our paper by the reviewers.

We believe we have adequately addressed all the excellent advices and questions raised by reviewers. Furthermore, we checked the manuscript and made sure the submitted manuscript is correct.

Please contact us if any further questions remain.

Sincerely yours,

 

Prof.  Chih-Cheng Chen 

 

 

 

 

 

 

 

 

Response to the comments of reviewers:

Reviewer 2:

Comments and Suggestions for Authors

This work proposes a framework for action quality assessment. Considering its versatile applications, the problem that the authors tackle definitely has a merit. The proposed method consists of multiple existing machine learning/computer vision solutions as the building blocks. For instance, the input video is primarily handled and featurized by OpenPose [21]. By weaving multiple frames into a data instances of frame sequence, and applying BoW, the data is classified into action classes. Using the action classes as a sort of latent variable, the proposed method internally choose a regression model and assesses the action quality.

     Thanks to the reviewer for pointing out the research issue of our manuscript.

 

Although it is clear that the proposed method involves a certain amount of design effort, the submitted manuscript raises questions.

Q1: Why the authors use OpenPose?

Answer: Thanks to the reviewer for pointing out the research issue of our manuscript. Since we take skeleton data captured from RGB video as the data input of our learning framework to conduct our action quality assessment research, state-of-the-art pose estimation method has been employed to detect skeleton data from RGB images or videos. In recent years, traditional skeletonization models including the deformable part model and flexible mixtures of parts model have been replaced by deep neural network-based approaches such as DeepPose, OpenPose, and DensePose, etc. OpenPose is an effective pose estimation method developed by the perceptual computing lab of Carnegie Mellon University. It is the first real-time multi-person skeleton detection system and works well on RGB videos. Therefore, to obtain skeleton data of an action performer in RGB video, we employed state-of-the-art OpenPose algorithm to detect joints’ position for each frame and extracted the trajectories of joints to represent action video.

These statements have been mentioned at page 5 in the revised manuscript.  

Q2: What the method only uses 14 body joints while there are other available signals?

Answer: Thanks to the reviewer for pointing out the research issue of our manuscript. In OpenPose’s keypoint detection, the 18-keypoint skeleton model is composed of 18 human body joints as illustrated in Figure 2(a), including nose, neck, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, right hip, right knee, right ankle, left hip, left knee, left ankle, right eye, left eye, right ear, left ear. In 25-keypoint skeleton model, six keypoints on left and right foot, and one keypoint on the center of hip are also included.

Since action quality assessment of diving, figure skating and vaulting is highly dependent on the changing positions and configuration of human body parts, whereas the motion changes of keypoints on eyes, ears and feet are not obvious. Consequently, we only use 14 body joints (N=0~13) except for eyes and ears for analysis. These statements have been mentioned at page 5 in the revised manuscript.

Q3: How does the BoW featurization apply to the data? What is the specific role of the k-means clustering algorithm?

Answer: In BoW model based action representation, unsupervised K-means algorithm is firstly performed on all joint motion volume features extracted from training videos to obtain K clusters for constructing the action codebook. Each center of the cluster is called a visual word, and all centers of K clusters form a visual codebook for action modeling. Then the original feature is projected onto the closest visual word in action codebook. All features of an action video are projected and aggregated into an occurrence frequency histogram of visual words. It forms the final BoW feature representation of action video.

These statements have been mentioned at page 7 in the revised manuscript.

Q4: What is y for the max margin classifier? Did the SVM with the linear kernel well-classify the action category? How the authors supply the values of y?

Answer: In max margin classifier for action classification as formulated in Equation 2, Y means the ground-truth action category of each video. It is provided by manual annotation in the process of constructing the benchmark dataset. These statements have been mentioned at page 7 in the revised manuscript.

In action classification component, we proposed to extract spatio-temporal volumes centered at each detected human body joint, and learn local motion pattern of joint motion volume to train action classifier. The output of action classifier is used to supervise the choice of class-specific regression model for assessing the quality of action video in testing stage. We have validated our action component on classifying diving, figure skating and vault videos. However, since action classification is not the main target and contribution of our action quality assessment framework, the detailed experimental result has not been discussed in the submitted manuscript. We have supplemented the average accuracy of linear kernel based SVM classifier on estimating class labels of diving, figure skating and vaulting videos. These statements have been mentioned at page 12 and 13 in the revised manuscript. 

 

There are multiple seemingly important components that are not explained nor described enough in the current description of the method. The above mentioned questions should be answered and resolved in the manuscript.

 

Q5: I also would like to raise my concern towards the choice of regressor. There are potentially better choices of regressor out there; To name a few, even simple multi-layered neural network or random forest regressor may perform better than linear SVR (please forgive my rough guess, but the authors should have crunched this question by experiments).

Answer: Thanks to the reviewer for giving us the constructive opinions. With the popularity of using deep neural networks to tackle the recognition problem in computer vision, a great deal of research works has taken an interest in developing deep learning methods for human action recognition. However, there are two main reasons why multi-layered neural network based regression learning method had not been adopted in this work. One reason is that we assumed that the quality of human action is directly dependent on the dynamic changes in human body movement, which can be represented by the motion trajectory and relative location relationship between joints, while it less influenced by environmental backgrounds or scenes. Therefore, one of the motivations is try to extract robust motion patterns from skeleton data and develop effective feature representation of human pose sequence for fine-grained action quality assessment. If combining our proposed feature method with a multi-layer neural network based regression framework, the learning process will be quite complicated and time-consuming due to the large-scaled parameters and the high cost of storage capacity. Another reason is that the training process of deep neural network regressor requires large-scale annotated training samples and high computational resources. However, massive annotated datasets with diverse actions or subjects for validating action quality assessment methods are still lacking at the present stage. Most public datasets are oriented toward particular application, and a limited number of training samples have been collected.   

Q6: The Pearson correlation coefficient used as the evaluation metric might not be sufficient as it is only able to capture linear correlations between predictions and ground truth. How does the authors assume that good predictions are linearly correlated with the true observations?

Answer: Thanks to the reviewer for pointing out the research issue of our manuscript. We sincerely apologize for the writing mistakes about the evaluation criteria to validate the proposed action quality assessment method as formulated in Equation 9. It should be Spearman rank correlation coefficient rather than Pearson correlation coefficient. Rank correlation measurement is also the evaluation criteria established in the benchmark MIT Olympic Scoring dataset, and followed by all the compared feature methods in Section 4. The evaluation result is obtained by performing ranking process both on the ground truth score vector and the predicted score vector, and then accumulating the square of the ranking difference between the ground truths and estimations overall testing videos. Relative information has been corrected on page 13 in our revised manuscript.

 

Q7: Lastly, the manuscript has many fragmented sentences and grammatical mistakes, which hinder the readability to a great degree. Such errors should be eliminated.

Answer: Thanks to the reviewer for pointing out our writing errors and giving us constructive opinions. We have checked on all sentences’ grammar, spelling, and punctuation of our manuscript. The errors have been corrected and mentioned in the revised manuscript.

 

 

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

I appreciate the authors that they had made efforts in responding to the previous review, within a short period of time. However, my concerns with respect to the following issues still remains that I have to request to resolve before the manuscript gets accepted.

1) Although the proposed method makes several interesting points, I still consider the current experimental setting does not suffice to prove the merits of the proposed method -- which needs to be fixed. While the authors claim that the proposed method is compared against the state-of-the-art feature extraction methods, the compared methods seem rather outdated [28,29,22,1]. Albeit it could be a valid option, I am NOT suggesting the authors to add new baseline methods, polished in 2016-2020 to their experiments. Rather, a proper justification about the choice of the baseline methods has to be provided. Both of the current Related Work section and Experiments do not fulfill this.

2) Explanations on some key components of the manuscript seem missing or need improvement. What is the "tracking algorithm" (line#482)? How do the STIP, Dense work (lines#917-918)? The notation of section 3.2 is very hard to understand (page 6). How the action classifier and assessment regression models are trained simultaneously (lines#120-121; they look like two independent models that are pipelined back-to-back and cannot be simultaneously optimized)?

3) Although the authors had put a certain amount of work to minimize grammatical mistakes and improve the manuscript, a notable amount of errors still remain that hinders the readability of the manuscript a lot. These include incorrect uses of singular/plural, adjective/adverb, and tenses; omissions of articles; inappropriate word choices; abuse of possessive case (apostrophe abuse); and uses of FANBOYS. Due to the errors and mistakes, many parts of the current manuscript are hard to follow.

Author Response

Journal: Electronics (ISSN 2079-9292)

Manuscript ID: electronics-714964

Type: Article   Number of Pages: 19

Title: Learning Effective Skeletal Representations on RGB Video for Fine-grained Human Action Quality Assessment

 

Dear Editor,

Thank you very much for your letter and for the comments by the reviewers. These comments are very valuable and helpful for our paper.

We appreciate the careful, constructive, and generally favorable reviews given to our paper by the reviewers.

We believe we have adequately addressed all the excellent advice and questions raised by reviewers. Furthermore, we checked the manuscript and made sure the submitted manuscript is correct.

Please contact us if any further questions remain.

Sincerely yours,

 

Prof.  Chih-Cheng Chen 

 

 

Response to the comments of reviewers:

Reviewer 2:

Comments and Suggestions for Authors

I appreciate the authors that they had made efforts in responding to the previous review, within a short period of time. However, my concerns with respect to the following issues still remains that I have to request to resolve before the manuscript gets accepted.

Answer: Thanks to the reviewer for giving us careful and constructive reviews.

Q1: Although the proposed method makes several interesting points, I still consider the current experimental setting does not suffice to prove the merits of the proposed method -- which needs to be fixed. While the authors claim that the proposed method is compared against the state-of-the-art feature extraction methods, the compared methods seem rather outdated [28,29,22,1]. Albeit it could be a valid option, I am NOT suggesting the authors to add new baseline methods, polished in 2016-2020 to their experiments. Rather, a proper justification about the choice of the baseline methods has to be provided. Both of the current Related Work section and Experiments do not fulfill this.

Answer: Thanks to the reviewer for giving us constructive opinions. Since the benchmark, the MIT Olympic Scoring dataset that we employed to evaluate the proposed method was published in the reference of [1]. And a similar solution strategy was employed both in [1] and our proposed learning framework that handcrafted feature engine was firstly built for feature representation and then the regression model was learned from action features for quality assessment. Therefore, we choose the reference of [1] as the baseline method. In the work of [1], they extracted both low-level spatial and temporal filtering features that captured edges and velocities, as well as high-level pose features obtained from the discrete cosine transformation of joint displacement vectors, and they estimated a regression model that predicted the scores of actions. They compared their performance with the space-time interest points (STIP) method [28] and DFT pose features. However, it has been proved in [29] that dense sampling has demonstrated better classification performance than original STIP on human action recognition realistic scenes. Therefore, we compared the performance of our feature method with STIP [28] and dense sampling [29] on benchmark MIT Olympic Scoring dataset.

On the other hand, we developed self-similarity feature representation extracted from joint trajectories and joint displacement sequences to describe motion patterns of joints and posture changes. The self-similarity matrices to encode human actions were firstly proposed in [22] that they only accumulated coordinate’s offset overall body joints for a single frame and neglected individual motion dynamic of each joint and the relationship of body joints’ relative positions. Different from[22], we encoded the motion dynamics of each body joint independently and further the displacement sequence between body joints to build temporal self-similarity matrices. Therefore, On the basis of skeleton data representation for human actions, we compared our proposed feature method with the baseline self-similarity matrix feature [22].

These statements have been added in page 13 in revised manuscript.

Q2: Explanations on some key components of the manuscript seem missing or need improvement. What is the "tracking algorithm" (line#482)?

Answer: Thanks to the reviewer for pointing out the research issue of our manuscript. The tracking algorithm mentioned on Page 5 in our manuscript means the noise filtering treatment of estimated skeleton data to deal with occlusion or cluttered backgrounds in a realistic scene. For a more accurate description, the presentation has been modified as follows. In the case of occlusion or self-occlusion, zero values of the joints’ coordinate were obtained due to the failure detection of the human body. Then we employed in a linear interpolation algorithm performed on the pose estimation results of the previous frame and the next frame to obtain the missing skeleton data. These statements have been added at line 227~230 on page 5 in the revised manuscript.

Q3: How do the STIP, Dense work (lines#917-918)?

Answer: Thanks to the reviewer for pointing out the research issue of our manuscript. STIP is the abbreviation of Spatio-Temporal Interest Point and developed for feature detection in traditional action recognition research. The method was presented on the basis of the observation that actions frequently occurred in positions with sharp changes both in the spatial and temporal domains. It employed a space-time extension of the Harris corner detector to extract the prominent positions of significant change in spatial and temporal dimensions from action video. Then histogram of oriented gradients (HOG) and optical flow (HOF) was calculated and concatenated for each local spatial and temporal volume centered at the prominent position. All local features are aggregated to form the feature representation of the whole video. Dense sampling extracted video blocks at regular intervals and scales in space and time by a sliding window moving throughout the whole video. The HOG and HOF descriptors of each video block were computed and concatenated to represent the local feature of each spatial and temporal position. All local feature descriptors are aggregated to form the final feature representation of the whole video. These statements have been added at line 529~555 on page 13 in the revised manuscript.    

Q4: The notation of section 3.2 is very hard to understand (page 6). How the action classifier and assessment regression models are trained simultaneously (lines#120-121; they look like two independent models that are pipelined back-to-back and cannot be simultaneously optimized)?

Answer: Thanks to the reviewer for pointing out our writing error. “simultaneously” here should be “respectively”, since the action classification model and action quality assessment model were trained in a different component. The word was misused to mean that both of the two components are contained in the proposed framework. We have revised this writing error in line 95 on page 2 in the revised manuscript.   

Q5: Although the authors had put a certain amount of work to minimize grammatical mistakes and improve the manuscript, a notable amount of errors still remain that hinders the readability of the manuscript a lot. These include incorrect uses of singular/plural, adjective/adverb, and tenses; omissions of articles; inappropriate word choices; abuse of possessive case (apostrophe abuse); and uses of FANBOYS. Due to the errors and mistakes, many parts of the current manuscript are hard to follow.

Answer: Thanks to the reviewer for pointing out our writing errors. We have used the English editing service provided by MDPI to checking the writing, grammar, spelling, and punctuation of our manuscript. The manuscript has been revised for English writing as follows.

 

The English Editing certification has been uploaded as follows.

 

Author Response File: Author Response.pdf

Back to TopTop