Real-Time Hand Gesture Spotting and Recognition Using RGB-D Camera and 3D Convolutional Neural Network
Round 1
Reviewer 1 Report
This paper proposes a realtime system that does two things: 1. detecting fingertips 2. recognizing gestures from rgb-d images. This is an interesting research topic and there is a lot of previous work in this area. This work uses kinect-2 to capture rgb-d images, each frame detect fingertip positions using K-cosine algorithm. 3D-CNN takes a short video of gesture performance as input, and outputs its recognized gesture. The method is reasonable and the exposition is generally clear.
Here are my comments:
The authors cited work [13] as research that uses gloves with sensors. Actually work [13] uses a vision-based method, the color glove has color blocks, but no sensors on it. For capturing hand gesture with gloves, the author may consider including
"Data-driven Glove Calibration for Hand Motion Capture", Yingying, Wang et al, SCA 2013
In addition, there are several related paper from this year's CVPR and BMVC.
"3D Hand Shape and Pose Estimation from a Single RGB Image", Liuhao, Ge, et al, CVPR 2019
"End-to-End 3D Hand Pose Estimation from Stereo Cameras", Yuncheng, Li, et al, BMVC 2019
For more work related to capturing hand motion, the authors can also refer to
"State of the art in hand and finger modeling and animation", Wheatland, Nkenge, et al, Eurographics 2015
I am a little confused by equation (1). What value is used for k?
The 3D-CNN comes when recoginizing gestures, but not for detecting fingertips. Why not make 3D-CNN end-to-end to perform both tasks? From the paper, gesture recognition does not rely on fingertip detection, right?
As for the result, it is understandable that 3D-CNN takes more time to train and test, and it is good to see that its accuracy is much higher than SVM and 2D-CNN. It is also good that the authors list the confusion matrix of all the seven gesture categories. I am actually curious about the false-positive recognition.
In conclusion, this is a solid paper, using a combination of geometry algorithm and deep learning method to achieve fingertip detection and gesture recognition tasks.
Author Response
[Authors’ Reply]
The authors would like to express many thanks to the reviewers for their invaluable comments. Based on them, we have revised the previous manuscript as follows.
Point 1: "Data-driven Glove Calibration for Hand Motion Capture", Yingying, Wang et al, SCA 2013
Response 1: We replaced ref [13] with this paper
Point 2: "In addition, there are several related paper from this year's CVPR and BMVC.
"3D Hand Shape and Pose Estimation from a Single RGB Image", Liuhao, Ge, et al, CVPR 2019
"End-to-End 3D Hand Pose Estimation from Stereo Cameras", Yuncheng, Li, et al, BMVC 2019
Response 2: We included these paper as reference [27, 28]
Point 3: For more work related to capturing hand motion, the authors can also refer to
"State of the art in hand and finger modeling and animation", Wheatland, Nkenge, et al, Eurographics 2015
I am a little confused by equation (1). What value is used for k?
Response 3: We re-write this part in detail. Page 10, line 178-191.
Point 4: The 3D-CNN comes when recoginizing gestures, but not for detecting fingertips. Why not make 3D-CNN end-to-end to perform both tasks? From the paper, gesture recognition does not rely on fingertip detection, right?
Response 4: Making an end-to-end hand gesture recognition system using deep learning is one of our long-term goals. However, to apply 3DCNN to both tasks is very challenging. Because 3DCNN is generally used for classification, but for segmentation tasks, we might consider much more data and apply other models such as U-net, Segnet. With two steps, sometimes it might overload system. Actually, gesture recognition still relies on fingertip detection. Model would be easier to recognize the hand gestures with fingertips
Point 5: As for the result, it is understandable that 3D-CNN takes more time to train and test, and it is good to see that its accuracy is much higher than SVM and 2D-CNN. It is also good that the authors list the confusion matrix of all the seven gesture categories. I am actually curious about the false-positive recognition.
Response 5: We updated confusion matrix with accuracy
Point 6: In conclusion, this is a solid paper, using a combination of geometry algorithm and deep learning method to achieve fingertip detection and gesture recognition tasks.
Response 6: We included this comment in discussion
Author Response File: Author Response.pdf
Reviewer 2 Report
The authors should better highlight their contribution to the state of the art. It seems the most significant aspect of the work is related to a better performance. However, comparison with other method is superficial and more details on results should be reported.
I suggest reporting more data about the implementation with a greater level of details.
Author Response
The authors would like to express many thanks to the reviewers for their invaluable comments. Based on them, we have revised the previous manuscript as follows
Point 1: I suggest reporting more data about the implementation with a greater level of details.
Response 1: In general, accuracy is used as an evaluation matric. Therefore, we updated confusion matrix with accuracy for detail evaluation in Table 3. We intend to collect more data and handle more hand gestures in future works.
Table 3. Confusion matrix of the proposed method.
Target |
Selected |
Acc |
||||||
|
SL |
SR |
SU |
SD |
ZI |
ZO |
M |
|
SL |
149 |
0 |
0 |
0 |
0 |
0 |
1 |
0.99 |
SR |
1 |
148 |
0 |
0 |
0 |
0 |
1 |
0.99 |
SU |
1 |
0 |
145 |
0 |
2 |
0 |
2 |
0.97 |
SD |
0 |
0 |
0 |
146 |
1 |
1 |
2 |
0.97 |
ZI |
0 |
0 |
0 |
0 |
143 |
4 |
3 |
0.95 |
ZO |
0 |
0 |
1 |
0 |
4 |
141 |
4 |
0.94 |
M |
1 |
0 |
0 |
0 |
0 |
1 |
148 |
0.99 |
|
|
|
|
|
|
|
|
0.97 |
Author Response File: Author Response.pdf
Reviewer 3 Report
This paper presents a novel method for real-time hand gesture recognition based on RGB-D video processing using a 3D convolutional neural network (3DCNN).
The technical description of this approach is clear, and both the testing and evaluation are appropriate.
The quality of the presentation (English grammar and style, and graphics) still needs to be improved, and the following are my comments:
- Acronyms should be mentioned with their expansions at their first occurrence. For example: the 1st occurrence of HCI is at line 29.
- The plural form of an acronym has to end with an "s". For example: the acronym for "convolutional neural networks" is CNNs not CNN (see line 51).
- The parts of the text mentioned in lines 64-65 are subsections, not sections.
- Consistency should be improved in using capitalization, especially for names. For example: "convolutional neural networks" is sometimes written in lowercase (line 51), sometimes written in title-case (line 65).
- The title of subsection 2.1 should be: "Sensors Used for Hand Gesture Recognition Interface".
- Use "glove-based" instead of "gloves-based".
- The major drawbacks of vision-based systems are not mentioned: occlusion, resolution, accuracy etc.
- In lines 86-87 the authors say that systems like the MS Kinect do not provide hand gesture recognition. This statement is not true, since most of these systems integrate basic hand gesture recognition (e.g. open/close hand in the MS Kinect v2 SDK).
- Subsection titles should be written in title-case.
- Be consistent in using: "fingertip detection-based" or "fingertip-based".
- The correct acronym for "3D convolutional neural networks" is 3DCNNs, with no space between D and C.
- Be consistent with the name of devices. For example: Microsoft Kinext v2.
- Names of stages/modules should be mentioned using title-case. For example: write "Hand Region Extraction" instead of "hand region extraction".
- Use punctuation in all figure captions.
- All figures and diagrams should be centered (alignment).
- In Figure 2, use title-case for all blocks (see Skeleton Information block).
- Figure 5 does not include any part (a) or (b), but these parts are referenced in the text (see lines 143 and 169).
- Fix the style and alignment of all equations.
- Anchor titles to the first paragraph of the relative body text.
- The labels for the layers of the CNN should be listed in line 208, not later.
- Figure 7 exceeds page margins.
- In line 216, replace "jth" with "ith".
- When referencing equations, "Equation" should always be capitalized (see line 235).
- The sentence at lines 236-237 looks incomplete.
- The acronyms for hand gestures should be listed in lines 195-196, instead of lines 247-248.
- Fix the arrangement of pictures in Figure 10.
- Be consistent in using English verb tenses.
- Sentence in lines 292-293 is not clear.
- Fix the labels of all tables (remove boldface).
Author Response
The authors would like to express many thanks to the reviewers for their invaluable comments. Based on them, we have revised the previous manuscript as follows.
Point 1: Acronyms should be mentioned with their expansions at their first occurrence. For example: the 1st occurrence of HCI is at line 29.
Response 1: We revised it.
Point 2: The plural form of an acronym has to end with an "s". For example: the acronym for "convolutional neural networks" is CNNs not CNN (see line 51).
Response 1: We revised it.
Point 3: The parts of the text mentioned in lines 64-65 are subsections, not sections.
Response 1: We revised it.
Point 4: Consistency should be improved in using capitalization, especially for names. For example: "convolutional neural networks" is sometimes written in lowercase (line 51), sometimes written in title-case (line 65).
Response 1: We revised it.
Point 5: The title of subsection 2.1 should be: "Sensors Used for Hand Gesture Recognition Interface".
Response 1: We revised it.
Point 6: Use "glove-based" instead of "gloves-based".
Response 1: We revised it.
Point 7: The major drawbacks of vision-based systems are not mentioned: occlusion, resolution, accuracy etc.
Response 1: We revised it.
Point 8: In lines 86-87 the authors say that systems like the MS Kinect do not provide hand gesture recognition. This statement is not true, since most of these systems integrate basic hand gesture recognition (e.g. open/close hand in the MS Kinect v2 SDK).
Response 1: We revised it.
Point 9: Subsection titles should be written in title-case.
Response 1: We revised it.
Point 10: Be consistent in using: "fingertip detection-based" or "fingertip-based".
Response 1: We revised it.
Point 11: The correct acronym for "3D convolutional neural networks" is 3DCNNs, with no space between D and C.
Response 1: We revised it.
Point 12: Be consistent with the name of devices. For example: Microsoft Kinext v2.
Response 1: We revised it.
Point 13: Names of stages/modules should be mentioned using title-case. For example: write "Hand Region Extraction" instead of "hand region extraction".
Response 1: We revised it.
Point 14: Use punctuation in all figure captions.
Response 1: We revised it.
Point 15: All figures and diagrams should be centered (alignment).
Response 1: We revised it.
Point 16: In Figure 2, use title-case for all blocks (see Skeleton Information block).
Response 1: We revised it.
Point 17: Figure 5 does not include any part (a) or (b), but these parts are referenced in the text (see lines 143 and 169).
Author Response File: Author Response.pdf