Figure 1.
Recently developed system for controlling a UAV in different ways.
Figure 1.
Recently developed system for controlling a UAV in different ways.
Figure 2.
Human-UAV interaction is more than flying a UAV (like (a,b). In the case of UAV rescue (c), the UAV needs to understand the action of person who needs to be rescued).
Figure 2.
Human-UAV interaction is more than flying a UAV (like (a,b). In the case of UAV rescue (c), the UAV needs to understand the action of person who needs to be rescued).
Figure 3.
Gesture/Action recognition is the intuitive way for human-robot interaction under many tasks.
Figure 3.
Gesture/Action recognition is the intuitive way for human-robot interaction under many tasks.
Figure 4.
The conventional action or gesture recognition algorithms are difficult to be directly applied for human-robot interaction. (
a) Action recognition result of Alpha-Pose [
28]; (
b) Gesture recognition result of
YOLO V3 [
27].
Figure 4.
The conventional action or gesture recognition algorithms are difficult to be directly applied for human-robot interaction. (
a) Action recognition result of Alpha-Pose [
28]; (
b) Gesture recognition result of
YOLO V3 [
27].
Figure 5.
Structure diagram of the system.
Figure 5.
Structure diagram of the system.
Figure 6.
Under different working conditions, a UAV should choose different flight strategy.
Figure 6.
Under different working conditions, a UAV should choose different flight strategy.
Figure 7.
The structure of Resnet101 [
29].
Figure 7.
The structure of Resnet101 [
29].
Figure 8.
Diagram of scene understanding to automatically adjust the flight strategy under different working conditions.
Figure 8.
Diagram of scene understanding to automatically adjust the flight strategy under different working conditions.
Figure 9.
Diagram of pilot action detection module.
Figure 9.
Diagram of pilot action detection module.
Figure 10.
The real UAV pilot and his hands will be correctly extracted and the other human and background noise will be ignored.
Figure 10.
The real UAV pilot and his hands will be correctly extracted and the other human and background noise will be ignored.
Figure 11.
Illustration of the ST-GCN [
19], where the spatial temporal graph of a skeleton sequence are used to classify the action of a human.
Figure 11.
Illustration of the ST-GCN [
19], where the spatial temporal graph of a skeleton sequence are used to classify the action of a human.
Figure 12.
Gesture recognition module.
Figure 12.
Gesture recognition module.
Figure 13.
Description for features.
Figure 13.
Description for features.
Figure 14.
There are some gestures and actions quite similar to each other. In (a), two gestures are quite similar, and in (b), in five successive images, the human action in four images looks almost the same as others.
Figure 14.
There are some gestures and actions quite similar to each other. In (a), two gestures are quite similar, and in (b), in five successive images, the human action in four images looks almost the same as others.
Figure 15.
Actions and gestures confusion ring sequence: :Rock, : Vertical palm, :Vertical knife, :Paper, :Horizontal palm, :Horizontal knife. Description for actions: :Waving left hand, :Hands up, :Draw circles in front of chest, :Swing hand up and down, :Waving right hand, :Draw circles over head.
Figure 15.
Actions and gestures confusion ring sequence: :Rock, : Vertical palm, :Vertical knife, :Paper, :Horizontal palm, :Horizontal knife. Description for actions: :Waving left hand, :Hands up, :Draw circles in front of chest, :Swing hand up and down, :Waving right hand, :Draw circles over head.
Figure 16.
Sparse coding of Gestures-Action combinations. (The orange blocks are selected into our codebook).
Figure 16.
Sparse coding of Gestures-Action combinations. (The orange blocks are selected into our codebook).
Figure 17.
Illustration of the cross validation process. (a) shows the probability distribution of action and gesture recognition results. (b) shows how our cross validation process could correct the erros in (a). Here, the yellow point represent the wrong coding combination when either action or gesture recognition results is wrong. Block points refer to the correct coding that should be used to control UAV.
Figure 17.
Illustration of the cross validation process. (a) shows the probability distribution of action and gesture recognition results. (b) shows how our cross validation process could correct the erros in (a). Here, the yellow point represent the wrong coding combination when either action or gesture recognition results is wrong. Block points refer to the correct coding that should be used to control UAV.
Figure 18.
Virtual fence to restrict the flight area.
Figure 18.
Virtual fence to restrict the flight area.
Figure 19.
Hardware of our system.
Figure 19.
Hardware of our system.
Figure 20.
The RPC analysis results based on NYU hand gestures dataset [
47].
Figure 20.
The RPC analysis results based on NYU hand gestures dataset [
47].
Figure 21.
The RPC analysis results based on our own dataset.
Figure 21.
The RPC analysis results based on our own dataset.
Figure 22.
Scenes understanding results of ResNet-101 [
29] over real aerial images.
Figure 22.
Scenes understanding results of ResNet-101 [
29] over real aerial images.
Figure 23.
Simulation experimental result of our system when multiple person perform similar actions.
Figure 23.
Simulation experimental result of our system when multiple person perform similar actions.
Figure 24.
Experimental results of controlling the flight of a real UAV with the proposed system. (a) Under the real urban scene; (b) Under the strong illumination condition.
Figure 24.
Experimental results of controlling the flight of a real UAV with the proposed system. (a) Under the real urban scene; (b) Under the strong illumination condition.
Figure 25.
Examples of false alarms arised in all compared algorithms in our test dataset. The yellow rectangle refers to the correct recognition; the red one means the false alarm (solid line) or miss detection (dot line).
Figure 25.
Examples of false alarms arised in all compared algorithms in our test dataset. The yellow rectangle refers to the correct recognition; the red one means the false alarm (solid line) or miss detection (dot line).
Figure 26.
The noise in test dataset may degrade the performance of gesture recognition algorithms. (a,b) are the original depth image and corresponding binarized image in NYU dataset; (c,d) are the input image and segmented hand region in our dataset.
Figure 26.
The noise in test dataset may degrade the performance of gesture recognition algorithms. (a,b) are the original depth image and corresponding binarized image in NYU dataset; (c,d) are the input image and segmented hand region in our dataset.
Figure 27.
Complex background conditions of our dataset. (a–d) Including indoor, outdoor, multiple persons and cluttered background.
Figure 27.
Complex background conditions of our dataset. (a–d) Including indoor, outdoor, multiple persons and cluttered background.
Table 1.
UAV flight strategy corresponds to scene understanding result in our research.
Table 1.
UAV flight strategy corresponds to scene understanding result in our research.
| Strategy | Top Speed (m/s) | Acceleration (m/s) | Maximum Height (m) | Possible to Land |
---|
Scene Label | |
---|
indoor | 1 m/s | 0.25 | 2 | Yes |
buildings | 2 m/s | 0.50 | 10 | Yes |
woods | 3 m/s | 1.00 | 20 | Yes |
factory | 2 m/s | 0.75 | 20 | Yes |
square | 4 m/s | 1.00 | 50 | Yes |
water surface | 5 m/s | 1.50 | 50 | No |
Table 2.
UAV flight action obtained through the cross validation process.
Table 2.
UAV flight action obtained through the cross validation process.
Hybrid Coding | UAV Corresponding Action | Hybrid Coding | UAV Corresponding Action |
---|
| Fly towards left | | Fly towards right |
| Rotate left | | Rotate right |
| Fly upward | | Fly down |
| Fly foreword | | Fly backword |
| Hover | | Circling |
| Draw square | | Draw S |
| Speed up | | Speed down |
| (Undefined) | | (Undefined) |
| (Undefined) | | (Undefined) |
Table 3.
Comparative experiments on the NTU dataset (57 k training samples and 5 k test samples).
Table 3.
Comparative experiments on the NTU dataset (57 k training samples and 5 k test samples).
Algorithm | Accuracy Rate |
---|
ST-LSTM + TS [41] | 69.2% |
Temporal Conv [42] | 74.3% |
ST-GCN [19] | 82.5% |
Table 4.
Comparative experiments on our 6-class action dataset (5.2 k training samples and 600 test samples).
Table 4.
Comparative experiments on our 6-class action dataset (5.2 k training samples and 600 test samples).
Algorithm | Accuracy Rate |
---|
AlphaPose [28] + ST-LSTM + TS [41] | 89.2% |
AlphaPose [28] + Temporal Conv [42] | 93.5% |
AlphaPose [28] + ST-GCN [19] | 95.8% |
OpenPose [21] + ST-LSTM + TS [41] | 88.8% |
OpenPose [21] + Temporal Conv [42] | 94.2% |
OpenPose [21] + ST-GCN [19] | 95.5% |
P-CNN [24] | 79.5% |
Table 5.
Recall Rate of different algorithm when on the NYU test data when precision is set as 0.98. (“P” represents gesture Paper, “HP” represents gesture Horizontal Palm, “HK” represents gesture Horizontal Knife, “VP” represents gesture Vertical Palm, “VK” represents gesture Vertical Knife, “R” represents gesture Rock).
Table 5.
Recall Rate of different algorithm when on the NYU test data when precision is set as 0.98. (“P” represents gesture Paper, “HP” represents gesture Horizontal Palm, “HK” represents gesture Horizontal Knife, “VP” represents gesture Vertical Palm, “VK” represents gesture Vertical Knife, “R” represents gesture Rock).
| Gestures | P | HP | HK | VP | VK | R |
---|
Algorithm | |
---|
FPN [44] | 0.71 | 0.65 | 0.62 | 0.66 | 0.70 | 0.75 |
RefineDet [45] | 0.78 | 0.88 | 0.75 | 0.78 | 0.79 | 0.74 |
Faster RCNN [31] | 0.62 | 0.58 | 0.65 | 0.55 | 0.65 | 0.63 |
Handpose [46] + Resnet101 [29] | 0.75 | 0.84 | 0.67 | 0.72 | 0.69 | 0.66 |
Yolo-v3 [27] | 0.84 | 0.86 | 0.89 | 0.82 | 0.85 | 0.80 |
Our algorithm | 0.85 | 0.83 | 0.88 | 0.92 | 0.86 | 0.88 |
Table 6.
Recall Rate of different algorithm when Precision is 0.98.(trained and tested by our datasets).
Table 6.
Recall Rate of different algorithm when Precision is 0.98.(trained and tested by our datasets).
| Gestures | P | HP | HK | VP | VK | R |
---|
Algorithm | |
---|
FPN [44] | 0.78 | 0.77 | 0.80 | 0.74 | 0.82 | 0.77 |
RefineDet [45] | 0.78 | 0.81 | 0.79 | 0.83 | 0.79 | 0.85 |
Faster RCNN [31] | 0.60 | 0.61 | 0.58 | 0.69 | 0.65 | 0.57 |
Handpose [46] + Resnet101 [29] | 0.71 | 0.75 | 0.68 | 0.69 | 0.79 | 0.71 |
Yolo-v3 [27] | 0.81 | 0.85 | 0.76 | 0.83 | 0.82 | 0.86 |
Our algorithm | 0.97 | 0.97 | 0.94 | 0.98 | 0.95 | 0.96 |
Table 7.
Comparative experimental results of scene understanding algorithms on datasetPlace365. (365 classes).
Table 7.
Comparative experimental results of scene understanding algorithms on datasetPlace365. (365 classes).
CNN Model | Accuracy Top-1 | Accuracy Top-5 |
---|
AlexNet [48] | 53.17% | 82.89% |
Inception-v3 [49] | 53.63% | 83.88% |
VGG-19 [50] | 55.24% | 84.91% |
ResNet-101 [29] | 54.74% | 85.08% |
Table 8.
Comparative experimental results of scene understanding algorithms when dataset Place365 is categorized into 6 classes according to
Table 1.
Table 8.
Comparative experimental results of scene understanding algorithms when dataset Place365 is categorized into 6 classes according to
Table 1.
CNN Model | Accuracy Top-1 |
---|
AlexNet [48] | 93.76% |
Inception-v3 [49] | 95.28% |
VGG-19 [50] | 96.41% |
ResNet-101 [29] | 97.02% |
Table 9.
The recognition accuracy of each module in our system when Recall Rate is 0.98.(trained and tested by our datasets,
accuracy just refer to the recalll rate of Equation (
7)).
Table 9.
The recognition accuracy of each module in our system when Recall Rate is 0.98.(trained and tested by our datasets,
accuracy just refer to the recalll rate of Equation (
7)).
Action Accuracy | Gesture Accuracy | Action-Gesture Combination | Action-Gesture Cross-Validation |
---|
95.8% | 96.7% | 92.5% | 99.4% |