1. Introduction
The evolution of the Internet of Things (IoT) network has made intelligent devices more available, which offers more possibilities to facilitate people’s lives [
1,
2]. In aging societies, one focus of smart infrastructure field is the assistance of the elderly and the disabled by using advanced IoT devices. Therefore, there is a strong demand for robots to tackle problems that resulted from the aging population, such as the lack of caregivers for nursing and accompanying of the elderly, and are promoting the development of nursing-care assistant robots [
3]. Providing more natural and intelligent interaction modes [
4,
5] with the nursing-care assistant robots [
6] is one of the frontiers of smart infrastructure development [
7]. Hand gesture recognition paves an appropriate way [
8] to obtain people’s intention for the control of smart devices, and some progress has been made in previous studies. Elderly people can use smart IoT devices to express intention by making corresponding gestures, so as to control various smart devices remotely [
9] at home [
10]. Therefore, natural human–robot interaction based on hand gestures [
11] will become a popular research topic in the near future.
Hand gestures are the most common means for nonverbal communication [
12]. Generally speaking, gestures are divided into two types: static gestures and dynamic gestures. The former mainly focuses on the finger’s flex angles and poses [
13,
14], while the latter pays more attention to the hand motion trajectory (HMT) [
15]. In previous studies, sensors for the above two types of gesture recognition mainly referred to two categories: image-based sensors [
16] and non-image based sensors [
17]. Most previous studies of static hand gestures recognition have used non-image based sensors (integrated in wearable gloves and bands [
14]), while the studies of HMT gesture recognition are based on fixed image-based sensors (such as using in-depth information provided by Kinect [
18]). Inertial sensors are commonly applied in the field of non-image based HMT gesture recognition. Xu et al. and Xie et al. used the accelerometer inertial sensor for the HMT gesture recognition with a mean recognition accuracy of 95.6% and 98.9% [
19,
20]. However, their methods were based on the feature of acceleration, which was susceptible to the sensor’s posture. Besides, the acceleration is not as intuitive as the velocity or displacement when representing a trajectory gesture, which might further limit the system’s performance on more diverse and complex gestures.
In recent years, significant efforts have been devoted to developing the image-based sensors for HMT gesture recognition. Plouffe et al. used the Kinect sensor to achieve the recognition of static and dynamic hand gesture recognition in real time, and achieved an average accuracy of 92.4% [
21]. Tang et al. proposed an approach for continuous hand trajectory recognition based on the depth data collected by the Kinect2 sensor [
22]. In addition, Zhang et al. proposed a novel system for dynamic continuous hand gesture recognition based on a frequency-modulated continuous wave radar sensor [
23], which achieved a high recognition rate of 96%. However, the above methods for the dynamic gesture recognition have to rely on the position-fixed sensors, which limits the spatial flexibility of gesture actions and is not a suitable human–robot interaction [
24] mode for the elderly. What’s more, there is a study on gesture recognition without spatial position restriction. Kim et al. recovered the full three-dimensional (3D) pose of the user’s hand using a wrist-worn sensor [
25]. It should be pointed out that the work of Kim et al. is only effective for static finger gesture recognition, and cannot achieve the HMT gesture recognition.
In this article, we propose a novel HMT gesture recognition system based on a wearable wrist-worn camera, and apply it to the intelligent interaction with a nursing-care assistant robot, as shown in
Figure 1. To our best knowledge, this is the first study of HMT gesture recognition using a wearable wrist-worn camera based on background velocity analysis, which has no workspace restrictions. In addition, we proposed a reliable method to detect the start/end point of effective HMT gestures for continuous gesture segmentation, which is achieved by detecting the fist motion and the hand motion velocity. Furthermore, we constructed an algorithm framework that is composed of hand region segmentation, background velocity calculation, continuous gesture segmentation, and gesture type classification. We also designed the prototype of an HMT gesture recognition system and carried out experimental verification and results analysis. To further demonstrate the practicability of the proposed system, we designed a prototype of a nursing-care assistant robot for the aged-care at home, and defined 10 special gestures to interact with the nursing-care assistant robot.
The remaining part of this article is organized as follows: the architecture of the wearable wrist-worn camera and the new nursing-care assistant robot are described in
Section 2. The algorithm framework for the gesture recognition and the gesture principle designed for the navigation of the nursing-care assistant robot are presented in
Section 3. In
Section 4, the description of the experimental process and the evaluation of the proposed gesture recognition system are conducted, and the application of the interaction with the nursing-care assistant robot is carried out. Finally,
Section 5 gives the discussion and conclusion of our work.
2. System Architecture
The architecture of the HMT gesture recognition system consists of three parts: data acquisition, data processing, and a natural human–robot interface for smart infrastructure. In the data acquisition, the subject puts the wearable wrist-worn camera on their right wrist and performs gestures. The camera records the original video data of the background, which reflects the HMT. After that, the data are transmitted to the host computer through Wi-Fi, and then the data are processed. During data processing, a series of algorithms are used to recognize the hand region, calculate the velocity of the background based on matching the Speeded-Up Robust Features (SURF) feature points, and segment the continuous gesture. Then, the velocity data of the effective gesture are obtained. After that, the classification algorithm is used to recognize the target gesture. The gesture recognition results correspond to the predefined control commands, so that various smart IoT devices in smart homes can be remotely controlled by the gesture. The highlight application of this study is the interaction between human and robot. A nursing-care assistant robot is designed for the assistance of elderly people at home, and the completed prototype can achieve two working modes of the man-in-seat interaction mode and remote interaction mode based on the proposed HMT gesture recognition system.
2.1. The Wearable Camera Architecture
For the HMT gesture recognition in this study, we designed a new wearable wrist-worn camera. The hardware structure is shown in
Figure 2. We used SolidWorks to design the lightweight and foldable structures firstly, as shown in
Figure 2a. The device is designed according to the requirements of lightweight, unobtrusiveness, and portability, as a wearable device for the daily use of the elderly [
26]. The base, shell, and camera hood are all manufactured by the 3D printing of nylon material (the thickness of printing shell is one mm), which has the advantage of being lightweight (the final prototype is 114 g in weight). The foldable structure of the camera partially endows the device with a compact structure. The base and an elastic fabric wristband are bonded by melt adhesive (we use a dispensing gun to heat the melt adhesive and apply the base to the fabric). The mentioned foldable structure guarantees the unobtrusiveness of the device for users. The drawer type structure between the shell and the base is adopted, which can be easily dismantled and installed, and satisfies the user’s usage requirements of portability.
The selected image-based sensor is the Raspberry Pi Camera Module, which is a CMOS-type (Complementary Metal Oxide Semiconductor) 175-degree wide-angle camera that is especially compatible with Raspberry Pi with a resolution of five million pixels (2952 × 1944 pixels). As the control unit, Raspberry Pi Zero W integrates a 1-Ghz single-core central processing unit (CPU) and 512 MB RAM with additional support for 802.11 b/g/n wireless LAN connectivity. The module is suitable for prototype development and the verification of smart infrastructure under the Internet of Things technology due to its small size (65 mm × 30 mm × 5 mm) and wireless transmission compatibility. In addition, compared with other controller modules such as Arduino, it has a higher clock frequency, which is more suitable for fast image processing and acquisition. In order to ensure the small size of the integrated design, the camera module and the Raspberry Pi are connected by a flexible flat cable (FFC). According to the power supply requirement and the size limitation of the integrated design, two rechargeable lithium batteries with a rated voltage of 3.7 V are selected for parallel output with a total capacity of 2000 mAh. Meanwhile, in order to meet the demand of Raspberry Pi and the camera module’s power supply, the boost converter is used to get 5 V of voltage output. The integrated implementation of the wearable wrist-worn camera described above and the prototype are shown in
Figure 2b.
In the process of collecting the original video data, the Raspberry Pi runs the python script on the Raspbian system (a system based on Debian GNU/Linux for Raspberry Pi hardware development) to establish the TCP (Transmission Control Protocol) server firstly, and then connects with the TCP client on the computer. After the successful connection, the Raspberry Pi collects the video data from the camera module, and then transmits the video data to the computer with the configuration of 320 × 240 resolution and a frame rate of 12 frames per second (FPS). The above parameters based on experimental optimization can reduce the packet loss and satisfy the data processing requirements of subsequent algorithms. The algorithmic details of data processing will be introduced in detail in
Section 3.
The exact size of the wearable camera can be found in
Figure 2a. The camera hood is equipped with a camera module, which can rotate 0–150 degrees between the camera hood and the shell through a rotating shaft. There are irregular grooves designed at the bottom of the base for uniform melt adhesive. The angle θ is set to 80 degrees in the working state of this study, and it can be easily folded in the non-working state. The test result showed that the working current is 0.21 A in the video transmission state, and 0.11 A in the boot state without data transmission. According to the battery capacity and actual use test, the device can work continuously for more than four hours, which fully meets the requirements of the use in household conditions [
27]. Furthermore, the system power consumption can be further reduced by monitoring the motion velocity threshold that triggers the sleep mode.
2.2. The Nursing-Care Assistant Robot
With the advent of an aging society, many robots, such as mental commitment robots dedicated for mental healing [
28] and the smart wheelchairs [
29], have been proposed to help on-site caregivers. Mukai et al. developed an assistant robot, RIBA, to lift a human in its arms [
30]. The above nursing robots are mainly oriented toward hospitals and clinics. In this study, we have integrated the design of a nursing-care assistant robot for aged-care at home that can not only carry people similar to a wheelchair, but also grasp the target object. As shown in
Figure 3, the mechanical structure of the nursing robot consists of four parts: omnidirectional mobile chassis, lift adjusting mechanism, dual manipulator above the chassis, and the seat part at the fore. The YuMi collaborative robot produced by Asea Brown Boveri Ltd. (ABB) was chosen as the dual manipulator. The frame of the other structures was made of all the aluminum profiles. The design of all of the mechanical structures of this cooperative robot was carried out in SolidWorks. Meanwhile, the corresponding structural stability was checked to ensure the reliability and safety of the household environment. Limited by the length of the article, the details of the check are not carried out in this article; these are provided in the corresponding
supplementary materials, Check S1. The dimensions of the robot are shown in
Figure 3, where the maximum (550 mm) of longitudinal lift is reflected.
The user can sit on the seat part in front of the nursing cooperative robot, and the dual manipulator makes corresponding nursing actions behind the user such as assisting in helping the user get up from the seat, fetching the target object, and so on. In detail, the robot can move its dual arms to a suitable position providing supporting points such as the chair arms for the elderly to get up from the seat. An electric lifting adjusting mechanism is designed between the dual manipulator and the mobile chassis. The corresponding height between the dual manipulator and the user can be adjusted to adapt to users with different body shapes and ensure the space for the manipulator with different movements. Similarly, the sliding rail mechanism between the seat part and the mobile chassis can adjust the relative distance between the user and the manipulator, which ensures the user’s comfort and a wide application for different people.
As shown in
Figure 3, the communication protocol of the nursing-care assistant robot mainly related to the communication between the omnidirectional mobile chassis and the dual manipulator. In this design, the STM32F405RGT6 provided by STMicroelectronics is selected as the microcontroller unit (MCU) in the main control board. The motion control of the dual manipulator is based on the IRC5 controller of ABB. The single manipulator consists of a mechanical arm and a gripper, which is independently communicated with the main control board via Ethernet. The omnidirectional mobile chassis is driven by four servo motors, which are controlled by the corresponding servo motor controller. The Controller Area Network (CAN-bus) communication mode is chosen between the mobile chassis servo controller and the MCU. The nursing-care assistant robot that we proposed is aiming to assist and nurse the elderly at home. Two working modes of the nursing-care assistant robot are designed to better meet the above nursing needs: man-in-seat interaction mode and remote interaction mode. The man-in-seat interaction mode refers to the near-field control. When the user sits on the nursing-care assistant robot’s seat part, this mode realizes the function of assisting the elderly in moving to the designated destination and taking some necessary objects such as medicines. The remote interaction mode refers to the condition that interacts with the robot remotely, which is aiming to assist the elderly in picking up distant objects.
4. Experiments and Results
4.1. Dataset
Before the experiment, the approval of the Ethics Committee of the 117th Hospital of People’s Liberation Army of China (PLA) has been obtained, and all of the subjects have signed a consent form. In this study, the performance of the HMT gesture recognition system is tested and verified with five subjects in two postures (sitting and standing); then, we conducted the application on the nursing-care assistant robot by controlling the robot’s movement based on the predefined 10 gestures using the wrist-worn camera.
In order to verify the performance of the gesture recognition system, five subjects of four males and one female, aged from 20–30 with a healthy and movable body, were included in the experiment. In the experiment, the test subjects performed the predefined 10 gestures continuously in one round as a gesture combination. Each subject repeated the gesture combination 10 times in both sitting conditions and standing conditions, respectively. In other words, 200 gesture samples were collected from each subject. Finally, 500 gesture samples were obtained in each condition, and a total of 1000 gesture samples were obtained. During the experiment, the participant put the wearable wrist-worn camera on their right wrist. Then, the camera was positioned to ensure a suitable proportion of the hand area in the image. After that, the participant performed the predefined 10 HMT gestures one by one to collect gesture data following the predefined rules. The participant performed the motion of fist bobbing firstly, and then made the single effective HMT gesture. After finishing a single gesture, the participant needed to maintain a motionless state for more than half a second. The test subjects were guided by the above rules and pretrained to perform the gesture combination once, which was not included as one of the experimental samples to be analyzed. After finishing all of the HMT gestures, the original video data of the gestures were collected and processed to extract the velocity data of the effective single HMT gesture. Then, a DTW algorithm was used for measuring the similarity of the different gestures, and three different methods of cross-validation were used to classify the gestures. The recognition accuracies based on different cross-validation were obtained for the performance verification. After the experiments for performance verification, we applied the HMT gesture recognition system to interact with the nursing-care assistant robot. Similarly, the participant put on the wearable camera to control the nursing-care assistant robot in its man-in-seat interaction mode and remote interaction mode, respectively, based on the predefined 10 gestures. What is different from the experiments in the application is that the HMT gesture recognition is in real time, which is based on a specified training set using a representative method of cross-validation for the classification. The data processing and algorithm verification that were involved in this experiment were all carried out in version 2018a of MATLAB installed on a computer configured with 8 G of memory and 2.6 Ghz CPU.
4.2. Results of the HMT Gesture Recognition
4.2.1. Results of the Continuous Gesture Segmentation
According to the segmentation algorithm of continuous gestures defined in
Section 3, 10 consecutive gestures of all of the groups were segmented to extract the effective single gesture. In the experiment, 94 sets of 100 consecutive gestures from the five subjects were segmented into 10 corresponding gestures correctly. The start and end points of the segmentation were also in line with the expectations of the algorithm. Among the six groups of gestures with incorrect segmentation, there were three groups that were caused by the absence of start and end points because of the non-standardized gestures (including the too-small motion of the fist bobbing before the single effective gesture, and the loss of a motionless state after finishing the single effective gesture). The remaining three groups of segmentation errors were caused by the motion of fist bobbing during the process of the effective HMT gesture, which lead to an extra wrong segmentation. Finally, there were 992 gestures completely segmented out of the whole 1000 gestures correctly, which led to a segmentation accuracy rate of 99.20%. The effective gestures that were not correctly segmented were processed manually by specifying the start and end points.
4.2.2. Results of the Background Velocity
According to the above results of continuous gesture segmentation and the calculation of background velocity based on the matching of the SURF keypoints, a group of the velocity curves of the 10 predefined gestures in this experiment is shown in
Figure 9. The velocity curves of V
X and V
y are mapped from the background velocity to the gesture trajectory velocity. As seen from the figure, the simple movement between gesture 1 and 4 takes less time to make; the effective gesture action duration is about one second, and the complex movement between gestures 5 and 10 is about twice the simple duration of about three seconds. To reduce the influence of the different lengths of gesture time for classification, all of the hand gesture data were normalized by the method mentioned in
Section 3.
4.2.3. Results of the HMT Gesture Recognition
After obtaining the effective velocity data of the corresponding gestures, 1000 gestures were classified with three different cross-validation methods, based on the distance between the velocity data of the corresponding gestures calculated by a DTW algorithm. In order to meet the data requirements of the DTW algorithm, the original velocity data were normalized and resampled with 30 sampling points by linear interpolation before classification.
The three cross-validation methods are introduced as follows: (1) leave-one-subject-out (LOSO) cross-validation is a method that refers to selecting the sample data of one subject as the test set and the sample data of the other subjects as the training set; (2) leave-other-subject-out (LPO) cross-validation refers to selecting the sample data of one subject as the training set and the sample data of the other subjects as the test set; (3) leave-one-group-out within one subject (LOOWS) cross-validation refers to selecting one group of samples from one subject as the test set and the other groups of samples of this subject as the training set. The types of the test gestures were determined according to the shortest distance to the training set based on the DTW algorithm, and the most types that the test gesture was determined to be were taken as the recognition results.
Finally, the accuracies of gesture recognition using different cross-validation methods are shown in
Table 4. The mean recognition accuracy with the LOSO cross-validation was up to 97.34%, which verifies the system’s performance in front of an unknown subject. The LPO cross-validation achieved a mean accuracy of 96.55%, which is lower than LOSO, and reflects the variations between different subjects and the diversity of our data. The LOOWS method achieved a mean accuracy of 98%, which is higher than LOSO, and indicates that the user can easily add their characteristic gestures to the system, and the gestures can be recognized efficiently. As shown in
Table 4, the HMT gesture recognition accuracies in the standing condition are slightly higher than those in the sitting condition. Besides the random influence of the external environment, the features of the gesture in the standing condition are more evident than those in the sitting condition since the action space in the standing condition is much larger than that in the sitting condition.
In order to cater for the interactive application of the two interaction modes with the nursing-care assistant robot, the gesture velocity data collected in the standing and sitting conditions based on LOSO cross-validation are analyzed, respectively. The confusion matrices of the recognition results under the two conditions are shown in
Figure 10. In addition, the precision, recall, and F-measure of the 10 gestures’ classification are calculated based on the confusion matrix to further verify the performance of the gesture recognition, which are shown in
Figure 11. The mean value of the F-measure under the sitting condition is up to 0.984; under the standing condition, it is 0.963. The higher F-measure value in sitting conditions indicates its better classification performance than that in standing conditions.
The overall recognition results of sitting conditions and standing conditions based on LOSO cross-validation are clearly shown from the confusion matrix in
Figure 10. Among the confusing gestures, we analyzed the reasons for some having a high proportion of varying classification types. Gesture 7 was classified inaccurately as gesture 4 with the proportion of 4.67% under the sitting condition and 3.11% under the standing condition, because of the missing horizontal motion detection. The missing detection was caused by the incorrect end of a gesture due to the unexpected pauses during a single gesture period. For similar reasons to those mentioned above, gesture 8 was classified inaccurately as gesture 4 with the proportion of 4.67%, and gesture 10 was recognized incorrectly as gesture 2 with the proportion of 2.67%. There is another situation: gesture 8 was classified wrongly as gesture 2 with the proportion of 2.67%, which was caused by the unobvious longitudinal motion in gesture 8. For example, the subjects immediately started the showcased gestures with the uncompleted motion of the fist bobbing, resulting in the insufficient longitudinal motion.
4.3. Interaction with Nursing-Care Assistant Robot
HMT gesture recognition was applied to the intelligent interaction with the nursing-care assistant robot at home. Based on the control system and interaction mode of the nursing-care assistant robot proposed in
Section 2, we improved the HTM gesture recognition system to a real-time HMT gesture recognition system, which was more efficient for the interaction application with the robot. Different from the algorithm framework proposed in the theoretical verification part of
Section 3, the real-time HMT gesture recognition system’s algorithm framework had the specified training templates based on the LOSO cross-validation, which is more practical and faster during the interaction progress.
The interaction application with the nursing-care assistant robot was carried out under two different interaction modes of the robot. As shown in
Figure 12a, under the man-in-seat interaction mode, the user wore a wearable camera on his right wrist, and sat on the seat part in front of the nursing-care assistant robot. Then, the user made corresponding predefined gestures to guide the robot. Due to the additional background velocity of the robot when the user interacted with the robot in the man-in-seat interaction mode, there was a deviation between the experimental results in sitting conditions and the actual control interaction process in theory. However, the subsequent application shows that this additional velocity hardly affected the recognition accuracy, which illustrates the robustness of the proposed HMT gesture recognition system. The HMT gesture recognition in standing conditions corresponds to the remote interaction mode of the nursing-care assistant robot. The remote interactive state is shown in
Figure 12b. The user remotely guides the robot to a specified condition, and then the command mode is switched for the control of the dual manipulator by performing the gesture command. After that, the manipulator is remotely operated to complete the action of assisting in grabbing the specified objects. During the progress of the interaction, the original video data of the gesture collected by the wearable camera is transmitted to the host computer through Wi-Fi for processing and recognition, and the recognition result communicates with the robot’s control system through the predefined control commands corresponding to the predefined HMT gesture.
5. Discussion and Conclusions
In this paper, an innovative HMT gesture recognition system based on background velocity features using a wearable wrist-worn camera was proposed and applied to intelligent interaction with a nursing-care assistant robot. In this study, the environment image data were collected during the user’s hand motion when using a wearable wrist-worn camera. The velocity of the HMT gesture was reflected by background velocity, which was calculated by the displacement of the matching SURF points between adjacent frames. In addition, we defined a reliable rule to segment the continuous gestures by detecting the motion of fist bobbing and the background velocity, and the accuracy of the gesture segmentation reached 99.2% with the 1000 effective gestures that were obtained.
Ten gesture command rules for interacting with the nursing-care assistant robot were defined in this study. More importantly, in order to evaluate the performance of the HMT gesture recognition system proposed in this study, we collected 1000 effective gestures from five test subjects. The gestures’ classification and recognition were achieved using three different cross-validation methods based on the DTW algorithm. The average recognition accuracy of 97.34% is achieved based on the LOSO cross-validation, and the recognition accuracy of gesture recognition in sitting conditions and standing conditions were analyzed and compared. In addition, the application of interaction with the nursing-care assistant robot under the man-in-seat interaction mode and the remote interaction mode were conducted. Furthermore, a demonstration video was made and provided as supplementary material for clear expression.
Although the HMT gesture recognition system that was proposed in this study as a novel recognition method has significantly improved flexibility and reliability compared with the traditional fixed gesture recognition method, there are also some drawbacks to the current work. Although the approach of the fist bobbing detection proposed in this study to segment the effective gestures is reliable, the continuous gesture segmentation algorithm takes a long time due to the hand region segmentation based on the lazy snapping algorithm, which affects the efficiency of the whole algorithm. The algorithm for the hand region segmentation can be improved or replaced in order to reduce the computation time in the future.