1. Introduction
Hand gestures are a critically important form of non-verbal communication. The interpretation of hand gestures using wearable sensors [
1,
2], or cameras [
3,
4] aims to transform the movement of the hand into meaningful instructions; this interaction is also known as hand gesture recognition. The field of hand gesture recognition has seen significant improvements over the past few years [
5] and, most recently, bundled with the latest advancements in computer vision, has encouraged the development of new technologies to support rehabilitation [
6,
7], robot control, and home automation [
8]. Amongst other techniques, deep learning and computer vision methods have aimed to reach a complex understanding of the dynamic behaviours of hand motion, with the advantage of being more sensitive to learning rapid time-varying features.
Computer vision techniques rely on convolutional neural networks (CNNs) to extract two-dimensional (appearance-based) and three-dimensional (motion-based) array features. CNNs are generally used in image recognition to process pixel data. They take raw pixel data as input, train the designed architecture, and automatically extract features. These models have been divided into static (two-dimensional) and dynamic (three-dimensional) based on the model’s output features. Several investigations [
9,
10,
11] have implemented two-dimensional static appearance-based hand gesture recognition models (also known as two-dimensional CNN models), intending to develop a computationally inexpensive classifier to extract stable shapes of the human hand. However, these models do not consider the spatio-temporal parameters that occur from sequential frames of a video recording, and appearance alone cannot accurately identify the gesture signature [
12]. Therefore, new approaches, known as three-dimensional dynamic hand gesture recognition, have emerged to fill this gap.
Three-dimensional dynamic hand gesture recognition models also rely on CNNs, act similarly to conventional two-dimensional CNNs, and have spatial-temporal filters. Since their introduction in 2015 [
13], these models have been primarily embraced for hand gesture recognition [
13,
14,
15], presenting excellent characteristics in recognizing hand actions from both appearance and spatio-temporal features. However, they require more parameters than two-dimensional CNNs, meaning vast datasets are needed, and making them more challenging to train [
16]. Furthermore, these approaches have additional drawbacks that include cost, the logistical challenges of dealing with complex and lengthy datasets, and the requisite quality of captured images needed for appropriate training. To overcome these drawbacks, previous research has leveraged a technique known as transfer learning [
17].
Transfer learning is a methodology where architecture is implemented and trained on a generic activity and is then adopted for a specific different but linked activity (
Figure 1). This technique is often employed to tackle the issue of a deficiency of training data [
18,
19]. The usual objective of transfer learning techniques is to learn visual features from the initial assignment [
19]. This technique can train and acquire a forthcoming linked task from fewer data samples. Transfer learning is adopted when a novel, minor dataset is smaller than the dataset used to train the pre-trained architecture.
Another hurdle in dynamic gesture recognition for three-dimensional CNNs is recognizing specific actions when dealing with continuous video streams [
20]. Identifying human activities within video sequences is difficult because of the vast irregularity of hand actions on a time scale, unclear frame quantity, distribution, and limits of gesture signatures [
21,
22,
23]. Furthermore, hand motions are often intricate and articulated and, when performed in an uncontrolled environment, can lead to occlusion that can limit the tracking. However, the ability to track and segment hand gestures in the real world can answer the need for application of these models to more realistic and generalizable tasks.
Manual segmentation of continuous video recordings is considered the most adopted technique when training hand gesture recognition [
24]. However, the process is lengthy, and often a large proportion of frames is left unlabeled, causing indexing issues in the training of novel classification methods. The ability to automatically detect action in video recordings has an essential function for different applications that require end-to-end process automation. However, while much work has been produced on increasing the accuracy of hand gesture recognition models and enhancing the strength of these approaches [
3,
5,
25], just a few attempts have been presented for temporal segmentation [
26,
27].
Attempts at temporal segmentation have focused on motion trajectory [
28] and skeletal tracking [
29] from depth cameras. However, these systems were sensitive to image backgrounds and lighting conditions. A different approach, presented by Camgoz et al., suggested windowing the continuous video stream for segmentation [
30]. However, the length of the sliding volume was fixed, often cutting part of the critical features of the gestures. Moreover, appearance and hand motion information complement a temporal segmentation classifier [
27]. However, Camgoz et al. also used only time-series data detected from hand motion, with no appearance information [
30]. Kuehne et al. [
31] proposed an end-to-end generative framework for video segmentation, using hidden Markov modelling for video segmentation and recognition of human activities. This has the drawback of an intensive processing time, reducing the ease of applying the approach in real-time. Ni et al. [
32] presented an approach based on recurrent neural networks (RNNs) to perform sliding window detection and temporally segmenting continuous actions. The issue with this methodology is linked to the identification of peripherical boundaries only, with no global overview of the temporal events.
To overcome these disadvantages, recent approaches have suggested making a distinction between gestural frames, when the action is taking place, and translation frames, by merging both shape and spatiotemporal parameters. Such an approach has been presented by Wang [
27]. Wang presented a segmentation method that contained both action and appearance-based information and used both RGB and depth capture modalities driven by dual architecture for hand gesture classification and segmentation. This approach requires dual-modality acquisition, which does not leverage the ubiquity of standard monocular RGB cameras. Similarly, most recently, Sahoo et al. [
33] presented an end-to-end fine-tuning method using a pre-trained CNN for a hand gesture recognition model; however, their model was also driven by dual-modality and multiple architectures.
Increasingly, enormous datasets of human movement are publicly available, as researchers seek to pool resources and work more openly. The 20BN Jester is a state-of-the-art dataset and the largest of the human hand gestures collected from monocular RGB cameras. It contains a total of 148,092 videos corresponding to 5,331,312 frames [
15]. Each video is, on average, three seconds, and the dataset contains a total of 27 classes.
This paper aims to present a novel pipeline based on the training of a CNN using a small set of data for the development of a narrow architecture that can run efficiently during continuous video recordings of hand gestures to effectively recognize different gesture interactions. The key contributions of this paper include:
- (a)
The implementation and testing of a novel pipeline that leverages a three-dimensional CNN model combined with a long short-term memory (LSTM) unit to reliably classify and temporally segment continuous video recordings. This novel pipeline enables improved accuracy compared with previously presented methodologies.
- (b)
The introduction of a model trained on a larger scale dataset and then fine-tuned on a small-scale dataset, that enables generalizability to different types of gestures, participants and hand shapes.
- (c)
The introduction of a small-scale architecture that lays the foundations for a real-time model capable of executing tasks reliably in real-world scenarios. This paves the way to a broader and optimised application that can be used to automatically detect tasks in different domains.
To deliver these contributions, we proceed as follows. In
Section 2, the experimental set-up, data collection, and pre-and post-processing steps implemented for the action recognition detector are explained.
Section 3 discusses the experimental results and
Section 4 summarizes the main implications of these findings and addresses future directions.
Section 5 concludes the proposed work. The key novelty of the presented methodology (with evaluation) includes a temporal segmentation classifier driven by monocular video sequences that outperform previous investigations in terms of accuracy and enables fine-tuning on a small-scale dataset trained on a single, low-complexity, architecture.
2. Materials and Methods
2.1. Experimental Set-Up
Twenty-two volunteers (twelve female, ten male) participated in this experiment. All the participants were healthy, presenting with no hand pathology, no loss in mobility, and no experience of upper limb joint surgery or fracture in the six months preceding the data collection. All participants were informed, both verbally and in writing, of their right to withdraw from the study at any time. Written informed consent was obtained from each participant. The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Imperial College Research Ethics Committee (ICREC) of Imperial College London. Video data were captured using an Oqus RGB camera (Qualisys AB, Göteborg, Sweden) at a 30 Hz frame rate. The entire pipeline adopted in the study is illustrated in
Figure 2.
2.2. Data Collection
Ground truth composition is an essential matter in CNN-based design. Given the absence of an available hand gestures dataset suitable for clinical hand applications, a novel recorded hand gesture dataset was introduced. While we acknowledge that many gestures can be performed by one person, to generate the hand gesture dataset, we included more participants to increase the population diversity (e.g., hand shapes, skin color) and generalizability of the proposed methodology. To enable comparison with other proposed hand gesture models, the accuracy was initially tested for 12 participants. To increase the performance of the model, 10 more participants were subsequentially added for a total of 22 participants.
Participants were asked to record one video sequence during online video meetings. A timed PowerPoint (Microsoft, Redmond, WA, USA) show was used to make the video acquisition consistent, to support participants in the activities to be performed during the recordings, and to inform participants regarding the way to position themselves relative to the device for the recordings.
To perform the hand gestures, participants were asked to use a standard device camera to capture the required hand exercises using any laptop, smartphone, or desktop computer. A standard camera was defined as a camera developed from 2012 onwards that was able to capture video recordings at a rate of thirty frames per second. To assess if the data were captured from an acceptable browser and operating system, participants were asked to check the specifications of their recording system.
The hand activities performed by participants included abduction and adduction, metacarpophalangeal joint flexion, and thumb opposition. Each was performed four times with both the left and right hands. During these exercises, participants were asked to hold static poses for five seconds. Four classes of gestures were defined based on the trials (
Figure 3).
The hand gesture sequences were captured from continuous video recordings of 250 s. The continuous video sequences were then manually segmented and labelled. Examples representing the data collected from twelve representative participants are illustrated in
Figure 4.
In addition to the captured data, the 20BN Jester dataset acquired by Materzynska et al. [
15] was used. The classes of interest in this study, “no gesture”, “abduction and adduction”, “MCP flexion”, and “thumb opposition”, were not present in the Jester dataset. Therefore, out of the 27 classes of the 20BN Jester dataset, five hand activities were considered. These hand tasks of the 20BN Jester were count to five, swiping down and left, thumb up, and thumb down. These activities were selected to include different image frames of isolated digits and the palm with all the digits for both the left and right hands.
2.3. Pre-Processing
The captured frames were normalized to ensure that each input to the three-dimensional CNN had the same distribution, and each class had the same number of frames. This was particularly important as, although the timing of the participants’ actions was marked by the PowerPoint presentation, individuals could execute hand gestures at different speeds. Ideally, a three-dimensional CNN input should always be balanced, making the model converge faster. If the input frames were not normalized, the weights could have had different calibrations across features, making the cost function converge ineffectively.
The frame length was set to be equal for all the acquisitions for which the hand gestures were at the centre of the video [
9,
34]. Following the structure of the 20BN Jester dataset, normalization was applied to impose a fixed length, set to be 32 frames. If the number of frames was higher or lower, a down-sampling or a padding function was applied, respectively, to generate fixed-length videos. Given the
sequence of RGB frames, the
length of the sequence, and the
fixed length, the padding and down-sampling techniques were defined as:
Following normalization, the images were resampled to be 64 × 64 pixels to expedite classification. The labels were assigned manually, and the videos were manually trimmed for input into the segmentation classifier. Finally, for training and validation, the datasets were split into training, validation, and testing sets, with a 70:20:10 ratio.
Of the data from the video collected, a total of 2812 short video sequences of healthy volunteers performing three different hand activities were used for testing and validation, including 1968 (≈70% of the dataset) were used for training and 845 (≈30% of the dataset) were used for validation and testing. Each short video sequence contained 32 frames, for 89,984 frames in total. A total of 5155 short video sequences were collected, of which 3609 (≈70% of the dataset) were used for training and 1546 (≈30% of the dataset) were used for validation and testing. Each short video sequence contained 32 frames for a total of 113,410 frames for training and 6784 for validation.
2.4. Model Design, Training and Evaluation
After the data pre-processing, the architecture was implemented based on an existing model originally introduced by Tran et al. [
35], known as C3D. Specifically, a modified version of the C3D network, similar to the multimodal RGB-D-based network by Hakim et al. [
12], was considered. Furthermore, to make sure that the three-dimensional CNN model was able to learn longer sequences, another unit, able to acquire long-term temporal features, was combined with the three-dimensional CNN, an LSTM unit. The final architecture (
Figure 5) consisted of a three-dimensional CNN layer with three convolutional layers, a Rectified Linear Unit (ReLU) as activation function in the hidden layers used to avoid vanishing gradient, one LSTM layer, a flatten layer, a fully connected dense layer and an activation function, also known as the Softmax layer.
The multi-dimensional input tensors were flattened into a single dimension. A flattened layer is often employed in the presence of multi-dimensional output. This layer aims to produce a linear output that can be conveyed onto a dense layer. A dense layer (also called fully connected) joined every input neuron to every output neuron in the preceding layer. Finally, the Softmax function produced a vector that denoted the list of probability classes of possible results. Based on the output from the Softmax, the frames were then segmented into those where the activities occurred and those where there was no gesture. The class “no gesture” was provided in case no activity was performed, but also for frames without a hand, when participants placed the hand down following a performed activity.
The baseline model was pre-trained on the selected five classes of the 20BN Jester dataset. Starting from the pre-trained architecture, a technique known as transfer-learning [
18] was then used to fine-tune the model to the activities performed in this study. The technique took the parameters from the previously trained model, froze the last layers to avoid the weights in the last (frozen) layers being updated, and then new trainable layers were added, together with new data to fine-tune the model.
A total of four tests were performed. During the first two tests, transfer learning was used with three convolutional layers. Then, to increase performance, an additional convolutional layer and an increased sample size were considered. The first two tests were evaluated over mini-batches of 13 epochs, following the segmentation classifier proposed by Wang [
27]. The last two tests were evaluated over a batch size of 64 epochs, a training batch size also presented in Wang’s investigation [
27]. A 12 gigabytes (GB)NVIDIA Tesla K80 graphics processing unit provided by Google Colaboratory was used for training the 20BN Jester dataset for the baseline model, TensorFlow [
36] was used to deploy the model, and the training took approximately nine and a half hours. For the first and the second tests, the training times were, respectively, one and a half hours and two and a half hours, whereas for the last two tests, they were two and four hours.
The metric used to evaluate the performances of the model was the Jaccard index or intersection over union value [
37,
38]. The index is often used for segmentation classifiers and was computed to analogize a set of predicted labels with a set of the corresponding true labels. Letting
A and
B be the set of frames predicted and ground truth manually labelled, respectively, the index is defined as:
The Jaccard index varies from zero to one, the larger the index, the higher the accuracy of the temporal segmentation classifier. The mean Jaccard index recorded was used as a similarity to define performances of the proposed model with comparative studies. Training and validation accuracies were tested for 13 and 64 epochs for a small sub-portion of 12 participants and for 22 participants to evaluate how variations in population sizes can improve training and validation performances.
4. Discussion
This work illustrates a CNN that automatically classifies and segments videos containing specific hand exercises including no gesture, abduction and adduction, MCP flexion, and thumb opposition. The segmentation of continuous video recordings was based upon a classifier that identified when the label “no gesture” was present. The presented pipeline addressed the challenge of hand gesture recognition from long video sequences captured using a monocular RGB camera.
The implementation of the three-dimensional CNN was based on a model known as C3D, proposed by Tran et al. [
35] and made of an high-resolution and a low-resolution sub-architecture, both trained individually. Even if the C3D model presented good performance, the cost of training two different models is high, so a modified version, which incorporated the two networks into one, was used in this work. This modified C3D, however, could only detect short temporal characteristics from short video sequences, whereas this work aimed to introduce a network that detects short-term temporal features from long video sequences. Therefore, the final CNN was combined with an LSTM unit, capable of learning the long-term dependencies in long video sequences.
Previous studies that combined three-dimensional CNN with LSTM units for hand activity recognition used both RGB and depth modalities to extract the motion signature [
12,
27], whereas the three-dimensional architecture implemented in this work was only based on an RGB sequence, showing a similar level of accuracy (93.95%) can be reached also from a single acquisition modality. Furthermore, the proposed network outperformed the 82% accuracy presented by Hakim et al. [
12]. The overfitting observed after 64 epochs was similar to that of other investigations that used dual modalities [
25,
26]. The use of transfer learning to reach an acceptable (above 80%) level of accuracy enables the possibility of scaling this approach to include different hand gesture activities, showing how the model can be trained effectively on a small dataset to create an effective small-size segmentation classifier.
The mean Jaccard index recorded and used as a comparative index also in a similar investigation, was benchmarked against similar approaches. In Wang’s [
27] and Wang et al.’s [
39] studies, the Jaccard index was lower compared with the one presented in this investigation. The mean Jaccard index presented in this study reached 0.794 for the same number of participants, outperforming the value presented in previous investigations (
Table 2). However, Wang’s accuracy was based on the Montalbano Gesture Dataset [
39], containing different hand activities from those implemented in this investigation. Therefore, further investigations would be needed to compare the performances of this network using this metric. Furthermore, no inconsistency was shown across the segmented video recordings for action and participants, meaning that segmentation accuracy was not based on specific actions or specific participants.
To adopt and scale this application in real-world scenarios, if multiple classes are considered, future directions could include testing this approach for real-time application using a finite state machine system that can decrease the classes under inspection and increase the accuracy for real-time deployment. To further improve the model’s performance for real-time applications, the input image size or the number of layers could be increased. On top of the 20BN Jester dataset, an additional dataset could be used to enhance the model’s performance. The Jester dataset was developed by actors and did not provide numerous occlusion cases. Regardless, in realistic circumstances, occlusion exists. A foreseen limitation of the results reported here includes the absence of edge cases for the recordings captured in unconstrained scenarios. Ambiguous appearance results may lead to tracking errors. Capturing methods solely relying on two-dimensional appearance information could struggle in scenarios where images are blurry, out of the plane or rotated, distant or small. Visual tracking methods may be incorporated to consider types of interference (e.g., blurry hand gestures if the participants or the camera moves suddenly during the acquisition) with the goal of disambiguating the recognition target. Rescuing identifiable appearance cues of image interference for a real-time hand recognition model, for instance, with an image blur classification and blur removal, would be an attractive research direction.
Even given the limitations of the monocular technique, when incorporated into a pipeline that is intended for further processing, the temporal segmentation results are still usable when viewed in the context of performing manual temporal segmentation. One intended use case would be in a patient assessment setting, where hand exercises could be monitored, particularly when they are intended for use as therapy; this potentially extends suggested approaches of home exercise monitoring [
41].
Furthermore, while the supervised-based transfer learning produced expected outcomes, the approach presented in this work could be transported to unsupervised learning and could support the automated labelling and segmentation of long video recordings, increasing the models’ generalizability. Furthermore, hybrid deep learning models, such as the work from Nasser et al. [
42], that combine recurrent networks to also model the temporal dependencies in high-dimensional sequences, which is an interesting area to explore further.
Adapting current gesture recognition techniques to specific mobility exercises would have benefits that go beyond this single application. A real-time device that requires minimal manual processing could process and identify multiple gestures as soon as an image frame is received. This approach could be deployed in online hand gesture recognition studies for advanced assistance systems, surveillance, aided robotics, and clinical applications. For instance, the pipeline illustrated here could be integrated into remote monitoring clinical solutions, presenting the training of a model that uses a smaller dataset implemented on a small architecture that can run efficiently to solve the classification problem for hand temporal segmentation. This would pave the way to a broader application of hand tracking models, incorporating other hand activity categories, and obtaining a more generalizable approach, that would include different hand exercise programs and different hand conditions.