1. Introduction
Nowadays, ultrasound is broadly applied in several fields, and the major areas include medical and structural diagnoses and range-finding-related applications. For example, Doppler radar can be used in aviation, meteorology, speed gun, motion detection, intruder warning, and light control. In particular, Doppler radar has been used for posture and gesture recognition in motion sensing [
1,
2,
3,
4,
5].
Posture or gesture recognition commonly uses mechanical or light stimulation sensors, such as cameras, infrared sensors, and wearable devices. Although their installation may not be difficult and the obtained results could be intuitively analyzed, these devices have inherent limitations [
6]. Cameras can infringe on users’ privacy; infrared sensors are susceptible to misjudgment; securing wearable sensors is necessary for functional operation [
7]. In contrast, acoustic stimulus sensing, such as sonar, possesses several unique advantages, particularly for the monitoring of the home environment and can be fully operated in the dark [
8]. In addition, visuals need not be recorded so that the privacy of users will not be affected [
9]. These characteristics make sonar a superior sensing system for gesture recognition applications.
The hand gesture detection system possesses several advantages. The contactless interaction mode allows the users not to touch the control panel. This avoids the potential cross-contamination of multiple users via the touch panel/screen and the likely damage/fatigue of the physical control device due to inappropriate/intensified operation. In addition, the proposed hand gestures in the study are aimed at robot manipulation. The hand gestures intuitively correspond to the motion of the robot. This facilitates the users not only to operate the robot but to collaborate with the robot to execute a task.
Liu et al. used a single speaker and microphone in a smartphone with the Doppler effect to study gesture recognition via ultrasound. The vectors containing +1 and −1 can be used to indicate the direction of gesture movement in each time period [
10]. The SoundWave team also used a microphone and speaker in a mobile device. The gestures investigated by them consisted of scrolling, single or double tapping, forward and backward hand gestures, and continuous motion. These gestures can be judged from the amplitude of the reflected signal [
11]. Przybyla et al. developed a three-dimensional range-finding function chip. Its searchable range is up to 1 m in a 45° field of view. Low power consumption is the main feature of the chip [
12]. Zhou et al. reported the ultrasonic hand gesture recognition system which is capable of recognizing 3D micro-hand gestures with a high cross-user recognition rate. This system possesses the potential to develop a practical low-power human machine interaction system [
13].
On the other hand, researchers working in the field of computer vision have broadly employed image processing methods such as feature detection, description, and matching for object recognition, image classification, and retrieval. Features, such as points, edges, or corners, can be considered as the information of interest within the investigated image [
14]. A good feature detector can find features possessing strong information changes that are repeatedly recognized among several images taken with different viewing angles and lighting conditions [
15]. A feature descriptor is an algorithm used to process the image feature points into meaningful feature vectors. Feature descriptors encode interesting image features into a series of numbers and act as a type of numerical identifier that can be employed to differentiate one feature from another.
Researchers have conducted a considerable number of studies on image feature extraction and description and proposed several classic feature description and extraction methods, such as scale invariant feature transform (SIFT), speeded up robust features (SURF), features from accelerated segment test (FAST), and binary robust independent elementary feature (BRIEF) [
16,
17,
18,
19]. These methods obtain image feature points and their descriptors by finding the local extremum in the image and describing the features using the luminance information of their neighborhood.
Feature matching is obtained by calculating the distances between feature points in different images. For example, the SIFT feature descriptor uses the Euclidean distance as the judgment standard between descriptors, whereas the BRIEF descriptor [
20] is a type of binary descriptor that uses the Hamming distance as the judgment standard to describe the correspondence between two feature points [
21,
22].
As described above, existing air sonar applied to hand gesture recognition by time-domain reflected signal by decoding its signal level or amplitude, the likely environmental noise and the distance between the users and speaker/microphone could decrease the accuracy rate of hand gesture recognition. Moreover, most of the image recognition techniques deals with the images directly obtained via camera. The large amount data points of images accompanying with the frame change not only requires the higher cost hardware but more computational effort compared to our proposed air sonar approach. In this study, we constructed an air sonar with a pair of ultrasonic emitters and receiver to investigate the hand gesture recognition. Through the cost-effective circuity, the acquired Doppler signals were processed for the study of hand gesture recognition. Two algorithms were developed with the one to judge the starting time of hand gestures by continuous short-time Fourier transform and the other to recognize the hand gesture by image processing with spectrogram. The Superior recognition results were obtained using the proposed scheme. Further details are described below.
2. Linear and Rotational Doppler Effect on Hand Gesture Motion
The common hand gestures include translational and rotational motion of the hand. The signals from air sonar with linear and rotational Doppler effect will be applied to our hand gesture recognition. Consider the constructed ultrasonic transmitter and receiver pair as an air sonar system fixed and located at the origin Q of the radar coordinate system (U, V, W) (
Figure 1). The hand is described in the local coordinate (x, y, z) and has a translation and rotation with respect to the radar coordinate. A reference coordinate system (X, Y, Z) is introduced, which has the same translation as coordinate (x, y, z) but has no rotation with respect to the (U, V, W) coordinate.
Suppose the hand has a translation velocity v with respect to the radar and an angular rotation velocity
ω represented in the reference coordinate as
ω = (
ωX,
ωY,
ωZ)
T. A point scatterer M on the hand, located at
r0 = (
X0,
Y0,
Z0)
T, at time
t = 0 will move to M′ at time
t. This movement can be considered as a translation from M to M″ with velocity
v and a rotation from M″ to M′ with an angular velocity
ω. At time
t, the range vector from the radar to the scatterer at M′ becomes
The scalar range of
can be expressed as
When the air sonar emits a continuous sinusoidal wave with a carrier frequency
f0, the echo signal
s(
t) received from the scatterer on the hand at the position (
x,
y,
z) is expressed as a function of
Rt.
where
is the reflectivity function of the point scatterer M described in the target local coordinates (
x,
y,
z),
c is the speed of sound, and the phase of the signal is
. The Doppler frequency shift by hand motion can be obtained by taking the time derivative of the phase as [
23]
where
.
Thus, the Doppler frequency shift included the effect of translation and rotation as well as the direction from the radar to the scatterer on the hand. Four hand gestures, namely, “push”, “wrist motion from flexion to extension”, “pinch out”, and “hand rotation”, were performed with the right hand for this study.
(1) Push: As shown in
Figure 2a, the push gesture involves only a translational motion of the palm. While the hand moved forward from an initial position to a stop position, the normal direction of the palm was toward the radar and the palm underwent the acceleration, near-constant velocity, and deceleration stages. The maximum Doppler frequency shift was contributed by the center of the palm because the velocity direction was normal to the direction of the radar to the center of the palm. The minimum Doppler frequency shift resulted from the corner of the moving palm.
(2) Wrist motion from flexion to extension: As shown in
Figure 2b, the hand moved from the position with the palm away from the radar to that with the palm toward the radar. Initially, the wrist was flexed. Then, regarding the angular displacement, the hand underwent acceleration, constant speed, and then deceleration stages. Finally, the wrist was extended. The pivot region, which can be considered as the wrist joint, possesses zero velocity during the entire motion, thus resulting in a zero Doppler frequency shift. The largest Doppler frequency shift was contributed by the largest value of the hand velocity at a certain scatterer point on the hand multiplying the unit direction vector from that scatterer to the radar.
(3) Pinch out: This motion is mainly performed by the thumb and index finger. In the initial state, the front ends of the index finger and thumb contacted each other. The remaining fingers were clenched. The motion is illustrated in
Figure 2c. The front ends of the two fingers gradually opened up along with the forward movement of the hand. The thumb and index finger underwent acceleration from the initial position and then decelerated to a stop position, thus completing this motion. The index finger rotated in a clockwise direction, and the thumb rotated in a counterclockwise direction. These two motions provide a negative Doppler frequency shift with the magnitude gradually increasing, reaching the maximum, and then decreasing to zero. Meanwhile, the remaining fingers formed a partial fist and underwent acceleration and deceleration phases. This caused a similar Doppler frequency shift as the push motion, but covered a smaller scattering area compared with the push motion.
(4) Hand rotation: The motion of hand rotation is similar to the wrist motion from flexion to extension, with the major difference being the axis of rotation. Regarding hand rotation, the hand was aligned with the lower arm, and all the fingers closed to form a plane with the palm (
Figure 2d). The axis of rotation was the centerline of the lower arm. The rotation motion started from the palm toward the radar and rotated to make the palm face away from the radar. The hand motion underwent two phases of acceleration and deceleration. The axis of rotation had zero velocity, and thus contributed zero frequency shift. During the motion, the part of the hand from the axis of rotation to the little finger edge provided a positive Doppler frequency shift, followed by a negative Doppler frequency shift. Meanwhile, the rest of the hand produced a negative Doppler frequency shift, followed by a small positive Doppler frequency shift.
Hand gesture can be served as an important tool for human interaction. Compared to existing interfaces, hand gestures have the advantages of being intuitive and easy to use. The investigated four different kinds of hand gestures which could be applied to human-robot interaction. The importance of the characterization of these hand gestures are as following: the push motion commands the robot arm to move forward, the wrist motion from flexion to extension asks the robot grip to make a right turn, the pinch out motion instructs the gripper of robot to open, and the hand rotation motion requests the robot body/platform to turn left (
Figure 3).
3. Hardware Setup of Air Sonar
3.1. Ultrasonic Emitter and Receiver
We employed MA40S4S as the emitter and MA40S4R as the receiver (Murata Co., Nagaokakyo, Japan) with an operating frequency of approximately 40 kHz in the air sonar system. The sinusoidal signal was sent using a function generator to actuate the ultrasonic emitter, and an NI-USB6351 data acquisition card was used to acquire the electrical signal from the ultrasonic receiver. To avoid the sinusoidal wave generated due to distortion, we set the sampling frequency to 250 kHz, which is five times higher than the operating frequency, to examine the performance of the selected pair of ultrasonic transducers.
Two tests were performed under a sinusoidal voltage input of 20 Vpp: (1) the emitter was fixed, and the receiver was placed 3, 18, and 38 cm away such that it was facing the emitter. The amplitudes of the acquired sinusoidal signals were 6.52, 0.736, and 0.188 Vpp, respectively. (2) The emitter and receiver were soldered on a circuit board 1 cm away from each other. A large glass plate was held parallel to the circuit board 7 and 40 cm away from the ultrasonic transducers as a reflector of the ultrasonic wave. The resulting amplitudes of the acquired signals were 2.3 and 0.424 Vpp, respectively. These two simple tests verify the capability of the selected transducer pair to obtain a sufficiently large signal compared with the environmental noise level during the subsequent hand gesture experiment performed approximately 10 to 30 cm away from the transducer pair.
3.2. Circuit for Air Sonar Operation
The air sonar was designed with a transmitted frequency of 40 kHz. The minimum sampling rate should be greater than 80 kHz to acquire the reflected signal directly. It can be even higher than 200 kHz to obtain a better waveform of the reflected signal. The sampling rate and data amount for algorithm processing should be reduced to make the hand gesture recognition technology easily implementable online.
As we utilized the Doppler effect of the received signal for our hand gesture recognition, the sampling frequency is much lower than that required to acquire the signal emitted from the ultrasonic emitter. This could be realized through a frequency-mixing technique. Consider f1 as the frequency of the received ultrasonic wave, which is given by the frequency of the transmitted ultrasonic wave plus the Doppler frequency shift. Consider f2 as the frequency of the mixed signal, as specified. When signals of frequencies f1 and f2 enter the mixer, the mixed signals of frequencies f1 + f2 and f1 − f2 emerge at the output terminals. We employed the signal with the frequency (f1 − f2) for investigating the Doppler effect.
Based on the preliminary experiment, the magnitude of the Doppler frequency shift of interest is less than 500 Hz. We constructed the mixed signal with the frequency f1 − f2 = f1 (40.8 kHz + Doppler frequency shift) − f2(37.6 kHz) = 3.2 kHz. The frequency of 3.2 kHz was selected, as it was five times higher than the Doppler frequency shift in this study.
The oscillators, mixer, and filters were employed to construct the circuitry for reducing the carrier frequency [
24].
Figure 4 shows the circuit implementation. Two oscillators were realized using a Wien bridge sine-wave oscillator. One oscillator was operated at 40.8 kHz (
f1) to drive the ultrasonic emitter, and the other oscillator was operated at 37.6 kHz (
f2), as the mixed signal. The analog multiplier AD633 served as the mixing function. A bandpass filter was applied to remove unwanted noises and obtain a reasonable frequency bandwidth signal. A low-pass filter was employed to remove the signal with a carrier frequency of
f1 +
f2. In addition, a voltage follower was added to avoid the loading effect of the voltage signal from the ultrasonic receiver to the bandpass filter. After bandpass filtering, a voltage amplifier was used to adjust the gain so that the signal sent to the mixer had a proper amplitude during the hand gesture operation.
6. Recognition Results of the Same Gestures and Different Gestures
Using the aforementioned methods, we investigated the four types of gestures described in the previous section. For each studied gesture, we randomly took two samples from five hand gesture operations. The samples were performed by the same person. We first obtained the recognition results for the same gestures. For each of the four studied gestures, we took the samples at different times as motions “A” and “B” and compared them with each other to obtain the matching points of the extracted features.
Table 1 lists the test results of the matching points.
Figure 9a shows that the results of the push motion “A” tested against itself yielded 245 matching points and the push motion “B” tested against itself yielded 251 matching points. The small yellow circles indicate the matching points.
Figure 9b shows that, by using the push motion “A” to test the push motion “B”, we could obtain 189 matching feature points. When we switched the matching sequence, that is, using the push motion “B” to the test motion “A”, we could obtain 179 matching points. The pinch out motion “A” tested against itself yielded 210 matching feature points, and the pinch out motion “B” tested against itself yielded 240 matching points. Using the pinch out motion “A” to test “B”, we could obtain 165 matching points, and we could obtain 149 matching points by switching the test order.
A similar test was also performed for the hand rotation motion. The rotation “A” tested against itself yielded 320 matching points, and rotation “B” tested against itself yielded 311 matching points. The cross-validation of using rotation “A” to test “B” yielded 192 matching points and the reverse order yielded 188 matching points. As for the gesture of wrist motion from flexion to extension, the motion “A” tested against itself yielded 267 matching points; the motion “B” tested against itself yielded 250 matching points. Using motion “A” to test “B” yielded 156 matching points, and the reverse order yielded 140 matching points.
From the above analysis, we observe that a stronger matching occurs at the comparison of each identical feature description for each investigated case. The rotation motion case showed the highest matching number of 320 points, and the pinch out motion case showed the lowest matching number of 210 points. Although the matching points of cross-validation for the same gestures were not as high as those for testing with an identical gesture, they still exhibited the largest value of 192 and the smallest value of 140 for the four investigated gestures. In addition, we observed that the comparison order could affect the matching number of points for the same gesture motions. The differences of cross-validation comparisons were 10, 16, 16, and 4 for push motion, wrist motion from flexion to extension, pitch out, and rotation, respectively. The disparity is approximately one order of magnitude less than the number of matching points, which indicates that the effect of the test order could be reasonably neglected.
Subsequently, different gestures were examined. For every two cases of the same gestures discussed above, we tested the other three cases of gestures. We selected some examples to illustrate the matching results.
Figure 10a–e shows the results of using the rotation motion “A” to test the push motion “A”, the wrist motion from flexion to extension “A” to test the pinch out motion “A”, the wrist motion from flexion to extension to test the rotation motion “A”, the pinch out motion “A” to test the wrist motion from flexion to extension “A”, and the push motion “A” to test the rotation motion “A”, respectively. The corresponding numbers of matching points were 30, 37, 25, 26, and 26, respectively. The marked matching points (yellow circle) in
Figure 10 indicate much smaller numbers compared with that of the same gestures. Therefore, a borderline of the matching points could be set to distinguish similar gestures and different gestures using our proposed method.
To correlate the hand gestures and matching results quantitatively, we categorized the matching results into three groups. The first group consisted of identical hand gestures with the number of matching points between 320 and 210. The second group had the same hand gestures, but they were performed at different times. The number of matching points was between 140 and 189. The third group consisted of different hand gestures, and the number of matching points was between 48 and 11.
If we define the maximum number of matching points in the aforementioned cases, which is 320, as a denominator to find the matching ratio of hand gesture recognition, then the identical hand gestures are indicated by a matching ratio beyond 0.66 (i.e., greater than 210 divided by 320). The same gestures are indicated by matching ratios above 0.44. The different hand gestures could be determined by a matching ratio of less than 0.15. The results indicate that there is a large ratio interval of 0.29 to prevent the misjudgment of hand gestures.
Furthermore, if we directly find the probability density functions according to the experimental matching points shown in
Table 1 for the same hand gestures and the different hand gestures, the probability of accuracy rate could be found. The mean value (
μ) and standard deviation (
σ) are 214.44 and 53.43 for the identical gestures, 26.73 and 9.02 for the different gestures. Using the probability of normal distribution, the accuracy rate could be estimated as below:
Let
X1 be the matching points for the cases of failure of hand gesture recognition possessing the normal distribution
N(
μ1,
) with probability density function (pdf)
F1(
x) and cumulative density function (cdf)
F1(
x) and
X2 be the matching points for the cases of correct hand gesture recognition possessing the normal distribution
N(
μ2,
) with pdf
F2(
x) and cdf
F2(
x),
μ1 <
μ2. The area of intersection zone, which indicates the probability of faulty hand gesture recognition, could be found by
where erf(.) means the error function,
c is the
x-value for
F1(
x) =
F2(
x) and can be obtained by
Figure 11 shows the analyzed results. The accuracy rate of hand gesture recognition could achieve a probability of 99.8% by using the developed scheme.
We estimated the required time to execute the proposed algorithm for hand gesture recognition. The calculation was based on the MATLAB code operated in the personal computer with the hardware configuration of Intel Core i5 CPU @3.10 GHz and 8 GB RAM. Through the MATLAB timer functions of tic (starts a stopwatch timer) and toc (prints the elapsed time since tic was used). The amount of time required to complete one hand gesture recognition was about 0.7 s. The computational cost could be further reduced by converting the MATLAB code to C program language or Python language for better performance.
In the future, we will explore more different types of hand gestures for recognition. More data from different testers will be analyzed. In addition, the parameters used in the algorithm could be further investigated to obtain optimized results. For example, the window functions used in the STFT analysis and the threshold values in the MATLAB matchFeatures function.
The proposed methods in this study could be applied to other problems such as structural health monitoring or fault diagnosis of machines [
28,
29,
30]. For example, using the receiving acoustic signal along with the presented signal processing scheme possesses the advantages of low-cost hardware setup and non-destructive detection. The described image processing scheme could be also employed for thermal imaging data analysis. For instance, using a specific fusion method to extracting features, along with nearest neighbor classifier and support vector machine has been studied as an effective way for fault diagnosis of the angle grinder [
30]. It could be interesting for further investigation by using our proposed imaging processing scheme.