Millimeter-Wave Radar Point Cloud Gesture Recognition Based on Multiscale Feature Extraction

Li, Wei; Guo, Zhiqi; Han, Zhuangzhi

doi:10.3390/electronics14020371

Open AccessArticle

Millimeter-Wave Radar Point Cloud Gesture Recognition Based on Multiscale Feature Extraction

by

Wei Li

¹,

Zhiqi Guo

¹ and

Zhuangzhi Han

^2,*

¹

School of Information Science and Technology, North China University of Technology, Beijing 100144, China

²

Department of Electronic and Optical Engineering, Shijiazhuang Campus, Army Engineering University, Shijiazhuang 050003, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(2), 371; https://doi.org/10.3390/electronics14020371

Submission received: 15 December 2024 / Revised: 15 January 2025 / Accepted: 15 January 2025 / Published: 18 January 2025

(This article belongs to the Special Issue Machine Learning for Radar and Communication Signal Processing)

Download

Browse Figures

Versions Notes

Abstract

:

A gesture recognition method is proposed in this paper, which leverages millimeter-wave radar point clouds, primarily for identifying six basic human gestures. First, the raw radar signals collected by the MIMO millimeter-wave radar are converted into 3D point cloud sequences using a microcontroller integrated into the radar’s baseband processor. Next, based on the SequentialPointNet network, a multiscale feature extraction module is proposed in this paper, which enhances the network’s ability to extract local and global features through convolutional layers at different scales. This compensates for the lack of feature understanding capability caused by single-scale convolution kernels. Moreover, the CBAM in the network is replaced with GAM, which effectively enhances the extraction of global features by more precisely modeling global contextual information, thereby increasing the network’s focus on global features. A separable MLP structure is introduced into the network. The separable MLP operation is used to separately extract local point cloud features and neighborhood features, and then fuse these features, significantly improving the model’s performance. The effectiveness of the proposed method is confirmed through experiments, achieving a 99.5% accuracy in recognizing six fundamental human gestures, effectively distinguishing between gesture categories, and confirming the potential of millimeter-wave radar 3D point clouds in recognizing gestures.

Keywords:

millimeter-wave radar; 3D point cloud; deep learning; signal processing

1. Introduction

With the continuous advancement of technology, human–computer interaction has garnered increasing attention. Gesture recognition is an important method of human–computer interaction [1]. For example, in fields such as operating rooms [2], smart automobiles [3], and smart furniture [4], natural gestures enable interaction with devices without the need to touch hardware, which can lead to significant advancements in user experience.

Gesture recognition methods can be divided into wearable and non-wearable devices. Wearable devices primarily obtain gesture change information through sensors specifically worn on the body, typically using accelerometers, gyroscopes [5], and other sensors. These devices tend to have high sampling rates and accuracy. However, wearable devices require continuous wearing, which can cause discomfort to users and significantly affect their daily lives. Non-wearable devices mainly include cameras, WiFi [6], depth cameras, and radar. Ordinary cameras and depth cameras detect gesture information from the captured images or depth point clouds, extracting features such as shape and posture. Ordinary cameras have high resolution and can identify gesture information with relatively high accuracy [7,8]. However, they cannot capture distance and velocity information and pose a risk of privacy leakage. WiFi-based methods [9] have poor anti-interference capabilities and low accuracy when dealing with complex gestures. Depth cameras can capture high-precision depth images and model the hands to achieve gesture recognition [10]. However, they require good lighting conditions, have limited applicable scenarios [11], and also pose privacy risks. LiDAR [12] and ultra-wideband radar are susceptible to extreme weather and are expensive.

Compared to other non-contact devices, millimeter-wave radar offers advantages such as high measurement accuracy [13], the ability to operate continuously in all weather conditions, cost-effectiveness, and protection of personal privacy [14]. It has a wide range of applications in human detection. Radar can effectively detect cardiac motion signals [15] and perform non-contact measurements of heartbeats and respiration [16]. Additionally, it is capable of detecting human actions and positions [17]. In light of these characteristics, this paper opts to utilize millimeter-wave radar for gesture recognition research.

Currently, numerous gesture recognition approaches leveraging millimeter-wave radar data are based on deep learning. By analyzing and processing the received gesture echoes, micro-Doppler feature maps corresponding to different gestures are obtained, and support vector machines classify the dataset obtained. [18]. In [19], finger movements are captured using millimeter-wave radar. Feature extraction and analysis are performed on the micro-Doppler spectrum and Range-Doppler spectrum, and gesture classification is carried out using a Convolutional Neural Network (CNN). This approach enables gesture recognition, thereby achieving gesture-based text input. In [20], a Multi-feature Fusion Temporal Neural Network is utilized to extract the spatiotemporal information of gestures from Range-Doppler Mapping (RDM) and Angle–Time Mapping (ATM), thereby classifying the gestures. Compared to 2D images, 3D point clouds are more intuitive and contain richer target information. In the recognition of single-frame point clouds, the PointNet [21] network uses shared multi-layer perceptrons (MLPs) to independently extract features for each point in the point cloud data and then integrates the features of each point into global features through global max pooling, thus extracting the overall structural information of the point cloud. PointNet++ [22] adds sampling and grouping operations on top of PointNet, iteratively achieving global feature extraction. The 3DV-PointNet++ framework [23] introduces 3D dynamic voxels (3DVs) as a method to represent 3D motion. PointNet++ processes a subset of points extracted from the 3DV for 3D action recognition, enabling end-to-end learning. Nonetheless, converting the point cloud sequence into a static 3D point set results in a significant loss of spatiotemporal details and incurs higher computational overhead. The Point 4D Transformer (P4Transformer) [24] network combines 4D convolution and the Transformer. Four-dimensional convolution is used to embed the spatiotemporal local structures presented in the point cloud video; by applying self-attention to the embedded local features, the Transformer effectively captures both appearance and motion details across the entire video. The SequentialPointNet [25] network combines the PointNet++ network with the concept of positional encoding from Transformers [26], where PointNet++ is used to extract features from static point clouds, and positional encoding generates a coordinate component to embed time information.

Inspired by the deep learning feature extraction methods for point cloud sequences mentioned above, this paper proposes the MSFE-GAM-SPointNet network based on the SequentialPointNet network. First, the network’s ability to extract local and global features at multiple scales is enhanced through the introduction of a multiscale feature extraction module. Next, the GAM [27] attention mechanism is used to replace CBAM, compensating for the potential loss of contextual information in the network. Finally, a separable MLP structure is introduced to optimize the feature processing workflow.

2. Gesture Recognition Principles

2.1. Gesture Recognition System Process

A method for gesture recognition based on millimeter-wave radar point clouds is proposed in this study, consisting of three key steps: collecting raw data of human gestures based on the millimeter-wave radar module; processing the collected raw data using the millimeter-wave radar baseband processor to generate 3D point cloud information of the target, including distance, velocity, and angle; and utilizing deep learning networks to extract features from the 3D point clouds and recognize gestures based on the extracted point cloud features. The detailed system process is shown in Figure 1.

2.2. Principle of Millimeter-Wave Radar

2.2.1. Radar Working Principle

This paper uses MIMO millimeter-wave radar to obtain the target’s distance, velocity, and angle information. The working principle of the FMCW radar is shown in Figure 2.

The FMCW radar transmitting antenna (TX) emits a linearly frequency-modulated signal at a specific frequency, while a corresponding signal is also sent to the mixer. Upon encountering the target, an echo signal is generated and received by the receiving antenna (RX). By mixing the echo with the transmitted signal, an intermediate frequency (IF) signal is produced, which is subsequently digitized through an analog-to-digital converter.

The FMCW radar-transmitted signal can be expressed as follows:

x_{T X} (t) = A_{T} \cos (2 π f_{c} t + π S t^{2})

(1)

where

A_{T}

corresponds to the amplitude of the transmitted signal,

f_{c}

represents the carrier frequency of the radar’s transmitted signal;

S = B / T_{c}

is the frequency modulation slope;

B

is the modulation bandwidth; and

T_{c}

is the period of the linear frequency-modulated signal.

When the radar-transmitted signal encounters a moving target, the radar antenna captures the signal after it is reflected. The received echo signal can be expressed as follows:

x_{R X} (t) = A_{R} \cos (2 π f_{c} (t - t_{d}) + π S {(t - t_{d})}^{2})

(2)

where

A_{R}

is the amplitude of the echo signal,

t_{d} = 2 R / c

is the time delay of the echo signal received by the receiving antenna,

R

is the distance to the detected target, and

c

is the speed of light.

As shown in Figure 2, the transmitted signal and the received signal are mixed through the mixer, and the high-frequency signals are filtered out using a low-pass filter to obtain the intermediate frequency (IF) signal. The IF signal is expressed as follows:

x_{I F} (t) = \frac{1}{2} A_{T} A_{R} \cos (2 π S (t_{d} t - \frac{1}{2} {t^{2}}_{d}) + 2 π f_{c} t_{d})

(3)

In practical applications,

t_{d}

is very small and

{t^{2}}_{d}

can be ignored, so IF signals can be represented as follows:

f_{I F} = β t_{d} = \frac{B}{T_{c}} . \frac{2 R}{c}

(4)

2.2.2. Radar Detection Principle

Millimeter-wave radar can obtain information such as the target’s distance, velocity, and angle relative to the radar by processing the intermediate frequency (IF) signal generated by the mixer.

Millimeter-wave radar primarily extracts parameters such as distance, velocity, and angle of the target through the intermediate frequency (IF) signal. The IF signal frequency generated by the radar differs for targets at different distances. Therefore, when multiple targets are identified by the radar, an FFT is performed on the IF signal to separate the different intermediate frequencies

f_{I F}

and their corresponding amplitudes. By performing distance-domain FFT on the millimeter-wave radar IF information, the target’s distance can be calculated as follows:

R = \frac{c T_{c} f_{I F}}{2 B}

(5)

The maximum measurement distance of millimeter-wave radar is related to the maximum frequency of the IF signal, and the IF bandwidth is limited by the sampling rate, so the maximum distance can be expressed as follows:

R_{\max} = \frac{F_{s} c}{2 S}

(6)

The velocity of a target is determined by FMCW millimeter-wave radar using the phase difference of the intermediate frequency (IF) signal. If the target moves with a radial velocity

v

relative to the radar and the time interval of the transmitted Chirp signal is

T_{c}

, the phase difference of the IF signal is given by the following:

Δ φ = φ_{2} - φ_{1} = \frac{4 π v T_{c}}{λ}

(7)

From the above equation, the target’s radial velocity relative to the radar is represented as follows:

v = \frac{Δ φ λ}{4 π T_{c}}

(8)

In order to describe the position of the target in space, in addition to distance and velocity, the azimuth angle and elevation angle are also required. The azimuth angle and elevation angle are important parameters for determining the target’s direction and altitude. The azimuth angle is the angle on the horizontal plane between the target and true north, while the elevation angle is the vertical angle relative to the radar. This paper extracts the angle information of the target from the azimuth and elevation angles.

To obtain angle information from FMCW millimeter-wave radar (Beijing, China), at least two receiving antennas are needed. The wavepath difference between the target and each receiving antenna causes a phase shift, and by using the phase difference, the target’s angle information can be determined. The angle measurement principle of the FMCW millimeter-wave radar is shown in Figure 3.

Assuming that the target’s angle relative to the radar is

θ

, the wavepath difference in the propagation distance between the echo signals received by RX1 and RX2 in the diagram is

d \sin (θ)

. The phase difference of the echo signals captured by the receiving antennas is represented as follows:

Δ Φ = \frac{2 π d \sin (θ)}{λ} |Δ Φ| < 180^{\circ}

(9)

where

d

is the distance between the receiving antennas RX1 and RX2, and

θ

is the angle of arrival. The calculation formula for the angle of arrival

θ

can be expressed as follows:

θ = \sin^{- 1} (\frac{λ Δ Φ}{2 π d})

(10)

After signal processing of the intermediate frequency (IF) signal, the target’s distance, velocity, angle, and other information can be obtained. Currently, the data are in spherical coordinates with the radar position as the origin. To make the 3D point cloud format more intuitive for human vision, a Cartesian coordinate transformation is used to convert the obtained spherical coordinates into a rectangular coordinate system. The transformation formula for the Cartesian coordinate system can be expressed as follows:

\begin{array}{l} x = r \cdot \cos (θ) \cdot \cos (φ) \\ y = r \cdot \cos (θ) \cdot \sin (φ) \\ z = r \cdot \sin (θ) \end{array}

(11)

where

r

is the target distance,

φ

is the azimuth angle, and

θ

is the elevation angle.

2.3. Point Cloud Signal Processing

Figure 4 shows the radar module used in this paper. The module uses the RC7711C-AIP radar chip from Skyrelay (Beijing, China). The module consists primarily of two chips: a 77GHz RF transceiver chip and a radar baseband processor chip.

The radar chip integrates a 4 × 4 antenna array. Figure 5 shows the distribution of the antennas used in the radar module and the virtual antenna array.

2.3.1. Radar Signal Processing Procedure

This paper uses the radar baseband processor to perform processing of the radar signal. The hardware signal processing flow is shown in Figure 6.

After the intermediate frequency (IF) signal is converted into a digital signal by the ADC, it is input into the baseband processor for subsequent signal processing. The baseband processor integrates modules such as FFT, CFAR, and digital beamforming technology.

The raw data collected by the radar’s ADC are organized into a radar data cube. The three dimensions of the cube are the number of sampling points in a single Chirp, the number of Chirps, and the number of transmitting and receiving antennas. First, an FFT is performed on the sampling points of a single Chirp to obtain the distance information. Then, an FFT is performed on the Chirp dimension of the radar data cube to obtain the velocity information and generate the Range-Doppler Map (RDM).

Next, the SO-CFAR algorithm is used to compute the Range-Doppler Map, with the protection and reference cells for the CFAR set. This allows for the determination of the distance and velocity indices of the target points, along with the corresponding energy values.

After obtaining the index information of the point cloud, the corresponding complex values can be extracted from the Range-Doppler Map for subsequent azimuth angle calculation. Based on the antenna arrangement, an FFT operation is performed to obtain the 3D FFT (FFT3D) results. Then, peak detection is performed on the results using the DPK module to determine the azimuth angle index of the target.

To improve the signal-to-noise ratio, beamforming must first be performed based on the azimuth angle index before calculating the elevation angle. The beamforming process involves multiplying and summing the complex values corresponding to the direction vectors of the azimuth angle index and the antenna positions, ultimately obtaining three sets of complex values for elevation angle calculation. Then, the elevation angle information is obtained by calculating the maximum index value in the 4D FFT. At this point, the point cloud information is represented in 3D spherical coordinates, which can be converted into a rectangular coordinate system for further processing and analysis.

2.3.2. Radar Parameters

The parameters of the frequency-modulated wave are set as shown in Table 1.

The parameters for data sampling are set as shown in Table 2.

The distance resolution of the FMCW millimeter-wave radar is as follows:

R_{r e s} = \frac{c}{2 B}

(12)

The radar’s distance resolution is enhanced by increasing the effective bandwidth of the millimeter-wave radar. Based on the bandwidth parameters of the radar configuration in this paper, the distance resolution is 0.05 m.

The maximum measurable distance of the FMCW millimeter-wave radar is 12.8 m.

The velocity resolution of the FMCW millimeter-wave radar can be expressed as follows:

v_{r e s} = \frac{λ}{2 T_{f}}

(13)

where

T_{f} = N T_{c}

represents the time of each frame of the signal. Based on the radar configuration parameters in this paper, the velocity resolution is 0.1101 m/s.

The angle resolution for millimeter-wave radar is given by the following:

v_{r e s} = \frac{λ}{2 T_{f}}

(14)

where

N_{A}

represents the number of antennas. The antennas used for azimuth angle are numbered 9–16, as shown in Figure 5, and the azimuth angle resolution is 2/8 rad. For the target information in the azimuth angle results, digital beamforming technology is applied, and the signals are received from the elevation angle. In this case, all antennas are used for the elevation dimension, and the elevation angle resolution is 2/3 rad.

Due to the short frame interval time, there is no significant difference between consecutive frames, and the accumulated point clouds can represent the continuity of motion. In the experiment, three frames of point clouds are combined into one frame, reducing the number of frames and increasing the number of point clouds in each frame.

2.4. Point Cloud Gesture Recognition Model

2.4.1. SequentialPointNet Network

The SequentialPointNet network is an improvement and extension of PointNet++, designed to extract features in both the time and space domains of 3D point cloud sequences. Spatial features within a single point cloud frame are extracted using the PointNet++ network. Positional encoding is introduced into the network, where a position vector is embedded within each point cloud frame in the 3D point cloud sequence, enhancing the ability to express temporal information and enabling the extraction of temporal features.

Point cloud sequences are processed using the network. The network takes a multi-frame point cloud sequence as input, with each frame represented as a five-dimensional tensor of shape (B, F, d, N, k). Here, B denotes the batch size, F the number of frames, d the feature dimension, N the number of sampled points, and k the number of k-nearest neighbors. The entire network consists of multiple submodules, including a sampling layer, grouping layer, PointNet layer, max pooling layer, and fully connected layer. After grouping and sampling the point clouds, the point cloud’s spatial and temporal features are extracted in sequence.

As seen in Figure 7, first, spatial feature extraction is performed frame by frame through the netR_T_S1 and netR_T_S2 modules, with the features weighted using CBAM between the modules. Then, the features weighted by the attention mechanism are merged to form a new input. The net4DV_T1 module is then used to extract inter-frame features, obtaining higher-level temporal features. Subsequently, a max pooling layer and positional encoding (PE) are applied to strengthen the temporal feature representation, and temporal features are further refined by the net4DV_T2 module. After several average pooling layers and max pooling layers, the extracted features are aggregated. Finally, the spatial features, the first extracted temporal features, and the second temporal features are integrated and classified using fully connected layers.

The structures of the netR_T_S1, netR_T_S2, and net4DV_T1 modules are similar, as shown in Figure 8. These modules are primarily composed of three layers of 2D convolutional layers, batch normalization layers, and activation functions. After each convolutional layer, batch normalization is applied to speed up training and stabilize the process. The ReLU is applied to introduce nonlinearity. Finally, to compress and downsample the features without losing key details, a max pooling layer is applied.

Figure 9 illustrates the structure of the net4DV_T2 module. It uses 2D convolutional layers, batch normalization layers, and ReLU activation functions to further extract features and perform nonlinear transformations. The overall network structure is designed to fully extract and integrate the spatial and temporal features of the point cloud data, thereby achieving accurate classification tasks.

2.4.2. Initial Attention Mechanism Replacement

The CBAM (Convolutional Block Attention Module) has shown excellent performance in enhancing convolutional network capabilities. However, its main drawback lies in its reliance on local information processing, making it difficult to capture global contextual information. This limitation may cause reduced performance when handling tasks like point cloud data, which contain rich global information. In this study, the CBAM is replaced with the GAM (Global Attention Module), which can enhance global dimension interactions while reducing information dispersion. By computing global context, GAM effectively captures global information in gesture features and strengthens the model’s attention to important features.

The GAM (Global Attention Module), as shown in the Figure 10, consists of two main submodules: the channel attention submodule and the spatial attention submodule. The calculation method is as follows:

\begin{array}{l} F_{2} = M_{c} (F_{1}) \otimes F_{1} \\ F_{3} = M_{s} (F_{2}) \otimes F_{2} \end{array}

(15)

where

F_{1}

is the input feature,

F_{2}

is the intermediate state, and

F_{3}

is the output feature.

The channel attention submodule uses 3D convolution and multi-layer perceptrons (MLPs) to retain information across channels, spatial width, and height, enhancing global interaction features. On the other hand, spatial information is merged by the spatial attention submodule through convolutional layers. GAM addresses the issue of insufficient global feature representation in CBAM when processing point clouds. By simultaneously focusing on the interaction between channel and spatial dimensions, GAM enhances the ability to express global features.

2.4.3. Multiscale Feature Extraction Module

In the SequentialPointNet network, although spatial and temporal features of the point cloud sequence are extracted through a series of convolution operations, there are certain limitations in capturing multiscale features, especially when handling complex scenes with information at different scales, where the model’s adaptability is relatively poor. To tackle this problem, a multiscale feature extraction module is introduced to improve the feature extraction performance for point cloud data, particularly in understanding the context at different scales.

The innovation of this module lies in the introduction of two convolutional branches, each consisting of a convolutional layer and paired with batch normalization (BN) to optimize the feature extraction process. The two convolutional layers use kernels of different sizes, which allows for feature extraction at different scales. This design enables the module to simultaneously capture rich contextual information from both local and global scales, compensating for the limitations of traditional single-scale convolutional kernels in understanding point cloud features. The outputs from the two convolutional branches are merged to create a multiscale feature combination, which is used as the module’s final output. As shown in Figure 11, the improved multiscale feature extraction module effectively expands the network’s receptive field by simultaneously performing 1 × 1 and 3 × 3 convolution operations, enabling it to better capture detailed features and global structural information in the point cloud. The receptive field is determined using the formula below:

F_{i} = (F_{i - 1} - 1) \times S t r i d e + K_{s i z e}

(16)

In this formula,

F_{i}

refers to the receptive field of layer

i

,

S t r i d e

represents the stride, and

K_{s i z e}

represents the kernel size.

This method not only improves the feature extraction capability but also enhances the network’s performance in complex scenes, enabling the model to exhibit stronger adaptability and robustness when processing point cloud data with multiscale information.

2.4.4. Separable MLP

The original SequentialPointNet network contains two Set Abstraction (SA) layers. An SA layer is composed of a sampling layer, a grouping layer, and an enhanced PointNet layer, extracting point cloud features with a multi-layer perceptron (MLP). However, this design has some limitations in feature extraction. All three layers of the MLP are computed on the neighbor features, making it difficult to effectively capture local structural information and leading to the loss of detailed point features. To further improve the network’s performance, this paper introduces a separable MLP to optimize the SA layers, which allows for better capture of local structural information while preserving the details of point features.

The design of the separable MLP decomposes the traditional multi-layer perceptron (MLP) into two parts. Figure 12 shows the detailed structure of the separable MLP. One part is computed on the neighbor features, and the other part is computed on the point features. The single-layer MLP computed on the neighbor features is responsible for capturing local geometric structural information, while the subsequent two layers of MLP focus on further refining and enhancing the point features. Additionally, residual connections are employed to stabilize the training process and improve the quality of feature representation by fusing the local aggregation information with the original point features, thus enhancing model performance. The specific formula is as follows:

f_{i}^{r e s} = M L P s (f_{i}^{l}), f_{i}^{l + 1} = f_{i}^{r e s} + M L P s (R ({f_{j}^{r e s} | j \in N (i)}))

(17)

This design, by processing different types of features in stages, enables more effective extraction and retention of key information in the point cloud data. By handling different types of features in stages, the efficiency and quality of feature extraction are improved, allowing the model to better adapt to complex point cloud data.

3. Experimental Verification

3.1. Dataset Construction

In this paper, six types of human gestures are collected using a 77GHz FMCW millimeter-wave radar. After processing the raw radar data, a dataset for human gesture recognition is constructed. Training is performed by inputting the dataset into a deep learning network.

The radar is positioned 1.5 m above the ground. Ten experimental subjects sequentially perform six predefined human gestures at distances ranging from 1 m to 1.5 m from the radar. As shown in Figure 13, the gestures are as follows: push (gesture 1), pull (gesture 2), swipe left (gesture 3), swipe right (gesture 4), swipe up (gesture 5), and swipe down (gesture 6). These 6 gestures are commonly used in daily life. Whether operating smart devices, controlling home appliances, or interacting in virtual reality, users typically use these basic gestures naturally, ensuring that the research results have broad applicability. The experiment participants include 5 males and 5 females, with ages ranging from 18 to 30 years, and heights ranging from 155 cm to 185 cm; the amplitude and range of the gestures demonstrated by each experiment participant vary. Each action is repeated 20 times by each participant, resulting in a total of 1200 human gesture recognition samples. Each data sample is named based on the subject number and gesture number. In subsequent experiments, A training set and a test set are created by dividing the data for each gesture, with different experimental subjects assigned proportionally to each group.

During the radar data collection process, the radar module communicates with the host computer via a serial port, using the host computer to send commands and collect radar data. The duration of each action is 1–2 s, and the number of frames is between 20 and 30. The 3D point cloud image sequence of the left swipe gesture is provided, as depicted in Figure 14.

In the subsequent data processing, the 3D point cloud sequence is converted into depth images, reducing the computational complexity of the data while preserving the point cloud coordinates. The depth image sequence of the left swipe gesture is shown in Figure 15.

3.2. Network Training

Using the PyTorch (1.13.0) deep learning framework, the network model is constructed in this paper. The model runs on the Windows 11 64-bit operating system and is trained using an NVIDIA Tesla P100 (NVIDIA Corporation, Santa Clara, CA, USA). The initial value of the learning rate is 0.0001, with 100 iterations for the model and a batch size of 32.

AdamW is selected as the optimizer for the network to update the model parameters. Compared to Adam, AdamW improves the model’s generalization performance and training stability by correctly implementing weight decay, avoiding the negative effects caused by L2 regularization. The formula for its principle is as follows:

m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}

(18)

v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) {g_{t}}^{2}

(19)

m_{t} = \frac{m_{t}}{1 - β_{1}^{t}}

(20)

v_{t} = \frac{v_{t}}{1 - β_{2}^{t}}

(21)

θ_{t + 1} = θ_{t} - η (\frac{m_{t}}{\sqrt{v_{t}} + ε} + w θ_{t})

(22)

where

θ_{t}

and

β_{2}

are the exponential decay constants,

m_{t}

and

v_{t}

represent the first and second moment estimates,

ω

is the decay factor,

η

is the learning rate,

g_{t}

is the gradient of the parameters, and

θ_{t}

are the model parameters to be learned.

Additionally, label smoothing is applied by smoothing the target labels, which reduces the model’s overconfidence in a single label, alleviates overfitting, improves generalization, and enhances the robustness to data noise and uncertainty. The formula for its principle is as follows:

y_{i}^{L S} = y_{i} (1 - α) + α / K

(23)

where

α

represents the label smoothing parameter, and

K

denotes the total number of categories in multiclass classification.

Additionally, the model applies cosine annealing to decrease the learning rate gradually using a cosine function. This smooth change in the learning rate improves the stability and convergence efficiency of the training, helps the model escape from local minima, and allows for finer parameter adjustments in the later stages of training, thus enhancing the overall performance of the model. The formula for its principle is as follows:

l r = l r_{e n d} + 0.5 \times (l r_{i n i t} - l r_{e n d}) \times (1 + \cos (s t e p / t o t a l \times π))

(24)

In this context,

l r

denotes the current learning rate,

l r_{i n i t}

the initial rate,

l r_{e n d}

the minimum rate,

s t e p

the current step in training, and

t o t a l

the total number of steps.

To determine the ideal ratio between the training set and the test set, three ratios were set for validation: 5:5, 6:4, and 7:3. The results are shown in Table 3. As more data are allocated to the training set, the model’s recognition performance significantly improves. This is because the increase in data directly impacts the model’s performance. The model achieves optimal recognition performance when the training-to-test-set ratio is 7:3, indicating that more training data significantly enhance the model’s generalization ability and accuracy. The model’s accuracy starts to converge and stabilize after approximately 15 iterations.

3.3. Model Performance Analysis

To assess the model’s effectiveness, experiments were performed and compared against PointNet++, MeteorNet, P4Transformer, and SequentialPointNet in this study. In Table 4, the experimental results are presented.

As shown in the table above, the algorithm presented in this paper performs better for the six human gestures—push, pull, swipe left, swipe right, swipe up, and swipe down. Moreover, the accuracy of the proposed MSFE-GAM-SPointNet network shows a significant improvement of 3.2% compared to the SequentialPointNet network. By applying the Global Attention Mechanism (GAM) to point cloud sequence feature extraction, the model’s focus on key features is effectively enhanced, while irrelevant information is ignored. This improves the quality of feature representation and the overall performance of the model. The multiscale feature extraction module uses convolutional kernels of different scales, enhancing the feature extraction capability of the network, spanning local to global levels. This design allows the network to capture and integrate information at multiple scales, leading to a better understanding of the complex structures within the data. The separable MLP, by decomposing the MLP into two parts, is designed to capture local information more effectively. This design allows the network to more accurately capture and utilize local features when processing point cloud data. The experimental accuracy and loss values of the algorithm are shown in Figure 16.

The test set is used to evaluate the model’s performance, and the confusion matrix is displayed in the Figure 17. From the Figure 17, the model’s recognition of different human gestures can be clearly seen. The overall recognition accuracy reaches 99.5%, with each gesture’s recognition accuracy being relatively close. The recognition accuracy for the left swipe, right swipe, and down swipe gestures is 99%. The push, pull, and swipe up gestures can achieve 100% accuracy.

3.4. Ablation Experiment

This study involves four experimental setups to evaluate the proposed method’s effectiveness. The experimental setups are shown in Table 5 and include the original SequentialPointNet network, replacement with the GAM attention mechanism, use of multiscale feature extraction, and adjustment to a separable MLP.

Table 6 demonstrates a clear improvement in model performance achieved through network optimization in this study. By comparing the model accuracy between Experiment 1 and 2, it can be seen that after replacing the attention mechanism with GAM, the model accuracy improved by 1.8%. Compared to CBAM, GAM has a stronger advantage in focusing on global features of the point cloud, effectively enhancing the model’s ability to perceive the overall structures and patterns, which in turn improves gesture recognition accuracy.

Comparing Experiments 2 and 3, it is found that using the multiscale feature extraction module improved the model’s accuracy by 0.9%. The key improvement lies in setting the convolutional kernels in the multiscale feature extraction module to 1 × 1 and 3 × 3, enabling the network to capture both local and large-scale global features at the same time. Through this multiscale convolution operation, the model is better able to adapt to the complex forms and diverse gesture structures in point cloud data when handling information at different scales. This improvement effectively enhances the network’s feature extraction capability, further boosting gesture recognition accuracy.

Comparing Experiments 3 and 4, the model accuracy improved by 0.3% after modifying the network to use a separable MLP structure. In this network structure, the single-layer convolution is responsible for extracting local point cloud features, effectively capturing the geometric information and relative positional relationships of each point. The two-layer convolution further extracts point cloud features, enhancing the model’s understanding of both local structures and global geometry in the point cloud.

4. Conclusions

This paper proposes a gesture recognition method based on millimeter-wave radar point clouds. By using millimeter-wave radar to generate three-dimensional point cloud data of gestures, the MSFE-GAM-SPointNet network is employed to extract gesture features and classify them. A series of experiments demonstrate the effectiveness of our method, achieving an overall accuracy of 99.5%. Further observation reveals that the recognition accuracy for different gestures is relatively balanced. This proves that the combination of three-dimensional point cloud sequences and deep learning architectures can accurately capture the key features of gesture movements.

Future research will focus on eliminating interference from the surrounding environment to improve the system’s accuracy and reliability. By utilizing signal optimization algorithms, the aim is to reduce the impact of external factors on experimental results, thereby ensuring more precise recognition. This direction will help enhance the system’s adaptability in complex environments and expand its range of applications.

Author Contributions

Conceptualization, W.L., Z.H. and Z.G.; methodology, W.L. and Z.G.; software, Z.G.; validation, W.L. and Z.G.; investigation, W.L. and Z.H.; resources, Z.H. and W.L.; writing—original draft preparation Z.G.; and writing—review and editing, W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are unavailable due to the privacy policy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jiang, W.; Ren, Y.; Liu, Y.; Wang, Z.; Wang, X. Recognition of Dynamic Hand Gesture Based on Mm-Wave Fmcw Radar MicroDoppler Signatures. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 4905–4909. [Google Scholar]
De Crescenzio, F.; Fantini, M.; Persiani, F.; Di Stefano, L.; Azzari, P.; Salti, S. Augmented Reality for Aircraft Maintenance Training and Operations Support. IEEE Comput. Graph. Appl. 2011, 31, 96–101. [Google Scholar] [CrossRef] [PubMed]
Geng, K.; Yin, G. Using Deep Learning in Infrared Images to Enable Human Gesture Recognition for Autonomous Vehicles. IEEE Access 2020, 8, 88227–88240. [Google Scholar] [CrossRef]
Li, A.; Bodanese, E.; Poslad, S.; Hou, T.; Wu, K.; Luo, F. A Trajectory-Based Gesture Recognition in Smart Homes Based on the Ultrawideband Communication System. IEEE Internet Things J. 2022, 9, 22861–22873. [Google Scholar] [CrossRef]
Gupta, H.P.; Chudgar, H.S.; Mukherjee, S.; Dutta, T.; Sharma, K. A Continuous Hand Gestures Recognition Technique for Human-Machine Interaction Using Accelerometer and Gyroscope Sensors. IEEE Sens. J. 2016, 16, 6425–6432. [Google Scholar] [CrossRef]
Rocamora, J.M.; Wang-Hei Ho, I.; Mak, W.; Lau, A.P. Survey of CSI Fingerprinting-based Indoor Positioning and Mobility Tracking Systems. IET Signal Process. 2020, 14, 407–419. [Google Scholar] [CrossRef]
Shotton, J.; Fitzgibbon, A.; Cook, M.; Sharp, T.; Finocchio, M.; Moore, R.; Kipman, A.; Blake, A. Real-Time Human Pose Recognition in Parts from Single Depth Images. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; IEEE: New York, NY, USA, 2011; pp. 1297–1304. [Google Scholar]
Nishida, N.; Nakayama, H. Multimodal Gesture Recognition Using Multi-Stream Recurrent Neural Network. In Proceedings of the Image and Video Technology: 7th Pacific-Rim Symposium, PSIVT 2015, Auckland, New Zealand, 25–27 November 2015; Springer International Publishing: Cham, Switzerland, 2016; Volume 9431, pp. 682–694. [Google Scholar]
Yang, M.; Zhu, H.; Zhu, R.; Wu, F.; Yin, L.; Yang, Y. WiTransformer: A Novel Robust Gesture Recognition Sensing Model with WiFi. Sensors 2023, 23, 2612. [Google Scholar] [CrossRef] [PubMed]
Gatteschi, V.; Lamberti, F.; Montuschi, P.; Sanna, A. Semantics-Based Intelligent Human-Computer Interaction. IEEE Intell. Syst. 2016, 31, 11–21. [Google Scholar] [CrossRef]
Kim, Y.; Toomajian, B. Hand Gesture Recognition Using Micro-Doppler Signatures with Convolutional Neural Network. IEEE Access 2016, 4, 7125–7130. [Google Scholar] [CrossRef]
Malysa, G.; Wang, D.; Netsch, L.; Ali, M. Hidden Markov Model-Based Gesture Recognition with FMCW Radar. In Proceedings of the 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Washington, DC, USA, 7–9 December 2016; IEEE: Washington, DC, USA, 2016; pp. 1017–1021. [Google Scholar]
De Miguel, K.; Brunete, A.; Hernando, M.; Gambao, E. Home Camera-Based Fall Detection System for the Elderly. Sensors 2017, 17, 2864. [Google Scholar] [CrossRef] [PubMed]
Liu, B.; Ma, K.; Fu, H.; Wang, K.; Meng, F. Recent Progress of Silicon-Based Millimeter-Wave SoCs for Short-Range Radar Imaging and Sensing. IEEE Trans. Circuits Syst. II Express Briefs 2022, 69, 2667–2671. [Google Scholar] [CrossRef]
Dong, S.; Zhang, Y.; Ma, C.; Zhu, C.; Gu, Z.; Lv, Q.; Zhang, B.; Li, C.; Ran, L. Doppler Cardiogram: A Remote Detection of Human Heart Activities. IEEE Trans. Microw. Theory Tech. 2020, 68, 1132–1141. [Google Scholar] [CrossRef]
Ko, M.M.; Moriyama, T. Noncontact Monitoring of Respiration and Heartbeat Based on Two-Wave Model Using a Millimeter-Wave MIMO FM-CW Radar. Electronics 2024, 13, 4308. [Google Scholar] [CrossRef]
Cardillo, E.; Li, C.; Caddemi, A. Radar-Based Monitoring of the Worker Activities by Exploiting Range-Doppler and Micro-Doppler Signatures. In Proceedings of the 2021 IEEE International Workshop on Metrology for Industry 4.0 & IoT (MetroInd4.0&IoT), Rome, Italy, 7–9 June 2021; pp. 412–416. [Google Scholar] [CrossRef]
Zhang, S.; Li, G.; Ritchie, M.; Fioranelli, F.; Griffiths, H. Dynamic Hand Gesture Classification Based on Radar Micro-Doppler Signatures. In Proceedings of the 2016 CIE International Conference on Radar (RADAR), Guangzhou, China, 10–13 October 2016; IEEE: Guangzhou, China, 2016; pp. 1–4. [Google Scholar]
Wei, H.; Li, Z.; Galvan, A.D.; Su, Z.; Zhang, X.; Pahlavan, K.; Solovey, E.T. IndexPen: Two-Finger Text Input with Millimeter-Wave Radar. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2022, 6, 79. [Google Scholar] [CrossRef]
Tian, Y.; Cao, Z.; Deng, Y.; Li, J.; Cui, Z. Feature Reconstruction for Multi-Hand Gesture Signals Separation Based on Enhanced Music Using Millimeter-Wave Radar. In Proceedings of the IGARSS 2024–2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 7440–7443. [Google Scholar]
Charles, R.Q.; Su, H.; Kaichun, M.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Honolulu, HI, USA, 2017; pp. 77–85. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 5105–5114. [Google Scholar]
Wang, Y.; Xiao, Y.; Xiong, F.; Jiang, W.; Cao, Z.; Zhou, J.T.; Yuan, J. 3DV: 3D Dynamic Voxel for Action Recognition in Depth Video. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; IEEE: Seattle, WA, USA, 2020; pp. 508–517. [Google Scholar]
Fan, H.; Yang, Y.; Kankanhalli, M. Point 4D Transformer Networks for Spatio-Temporal Modeling in Point Cloud Videos. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Nashville, TN, USA, 2021; pp. 14199–14208. [Google Scholar]
Li, X.; Huang, Q.; Wang, Z.; Yang, T.; Hou, Z.; Miao, Z. Real-Time 3-D Human Action Recognition Based on Hyperpoint Sequence. IEEE Trans. Ind. Inform. 2023, 19, 8933–8942. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Liu, Y.; Shao, Z.; Hoffmann, N. Global Attention Mechanism: Retain Information to Enhance Channel-Spatial Interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar]

Figure 1. Gesture recognition system.

Figure 2. Radar principle.

Figure 3. Millimeter-wave radar angle measurement principle.

Figure 4. Radar hardware component module.

Figure 5. Radar chip antenna and virtual antenna array.

Figure 6. Hardware signal processing workflow.

Figure 7. Network architecture diagram.

Figure 8. NetR_T_S1 module.

Figure 9. net4DV_T2 module.

Figure 10. GAM attention module.

Figure 11. Multiscale feature extraction module.

Figure 12. Separable MLP structure.

Figure 13. Gesture diagram.

Figure 14. Three-dimensional point cloud image of left swipe.

Figure 15. Depth image of left swipe.

Figure 16. Experimental accuracy and loss curve of the model in this paper.

Figure 17. Confusion matrix.

Table 1. Chirp parameter settings.

Parameter	Value
Initial Frequency (GHz)	77.000
Frequency Modulation Slope (MHz/μs)	58.545
Effective Bandwidth (GHz)	2.9915
Chirp Time (μs)	55

Table 2. Sampling parameter settings.

Parameter	Value
Number of Sample Points	512
Sampling Rate (ksps)	10,000
Number of Chirps	64
Number of Antennas	16
Frame Interval Time (ms)	50

Table 3. Model accuracy with different dataset ratios.

Training–Testing Split Ratio	Accuracy (%)
5:5	82.8
6:4	91.6
7:3	96.3

Table 4. Accuracy comparison of different networks.

Model	Accuracy (%)
PointNet++	70.5
meteorNet	83.7
P4Transformer	95.0
SequentialPointNet	96.3
MSFE-GAM-SPointNet(ours)	99.5

Table 5. Experimental setups.

Scheme Number	GAM	Multiscale Feature Extraction Module	Separable MLP
1
2	√
3	√	√
4	√	√	√

Table 6. Experimental results.

Scheme Number	Accuracy (%)
1	96.3
2	98.1
3	99.2
4	99.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, W.; Guo, Z.; Han, Z. Millimeter-Wave Radar Point Cloud Gesture Recognition Based on Multiscale Feature Extraction. Electronics 2025, 14, 371. https://doi.org/10.3390/electronics14020371

AMA Style

Li W, Guo Z, Han Z. Millimeter-Wave Radar Point Cloud Gesture Recognition Based on Multiscale Feature Extraction. Electronics. 2025; 14(2):371. https://doi.org/10.3390/electronics14020371

Chicago/Turabian Style

Li, Wei, Zhiqi Guo, and Zhuangzhi Han. 2025. "Millimeter-Wave Radar Point Cloud Gesture Recognition Based on Multiscale Feature Extraction" Electronics 14, no. 2: 371. https://doi.org/10.3390/electronics14020371

APA Style

Li, W., Guo, Z., & Han, Z. (2025). Millimeter-Wave Radar Point Cloud Gesture Recognition Based on Multiscale Feature Extraction. Electronics, 14(2), 371. https://doi.org/10.3390/electronics14020371

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Millimeter-Wave Radar Point Cloud Gesture Recognition Based on Multiscale Feature Extraction

Abstract

1. Introduction

2. Gesture Recognition Principles

2.1. Gesture Recognition System Process

2.2. Principle of Millimeter-Wave Radar

2.2.1. Radar Working Principle

2.2.2. Radar Detection Principle

2.3. Point Cloud Signal Processing

2.3.1. Radar Signal Processing Procedure

2.3.2. Radar Parameters

2.4. Point Cloud Gesture Recognition Model

2.4.1. SequentialPointNet Network

2.4.2. Initial Attention Mechanism Replacement

2.4.3. Multiscale Feature Extraction Module

2.4.4. Separable MLP

3. Experimental Verification

3.1. Dataset Construction

3.2. Network Training

3.3. Model Performance Analysis

3.4. Ablation Experiment

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI