Multi-Person Action Recognition Based on Millimeter-Wave Radar Point Cloud

Dang, Xiaochao; Fan, Kai; Li, Fenfang; Tang, Yangyang; Gao, Yifei; Wang, Yue

doi:10.3390/app14167253

Open AccessArticle

Multi-Person Action Recognition Based on Millimeter-Wave Radar Point Cloud

by

Xiaochao Dang

^1,2,

Kai Fan

¹,

Fenfang Li

^1,*

,

Yangyang Tang

¹,

Yifei Gao

¹ and

Yue Wang

¹

College of Computer Science and Engineering, Northwest Normal University, Lanzhou 730070, China

²

Gansu Province Internet of Things Engineering Research Center, Lanzhou 730070, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(16), 7253; https://doi.org/10.3390/app14167253

Submission received: 12 July 2024 / Revised: 13 August 2024 / Accepted: 16 August 2024 / Published: 17 August 2024

(This article belongs to the Special Issue Advances in HCI: Recognition Technologies and Their Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

This research has important applications in areas such as smart furniture and human-computer interaction. It will bring people a more efficient and comfortable living experience as well as a new smart experience.

Abstract

Human action recognition has many application prospects in human-computer interactions, innovative furniture, healthcare, and other fields. The traditional human motion recognition methods have limitations in privacy protection, complex environments, and multi-person scenarios. Millimeter-wave radar has attracted attention due to its ultra-high resolution and all-weather operation. Many existing studies have discussed the application of millimeter-wave radar in single-person scenarios, but only some have addressed the problem of action recognition in multi-person scenarios. This paper uses a commercial millimeter-wave radar device for human action recognition in multi-person scenarios. In order to solve the problems of severe interference and complex target segmentation in multiplayer scenarios, we propose a filtering method based on millimeter-wave inter-frame differences to filter the collected human point cloud data. We then use the DBSCAN algorithm and the Hungarian algorithm to segment the target, and finally input the data into a neural network for classification. The classification accuracy of the system proposed in this paper reaches 92.2% in multi-person scenarios through experimental tests with the five actions we set.

Keywords:

human action recognition; millimeter-wave radar; point cloud; filtering; deep learning

1. Introduction

The demand for human action recognition is increasing with the continuous development of Internet of Things technology. Human action recognition has a broad application prospect in human–computer interactions, smart homes, medical and health care, etc. More accurate human activity recognition will also give people a more convenient, comfortable lifestyle and intelligent experience. Traditional human activity recognition is mainly realized using camera devices [1], wearable devices [2,3], infrared sensors [4,5], and other devices. Using camera devices can achieve a high accuracy rate, but is prone to user privacy leakage. Camera devices are also susceptible to weather and environmental influences, such as dark environments, and weather conditions, such as rain, snow, and haze. Infrared devices can be used in dark environments, but are less sensitive to other heat and light sources. Wearable devices require users to wear them for long periods, making providing a comfortable and convenient experience difficult. Wearable devices are usually one-person-one-device, so they are costly and cannot be used for multi-person identification, as shown in Table 1.

In contrast, researchers are giving more and more attention to wireless sensing devices due to the advantages of their low cost, non-invasiveness, and privacy protection. In recent years, due to the ubiquitous nature of WIFI signals [6], many studies have used WIFI signals for human action recognition [7]. Wang, Yuxi et al. showed how to use a WIFI router to detect if a human falls [8]. Wang, Wei et al. used channel state information differences for human action recognition [9]. However, the bandwidth of WIFI signals is narrow, and more than its resolution is needed to accomplish high-precision and high-robustness classification. Moreover, most WIFI-based human action recognition uses channel state information (CSI), which cannot perform multi-target human action recognition.

Millimeter-wave radar can provide a higher spatial resolution due to its ultra-high bandwidth (30–300 GHz). Compared to infrared and laser equipment, millimeter-wave radar can penetrate fog, smoke, and dust and is characterized by its all-weather, round-the-clock capability [10]. In addition, millimeter-wave radar also has some imaging capabilities. Fairchild et al. collected data using millimeter-wave radar and extracted micro-Doppler features for human action recognition [11]. Radhar [12] generated human body point clouds using millimeter-wave radar, and then input deep convolutional neural networks for recognition after voxelizing the point clouds. In addition, several studies have used graph neural networks (GNNs) to classify point clouds, such as Point-GNN [13]. Peixian et al. designed a time-distributed MMPoint-GNN + bidirectional LSTM activity recognition framework.

The methods for extracting micro-Doppler features in the previous studies have not fully utilized the spatial characteristics of human actions and are yet to be able to classify multiple people’s actions. The methods that voxelize the point clouds that use higher resolutions result in most voxels being empty. Moreover, the computational overhead of 3D convolutional networks is very high [14]. Designing a millimeter-wave radar system that can recognize multiple people’s actions simultaneously with a low recognition cost has become a vital issue.

In this paper, we propose an overall framework for multi-person action recognition using millimeter-wave radar, including the following four parts: data preprocessing, target segmentation, data optimization for dimensionality reduction, and a convolutional neural network + Long Short-Term Memory (CNN + LSTM) deep learning model. First, the point cloud generated by millimeter-wave radar is sparse [15] and cannot generate a dense and uniform point cloud like LIDAR. Moreover, millimeter-wave radar is susceptible to the environment’s multipath effect in data acquisition [16], so the generated point cloud data will contain much noise. In this paper, the data preprocessing module is designed to superimpose some traditional filtering methods, and the filtering method proposed in this paper is based on the inter-frame differences of millimeter-wave point clouds to filter and reduce the noise of the original point cloud. Secondly, multiplayer action recognition must segment the multiplayer point cloud. This paper uses Density-Based Spatial Clustering of Applications with the noise algorithm [17] and the Hungarian algorithm [18] to cluster the point clouds belonging to the same target in each frame. At the same time, the data of the same target in consecutive multi-frames are connected and matched [19] to complete the segmentation of the multi-person point cloud. Then, due to the sparsity of the millimeter-wave point cloud, inputting each frame’s data into the deep learning model will generate a substantial computational overhead. Therefore, in this paper, we design a data optimization dimension reduction module and superimpose six frames of data so that it not only helps to overcome the sparsity and uncertainty of the millimeter-wave point cloud, but also can significantly reduce the computational cost. Since most human actions can be accurately described in the XZ dimensions, we discard the Y-axis component. While reducing the dimensionality of the data, the critical information in the data is retained as much as possible, creating certain conditions for the subsequent use of the CNN + LSTM deep learning model [20]. Finally, the CNN + LSTM deep learning model is used in this paper. Human action is a process highly correlated with temporal changes, and this approach not only extracts the spatial features of human action, but also fully considers the temporal features. Our proposed method achieves 92.2% recognition accuracy in multi-person scenarios.

This paper makes the following contributions:

We propose an integrated framework for multi-person action recognition based on millimeter-wave radar point clouds.
Based on the characteristics of the millimeter-wave radar point clouds, we introduce a filtering method based on the inter-frame differences of the millimeter-wave point clouds.
Using the collected multi-millimeter-wave radar human point cloud data, we achieved an action classification accuracy of 92.2% in multi-person scenarios.

The sections of this paper are organized as follows: Section 2 describes the system design in detail. In Section 3, our experimental setup and data acquisition are described. Section 4 discusses the robustness and ablation experiments and classification results. In Section 5, we share the conclusions.

2. System Framework Design

2.1. Overview of the System Framework

Human action recognition using a point cloud generated by millimeter-wave radar has to overcome the sparsity and inhomogeneity of the millimeter-wave point cloud. Moreover, the acquired millimeter-wave point cloud includes much noise in indoor environments due to multipath effects and other factors. In addition, the data volume of the multiplayer point cloud is enormous, and it will generate a substantial computational overhead if deep learning models are directly used for training and recognition without processing. Thus, we designed targeted system frameworks to solve these problems, as shown in Figure 1. One of the modules we designed is the data preprocessing module, which first filters the point cloud data through traditional pass-through filters and statistical filters [21], and then further processes the data using the millimeter-wave point cloud inter-frame disparity filtering proposed in this paper to obtain point cloud data of a higher quality. Then, we used the DBSCAN clustering algorithm and the Hungarian algorithm for inter-frame target matching and differentiation to differentiate the point clouds with multiple targets. The third part of the system framework is the data optimization and dimensionality reduction module, which mainly solves the problems of sparse millimeter-wave point clouds [22] and large amounts of multi-person point cloud data by stacking a number of frames and rounding the Y-axis component. The fourth part of the system framework is the CNN + LSTM recognition module, which performs the multi-dimensional feature extraction on the human point clouds through the powerful spatial feature extraction capability of CNN [23] and the temporal feature extraction capability of LSTM [24]. A detailed description of the above four modules is presented as follows.

2.2. Data Preprocessing

Because millimeter-wave radar is sensitive to small vibrations and susceptible to environmental influences, it is easy to generate multipath effects in indoor environments. Therefore, the generated point cloud data contain much noise, significantly affecting the system’s stability and accuracy. In this paper, we initially used pass-through filters and statistical filters to filter the point cloud data. Pass-through filters and statistical filters have been widely used in point cloud filtering. A pass-through filter sets a range on the attributes of the point cloud (x, y, z, v, and the signal-to-noise ratio) based on the attributes of the point cloud, removes the outliers, and retains the points within the range. The whole process of a human action usually occurs in a specific spatial range, but not all parts of the human body are in motion during the whole process. Thus, in this paper, the four attributes of the point cloud’s x, y, z, and signal-to-noise ratio are selected as the setup conditions of straight pass filtering. The statistical filter principle is based on the concept of statistics to identify and filter out the outliers or noise by calculating the statistical properties of the data set. The statistical filter principle used in this paper is as follows:

Calculate the statistical properties of the data set, including the mean and standard deviation. The mean is the average value of the data set, and the standard deviation is the dispersion of the data set values.
Based on the mean and standard deviation, a threshold range is defined within which the data points are considered normal, while the data points outside this range are considered outliers.
By comparing each data point’s mean and standard deviation with the corresponding variable, the data points that fall within the threshold range are retained, while those that fall outside the range are filtered out.

The above steps describe the initial filtering of the point cloud data. Human body action is a continuous and smooth process, and the millimeter-wave radar will collect a certain number of frames in a specified time during the data acquisition process. Although there are differences in the spatial distribution of the point cloud between neighboring frames due to variations in human actions, the differences between them are limited and follow certain statistical principles. Based on this idea, we propose a filtering method based on the difference between the frames. The detailed steps are as follows:

Calculate the range difference in the spatial coordinates of the point cloud between the neighboring frames: For each point’s spatial coordinates (x, y, z), calculate the difference between the neighboring frames. This results in a vector representing the variation between the frames.
Select the median as a benchmark: Since the human body is usually at rest in the beginning and end frames, the effect of these frames on overall variation may be significant. The median of the range differences of all the neighboring frames is chosen as the benchmark to eliminate this effect. The median can reduce the effect of outliers to a certain extent and, more accurately, reflect the general trend of change between the frames.
Flag and delete the point clouds whose range of variation exceeds the benchmark: Iterate through all the neighboring frames, flag them as noisy, and delete the points whose range of point cloud spatial coordinates exceeds the benchmark. Doing so eliminates the points whose range of variation between frames is too extensive, as they may be caused by noise or other disturbances.

The pseudocode for this filtering method is as shown in Algorithm 1:

Algorithm 1: Filtering Method Based on Frame Differences

Input: frames—a list containing multiple frames
Output: filtered_frames—a list of frames after filtering
Calculate the range difference of x, y, z coordinates between adjacent frames
FUNCTION calculate_range_difference (frame1, frame2):
range1 = calculate_range (frame1)
range2 = calculate_range (frame2)
difference = absolute_value (range2–range1)
RETURN difference
Calculate the range of a frame
FUNCTION calculate_range (frame):
min_coords = min (frame, axis=0)
max_coords = max (frame, axis=0)
range = max_coords − min_coords
RETURN range
Calculate the differences between all adjacent frames
FUNCTION frame_differences (frames):
differences = []
for i from 1 to length (frames) − 1:
difference = calculate_range_difference (frames[i − 1], frames[i])
differences.append (difference)
RETURN differences
Get the median of all adjacent frame range differences as the threshold
FUNCTION get_median_threshold (differences):
sorted_differences = sort (differences)
median_diff = median (sorted_differences)
RETURN median_diff
Filter noise in point cloud data
FUNCTION filter_noise (frames, threshold):
filtered_frames = []
for i from 1 to length (frames) − 1:
difference = calculate_range_difference (frames[i − 1], frames[i])
IF all_elements_less_or_equal (difference, threshold):
filtered_frames.append (frames[i])
RETURN filtered_frames

In this algorithm, for each pair of neighboring frames, it is necessary to compute the range of each coordinate axis and find the maximum and minimum values. Each operation requires traversing all the points, and the time complexity is

O (m)

, assuming there are

m

points in each frame. Assuming a total of

n

frames, there are a total of

n - 1

pairs of neighboring frames, so the time complexity of calculating the range difference between the neighboring frames is as follows:

O ((n - 1) * 3 m) = O (n m)

(1)

In point cloud data processing, calculating the range of points (maximum and minimum) in each frame is the primary operation, so the final time complexity of this algorithm is O(nm).

As shown in Figure 2a, Figure 2b and Figure 2c, respectively, represents the original point cloud, the point cloud after pass-through and statistical filtering, and the final point cloud after inter-frame difference filtering. The inter-frame difference filtering method effectively eliminates the noise caused by multipath interference and noise that exceeds the normal range of human action.

2.3. Target Segmentation

To complete multi-person action recognition, it is necessary to segment the multi-person point cloud and complete inter-frame target matching. Since the number of point clouds per frame in the point cloud data collected by millimeter-wave radar is not fixed, the shape of a human body point cloud is not regular. Compared to the other clustering algorithms, such as the K-means clustering algorithm, the DBSCAN algorithm does not need to pre-specify the number of noisy points in the data set or the number of clusters to be generated, so it is more suitable for the clustering of irregular shapes. The main idea of the DBSCAN algorithm is to perform clustering based on the density of the data points [25], which can be used according to the Euclidean distance in 3D space for clustering and segmenting the point cloud. The critical parameters of this algorithm include the radius eps (used to define the distance threshold of the neighborhood) and min_samples (used to define the minimum number of neighbors required for a core point). Its mathematical expression is as follows:

The data set is represented as

X

, where each data point is denoted as

x_{i}

,

i = 1,2, \dots, n

. Each data point has an associated feature vector representing its position in the feature space. Neighborhood

N_{ϵ} (x_{i})

:defined as the neighborhood centered at

x_{i}

with radius

ϵ

represented as follows:

N_{ϵ} (x_{i}) = {x_{j} | d i s t (x_{i}, x_{j}) \leq ϵ}

(2)

Core Point: If

N_{ϵ} (x_{i})

contains at least min_samples data points, then

x_{i}

is a core point.

Directly Density-Reachable: If

x_{j}

is in

N_{ϵ} (x_{i})

and

x_{i}

is a core point, then

x_{j}

is directly density-reachable from

x_{i}

in

X

.

Density-Reachable: If there exists a sequence of data points

p_{1}, p_{2}, \dots, p_{k}

, such that

p_{1} = x_{i}, p_{k} = x_{j}

(3)

and for

p_{m + 1}

(where

1 \leq m < k

),

p_{m + 1}

is in

N_{ϵ} (p_{m})

, then

x_{j}

is density-reachable from

x_{i}

in

X

.

Density-Connected: If there exists a data point

x_{k}

such that both

x_{i}

and

x_{j}

are density-reachable from

x_{k}

, then

x_{i}

and

x_{j}

are density-connected in

X

.

Based on the above definition, DBSCAN categorizes the data points into core, boundary, and noise points to form density-connected clusters. The core points are density-connected to each other, while the boundary points that are not core are density-connected to a particular core point. On the other hand, the noise points are not density-connected to any core or boundary point. The algorithm will recursively build clusters by density reachability relationships starting from the core points.

The Hungarian algorithm is then used to match the target of the current frame to ensure target continuity. The core advantages of the Hungarian algorithm include the simplicity of implementation and the ease of understanding and implementation. The differentiation of multi-target point clouds using the Hungarian algorithm typically involves matching the point cloud targets of the current frame with known targets from the previous frames. The positions are first extracted from the point cloud clusters after clustering by the DBSCAN algorithm, and the positions between the different clusters and the radar are used as costs to construct a cost matrix for the match between the target in each current frame and the known target in the previous frame. The optimal matching of the cost matrix is solved using the Hungarian algorithm. The algorithm returns a set of matches representing the correspondence between each target in the current frame and one of the known targets in the previous frame. Based on the matches, the trajectories of the targets are updated or created, and the targets in the current frame are associated with the matched targets in the previous frames to track the target’s motion. The final output is pointing cloud data with target labels.

2.4. Data Optimization and Dimensionality Reduction

Because most human actions occur within 1–3 s, to completely cover the entire human action cycle in this paper, we set the acquisition period of action as 4 s, and the millimeter-wave radar collects 15 frames of point cloud data per second. In our experiments, we found that the number of superimposed frames needs to be more significant to overcome the sparsity of millimeter-wave radar, and the number of superimposed frames is too large to lose a large number of spatial and temporal features. Because each action has 60 frames of data in a 4 s acquisition period, to ensure that the data can be evenly distributed, we need to ensure that the number of superimposed frames can be divided by 60. Ultimately, we choose to aggregate six frames of point cloud data into one frame of point cloud data to generate a more dense point cloud of the human body, as shown in Figure 3. After aggregation, a single action will generate 10 aggregated point clouds so that the entire human body action can be accurately depicted without significantly losing information due to the small number of frames. Since the variations in human actions in most daily activities are mainly manifested in the variations in human height (e.g., standing, bowing, sitting, squatting, etc.) and the variations in human limbs (e.g., boxing, kicking, etc.), these human actions can be accurately described in both the x-axis and z-axis dimensions. Thus, we round off the y-axis component to obtain the point cloud projection in the xz plane, as shown in Figure 4. This approach retains most of the critical information and dramatically reduces the amount of data.

2.5. CNN + LSTM Classification Network

Human action is a continuous process that contains both spatial and temporal features. CNNs have achieved remarkable success in image recognition, target detection, and face recognition and have subsequently been widely used in gesture, action, and gait recognition in millimeter-wave point clouds. The LSTM network, through the designed structure of memory cells, input, forgetting, and output gates, effectively solves the traditional problem of long-term dependency in RNNs [26]. The memory unit allows for the network to store and maintain information in long sequences. In contrast, the gating mechanism allows for the network to selectively read, write, and forget information to capture long-term contextual relationships better. The traditional RNNs are prone to the problem of gradient vanishing or gradient explosion when dealing with long sequences, making it challenging to learn long-term dependencies efficiently [27]. LSTMs help alleviate these problems through gating mechanisms, especially forgetting gates, making the network more stable during training; the CNN + LSTM framework is used in this paper. Firstly, spatial feature extraction is performed separately for each frame after aggregation by the CNN, and then the extracted features are fed into the LSTM network in frame order (chronological order), as shown in Figure 5. The LSTM network captures the patterns and relationships in the time series. The LSTM network outputs are connected to form a complete feature representation. Finally, the classification of actions is outputted through a fully connected layer. The advantage of this structure is that each CNN branch focuses on a one-time frame, which helps to extract the spatial information of that time frame. The LSTM network integrates the information from these time frames and models the temporal relationships [28]. This approach allows for the better handling of time series data and reduces the model’s over-reliance on the entire sequence.

3. Experimental Setup

3.1. Hardware Setup and Experimental Scenario

In this paper, we use Texas Instruments IWR 1843boost [29], operating at 78–81 GHz, which integrates four receiving antennas (RX) and three transmitting antennas (TX) to track multiple objects with distance and angle information, as shown in Figure 6. This antenna design can estimate the distance and the pitch angle, enabling object detection in a three-dimensional plane. The FMCW radar first transmits a linear FM signal from the TX antennas, and then the RX antennas capture the reflection of the linear FM signal generated by the object. After that, an IF signal is generated by an electronic component called a mixer [30]. The FMCW radar system estimates the angle of the reflected signal concerning the horizontal plane, also known as the angle of arrival. The angle estimation is obtained by the phase change in the peak of the Doppler Fast Fourier Transform due to a slight change in the object’s distance. The radar point cloud data can be converted using the angle, distance, velocity, and signal-to-noise ratio information [31]. A CSV file of the point cloud is exported and contains the number of frames, the x-y-z coordinates, Doppler (velocity), and the signal-to-noise ratio.

The radar board’s power connector is powered up at startup and the data exchange interface is connected to the PC. In the configuration area, the distance resolution is set to 4.0 cm, the maximum detection range is 9 m, the maximum measurable radial velocity is 1.5 m/s, the radial velocity resolution is 0.13 m/s, and 15 data frames are collected per second. The radar height is set to 0.8 m, the distance between the experimenter and the radar is set to 4 m, and the distance between the two experimenters is set to 0.5 m, which is usually considered the average social distance. The experiment recorded the following five actions (punching, kicking, bowing, going from standing to crouching, and going from sitting to standing, as shown in Figure 7). Nine volunteers ranged from 155 cm to 185 cm in height and from 42 kg to 91 kg in weight. These volunteers are grouped in pairs and performed different actions facing the radar at an average period of 4 s. Eventually, 21,600 data frames are measured for each action, and 108,000 frames are captured for the five actions. Stacking 6 frames gave 3600 point cloud images for each action and 18,000 point cloud images for the 5 actions.

3.2. Model Parameter Settings

The classifier model consists of a convolutional layer, an LSTM layer, and a fully connected layer for classification. We first define a series of convolutional layers. These convolutional layers are activated by the activation function GELU, and we use Batch Normalization (BN) to speed up the training process and enhance the stability of the model [32]. In the training loop, forward propagation is performed for each batch, losses are computed, backpropagation is performed, and the optimizer is used to update the weights. For the first convolutional layer, the number of input channels is 2, the number of output channels is 128, the convolutional kernel size is 7, the step size is 2, and the padding value is 3. For the subsequent three convolutional layers, the number of input channels and output channels is 128, the convolutional kernel size is 3, the step size is 2, and the padding value is 1.

The whole CNN part builds a feature extractor through these convolutional layers, which helps to extract useful features from the input time series data, which will be fed into the LSTM layer for time series modeling and classification. The parameters of the LSTM layer include settings such as the input dimensions, the hidden layer dimensions, and batch priority. Here, the input and hidden layer dimensions are set to 128. The LSTM layer operates on the sequence dimension, and hence the input tensor must be in the dimensions of batch_size, seq_length, and input_size, where input_size is a dimension of the input features. The input data are adapted to the input format of the LSTM layer through transpose operation after feature extraction by the CNN part. Then, recursive operations are performed on the time step in the LSTM layer. The LSTM layer processes the sequence data by performing recursive operations on the sequence data in the time step to generate the outputs and the hidden states [33]. The LSTM layer generates the outputs and the hidden states at each time step. Here, only the hidden state is used, and the hidden state is averaged over all the time steps to obtain a fixed representation of the sequence. This fixed representation is fed into the fully connected layer for final classification. We divide the data set into 80% training, 10% testing, and 10% validation sets. We set the initial learning rate to 0.0008, the batch size to 128, and the maximum number of training rounds to 100. The optimization function for the network is Adam. We implement this network in PyTorch.

4. Experimental Results Analysis

4.1. Robustness Experiment

Human daily activities are usually not fixed in a certain distance or area, so in this paper, we analyze the generalization ability and robustness of the system by setting different distances between the human body and the radar and different radar placement heights. Since we set the maximum detection range of the radar to 9 m, we fixed the radar height to 0.8 m to conduct experiments at 2 m, 4 m, 6 m, and 8 m from the radar, respectively. The results are shown in Figure 8. The accuracy is highest at 4 m from the radar, and the accuracy is also good at all the other distances. When the human body is too close to the radar, the millimeter wave emitted by the radar is not enough to cover the end of the human body as well as the changes in the limbs, so there is a decrease in accuracy when the human body is too close to the radar. Moreover, due to the severe attenuation of millimeter waves in the atmosphere, the accuracy rate declined when the human body was farther away from the radar. Therefore, we found that the recognition accuracy of the system is highest when the human body is 4 m away from the radar in this environment.

Since the radar placement heights are different in different scenarios in daily life, we fixed the distance of the human body from the radar as 4 m. We experimented with four radar placement heights of 0.4 m, 0.8 m, 1.2 m, and 1.6 m, and the experimental results are shown in Figure 9, which shows that the accuracy rate is the highest in the radar placement height of 0.8 m. After experimental analysis, we learned that the radar placement was too low or too high, so it could not completely cover the ends of the human limbs, and when the radar placement was too low or too high, the millimeter-wave radar generated a point cloud in the xz-plane projection will produce deformation, which led to an obvious decrease in action recognition accuracy. Moreover, it can be seen through the experiment that the effect of the radar placement height on accuracy exceeds the effect of the distance between the human body and the radar.

4.2. Ablation Study

This paper’s data preprocessing module proposes a filtering method based on the inter-frame differences. To analyze the effect of this method on the final classification results, we perform an ablation study. In order to make a comparison under multiple conditions, we test the effect of deleting the modules of straight-through filtering, statistical filtering, and inter-frame difference filtering on the classification results at different human body radar distances and different radar placement heights, respectively. As shown in Figure 10, the maximum contribution of the frame difference filter to the system’s overall accuracy reaches 11%, the minimum contribution reaches 6%, and the average contribution is 8.8% for different human body-radar distances. As shown in Figure 11, at different radar placement heights, the maximum contribution of frame difference filtering to the system’s overall accuracy reaches 11%, the minimum contribution reaches 5%, and the average contribution is 8.3%.

This experiment shows that our proposed filtering method based on millimeter-wave radar inter-frame differences significantly eliminates noise due to multipath effects and other interferences, which is more significant when the overall accuracy of the system is high. The addition of frame difference filtering can significantly improve the quality of the human action point cloud, which can significantly improve the classification accuracy of the overall system.

4.3. Time Cost

In order to test the time cost of the system, we took 50 test cases for each action in the filtered and clustered segmentation processed data for a total of 250 test cases to conduct experiments on the time cost of individual action recognition. The average test time for each action is shown in Table 2. The average recognition time for the action with the longest recognition time is 75 ms, and the average recognition time for the action with the shortest recognition time is 59 ms. The average recognition time for all the actions is 66.4 ms. Subsequently, we processed our captured data using RadHAR’s processing flow, and similarly, we sampled 50 test cases for each action to conduct experiments on the time cost of individual action recognition. The results are shown in Table 3. The time cost of the proposed system in this paper is much lower than the average 785.8 ms elapsed time of RadHAR. Most of the previous studies used voxelization for point cloud data processing. Voxelization not only loses part of the information in the data, but also produces many empty voxels, resulting in the rise in the recognition time cost. In this paper, by overlaying multiple frames and discarding the y-axis data with relatively less information, we significantly reduce the data dimension, data volume, and the time cost required for recognition.

5. Conclusions

This paper proposes a multi-person action recognition system, which visualizes the distribution of human body point clouds in three-dimensional space from the signals reflected by millimeter-wave radar and removes the noise due to multipath effects and other influences to obtain high-quality human body point clouds by the traditional filtering methods and a filtering method based on inter-frame differences. Then, the sparsity of the millimeter-wave radar point cloud is overcome, and the amount of data is reduced in multi-person scenarios by superimposing multiple frames and rounding off the y-axis component. Finally, the spatial and temporal features of human body actions are extracted and classified by a CNN + LSTM network, and a classification accuracy of 92.2% is achieved in multi-person scenarios. We implemented the system using commercial millimeter-wave radar and performed robustness, ablation, and recognition time-cost experiments. The recognition time cost experiments show that the average recognition time of the system proposed in this paper is 66.4 ms for various actions. The ablation experiments show that the average contribution of our proposed frame-difference filtering to overall accuracy under different conditions reaches 8.5%, which fully proves the effectiveness of our proposed frame-difference filtering method, as well as the stability and efficiency of the overall system. In the future, our research will explore the hierarchical classification filtering method for point clouds and further extract more detailed human action features through a multimodal approach, combining multiple features to improve recognition accuracy and efficiency. This paper’s research on human action recognition in multi-person scenarios creates a good foundation for future research directions. It promotes the practical application and development of millimeter-wave radar in human action recognition.

Author Contributions

Conceptualization, X.D. and K.F.; methodology, K.F.; software, K.F.; validation, Y.T., Y.G. and Y.W.; formal analysis, K.F. and F.L.; investigation, K.F.; data curation, K.F.; writing—original draft preparation, K.F.; writing—review and editing, X.D. and F.L.; supervision, X.D.; project administration, X.D.; funding acquisition, X.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 62162056) and the Industrial Support Foundations of Gansu (Grant No. 2021CYZC-06).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all the subjects involved in this study.

Data Availability Statement

The data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jalal, A.; Kim, Y.H.; Kim, Y.J.; Kamal, S.; Kim, D. Robust human activity recognition from depth video using spatiotemporal multi-fused features. Pattern Recognit. 2017, 61, 295–308. [Google Scholar] [CrossRef]
Attal, F.; Mohammed, S.; Dedabrishvili, M.; Chamroukhi, F.; Oukhellou, L.; Amirat, Y. Physical human activity recognition using wearable sensors. Sensors 2015, 15, 31314–31338. [Google Scholar] [CrossRef] [PubMed]
Chen, C.; Jafari, R.; Kehtarnavaz, N. UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015. [Google Scholar]
Hao, Q.; Hu, F.; Xiao, Y. Multiple human tracking and identification with wireless distributed pyroelectric sensor systems. IEEE Syst. J. 2009, 3, 428–439. [Google Scholar] [CrossRef]
Han, J.; Bhanu, B. Human activity recognition in thermal infrared imagery. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)-Workshops, San Diego, CA, USA, 21–23 September 2005. [Google Scholar]
Ma, Y.; Zhou, G.; Wang, S. WiFi sensing with channel state information: A survey. ACM Comput. Surv. (CSUR) 2019, 52, 1–36. [Google Scholar] [CrossRef]
Li, C.; Cao, Z.; Liu, Y. Deep AI enabled ubiquitous wireless sensing: A survey. ACM Comput. Surv. (CSUR) 2021, 54, 1–35. [Google Scholar] [CrossRef]
Wang, Y.; Wu, K.; Ni, L.M. Wifall: Device-free fall detection by wireless networks. IEEE Trans. Mob. Comput. 2016, 16, 581–594. [Google Scholar] [CrossRef]
Wang, W.; Liu, A.X.; Shahzad, M.; Ling, K.; Lu, S. Understanding and modeling of wifi signal based human activity recognition. In Proceedings of the 21st Annual International Conference on Mobile Computing and Networking, Paris, France, 7–11 September 2015; pp. 65–76. [Google Scholar]
Iovescu, C.; Rao, S. The Fundamentals of Millimeter Wave Sensors; Texas Instruments: Dallas, TX, USA, 2017; pp. 1–8. [Google Scholar]
Fairchild, D.P.; Narayanan, R.M. Multistatic micro-Doppler radar for determining target orientation and activity classification. IEEE Trans. Aerosp. Electron. Syst. 2016, 52, 512–521. [Google Scholar] [CrossRef]
Singh, A.D.; Sandha, S.S.; Garcia, L.; Srivastava, M. Radhar: Human activity recognition from point clouds generated through a millimeter-wave radar. In Proceedings of the 3rd ACM Workshop on Millimeter-Wave Networks and Sensing Systems, Los Cabos, Mexico, 25 October 2019; pp. 51–56. [Google Scholar]
Gong, P.; Wang, C.; Zhang, L. Mmpoint-gnn: Graph neural network with dynamic edges for human activity recognition through a millimeter-wave radar. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–7. [Google Scholar]
Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 221–231. [Google Scholar] [CrossRef] [PubMed]
Kong, L.; Khan, M.K.; Wu, F.; Chen, G.; Zeng, P. Millimeter-wave wireless communications for IoT-cloud supported autonomous vehicles: Overview, design, and challenges. IEEE Commun. Mag. 2017, 55, 62–68. [Google Scholar] [CrossRef]
Heath, R.W.; Gonzalez-Prelcic, N.; Rangan, S.; Roh, W.; Sayeed, A.M. An overview of signal processing techniques for millimeter wave MIMO systems. IEEE J. Sel. Top. Signal Process. 2016, 10, 436–453. [Google Scholar] [CrossRef]
Schubert, E.; Sander, J.; Ester, M.; Kriegel, H.P.; Xu, X. DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN. ACM Trans. Database Syst. (TODS) 2017, 42, 1–21. [Google Scholar] [CrossRef]
Kuhn, H.W. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
Khan, M.; Ahmad, J.; El Saddik, A.; Gueaieb, W.; De Masi, G.; Karray, F. Drone-HAT: Hybrid Attention Transformer for Complex Action Recognition in Drone Surveillance Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 4713–4722. [Google Scholar]
Mutegeki, R.; Han, D.S. A CNN-LSTM approach to human activity recognition. In Proceedings of the 2020 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Fukuoka, Japan, 19–21 February 2020; pp. 362–366. [Google Scholar]
Han, X.-F.; Jin, J.S.; Wang, M.-J.; Jiang, W.; Gao, L.; Xiao, L. A review of algorithms for filtering the 3D point cloud. Signal Process. Image Commun. 2017, 57, 103–112. [Google Scholar] [CrossRef]
Palipana, S.; Salami, D.; Leiva, L.A.; Sigg, S. Pantomime: Mid-air gesture recognition with sparse millimeter-wave radar point clouds. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2021, 5, 1–27. [Google Scholar] [CrossRef]
Bevilacqua, A.; MacDonald, K.; Rangarej, A.; Widjaya, V.; Caulfield, B.; Kechadi, T. Human activity recognition with convolutional neural networks. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2018, Dublin, Ireland, September 10–14, 2018, Proceedings, Part III 18; Springer International Publishing: Berlin/Heidelberg, Germany, 2019; pp. 541–552. [Google Scholar]
Pienaar, S.W.; Malekian, R. Human activity recognition using LSTM-RNN deep neural network architecture. In Proceedings of the 2019 IEEE 2nd Wireless Africa Conference (WAC), Pretoria, South Africa, 18–20 August 2019; pp. 1–5. [Google Scholar]
Duran, B.S.; Odell, P.L. Cluster Analysis: A Survey; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 100. [Google Scholar]
Alex, S. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Phys. D Nonlinear Phenom. 2020, 404, 132306. [Google Scholar]
Wang, F.; Tax, D.M. Survey on the attention based RNN model and its applications in computer vision. arXiv 2016, arXiv:1601.06823. [Google Scholar]
Abdu, F.J.; Zhang, Y.; Fu, M.; Li, Y.; Deng, Z. Application of deep learning on millimeter-wave radar signals: A review. Sensors 2021, 21, 1951. [Google Scholar] [CrossRef] [PubMed]
Texas Instruments. AWR1843BOOST and IWR1843BOOST Single-Chip mmWave Sensing Solution User’s Guide (Rev. B). 2020. Available online: https://www.ti.com/tool/IWR1843BOOST (accessed on 23 October 2023).
Mishra, K.V.; Shankar, M.B.; Koivunen, V.; Ottersten, B.; Vorobyov, S.A. Toward millimeter-wave joint radar communications: A signal processing perspective. IEEE Signal Process. Mag. 2019, 36, 100–114. [Google Scholar] [CrossRef]
Yujiri, L.; Shoucri, M.; Moffa, P. Passive millimeter wave imaging. IEEE Microw. Mag. 2003, 4, 39–50. [Google Scholar] [CrossRef]
Zhang, Q.; Wang, C.; Wu, H.; Xin, C.; Phuong, T.V. GELU-Net: A Globally Encrypted, Locally Unencrypted Deep Neural Network for Privacy-Preserved Learning. In Proceedings of the International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 3933–3939. [Google Scholar]
Xia, K.; Huang, J.; Wang, H. LSTM-CNN architecture for human activity recognition. IEEE Access 2020, 8, 56855–56866. [Google Scholar] [CrossRef]

Figure 1. System framework diagram.

Figure 2. The point cloud of two human bodies in motion. (a) The original point cloud; (b) the point cloud after pass−through and statistical filtering; and (c) the final point cloud after inter−frame difference filtering.

Figure 3. Point cloud of human body after superimposing 6 frames.

Figure 4. Point cloud of human body after rounding off y−axis component.

Figure 5. CNN + LSTM model.

Figure 6. IWR1843boost.

Figure 7. Multiplayer action sample chart.

Figure 8. Confusion matrix of human body and radar at different distances: (a) 2 m from radar (b) 4 m from radar; (c) 6 m from radar; and (d) 8 m from radar.

Figure 9. Confusion matrix for different radar placement heights. (a) Radar height: 0.4 m. (b) Radar height: 0.8 m. (c) Radar height: 1.2 m. (d) Radar height: 1.6 m.

Figure 10. Effect of each filter module on overall accuracy at different distances between human body and radar.

Figure 11. Effect of each filter module on overall accuracy for different radar placement heights.

Table 1. Comparison of different human action recognition devices.

Equipment Type	Resolution	Multiplayer	Privacy Protection	Affected by Weather
camera equipment	highest	yes	no	yes
wearable equipment	high	no	yes	no
infrared equipment	low	yes	yes	no
millimeter-wave radar	high	yes	yes	no

Table 2. The average recognition time for a single test case for each action of the system in this paper.

Activity	Time/ms
Boxing	62
Leg lift	75
Bowing	67
Squat down	69
Stand up	59

Table 3. The average recognition time for a single test case for each action of the RadHAR system.

Activity	Time/ms
Boxing	756
Leg lift	893
Bowing	844
Squat down	747
Stand up	689

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dang, X.; Fan, K.; Li, F.; Tang, Y.; Gao, Y.; Wang, Y. Multi-Person Action Recognition Based on Millimeter-Wave Radar Point Cloud. Appl. Sci. 2024, 14, 7253. https://doi.org/10.3390/app14167253

AMA Style

Dang X, Fan K, Li F, Tang Y, Gao Y, Wang Y. Multi-Person Action Recognition Based on Millimeter-Wave Radar Point Cloud. Applied Sciences. 2024; 14(16):7253. https://doi.org/10.3390/app14167253

Chicago/Turabian Style

Dang, Xiaochao, Kai Fan, Fenfang Li, Yangyang Tang, Yifei Gao, and Yue Wang. 2024. "Multi-Person Action Recognition Based on Millimeter-Wave Radar Point Cloud" Applied Sciences 14, no. 16: 7253. https://doi.org/10.3390/app14167253

APA Style

Dang, X., Fan, K., Li, F., Tang, Y., Gao, Y., & Wang, Y. (2024). Multi-Person Action Recognition Based on Millimeter-Wave Radar Point Cloud. Applied Sciences, 14(16), 7253. https://doi.org/10.3390/app14167253

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Person Action Recognition Based on Millimeter-Wave Radar Point Cloud

Abstract

Featured Application

Abstract

1. Introduction

2. System Framework Design

2.1. Overview of the System Framework

2.2. Data Preprocessing

2.3. Target Segmentation

2.4. Data Optimization and Dimensionality Reduction

2.5. CNN + LSTM Classification Network

3. Experimental Setup

3.1. Hardware Setup and Experimental Scenario

3.2. Model Parameter Settings

4. Experimental Results Analysis

4.1. Robustness Experiment

4.2. Ablation Study

4.3. Time Cost

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI