1. Introduction
The detection and classification of ground moving targets have been widely applied to military and civilian applications, such as square surveillance, airport scene surveillance, post-disaster rescue, anti-terrorism, and auxiliary medical treatment, and have been widely researched [
1,
2,
3,
4,
5,
6,
7]. Pedestrian and vehicle targets are the main types of surveillance objects of a ground surveillance radar. The Doppler effect occurs when there exists a relative motion between the target and radar. When the radial motion speed of a ground moving target is slow, the target Doppler spectrum and clutter spectrum can easily overlap, which increases the difficulty in detection and recognition of pedestrian and vehicle targets. At the beginning of this century, Chen’s team from the Naval Research Laboratory of America systematically studied the micro-Doppler effect of radar [
8]. A target or some of its parts have a relative movement (e.g., vibration, swing, or rotation) to the main body, which is the main cause of the micro-Doppler effect. The micro-Doppler frequency produces sidebands around the Doppler shift and widens the main Doppler spectrum [
9]. If micro-Doppler information is introduced, the detection and recognition probabilities of ground moving targets, such as pedestrians and vehicles, can be improved significantly. Therefore, the classification and recognition of ground moving targets based on micro-Doppler characteristics have attracted great attention in recent years and have become one of the main research topics in the area of target classification [
10,
11].
Pedestrian movement is a highly coordinated non-rigid movement, which includes the action of the brain, muscles, nerves, joints, and bones. It has a non-stationary property of a single scattering point and a non-rigid property of multiple scattering points. There are three ways to obtain the radar echo. The first way is by using pedestrian motion modeling based on the biomechanical model. A typical model is the Boulic model [
12], which was developed to imitate pedestrian walking to calculate the speed of human joints. In [
9], the authors improved the Boulic model, simulated the radar echoes of pedestrian walking, analyzed the micro-Doppler effect of pedestrians through the time–frequency analysis, and estimated the arm swing frequency and gait cycle. However, this method mainly aims at the ideal walking model and has a low generalization ability. The second way is through pedestrian motion modeling based on the motion capture (MOCAP) data [
13]. In the work of Barrs Erol et al. [
14], a few complex motion models were established by directly capturing the pedestrian motions, which recorded the actual motion trajectory of each node of the human body. In the work of Yao et al. [
15], the authors improved the MOCAP model, simplified the model, and improved the modeling efficiency on the premise of meeting the real motion requirements. In this method, the motion modes are diversified, but the scattering characteristics are still different from the actual situation. The third way is to record the actual, measured radar echo data. In the work of Björklund et al. [
16], the authors used a 77 GHz radar to record the pedestrian posture echoes and applied the micro-Doppler characteristics of pedestrian postures to multi-person recognition. In the work of Mehul et al. [
17], a continuous gait was collected by a frequency-modulated continuous wave radar and three ultra-wideband (UWB) pulse radars placed at different locations. However, this method is difficult to use in practical situations and has high requirements for the hardware equipment.
Considering the pedestrian activities recognition, in the work of [
18], Garreau et al. classified pedestrians, skiers, and cyclists using a classifier based on their differences in micro-Doppler characteristics. In the work of Lshibashi et al. [
19], a classifier that can simultaneously classify the lifting action and its load weight based on the kinematic quantities of the body motion was developed using the hidden Markov model framework. In the work of Fairchild et al. [
20], different classifiers were used to distinguish various pedestrian movements. In the work of Amin et al. [
21], pedestrians on crutches were identified. In the work of McDonald et al. [
22], humans sitting in a taxiing plane were identified. In the work of Kim et al. [
23], the physical characteristics of pedestrians was used to detect them. In the work of Bryan et al. [
24], the micro-Doppler characteristics of pedestrians were analyzed using a UWB radar. In addition, many classification methods, such as support vector machine [
25], have been used to achieve better classification results of running, walking, armed walking, climbing, and other types of motions. In the work of [
15], Yao et al. used a complex-valued neural network to classify pedestrian activities, extracted more rich features, and achieved higher classification accuracy under the small-sample conditions and a low signal-to-noise ratio (SNR) environment. Considering the classification of pedestrian and vehicle targets, in the work of Nanzer et al. [
26], the difference in the number of strong scattering points in the time spectra of the pedestrian and vehicle targets was used to classify these two types of targets. In the work of Du et al. [
27], the micro-Doppler characteristics of pedestrians and vehicles were studied based on the measured continuous wave radar data, and noise reduction and clutter suppression of measured data were performed by the Bayesian learning method; robust features extracted from the time–frequency spectrum were used to classify pedestrians and vehicles. In the work of [
28], Shi et al. analyzed differences in the micro-motion characteristics of pedestrian and vehicle targets and extracted the spatial features of the spectrogram, namely, the texture characteristics; good classification results were achieved.
All the above-mentioned classification methods are based on manual feature extraction, and different classifiers have been used to classify the ground moving targets. However, these methods heavily rely on prior knowledge that has strong subjective factors and can be applied only in specific application scenarios due to their low generalization ability. With the development of hardware equipment, more attention has been paid to deep-learning-based methods, which can automatically extract inherent characteristics of targets without requiring prior information and can achieve good classification performance. In the works of Kim et al. [
29,
30], a convolution neural network based on the micro-Doppler analysis was used to classify and recognize pedestrian motions. Although common deep-learning-based methods can achieve good recognition results, they require a large number of training and testing samples; if there are fewer samples, they are likely to overfit, which reduces their classification performance. Therefore, it is necessary to enhance and augment the training samples. In the work of Du et al. [
27], the number of samples was increased through sliding window processing; however, its generated samples had a strong similarity. In the work of Bjerrum et al. [
31], a data enhancement method based on affine transformation, which can generate similar samples by enlarging, reducing, translating, and rotating a sample image, was proposed. In the work of Song et al. [
32], an image sharpening method was used to enhance the training sample set. However, these image transformations are global and can neither focus on the diversity of local regions nor extract the intrinsic features of a database. In the work of Goodfellow et al. [
33], a generative model was proposed based on the game process of generator and discriminator, namely generative adversarial network (GAN), which can reach the Nash equilibrium and generate new images with similar but different features to the original images. In the work of Seyfioğlu et al. [
34], a self-encoder was used to initialize network parameters in the case of small samples, and then a convolutional network was used to classify auxiliary and non-auxiliary indoor pedestrian movements, and good classification results were achieved. In the work of Alnujaim et al. [
35], a GAN was used to supplement the pedestrian micro-motion data collected by a Doppler radar, which solved the problem of insufficient training data and improved the deep convolutional neural network (DCNN) classification accuracy. In the work of Erol et al. [
36], a pedestrian micro-motion data enhancement method based on the auxiliary condition GAN was proposed for the classification of pedestrian activities. However, the data generated by this method need to be processed by the principal component analysis to eliminate inconsistent samples. In the work of Alnujaim et al. [
37], a GAN was used to expand the time–frequency (TF) images obtained from a single angle into images from multiple angles, thus enriching the sample set from multiple perspectives. However, a GAN has the limitation of convergence instability, which makes it be prone to produce samples with high similarity. To solve these problems, the Wasserstein GAN was proposed by Arjovsky et al. [
38], and has been applied in some research. In the work of Magister et al. [
39], the WGAN with a semantic image inpainting algorithm was used to devise more realistic retinal images for medical teaching purposes. In the work of Fu et al. [
40], a novel framework by concatenating a super resolution GAN and a WGAN was proposed to increase the performance of a backbone detection model.
To obtain good performance of ground moving targets classification under small sample conditions and solve the problem of limitation of convergence instability, a WGAN sample enhancement method for ground moving targets (GMT-WGAN) classification is proposed in this paper.
The proposed method augments the samples using the adversarial learning strategy and increases the richness of information under the condition of small samples. Compared with the existing ground moving target classification methods, the main contributions of the proposed GMT-WGAN method are as follows:
(1) Wasserstein distance is introduced to measure the difference between the real and generated distributions of the ground moving targets in the adversarial network, which makes the training more stable, and the training degree of generator and discriminator is no longer needed;
(2) The inverse of the Wasserstein distance is used as a loss function of the discriminator in the ground moving target samples generation, which can indicate the training process state;
(3) Three different databases are established to verify the effectiveness of the proposed method. Experimental results show that the GMT-WGAN method can provide significant improvement in the ground moving target classification under different SNR conditions.
The remainder of this paper is organized as follows.
Section 2 introduces the related work of the GMT-WGAN, including the motion characteristics of ground moving targets, the time–frequency analysis method based on the short-time Fourier transform, and the basic principle of the GAN.
Section 3 presents the proposed GMT-WGAN method.
Section 4 provides detailed experimental results and evaluates the sample image quality of the WGAN.
Section 5 concludes the paper.
2. Related Work
2.1. Ground Moving Target Motion Characteristics
Ground moving targets mainly include pedestrians and vehicles. There are many forms of pedestrian movements, such as walking, running, jumping, armed, and on crutches. Different postures represent different pedestrian motion intentions. In addition, there are many types of vehicle targets, such as armored vehicles, tanks, and other military vehicles, transport vehicles, and bicycles. Different moving mechanisms lead to different motion characteristics.
The motion rules of pedestrians are as follows: (1) the motion of the center of mass of a human body modulates the radar echo Doppler signal and can reflect the speed and direction of a target; (2) in the moving process, the periodic swing of upper and lower limbs micro-modulates the radar echo Doppler signal and reflects the micro-motion of the target. Due to the unique model and motion mechanism of pedestrians, the modulation effect of different parts on the Doppler can be different, which is related to the limb size and relative radial velocity to the radar. The modulation amplitude depends on the radar cross section (RCS) of a limb. Due to the existence of micro-motion, more parameters can be extracted to describe the motion posture of pedestrians.
The motion rules of vehicles are as follows: (1) the motion of a vehicle’s body modulates the radar echo Doppler signal; (2) the radar echo Doppler signal is micro-modulated by the periodic rotation of wheels and the up–down vibration of a vehicle’s center of mass caused by objective factors, such as uneven roads. The tire material of a wheeled vehicle is generally rubber, whose RCS is small. Therefore, the micro-Doppler modulation of wheeled vehicles is usually not obvious, and its Doppler characteristics are mainly reflected in the movement of a vehicle’s body. Meanwhile, a crawler and its wheels are metal with large size and smooth areas. If the vehicle body’s center of mass velocity is , then the speed of the upper crawler relative to the vehicle’s body is , and that of the lower crawler is zero. Therefore, in an ideal situation, within a certain attitude angle range, the Doppler spectrum of tracked vehicles has obvious components of and zero, but in the actual situation, the zero component will be suppressed in the process of suppressing ground clutter.
2.2. Time–Frequency Analysis Method Based on Short-Time Fourier Transform
Fourier transform is a natural tool for the time–frequency analysis of stationary signals. It is suitable for the global analysis of signals but has certain limitations for non-stationary signals. However, the frequency of signals in nature is time-varying and non-stationary, and the radar echo signal from a micro-moving target is a typical non-stationary signal as well. Therefore, it is necessary to introduce the joint time–frequency transform to describe the time-varying characteristics of this type of signal. Common time–frequency representation methods include the short-time Fourier transform (STFT), continuous wavelet transform, adaptive time–frequency representation, Winger-Ville distribution (WVD), and Cohen time–frequency distribution.
As one of the commonly used time–frequency analysis methods, STFT calculates the Fourier transform of a signal in each time sliding window and then obtains the two-dimensional time–frequency distribution of signal as follows:
where
is the signal to be analyzed,
is the window function, and
is the time–frequency spectrum after transformation whose discrete form is often used in engineering applications, and it is given by:
where
is the discrete form of the signal to be analyzed;
and
are the time and frequency sampling intervals, respectively;
and
are the time and frequency samplings, respectively;
is the window function.
The STFT has the advantage of not considering the effect of cross terms, which results in a small calculation amount. However, it also has certain disadvantages, for instance, the resolution is limited by the selected window function, and the time and frequency resolutions usually cannot be optimized at the same time. Selecting a wide window can ensure high frequency resolution but can also worsen the time resolution. On the contrary, selecting a narrow window can provide high time resolution but can decrease frequency resolution. Therefore, the window size is the key parameter that significantly affects the effect of time–frequency analysis in the STFT.
The time–frequency spectrum of a signal is the square of the STFT module, which can be expressed as follows:
Generally, the horizontal and vertical directions of the spectrum indicate time and frequency, respectively. The spectrogram contains the frequency information of signals at different times, and it can clearly indicate frequency variations with time.
2.3. Basic Principle of GAN
Inspired by the zero-sum game between two people, Goodfellow et al. [
30] proposed the GAN structure. It consists of a generator and a discriminator. The generator is used to capture the distribution model of the time–frequency spectrogram of real ground moving targets and to generate new data samples. The discriminator is a binary classifier that determines whether the network input is real or generated data. The generator and discriminator iteratively optimize their parameters by competing and restricting each other to improve their abilities to generate and discriminate samples. In fact, this optimization process represents a game problem, which is to find an equilibrium point between the generator and discriminator. If the Nash equilibrium is reached, the discriminator cannot determine whether the input data come from the generator or denote real samples, and that is when the generator reaches the optimum state.
The structure of GAN is shown in
Figure 1. The input of the generator is a one-dimensional random Gaussian noise vector
, and its output is a spurious time–frequency spectrogram
. The input of the discriminator is the spectrogram sample
of a ground moving target or a spectrogram sample generated by the generator, and its output is either “1” or “0,” where “1” represents true and “0” represents false. The training goal of the GAN is to make the distribution of the spurious spectrogram
generated by the generator as close as possible to that of the real target. The purpose of the generator is to make the distribution of the generated spurious spectrogram on the discriminator
as consistent as possible with that of the real target
. The loss function of the generator is given by:
where
indicates the discriminator;
indicates the generator;
is the random noise distribution;
is the expectation operator.
In the process of constant confrontation learning, a spurious time–frequency image generated by the generator becomes more close to the real time–frequency image of a ground moving target, and the discriminator becomes more fuzzy for .
The purpose of the discriminator is to realize the binary classification of input images. If the input is a real spectrogram sample, the discriminator will output “1”; but if the input is a spurious spectrogram generated by the generator
, the discriminator will output “0.” The loss function of the discriminator is given by:
Thus, the total loss function of the generator and discriminator can be expressed as follows:
where
is the time–frequency spectrum distribution of real targets.
However, the GAN has the problem of instability in the training process. Particularly, in the case when the discriminator does not converge well, the generator cannot be updated many times; otherwise, it will be prone to mode collapse.
A deep convolutional GAN (DCGAN) introduces a series of constraints to the original GAN structure. The DCGAN mainly improves the GAN from the engineering point of view, improving the stability of GAN training. Although there is not much innovation in this theory, it provides engineering support for the GAN development.
The main changes in the DCGAN structure compared to the ordinary GAN structure can be summarized as follows:
(1) The pooling and fully connected layers are removed from the discriminator, and a fully convoluted network is used to map the input sample to a two-dimensional vector to reduce the network parameters.
(2) In the generator, the transposed convolution layer is used to map the input random Gaussian noise vector to generate samples.
(3) To stabilize the training and convergence of the network, batch processing is used to normalize the convolution and transposed convolution layers.
(4) In the generator, the output layer uses the hyperbolic tangent function (tanh) as an activation function, while the other layers use the ReLu activation function after batch normalization.
(5) In the discriminator, to prevent the problem of gradient dispersion when the loss function of the discriminant network propagates to the generated network, the ReLU activation function is replaced by the leaky ReLU activation function so that the negative value input also has an activation output.
The structure of the generated DCGAN network is shown in
Figure 2, where the input noise vector has a dimension of 100, and the output sample with a dimension of [3,64,64] is obtained through the transposed convolution with a step size of two in four layers. Although the DCGAN has an efficient network architecture, which can improve the stability of training to a certain extent, there is still the problems of unstable training, mode collapse, and inability to indicate the training process. To overcome these problems, this paper proposes a WGAN, which represents an improved DCGAN structure. The specific principle and improvement of WGAN will be described below.
3. GMT-WGAN for Target Classification
The flowchart of the proposed GMT-WGAN classification method is presented in
Figure 3. The original time–frequency images of the ground moving targets are divided into two parts. The first part is used to train the WGAN. In the training process, through countermeasures between the generator and discriminator, the Nash equilibrium is achieved, a well-trained WGAN model is obtained, and the time–frequency images of ground moving targets are generated. Then, the generated time–frequency images and the first part of the original time–frequency images are mixed to form the DCNN training set, which is fed to the DCNN to train the network parameters and learn the inherent characteristics of time–frequency images until convergence. Finally, the second part of the original time–frequency images is used as the DCNN test set, which is fed to the trained DCNN, and outputs the classification labels to achieve the robust classification of ground moving targets. The specific principles are introduced in detail below.
3.1. Data Expansion Method Based on WGAN
Although common deep-learning-based methods can achieve good classification results, they require a large number of training and testing samples. If there are fewer samples, the overfitting phenomenon can appear, resulting in a poor classification effect. Therefore, it is necessary to enhance and augment data samples. Most traditional image transformation methods are global and can neither focus on the diversity of local regions nor extract the intrinsic features of a database. To alleviate these drawbacks, this paper proposes a data enhancement method based on the WGAN.
The GAN and DCGAN structures have certain problems, such as difficulty in training and the loss function of generator and discriminator cannot indicate the training process and mode collapse. The Jensen–Shannon (JS) divergence (the first form) and Kullback–Leibler (KL) divergence (the second form) are used by the GAN to calculate the distance between the generated and real distributions. When the JS divergence is used, if the intersection of two distributions is very small or in a low-dimensional manifold space, then the discriminator D can easily find a discriminant surface to distinguish the generated distribution from the real distribution. Therefore, it cannot provide effective gradient information to the generator in the backpropagation process. In other words, the loss function is not continuous. Due to the ineffective gradient information, network parameters of the generator G cannot be effectively updated, which makes training difficult. Since the KL divergence is asymmetric, the generator tends to generate samples that are easier to pass through, which makes it easier to cause a mode crash. In addition, there is no single indicator that can measure the status of network training.
The WGAN introduces Wasserstein distance to measure the difference between two distributions. Wasserstein distance, which is also known as the earth-mover distance, is defined as follows [
35]:
where
represents the set of all possible joint distributions of distributions
and
.
For each possible joint distribution
, a real sample
and a generated sample
can be sampled from
, and the distance between the samples
can be calculated. Therefore, the expected value
of the sample distance under the joint distribution
can be calculated. In all possible joint distributions, the lower bound of the expected value is
, and it is also defined as Wasserstein distance. Intuitively, this can be understood as the “consumption” needed to move the pile of “sand”
to
under the “path planning”
, which gives it the name of earth-mover distance, where
is the minimum consumption under the optimal path planning. Wasserstein has a better distance measure than any other divergence. This is because Wasserstein distance measures the distance between two distributions, even if they show almost no overlapping. This advantage enables the WGAN to alleviate gradient dispersion in the process of GAN training. In particular, the discriminator D can still propagate effective gradient to the generator G when two distributions almost do not overlap. In practice, mathematical transformations have been commonly used to express the Wasserstein distance in a solvable form, and the weight of the discriminator is maximized by using weight clipping to limit it within a range so as to approximate the Wasserstein distance. By using this approximate optimal discriminator, the optimization generator increases the Wasserstein distance, thus effectively shortening the distance between the generated and real distributions. The network structure of the discriminator of the proposed WGAN is presented in
Figure 4.
The main contribution of the WGAN is that instead of using the JS or KL divergence, the Wasserstein distance is used to measure the difference between the real and generated data distributions. Compared with the JS divergence, the Wasserstein distance is continuous, which makes the discriminator loss be no longer defined by binary values (true or false), and transforms it into a regression problem. At the same time, due to the continuity of the loss function and weight cutting, the training is more stable. Compared with the KL divergence, the Wasserstein distance is symmetric, and the generator training has a stable tendency, so the mode crash problem is overcome. Finally, since the GAN cannot generate indicators, which indicate the training process, the Wasserstein distance is the inverse of the discriminator loss function, so the difference between the real and generated distributions can be obtained by the discriminator loss function. The Wasserstein distance does not depend on the discriminator’s network structure, can indicate the training process, and is highly correlated with the quality of generated samples. In conclusion, the advantages of the WGAN can be summarized as follows:
(1) It solves the problems of unstable GAN training and the requirement for balancing the training degree of the generator and discriminator.
(2) It solves the problem of a mode crash and ensures the diversity of generated samples.
(3) In the training process, it generates a value that can indicate the training process state (i.e., the inverse of the discriminator loss function). The smaller this value is, the better the GAN’s training performance and the higher the image quality will be.
3.2. DCNN-Based Ground Moving Target Classification
In this study, a VGG-16 network is used to classify ground moving targets. The VGG-16 network structure is shown in
Figure 5. Input samples of the VGG-16 are grayscale images or RGB images labeled in the one-hot form. The VGG-16 network consists of five convolutional blocks and one fully connected block. The convolution block includes two or three convolution layers, each of which has the extracted feature map depth of
and the convolution kernel size of
; thus, the convolution of this type of layer is denoted as
. After the convolution layers, the ReLU function is used to activate and transfer the feature map to the pooling layer. The window size of the maximum pooling layer is 2 × 2. After the convolutional block, the feature map is reconstructed into a one-dimensional feature vector by the fully connected layers. Finally, the softmax layer maps the vector into probability values to obtain the category and calculate the classification accuracy. The parameters of the network layers are shown in
Figure 5.
3.3. Image Quality Evaluation Metrics
Although the generated images can augment and enrich the sample database, if the quality of generated images is not high enough, or there exists a large difference between generated and original images, adding the generated images will not help to improve the performance but instead will decrease the classification accuracy. Therefore, the quality of generated images needs to be evaluated.
The common image quality evaluation indexes include mean value, entropy, dynamic region, fuzzy linear index, mean gradient, mean variance, and gray difference. The evaluation metrics are introduced in the following.
The mean of an image indicates its total energy. Assume the size of an image
is
; then, the mean value of the image is given by:
The variance of an image indicates the deviation degree of the image relative to the mean value, which can be expressed as follows:
Information entropy indicates the amount of information in an image and reflects the degree of focus of the image. The information entropy is calculated by:
where
is the probability value of pixels in image
.
The smaller the information entropy value is, the more focused the image is.
The dynamic zone denotes a ratio of the maximum value to the minimum value of a grayscale image, and its logarithmic expression is as follows:
where
and
denote the maximum and minimum values of the grayscale image, respectively.
The larger the dynamic range is, the higher the image contrast is.
The linear index of fuzziness (LIF) is used to describe the fuzzy degree of an image, and it is defined by:
The smaller the LIF value is, the sharper the image is.
The average gradient (AG) of an image is calculated by:
where
and
represent the horizontal and vertical gradients of the image, respectively.
The larger the AG value is, the clearer the edge details of the image are.
The gray level difference (GLD) of an image indicates the edge sharpness of the target area of interest in the image, and it is obtained by:
The larger the GLD value is, the clearer the image edge is.