1. Introduction
Owing to its privacy-aware nature and robustness against a variety of operating conditions, radar technology is finding increasing applications in healthcare [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11]. These include remote patient monitoring outside of a hospital setting, rehabilitation interventions with a focus on improving mobility, and eldercare for aging-in-place. From an algorithmic perspective, human activity recognition is a core characteristic of radar-sensing solutions for such applications.
Classification of human activities using radar has recently experienced an influx of deep learning models due to their predictive power and ability to automatically learn relevant discriminant features from radar measurements [
12,
13,
14,
15,
16,
17,
18]. In particular, convolutional neural networks (CNNs) are being extensively used for learning spatial hierarchies of features from micro-Doppler signatures of human activities [
19,
20,
21,
22,
23,
24,
25,
26]. In [
19], a four-layer CNN-based activity classifier was used with Cepstral heatmaps, which were computed from the real radar spectrograms by applying an optimized filter bank generated on a diversified simulation database. A flexible deep CNN model was proposed in [
20] to classify Doppler signatures of humans walking with different arm movements. Therein, a Bayesian learning technique was used to optimize the network. In [
21], a dot-product attention-augmented convolutional autoencoder was proposed to learn both localized information and global features from micro-Doppler signatures. Superior classification accuracy was achieved by the attention-augmented model compared to its conventional counterpart. In [
22], AlexNet was trained with an attention module to learn to highlight salient regions in micro-Doppler signatures, which in turn was shown to enhance the network predictions. A hybrid model comprising a long short-term memory (LSTM) network and a one-dimensional CNN, was introduced in [
23], which provided enhanced classifications of human activities with relatively low complexity over two-dimensional (2-D) CNN methods. Complex-valued CNN-based architectures were investigated in [
24] with micro-Doppler signatures, range–time plots, and range–Doppler maps as the data formats of choice. Using experimental data of nine human activities, the advantages of complex-valued models over their real-valued counterparts were demonstrated for certain data formats and network architectures. In [
25], a multi-view CNN and LSTM hybrid network was proposed for human activity recognition, which fused multiple views of the time-range-Doppler radar data-cube. In [
26], a millimeter-wave radar was used for real-time contactless fitness tracking via deep CNNs, providing an effective alternative to body wearable fitness trackers.
Most CNN-based solutions for recognizing human activities with radar readily employ batch normalization (BN) [
27], which standardizes the activations of each batch in a layer. This renders the loss function considerably smooth, which in turn leads to improved accuracy and training speed for gradient-based methods [
28]. Benefits beyond those afforded by BN in terms of model optimization and generalization can be achieved by whitening the hidden layers’ activations [
29]. However, to the best of our knowledge, the impact of decorrelating the activations by whitening has not been investigated for the application at hand. In this paper, we propose the use of a whitening-aided CNN to effectively distinguish between radar micro-Doppler signatures of different human activities. We employ the iterative batch normalization (IterNorm) technique [
30] which uses Newton’s iterations to efficiently implement whitening, thereby avoiding the high computational load imposed by eigen-decomposition of the data covariance matrix required otherwise. Convergence of IterNorm is guaranteed by normalizing the eigenvalues of the covariance matrix. Additionally, following the work in [
31], we exploit the rotational freedom afforded by the whitening matrix to design an add-on rotation module, which can align different activity classes in orthogonal directions in the latent space. We test two different whitening-aided CNN models, one exploiting IterNorm only in lieu of BN layers and the other replacing BN layers with IterNorm + rotation module, on real data measurements of six different activities, namely, sitting down, standing up, walking, drinking water, bending to pick up an object, and falling. We show that whitening the latent space of a model provides significant enhancements to the classification accuracy compared to the CNN architecture with BN layers, with the alignment of the axes along the classes via rotation providing a slight advantage over the IterNorm only model.
The remainder of the paper is organized as follows.
Section 2 describes the radar signal model and the micro-Doppler signatures. The BN and whitening methods are presented in
Section 3, while the whitening-aided CNN models for human activity classification are described in
Section 4. With the aid of real data examples, we demonstrate in
Section 5 the usefulness of the whitening-aided models in achieving higher classification accuracy and also provide insights into the achieved performance enhancements over a base model employing BN layers. Concluding remarks are provided in
Section 6.
2. Signal Model and Micro-Doppler Signatures
Consider a frequency-modulated continuous-wave (FMCW) radar, with the transmit signal,
, given by
where
is the signal amplitude,
is the carrier frequency, and
is the chirp rate. For a moving point target, the radar return,
, can be expressed as
where
is the received signal amplitude,
is the two-way travel time, and
is the Doppler shift. The in-phase (
I) and quadrature-phase (
Q) components of the complex baseband signal can be obtained by demodulating
using the
I/
Q demodulator as
where
is the amplitude of
.
For the activity recognition problem, the human body can be viewed as a collection of moving point scatterers, which results in the corresponding radar return being a superposition of individual returns of the form of (
3), represented by
where
is the amplitude,
is the Doppler frequency, and
is the two-way travel time, all corresponding to the
ith point scatterer.
Once the complex baseband signal has been sampled, it can be arranged as a 2-D matrix,
, with
and
denoting fast-time and slow-time, respectively. To compute the range map,
, we take the discrete Fourier transform (DFT) along the matrix columns, represented by
where
is the number of samples (range bins) in one pulse repetition interval,
, and
, with
representing the total number of considered pulse repetition intervals. Next, the corresponding micro-Doppler signature is obtained through a two-step process. First, we sum the data over the range bins of interest as
with
and
being the minimum and maximum range bins considered. Then, we apply the Short-Time Fourier Transform (STFT) to
and compute the micro-Doppler signature,
, as the spectrogram (the squared-magnitude of the STFT). That is,
where
represents the window of length
that determines the trade-off between time and frequency resolutions [
32], the integer
h determines the step size by which the window is shifted across the signal
,
is the time index. and
is the frequency index. These micro-Doppler signatures serve as the input to the CNN-based classifier for human activity recognition.
4. Whitening-Aided CNN-Based Activity Classification
Having described the whitening methods, we are now in a position to present the whitening-aided CNN-models for human activity recognition.
We consider a base CNN model consisting of a series of building blocks. Each building block comprises a convolutional layer, followed by a max-pooling layer and then a BN layer, as seen in
Figure 1a. Each convolutional layer generates feature maps by convolving its input with 2-D filters in a sliding window fashion and then feeding the filter outputs to an activation function. Considering a convolutional layer with
L filters and denoting the input of the convolutional layer by
, we can express the
lth convolutional map,
, corresponding to the
lth filter as
where ‘∗’ denotes 2-D convolution,
is the activation function,
is the bias term corresponding to the
lth map, and
is the
lth 2-D convolutional filter. Next, the max-pooling layer downsamples the feature maps by taking the maximum over an
spatial window for complexity reduction [
35]. Finally, the BN layer applies centering and scaling operations to normalize the downsampled feature maps within a batch. We note that the micro-Doppler signature of (
7) serves as the input of the first building block, whereas the input of each subsequent block is the output of the previous block.
A whitening-aided CNN model is essentially the same as the base CNN model with the exception that it employs a whitening layer in lieu of BN in its building blocks. We consider two whitening-aided models, namely, whitening-aided models 1 and 2; the former replaces BN layer with an IterNorm layer as shown in
Figure 1b, whereas the latter employs IterNorm + Rotation in place of BN as depicted in
Figure 1c.
We note that in
Section 3, the activations for the BN and whitening methods are assumed to be vectors. However, the output of a convolutional layer comprises a total of
L 2-D feature maps. As such, the batch input to any normalization layer in this case would be of size
, where
and
indicate the height and width of the downsampled feature maps (output of the max-pooling layer) and
m is the number of samples in the batch. Following [
27,
30,
31], we unroll the batch input as
. The BN and whitening operations can now proceed with the unrolled
as the batch input.
5. Experimental Results
In this section, we evaluate the performance of the whitening-aided CNN models for human activity classification using real data measurements. We compare the classification accuracy of the whitening-aided models with that of the base CNN model.
5.1. Experimental Dataset
We employ the human activity dataset collected at the University of Glasgow, UK [
36]. This dataset consists of six smaller subsets, out of which we employ the three subsets collected in 2017 in a laboratory environment. The data were collected using an FMCW radar, model SDR-KIT-580B by Ancortek (Fairfax, VA, USA), with a 5.8 GHz carrier frequency, 400 MHz bandwidth, and a chirp duration of 1 ms, delivering an output power of approximately 18 dBm. Two Yagi antennas, each with a gain of about 17 dB, were used for signal transmission and reception. The number of samples per recorded beat-note signal was set as 128. The dataset contains six activity classes: walking, sitting down, standing up, bending to pick up an object, drinking water, and falling. A total of 33 participants were used as test subjects, 31 of them were male and two were female, ranging in height from 149 cm to 188 cm with ages between 22 and 36 years. Each participant repeated each activity two to three times along the radar’s line of sight, i.e., measurements were made at normal incidence. The spectrograms were computed using a Hanning window length of 256 with 2048 frequency points and 254 points overlap, i.e.,
in (
7). The resulting micro-Doppler signatures were then cropped, downscaled, and converted to grayscale images with dimensions of
and pixel values ranging from 0 to 255. The dataset contains a total of 570 micro-Doppler signatures, with 95 signatures per class. Representative signatures of each of the six activities are shown in
Figure 2; the horizontal axis represents time while the vertical axis is Doppler frequency.
5.2. CNN Models and Training
For illustration, we employ the learning architecture depicted in
Figure 3, where the input to the network is a micro-Doppler signature of size
. The network output is a one-hot encoded length-6 vector such that the location of a ‘1’ indicates a specific human activity. The input is passed through a 3-layer CNN implementing 32, 64, and 128 filters, respectively, each of kernel size
. A max-pooling layer with a stride of 3 follows each convolutional layer. A normalization layer is the last module in each building block. A dropout layer (not shown in
Figure 3) with a 15% rate is also included before the fully-connected output layer. The ReLU activation function is used for all layers except the output layer, which uses a softmax function. Three different variants of this learning architecture are considered, differing in terms of the employed normalization method, as detailed in
Figure 1. Specifically, these include the base model with BN layers, whitening-aided model 1 with IterNorm layers, and whitening-aided model 2 with IterNorm + Rotation layers.
We utilize cross-entropy as the loss function for activity classification. To optimize the model, we apply stochastic gradient descent with a batch size of 10. We used an adaptive learning rate with an appropriate initial value for each CNN model, decreased by a factor of 10 after every seven epochs. A maximum of 30 epochs are used for training the base model and whitening-aided model 1, with the number of iterations for IterNorm set to 5. For whitening-aided model 2, we perform a warm start with the pretrained whitening-aided model 1 to which we add the rotation modules and continue the training for five additional epochs.
5.3. Classification Accuracy
We first examine the classification accuracy of the proposed whitening-aided models as a function of the number of training samples per class. We let the number of training samples vary from 20% to 80% in increments of 30%. The remaining signatures in each instance are utilized for testing. We conduct 30 classification experiments over distinct training and testing datasets for each considered split using the base CNN model and its whitening-aided counterparts. We calculate the mean and standard deviation of the test data classification accuracy for all three classifiers. The results are provided in
Table 1. We clearly observe that for each training/testing split, both whitening-aided models significantly outperform the base model, especially under limited training samples. This is attributed to the reduced model confusion amongst the six classes resulting from the whitening of the latent space. The addition of the rotation module in whitening-aided model 2 to maximize the class activations along the latent space axes provides an additional 1.5% to 2% increase in average accuracy and relatively lower standard deviation values over whitening-aided model 1. This attests to further class disentanglement brought about by constraining the latent space to represent the classes. For further illustration of the impact of whitening, we compute the confusion matrices, averaged over 30 trials, corresponding to the base and the whitening-aided models for the 50%-50% training/testing data split. These confusion matrices, depicted in
Figure 4, clearly demonstrate that the addition of the whitening layers cause a reduction in the model confusion for all six classes, with whitening-aided model 2 providing slightly higher reductions as compared to whitening-aided model 1.
Next, we consider 50%-50% training/testing data split and investigate the impact of whitening on the classification performance when introduced as a replacement for a single BN layer in the base model, leaving the remaining two BN layers intact. The corresponding average value and standard deviation of the classification accuracy are provided in
Table 2, with the values corresponding to the base model under column labeled as “Base Model” and those corresponding to whitening methods 1 and 2 replacing BN in the first, second, and third layers of the network in respective columns labeled as “Layer 1”, “Layer 2”, and “Layer 3”. We observe that, compared to the base model, even replacing one BN layer with either whitening module yields performance enhancements, with progressively higher improvements for the introduction of the whitening layer at increasing depth of the network. Again, whitening method 2 provides higher accuracy on average and lower standard deviation as compared to whitening method 1. Comparing the results in
Table 1 for 50%-50% training/testing data split and
Table 2, we see that while replacing all BN layers with whitening layers yields the best performance, there is considerable value in replacing even a single BN layer with a whitening layer, especially deeper in the network and more so for whitening method 2 than method 1.
5.4. Correlation Coefficients
To visually highlight the decorrelation aspect of the whitening layers, we consider the 50%-50% training/testing data split and measure the output of the normalization modules for the test set in each layer in the base model, whitening-aided model 1, and whitening-aided model 2 after training. We then calculate the absolute value of the correlation coefficient of every feature pair in each layer of the respective models. As depicted in the top row of
Figure 5, the base model with all BN layers exhibits relatively strong correlations. This is expected since BN only standardizes the activations and does not decorrelate them. On the other hand, when all BN layers are replaced by either IterNorm layers or IterNorm+Rotation layers, the features in every layer indeed become decorrelated as seen in the middle and bottom rows of
Figure 5, thereby leading to improved classification performance.
5.5. Top Activated Signatures
An important characteristic of whitening method 2 is its alignment of the axes of the latent space with the activity classes, which has been shown to enable an understanding of the learning process across the layers [
31]. To this end, in this example, we assess the relationship between the test samples and a class label in the latent space for a trained whitening-aided model 2 with 50%-50% training/testing data split. We calculate the activation values of the test samples on each axis for each label and identify the top activated signature for each class in each layer, depicted in
Figure 6. We observe that in the third layer, the top activated signatures correspond to the correct class labels. However, in the first layer, as the convolutional layers capture low-level information, the alignment is not as accurate as the higher levels. We also determine the empirical receptive fields of the top activated signatures by identifying those locations in each signature which when masked cause the largest reduction in the activation values on different latent space axes [
31]. For this purpose, we apply
random masking patches with a stride of 5 on the top activated images. The corresponding results are shown as highlighted regions in
Figure 6. Clearly, in the first layer, the extracted features appear to be related to the background, while by the third layer, the learned features are predominantly from the main pattern of the micro-Doppler signature. For example, the “Walking” axis in the third layer focuses on sinusoidal segments of the signature, while the “Falling” axis converges on the waterfall shape of the corresponding micro-Doppler signature.
5.6. Performance with Unseen Testing Data
In this final example, we examine the performance of the whitening-aided models under unseen testing data. Specifically, we retrain the networks using micro-Doppler signatures of 27 out of 33 human subjects (77 samples per class). The signatures of the remaining six subjects (18 samples per class), which were excluded from the training data, are used for testing. This is roughly equivalent to an 80%/20% training/testing data split. The respective classification accuracy values of the base model, whitening-aided model 1, and whitening-aided model 2 are 85.18%, 89.81%, and 92.59%. We note that the accuracy of each model is relatively lower than the corresponding average values reported in
Table 2 for the 80%/20% data split. However, even in this case of unseen data, the superiority of the whitening-aided models over the base model is clearly evident, with whitening-aided model 2 outperforming whitening-aided model 1 as in the previous examples.
5.7. Summary of Findings
The above examples clearly demonstrate the superior performance of the whitening-aided CNN models over the base CNN model for human activity classification. The performance enhancements exist irrespective of testing with unseen data or samples from subjects the models have seen before during training. This superiority is attributed to the ability of the whitening layers to not only standardize but, more importantly, decorrelate the activations, and in case of whitening method 2 also to the alignment of the latent space axes with the activity classes. Further, while the results suggest replacing all BN layers in a CNN model with whitening layers to exploit their offerings to the fullest, considerable performance enhancements over the base model can be realized by using a whitening layer in lieu of even a single BN layer; the level of improvement increasing with increasing depth at which this replacement occurs in the network. Furthermore, performance evaluation of the two whitening methods showed that addition of the specific rotation module to IterNorm which maximizes the activation of the classes along the latent space axes provides model 2 with an appreciable advantage over model 1 in terms of classification accuracy, albeit at the additional expense of implementing the rotation module.