1. Introduction
In recent years, the use of machine learning techniques to model and interpret different sensor data for the analysis of human activity has become a prolific field of research, due to its value in various applications such as healthcare, sports, robotics, and security. In particular, in healthcare, long-term activity records are valuable data for identifying early signs of cognitive aging problems; the accurate and timely detection of falling or freezing of gait (FOG) from sensor data could be life-saving techniques for elderly care and emergency response.
Sensor-based human activity recognition (HAR) technologies can be divided into three main groups: (i) based on image/video; (ii) based on inertial sensor data in wearable devices; and (iii) based on wireless radio signals. Compared with wearable-based approaches, noninvasive methods based on video or wireless radio signals require less participation from subjects, and therefore are suitable for dependent individuals. Recently, the wireless-radio-signal-based approach [
1] has attracted increasing attention due to its privacy-preserving nature and independence from light conditions. These characteristics make wireless-signal-based sensing particularly suitable for residential environments, which is the main site of future healthcare [
2], compared to video-based counterparts.
WiFi-based HAR, with its ubiquity and privacy-respecting nature, has many strengths, including its resilience against environmental factors such as temperature fluctuations and not requiring line of sight. However, the use of wireless signals, generally using channel state information (CSI), is a complex problem [
3]. While WiFi has demonstrated significant success, particularly in areas like activity recognition [
4] and fall detection [
5], our work on low-resolution infrared (LRIR) sensors offers a complementary solution, rather than a competing one. LRIR sensors provide advantages in specific use cases where installation complexity and environmental constraints may be minimal concerns, and they offer unique privacy-preserving benefits that are inherently tied to their low resolution regarding other video-based solutions. Moreover, LRIR sensors have the added advantage of being less susceptible to electromagnetic interference or noise from other wireless signals, which can be a challenge in crowded or signal-dense environments where WiFi-based systems might face performance degradation. In addition, while WiFi-based systems, particularly those using CSI, can be sensitive to environmental changes due to fluctuations in the wireless channel, LRIR sensors offer more stable performance in dynamic environments by directly detecting thermal patterns, although they still require an unobstructed line of sight.
The use of LRIR sensors to recognize human activities has been investigated in previous works. Jeong et al. [
6] proposed a probabilistic method with multiple image processing techniques interpolating 8 × 8 data. In this work, the original 8 × 8 thermal pixels formed the heat signature of the human subjects, and they were interpolated to 29 × 29 pixels to apply Gaussian filtering to remove high-frequency noise due to temperature fluctuations. Mashiyama et al. [
7] proposed a method for activity recognition using the temperature distribution obtained from an 8 × 8 LRIR sensor. Burns et al. [
8] employed two 32 × 31 LRIR sensors to detect daily activities in a kitchen, comparing ground frames with activity frames and using a random forest model to classify. Fan et al. [
9] made a comparison between gated recurrent units (GRUs) and long short-term memory (LSTM) models to classify fall detection using an 8 × 8 LRIR sensor. Yin et al. [
10] used a noise reduction filter and an LSTM model to automatically extract features from an 8 × 8 LRIR image and build a recognition model. Karayaneva et al. [
11] thoroughly investigated noise removal, feature extraction, and neural network models for LRIR data for HAR.
Since high accuracies have been obtained in recognizing human activities using data from the same scenarios, the next step is to achieve similar accuracies using data from different scenarios, days, and conditions, using transferability strategies. In this sense, recent research efforts have been directed towards achieving transferability of results between different environments, often referred to as cross-domain. In this context, the term ‘domain’ refers to the setup, the environment, and the conditions under which the measures were taken. Based on this, cross-domain implies that a model trained with data from a specific condition, e.g., a particular day, room, subject, and/or under particular conditions, is able to correctly classify input data under other conditions.
Although cross-domain HAR is being extensively explored in WiFi- and video-based systems, its application to LRIR sensors remains underexplored. This represents a significant research gap, as the ability to transfer knowledge across domains is essential for the wider adoption of LRIR-based HAR systems. To address this, our paper introduces a novel model aimed at bridging this gap and enhancing the transferability of LRIR-based HAR.
Some deep learning (DL) strategies have been designed for this objective [
12], and one of the most prolific is few-shot learning (FSL). FSL allows cross-domain adaptation for a few labeled samples from one dataset using a pretrained model with an extensive dataset. Yin et al. [
13] used a semi-supervised cross-domain neural network based on 8 × 8 LRIR images to accurately identify human activities. This network consisted of a convolutional network (CNN) as a feature extractor, followed by two fully connected networks (FC): one two-layer FC for domain discrimination, and another three-layer FC as a label classifier, respectively. This network was a domain adversarial neural network and, in this case, incorporated an FSL strategy in which a small number of the testing samples were labeled for training.
One of the most prominent methods for FSL is prototype networks (PN). These networks are based on learning a prototypical representation for each class, which allows the classification of new samples by comparing them with these prototypes in an embedding space. In the same way, the network employs an embedding function to learn the specific representation of the classes, which is the core of the model. This approach improves the model efficiency by requiring less labeled data and facilitates further generalization to new classes that were not seen during training [
14]. The PN most commonly used in the literature uses CNNs as an embedding function to extract features [
15,
16,
17,
18,
19,
20].
A prototype recurrent convolutional network (PRCN) is a type of PN which employs a sequence of convolutional blocks and LSTM cells, combining the CNN and LSTM layers as an embedding function. This CNN-LSTM scheme, also called a long-recurrent convolutional network (LRCN), was used for the first time in [
21] as a feature extractor for video-based activity recognition. This showed that an LRCN fed with several frame packets as input greatly improved on the previously employed CNN single-frame baseline method for this task, using its LSTM cell capacity to learn to recognize and synthesize temporal dynamics for tasks involving sequential data, as mentioned previously. Since then, the LRCN has become a popular model in many fields, mainly where CNNs have obtained good results previously and involving nonstationary events [
22,
23].
PRCNs have been previously used in various works. In [
24], the authors employed a CNN-FC-BiLSTM scheme as an embedding function to extract features from X-band SAR images in the MSTAR dataset. In [
25], the authors used a one-dimensional convolutional layer and two Bi-GRU units as embedding functions to classify speech imagery data, which were gated recurrent units, an earlier and straightforward model in which LSTM units are an evolved stage. In [
26], the authors developed a similar network based on BiLSTM-CNN-FC to classify different types of medical data with temporal sequences.
To address the LRIR cross-domain gap, our work proposes a cross-domain model for LRIR employing a novel PRCN. This PRCN uses a sequence of convolutional, long short-term memory, and fully connected layers as embedding functions to generate the prototypes, as explained in the following sections. In addition, we compared the results with two other PNs: a prototype convolutional network (PCN) and a prototype recurrent network (PRN), which employed convolutional and recurrent layers, respectively, as embedding functions. The results confirmed that the proposed model was robust and presented a transferability capacity specific for HAR data recorded with LRIR sensors. The model achieved an accuracy greater than 90% in the identification of activity and a significantly accurate performance (up to 85%) when the pretrained model was used in a dataset from multiple domains.
The remainder of this paper is organized as follows.
Section 2 explains the three prototype networks used in the work.
Section 3 describes the two datasets used to test the models and the cross-domain between them.
Section 4 covers the evaluation method, including the data selection to train the networks and parameter values. Finally,
Section 5 discusses the results and
Section 6 presents the conclusions.
2. Prototypical Networks
This section explains how a prototype network works from a general point of view. Then, in each subsection, we describe the three different prototypical networks employed in this work, which have different input format data and can extract different features.
In general, a prototype network is based on the idea that it can generate a prototype
that fits each class
k. To do so, the network searches for the optimal X-dimensional space in which the samples used as inputs in the training stage are closer to the prototype than the others. The embedding space is searched by updating the embedding function in each iteration. In this sense, each iteration consists of selecting a certain number of samples from each class and dividing them into a support set
S and a query set
Q. The samples in
S generate the prototypes; each prototype is the average of their class samples in the support set. Then, the Euclidean distance between each sample of
Q and each prototype is measured, obtaining the number of matches and misses. A scheme of this process is shown in
Figure 1.
Mathematically, each prototype is the mean vector of the embedded points of a support set
belonging to a certain class
such that
where
is the cardinality of the support set
and
is the embedding function. Therefore, assuming a query set of unlabeled samples
, classification for a given sample
is carried out by finding the minimum distance to the prototypes, as follows:
where
is the estimated class for the sample
. Note that any input
is classified into one of the classes
k, even though it does not belong to any of them.
The embedded function in a prototype network is a neural network that searches for the optimal embedded space in which the samples of the query set are clustered together with the prototype. The search is based on updating the weights in each iteration. This means that finding the optimal embedding space depends on the ability of the neural network as an embedded function.
2.1. Prototypical Convolutional Network
In our work, the PCN model is based on [
15]. The embedding function consists of four convolutional blocks. Each convolutional block consists of a 2D convolutional layer, a dropout layer (10%), a batch normalization layer, and a final 2D max-pooling layer. The convolutional layer extracts features from the input image (or from the output of the previous layer in deeper blocks of the network). The dropout layer shuts down specific neurons in each training iteration and is used to avoid overfitting the network to the incoming data and improve generalization capability. The normalization layer adjusts the mean and variance of the data to the same values for each input batch. The final max-pooling layer is used to reduce the dimensionality of each block’s output and reduce the output’s dimensions, while preserving the dominant features. The output of the last convolutional block is vectorized and this vector is used as input to the FC layer, which in our model generates an output of 32 dimensions.
In this model, each 8 × 8 LRIR image was resized to 64 × 64 using bicubic interpolation with a shape 64 × 64, as is shown in
Figure 2, and each of them was used as an input. This interpolation was useful to apply more convolutional blocks, as they sequentially reduce the image dimension. The main reason behind the use of bicubic interpolation instead of other interpolation methods (such as bilinear) was that bicubic interpolation tends to generate smoother and more accurate images, as it takes into account a wider area around each pixel compared to the bilinear interpolation. This is especially important when working with low-resolution images such as those from infrared sensors, where details are limited and noise or temperature fluctuations can significantly influence image quality [
27,
28].
The PCN embedding function is depicted in Figure 4a.
2.2. Prototypical Recurrent Network
The PRN employs two consecutive long short-term memory (LSTM) layers as the embedding function. LSTM networks are a type of recurrent neural network (RNN) designed to capture temporal dependencies and sequence information, which makes them particularly well suited for handling sequential data or time series inputs [
29]. Each LSTM layer in the PRN is equipped with a set of gates (input, forget, and output gates) that regulate the flow of information, allowing the network to maintain long-term dependencies and mitigate the vanishing gradient problem commonly encountered in traditional RNNs. In the PRN, the input data are first fed into the initial LSTM layer. This layer processes the input through internal mechanisms, generating hidden states that encapsulate the current input and the accumulated knowledge from previous time steps. The output of the first LSTM layer is then passed on to the second LSTM layer, which further refines and processes the information. This stacked LSTM configuration enhances the model’s ability to learn complex patterns and dependencies within the data. The final output of the second LSTM layer is a 32-dimension vector representation that summarizes the entire input sequence. This vector is used as input for the FC layer, which generates the output with 32 dimensions.
Regarding the PRN input, the LRIR images were converted to spatiotemporal matrices over 20 frames and 64 pixels with a shape
, in which each 8 × 8 LRIR image was flattened and 20 of these consecutive vectors were grouped to form spatiotemporal matrices. The number of 20 consecutive samples was obtained experimentally. To increase the number of samples, consecutive spatiotemporal samples overlap half of the previous sample, as shown in
Figure 3. In this way, each input contains sequential or temporal information about each activity, in line with the mentioned capabilities of the LSTM cells. The PRN embedding function is depicted in
Figure 4b.
2.3. Prototypical Recurrent Convolutional Network
A PRCN combines the power of a CNN to extract static characteristics and the ability of LSTM cells for temporal sequences, to improve the capabilities of its embedded function. Specifically, our PRCN consisted of four blocks of convolutional layers followed by two LSTM cells and a final fully connected layer. Each convolutional block consisted of a 2D convolutional layer, a dropout (0.1) layer, a normalization layer, and a final MaxPooling layer for dimensionality reduction. The LSTM block consisted of two consecutive LSTM cells of 32 dimensions. Finally, the FC layer generated the output of the 32 dimensions. The PRCN embedding function is depicted in
Figure 4c.
Each input in the network was a 10 × 64 × 64 matrix, formed by ten consecutive IR 8 × 8 images of the same class, each of them interpolated to 64 × 64 dimensions as in
Figure 2. In the same way as for the PRN, the number of 10 consecutive samples was obtained experimentally.
4. Evaluation
In this work, evaluation was carried out in two ways. First, a performance comparison of the three prototypical network architectures described in
Section 2 was performed, training and testing the models on the two datasets, with 75% of the data to train the model and 25% to test, using cross-validation with five iterations, and with the same number of samples per class in training to prevent overfitting.
Second, the cross-domain performance of the proposed PRCN solution was tested. For this purpose, we separated the two datasets into source and target datasets. The source dataset was the one on which the model was trained, and the target dataset was the one on which the model was validated. Both InfraADL and Coventry datasets were used as sources and targets. Since our model follows an FSL strategy, the target dataset had to have a small number of labeled samples, . The comparison was made between the model trained directly on the target dataset using L samples and the model trained on the source dataset and subsequently retrained on the target dataset using the L samples. In this work, samples were randomly selected from the entire dataset.
In addition, fine-tuning was used for the retraining step. Fine-tuning is a technique that consists of freezing the weights of the deep layers of a model and retraining only the shallower layers, thus avoiding the overfitting that can be generated when retraining with few samples and maintaining the learning of the deep layers obtained in the training with an extensive dataset [
32]. Following our previous work [
4], the weights of the convolutional blocks were frozen, while the weights of the LSTM cells and the FC layer were retrained. With this in mind, we tested whether the model trained on the large dataset was more efficient than the model trained directly on the target dataset using
L samples.
Table 3 shows the training parameters for the models. Following [
14,
15], the number of samples in both support and query sets was four, while the other parameters were obtained experimentally.
6. Conclusions
This paper dealt with two LRIR-based HAR datasets to evaluate cross-domain recognition via few-shot learning and proposed a novel prototypical recurrent convolutional network (PRCN).
A first evaluation compared the feature extraction capabilities in the prototype network using three models: a convolutional network (CNN) based on the literature, a recurrent network (LSTM) to take advantage of the time sequence of the data, and a mixture of both using a variant long-recurrent convolutional network (LRCN), which was the model proposed in this work. The results confirmed a relevant improvement for our model and showed a high accuracy, higher than 90% in most classes, verifying that the proposed PRCN is a robust model for HAR recorded with LRIR sensors.
Secondly, a cross-domain evaluation was performed between the two datasets to test the proposed PRCN. Eleven similar activities were evaluated by one or two people in two different scenarios. In general, the results showed a greater capacity for transferability, mainly for activities performed by two people, as well as for those activities that were more static, such as standing still or sitting. In addition, the accuracy could be improved if the transferability was between sensors with the same relative position in the room.
In conclusion, the presented results are promising for the transferability of LRIR-based HAR models between different scenarios. Determining the diversity required for dataset measures to make the model work as well as possible is a key task in developing this field. As future work, it would be valuable to explore how the more advanced development of WiFi-based HAR systems can be combined with and complement LRIR-based approaches, potentially creating hybrid systems that leverage both technologies’ strengths.