1. Introduction
Emotion is an adaptive multidimensional response triggered by meaningful events or stimuli that influence thoughts, feelings, behaviors, and interactions with others in daily life [
1,
2]. These responses are underpinned by changes in cognitive, physiological, motivational, motor expression, and subjective feeling systems that facilitate effective self-regulation mechanisms that help humans adapt to complex and ever-changing environments [
3]. These changes can be expressed in several ways. In addition to the well-known methods of verbal communication, facial expressions, and body movements, they also include alterations in biological markers, such as heart rate (HR), electrical brain signals, and respiration (RSP) [
4,
5,
6].
Closed cabins are typically used to create small environments in which humans can temporarily survive under harsh conditions, thereby facilitating the exploration and re-search of specific environments [
7]. For example, in deep-sea submersibles, missions require high cognitive processing from the submersible crew in a high-stress environment. Consequently, changes in emotions can significantly affect mission performance, potentially leading to reduced work efficiency or even severe accidents [
8]. Current research on the impact of missions on submersible crew members primarily focuses on the effects of the environment and the comfort of operational postures [
9,
10]. However, these studies have not considered the influence of emotions on mission performance, and research methods for emotion recognition in closed-cabin environments are scarce.
Studies have found that emotions significantly impact work performance and influence individuals’ cognitive abilities and attention distribution. Positive emotional states generally enhance focus and creativity, facilitating problem-solving and task completion [
11]. An individual’s emotional state affects their decision-making process; negative emotions may lead to impulsive and irrational decisions [
12]. Furthermore, emotional states influence work motivation and engagement [
12,
13]. Positive emotions help boost work motivation, increase enthusiasm, and enhance self-motivation, which, in turn, improves job performance [
14].
Currently, various methods are available for emotion recognition, such as those based on computer vision and speech analysis [
15]. Computer vision-based methods rely primarily on capturing external changes in facial expressions, eye movements, and body movements to recognize emotions [
16,
17,
18]. However, these external expressions of emotion can be consciously suppressed or exaggerated, and are easily influenced by factors such as culture, environment, age, and gender. Speech-based methods typically utilize acoustic analysis because features such as pitch, tone, and clarity of speech signals are highly correlated with underlying emotions [
19]. Inferring emotional states from the semantics of speech is a common approach. Nevertheless, the reliability of these methods in closed spaces such as submarines is questionable. The lighting conditions inside such cabins may not support clear facial expression capture [
20], and speech signals during work may not contain sufficient emotional information.
Emotion recognition through physiological signals has also been reported using metrics such as the respiration rate, blood flow, galvanic skin response (GSR), electroencephalography (EEG), and electrocardiography (ECG) [
21,
22,
23,
24,
25]. Physiological signals cannot be consciously controlled by individuals, which makes these methods more closely related to internal feelings and more accurate in reflecting emotional changes. However, most of these physiological systems rely on direct contact with highly invasive sensors and specific precision instruments. This contrasts with the practical conditions of missions in a closed cabin, where continuous data collection cannot be guaranteed and the awareness of being monitored may affect the results. Therefore, because some vital signs can be detected noninvasively, using noncontact systems for emotion recognition is advantageous.
In the field of emotion recognition using radar sensors, Healey [
26] demonstrated that physiological information collected from a professional actor under eight different emotional states, combined with feature extraction algorithms, Fisher’s linear discriminant analysis, and leave-one-out testing, could distinguish between anger and calm with high accuracy (90–100%). In addition, high and low arousal were significantly more distinguishable than were high and low valence. In this context, arousal describes the intensity or level of activity of moods and emotions, indicating whether they are relaxed/calm or excited/stimulated. Valence, on the other hand, refers to the positive or negative direction of moods and emotions, indicating whether they are positive or negative. Zhao [
27] used a millimeter-wave radar to extract and separate heartbeat and respiration signals, manually extracting 27 physiological signal-related features as inputs for two classification models. The average recognition accuracy for the four emotions was 87 and 72.3% for the person-dependent and person-independent classification models, respectively. However, current radar-based emotion recognition methods require manual feature identification to improve accuracy and have not been optimized for a closed-cabin environment, thereby providing a gap for our research.
In this paper, we hypothesize that millimeter-wave radar sensors can classify individuals’ emotions by measuring their respiratory signals in conjunction with machine learning algorithms within a closed-cabin environment. To address the issue of emotion recognition for personnel in closed cabins, this study proposes a method using a millimeter-wave radar to collect respiration signals, combined with a machine-learning framework for emotion classification and recognition. First, we used a continuous-wave radar to measure the respiration signals of the participants in a closed-cabin laboratory while they watched videos designed to elicit different emotions. After processing, we obtained respiration waveforms corresponding to different emotional states. Next, we used a sparse autoencoder (SAE) to extract features from the waveforms. These features were then inputted into two support vector machines (SVMs) for arousal and valence classification. Our method was compared with the FaceReader™, a commercial emotion recognition device that uses audiovisual signals. The results indicate that our method is not only more suitable for deployment in closed cabins but also maintains accuracy and provides objective results. The overall research framework is illustrated in
Figure 1.
The contributions of this paper can be summarized as follows. First, we used a millimeter-wave radar to capture human respiration signals in a closed-cabin environment, which is noncontact and interference-free. Second, considering the characteristics of feature extraction and selection in machine-learning, SAEs were introduced to extract respiration features for emotion recognition. Third, the results of the experiment demonstrated that RSP signals can be effectively used for emotion recognition, eliminating the need for more complex physiological signals such as EEG [
28]. Although machine-learning involves significant computational effort during model training, the time required for testing is minimal. Finally, we preliminarily validated that the system can be applied to closed-cabin environments. Overall, this study provides a foundation for the development of noncontact emotion recognition systems that can operate effectively in challenging environments, ensuring safety and efficiency during critical operations.
The remainder of this paper is as follows. In
Section 2, the working principle of a millimeter-wave radar is explained, the experimental process is described, and the signal preprocessing algorithms used before emotion classification are introduced. Then, the implementation of the classifier and the feature extraction process were described.
Section 3 presents the experimental results and cross-validation methods.
Section 4 provides a discussion of the results. Finally,
Section 5 presents the conclusions.
2. Methodology
In this study, we employed a comprehensive methodology to investigate emotion recognition in a closed-cabin environment using noncontact millimeter-wave radar for respiration signal monitoring. Our experimental setup was specifically designed to simulate real-world closed-cabin conditions, ensuring an environment conducive to reliable data collection. A 77 GHz frequency-modulated continuous-wave (FMCW) radar sensor with a 2-transmitter and 4-receiver MIMO configuration was used to capture respiration signals at a sampling rate of 32 Hz. This system, installed within a shielded, soundproof room, allowed us to measure respiration rates accurately, unaffected by external environmental factors such as lighting and airflow.
The participants were seated in a controlled laboratory setting with ambient conditions maintained at 25 °C and lighting set to 6500 K × 600 lx to support a relaxed atmosphere. They watched a series of emotion-inducing video clips designed to elicit specific emotional responses (e.g., high-arousal positive, low-arousal negative) while the radar recorded their respiration signals. Respiration data were collected in segments of 20 s for a total of 1200 recordings across 20 participants, using pre-screening criteria aligned with operational requirements in closed-cabin environments, such as manned submersibles.
Signal preprocessing included the application of bandpass filters to remove noise and isolate the respiration signal from mixed physiological data. We then applied the variational mode extraction (VME) algorithm to further refine the signal for analysis, allowing accurate identification of respiration patterns associated with different emotional states. These data were subsequently used as input for a sparse autoencoder (SAE) combined with a support vector machine (SVM) for emotion classification. This methodology not only provided a noninvasive approach to emotion recognition but also enabled a practical implementation framework suitable for closed-cabin applications where real-time, continuous monitoring is critical.
2.1. Methods for Collection and Processing of Respiration Signals
2.1.1. Working Principle of Millimeter-Wave Radar for Biosignal Detection
The principle of bio-signal detection with millimeter-wave radar relies on the Doppler effect, where the received signal reflects distance variations between the radar antenna and the participant’s chest wall, driven by cardiac and respiratory movements. The radar continuously emits a digitally generated sinusoidal carrier wave, and for objects with periodic motion, the radar receiver component receives echoes that vary periodically over time. This characteristic makes it possible to detect vital signs such as the periodic movement of the chest caused by breathing or heartbeats, from which physiological information such as heart and respiration rates can be extracted [
29].
We selected a 77 GHz millimeter-wave radar sensor (
Figure 2) for this study. The radar sensor based on frequency-modulated continuous wave (FMCW) signals for radar detection, featuring a 2-transmitter, 4-receiver, multiple-input multiple-output (MIMO) antenna configuration. This system achieves a detection range of 0.1 to 2 m and operates independently of environmental conditions, including temperature, humidity, noise, airflow, dust, and lighting. The high integration and bandwidth characteristics of this sensor chip enable flexible applications with a compact size. The sampling rate was set to 32 Hz to facilitate data processing.
2.1.2. Valence–Arousal Emotion Model
In this study, Russell’s circumplex model, the valence–arousal (VA) two-dimensional model of emotion, was used to aid emotion recognition [
30]. This theory suggests that a common, interconnected neurophysiological system is responsible for all emotional states. Emotions can be distributed in a two-dimensional space consisting of arousal and valence dimensions. Arousal is the vertical axis, measuring the intensity of emotional activation, whereas valence is the horizontal axis, describing how positive or negative an emotion is.
Figure 3 illustrates how this theory categorizes emotions. The arousal and valence of an emotion can be classified into four categories. Thus, emotion-recognition tasks can be divided into two binary classification tasks.
2.1.3. Respiration Data Acquisition Based on Millimeter-Wave Radar
We designed the entire experimental system according to a closed-cabin environment, and placed the millimeter-wave radar inside the backrest of a chair. According to the pre-test, this method can collect the signal according to the demand and at the same time avoid the interference of the previous radar, which needs to be placed in front of the experiment. Because the relative distance between the participant and sensor changes very little, it is important that the deployment method allows the signal processing method to disregard the distance as much as possible. Since the radar sensor can detect only one person at a time, only one subject was allowed to remain in the experimental area once the experiment began, to ensure accurate data collection.
The participants were screened according to the requirements of the diving mission by consulting with experts related to manned deep diving missions, and the experiment was conducted on 20 male participants, aged 21–40 years, at intervals of at least two days. All subjects had no history of cardiovascular or psychiatric disorders or other medical contraindications. Additionally, none had consumed alcohol within the previous three days, and all reported adequate sleep. This sample selection was balanced using a convenience sampling method, which allowed us to easily recruit participants from an existing pool [
31]. Convenience sampling was deemed appropriate for this preliminary study to gather initial data quickly and evaluate the feasibility of the proposed emotion recognition method. During the experiment, the RSP signals of each participant were measured while watching different video clips that elicited different emotions, as shown in
Figure 4.
The experimental setup is illustrated in
Figure 5. To test whether the methodology of this study could be applied to a closed-cabin environment, this experiment was conducted in the closed-cabin laboratory at the Department of Industrial Design, College of Electrical and Mechanical Engineering, Northwestern Polytechnical University. The laboratory is an electrically shielded and soundproof room with manually controlled light and temperature. The room temperature is set at 25 °C, and the lighting conditions are configured to 6500 K × 600 lx, which, according to a previous study [
20], can help create a relaxed atmosphere. The experimental steps and details were explained to all participants prior to the experiment, and written informed consent was obtained from all participants, who were assured that the data obtained from this experiment would not be used for other purposes. The experiment was approved by Northwestern Polytechnical University (No. 202202053) and complied with the Declaration of Helsinki.
A total of 1200 sets of 1 min RSP signal sequences were collected from each of the 20 participants, including 300 sets for each emotion. The data were randomly selected for training (80%) and testing (20%).
To achieve rapid onsite emotion recognition, we evaluated different lengths of RSP signal segments, ranging from 5 to 60 s. After considering both effectiveness and efficiency, we decided to use 20 s segments. Given that our sampling rate was set at 32 Hz, we segmented the signal data into 20 s segments, starting from the beginning of each record, with the segmentation window moving every second. This means that in two adjacent segments, the latter segment starts from 2-s of the former segment, resulting in a 19 s overlap between adjacent segments. Thus, for a 60 s trial video, we obtained 41 segments. Each participant performed 60 trials, and each trial produced 41 segments, resulting in a total of 49,200 data samples.
2.1.4. Emotion Stimulus Materials
We selected video clips capable of eliciting different emotions [
32,
33] as stimulus materials and measured the RSP signals of each participant while watching the videos. The comedic clips induced high-arousal and high-valence emotions (such as happiness), whereas the horrific videos elicited high-arousal and low-valence emotions (such as fear or anger).
To induce genuine emotions in the participants, we selected stimuli capable of eliciting four emotional states: high-valence low-arousal, low-valence high-arousal, high-valence high-arousal, and low-valence low-arousal. In total, 120 video clips were used, with 30 clips corresponding to each emotional state. After all 120 clips had been rated by at least 15 participants, we selected the 60 highest and most consistently rated videos to ensure high-quality experimental data, with 15 clips for each emotional state.
For each video, normalized arousal and valence scores were calculated by dividing the mean rating by the standard deviation. We then selected the 15 videos closest to the polar extremities in each quadrant of the normalized valence–arousal space, as illustrated in
Figure 6. During the experiment, the videos were played in an order designed to enhance the emotional effect, which continuously and effectively stimulated the participants throughout the process. To avoid potential bias, the participants were not informed in advance of the type of video they would watch. Each test session was approximately 30 min.
Since the primary goal of this work is to validate the use of a millimeter-wave radar system for emotion recognition of personnel in closed cabins, we employed FaceReader to simultaneously measure participants’ emotions as a means of verification and comparison. FaceReader 8.1 is a professional facial analysis software developed by Noldus, which is capable of recognizing facial expressions [
34]. FaceReader can identify six basic emotions—happiness, sadness, anger, surprise, fear, and disgust—in addition to the neutral emotional state. When connected to a camera, the software first detects the face and then determines the expression based on facial muscle movements, achieving an accuracy of up to 95% [
34]. The analysis results provide the probabilities of the six basic emotions occurring at corresponding time intervals and generate valence and arousal scores. A higher valence score indicates more positive emotions, whereas a lower valence score indicates more negative emotions.
The software’s effectiveness in recognizing facial expressions for Chinese faces is reasonably good, particularly for the emotions of “happiness”, “surprise”, and “neutral”. As shown in
Figure 5, the measurement results from FaceReader are displayed on a separate computer.
2.1.5. Respiration Signal Processing
Currently, there are many methods for processing bio-signals based on a millimeter-wave radar, such as the variational mode decomposition (VMD) algorithm [
35], the ensemble empirical mode decomposition (EEMD) algorithm, which is an improvement based on the empirical mode decomposition (EMD) algorithm [
36], and the improved complete ensemble empirical mode decomposition with adaptive noise (ICEEMDAN) algorithm [
37]. The variational mode extraction (VME) algorithm offers advantages over the VMD algorithm in terms of lower time complexity for extracting specific mode signals and the ability to determine the number of modes without prior specification [
38].
As shown in
Table 1, respiration and heartbeat signals exhibit different frequency characteristics. Heartbeat signal frequencies typically range from 0.8 to 2.0 Hz, while RSP signal frequencies range from 0.1 to 0.5 Hz. Therefore, a Butterworth bandpass filter was used to remove the signal components in different frequency ranges. By filtering the heartbeat signal components from the mixed vital sign signals, we can extract the RSP signals, thereby separating the respiration and heartbeat signals.
To select the appropriate signal processing method, the same 1 min data segments were sequentially processed using the EMD, EEMD, ICEEMDAN, VMD, and VME algorithms. The estimated average respiration rates obtained using these methods were compared with the actual values. Additionally, the data were divided into 20 s segments, processed using different methods, and the average processing time required for each method was estimated. The final results are presented in
Table 2.
From
Table 2, it can be observed that the EEMD algorithm estimates the RSP rate with a significantly lower accuracy than the ICEEMDAN, VMD, and VME algorithms. The accuracies of the ICEEMDAN, VMD, and VME algorithms were similar, with the ICEEMDAN performing slightly better than the VME, although the differences were minimal. The processing times for the EEMD and ICEEMDAN algorithms exceeded the data duration, rendering real-time computation infeasible. Although the processing time of the VMD algorithm was less than 20 s, because it was sufficiently close to this limit, it can easily lead to buffer overflow in the actual system, posing a risk to system stability.
The specific steps for the final data processing are as
Figure 7:
Sample and parse the data acquired from the millimeter-wave radar;
Perform three different dimensional analyses on the intermediate frequency signal to obtain target information;
Conduct a fast Fourier transform on the received data in the range dimension to obtain the positional information;
Eliminate stationary objects in the indoor environment to remove clutter;
Extract and analyze the phase to obtain vital signs signals and use bandpass filters for preliminary signal separation;
Apply the VME algorithm(see Algorithm 1) to the extracted vital signs signals to isolate the RSP signals.
Algorithm 1: VME. |
Initialize and initial guess | |
Repeat | |
| |
(1) Update for all : | |
| (1) |
(2) Update ωd: | |
| (2) |
(3) Dual Ascent for all : | |
| (3) |
until convergence: | |
Figure 7.
Respiration signal processing.
Figure 7.
Respiration signal processing.
The final RSP signals are presented in
Figure 8. Using the VME algorithm allows for direct extraction of the corresponding signals, thereby lowering the computational load and improving the execution speed of the program.
After the experiment, the respiration data of the 20 participants in four different emotional states were collected, counted, and tabulated, as shown in
Table 3. No further statistical analysis of the data features was performed because we used a machine-learning approach to obtain useful features.
2.2. Emotion Recognition Method Based on Sparse Auto-Encoder and Support Vector Machine
We investigated an effective method for mapping users’ physiological signals to their emotional states. As shown in
Figure 9, we use an SAE to transform raw signals into extracted features. The sparse representations learned by the SAE were then used as inputs to train the SVM as a classifier for recognizing emotional states.
First, an SAE was trained on the raw input x
(k) to learn its primary features h
1(k) of the raw input. These primary features were then used as the input for another SAE to learn secondary features h
2(k). These secondary features were then used as inputs for an SVM classifier, which was trained to map the secondary features to the data labels. Finally, all three layers were combined to form a stacked SAE with two hidden layers and an SVM classifier layer that can classify the collected RSP signal dataset as required [
39].
This combined approach of using an SAE and an SVM leverages the strengths of both methods to enhance the generalization ability and performance of the model, particularly when the data have high dimensionality, a small sample size, or an unclear feature representation. Our hypothesis is that by automatically extracting features through deep learning, we can produce a more powerful and accurate emotion recognition model.
2.2.1. Feature Selection of Respiration Signals Based on Sparse Auto-Encoder
An SAE is a deep-learning method used to automatically learn features from unlabeled data. As a fundamental element of an SAE, an autoencoder (AE) transforms input data into hidden representations or extracts features through an encoder (
Figure 10). The AE learns to map these features back to the input space using a decoder. Despite the presence of minor reconstruction errors in the training examples, the goal was to match the reconstructed input data to the original input data as closely as possible. The entire structure uses the features extracted by one AE as the input for another AE, thereby hierarchically determining general representations from the input data [
40].
Therefore, the SAE pre-trains the AE and uses it as its first hidden layer. These techniques can also be applied to train subsequent hidden layers. Once pre-training is complete, the first layer remains unchanged, while the other hidden layers are trained. Generally, the encoder maps the input example
to the hidden representation
as follows:
where
is a weight matrix,
is a bias vector, and
is a nonlinear activation function, typically a sigmoid function
. The decoder maps the hidden representation back to the reconstructed input
:
where
is a weight matrix,
is a bias vector, and
f is the same function used in the encoder. To minimize the difference between the original input xxx and reconstructed input x~\tilde{x}x~, we consider reducing the reconstruction gap
. Given a training dataset
with
D examples, we adjust the weight matrices
and
as well as the bias vectors
and
through backpropagation. In addition, we impose a sparsity constraint on the expected activation of the hidden units by adding a penalty term. This leads to the following optimization problem:
The sparse penalty term is
where
is the average activation of hidden unit,
j,
is the sparsity level, and
β is the weight of the sparsity penalty term. By minimizing the cost function using the L-BFGS (Limited-memory BFGS) algorithm [
41], we obtain the optimal
and
to determine the internal features
.
We trained the SAE to extract features from the RSP signals and emotional states present in the dataset. Through preliminary experiments with one- and two-layer SAEs and by testing with 20 neurons in each hidden layer, we selected the topology of the AE. After pretraining each layer of the SAE, we fine-tuned the deep learning framework to obtain the optimal SAE configuration with two hidden layers. The first and second hidden layers contained 200 and 50 neurons, respectively.
To demonstrate the reconstruction capability of the SAE, we inputted the same RSP signal (shown in blue) from
Figure 8 into the SAE. The red line in
Figure 11 represents the output of the SAE (the reconstructed input).
2.2.2. Emotion Classification Method Based on Support Vector Machine
We employed an SVM as the classifier to balance the performance and power consumption of the proposed method. SVMs are nonlinear models that have been widely used in various fields. Because SVMs use support vectors to determine the optimal hyperplane that maximizes the margin, the complexity of finding the hyperplane function is reduced and the generalization ability of the classifier is enhanced, especially with smaller datasets. SVMs have been proven to be effective models for detecting comfort levels and emotions [
42,
43].
The SVM calculation steps are as follows:
Input:
Given the nonlinearity of the RSP dataset samples, we introduced the radial basis function as a kernel function
K. This kernel function transforms nonlinear data samples in the dataset into linearly separable data samples as follows:
Range of linear sample data: . The results of the linear SVM are determined by the parameters ω and b, and the sign function sgn, where ω is the weight vector and b is the bias. This process was conducted as follows.
An optimal solution problem for the objective function is constructed:
The function is optimized by adding the Lagrange multiplier parameter
, taking the form of:
The formula is simplified to obtain the SVM classifier:
The final classifier is obtained by combining Equations (8) and (11):
Because an SVM is a binary classifier, to validate our method and compare it with previous results, we used two SVMs to classify arousal and valence into two binary classes based on the assigned values. We chose a threshold value of 5, making the task a binary classification problem; that is, high/low valence and high/low arousal. In this study, to achieve the best classification results, we set the initial value of C (the regularization parameter) to 1 to avoid overfitting. The Gamma value was set to .
4. Discussion
In this study, we propose an effective noncontact emotion recognition system based on RSP signals. The system aims to be deployed in specialized work environments, such as manned submersibles, where emotional changes can lead to decreased performance or even accidents. The emotion recognition accuracy achieved in the study was 68.21%. For all participants, our method achieved an acceptable accuracy rate of over 60% for emotion recognition. The goal of our study is to investigate the feasibility of using millimeter wave radar and breath signals for emotion recognition. We aimed to demonstrate that emotional state detection in a closed-cabin environment can be effectively achieved by combining these two signals. Our results indicate that the model integrating millimeter wave radar and respiration signals shows promising accuracy and feasibility in emotion recognition. This supports our initial hypothesis that this combined approach can enhance the effectiveness of emotion recognition.
Emotions create a motivational tendency and increase the likelihood of engaging in a range of maladaptive behaviors [
45]. By identifying a method that minimally affects the normal operations of personnel and can detect emotional states in closed-cabin environments, specific emotions may be prevented from negatively affecting work tasks. Potential application scenarios for the proposed method include various confined spaces, such as aerospace, automotive, and maritime environments. In these settings, emotions can significantly influence task completion and the safety of both personnel and systems.
We also note that in the literature [
46], similar to our approach, the valence–arousal emotion model is employed. However, this study utilizes a VR scene to induce corresponding emotions and categorizes emotional valence and arousal by collecting EEG signals using a Gradient Boosted Decision Tree. This research includes more detailed emotion classification labels and demonstrates that EEG can effectively reflect neural activation patterns associated with different emotions. In contrast, our study benefits from the radar sensor we used, which exerts less impact and fewer constraints on individuals, functioning effectively in various environmental conditions (e.g., changing light, crowded spaces, etc.). This adaptability allows for a broader range of application scenarios, whereas EEG devices require a complex deployment and setup process, along with significant perceptibility during data acquisition, potentially affecting the final results.
4.1. Signal Selection and Emotion Model
The literature [
25] systematically reviewed the physiological signals employed in current research and concluded that internal physiological signals, such as EEG and ECG, are involuntarily activated and, therefore, more reliable for emotion recognition than body signals such as facial expressions and posture.
Research has shown that when individuals experience emotions such as anger, fear, or happiness, the frequency and amplitude of their breathing increase involuntarily owing to the heightened energy demand of cells. Conversely, when individuals are in a state of sadness, relaxation, or calmness, their energy demand decreases, resulting in slower and shallower breathing. When individuals are startled, there may be a brief breathing interruption resulting in no RSP signal [
47]. The RSP parameters also included vital capacity, respiration rhythm, and tidal volume.
Our study indicates that high-valence emotions exhibit more stable and uniform RSP values than low-valence emotions. In terms of arousal differences, high-arousal states were characterized by a higher breathing frequency, whereas low-arousal states exhibited a lower breathing frequency. Low-arousal emotions are characterized by quick inhalations and slow exhalations with a prolonged ending [
48].
In this study, we used FaceReader as a comparative emotion recognition method. FaceReader captures facial images using a camera and detects emotions by analyzing the relative position changes of facial landmarks. This recognition method is different from the measurement approach of the millimeter-wave radar and is more direct, thereby providing a robust validation of the effectiveness of the proposed method. FaceReader has shown high accuracy in some literature [
49]. However, in our study, we observed that the accuracy of the FaceReader had some discrepancies compared to the emotions expected to be elicited by the videos.
We believe that this discrepancy is due to differences in cultural backgrounds and emotional expression styles, which can lead to recognition errors when FaceReader is used with East Asian individuals. This issue has also been discussed in the literature [
50].
To classify emotions, we referred primarily to Russell’s two-dimensional emotion model using valence and arousal as classification parameters. There are three reasons for using this approach:
Consistency with FaceReader: We aimed for our method to be as consistent as possible with FaceReader’s classification method, making the comparison between the two more relevant and informative.
Simplified Modeling: Using valence and arousal simplifies the modeling process, as these dimensions are easily comparable.
Enhanced Research Depth: Our study builds on existing literature and further refines classification by providing a more detailed differentiation of emotions.
4.2. Machine-Learning Frameworks
We used an SAE to transform raw signals into extracted features, and the sparse representations learned by the SAE were used as inputs to train SVMs as classifiers to predict emotion states. The combination of the SAE and SVMs leveraged the feature extraction capabilities of the SAE, reduced the difficulty of handling high-dimensional data, and enhanced the overall model performance. The advantages of combining SAE and SVM are as follows.
Effective feature extraction: The SAE, an unsupervised learning method, automatically learns useful feature representations from raw data. This is particularly beneficial for emotion recognition tasks because physiological signals related to emotions (e.g., RSP signals) often contain redundant and irrelevant information. Using the SAE, we can automatically extract features closely related to emotional states, thereby reducing the complexity of subsequent classification tasks.
Dimensionality reduction: The SAE maps the original high-dimensional data to a lower-dimensional sparse representation through a multilayer encoding and decoding process. This helps mitigate the “curse of dimensionality” problem faced by SVMs when dealing with high-dimensional data, which can lead to reduced generalization capability. By providing low-dimensional features extracted by the SAE, the SVM can be classified more effectively.
Robust feature learning: The SAE can learn robust representations, even from limited datasets. This implies that even with a small amount of training data, the features extracted by the SAE can help the SVM achieve better performance.
Nonlinear mapping: The SAE employs nonlinear activation functions to map raw data to the feature space, enabling the model to capture nonlinear relationships within the data. This is crucial for emotion recognition tasks because emotions often have complex nonlinear relationships with physiological signals.
By combining the SAE and SVM, we achieved a more efficient and effective approach to emotion recognition. The SAE reduces the complexity and dimensionality of the input data, making it easier for the SVM to handle and classify the data, thus improving the overall performance of the model in recognizing emotions based on physiological signals.
4.3. Comparison with FaceReader Results
The results show that the proposed system achieves high accuracy. The SAE helped reduce the complexity of the data processing, reducing the data dimensionality. The SAE automatically selects the appropriate RSP signal features for classification through training, thereby enhancing the overall system performance.
Under normal lighting conditions, the accuracy of our method is comparable to that of the FaceReader. However, under poor lighting conditions, the accuracy of the FaceReader decreased significantly owing to its reliance on clear facial images, which could not be obtained. In contrast, our method maintained consistent accuracy regardless of lighting conditions, as visual data are unnecessary.
According to the literature [
44], using manually selected features for prediction achieves a higher accuracy than directly using segmented signal waveforms. Compared with our method, the manually selected features resulted in a higher accuracy, validating the findings in the literature. Additionally, the results of our method are close to those obtained using manual feature selection. The manual selection of signal features requires experts with extensive experience and knowledge, which increases the deployment complexity and cost of the entire system. In contrast, our method demonstrates the power of machine-learning in this field, with broad potential for application. The automation and efficiency of machine-learning models make them suitable for emotion recognition tasks in various environments, including those with challenging conditions, such as poor lighting.
4.4. Availability in Closed Cabins
To comprehensively evaluate the performance of our study in closed-cabin environments, we consulted human factor experts and compared our method with three existing emotion recognition methods across three dimensions: invasiveness, system complexity, and deployment cost. The comparison results are presented in
Figure 15.
Figure 15 illustrates these comparisons and highlights the strengths and weaknesses of each approach. Our noncontact method, with its balance of low invasiveness, moderate complexity, and cost-effectiveness, shows great promise for practical deployment in closed-cabin environments.
4.5. Limitation
The limitations of this study are as follows:
Sample bias: The sample size was not sufficiently diverse. As closed-cabin personnel are mostly young males, there was a selection bias in the choice of participants. The experimental environment was set within a closed space, which may have affected the generalizability of the proposed method.
Comparison with FaceReader: FaceReader was used as the control method. However, previous literature indicates that FaceReader’s accuracy is somewhat limited for East Asian populations [
50,
51,
52], which may affect the strength of the evidence provided in this study.
Laboratory setting: The study was conducted in a laboratory environment rather than in a real closed cabin, such as a manned submersible. Real-life electromagnetic environments are more complex and may affect the effectiveness of the proposed method.
Single modality: For convenience of deployment, this study only used respiration signals. According to other studies, employing multimodal signals such as EEG and skin conductance signals can significantly enhance the accuracy of emotion prediction [
25,
53].
Emotion stimulus material selection: In this manuscript, we utilize video as the emotion stimulus material; however, the literature [
46] offers a more effective solution by using VR scenes to evoke the corresponding emotions. This method provides a more controlled and immersive environment for emotion elicitation and assessment, leading to applications in various domains such as mental health, marketing, and entertainment.
We plan to address these issues in future studies. Previous studies have shown that the ability to maintain emotional stability has a significant impact on workplace performance. Therefore, further research is required to understand the patterns and distribution of emotional changes among personnel working in closed-cabin environments.
Future research directions include the following.
Diverse sample population: Include a broader range of participants, to improve the generalizability and robustness of the findings.
Enhanced validation methods: Use more reliable algorithms that account for cultural and demographic variations, to strengthen the validation of the proposed method.
Real-world testing: Conduct experiments in actual closed environments such as manned submersibles, to test the robustness of the method under real-world conditions.
Multimodal signal integration: Incorporate additional physiological signals such as EEG and ECG signals to improve the accuracy and reliability of emotion recognition.
To address these limitations, we aim to refine our method and enhance its applicability to various closed-cabin environments. Further investigation into the emotional stability of closed-cabin personnel during work will also contribute to improving their performance and safety.
5. Conclusions
This study explored the potential of RSP signals for recognizing emotions in closed cabins. The widely used dimensional emotion theory, the valence–arousal theory, was employed to classify the four types of emotions. SAEs were used to extract and select emotion-related features. SVMs were then used to classify high/low arousal and valence. We validated the method from several perspectives. The test results demonstrated that the proposed method achieved acceptable performance. In summary, millimeter-wave radar sensors and respiration signals exhibit significant potential for recognizing emotions in closed cabins.
We believe that there are two main directions for improvement to achieve better recognition results: introducing other physiological signals that can be measured noninvasively, such as heart signals or skin conductance, and adopting multimodal physiological data to enhance performance. In addition, choosing more powerful classifiers, such as random forests, can improve the recognition accuracy. With these improvements, we will be able to monitor the emotions of personnel in closed cabins without interrupting their ongoing activities.
The results of this study not only offer new insights for the development of emotion recognition technology but also contribute to enhancing the safety and work efficiency of individuals in closed environments, demonstrating significant application value.