1. Introduction
Assessing and tracking patients’ pain levels over time is critical in determining the effectiveness of medical treatments [
1,
2] and preventing the development of chronic pain [
2]. The current standard methods of pain assessment, which rely on patient self-reporting or medical staff observations, are highly subjective and imprecise, making it difficult to accurately monitor pain levels in patients, especially those with communication impairments [
1,
2,
3,
4]. Furthermore, medical staff observations of pain are highly subjective and require experienced observers [
5], whose scarcity can hinder the continuity of pain assessment monitoring [
6]. The need for accurate pain-level assessment has led to a growing demand for automated pain monitoring systems that can be used both in hospitals and at home for elderly and injured patients [
5,
6]. Such a system would provide a more objective and reliable method of pain assessment, helping to ensure that patients receive appropriate pain management and avoid unnecessary suffering.
Developing such systems is no easy task due to a variety of challenges, including variations in head poses, illumination [
7], and occlusion [
8]. Despite significant progress in recent years, many deep-learning-based pain assessment models rely on face detection algorithms as a critical data processing step, which may not work accurately for full left/right profile views. Our previous work, Facial Expressions-based Pain Assessment System (FEAPAS) [
9], employs the Multi-Task Cascaded Neural Network (MTCNN) algorithm to detect a face from the input image and subsequently employs OpenCV library to segment the detected face into two parts, in order to extract the upper part of the face (eyes and eyebrows). These parts are then fed to two concurrent pre-trained CNNs to mimic the Prkachin and Solomon Pain Intensity (PSPI) measurement [
10]. Therefore, it assigns a higher weight to the upper part of the face in the pain evaluation process. FEAPAS achieved an accuracy of 99.10% on 10-fold cross-validation, thus outperforming the state-of-the-art model on the UNBC-McMaster shoulder pain expression archive dataset [
11]. It also scores 90.56% on unseen subject data with an average response time of 6.49 s. However, it is restricted to the limitations of the MTCNN algorithm which cannot detect and extract the upper part of the face from full left/right profile images (note that, as with other referenced pain-assessment material, the left/right facial profiles are removed from input images). Consequently, the absence of input frames hinders the application of the FEPAS model, rendering it ineffective for left/right profile facial images.
To overcome these challenges and develop a more robust pain assessment model, we have turned to different approaches such as the Sparse Autoencoders for Facial Expressions-based Pain Assessment (SAFEPA). Unlike traditional models, in the SAFEPA, we utilize Sparse Autoencoders (SAE) to reconstruct the upper part of the face from input images, allowing us to effectively handle variations in head poses. By leveraging this approach, we achieve high recognition performance and surpass the accuracy of state-of-the-art models, even when faced with challenging full left/right profile views.
Autoencoders (AE) have gained popularity as an unsupervised neural network that can reconstruct data while minimizing the error between input and output [
12] with an encoder mapping the input to a hidden representation through a nonlinear function and a decoder reconstructing the input [
13]. AE has been shown to perform well in facial reconstruction, resolving issues of partial occlusion [
14]. In this paper, we present a novel and advanced model, named Sparse Autoencoders for Facial Expressions-based Pain Assessment (SAFEPA), which builds upon our previous FEAPAS described in [
9].
Our contribution lies in:
Developing a new system, such as SAFEPA, for pain assessment that accounts for variations in head poses by utilizing a custom SAE. The SAE can reconstruct the upper part of the face in any pose without requiring face detection or splitting steps, overcoming limitations in face detection algorithms, and reducing preprocessing steps, thereby improving the accuracy of pain assessment. The SAFEPA model achieved 98.93% on 10-fold cross-validation and 84.06% on unseen subject data on the UNBC-McMaster dataset,
Section 4.1. To the best of our knowledge, sparse autoencoders have not been used in this manner in relevant publications.
Extending the capabilities of the SAFEPA system to recognize seven facial expressions on data with different poses including full left and right profiles. Our results demonstrate SAFEPA’s high accuracy in facial expressions recognition, achieving 94.29% on 10-fold cross-validation on the Karolinska Directed Emotional Faces dataset (KDEF) [
15] and outperforming state-of-the-art models, highlighting SAFEPA’s efficient performance with various head poses including full left/right profiles,
Section 4.2.
Investigating the performance of the SAFEPA system to work in real-world situations on a new pain assessment dataset, measuring its accuracy and average processing time. We demonstrate that the SAFEPA system generalizes well, as evidenced by the performance on unseen BioVid Heat Pain datasets [
16] processing each video in 17.82 s making the SAFEPA suitable for real-world situations,
Section 4.3.
The rest of this article is organized as follows. In
Section 2, related works are discussed, while
Section 3 provides a detailed description of the proposed framework, including the SAE structure and the entire system.
Section 4 presents the results of three main experiments and compares them with existing methods in addition to an ablation study to show the positive impact of our autoencoder approach. Finally,
Section 5 provides a conclusion and future works.
2. Related Works
Recent studies in Facial Expressions Recognition (FER) and Pain Assessment have introduced multiple cutting-edge models that achieve an acceptable performance. These models have tackled various challenges related to performance improvement by leveraging state-of-the-art Convolutional Neural Networks (CNN) [
17,
18] or by adopting Long Short-Term Memory (LSTM) [
19]. Moreover, researchers have explored the impact of various techniques such as augmentation [
20] and batch normalization [
21]. Recent studies turned to fusion models of two [
5,
9,
22] or more CNNs [
23]. Furthermore, a recent study implemented IoT in the FER system [
24].
Despite this progress, some researchers continue to use handcrafted feature descriptor algorithms such as the Histogram of Oriented Gradient (HOG) [
25,
26] and the Local Binary Pattern (LBP) [
26,
27] in their work due to their simplicity and efficiency.
While most studies focus on improving recognition accuracy, challenges such as poses, occlusion, and illumination demand further investigation and more efficient solutions. In order to compare the SAFEPA with the state-of-the-art, we have selected four recent studies [
17,
24,
25,
26] that use CNN, HOG, and LBP for FER on the same datasets as our study. Additionally, we included three more recent studies that focus on facial expression-based pain assessment [
3,
22,
23].
In [
17], Bentoumi and his colleagues have developed an advanced facial-expression-classifier using two of the most common deep learning models: VGG16 and ResNet50, to extract features. Their model employs a multilayer perceptron (MLP) classifier for the extracted features by VGG16, ResNet50, or both (Ensemble). The team conducted testing on the Extended Cohn-Kanade (CK+), JAFFE, and 980-images subset of KDEF dataset (frontal view).
In [
24], Barra and his colleagues developed a facial expression recognition system based on a social IoT solution. To detect the face region, they applied the tree-structured model. To extract the features, they applied several different texture descriptors such as LBP and HOG. To extract more discriminating and distinctive features, they used a sparse representation technique, followed by the Spatial Pyramid Mapping technique. Finally, they used SVM for the classification task. The performance of the proposed system is tested on KDEF and GENKI4K datasets.
In [
25], Eng and his colleagues proposed a model for the FER. Their model utilizes HOG to extract features from multiple cells in the image after detecting and cropping the face, followed by a support vector machine (SVM) for classification. This approach was applied to a subset of 980 images from KDEF dataset, focusing on images with frontal faces excluding other poses. Additionally, they utilized the Japanese Female Facial Expression (JAFFE) [
28] database.
Yaddaden in his work [
26] combined LBP and HOG to extract features from the image then he used SVM to classify the emotions. The model has been evaluated on JAFFE, KDEF, and Radboud Faces Database (RaFD).
In [
3], the authors proposed a computer-aided pain assessment system based on facial expressions with a satisfactory accuracy to reduce the computational work. The proposed model consisted of two shallow neural networks: spatial appearance network (SANET) to extract spatial appearance-based features from RGB image, and a shape descriptors network (SDNET) to extract spatial facial appearance-related features from the landmarks input, then the two outputs are concatenated to obtain the expected shape of the two different inputs and is sent to joint feature learning to learn from the entire fusion network.
In [
22], Bargshady and his colleagues developed an ensemble deep learning model (EDLM) for pain intensity detection. Their model consists of two phases: the early fusion and the late fusion. In the early fusion, the features are extracted through a combination of pre-trained CNN and linear Principal Component Analysis (PCA). In the late fusion, an ensemble of a three stream CNN + RNN hybrid deep learning network is used for classification. They used two databases in their study: the MIntPAIN and UNBC-McMaster databases.
In [
23], the authors proposed a pain severity assessment model to work in a real environment. For that purpose, the authors collected their own data in addition to a sub-set of the UNBC-McMaster dataset. The model inputs were obtained by using MTCNN to extract the face localization from an RGB image, and the numpy matrix slicing operation for cropping the face. To extract the spatial appearance information from the raw RGB input images, they used the CNN-TL network. To extract local features from entropy images, they used ETNet. To learn jointly from RGB and entropy, they used concurrent networks of DSCNN. The classification decision is taken based on the fusion of the three outputs.
In [
9], we developed a concurrent model consisting of two Convolutional Neural Network (CNN) branches called Facial Expression-based Automatic Pain Assessment System (FEAPAS). The first branch processes the detected face, while the second branch focuses specifically on the upper part of the face to improve attention and overall performance. All the models described above rely on face detection algorithms such as Haar or MTCNN as an essential step for data preprocessing. These models are unable to process frames that contain faces captured in full left or right profiles, leaving room for further improvement presented in this paper.
In contrast to the previous work, by developing our SAE and training it on the KDEF dataset that includes images with full left or right profile poses, the SAFEPA improves the facial expression capabilities of FEAPAS. The developed SAE can reconstruct the upper part of the face in any pose, without the need for face detection algorithms. After reconstructing the upper part of the face with the SAE, the resulting image is fed into InceptionV3, while the original image is processed in parallel with another InceptionV3. The outputs of the two InceptionV3 networks are then concatenated and sent to a fully connected layer for classification. The use of the SAE eliminates the need to use face detection algorithms and reduces the data preprocessing steps.
3. Materials and Methods
This section provides a comprehensive overview of our approach, including the datasets we used, the classifier we developed using SAE and concurrent CNNs, the SAFEPA system we built, and the evaluation processes we undertook.
3.1. Datasets
To ensure robustness, we utilized two widely recognized facial expression datasets, namely the Japanese Female Facial Expression (JAFFE) [
28] and the Karolinska Directed Emotional Faces (KDEF) [
15], in addition to the UNBC-McMaster shoulder pain expression archive dataset [
11], and BioVid Heat Pain dataset [
16] which are widely used for pain assessment. These datasets and the purpose for which they are used are described in the following subsections.
Table 1 provides a summary of the various datasets we used in this study along with their characteristics.
3.1.1. JAFFE Dataset
This dataset contains 213 gray scaled images that capture the essence of 7 distinct facial expressions: Afraid, Angry, Disgusted, Happy, Neutral, Sad, and Surprised. The images were captured from 10 female participants from a frontal view only. As one of the widely used and trusted datasets in the field of facial expression recognition, the JAFFE dataset was selected to test the performance and capability of extending our previous pain assessment monitoring system FEAPAS to facial expression recognition and to compare with state-of-the-art studies [
17,
25]. JAFFE dataset cannot be used with our new model SAFEPA, as it comprises non-color images that are not compatible with our system’s requirements.
3.1.2. KDEF Dataset
KDEF contains 4900 colored images from 70 participants, evenly split between 35 females and 35 males, and every subtle nuance of emotion is captured in detail. The dataset includes 7 distinct emotional facial expressions—Afraid, Angry, Disgusted, Happy, Neutral, Sad, and Surprised—and each expression is captured from 5 different angles, ensuring that every nuance of facial expression is accounted for. In
Figure 1, we offer a glimpse of the KDEF dataset, presenting samples of different facial expressions captured from a variety of angles: (a) full left profile, (b) full right profile, (c) half left profile, (d) half right profile, and (e) straight-on. Although many studies in FER focus solely on a subset of the KDEF dataset—namely, the 980 images that represent the seven emotions from a frontal view—we went above and beyond to use the 3920 images that represent the seven emotions from full and half left and right poses excluding the frontal view as well as the full dataset in our study. To test the accuracy and effectiveness of our model, we applied SAFEPA to the KDEF dataset subsets as well as the full dataset.
3.1.3. The UNBC-McMaster Dataset
The UNBC-McMaster shoulder pain expression archive dataset contains 200 sequences of 48,398 colored frames, each measuring 320 × 240 pixels. This dataset captures the facial expressions of 25 adult participants, 12 male and 13 female, who suffer from shoulder pain.
Figure 2 shows samples of the UNBC-McMaster dataset with different levels of pain. This dataset only contains images captured from a frontal perspective or in partial view. The UNBC-McMaster shoulder pain expression dataset is highly imbalanced. With a staggering 82.71% of the data representing the “No Pain” class—comprising 40,029 images—and only 17.29% devoted to varying levels of pain—just 8369 images in total. To address the biased classification [
5,
22], and ensure maximum accuracy, a subset of 24 participants’ data was randomly selected, resulting in a total of 6000 frames divided into four classes: No Pain, Low Pain, Moderate Pain, and Severe Pain. Each class contains 1500 images, with the Severe Pain class representing images from level 3 and greater. Testing the model with unseen subjects is critical for avoiding overfitting and ensuring generalizability, and in this study, we utilized all frames that belonged to participant 25 and were coded as “064-ak064”, resulting in a total of 1611 frames for testing. The subset of the UNBC-McMaster dataset used in this study is identical to that used in FEAPAS [
9], providing continuity and ensuring comparability across studies.
3.1.4. The BioVid Heat Pain Database
The BioVid Heat Pain Database on the other hand is relatively newer than UNBC-McMaster dataset, but has a higher variety, and is thus chosen to be used to further test the capability of the SAFEPA system in real-world situations. This database was collected by the collaboration of the Neuro-Information Technology group of the University of Magdeburg and the Medical Psychology group of the University of Ulm. The participants in the study were captured in short videos while being exposed to four levels of experimentally induced heat pain of subject-specific stimulation temperatures. The dataset comprised videos from 69 participants, each with 20 baseline (no pain) samples and 4 levels of 20 pain samples, resulting in a total of 6900 samples. It is worth noting that the participants were explicitly given the freedom to move their heads, providing an even greater range of motion in the captured footage. In
Table 1, we present a range of statistical and functional information for the four datasets described above, providing a comprehensive overview of the scope and detail of each dataset.
3.2. The Model
SAFEPA model builds on the strengths of our earlier FEAPAS framework [
9] which features two concurrent and coupled pretrained Convolutional Neural Networks (CNNs) for high-performance facial expression-based pain assessment. By integrating custom-designed Sparse Autoencoders (SAE) into the FEAPAS classifier, the SAFEPA model can eliminate the facial detection step and enable more accurate pain assessment across a wide range of facial poses and expressions. In this section, we delve into the details of FEAPAS, SAE, and the combination of these two techniques that underpin the SAFEPA model.
Section 3.2.1 focuses on the FEAPAS methodology, which utilizes two concurrent InceptionV3 models to construct the system architecture. We provide a detailed description of this approach, highlighting the key features that make it effective for feature extraction and classification.
Section 3.2.2 provides a description of the encoder and decoder layers employed in the SAE model and data preparation for SAE training. Finally, in
Section 3.2.3, we describe the implementation of the SAFEPA classifier, which is the result of the amalgamation of the SAE and the FEAPAS techniques.
3.2.1. FEAPAS
In FEAPAS, we used InceptionV3 as the building block of our classifier. The choice was made because InceptionV3 can provide accurate and efficient results in the absence of a very large amount of training data [
29]. The InceptionV3 architecture is inspired by the Network in Network (NIN) model, which employs 1 × 1 filters for feature extraction. InceptionV3 takes this approach a step further by utilizing 1 × 1 filters before the larger filters, which are then replaced with small and asymmetric filters. This design enables the model to effectively capture features at different scales and resolutions, leading to improved performance in various image recognition tasks. The use of 1 × 1 filters in this manner is a key feature of the InceptionV3 architecture and has been shown to be effective in improving accuracy and reducing computational complexity [
30]. The FEAPAS model utilizes two concurrent InceptionV3 networks for feature extraction. The first processes the upper part of the detected face, while the second processes the entire detected face. The features extracted from both branches are then concatenated and fed into a dense layer with 1024 neurons, followed by a dropout layer with a probability of 0.25. The resulting features are then fed into the classification layer, as illustrated in
Figure 3. This approach enables FEAPAS to capture detailed information from both the upper and whole face, important for accurately recognizing and assessing pain. The use of two concurrent InceptionV3 branches with different inputs allows the model to capture different types of features and improve the robustness of the system. Overall, this methodology represents an effective approach for pain assessment using deep Convolutional Neural Networks.
3.2.2. Sparse Autoencoders SAE
The developed SAE in this study consists of an encoder followed by a decoder. The SAE includes a total of six layers: one input, four hidden, and one output. The input and output layers each contain 150 neurons, which match the shape of the data. The encoder is composed of two dense layers, each followed by a LeakyReLU activation layer. LeakyReLU has been shown to work well in autoencoders, as it helps to prevent the issue of “dead neurons”, which can occur when ReLU activation outputs zero for a given input, leading to zero gradients during backpropagation and effectively stopping learning. The use of LeakyReLU can improve the stability and performance of the network during training. The decoder is developed in a similar but reverse way to the encoder and includes a Sigmoid activation function in the output layer. The goal is to train our SAE to reconstruct the upper part of the face from input images without the help of the face detection step used in other methods.
In general, autoencoders
with input
, hidden representation
, and output
can be represented as in Equation (1).
The encoder function
transfers
to
as shown in Equation (2).
Equation (3) explains how the decoder function
transfers
to
.
Training the autoencoders aims to minimize
, where
represents the difference between the input and the output of the autoencoder as shown in Equation (4).
The mean squared error
is commonly used as a loss function in the autoencoder. Assuming
equals the number of the observations in the training dataset, then
focuses on the distance between compressed data and reconstructed data as shown in Equation (5) [
29,
30].
To prepare the data for the training and testing phase of SAE, we applied an MTCNN detector on all frontal images in KDEF dataset, then we used the OpenCV library to extract the upper part of the face. Finally, we matched each image of the same person with the same expression but taken from different angles (full left, full right, half left, half right, and straight-on) to the same extracted upper part of the frontal view image.
Figure 4 shows an example of our data preparation method.
SAE was trained to 100 epochs on the prepared KDEF dataset using an Adaptive Moment estimation (ADAM) optimizer [
31] and a 46 batch size. The validation loss was 0.036 as shown in
Figure 5.
An illustration of SAE input and output is presented in
Figure 6. SAE is designed to process an input image in various poses and generate the upper portion of the face in that image. The output image produced by SAE demonstrated excellent performance in the SAFEPA model, exhibiting promising levels of accuracy. Note that the blurriness of the output does not impact the network’s ability to detect pain assessment.
3.2.3. The Sparse Autoencoders for Facial Expressions-Based Pain Assessment SAFEPA
The SAFEPA model takes a 150 × 150 colored image as input and utilizes SAE to reconstruct a colored image of the same size, which will contain the upper part of the face if the input image contains a face in any poses. We did not consider the case when the input image did not contain a face since all used datasets had samples with faces. This approach enables the SAFEPA model to work on all frames without crashing or malfunctioning when there is a full left or right profile in the input image. In contrast, other systems that rely solely on face detection algorithms may fail to detect a face, resulting in no input to the classifier and subsequent system failure. After obtaining the reconstructed upper part of the face and the input image, the SAFEPA model feeds them into two concurrent InceptionV3 models. The resulting outputs from the InceptionV3 models are concatenated into a single feature vector. This concatenated feature vector is then passed through a dense layer with 1024 neurons, followed by dropout layers with a probability of 0.25. These dropout layers are used to prevent overfitting, which is a common problem in deep learning models. Finally, the features are fed into a fully connected layer for classification. The overall structure of the SAFEPA model is illustrated in
Figure 7.
SAFEPA monitoring system operates by reading online video frames in real-time and utilizing the SAFEPA classifier for pain detection. When a video frame is processed, the SAE reconstructs a colored image as explained above. The processed frame and SAE output are subsequently fed into the InceptionV3 classifier. If the classifier output is not “No Pain”, an alarm is activated, and critical data such as time, frame, and pain level are recorded.
Figure 8 shows a high-level flow chart of the automatic pain assessment system SAFEPA. It is important to note that the SAFEPA model was trained and tested on datasets that contain faces on all of their samples.
3.3. The Evaluation
CNNs Model’s performance is often measured with accuracy, precision, recall, and an F1-Score which are calculated from the confusion matrix (CM) [
32]. Accuracy is the ratio of the number of samples correctly classified by the classifier to the total number of samples in the testing data as in Equation (6).
In the equations, TP, TN, FP, and FN refer to true positive (observation is predicted positive and is actually positive), true negative (observation is predicted negative and is actually negative), false positive (observation is predicted positive and is actually negative), and false negative (observation is predicted negative and is actually positive), respectively. Where accuracy shows the gap between the real values and the predicted values, precision deals with the fraction of positive predictions as in Equation (7) and recall deals with the actual positive fraction as in Equation (8).
The F1-Score combines both precision and recall in a single value as in Equation (9). The F1-Score is indicative of a model’s balanced ability to describe both positive cases (recall) as well as be accurate with the cases that it captures (precision).
The performance evaluation of the SAFEPA system was conducted using various datasets, as follows:
A 10-fold cross-validation accuracy was measured on a subset of the UNBC-McMaster dataset, consisting of 6000 samples. The obtained results were compared with recently published models designed for pain assessment, as described in references [
3,
22,
23]. Subsequently, the confusion matrix and the Receiver Operating Characteristic (ROC) curve were generated to evaluate the performance of SAFEPA’s best performing model, achieved through 10-fold cross-validation.
- 2.
KDEF Dataset—Frontal View Subset
A 10-fold cross-validation accuracy was measured on a subset of the KDEF dataset, focusing specifically on the frontal view. This subset comprised a total of 980 samples. The obtained results were compared with four recently published models for Facial Expression Recognition (FER) mentioned in references [
17,
24,
25,
26].
- 3.
KDEF Dataset—Full and Half Left/Right Poses
The evaluation was performed on a subset of the KDEF dataset, which included full and half left/right poses. This subset contained a total of 3920 samples. The results obtained were compared with the outcomes of models mentioned in reference [
17].
- 4.
KDEF Dataset—Entire Dataset
The evaluation was further performed on the entire KDEF dataset, consisting of 4900 samples. The results obtained were compared with the outcomes of models mentioned in reference [
17].
- 5.
Analysis Metrics on KDEF Subset (Frontal View)
The evaluation involved analyzing a confusion matrix, precision, recall, F1-Score, and accuracy. This analysis was conducted using a training dataset of 70% and a testing dataset of 30% on the KDEF subset with 980 samples.
- 6.
Evaluation on BioVid Dataset
To provide a comprehensive evaluation of the SAFEPA system’s performance, accuracy and processing time were measured. The system was tested on the entire unseen BioVid dataset. Processing time was considered a critical factor, especially in real-world pain assessment scenarios where efficient and prompt processing is vital for timely pain management.
These evaluations and measurements collectively provide a thorough assessment of the SAFEPA system’s performance across different datasets and scenarios.
4. Experimentation and Results
To implement the deep learning experiments Anaconda 5.3.1. [
33], Keras library [
34], OpenCV library [
35], and Python 3.7.3. programming languages are used. All the experiments were run on the Hercules cluster in our PDS lab [pds.ucdenver.edu] on GPU Tesla P100-SXM2 [
36].
To test the SAFEPA system, we used an Intel Core i7-8750 CPU (2.20 GHz, 8 GB RAM), with a Windows 10 64-bit operating system.
This study describes three main experiments that were conducted to evaluate the proposed SAFEPA system.
4.1. Experiment 1: How Well SAFEPA Performs in Pain Assessment with a Frontal and Partial Veiw
In Experiment 1, we conducted a rigorous evaluation of SAFEPA’s pain level recognition capabilities. Leveraging a subset of the UNBC-McMaster dataset containing 6000 samples, we utilized the same training and testing process as described in [
9]. Specifically, we trained SAFEPA for 20 epochs using the SGD optimizer and a batch size of 32 and measured 10-fold cross-validation for each epoch.
In addition, we trained the three distinct models VGG16, ResNet50, and Ensemble in [
17] to 500 epochs using the same parameters as mentioned in [
17]. Specifically, we classified extracted features by VGG16 with 50 neurons, ResNet50 with 15 neurons, and the concatenation features extracted by VGG16 and ResNet50 with 50 neurons. These models were run on images of size 300 × 300, and the experiment was repeated for size 150 × 150 to match the input size of the SAFEPA models. Once again, we measured 10-fold cross-validation for each epoch. After running the models on the two different image sizes, the best model from each type was selected based on their performance. The results in
Table 2 show that the SAFEPA demonstrated the highest accuracy of 98.93% for 10-fold cross-validation.
Figure 9 presents the confusion matrix for the best performing model achieved through 10-fold cross-validation of the SAFEPA on the UNBC-McMaster dataset. This matrix provides valuable insights into the model’s performance in classifying four pain levels: ‘No Pain’, ‘Low Pain’, ‘Moderate Pain’, and ‘Severe Pain’. Upon examination, it is observed that the model exhibits a high accuracy in pain classification. Specifically, there is one misclassification between the ‘No Pain’ and ‘Low Pain’ classes and one additional misclassification between the ‘Low Pain’ and ‘No Pain’ classes. Notably, two misclassifications occur between the ‘Moderate Pain’ and the ‘Low Pain’ classes, reflecting a challenge in accurately distinguishing between these categories. However, it is noteworthy that the model demonstrates exceptional performance for the ‘Severe Pain’ class.
Figure 10 presents the Receiver Operating Characteristic (ROC) curve for the best performing model achieved through 10-fold cross-validation of the SAFEPA on the UNBC-McMaster dataset. The ROC curve provides a comprehensive visualization of the model’s performance in distinguishing between different pain levels.
4.2. Experiment 2: How Well SAFEPA Performs in the Facial Expression Recognition Task
Of all the models for the case of FER, in Experiment 2, we explore how well the SAFEPA model can be extended to recognize facial expressions. We first provide a comparison and analysis of all models for the case of FER, restricting our datasets to full frontal face images. Although such a restriction will limit us in showing all the potentials of SAFEPA, it will demonstrate it to be compatible with the state -of-the-art models for FER, even in the restricted cases. We then present the results showing the SAFEPA’s ability to outperform the models in [
17] when various facial poses are present utilizing a subset of the KDEF dataset (consisting of 3920 samples) that excludes the frontal view, as well as the complete KDEF dataset (4900 samples). To do this, we trained the SAFEPA on a subset of the KDEF dataset consisting of 980 samples with frontal faces and an input size of 150 × 150. The model was trained using the Stochastic Gradient Descent (SGD) optimizer with a batch size of 32 for 50 epochs. The 10-fold cross-validation results show that the SAFEPA model achieved a competitive accuracy of 97.94%. These results are particularly noteworthy when compared to those obtained in [
17,
24,
26], where VGG16 and Ensemble were used to extract features from the same subsets of KDEF but with an image size of 300 × 300 and trained for 500 epochs. The results of the experiment are summarized in
Table 3.
We also sought to evaluate the performance of the SAFEPA and models described in [
17] on facial expression recognition (FER) under different poses, including full left and right profiles. To execute this, we trained the models in [
17] on a subset of the KDEF dataset (consisting of 3920 samples) that excludes the frontal view, as well as the complete KDEF dataset (4900 samples) using the same approach as described in Experiment 1, but without the face detection step, which is not applicable for full left and right profiles.
In contrast, the SAFEPA model was trained specifically for FER on the same subset of 3920 samples of KDEF and on the whole KDEF dataset, utilizing an ADAM optimizer and a batch size of 32 for 75 epochs. The input size was set to 150 × 150.
For 10-fold cross-validation, the SAFEPA achieved the highest performance with a 95.51% accuracy on the subset of 3920 samples of KDEF and 94.29% on the entire KDEF outperforming the other models.
We conducted a follow-up experiment on the subset of KDEF with 980 samples, using 70% of the data for training and 30% for testing as in [
25], with the same parameters as Experiment 2. SAFEPA achieved an accuracy of 84.19%, which is superior to the accuracy obtained by the model presented in [
25], which employed HOG for feature extraction and SVM for classification. The results are presented in
Table 4.
The performance of the SAFEPA model was evaluated on a 30% test set from the subset of KDEF (980) dataset. The data were split using a split function, resulting in a total of 291 samples for testing. The evaluation process included the confusion matrix, precision, recall, and F1-Score. The results of the SAFEPA model are presented in
Figure 11.
Table 5 indicates that SAFEPA achieved high accuracy in predicting the ‘Happy’ emotion with 100.00%. However, the ‘Sad’ emotion was the least accurately predicted by SAFEPA model with an accuracy of 53.85%. Notably, the overall accuracy of SAFEPA model on the test set was higher than that of the model designed for the FER task, as described in [
25]. These results demonstrate the effectiveness of the proposed model in facial emotion recognition on the subset of KDEF (980) dataset.
Furthermore, FEAPAS model was trained on JAFFE dataset using an ADAM optimizer and 150 × 150 input size and achieved a remarkable accuracy of 96.80% in 10-fold cross-validation, which is a highly competitive result compared to the Ensemble classifier of VGG16 and ResNet50 models used in [
17] that achieved an accuracy of 96.40%. Moreover, when trained on 70% training data and 30% testing data using RMSProp optimizer, the FEAPAS achieved an accuracy of 78.13%, outperforming the model in [
25] which achieved an accuracy of 76.19%. However, it is important to note that the SAFEPA model cannot be applied to gray scaled images as it was specifically trained on colored images, and hence, no result is provided for the SAFEPA on the JAFFE dataset.
4.3. Experiment 3: Test in Real-Time: How Well the SAFEPA System Performs with a New Pain Assessment Dataset
In Experiment 3, we tested the SAFEPA system on unseen data, the BioVid dataset that was collected for pain assessment and contains 6900 videos. Since the BioVid Heat dataset only offers video-level labels, it does not provide frame-level information. To address this, using the SAFEPA system, which is described in
Figure 7 and operates on a frame-by-frame basis, we selected the highest-level prediction from all predictions made for the frames belonging to that video. The accuracy of the SAFEPA system on the entire unseen Biovid dataset was 33.28%, which is an improvement over the results reported in [
37,
38]. In [
37], the authors proposed facial activity descriptors to detect pain and estimate its intensity. Their model was applied on the BioVid dataset and for leave-one-subject-out cross-validation; to recognize five levels of pain, an accuracy of 30.80% was achieved. In [
38], Bourou and his colleagues proposed a feature selection and inter-subject variability modeling approach for video-based pain level assessment. They utilized lasso regression to select the most informative features from the extracted geometric and color-based features, and then modeled inter-subject variability using a generalized linear mixed effects probit model. To evaluate their approach, the authors extracted a set of features from the facial regions of each video frame and trained a regression model to predict pain intensity [
38]. They reported an accuracy of 27.13% based on 10-fold cross-validation [
38].
However, the SAFEPA’s accuracy is slightly lower than the accuracy of 37.42% reported in [
39].
In [
39], Xiang and his colleagues proposed a face analysis CNN combined with an LSTM network for pain assessment. They trained and tested their model on selected data belonging to 30 participants. They do not report which subset of the dataset was used for their result.
Table 6 presents the performance of the SAFEPA model on the entire BioVid dataset, presenting accuracy and processing time metrics alongside accuracy scores obtained from three other studies (including each method’s main characteristics) conducted on the entire dataset or a subset of it. Notably, previous studies did not take processing time into account and in contrast with results in the SAFEPA used various validation methods instead of testing using the entire dataset for testing.
The average processing time for the SAFEPA system is 17.82 s per video. It is important to note that each video in the BioVid dataset lasts for 5 s. Thus, the SAFEPA can efficiently process each video in the dataset within a reasonable time, indicating its potential for real-world application.
4.4. Ablation Test
We conducted a decomposition of the SAFEPA into its fundamental models, as depicted in
Figure 12. The fundamental model comprises a single branch of InceptionV3, which is subsequently followed by dense and dropout layers. In contrast, the FEAPAS consists of two branches of InceptionV3, where the outputs of these branches are concatenated and forwarded to dense and dropout layers. As an extension to the FEAPAS, the SAFEPA introduces the inclusion of SAE into the first branch.
To evaluate the performance of each model, we utilized the UNBC-McMaster dataset for pain assessment, along with three sets of KDEF. The KDEF dataset encompassed three distinct variations: (1) only frontal view, comprising 980 samples, (2) full and half left and right poses, amounting to 3920 samples, and (3) the complete dataset encompassing 4900 samples for facial expression recognition (FER).
Table 7 shows the accuracy for each case.
Table 7 does not provide the results for the FEAPAS in the last two cases of the KDEF dataset, as the FEAPAS relies on MTCNN which is unable to handle full left/right profile samples. Consequently, the FEAPAS cannot be applied to these specific samples within the KDEF dataset. However, while the FEAPAS exhibits satisfactory performance with front view samples in both the UNBC-McMaster and KDEF datasets, it fails to deliver successful outcomes in other scenarios.
Overall, the findings presented in
Table 7 demonstrate that all models perform optimally when confronted with frontal view samples in both the UNBC-McMaster and KDEF datasets. In the case of full left and right profile samples, the FEAPAS encounters limitations and the base model’s performance experiences a substantial decline. In contrast, the SAFEPA operates effectively without any complications and demonstrates superior performance, surpassing the base model’s capabilities.
5. Conclusions and Future Works
The combination of sparse autoencoders (SAE) and concurrent CNNs, as implemented in the SAFEPA system, has demonstrated remarkable efficacy in overcoming the limitations of face detection algorithms that can hinder model performance in certain poses. By leveraging the SAE to reconstruct the upper part of the face, the proposed model can operate on different poses without relying on a face detection algorithm. Our experiments, which utilized the KDEF dataset for facial expression recognition and the UNBC-McMaster and BioVid datasets for pain assessment, have validated the efficacy and versatility of the SAFEPA system. Notably, the SAFEPA model achieved a 98.93% accuracy in 10-fold cross-validation for pain assessment recognition and an 84.06% accuracy for unseen subjects when applied to a subset of the UNBC-McMaster dataset. Moreover, its competitive performance compared to state-of-the-art studies in facial expression recognition confirms its efficiency in handling such tasks. Importantly, when other systems relying on face detection algorithms like the FEAPAS failed to work properly in samples with full left and right poses, the SAFEPA model was able to work smoothly and achieve high accuracy. Finally, the performance of the SAFEPA system on the BioVid dataset, as demonstrated by its accuracy and processing time, highlights its generality and potential for real-world applications. These findings underscore the potential of the SAFEPA to accurately recognize pain assessment and facial expressions, process video data efficiently, and benefit various domains such as healthcare, education, and entertainment. Furthermore, considering the significant potential of augmented reality [
40], our work holds promising implications in this area. Our future plan is to explore additional scenarios, such as when only partial facial information is available or when the input images may include non-facial images. Additionally, we aim to investigate the integration of voice analysis as an auxiliary component to enhance pain detection capabilities.