1. Introduction
Diabetic foot ulcers (DFUs) represent the most frequently recognized and highest risk factor associated with diabetes mellitus [
1,
2]. An infection of the wound may require the amputation of the foot or lower limb. The worldwide estimation is a limb amputation every 20 s [
3]. In addition, the recurrence rate remains at about 60% after three years [
4]. DFU occurrence can be avoided, reduced, or substantially delayed by early detection, assessment, diagnosis, and tailored treatment [
1,
5]. The identification of the underlying condition that sustains skin and tissue damage at an early stage, prior to the onset of superficial wounds, is an emerging area of research [
6,
7,
8,
9].
Machine learning (ML) and deep learning (DL) approaches based on infrared thermography have been established as a complementary tool for the early identification of superficial tissue damage. Thermography enables real-time visualization of plantar temperature distribution passively, that is, the surface to be measured remains intact [
2]. However, the heat pattern of the plantar aspect of the feet and its association with diabetic foot pathologies are subtle and often non-linear [
10]. For these reasons, ML and DL models are selected as they offer versatile and highly accurate outputs, lessening the time burden of demanding tasks, the associated costs, and human bias such as subjective interpretations or inherent limitations of human visual perception. Despite the advantages provided, the use of these models as a tool to support clinical decision support systems in real-world scenarios has not been achieved [
11]. More studies are required to consider the integration of these models in the healthcare setting [
12]. Particularly, in the case of DFUs, the use of ML and DL models is hindered by the lack of labeled data, which causes overfitting and poor generalization on new data if the training dataset is not large enough [
13]. There are techniques to mitigate this problem, such as transfer learning [
14] or data augmentation [
15,
16]. Furthermore, these problems are magnified by the current trend towards deeper neural networks [
17,
18,
19], where the problem of vanishing gradients [
20] is very widespread; however, skip connections have been proven to work out this limitation and provide other benefits during the training process [
21]. Additionally, the lack of standardization regarding feature extraction may also have an impact.
Ideally, ML and DL models should classify subjects at risk of developing an ulcer from a single thermogram containing the plantar aspect of both feet and, if possible, quantify the severity of the lesion. In the context of healthcare, comprehensive data interpretation is crucial. However, in the case of identifying foot disorders using thermography, many features have been proposed in the state of the art, but it is challenging to determine which ones are the most representative for DFUs. The presence of a high number of features can hinder data interpretation. Misinterpretation of the data may lead to inconsistencies among experts when diagnosing a disease, resulting in increased variability in clinical decision-making. Therefore, the identification of foot disorders using thermography requires establishing a subset of relevant features to reduce decision variability and data misinterpretation and provide a better overall cost–performance for classification [
22]. Using a subset of features with relevant information, classifiers with better cost–performance ratios are achieved, as reducing the number of features can lessen both computational and memory resources [
23]. The lack of standardization among thermograms as well as the unbalanced datasets towards diabetic cases hinder the establishment of this suitable subset of features.
ML and DL models have been explored to determine relevant features for early detection of DFUs [
9,
24,
25,
26,
27]. However, except for a few cases, these studies were derived mainly from the only publicly available dataset, the INAOE dataset (Instituto Nacional de Astrofísica, Óptica y Electrónica) [
26], which is composed of thermograms containing the plantar aspect of both feet. Recently, a similar dataset was released, STANDUP [
28], which provides means for extending the current state of the art by simply increasing the number of samples available to train the ML and DL models. Furthermore, the additional dataset enables the determination of the generalizability of the set of state-of-the-art features previously extracted by classical and DL approaches [
27].
In this work, the same methodology previously described was executed in order to extract a state-of-the-art set of features from infrared thermograms [
27]. Four input datasets were considered by merging different datasets for feature extraction. A subset of features associated with each input dataset was extracted using classical- and DL-based approaches. The subset of features common to all of the approaches employed were used as an input for both a standard and an optimized support vector machine (SVM) [
29] classifier. The SVM classifier was used as a reference to assess and compare the performance of each set of extracted features from the STANDUP and extended databases. In addition, a comparison was performed between the more relevant and robust features extracted in this work and those extracted using solely the INAOE dataset [
27] as well as those proposed in previous studies [
9].
4. Discussion
Several approaches were considered to extract relevant features used for DFU detection based on infrared thermograms following the same methodology previously described [
27]. In this case, an extended and multicenter dataset was created by merging the INAOE, STANDUP, and local database, which provided a generalization factor to the classification task at hand. This was conducted to determine whether a thermogram corresponded to a healthy or diabetic person.
To the best of the authors’ knowledge, this is the largest thermogram dataset explored, especially regarding DFU detection at an early stage. As mentioned above, the INAOE dataset has been the only thermogram database publicly available, and the recently released STANDUP dataset provides the opportunity to test the methodology previously established. The STANDUP dataset was considered alone as well as merged with the local dataset aiming to correct the imbalance toward diabetic cases observed. Furthermore, a more generalized and extended dataset was created by merging all available datasets (ALL).
Classical approaches, such as lasso and random forest, were tested against two DL-based approaches by applying the dropout techniques, concrete and variational dropout. The dropout techniques, initially designed to address overfitting in DL models, were employed not only in the feature selection but also across different layers using a dropout rate of 0.5. For instance, in the case of concrete dropout, the input layer is defined by variational parameters establishing a binomial distribution composed of
d independent Bernoulli ‘continuous relaxed’ distributions [
27]. This configuration acts as a ‘gate’ to identify irrelevant features by introducing noise [
27]. In an ideal scenario, relevant features tend to have a dropout rate of close to zero, while irrelevant features tend towards a dropout rate of one. In essence, the proposed restriction in the model implicitly serves to mitigate overfitting concerns inherent in DL-based models. Furthermore, it is worth noting that the chosen models, particularly the random forest and DL-based approaches, are inherently robust at handling data variability. While preprocessing could mitigate issues related to feature extraction, the focus of this work was to identify the most relevant features within the newly released STANDUP and ALL databases and compare them with previous results [
27]. Therefore, extensive hard preprocessing of the thermograms was avoided.
In the context of ML models, where the parameters are denoted as
, theoretically, a test could be established to validate the statistical significance of
concerning
, where
X is the dataset and
Y the prediction. However, it is crucial to note that ML models are commonly evaluated using metrics such as the mean squared error (MSE) or the AUC-ROC. In this work, K-fold cross-validation [
42] was employed to validate the SVM model. The dataset is partitioned into ‘k’ subsets, and the model is trained on ‘k − 1’ subsets while being validated on the remaining subset. This process is iterated ‘k’ times, with each ‘fold’ serving as both a training and test set. The outcome is an estimation of the mean error value and standard deviation, providing a robust assessment of model performance. Specifically, a low standard deviation was observed for the standard SVM classifier with predefined hyperparameter configurations across different experiments to discard biased conclusions. This finding leads to the conclusion that the model effectively fits the distribution
and the provided features contain sufficient information about
X for predicting
Y. In general, the uncertainty is increased with the class-balanced dataset, as noticed by the increase in the standard deviation.
The analysis of the subset of features considered relevant and the subsequent classification task for each approach provided sufficient metric values regarding performance. For the dataset with maximum heterogeneity (ALL), the best approach varied depending on whether the classifier was standard or optimized. For the standard SVM, in which a true comparison can be drawn between the different approaches, the best performance metrics were observed for the state-of-the-art features previously reported [
27]. These results support the fact that the methodology, and the subset of state-of-the-art features subsequently derived, provide consistent and reliable descriptors to discriminate between healthy and diabetic individuals. Despite the heterogeneity of the dataset, the performance was suitable, although some decreases were observed precisely due to this variability. The best F1-score reported for the DFU dataset was 0.9027 ± 0.104 [
27], whereas the same metrics was 0.8513 ± 0.0279 for the ALL dataset.
For the optimized SVM, the lasso approach provided the best performance metrics, except for the recall, which was best in the variational dropout approach. In this case, the F1-score for the ALL dataset was 0.7956 ± 0.0291. The reason for a decrement in the performance may be due to oversampling. For the non-oversampled datasets, when using the optimized SVM, the recall performance increases. This can be due to the fact that some subjects considered as control may be diabetic. Therefore, when applying SMOTE, features corresponding to diabetic subjects are propagated and disrupt the control group. This is particularly noticed for the STANDUP dataset.
Regarding the set of relevant features, R_MCA_std and R_LCA_kurtosis appeared as relevant features independent of the approach and dataset. LPA_ETD appeared in coincidence for three datasets:,DFU, STANDUP, and STANDUP
, whereas R_MCA_kurtosis also appeared in coincidence for three datasets, STANDUP, STANDUP
, and ALL. In addition, nine features were in coincidence for two datasets. Among all of these features found in coincidence, those that already appeared as relevant in our previous work are as follows [
27]: R_MCA_std, LPA_ETD, L_kurtosis, R_kurtosis, L_MCA_std, and R_LPA_std. Thus, these features, mainly associated with the MCA and LPA angiosomes, as well as the kurtosis for each foot, consistently appeared as relevant features independent of the input dataset.
A major limitation of the present study is the lack of an associated clinical trial. At this stage, the main aim was focused on establishing the workflow required for data analysis. In this work, as a proof of concept, a relevant set of state-of-the-art features was determined. This provided a tool to successfully discriminate between healthy and pathological subjects by measuring the temperature within the plantar aspects of both feet. Furthermore, some insight was gained regarding the importance of the different angiosomes and their predictive value for classification. However, the presented methodology must be tested and validated in a standard clinical setting in order to assess the clinical relevance of the findings. Then, the incorporation of glycemic control parameters and other diabetes-specific factors must be included as additional features. This allows for the assessment of whether underlying biochemical processes relate to inflammation or microvascular changes in diabetic foot disorders.
Further studies would require more balanced datasets to classify thermograms between two classes, diabetic and healthy. The necessity of additional preprocessing to unify different datasets must be explored. The lack of improvement noticed in this work when merging datasets in comparison with our previous work [
27] may be a consequence of avoiding a uniform preprocessing.
Moreover, the STANDUP database provides thermographic images after thermal stress for healthy and diabetic subjects. This could help to gain some insight regarding dynamic thermal changes in diabetic foot disorders and whether thermal information could contribute to early detection. Currently, these data are being preprocessed in order to apply the methodology presented in this work. Finally, once the patient has been labeled as diabetic, a new classification task is planned to determine the level of severity within diabetic thermograms.