1. Introduction
Pattern recognition is among of a plethora of intricate cognitive capabilities that has evolved over thousands of years, by encoding and integrating information acquired from environmental sensory inputs, to guide behavioral responses [
1]. Some pattern processing functions—like the ability to create cognitive maps of physical environments and to distinguish individuals of the same species and the use of gestures to communicate with other individuals—are common among many animal species, whereas others—like creativity and invention, spoken and written language, imagination and comprehension of the passing of time—are uniquely human, in large part due to relatively large size of the cerebral cortex [
1]. Specifically for visual information, photoreceptors in the retina activate in response to incoming light, and details are transiently encoded in nerve cell circuits of the visual regions of the cerebral cortex, which are then transferred and stored in the hippocampus [
1]. As a result of this evolutionary development, humans have an innate ability to recognize and empirically describe textural properties (coarseness, rippling, contrast, etc.) associated with visual inputs. In 1973, with advancements in computing and digital photography, Haralick et al. foreshadowed the need and potential utility in quantifying textural characteristics of an image and pioneered the field of radiomics by publishing a seminal work defining a process to calculate a set of features from an image [
2]. Based on the assumption that the textural information of images is contained in the overall or “average” relationship of pixel intensities within the image, Haralick et al. computed a spatially dependent probability–distribution matrix based on grayscale intensities of neighboring pixels, in what is called a gray-level co-occurrence matrix (GLCM), and defined features based on said matrix [
2]. From that work, other matrices have been developed to quantify relationships between image pixels, including the gray-level run length matrix (GLRLM) [
3], the gray-level size zone matrix (GLSZM) [
4], and the gray-level dependence matrix (GLDM) [
5], all with distinctly defined features. Features can be calculated for entire images or for regions of interest (ROI) within an image.
Modern personalized medicine is emerging in parallel with advancements in biomedical imaging, unlocking visual insights beyond what is accessible to the naked eye. Observation of internal anatomy with medical imaging allows physicians to better understand the nature of diseases, plan treatment, and monitor treatment efficacy. Current cancer care standard protocols involve the acquisition of a biopsy sample from the suspected diseased tissue for up-front genetic profiling and confirmation of cancer. However, three major hurdles limit potential, namely:
(i) acquiring a biopsy sample is an invasive procedure for patients, and
(ii) samples can ultimately fall short in adequately characterizing the entirety of a suspicious mass, which is crucial, considering genetic, anatomic, and physiological heterogeneity drastically influences cancer growth and treatment outcomes [
6,
7]; and
(iii) clinically validated biological biomarkers have yet to be established for most cancer types. Globally, considerable resources are directed towards researching the role tumor composition and microenvironments play with respect to treatment outcomes, disease progression, metastasis, recurrence, and more [
6,
8,
9].
In contrast to biopsies, radiomic analysis of suspect masses presents a few noteworthy potential advantages; mainly,
(i) it presents a potential method of characterizing the entirety of a region in question,
(ii) it is non-invasive,
(iii) it allows clinicians to maximize the utility of often previously acquired images (computed tomography (CT) and magnetic resonance imaging (MRI), in particular) required for qualitative diagnostic and treatment planning purposes,
(iv) it can be applied to multiple masses within a single patient, and
(v) it can be used at different time points, providing longitudinal data such as treatment response over time [
10]. Theoretically, radiomics features could reveal promising imaging biomarkers associated with numerous biological and clinical conditions of interest. The identification and combining of meaningful features to distinguish between populations of interest (for example, recurrence, metastasis, treatment efficacy, etc.) is a task that can be carried out by common machine learning (ML) classifiers after labeling the “mined” textural features with known outcomes and training predictive models [
11].
Epithelial malignancies originating in the oral cavity, pharynx, larynx, paranasal sinuses, nasal cavity, and salivary glands are broadly categorized as head and neck (H&N) cancers [
12], approximately 90% of which are squamous cell carcinomas (SCC) [
13]. A 25-year-long analysis of cancers in Canada, published in 2022, revealed that the vast majority (~35,000 or 70%) of roughly 48,000 incidences of H&N cancers occurred in male subjects [
14]. Tobacco and alcohol consumption [
15,
16,
17], human papilloma virus (HPV) infection [
18], and p53 and p16 gene mutations [
19,
20] are risk factors that make H&N cancers the seventh most common type of cancer [
18]. The World Health Organization’s International Agency for Research on Cancer estimated globally—in the year 2020—there would be in excess of 900,000 newly diagnosed H&N cancer patients, and just under 500,000 individuals would succumb to complications of H&N cancer [
14]. Though distant metastasis is rare at the time of diagnosis (~10%), the majority of HPV-related cancer patients exhibit regional spread of cancerous cells to lymph nodes (LN) by the time of diagnosis [
12]. Five-year mortality rates are approximately 50% but vary depending on clinical factors such as the primary site of disease, tumor stage, and HPV status, as well as non-clinical factors like geographical and socioeconomic status influencing access to healthcare [
21,
22]. It is worth noting that these statistics and estimations do not take into account the impact of the COVID-19 pandemic, which should not be underestimated. One study on the impact of the COVID-19 pandemic on H&N cancer diagnosis and treatment reported a decrease in the proportion of H&N patients with early-stage diagnosis, with authors positing these trends to be associated with patients’ unwillingness to visit a doctor during uncertain times or lack of access [
23].
Treatments for H&N cancer are multimodal and may involve a combination of surgery, radiotherapy (RT), and or systemic therapy, depending on factors related to the tumor and the patient, including tumor site, stage, operability, and patients’ overall health. A typical up-front RT prescription consists of 70 Gy to the areas containing gross disease, 63 Gy to intermediate-risk areas (i.e., areas adjacent to the gross tumor volume [GTV]), and 54–56 Gy to low-risk areas (i.e., elective treated nodal regions), delivered in 33–35 fractions [
24]. It is worth noting that the complexity of H&N cancers, which may originate in a number of sub-sites, warrant precise considerations and oncological approaches both for treatment planning and recovery [
25]. Despite the shift towards personalized medicine and advancements in precise RT delivery techniques, such as newer treatment-planning software, improved imaging methods, and innovations like intensity-modulated radiation therapy (IMRT) [
26] and volumetric-modulated arc therapy (VMAT) [
27], there remains a subset of patients who do not respond to treatment and experience tumor resistance to treatment or recurrence. Therefore, having reliable and robust predictions regarding the probability of tumor response to radiotherapy is potentially valuable for tailoring treatment and improving results. For instance, treatment de-escalation, including reductions in dose or treatment volume, could be considered for patients predicted to exhibit complete responses to radiation, and those predicted to have poor responses could be offered treatment intensification.
Based on a working hypothesis that index LNs (detailed in methods) of H&N cancer patients exhibit phenotypic signals associated with treatment efficacy, the aim in this study was (i) to determine the feasibility of predicting tumor response to up-front RT by building predictive ML models trained using pre-treatment MRI radiomics features from index LN segmentations, (ii) to identify top-
k features, and (iii) to test previously developed
deep texture analysis (DTA) methodology to improve feature sets and enhance predictive capabilities. In short, DTA is an iterative process to investigate the spatial distributions of top-
k features by creating feature map images and subsequently “mining” deeper-layer texture features from said feature maps. Previously, DTA methodology has shown promise in predicting H&N cancer treatment outcomes with index LN quantitative ultrasound spectroscopic (QUS) parametric maps [
28] and treatment-planning CT scans [
29], but this was yet explored on MRI images (to our knowledge).
Figure 1 presents a graphical abstract for the DTA methodology.
2. Materials and Methods
This constitutes an analysis of a prospective cohort of patients with biopsy-proven de novo cT0-4N1-3M0 primary H&N cancer who underwent a standard course of up-front radiotherapy with 70 Gy in 33–35 fractions, with or without concurrent chemotherapy, between 2015 and 2020 (
ClinicalTrials.gov (accessed on 11 June 2024), identifier: NCT03908684). For the purpose of this current study, we included a subgroup of patients from that cohort who had enlarged regional lymph nodes measuring ≥15 mm in short axis, as assessed by CT scan, and underwent baseline magnetic resonance imaging at the time of diagnosis or radiation planning [
30], while excluding patients whose MRI contained motion and dental artifacts.
As mentioned earlier, radiomics models for this patient cohort were trained using US and CT scans and published previously [
28,
29]. Recruiting patients for the acquisition of US scans—which are not part of the standard treatment pipeline—solely for scientific research purposes proved challenging and limited the number of participants to
. The main objective of this study was to create machine learning algorithms using baseline MRI texture features determined from the index regional lymph node for predicting complete response three months after radical intent radiation therapy. This is a single-institution study conducted at Sunnybrook Health Sciences Center (Toronto, ON, Canada) and approved by the Research Institute Ethics Board (SUN-3047).
2.1. Treatment Approach and Follow-Up Imaging
Patients were treated with radical intent definitive up-front radiotherapy. Contouring and treatment plans were made in line with institutional guidelines and in accordance with standard practice. As part of patient clinical care for radiation treatment planning, the primary and lymph node gross tumor volumes (GTV) were contoured by a clinical team on planning CT scans, co-registered with diagnostic or planning magnetic resonance imaging data that were acquired from a 1.5 T Philips Ingenia Elition X (Philips Medical Systems, Amsterdam, Netherlands) device [
30]. The typical contouring approach consisted of
(i) expanding the GTV by 3–5 mm to generate the high-dose clinical target volume (CTV) and further expanding it by 5 mm to generate the high-dose planning target volume (PTV);
(ii) expanding the GTV by 10 mm and including the elective nodal regions to generate the low-dose CTV. The contouring of an intermediate-risk CTV was at the physician’s discretion. The PTV margins consisted of a 5 mm expansion of the respective CTVs. A typical prescription included 70 Gy to the PTV high-risk, 63 Gy to the intermediate-risk, and 56 Gy to the low-risk PTV, delivered in 33–35 fractions. RT was administered using IMRT or VMAT techniques, and cone-beam CT scans were used to verify and adjust the patient’s position before each radiation fraction [
30].
After completing treatment, patients were regularly monitored according to institutional clinical guidelines. The standard follow-up regimen typically included physical examinations and restaging imaging every 3 months during the first year post-treatment, every 4 months in the second year, every 6 months in the third year, and annually during years 4 and 5. The initial restaging imaging modality was MRI, and in the case of a complete response, CT scans were performed subsequently.
2.2. Tumor Response Definition and Segmentation
Tumor response was assessed for the primary tumor and regional lymph nodes on the 3-month follow-up contrast-enhanced magnetic resonance imaging following radiotherapy completion and was based on the Response Evaluation Criteria in Solid Tumors (RECIST) 1.1 criteria [
31]. In brief, complete response (CR) was defined as disappearance of the primary tumor and reduction of the involved lymph nodes in short axis to less than 10mm, whereas partial response (PR) was defined as at least a 30% decrease in the sum of diameters of tumors compared to the baseline MR imaging. The protocol outlined evaluation of patients that exhibited stable and progressive disease as well, but the cohort included no such instances and was thus not worth detailing in this methodology.
Radiation oncologists with over 5 years’ experience (DMP, IP, IR or AB) used pre-treatment T1-weighted post-contrast MRI for manual segmentation of the region of interest (ROI) in a radiation planning system (Pinnacle, Philips Medical System) that was copied for analysis using open-source software ITK-SNAP (version 3.6.0,
www.itksnap.org (accessed on 11 June 2024) [
32]). The ROI consisted of the largest affected regional lymph node, and when multiple lymph nodes were closely adjacent or formed a conglomerate, all were included into the ROI.
2.3. Texture Extraction and Machine Learning Algorithms
LN segmentation ROIs were used as a mask with associated MRI scans. In order to account for varying and arbitrary pixel intensities between scans, images were normalized by normalizing values to the mean with one standard deviation of pixel values in the whole image [
33]. Subsequently, 24 GLCM, 16 GLRLM, 16 GLDM, and 14 GLSZM 2D 1st-layer texture features (1LTFs) were calculated (as an S
1 feature-set) from axial slices using Pyradiomics version 3.0.1, an open-source Python (Python Software Foundation, Wilmington, DE, USA, version 3.7.10) package [
33]. Lastly, before proceeding to ML model building, radiomics features were concatenated with retrospective binary treatment outcomes.
Models were trained using S
1 1LTFs and three common classifiers: Fischer’s Linear Discriminant (FLD), Support Vector Machines (SVM), and
k-Nearest Neighbors (
k-NN). To account for varying magnitudes within the features, z-score scaling was carried out at the feature level. To split test and training data, a leave-one-out approach was used, whereby iteratively, each sample was left out, and models were trained, validated, and subsequently tested on the held-out sample. After the test sample was left out, but before training, the synthetic minority oversampling method (SMOTE) was utilized to account for the imbalance of data (23 CR/40 PR) [
34]. During each iteration, a 5-fold split was used on the remaining data to train and validate the models. To avoid overfitting and increase model robustness, a sequential forward selection (SFS) method in a wrapper framework was used to reduce dimensionality and identify valuable S
1 features based on an increasing balanced accuracy metric. Throughout all of the leave-one-out iterations, the most frequently selected combinations of single-feature and multi-feature models were identified and tested on the left-out samples for models with 1–10 features.
The attention mechanism is one of the most influential developments within deep learning for text- and image-based studies. This mechanism is based on amplifying important areas in image or word inputs to receive more attention from the learning networks to enhance the performance [
35]. Inspired by attention mechanism, we proposed deep texture analysis based on the extraction of important features within the image. To draw parallels, in this technique, radiomics features are extracted from images coded by important selected radiomics features from previous layers. Whereas in the attention mechanism, the scaled dot-product is used to obtain the importance of each token, here, the output of the classifiers determines the important features.
After identifying the most valuable 1LTFs from the S1 feature set, texture features were calculated once again, this time, however, using a sliding window technique (window size of 3 × 3 pixels) for sub-ROI windows rather than for the entire LN ROI. For each of the three classifiers, S1 feature maps were made for the models that resulted in the best-balanced accuracy test score among the 10 models. If multiple models demonstrated the same highest balanced accuracy, preference was given to the model using fewer features.
Next, using Pyradiomics, 2nd-layer texture features (2LTFs) were “mined” from the S1 feature maps (rather than the MRI scans directly). Previously identified top-k 1LTFs from each classifier’s top model were concatenated with newly determined 2LTFs to create new feature sets, S2,FLD, S2,SVM, and S2,kNN, respectively. Subsequently, identical model building and testing procedures were repeated to evaluate the influence of implementing deeper-layer features on predictive models. In addition to 2LTFs, the iterative DTA methodology was repeated with the aforementioned steps to evaluate the inclusion of 3rd-layer texture features (3LTFs) as well.
To evaluate whether the inclusion of deeper-layer features was worthwhile for model building (independently of model complexity with regards to the number of features), average performances of models trained using the S1 dataset were compared to the average performance of models trained using S2 and S3 datasets, using a one-tailed t-test with the significance level (α) set at p = 0.05, assessing the null hypothesis that there was no difference in average performance metrics.
3. Results
3.1. Patient Characterstics
A total of 63 patients were involved in this study. The median age of patients at the time of diagnosis was 61 years (range, 36–80), and 94% were male. Primary tumors were located in the oropharynx (
), larynx (
), hypopharynx (
), and nasopharynx (
). A total of 13% (
) of tumors were of unknown or unspecified primary locations. A total of 94% of tumors (
) were squamous cell carcinomas. A total of 86% received platinum-based concurrent chemotherapy (
). Tumor staging, p16 status, and alcohol and tobacco consumption habits were summarized in
Table 1.
Supplementary Table S1 includes anonymized medical and treatment characteristics for individual patients. A total of 23 patients achieved CR and 40 PR at 3 months after radiotherapy.
Figure 2 provides representative MRI scans and index LN ROIs before and after treatment for a CR and PR patient.
3.2. Models Trained with S1 Feature Set
Seventy 1LTFs were determined from LN segmentations from axial slices of MRI scans to create the S
1 feature set used to train models. As presented in
Table 2, FLD,
k-NN, and SVM classifier predictive models demonstrated a capability to differentiate between CR- and PR-labeled patients, with varying effectiveness. For each classifier, the best-performing model (highest balanced accuracy with the lowest number of features) was identified and is highlighted in
Table 2.
The best performance among the ten SVM classifier models was with the four-feature multivariable model, with a sensitivity (%Sn) of 85%, specificity (%Sp) of 70%, accuracy (%Acc) of 79%, precision (%Pre) of 83%, balanced accuracy (%BA) of 77%, and an area under (%AUC) the receiver operator characteristic (ROC) curve of 80%. Of the four selected features, three were GLCM features, including “Difference Entropy”, “Joint Entropy”, and “Joint Energy”, and one was a GLRLM feature, “Short Run Emphasis”. Feature maps were made for the aforementioned four features, using the sliding window technique, from which 2LTFs were determined in preparation of the S2,SVM feature set.
The best performance among the ten k-NN (k = 5) classifier models was seen with the nine-feature multivariable model, with %Sn = 80%, %Sp = 74%, %Acc = 78%, %Pre = 84%, %BA = 77%, and %AUC = 72%. The nine selected features were three GLCM features, i.e., “Maximum Probability”, “IDM”, and “Joint Energy”; two GLRLM features, i.e., “Run Percentage” and “Low Gray Level Run Emphasis”; three GLSZM features, i.e., “Small Area Emphasis”, “Small Area Low Gray Level Emphasis”, and “Size Zone Non-Uniformity Normalized”; and one GLDM feature, i.e., “Low Gray Level Emphasis”. Feature maps were made for the aforementioned nine features, using the sliding window technique, from which 2LTFs were determined in preparation for the S2,KNN feature set.
The best performance among the ten FLD classifier models was seen with the six-feature multivariable model, with %Sn = 83%, %Sp = 70%, %Acc = 78%, %Pre = 83%, %BA = 76%, and %AUC = 78%. The six selected features were three GLCM features, i.e., “Difference Average”, “MCC”, and “Inverse Variance”; two GLDM features, i.e., “Small Dependence Low Gray Level Emphasis” and “Large Dependence Low Gray Level Emphasis”; and one GLRLM feature, i.e., “Run Variance”. Feature maps were made for the aforementioned six features, using the sliding window technique, from which 2LTFs were determined in preparation for the S2,FLD feature set.
3.3. Models Trained with S2 Feature Sets
Feature maps were made from the selected features from each classifiers’ best-performing model. From each of the newly created feature maps, once again, 70 radiomics features were determined to create second-layer texture features (2LTFs). S2 feature set preparation included retaining top-k selected 1LTFs from the first iteration of model building and concatenating with newly mined 2LTFs. The S2,SVM feature set was comprised of 284 total features ((4 × 1LTFs) + (70 2LTFs × 4 sets of 1LTF maps)). The S2,KNN feature set was comprised of 639 total features ((9 × 1LTFs) + (70 2LTFs × 9 sets of 1LTF maps)). The S2,FLD feature set was comprised of 426 total features ((6 × 1LTFs) + (70 2LTFs × 6 sets of 1LTF maps)).
After S
2 feature set preparation was completed, predictive models were trained, validated, and tested again, with model building parameters unchanged. Ten models were built for each classifier with 1–10 features, and the results are presented in
Table 3. The best performances were seen with the nine-, four-, and six-feature multivariable models for the SVM,
k-NN, and FLD classifiers, respectively.
The best performance among the ten SVM classifier models trained with the S2,SVM feature set was seen with the nine-feature multivariable model, with %Sn = 85%, %Sp = 70%, %Acc = 79%, %Pre = 83%, %BA = 77%, and %AUC = 80%. Of the nine selected features, four were the originally selected 1LTFs, and five were newly introduced 2LTFs. Of the selected five 2LTFs, three were determined from the GLCM “Joint Energy” feature map. Those features were GLCM “Sum Entropy” and “Joint energy”, as well as GLSZM “Gray Level Non-Uniformity Normalized”. Of the remaining two 2LTFs, one was GLCM “Joint Energy” determined from the GLCM “Difference Entropy” feature map, and the other was GLRLM “Long Run Emphasis” determined from the GLRLM “Short Run Emphasis” feature map.
The best performance among the ten k-NN classifier models trained with the S2,kNN feature set was seen with the four-feature multivariable model, with %Sn = 90%, %Sp = 70%, %Acc = 83%, %Pre = 84%, %BA = 80%, and %AUC = 76%. Of the four selected features, two were the originally selected 1LTFs, and two were newly introduced 2LTFs. The selected 1LTFs were GLCM “IDM” and “Maximum Probability”, and the two 2LTFs were GLCM “Maximum Probability” determined from the GLRLM “Run Percentage” feature map and GLSZM “Size Zone Non-Uniformity Normalized” determined from the GLSZM “Small Area Low Gray Level Emphasis” feature map.
The best performance among the ten FLD classifier models trained with the S2,FLD feature set was seen with the six-feature multivariable model, with %Sn = 73%, %Sp = 87%, %Acc = 78%, %Pre = 91%, %BA = 80%, and %AUC = 82%. All six of the selected features were newly introduced 2LTFs. These features were GLCM “MCC” and GLRLM “Run Entropy” determined from the GLCM “MCC” feature map, GLCM “IDN” and GLSZM “Zone Entropy” from the GLCM “Inverse Variance” feature map, and GLCM “Joint Entropy” from both the GLRLM “Run Variance” feature map and the GLDM “Small Dependence Low Gray Level Emphasis” feature map, respectively.
3.4. Models Trained with S3 Feature-Sets
Selected features for each classifiers’ top-performing model were retained for S
3 feature sets. Using Pyradiomics, feature maps were made from the selected 2LTFs and their associated 1LTF map.
Figure 3 shows an example of a CR and PR patient with their associated MRI, LN ROI, an example of an 1LTF map, and an 2LTF map. From each of the newly created 2LTF maps associated with the three classifiers’ best-performing model, seventy third-layer texture features (3LTFs) were determined and concatenated to the previously retained first-layer and or second-layer features selected for by models trained with S
2 feature-sets. The S
3,SVM feature set was comprised of 359 total features ((4 × 1LTFs) + (5 × 2LTFs) + (70 3LTFs × 5 sets of 2LTF maps)). The S
2,KNN feature set was comprised of 144 total features ((2 × 1LTFs) + (2 × 2LTFs) + (70 3LTFs × 2 sets of 2LTF maps)). S
3,FLD consisted of 426 total features ((6 × 2LTFs) + (70 3LTFs × 6 sets of 2LTFs)).
After S
3 feature set preparation was completed, predictive models were trained, validated, and tested again, with all model-building parameters unchanged. For each classifier, ten models were built with 1–10 features, and the results can be seen in
Table 4. The best performances were seen with the six-, nine-, and four-feature multivariable models for the SVM,
k-NN, and FLD classifiers, respectively.
The best performance among the ten SVM classifier models trained with the S3,SVM feature set was seen with the six-feature multivariable model, with %Sn = 85%, %Sp = 70%, %Acc = 79%, %Pre = 83%, %BA = 77%, and . All six of the selected features were first- and second-layer texture features. None of the 350 available 3LTFs were selected, and the top performance of the S3,SVM feature set-trained SVM classifier model matched but did not exceed the performance of the best SVM model trained with the S2,SVM feature set. The selected features were two 1LTFs (GLCM “Joint Energy” and “Joint Entropy”) and four 2LTFs (GLCM “Joint Energy” and “Sum Entropy” from the GLCM “Joint Energy” feature map, GLCM “Joint Energy” from the GLCM “Difference Entropy” feature map, and GLRLM “Long Run Emphasis” from the GLRLM “Short Run Emphasis” feature map).
The best performance among the ten k-NN classifier models trained with the S3,kNN feature set was seen with the nine-feature multivariable model, with %Sn = 93%, %Sp = 74%, %Acc = 86%, %Pre = 86%, %BA = 83%, and %AUC = 75%. This was also the best-performing model in the study. Of the nine selected features, two were 1LTFs, two were 2LTFS, and the remaining five were 3LTFS. The two 1LTFs were GLCM “Maximum Probability” and “IDM”. The two 2LTFs were GLCM “Maximum Probability” from the GLRLM “Run Percentage” feature map and GLSZM “Size Zone Non-Uniformity Normalized” from the GLSZM “Small Area Low Gray Level Emphasis” 1LTF map. Of the five 3LTFs, four were determined from the GLRLM “Run Percentage” GLCM “Maximum Probability” 2LTF map, and those were GLCM “Joint Energy”, “Inverse Variance”, IDMN”, and “IDN”. The final selected 3LTF was GLRLM “Low Gray Level Run Emphasis” determined from the GLSZM “Small Area Low Gray Level Emphasis” and GLSZM “Size Zone Non-Uniformity Normalized” 2LTF maps.
The best performance among the ten FLD classifier models trained with the S3,FLD feature set was seen with the four-feature multivariable model, with %Sn = 75%, %Sp = 87%, %Acc = 79%, %Pre = 91%, %BA = 81%, and %AUC = 81%. Of the four selected features, two were 2LTFs, and two were 3LTFs. The two 2LTFs were GLCM “MCC” from the GLCM “MCC” 1LTF map and GLDM “Small Dependence Low Gray Level Emphasis” from the GLCM “Joint Entropy” 1LTF map. The two 3LTFs were GLCM “MCC” from the GLCM “MCC” GLRLM “Run Entropy” 2LTF map and GLCM “Difference Average” from the GLRLM “Run Variance” GLCM “Joint Entropy” 2LTF map.
3.5. Evaluating Deeper-Layer Features
Next, an assessment was conducted to determine whether incorporating and processing deeper-layer texture features was beneficial. For the ten models created at each layer, the average performance of models trained using Sn feature sets was compared to Sn + 1 and Sn + 2 (when possible) using a one-tailed t-test with the significance level (α) set at
, assessing the null hypothesis that there was no difference in average performance metrics.
Table 5,
Table 6 and
Table 7 summarize the findings for the inclusion of deeper-layer texture features for the SVM, k-NN, and FLD classifiers, respectively.
The one-tailed
t-test to evaluate meaningful improvement suggests that the inclusion of deeper-layer texture features and the DTA method did not statistically improve or degrade the performances of the SVM models.
Figure 4 is a visual representation of the results from
Table 5.
As seen in
Table 6, comparing the
k-NN classifier models trained using the S
1 and S
2,KNN feature sets showed that average sensitivity increased significantly (
) from
to
, whereas average specificity decreased significantly (
) from
to
. Accuracy, precision, balanced accuracy, and AUC did not change significantly. After the next iteration of DTA, comparing the average performance of the
k-NN classifier models trained using the S
2,KNN and S
3,KNN feature sets, every metric demonstrated a statistically significant improvement. Average sensitivity increased from
to
, average specificity from
to
, average accuracy from
to
, average precision from
to
, average balanced accuracy from
to
, and average AUC from
to
.
Figure 5 visually presents the results from
Table 6. Comparing the performances of the
k-NN classifier models trained using S
1 to models trained using S
3,KNN shows that with the exception of average specificity, all other metrics improved significantly with the inclusion of 2LTFs and 3LTFs. Average sensitivity increased from
to
, average accuracy from
to
, average precision from
to
, average balanced accuracy from
to
and average AUC from
to
.
Finally, the results for the FLD classifier models trained using the S
1 and S
2,FLD feature sets are shown in
Table 7, where it can be seen that average sensitivity decreased significantly from
to
, whereas average specificity, average precision, average balanced accuracy, and average AUC increased significantly from
to
,
to
,
to
, and
to
, respectively.
Evaluating the DTA methodology one layer deeper and comparing the average performance of FLD classifier models trained using the S
2,FLD feature set to models trained using the S
3,FLD feature set, the one-tailed
t-test revealed that average sensitivity, average accuracy, and average balanced accuracy improved significantly, from
to
,
to
, and
to
, respectively. Specificity, precision, and AUC did not change significantly.
Figure 6 visually presents the results of
Table 7. Comparing average FLD classifier models trained with two iterations of DTA (S
3,FLD) to average models trained with the original radiomics features (S
1) results in significant enhancement of the average specificity, average accuracy, average precision, and average, while showing a significant decrease in sensitivity.
4. Discussion
With growing appreciation of and a shift toward personalized medicine, clinicians and researchers have come to embrace the idea that a “one-size-fits-all” approach is often suboptimal, and that identifying clinically relevant biomarkers can lead to valuable breakthroughs, on both population levels and, by extension, on individual patients’ outcomes. These biomarkers can include clinical biomarkers (such as HPV status), genomic biomarkers, and/or imaging biomarkers, referred to as radiomics features [
36]. Though RT and systemic therapy (such as chemotherapy) are two of the most prominent and widely implemented approaches for treating cancer, they also present the possibility for patients to experience several undesirable side effects, such as cardio-cytotoxicity, nephrotoxicity, myelosuppression, neurotoxicity, hepatotoxicity, gastrointestinal toxicity, and mucositis, to name a few, and they may exaggerate the already challenging burden of cancer on patients [
37]. An in-depth review of the physiological mechanisms associated with negative treatment side effects can be found in the work of Liu et al. [
37]. Similarly, Rocha et al. provided sub-site specific insights regarding treatment side effects of H&N cancers and qualitative imaging features that may aid clinicians in identifying said symptoms from medical images [
38]. Although vast improvements with precision delivery and patient-specific customization have markedly improved survival and recurrence rates, the unfortunate reality of cancer treatment is that it does necessarily yield desired outcomes, as shown in
Figure 2. Accurate and reliable a priori insights about an individual’s response to upcoming treatment could provide clinicians with a valuable tool in aiding with both treatment planning and decision making. For example, if a patient is predicted to respond well to treatment, they could be given reassurances in regards to the likelihood for success and overcome potential hesitations or fears of ineffective treatment. Consequently, if a patient is predicted to exhibit an insufficient response to treatment, physicians may adjust treatment doses or fractionation, or perhaps advise against undergoing treatment and sparing the patient the potential aforementioned undesirable side effects.
This study explored phenotypic insights from regional involved LNs of H&N cancer patients () by “mining” 2D radiomics features from axial slices of pre-treatment T1-weighted post-contrast MRI scans of cancer patients (S1 feature set) and using ex post binary treatment outcomes to train and test predictive models with three common ML classifiers (SVM, k-NN, FLD). Crucially, the DTA methodology was explored to evaluate whether the inclusion of deeper-layer features enhanced model performances, suggesting that the proposed method may be a worthwhile consideration for future radiomics studies.
For each of the three classifiers, ten models were created using an SFS method for 1–10 features. Throughout the study, the best model was considered as the one with the highest
and the lowest number of features. For the SVM,
k-NN, and FLD classifier models trained using the S
1 feature set, the outstanding models were the four-, nine-, and six-feature multivariable models, with
, and
, respectively. Interestingly “Joint Energy”, a GLCM feature, was selected for both the SVM and
k-NN classifier models. “Joint Energy” is a measure of homogenous patterns, with a greater value implying more instances of pixel intensity pairs in the image that neighbor each other at high frequencies [
33]. Taking into account the role of tumor heterogeneity with regards to outcomes, it is interesting that in this study, the average GLCM “Joint Energy” value for the CR and PR group was
and
, respectively (arbitrary units), suggesting that treatment was more effective for patients exhibiting more homogeneous patterns.
In recent years, with advancements in computing power and affordability, researching ML modeling has increased in popularity. Coupled with improvements in imaging techniques, this has resulted in growing interest for radiomics investigations aimed at predicting a variety of biological endpoints. For example, Niu et al. found that by using contrast-enhanced T1- and T2-weighted MRI images, preoperative prediction of cavernous sinus invasion by pituitary adenomas was possible, with
and
for exclusively radiomics features and radiomics + clinical features, respectively [
39]. Tang et al. found it possible to predict recurrence within two years of locally advanced esophageal SCC with radiomics features with a sensitivity of
with a sample size of
[
40]. Yet another study reported promising results using H&N patients’ pre-operative CT radiomics features to predict metastasis, with
and
, and extranodal extension with
and
, for the model compared with experienced radiologists, respectively [
41]. Previously, ML models trained using radiomics features mined from pre-treatment LN QUS parametric maps for the same patient cohort studied in this investigation were found to predict binary treatment outcomes with
and
using a seven-feature multivariable model trained using the SVM classifier [
28]. Whereas CT and MRI radiomics studies mine features from the images directly, it is worth bringing attention to the distinction that QUS radiomics studies utilize raw RF data rather than processing of B-mode images and pixel intensities [
42]. Using the sliding window technique, RF data can be converted into a QUS spectrum from which various parameters can be extracted [
43]. In the QUS study, parametric maps were created for various quantitative US spectral parameters, before radiomics features were determined for model building. QUS spectral parameters have proven useful for characterizing cellular conditions such as apoptosis [
44,
45]. Similarly, for the same patient cohort, radiomics features were mined from treatment-planning CT LN segmentations and used to train ML classifiers to create predictive models, resulting in
for a six-feature SVM classifier model trained using only 1LTFs [
29].
This study is unique in that the potential for MRI deeper-layer features was explored in order to study the heterogeneity of features with a resolution of 0.5 mm. After building the models trained using 1LTFs, the DTA method was initiated, and 1LTF maps were made of the features selected for in each classifier’s best model. Subsequently, 2LTF radiomics features were mined from the 1LTF maps. S2 feature sets were comprised of newly determined 2LTFs concatenated with the retained 1LTFs selected for in the first step. The best k-NN model trained using the S2,KNN feature set (four-feature multivariable model) outperformed any of the models trained using the S1 feature set, with . Similarly, the best FLD model trained using the S2,FLD feature set (six-feature multivariable model) outperformed any of the models trained using the S1 feature set, with . The best S2,SVM-trained SVM model matched but did not improve on the best-S1 trained SVM model, with .
Another iteration of the DTA method was carried out to mine 3LTFs, and again, the best-performing k-NN and FLD models trained using the S3,KNN and S3,FLD feature sets bested the top performances of models trained using the S2,KNN and S2,FLD feature sets, with and , respectively. The nine-feature multivariable k-NN model trained using the S3,KNN feature set was the best model in the study. The nine selected features were a combination of two 1LTFs, two 2LTFS, and five 3LTFS.
Although the best performance of SVM models did not improve with the inclusion of 2LTFs and 3LTFs, it is worth emphasizing that it did not degrade either. This is theoretically consistent, mainly because at each iteration of DTA, the best features from the previous iteration are retained, while all other features are discarded. The principle of DTA works such that after the introduction of deeper-layer features, two outcomes are that (i) the models improve due to valuable information gain from new features, or (ii) the new features are not providing additive benefit, at which point performance should match the previous iteration due to retention of the previously selected features, as was the case in this study.
Next, for each classifier, to inspect the effect of including deeper-layer features on all the models, in each iteration, average performance metrics were evaluated for the 10 models created. As seen in
Table 5, average SVM classifier models did not benefit from the inclusion of 2LTFs or 3LTFs, as none of the average performance metrics changed in a statistically significant manner (
). However, once again, it is worth emphasizing that average performances did not degrade in a statistically significant manner either, which is to be expected due to retention of the best features from the previous iteration. On the other hand,
k-NN and FLD classifier models exhibited statistically significant improvement on several metrics, as shown in
Table 6 and
Table 7 and visually represented in
Figure 5 and
Figure 6. As previously mentioned, the DTA feature enhancement methodology was evaluated for this patient cohort with QUS [
28] and CT [
29] radiomics as well. In the QUS study, models trained with the inclusion of 2LTFs improved the predictive capacity of the seven-feature multivariable model up to
and
, a marked improvement from the seven-feature multivariable model trained solely on 1LTFs, with
and
[
28]. Similarly, DTA methodology enhanced the performance of predictive models trained using treatment-planning CT LN segmentations, with the best model performance resulting from inclusion of 2LTFs and 3LTFs, with a seven-feature model SVM classifier model demonstrating
and
[
29]. It is worth mentioning that in both of these studies, DTA methodology was carried out for the five-feature model in each iteration, which was a decision made mainly to keep computation time reasonable and not due to any loss function [
28,
29]. In this study, 1–10 feature multivariable models were created for 1–10 features, and the model with the top BA was selected to proceed for DTA. In the case of multiple models exhibiting the same top BA, the model with fewer features was selected. To our knowledge, this is the first time that DTA methodology was implemented on MRI images.
Computing radiomics features in 2D slices rather than 3D voxels and the use of a small sample size are two factors that reduce the generalizability of the models. Another shortfall involves one necessary step for mining radiomics features, which is an initial discretization of image pixel intensities through binning, which was not optimized in this study and is worth consideration [
33]. Similarly, in this study, the window size for calculating the feature maps was 3 × 3 pixels, which was the smallest window size possible, but future studies could investigate optimizing window size as well [
33]. Finally, a clinically feasible model would likely maximize all available information and include clinical and genomics features in addition to radiomics features. We acknowledge that HPV-related carcinomas may have characteristic inhomogeneities as assessed by MRI, which are absent in non-HPC cancers, and that in this work, the unknown HPV status may silently be reflected in radiomic features. The hope is that radiomics features may capture and characterize heterogeneous patterns from involved LNs that impact response to treatment, even absent of knowledge about HPV status. In this study, clinical features were not consistently available in institutional database patient notes and, as such, were not included.