1. Introduction
One of the main public health concerns impacting populations globally is lung cancer, which is responsible for the highest number of deaths related to cancer and accounts for 12.4% of patients receiving a new cancer diagnosis [
1]. Approximately 85% of patients diagnosed with lung cancer have non-small-cell lung carcinoma (NSCLC); adenocarcinoma constitutes the histological subtype of this heterogeneous entity that occurs most frequently [
2]. However, it encompasses a spectrum of unique molecular subtypes, all of which are characterized by specific oncogenic genetic alterations, rather than a single disease entity. The identification of these molecular changes driving malignancy has resulted in new therapies targeting these pathways being developed, ushering in an era of targeted therapy in cancer management. Gene mutations, rearrangements, and amplifications can initiate NSCLC carcinogenesis. The detection of specific driver mutations has led to significant enhancements in terms of NSCLC treatments, particularly for patients whose cancer has metastasized. Targeted therapies for NSCLC are associated with an increased response rate and extended progression-free survival (PFS) and overall survival (OS), making them the preferred treatments for cases involving specific driver mutations, including the mutations that affect the epidermal growth factor receptor (
EGFR),
ALK,
BRAF, and
ROS1 genes [
3].
EGFR is one of the leading molecular targets in NSCLC patients. For cases involving patients exhibiting advanced NSCLC with
EGFR mutations (
EGFRm), it has been demonstrated that a correlation exists between oral
EGFR-tyrosine kinase inhibitor treatment and extended PFS and OS, as well as higher objective radiographic response rates in comparison to conventional first-line chemotherapy. In such cases, treatment with osimertinib, a third-generation oral
EGFR-tyrosine kinase inhibitor, has been linked with PFS > 18 months [
4] and OS > 3 years [
5]. According to the recommendations in current guidelines, such as those published by the National Comprehensive Cancer Network (NCCN), molecular testing for
EGFR mutations in individuals with metastatic pulmonary adenocarcinoma, large-cell carcinoma, and NSCLC—not otherwise specified (NOS) is recommended [
3]. Furthermore, for patients with metastatic pulmonary squamous cell carcinoma, such testing is also a potential option. In addition to the recommendation for metastatic patients,
EGFR mutation testing is currently recommended for patients with completely resected stage IB-IIIA NSCLC [
3], because the addition of osimertinib after adjuvant chemotherapy for such patients leads to the significant prolongation of both PFS [
6] and OS [
7]. Hence, when developing treatment plans for NSCLC patients, it is critical that
EGFR mutations are identified.
When
EGFR mutation status is being assessed, next-generation sequencing (NGS), real-time polymerase chain reaction, or Sanger sequencing is generally performed [
3].
NGS is a revolutionary technique for the biomolecular characterization of NSCLC. In efforts to identify
EGFR mutations, tissue and/or cytologic specimens are the gold standard materials for molecular analyses such as NGS. Nevertheless, the acquisition of sufficient tumor samples remains challenging, particularly among patients with advanced-stage NSCLC. Although NGS is effective for the identification of
EGFR mutations, it is hindered by the limited availability of tumor tissues and technical factors [
8]. The most commonly encountered problems include inadequate tumor sample quantity or quality [
9,
10,
11]. Invasive tissue biopsy is associated with several challenges, including inadequate tumor retrieval, an inability to access suitable tissues, and risks of bleeding and pneumothorax [
12,
13]. Therefore, tissue biopsy is not performed in approximately 7–15% of patients with NSCLC [
9,
14]. For around 20% of patients, the feasibility of biopsy may be questionable [
10,
13,
15,
16]. Furthermore, in some cases where the feasibility of biopsy has been confirmed, histological and molecular analyses may not be possible due to inadequate samples that may be insufficient for molecular and histological analyses, causing biopsy failure rates to be 8–43% [
9,
10,
11]. Additionally, substantial time and costs are associated with biopsy and NGS.
Radiomics is a method in which features are systematically extracted from digital medical images and then analyzed for the purpose of constructing comprehensive databases that aid in diagnosis and treatment. This approach utilizes advanced algorithms, often in combination with computer assistance, to quantitatively analyze radiological images and extract numerous features for the detection of tumor phenotypes. Radiogenomics incorporates specific features obtained from radiological images, along with genomic information. The quantitative data generated through radiomics and radiogenomics analyses can be used to predict patient risk, individualize treatment strategies, and facilitate non-invasive biopsy. Computed tomography (CT) is a technique that is commonly utilized for the diagnosis and staging of cancer patients. Numerous studies have explored whether there is an association between CT imaging features and EGFR mutation status in NSCLC; however, these investigations have produced inconsistent results.
The analysis of CT-based radiomic features for the identification of
EGFR mutation status has the potential to act as a crucial clinical determinant in the development of effective treatment plans for lung adenocarcinoma and NSCLC-NOS patients [
17,
18,
19,
20]. Therefore, we evaluated the performance of pretreatment CT radiomics features in terms of their ability to predict
EGFR mutation status in lung adenocarcinoma and NSCLC-NOS patients. While researchers have previously concentrated on biopsy samples in their studies, the inclusion of a large number of surgical cases in the present study ensured the collection of adequate tumor tissue to minimize false negative results and enhance the robustness of our findings. Hence, the current study’s aim was to explore the potential for pretreatment chest computed tomography (CT) radiomics features to act as a predictor of epidermal growth factor receptor (
EGFR) mutation status in NSCLC.
2. Patients and Methods
The original dataset of 1243 patients was reduced to 430 due to several reasons, including the unavailability of sufficient-quality CT scans, unclear tumor boundaries, and artifacts during image processing. Strict quality control ensured that only patients with the clearest images were included, which limited the dataset size.
The study protocol was firstly approved by the Institutional Review Board of the Ankara University Faculty of Medicine (AUFM) (no: I2-97-21), thus removing the need to seek informed consent. Between 2012 and 2020, testing of 1243 patients determined to have NSCLC according to histological analysis was performed to identify the
EGFR mutation status in their tumor tissues. We included patients with histopathologically confirmed adenocarcinoma or NSCLC-NOS; thoracic spiral contrast-enhanced CT performed within 4 weeks preoperatively or biopsy available from the electronic archive of the AUFM; a tumor suitable for segmentation; no prior malignancy; and no previous radiotherapy or chemotherapy treatment before the primary thoracic CT (
Figure A1 and
Figure A2). We excluded patients for whom CT scans of the primary tumor were unavailable, for whom the diagnosis was unclear due to artifacts, who had tumor atelectasis in which tumor boundaries could not be clearly delineated, or where the
EGFR mutation status was unknown. This resulted in the identification of 430 patients in total, who were assigned to either a wild-type (EGFR-WT, n = 372) or an
EGFR mutant (
EGFRm, n = 58) group (
Table 1). Patients identified as having mutations in exons 18, 19, 20, or 21 of the
EGFR gene were categorized into the
EGFRm group.
2.1. Molecular Testing
Between 2012 and 2018, Sanger sequencing or NGS was utilized for detecting EGFR mutations. Sanger sequencing focused on identifying mutations in exons 18–21 of the EGFR gene. The extraction of genomic DNA was conducted from formalin-fixed, paraffin-embedded (FFPE) tissue acquired from biopsy or resection samples utilizing the QIAamp DNA FFPE tissue kit (Qiagen, Hilden, Germany), in compliance with the guidelines provided by the manufacturer. An automated single-capillary genetic analyzer was then used to conduct further sequencing (ABI 310; Applied Biosystems, Foster City, CA, USA) with forward and reverse primers. The identification of nucleotide alterations was carried out according to a comparison of the sequences acquired with the database of the National Center for Biotechnology Information (reference sequence: NM_005228.2). Beginning in 2018, molecular analyses of the samples were conducted to detect insertions/deletions and point mutations in 19 genes, including EGFR, using the Qiagen GeneReader NGS System (Qiagen GeneReader Platform, Hilden, Germany) with the GeneRead QIAact Lung DNA UMI Panel. The analysis process involved using QCI-Analyze to import reads, trim primers, verify quality, map reads to the human reference genome, call variants, and filter according to coverage. QCI-Interpretation was then employed to import and filter variants, followed by annotation of the identified variants on the basis of the clinical outcomes.
The evaluation of presumed tumors in each of the patients was performed using contrast-enhanced CT. Either a 16-row detector CT (Siemens Somatom Sensation16, Forcheim, Germany), 64-row detector CT (Toshiba Aquilion 64), or 320-row detector CT (Toshiba Aquilion ONE, Otawara-shi, Japan) was utilized when conducting the chest CT examinations. The parameters for acquisition included a detector collimation of 0.5, 0.5, or 0.625 mm; tube voltage of 120 kVp; gantry rotation time of 0.5 s; reconstructed section thickness of 1, 1, or 1.5 mm; and reconstruction intervals of 0.8, 0.8, and 1 mm. Prior to the examinations, patients received an intravenous injection of 60–100 mL (1–1.5 mL/kg) of nonionic contrast agent (350/100 Omnipaque, GE Healthcare, Oslo, Norway). A workstation was utilized for the analysis of multiplanar reformatted images (GE Healthcare, Waukesha, WI, USA).
2.2. Dataset Management
The management of imaging and clinical data was performed using a Radiomics platform (Huiying Medical Technology Co., Ltd., Huiying, China), which was followed by radiomics statistical analyses. The use of the aforementioned platform enables radiomics features to be extracted from 2-D and 3-D images, as well as binary masks on imaging techniques like magnetic resonance imaging (MRI) and CT. Two senior radiologists whose field experience ranged from 10 years (reader 1) to 5 years (reader 2) independently reviewed each of the images. A radiologist with no access to the clinical data of the specific patients then manually delineated disease lesions (volumes of interest [VOIs]). The senior radiologist then assessed all contours. Where it was identified that a discrepancy of ≥5% existed, the ultimate decision pertaining to the boundaries of the tumor was made by the senior radiologist. From the total of 430 patients who were scanned, it was possible to segment 469 VOIs, on which additional analysis was performed (
Figure 1). Clinical parameters including blood groups, Rh factor, smoking status, hemoglobin (Hb) level, white blood cell (WBC) count, hematocrit (Hct) level, platelet count (PLT), sedimentation rate, alkaline phosphatase (ALP), lactate dehydrogenase (LDH), calcium level, tumor diameter on CT, and
EGFR mutation status (as the gold standard) were collected for all patients (
Table 1).
Through the application of a random method, patients were assigned to either the training or testing dataset according to an 8:2 ratio. Consequently, the testing set contained 50 patients and the training set 380 patients, and this random allocation process ensured that both clinical and radiological features were distributed in a balanced manner.
2.3. Outcomes
The primary outcome was identification of the EGFR mutation via the generation of a machine learning algorithm derived from pretreatment CT images.
2.4. Feature Extraction
The Radcloud platform (version V3.1.0, compatible for Windows) was employed for extracting CT images, producing 1409 quantitative imaging features in total. The extracted features were then grouped into three different categories: Group 1 (first-order statistics) comprised 126 descriptors that quantitatively characterize the voxel intensity distribution in the CT image, employing simple metrics frequently used in statistical analyses. Group 2 (features based on size and shape) included 14 3-D features reflecting the region of interest’s size and shape. Group 3 (texture features) consisted of 525 textural features obtained from gray-level run-length and gray-level co-occurrence texture matrices, which enable differences in region heterogeneity to be quantified (
Table 2).
In terms of feature extraction, several methods were utilized to reduce dimensionality and select task-specific features for performance optimization. First, a variance threshold of 0.8 was applied for the purpose of eliminating redundant features, whereby features with variance eigenvalues < 0.8 were removed to eliminate redundant information. Second, the SelectKBest method was used; this single-variable feature selection technique allows the relationship between features and classification outcomes to be analyzed utilizing p-values. Features identified to have p-values < 0.05 were kept such that they could be analyzed further. The third step involved the application of the Least Absolute Shrinkage and Selection Operator (LASSO) technique. In the LASSO model, an L1 regularizer was utilized as the cost function. An error value of 5 was set for the cross-test, and the maximum number of iterations was defined as 1000.
Subsequent to reducing the dimensionality and selecting task-specific features, the variance threshold technique was applied to select 469 features out of the total of 1409. This was followed by the selection of 26 features through the application of the SelectKBest technique. Lastly, the LASSO algorithm was employed, resulting in 9 optimal features being identified (
Figure 2 and
Figure 3).
We calculated the Radscore for each patient using the following formula. The Radscore was the comprehensive representation of the optimal features and was used to subsequently construct the model.
The Radscore was calculated using the following formula:
where Intercept represents the LASSO regression intercept, k denotes the overall number of features selected by the LASSO algorithm, is the LASSO coefficient of the i-th feature, and is the i-th radiomics feature.
2.5. Machine Learning Analysis
In this study, while the training dataset contained 80% of patients who were assigned on the basis of computer-generated random numbers, the testing dataset contained the other 20%. Based on the Radscore, various supervised learning classifiers were utilized for classification analysis, creating models that were capable of segregating or predicting data relating to a phenotype or outcome (e.g., patient outcome or response). Six classifiers were utilized to construct radiomics-based models: Decision Tree, Logistic Regression, Random Forest, Support Vector Machine (SVM), k-Nearest Neighbor, and eXtreme Gradient Boosting. Additionally, the model was rendered more effective through the use of the test method.
2.6. Statistical Analysis
The predictive performances of the classifiers were assessed using ROC curves; separate calculations were performed for the areas under the curve (AUCs) for the respective validation and training datasets. Additionally, classifier performance was comprehensively evaluated using four indicators, namely precision (P = true positives/[true positives + false positives]), recall (R = true positives/[true positives + false negatives]), F1-score (P × R × 2/[P + R]), and support (overall number of instances in the test set).
4. Discussion
As the molecular mechanisms underlying lung cancer are increasingly understood, the approach to NSCLC treatment has evolved, with a greater emphasis on identifying oncogenic driver mutations. The detection of these driver mutations serves as a crucial guide for physicians, helping them to tailor distinct therapeutic strategies for treating patients. Radiomics is a crucial technique used to achieve accurate medical imaging-based diagnosis and tailor management appropriately [
21]. The core idea of radiomics is that quantitative data can potentially be extracted from medical imaging that human practitioners are unable to discern, and such data provide valuable information on the underlying pathophysiology of the tissue. Preliminary research in the area of radiomics has focused on the oncology discipline according to the finding that success in this area is predominantly influenced by tumor heterogeneity [
22,
23]. Invasive biopsies taken to acquire a small number of samples are not consistently reflective of the tumor’s features in in its entirety. Conversely, non-invasive evaluation of the entire volume using objective radiomics data is a significant technique used for effective treatment planning catered to the individual patient and predicting prognosis [
24]. This assessment is crucial for advancing personalized medicine, where treatments are customized according to the specific characteristics of each patient and their tumors. In this regard, radiomics shows potential for personalized medicine.
In the radiomics imaging analysis approach, a significant number of features are extracted from medical imaging, which are then correlated with genetic or clinical data. In the general workflow, the steps involved include acquiring images, detecting and segmenting the region of interest, extracting features, and then utilizing the features that have been extracted to form a database. While there are constraints and difficulties associated with each of the aforementioned steps, the general challenge throughout the process involves the heterogeneity of the input data. Although it remains in its infancy, the possibilities for applying radiomics should not be ignored, and it is necessary to conduct multicenter prospective studies with larger patient samples to determine how radiomics data can be applied in standard clinical practice.
In the current study, we utilized CT images and clinicopathologic features from 430 patients with pulmonary adenocarcinoma or NSCLC-NOS to facilitate the development of a model based on radiomics for the purpose of predicting tumor EGFR mutation status. The EGFR mutation rate was 13.49% (58/430) in the selected patients. Female patients had an increased prevalence of EGFR mutations (63.8% vs. 36.2%), as did nonsmokers (60.5% vs. 39.5%). Two groups were formed from the dataset, with a training group comprising XX VOIs of 344 patients and a test group comprising the remaining XX VOIs of 86 patients. The Radcloud platform was used to extract 1,409 quantitative imaging features from CT images, which were then categorized into three groups. Among the six radiomics-based classifiers (Decision Tree, Random Forest, Logistic Regression, eXtreme Gradient Boosting, SVM, and k-Nearest Neighbor), SVM achieved the highest AUCs in both the testing and training groups, with values of 0.869 and 0.977, respectively. These findings indicate that the accuracy with which radiomics-based models are capable of predicting tumor EGFR mutation status is 86.9%.
Previous studies have shown possible advantages of using CT images for the prediction of
EGFR mutation status in NSCLC (18–20, 25, 26). For example, the meta-analysis conducted by Cheng et al. [
25] revealed an increased prevalence of
EGFR mutations in NSCLC patients with part-solid ground glass opacities on CT images. Rossi et al. [
26] created a radiomics model based on CT scans from 21
EGFRm and 88
EGFR-WT NSCLC patients, achieving an accuracy of 88.1% with an AUC of 0.85 in terms of
EGFR mutation status prediction, comparable to our study. Similarly, Wu et al. [
18] utilized CT scans from 34
EGFRm and 33
EGFR-WT NSCLC patients, extracting 849 features. The AUC achieved by their clinical and radiomics model was 0.9724, with 85.3% sensitivity and 90.9% specificity for the prediction of
EGFR mutation status [
18], similar to our findings. Omura et al. [
19] established a model using CT scans of 99 patients with early-stage NSCLC to whom surgical resection was applied. Their clinical and radiomics combined model recorded a mean AUC of 0.83 for the prediction of
EGFR mutation status, further supporting the utility of radiomics in this context. According to the findings of the meta-analysis of Felfli et al. [
20], pure radiomics models could predict
EGFR mutation status with an AUC of 0.8 (95% confidence interval: 0.757–0.845), whereas combined models incorporating clinical and radiomics features yielded better results. In our study, in which a radiomics model was used, the AUC in the training group was 0.977, which concurs with the findings of other researchers.
In addition, we employed up to six different classifier algorithms for model construction. By using multiple algorithms, we can comprehensively evaluate the performance of different methods to determine which one offers optimal effectiveness for our specific task. Using multiple algorithms provides a benchmark for future studies, helping to establish a standard for evaluating the performance of new methods. This approach allows us to compare the strengths and weaknesses of different algorithms, providing insights into which methods are best suited for similar tasks in future research.
According to our study findings, radiomics models offer significant potential as non-invasive instruments for the prediction of EGFR mutation status in lung tumors, as the accuracy is 86.9%. This non-invasive technique could be particularly valuable for patients with contraindications regarding invasive tissue biopsies, such as those on strict anticoagulation therapy or those at risk of complications from invasive procedures. Although liquid biopsy represents an alternative method for such patients, it has limitations, such as the possibility of false negative results. In contrast, radiomics offers a non-invasive approach that could limit the necessity for biopsy in cases where tissue samples are insufficient. However, it is important to acknowledge that the current study also has limitations, such as that it was conducted in a single center and the number of patients in both the training and testing groups was relatively limited. Future multicenter studies involving patient cohorts of a larger size are required to provide a more comprehensive assessment of EGFR mutation status, considering potential ethnic differences.
Although the mutation status predicted by radiomics may not currently directly influence patient management, its potential utility lies in guiding subsequent molecular testing strategies, particularly when tissue availability is limited. Furthermore, algorithms with greater predictive ability can be developed for other mutations. Through the identification of highly relevant mutation tests based on radiomics predictions, clinicians can prioritize and tailor further molecular analyses, thereby reducing test costs. Moreover, a negative initial test result in a tumor predicted by radiomics to harbor a specific mutation can prompt clinicians to repeat the test, potentially detecting mutations with low allele frequencies and ensuring that patients do not lose the opportunity for targeted treatment.
Radiomic features, such as wavelet transforms, capture both spatial and frequency information, providing a quantitative measure of tumor heterogeneity. These features may correlate with variations in tissue density and morphology, offering insights into the biological behavior of tumors. By interpreting these features in the context of tumor biology, we can bridge the gap between machine learning and clinical relevance [
27].
In our study, areas of higher texture complexity indicated more aggressive tumor phenotypes. Additionally, our study indicated a Skewness feature, which can be interpreted as the asymmetry in the distribution of voxel intensities in a region of interest (ROI). This was also interpreted as being related to heterogeneity in the tumor tissue. In this study, Wavelet-LLL/HLL represents the decomposition into low (L) and high (H) frequency components in different directions. The texture in different levels in the image, correlating with structural heterogeneity in the tumor, also indicated an EGFR-positive state. The high kurtosis indicates that there are extreme values which may be linked with the irregular tissue composition in the tumor.
The current NCCN guidelines recommend minimally invasive surgery in suspected cases of stage 1A lung carcinoma. However, studies have revealed relationships between
EGFR mutations and increased risks of brain metastasis when the cancer is diagnosed and after curative resection [
28,
29,
30]. The determination of radiomics signatures on CT images for
EGFR mutations may indicate the need for preoperative brain MRI in individuals with Stage 1A (Tabc, N0) lung cancer.
To summarize, the features that were extracted in this study are correlated with tumor aggressiveness as well as more heterogeneous textures and higher energy values. Features like kurtosis and range provide insights into the variability of tissue composition within the tumor, which may be linked to differences in cellular structure or necrosis. Finally, high-intensity homogeneous zones might reflect necrotic or well-vascularized regions in the tumor, offering clues about the tumor’s viability or growth patterns.
5. Conclusions
We demonstrated that by using a minimal subset of radiomics features, the performance achieved showed significant promise, with an AUC of 86.9%, for predicting EGFR mutation status. Our findings underscore the promising potential of radiomics features as non-invasive predictive imaging biomarkers for EGFR mutation status, which may improve personalized treatment in NSCLC. Additionally, radiomics may be a useful non-invasive option for patients for whom biopsy is technically challenging and for patients with inadequate tumor tissue for the detection of EGFR mutations. Radiomics may serve as a non-invasive approach that can be used in the identification of EGFR mutation in NSCLC, considering the results of our study and previous studies. However, in order to verify the clinical value of radiomics in NSCL, additional studies are required.
While our study showed considerable promise, it also has several limitations that need to be taken into consideration. First, while the provisional dataset contained 1243 patients, the analysis only focused on 420 of them without using any external validation set. Despite our promising results, the evaluation of a larger cohort of NSCLC patients whose EGFR status has been determined would be beneficial. Lastly, further studies could explore how this radiogenomic model could be applied to more comprehensive genotypes, like EGFR-TKI sensitivity or exon levels. However, this study is planned for the future due to a decrease in the number of patients.
Another study limitation is the absence of external validation. While our model performed well internally, it has not yet been validated on external datasets. Future work will address this by incorporating a multicenter study with a broader range of mutations, including EGFR-TKI sensitivity and exon-specific mutations. This step is crucial to confirming the generalizability and clinical utility of our radiomics model. This reduction may impact the overall statistical power of this study, and future work will involve larger datasets to validate the findings.