1. Introduction
In 1853, John Adams, a surgeon at the London Hospital, diagnosed a cirrhosis of the prostate gland as an orphan disease, the first described case of Prostate Cancer (PCa). In 2020, the disease was responsible for 7.3% of all cancer deaths in men and was the second most frequent malignancy [
1]. The prostate gland is about the size of a walnut and located in the pelvis surrounding the prostatic urethra and below the bladder. Usually, PCa originates from the peripheral zone of the prostate, adjacent to the rectum [
2]. Silently asymptomatic in an early stage, PCa is usually diagnosed by a Digital Rectal Examination (DRE) and a Prostate Specific Antigen (PSA) blood test.
The aggressiveness of PCa is quantified by the Gleason Grading System [
3]. Based on the glandular architecture of cells, the pathologist assigns a grade of 1 if prostate cells are uniformly packed, up to a grade of 5 depending on pattern irregularity. The most predominant pattern and the second most prevalent are identified and graded accordingly, and finally summed to obtain the Gleason Score (GS), proportional to PCa aggressiveness [
4]. In 2014, Epstein et al. [
5] proposed a new Gleason grading system when several studies showed that a GS 7 = 4 + 3 had a worst prognosis than GS 7 = 3 + 4. A deeper stratification by Grade Group (GG) was then possible, including the most likely prognosis. The low/very low Risk Group (RG) is assigned to patients with a GS
. The intermediate RG includes patients classified with a GS of 7, with both favourable (3+4) and unfavourable (4+3) prognosis. Finally, the high/very high RG includes patients with a GS
.
GS plays an important role in the determination of the type of treatment to follow. A low-grade cancer with a GS 6, together with a low PSA level and a small tumor may be an indication for active surveillance only. PCa evaluation takes into account the PSA levels, GS, TNM, patient history and physical examinations in order to provide a decision baseline for proper treatment and success prediction. Current guidelines suggest External Beam Radiotherapy (EBRT) as a curative option for localised and locally advanced disease and as a palliative option for metastatic low-volume disease [
6]. The effectiveness of the treatment usually relies on the monitoring of PSA blood levels, the current reference biological test. Although a high value is associated with an increased risk of PCa, it is not PCa specific. High values of PSA are also associated with Benign Prostatic Hyperplasia (BPH), an enlarged prostate gland [
7]. With the wide availability of PSA tests and the known long-term effects of definitive therapy, overdiagnosis and overtreatment have become a major issue. PCa treatments may cause sexual dysfunction, infertility, bowel and urinary problems [
6], and so more conservative approaches such as active surveillance or watchful waiting have been adopted [
8]. These are valid options, even for intermediate risk patients with a favourable prognosis (GS = 3 + 4).
During the EBRT workflow, many imaging modalities are available. The Computed Tomography (CT) provides the Hounsfield Unit (HU) values critical for dose estimations. For PCa, Magnetic Ressonance Imaging (MRI), providing superior soft-tissue resolution, is used for volume delineation. Positron Emission Tomography (PET) provides tumour cells’ metabolic insights and finally, the Cone Beam Computed Tomography (CBCT), acquired during EBRT sessions, is used for patient positioning and setup verifications. Quantitative analysis of medical images may have a similar prognosis power to phenotypes and gene protein signatures [
9]. This is the hypothesis behind radiomics (the extraction of features from radiographic images using data-characterization algorithms). As an emerging field in medicine, radiomics provides the quantification of phenotypic characteristics in medical imaging [
10]. Traditionally, image analysis and characterization of shape, texture or patterns is performed by highly trained human observers, but radiomics can provide quantitative image analysis without inter/intra observer variability.
Radiomic studies have recently triggered the interest of the research community, and for PCa are mainly focused on several predictive outcomes such as staging, grading, detection, Biochemical Recurrence (BCR) or aggressiveness. Furthermore, the most used imaging modality is MRI, which is usually performed in the initial stages and is critical for volume delineation. With the predictive and phenotypic power of radiomics, other imaging modalities besides MRI may provide valuable insights. Mendes et al. [
11] evaluated CT based radiomics to predict PCa aggressiveness with promising results. In a novel attempt, Bosetti et al. [
12] evaluated the use of CBCT radiomics to address tumour staging, GS, PSA, RG and BCR. Monitoring and classifying the outcome prognosis during treatment may help to avoid extra-invasive procedures or another MRI. This work intends to evaluate a model in borderline favourable vs. unfavourable (3+4 vs. 4+3) PCa cases, providing a tool that may trigger a more conservative approach, avoiding over-treatment and reducing the side effects of radiation exposure.
Following this introductory section is a summary of some of the work done in PCa radiomics using multiple imaging modalities. The idea is to gather information on the most-used feature selection and classification techniques.
Section 3 describes the dataset and the methods used to build the evaluated pipelines.
Section 4 presents the obtained results, highlighting the six best pipelines (from 98 evaluated). Finally,
Section 5 presents the main conclusions drawn.
2. Related Work
Radiomics, the extraction of quantitative features from medical images using data characterization algorithms, has the potential to provide more relevant information, improve decision outcomes and avoid overdiagnosis and overtreatment. A full radiomic study follows a pipeline initially proposed by Lambin et al. [
13] involving several steps, as exemplified in
Figure 1.
Many imaging modalities are of great value in screening PCa and improving diagnosis and prognosis outcomes [
14]. MRI provides superior soft-tissue contrast resolution when compared to other imaging modalities. It is the selected imaging modality by Prostate Imaging Reporting and Data System (PIRADS). Most radiomic studies are focused on MRI with the PCa clinical significance as a model endpoint [
15,
16,
17,
18,
19,
20,
21]. The combination of T1 and T2 weighted sequences (multi-parametric Magnetic Ressonance Imaging (mpMRI)) allows us to overcome the poor correlation between MRI signal intensities and tissue properties. With this in mind, several authors evaluated the use of MRI for PCa patient stratification with promising results [
16,
17,
18,
19], although the introduction of clinical outcomes such as PSA or GS introduced some issues [
20]. Abraham and Nair [
22] topped the PROSTATEx-2 2017 challenge with a quadratic-weighted kappa score of 0.2772, developing a new feature selection method for PCa aggressiveness assessment. Algohary et al. [
23] sought a model to evaluate Intensity Modulated Radiation Therapy (IMRT) PCa treatment responses using T2w and Apparent Diffusion Coefficient (ADC) maps in an attempt to personalize the PCa treatment evaluation framework.
PET provides insights into the pathological responses to some types of cancers with the addition of a radio-tracer and a viable tool for diagnosing, staging and grading. For radiomic PCa studies, researchers focused on evaluating lymph node involvement, metastasis, GS and extra-capsular extension [
24]. Alongi et al. [
25] evaluated tumour heterogeneity with 18F-Cho-PET/CT radiomics and introduced a novel feature selection method (a mixed descriptive-inferential sequential approach).
CT seems to be a poor candidate for radiomic studies since it lacks metabolic manifestation and soft-tissue contrast. However, the spatial distribution provided by the CT could be used as a virtual biopsy for patient risk stratification [
26]. In a recent work, Mendes et al. [
11] evaluated the use of CT based radiomics for PCa aggressiveness assessment. With a dataset of 44 PCa patients, Mendes et al. [
11] extracted features using pyradiomics [
10] and Local Image Features Extraction (LIFEx) [
27]. Unable to find a radiomic signature for RG stratification, they used Principal Component Analysis (PCA) and evaluated several kernels to build a model with a Support Vector Machine (SVM). The best results were obtained with pyradiomics with a maximum Area Under the Receiver Operating Characteristic (AUROC) value of 0.88 for both low/very low and high/very high RG.
CBCT is used for patient positioning verification procedures before EBRT treatment and therefore is freely available. Bosetti et al. [
12] were the first to study the use of CBCT radiomics to build models predicting tumour staging, GS, PSA levels, risk category and biochemical recurrence with promising results. In this work, we intend to evaluate the use of CBCT radiomics to distinguish between favourable and unfavourable PCa cases. An unevaluated scenario that may provide an EBRT treatment effectiveness monitoring tool is used.
4. Results and Discussion
The evaluation of the 98 pipelines was performed in a python environment, computing the AUROC and accuracy scores.
Figure 4 shows the obtained AUROC.
Figure 5 shows the obtained accuracy scores, and both present the corresponding mean standard deviations for the five fold cross-validation scheme (lighter colors mean a higher AUROC value as indicated in the color bar scale).
The pipelines were built with a very specific subset and goal - to distinguish favourable and unfavourable cases from PCa patients classified as intermediate risk. The specificity of this approach may provide EBRT with a tool capable of monitoring the true effectiveness of the treatment. An initial unfavourable outcome may become, during treatment, a favourable case, possibly leading to adjustments in the treatment workflow.
From the obtained results, a few pipelines do present good performance. In radiomic studies, there is always an issue with the reproducibility of the features. In the future, other CBCTs of each patient will be included in order to overcome this issue, as already performed by Bosetti et al. [
12].
Still, results seem to suggest that some classifiers are not suited to this particular task. The HistGradientBoosting obtained an AUROC of 0.50 for every feature selection method. From the evaluated feature selection methods, the KBest (ANOVA) and RFECV (SVR) provided poor performances. On the other hand, the percentile feature selection method seems to have an overall decent performance, obtaining its best results when combined with an AdaBoost classifier with an AUROC of and its worst when combined with an LR classifier with an AUROC of .
Considering a threshold value of AUROC of 0.79, six pipelines were capable of obtaining good performance. Selecting features from an LR model and combining it with a Bagging (SVC) classifier or GaussianNB provided a pipeline with an AUROC value of
and
, respectively. Furthermore, the univariate selection method combined with an SVC obtained an AUROC of
and, with an ExtraTrees, of
. The Adaboost classifier combined with the baseline feature selection method, the variance threshold, provided an AUROC of
, and when combined with the percentile feature selection method, an AUROC of
.
Table 6 highlights the selected pipelines obtained results.
The presented values are the mean values obtained for the five folds in the cross-validation. For these six pipelines, a deeper analysis was performed in order to evaluate the performance of each.
Figure 6 presents the AUROC curves and corresponding values for each fold as well as the mean value. Furthermore, in grey, are the
standard deviation curves and, in dashed red, the 0.5 AUROC line.
In binary classification models, other metrics may provide valuable information on the performance of a classifier. The precision is defined as a ratio of the number of true positives divided by the sum of true positives and false positives. It is also referred to as the positive predictive power, this is, the ability of the model to predict true positives, which in this work, are the unfavourable cases (GG = 3). The recall (sensitivity) is calculated as the ratio of the number of true positives divided by the sum of true positives and false negatives. The f1-score is the harmonic mean of the precision and recall, while the support is the number of occurrences of each class in the ground truth labels [
33].
Table 7 presents these parameters for the selected pipelines and the mean number of features selected in each fold in the cross-validation.
The threshold = 0, in the variance threshold method, is not enough to perform any feature selection because the preprocessing standardization removes the mean and scales features to unit variance. The AdaBoost classifier fits additional copies on the same dataset and re-adjusts the weights accordingly. This behaviour may overcome the lack of feature selection when both are combined. Besides the percentile–Adaboost combination, most pipelines present a low recall value for Class 2 (favourable), returning very few results. Themodel (LR)-bagging (SVC) pipeline has a high precision for Class 3 (unfavourable), suggesting those few results are well classified. This combination, however, has the opposite behaviour for Class 3. A high recall returns many results, but a low precision means those results are poorly classified.
The bagging (SVC) classifier aggregates the predictions of several SVCs, reducing the variance of the final output and improving accuracy. When using an SVC classifier, we obtained an accuracy of
, but with a bagging strategy of
. The GaussianNB classifier assumes a Gaussian likelihood with a naive assumption of pair-wise features’ conditional independence. By reducing the number of features to 43, we are also eliminating highly correlated features, a factor that degrades the classifier performance. The decision tree creates a piecewise constant approximation to the decision curve, but when using the entire feature set, the set is known to provide some overfitting. The ExtraTrees introduces randomized decision trees on several subsets allowing control of the overfitting and increasing accuracy [
33].
Percentile feature selection method with an AdaBoost classifier, the obtained results seem to be satisfactory, with relatively high values of precision and recall for both classes. The main advantage of this model is that the number of features used for classification is reduced to 11. This may be a step further in providing explainability and interpretability. In fact, the selected features in each fold of the cross-validation varies, although some are frequently selected.
Figure 7 shows the features that were more frequently selected.
All evaluated feature selection methods select the skewness. Complexity features are selected using a univariate or percentile approach, while the zone variance, business and surface to volume ratio are selected when learning from an LR model. The percentile-AdaBoost pipeline is used the complexity and skewness features in the cross-validation. Increasing the number of folds in the cross-validation scheme may provide more valuable insights.
5. Conclusions
For PCa, EBRT is a curative option for localized and locally advanced disease and a palliative option for metastatic low-volume disease [
6]. Currently, the only triggers for recurrence or treatment effectiveness monitoring are the PSA blood test or redoing a Transrectal Ultrasound Guided Biopsy (TRUS). With radiomics, a quantitative analysis of medical images may have similar prognostic power to phenotypes and gene protein signatures [
9]. Interest in radiomics has been increasing and for PCa, it has been mainly focused on MRI in the initial staging and grading. However, during the EBRT, CBCTs are freely available, as they are used for patient positioning and setup verifications.
The value of CBCT-based radiomics was evaluated to distinguish between favourable and unfavourable prognosis for patients initially classified as intermediate risk. Such a tool may provide added value to monitor and trigger possible changes in EBRT outcomes. Following the current methods of feature selection and classification for PCa radiomics, 98 pipelines were evaluated. The results seem to suggest that selecting features from an LR model, combined with a bagging (SVC) classifier, provided good performance. Although using 43 features, it lacks the potential to offer explainability. In this sense, a better approach seems to be using a percentile feature selection method and an AdaBoost classifier. This pipeline presents an AUROC of and an accuracy of and high values of precision and recall, being the most balanced of the evaluated pipelines.
The skewness seems to be the most frequently selected feature, considering each fold in the cross-validation scheme. Although its true importance is yet to be evaluated, the fact it was selected in every fold of every feature selection method suggests it may provide some insights. Furthermore, the fact that only one CBCT was considered for each patient did not allow the evaluation of features reproducibility.
Although the obtained results are promising, some improvements need to be made for a deeper evaluation. A grid-search cross-validation may provide fine-tuning of parameters and improved results. Furthermore, the used subset is quite small, and may benefit in the future from the inclusion of high-risk patients and the assessment of other models.
CBCT-based radiomics may provide a baseline for an EBRT effectiveness assessment framework on ongoing treatment, improving outcomes and lowering recurrence rates regardless of the several limitations.