Machine Learning Approaches in Multi-Cancer Early Detection

Hajjar, Maryam; Albaradei, Somayah; Aldabbagh, Ghadah

doi:10.3390/info15100627

Open AccessReview

Machine Learning Approaches in Multi-Cancer Early Detection

by

Maryam Hajjar

^1,*

,

Somayah Albaradei

^1,2

and

Ghadah Aldabbagh

¹

Computer Science Department, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 23218, Saudi Arabia

²

Center of Research Excellence in Artificial Intelligence and Data Science, King Abdulaziz University, Jeddah 21589, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Information 2024, 15(10), 627; https://doi.org/10.3390/info15100627

Submission received: 23 July 2024 / Revised: 2 September 2024 / Accepted: 13 September 2024 / Published: 11 October 2024

(This article belongs to the Special Issue Applications of Machine Learning and Convolutional Neural Networks)

Download

Browse Figures

Versions Notes

Abstract

:

Cancer is a prominent global cause of mortality, primarily due to delayed detection leading to limited treatment options. Current screening methods are mostly invasive and involve complex lengthy processes with high costs. Moreover, each screening typically focuses on a single type of cancer. This imposes a growing need for innovative, precise, and minimally invasive methods for early cancer detection. With the current advances in assay technologies and data science, multi-cancer early detection (MCED) tests are gaining increased interest in the research community as they offer potential for earlier diagnosis and improved patient outcomes. Different approaches are followed for MCED, and multiple machine learning methods are considered. In this paper, we systematically explore various MCED studies and their applied machine learning (ML) models for different types of biomarker data. We discuss the strengths and limitations of different study designs and compare their performance. Future directions are proposed, emphasizing the importance of integrating multi-omics data, enhancing model transparency, and fostering collaborative efforts to develop robust, cost effective and clinically applicable MCED tools.

Keywords:

MCED; liquid biopsy; biomarkers; cell-free DNA; machine learning

1. Introduction

Cancer is one of the top fatal diseases around the world. It is ranked as the second leading cause of death globally, accounting for 9.7 million deaths in 2022 [1]. Beyond the profound social and psychological impacts on the lives of millions of people and their families, cancer also imposes a significant economic burden globally, affecting individuals, communities, healthcare systems, and national economies [2]. Reaching an advanced stage of cancer before detection decreases the survival rate by more than half, even when the best clinical therapies are followed [3]. This highlights the importance of early diagnosis, which can significantly increase patients’ chances of survival and reduce treatment costs [4]. As per data from the American Cancer Society’s 2023 report, only 39% of cancers are diagnosed at an early stage (Stage 1) [5]. Figure 1 shows top cancers and their percentages for early detection as per [1,5].

Cancer progresses over time from its initial site to the metastasis stage, and detecting cancer during this window is critical to lowering the chances of mortality [1,2]. One major issue is that many cancers lack symptoms in early stages or exhibit only vague, non-specific symptoms such as fatigue or weight loss [2,3]. Additionally, current clinically applied screening and preventative programs for cancer detection face numerous limitations, often resulting in missed opportunities for early detection [3]. For instance, methods such as biopsies, endoscopies, and bone marrow tests are invasive, carrying risks and discomfort [4]. Another limitation is that each screening procedure typically tests for only one type of cancer, and some cancers may lack an effective or widely accepted screening method altogether [5]. Screening also requires multiple appointments and follow-ups, which can be discouraging for many patients, especially when combined with high costs and limited accessibility [1]. Furthermore, the accuracy of these screenings is questionable and may sometimes lead to issues of overdiagnosis or underdiagnosis, complicating treatment and patient outcomes [3,4].

Liquid biopsy is a minimally invasive technique used for disease detection that has emerged in recent years as a way to overcome the limitations of existing screening methods [6,7]. This technique can analyze disease markers in body fluids like blood or urine, including circulating cell-free DNA (cfDNA), cell-free RNA (cfRNA), circulating tumor cells (CTCs), extracellular vesicles (EVs), tumor-educated platelets (TEPs), proteins, and metabolites [6]. Since tumors release these markers early, liquid biopsy can detect cancer as early as no symptoms nor detectable tumors are present [7]. Liquid biopsy also has the advantage of providing real-time tumor information [8]. Based on that, having a single liquid biopsy test that incorporates different biomarkers and can detect multiple cancers at once would therefore overcome the limitations discussed in current clinical practice for cancer detection [6,7]. This type of approach is known as a multi-cancer early detection (MCED) test and should be designed in a way that makes it highly sensitive to detect early-stage cancers, very specific to avoid false positives, able to identify the tissue of origin (TOO) of the cancer, and cost-effective [8].

This review paper aims to synthesize and critically evaluate the current state of Machine Learning (ML) approaches in MCED, highlighting key methodologies, performance metrics, and clinical implications. We explore various ML techniques and their applications in analyzing diverse biomarker data such as cfDNA, methylation patterns, glycosaminoglycans, and other genomic and proteomic signatures. Furthermore, we discuss the strengths and limitations of different study designs, including retrospective/prospective case-control studies and real-world data applications. Special attention is given to the sensitivity, specificity, and overall performance of these models in detecting multiple cancer types at early stages.

While numerous review papers have explored MCED studies [8,9,10], they tend to focus primarily on the clinical potential of these tests, their performance, and the differentiation based on laboratory technologies and related biomarkers. In contrast, this review aims to bridge the gap between technical and clinical perspectives by providing a comprehensive analysis of the machine learning models used in multi-cancer detection, including an exploration of novel machine learning techniques such as deep learning and ensemble methods, and their effectiveness in handling diverse datasets. Additionally, we examine the critical impact of data quality and processing on model performance, alongside discussing the clinical implications of these models and their potential to transform current screening paradigms.

2. Review Search Methodology

A search was performed on both PubMed and ScienceDirect databases with the query (((“multi cancer” OR “multicancer” OR “pan cancer” OR “pancancer”) AND “early detection”) OR “MCED”) AND (“MACHINE LEARNING” OR “DEEP LEARNING” OR “Artificial intelligence”) filtered for studies conducted between 2019 and 2024 that are either a research article or a clinical trial. In total, 119 records were found across both platforms. Duplicated articles, unrelated studies to the review scope and articles that focus mainly on one cancer type are all excluded. Total studies included in this review are 18. These are compared in Table 1 and Table 2. Figure 2 shows a summary of the followed search methodology. Studies were found falling under four biomarker categories, namely cfDNA [11,12,13,14,15,16,17,18,19,20,21,22], cfRNA [23,24], metabolites [25,26], and proteins [27,28,29]. Each will be described subsequently.

3. Cell-Free DNA Based MCED Tests

3.1. The Circulating Cell-Free Genome Atlas Study [11,12,13]

The Circulating Cell-free Genome Atlas (CCGA) is a prospective, case-control, large-scale study done by a company called GRAIL (ClinicalTrials.gov ID NCT02889978). The study aims to develop and validate blood-based MCED tests by exploring various genomic features in cfDNA from blood samples. Approximately 15,000 participants across 142 sites in the USA and Canada are involved in the study including both people with and without cancer. The selection of participants from these multiple clinical sites ensures diverse representation of different cancer types and stages. All participants will be followed for at least five years to gather clinical outcome data. CCGA consists of three sub-studies, the scope of each one is detailed below.

3.1.1. First CCGA Sub-Study [11]

This study evaluated various cfDNA approaches for MCED, involving 1800 participants, including 1100 cancer patients and 700 healthy controls. The machine learning models used include Support Vector Machine (SVM) and Gradient Boosting Machines (GBM) to analyze cfDNA patterns and classify participants. Data processing approaches involved normalization of cfDNA levels, feature selection to identify the most relevant cfDNA markers, and dimensionality reduction techniques to manage the high-dimensional data.

A new metric has been introduced and used to provide a benchmark for comparing the performance of the different approaches called clinical limit of detection (LOD) that is based on circulating tumor allele fraction (cTAF). Eight different cfDNA features taken from three different types of MCED assays are considered. A ML classifier has been created for each cfDNA feature type in addition to one classifier that combines all features and another that depends on clinical data only, establishing ten different classifiers in total. The considered assays are Whole Genome Bisulfite Sequencing (WGBS), targeted sequencing, and Whole Genome Sequencing (WGS). The considered biomarkers overall are Whole Genome (WG) methylation, Single Nucleotide Variant (SNV), Single Nucleotide Variant in White Blood Cells (SNV-WBC), Somatic Copy Number Alteration (SCNA), Somatic Copy Number Alteration in White Blood Cells (SCNA-WBC), fragment endpoints, fragment lengths, and allelic imbalance.

The best single feature classifier found is the WG methylation and it applied kernel logistic regression. For the pan-feature classifier, eXtreme Gradient Boosting (XGBoost) was used to combine the scores of individual classifiers. Hyperparameters were optimized using random search on the training data. Subsequently, a final model was trained on the entire training set and then used to predict cancer versus non-cancer on the validation set. For Cancer Signal Origin (CSO), multinomial logistic regression was applied. Robustness was ensured in this study by comparing multiple cfDNA approaches and validating the selected model across independent datasets, confirming its effectiveness in detecting cancers at various stages.

A key limitation of this study is the variability in cfDNA extraction and processing methods across different clinical sites, which may introduce noise and affect the model’s performance. Additionally, the study’s reliance on cfDNA alone may limit its ability to detect cancers that do not shed sufficient cfDNA into the bloodstream.

3.1.2. Second CCGA Sub-Study [12]

After identifying the potential of the targeted methylation-based approach in first sub-study, this study refines and validates the approach while using larger and more diverse cohort resulting with improvement in clinical LOD and performance of targeted methylation classifier. The study in [14] is part of CCGA second sub-study and has tackled the prognostic aspect of participants and linked it to their model prediction.

3.1.3. Third CCGA Sub-Study [13,14]

The purpose of this study is to perform long-term performance and clinical implementation of the refined MCED test. It utilizes comprehensive large-scale clinical validation with long-term follow-up and real-world application. The overall objective is to support the clinical implementation of the Galleri test.

3.2. SYMPLIFY [15]

The SYMPLIFY study provides valuable insights into the application of GRAIL’s Galleri MCED test in symptomatic patients. Conducted across 44 NHS hospitals with 6238 participants initially enrolled and 5461 included in the final analysis, SYMPLIFY demonstrated that the Galleri test could effectively detect cancer signals with high specificity (98.4%) and moderate sensitivity (66.3%), particularly in advanced cancer stages.

The machine learning model employed in this study focused on analyzing cfDNA methylation patterns to predict the presence of cancer. Data processing involved the analysis of methylation levels and cross-validation to ensure model stability. The robustness of the study was reinforced by validating the model’s performance across different patient subgroups, including those with varying cancer types and stages, ensuring its applicability to a diverse symptomatic population. The main limitation of the SYMPLIFY study is its focus on symptomatic patients, which may limit the generalizability of the findings to asymptomatic populations. Additionally, while the study is large and well-validated, the variability in symptom presentation and severity could introduce noise into the model, potentially affecting its accuracy in detecting cancer across different symptom clusters.

Although the test showed lower sensitivity in early-stage cancers, its high accuracy in predicting the tumor’s site of origin (Sensitivity of 80.4% at a specificity of 99.1%) suggests significant clinical utility. The findings underscore the need for further research to optimize the test for symptomatic use, particularly in enhancing its negative predictive value, and to assess its impact on patient management and healthcare resource utilization through interventional studies.

3.3. SPOT-MAS [16]

Multimodal analysis has been considered in this study by combining methylomics and fragmentomics in plasma cfDNA for MCED to distinguish between cancer patients and healthy controls and to predict the TOO for detected cancers. Nine types of features are used within the model as per extracted from cfDNA, including target methylation (TM), genome-wide methylation (GWM), copy number aberrations (CNA), fragment length (FLEN), long fragment count, total fragment count, short-to-long ratio, and end motifs (EM). The study has focused on cost reduction aspect by testing shallow sequencing while keep achieving high sensitivity and specificity. A total of 1800 participants were enrolled, including 1200 cancer patients and 600 healthy controls. Participants were selected based on the presence of detectable cfDNA and specific inclusion criteria, such as age, gender, and cancer type. The study design has followed multiple approaches as will be detailed subsequently.

3.3.1. Cancer Prediction Single-Feature Models

For each of the nine features, three machine learning algorithms—Random-forest (RF), logistic regression (LR), and XGBoost—were tested for hyperparameter tuning and to identify the best algorithm. A 20-fold cross-validation method was used to evaluate the performance of the different models. It shows that using the EM feature has the best Area Under the Curve (AUC) of 90%.

3.3.2. Cancer Prediction Combined-Feature Models

Concatenated model: Use all nine features as a single data frame: XGBoost algorithm was the best with AUC of 88%.
Ensemble model: using a stacking ensemble model with logistic regression to combine the predictions of the single-feature models. This model achieved the best performance with an AUC of 93%.

3.3.3. Tumor of Origin Prediction Models

Three algorithms were tested including, RF, convolutional neural network (CNN), and graph convolutional neural network (GCNN). GCNN appeared to have the best performance with median accuracy of 73%.

Robustness of SPOT-MAS was further supported by validating the findings on an independent dataset, confirming the model’s effectiveness in detecting and localizing cancers across different stages. However, the study’s focus on specific methylomic and fragmentomic markers may not comprehensively cover all cancer types, as evidenced by the lower detection rates for breast cancer. Additionally, the complexity of the multi-modal approach could pose challenges for its integration into clinical settings, where simpler, more interpretable models are often preferred.

3.4. Bao et al. [17]

This study discusses an ultra-sensitive assay designed for MCED using cfDNA fragmentomics. The study involved 1214 participants, with 971 cancer patients and 243 healthy controls. Participants were selected based on the presence of detectable cfDNA and the absence of prior cancer treatments. The study was designed as a case-control analysis to compare cfDNA fragmentomic patterns between cancer patients and controls.

ML approach followed in Bao et al. incorporates five different algorithms: Generalized Linear Model (GLM), GBM, RF, Deep Learning (DL), and XGBoost. Five distinct features from fragmentomic patterns are utilized, namely Fragment Size Coverage (FSC), Fragment Size Distribution (FSD), EnD Motif (EDM), BreakPoint Motif (BPM), and Copy Number Variation (CNV). The participants were randomly split into training and test datasets in a 1:1 ratio. The training dataset was used to build the first-level cancer detection model, where each of the five machine learning algorithms was trained on the extracted fragmentomic features to create base models. The predictions from these base models were then combined into a large matrix, which was used to train the final ensemble stacked model. Then, taking only cancer samples from the training dataset, the second-level cancer origin model was trained utilizing an ensemble approach. Data processing included normalization of cfDNA fragment lengths, feature selection to identify the most relevant fragmentomic markers, and cross-validation to enhance model robustness. Note that the study focused only on three types of cancer but was capable of detecting cancer at very early stages with high accuracy. Another limitation of this study is its reliance on cfDNA fragmentomics alone, which may not capture the full spectrum of cancer-associated cfDNA variations.

3.5. Moldovan et al. [18]

The study by Moldovan et al. explores a unique combination of cfDNA biomarkers, including fragmentomics and genomic signatures such as trinucleotide end sequences, fragment sizes, and somatic copy number alterations (SCNAs). These biomarkers are retrieved from WGS data and analyzed using the newly developed FrEIA (Fragment End Integrated Analysis) software, which quantitatively evaluates liquid biopsy samples. The study involved 629 patients and 306 controls from three independent cohorts.

The study employs four machine learning approaches, including k-neighbors, LR, RF, and SVM to classify cancer samples, with LR achieving the highest accuracy. The model was validated across three independent cohorts, ensuring robustness and generalizability. The preprocessing steps involved careful extraction and harmonization of cfDNA fragmentomic and genomic data, which were crucial in maintaining the fidelity of the biomarker signals. The study’s machine learning approach significantly improved cancer detection, achieving AUC of 0.96, which highlights the potential of multi-modal cfDNA analysis in a clinical setting. In summary, the study by Moldovan N et al. has a comprehensive multimodal approach to integrating cfDNA fragmentomic and genomic features, which not only enhances detection sensitivity but also provides prognostic insights for patient survival and recurrence. However, the study also presents an imbalance in the representation of different cancer types within the cohorts and potential biases introduced by pre-analytical variables and computational preprocessing. Moreover, the limited sample size for early-stage cancer cases reduces the statistical power for these critical cases, necessitating further validation in larger, more diverse populations.

Table 1. MCED Tests Main Comparison.

Study	Data Type	Source	Study Type	Biomarker	Assay Method	Tumor Type	Sample Size	Sensitivity/Statistical Analysis	Cancer Detection ML Technique	TOO ML Technique	Cancer Detection Performance (Sensitivity% at Specificity%)	TOO Performance (Accuracy)
CCGA Galleri [11]	cfDNA	Plasma	Prospective case-control	WG methylation, SNV, SNV-WBC, SCNA, SCNA-WBC, fragment endpoints, fragment lengths, & allelic imbalance	WGBS, targeted sequencing, & WGS	22 types	2800: 1628 cancer. 1172 healthy.	Bootstrap analysis, multivariate analysis, & Bayesian likelihood estimation of cTAF	kernel LR & XGBoost	multinomial LR & stacking-ensemble classifier	34% 98%	75%
SYMPLIFY [15]	cfDNA	Blood samples	Prospective observational cohort	Methylation patterns in cfDNA	Targeted methylation-based analysis	24 types	5461: 368 cancer. 5093 healthy.	Post-test probabilities, referral pathway analysis, & cross-validation	isotonic regression	Not specified	66.3% 98·4%	84.8%
SPOT-MAS [16]	ctDNA	Plasma	Retrospective case-control	TM (450 regions), GWM, CAN, FLEN, EM	Shallow WGBS	5 types	2288: 738 cancer. 1550 healthy.	Wilcoxon Rank-Sum Test, t-Test, Kolmogorov-Smirnov Test, Benjamini-Hochberg Correction, & DeLong’s Test	Single feature: RF, LR, XGBoost Combined feature: LR stacking ensemble	GCNN	72% 97%	70%
Bao et al. [17]	cfDNA	Plasma	Prospective case-control	Fragmentomics: FSC, FSD, EDM, BPM, CNV	WGS at varied depths (down to 1×)	3 types	1214: 971 cancer. 243 healthy.	Down-sampling to 1× coverage, Propensity Score Matching, & 10-fold Cross-validation.	Ensemble stacked model including five algorithms: GLM, GBM, RF, DL, and XGBoost	Not specified	96% 95%	93%
Moldovan N. et al. [18]	cfDNA	Plasma	Retrospective & Prospective	Fragmentomics & genomic signatures (trinucleotide end sequences, fragment sizes, SCNAs)	WGS with FrEIA software	21 types	629 cancer. 306 healthy.	Mann-Whitney U test, Spearman correlation, & Kaplan-Meier analysis	k-neighbors, LR, RF, & SVM	LR	72% 95%	(in terms of AUC) 0.96
Zhang Z. et al. [19]	cfDNA	Plasma	Retrospective Case-control	Integrated fragmentomic profile and 5hmC	Capture-based low-pass sequencing	4 types	396: 311 cancer. 85 healthy.	Cross-validation, Wilcoxon Rank-Sum Test, Mann-Whitney Test, ROC-AUC analysis, Feature importance based on training data	RF	RF	88.52% 82.35%	75%
THUNDER [20]	cfDNA	Plasma	Retrospective & prospective	Methylation (161,984 CpG sites)	ELSA-seq	6 types	Retrospective (1693): 735 cancer. 958 healthy. Prospective (1010): 505 cancer. 505 healthy.	10-fold cross-validation, t-SNE analysis, Clopper-Pearson confidence intervals	SVM	Multi-class LR	69.1% 98.9%	83.2%
Hezler et al. [21]	ctDNA	Blood	Retrospective cohort Study	Fragmentation patterns at the first coding exon in targeted cancer gene cfDNA panels	Targeted exon panels	5 types	GRAIL cohort: n = 198. UW cohort: N = 320.	10-fold cross-validation	GLMNET	Elastic-net regression	(in terms of accuracy) GRAIL: 76.3% UW: 86.6%	(in terms of AUC) GRAIL: ≥ 0.83 UW: ≥ 0.89
PATHFINDER [22]	cfDNA	Blood	Prospective cohort study	Methylation signatures in cfDNA	Targeted methylation sequencing	16 types	6621 from seven U.S. health networks.	Two-sided Wilcoxon	Not specified	Not specified	PPV: 38% NPV: 98.6% Specificty: 99.1%	85%
31-miRP Signature [23]	cfRNA (miRNA)	Plasma	Retrospective case-control	31 miRNA Pairs (miRPs)	Microarray analysis	13 types	15,832: 8316 cancer. 7516 healthy.	Youden Index & cross-validation	RF	RF	94.7–100% 92.5–100%	96.1–99.8%
thromboSeq [24]	RNA	Platelets	Retrospective case-control	TEP RNA profiles	RNA sequencing	18 types	2351: 1628 cancer. 723 healthy.	Iterative ANOVA modeling & cross-validation	Swarm intelligence-enhanced classification	RF	64% 99%	85%
GAGomes [25]	Metabolic	Plasma & Urine	Retrospective & prospective	14 GAGome features	High-throughput UHPLC-MS/MS	14 types	979: 553 cancer. 426 healthy.	Bayesian estimation and equivalence testing, & internal validation using bootstrap analysis	Bayesian multivariable LR	BART	66% 95%	89%
3D-EGN + SERS [26]	Metabolic	Urine	Retrospective case-control	Urinary metabolites	SERS	4 types	218: 162 cancer. 56 healthy.	t-test, Pearson correlation, PLSR, & ROC analysis	LR & CNN	CNN	88–100% 82–100%	95.6%
OncoSeek [27]	Protein	Plasma	Retrospective case-control	7 protein tumor markers	ECLIA analyzer	9 types	9382: 1255 cancer. 8127 healthy.	One-at-a-time method sensitivity analysis	GLM with cross-validation	GBM & RF	52% 93%	67%
OneTest [28]	Protien	Serum	Retrospective & prospective real-world (2 hospitals)	8 tumor markers: AFP, CA15-3, CA-125, PSA, SCC, CEA, CYFsR21-1, & CA19-9	Automated analyzers (not sequencing)	28 types	163,174: 785 cancer. 162,389 healthy.	Effect Size, Chi-Squared Test, Fisher’s Exact Test	LSTM model Single time-point and time-series data	Not specified	87% 88%	Not specified
DEcancer [29]	Protein, ctDNA, Epidemiologics	Plasma	Retrospective & prospective	Selected protein panels, DNA omega scores	High-throughput proteomics, DNA analysis	8 types	Cohen: 1005 cancer. 812 healthy. Blume: 61 cancer. 80 healthy.	Monte Carlo cross-validation & hold-out test set validation	RF, SVM, LR, & MLP	RF	95% 99%	81–100%

Table 2. MCED Tests Findings, Limitations, & Future Directions.

Study	Key Findings	Limitations	Future Directions
CCGA Galleri [11]	Whole-genome methylation is most promising for cancer detection and classification. cTAF is a significant predictor of classifier performance, more so than clinical stage or tumor type.	Small sample sizes for some cancer types. Case-control dataset. verification of clinical diagnosis.	Improve accuracy and sensitivity Large-Scale Clinical Validation. Cost effectivness analysis Exploration of Additional Biomarkers:
SYMPLIFY [15]	The Galleri test demonstrated high specificity (98.4%) and moderate sensitivity (66.3%). Sensitivity increased with cancer stage. Accurately predicted the site of origin in 84.8% of cases with a detected cancer signal.	The study’s observational nature limits its ability to assess the direct impact of the MCED test on clinical decision-making and resource utilization. The test showed lower sensitivity in early-stage cancers, particularly stage I.	Focus on optimizing the machine learning algorithm for symptomatic populations, particularly to enhance NPV. Interventional study to evaluate the test’s clinical utility in real-world settings, with a focus on improving patient outcomes and healthcare resource utilization.
SPOT-MAS [16]	Cost reduction by shallow sequencing while achieving high sensitivity and specificity. Efficiency in early-stage cancer detection and localization of TOO.	Imbalance in representation of different cancer types. Missing staging information for 26% of sample size. Retrospective design.	Conduct large, multi-center prospective studies for validation. Include more diverse cancer types. Improve staging data collection and balance cancer type representation.
Bao et al. [17]	Ensemble model demonstrated high sensitivity and specificity. model performed well even at low sequencing depths, making it cost-effective.	limited cancer types. Retrospective design.	Cohort Expansion. Clinical Implementation.
Moldovan N. et al. [18]	Integrated cfDNA measures yielded 72% cancer detection at 95% specificity. Demonstrated prognostic value in predicting survival and recurrence. Cost-effective, non-invasive tool for early cancer detection and monitoring treatment response.	Imbalance in representation across cancer types in cohorts. Potential biases from pre-analytical conditions and computational preprocessing. Limited sample size for early-stage cancer, affecting the statistical power.	Validate findings in larger, multi-center, prospective studies. Address potential biases and enhance generalizability across different populations. Explore integration of cfDNA features with other biomarkers for improved accuracy.
Zhang Z. et al. [19]	High sensitivity and specificity. Successfully integrated fragmentomic and 5hmC data for cancer detection. Demonstrated the utility of ultra-long cfDNA fragments as biomarkers.	Retrospective design. Limited sample size and cancer type representation may affect generalizability. Potential bias due to the imbalance in the representation of different cancer types. Missing staging information for a portion of samples.	Conduct large, multi-center prospective studies to validate findings. Explore the inclusion of more diverse cancer types. Improve staging data collection and balance cancer type representation. Enhance robustness and generalizability by combining other biomarkers or exploring additional fragmentomic and epigenetic markers.
THUNDER [20]	High sensitivity and specificity in detecting six types of cancers. Effective TOO prediction. Potential for significant stage shift and survival benefits in real-world settings.	Limited to six cancer types. Non-cancer status determined only by baseline check-ups, no follow-up. Retrospective sample collection for some cohorts.	Expand to include more cancer types. Conduct follow-up studies for non-cancer participants. Further prospective studies to validate real-world utility.
Hezler et al. [21]	cfDNA fragmentation patterns enable accurate cancer detection across multiple types. GLMNET model effectively classifies cancer types, even with low ctDNA fractions. Validation across two cohorts demonstrates broad applicability in MCED.	Retrospective design limits real-time clinical applicability. Low sensitivity in detecting stage 1 cancers. Limited to five cancer types.	Further validation needed for stage 1 cancer detection. Expand analysis to additional cancer types. Apply in prospective, real-world clinical settings.
PATHFINDER [22]	Solid performance with a 99.1% specificity, 98.6% NPV, and a 38% PPV. The test identified multiple cancers, including those without routine screening, at early stages.	High false-positive rate. Limited ethnic and socioeconomic diversity in the cohort limits generalizability. The study design lack a control group, which might affect the interpretation of results.	Assess long-term clinical utility and impact on cancer mortality. Investigate MCED testing across more diverse populations and healthcare settings.
31-miRP Signature [23]	The 31-miRP signature achieved AUC values of 0.976–1.000 across 13 cancer types. High accuracy in early-stage cancer detection with AUC up to 0.998. The signature effectively distinguished cancerous from benign lesions with strong specificity.	Retrospective design limits real-time clinical applicability. Technical variability in miRNA detection addressed but further consistency needs real-world testing.	Prospective studies needed to confirm clinical utility in broader, diverse populations. Investigate the molecular mechanisms of selected. miRNAs to improve model performance.
thromboSeq [24]	Highly specific pan-cancer blood test using TEP RNA profiles. Successful detection of 18 tumor types with 64% overall sensitivity. Accurate tumor site-of-origin determination with 85% accuracy in selected cancers.	Retrospective case-control design. Decreased specificity in symptomatic controls, indicating higher false positives in non-cancerous diseases. Variation in detection accuracy across different cancer types, with lower rates in early-stage cancers.	Conduct large-scale, multi-center prospective studies to validate findings. Include a broader range of non-cancerous diseases in the study to improve specificity. Explore the integration of TEP RNA with other liquid biopsy approaches to enhance early cancer detection.
GAGomes [25]	Could predict poor prognosis cancers and location of cancer with high accuracy. The combined model (plasma & urine) has highest accuracy. More cost effective than molecular based approaches	inflammation or metabolic conditions might affect the results. Small sample sizes for certain cancer types increase the risk of overfitting. GAGome profile variability across populations limits generalizability of the test results.	Enhance robustness and generalizability. Combine GAGomes with other biomarkers. Understand the mechanisms linking GAGome alterations to cancer.
3D-EGN + SERS [26]	Successfully demonstrated noninvasive multicancer diagnosis using whole urine samples. High accuracy in distinguishing between normal and cancerous samples across multiple cancer types (Pancreatic, Prostate, Lung, Colorectal). Potential for rapid, high-throughput onsite screening.	Limited sample size and diversity, which may impact generalizability. Potential risk of overfitting in machine learning models due to small and imbalanced datasets. Limited to four cancer types; applicability to other cancers is unknown.	Expand sample size and diversity in future studies to enhance generalizability. Validation in larger, multi-center clinical trials. Explore the applicability of the method to additional cancer types and other biofluids.
OncoSeek [27]	Cost-effective MCED tool. Best performance: in pancreas, ovary, and liver. Suboptimal performance in breast cancers, esophagus cancers, and lymphoma.	Sample diversity and platform variability Bias from case-control study design Limited cancers and suboptimal performance in certain cancers (ex. Breast cancer)	Conduct rospective real-word study. Increase scope of cancer types. Generalizability to different populations.
OneTest [28]	LSTM model can robustly predict cancer risk using incomplete and irregular TM data. Time-series data further enhances the predictive performance of the model.	Class imbalance impacts model accuracy since dataset is heavily skewed toward non-cancer cases. Dependence on data imputation may introduce bias.	Increase availability of time-series data to enhance model robustness. mprove LSTM model interpretability for better clinical decision-making. Conduct prospective studies to assess real-world effectiveness of the model.
DEcancer [29]	Achieved 90–100% sensitivity at 99% specificity for detecting Stage 1 cancers. Integrated proteomic, ctDNA, and epidemiological data enhances cancer detection and classification accuracy. Accurate tumor site-of-origin determination with up to 100% accuracy in selected cancers.	Small sample size for certain cancer types limits generalizability. Data integration increases clinical complexity and cost.	Validate in large-scale, multi-center studies to improve generalizability. Explore cost-effective integration of multi-modal data for real-world applications.

3.6. Zhang Z. et al. [19]

This study integrated fragmentomic profiles with 5-hydroxymethylcytosine (5hmC) data from capture-based low-pass sequencing to enable pan-cancer detection via cfDNA. The cohort consisted of 396 participants, comprising 311 cancer patients and 85 healthy controls, analyzed through a retrospective case-control design to compare cfDNA profiles.

A key strength of the study is its novel combination of epigenetic and fragmentomic data, which together significantly enhance the sensitivity and specificity of cancer detection across multiple tumor types. Utilizing capture-based low-pass sequencing, the study not only explores the more commonly analyzed short cfDNA fragments but also delves into the potential of ultra-long cfDNA fragments as novel biomarkers, an area that has been underexplored in previous research.

RF model was employed for both cancer detection and TOO classification, applying rigorous cross-validation and feature selection processes. Balanced data splits between training, validation, and test sets helped maintain consistent representation across cancer and non-cancer samples. Diverse data features were integrated, including fragment size profiles, coverage profiles, preferred ends, and 5hmC signatures. Notable performance metrics were achieved: a sensitivity of 88.52% and specificity of 82.35% for pan-cancer detection, with a TOO classification accuracy of 75%. However, the retrospective nature of the study, along with limitations in sample size and cancer type representation, suggests that further validation through large-scale prospective studies is necessary to fully realize and generalize these findings across broader populations.

3.7. THUNDER [20]

The THUNDER study focused on unintrusive multi-cancer detection using cfDNA methylation sequencing, involving 1703 participants, including 735 cancer patients and 958 healthy controls in the retrospective phase. In the prospective independent validation phase, 1010 participants were included, with 505 cancer patients and 505 healthy controls. Utilizing a novel technology known as enhanced linear-splinter amplification sequencing (ELSA-seq), the study focused on detecting and localizing six cancer types—colorectal, esophageal, liver, lung, ovarian, and pancreatic—using a customized panel of 161,984 CpG sites.

Training data processing involved the use of 10-fold cross validation. A custom SVM algorithm was applied for cancer detection. Then, multi-class logistic regression for TOO prediction. The study demonstrated high sensitivity (69.1%) and specificity (98.9%), with an accuracy of 83.2% in predicting the tissue of origin.

While the main cohort was based on a retrospective design, which may limit real-time applicability and introduce potential selection bias, the study overcame this by including a prospective independent validation phase. However, the focus on only six cancer types limits the generalizability of the findings, and the lack of follow-up for the non-cancer cohort raises potential concerns about misclassification.

3.8. K. T. Helzer et al. (GLMNET) [21]

This study focused on the fragmentomic analysis of circulating tumor DNA (ctDNA) using targeted cancer panels to compare cancer patients and controls. Following a case-control retrospective approach, a total of 518 participants were enrolled, including 320 from the University of Wisconsin (UW) cohort and 198 from the GRAIL cohort. The goal was to explore how cfDNA fragmentation patterns, particularly around the first coding exon, could be used for cancer detection and classification.

To achieve this, the study employed a multinomial regression model with an elastic net penalty (GLMNET), a machine learning technique well-suited for high-dimensional data. By combining L1 (lasso) and L2 (ridge) regularization techniques, the model was able to effectively manage complex datasets, selecting the most relevant features while avoiding overfitting. This approach allows the model to accurately classify cancer types and subtypes based on cfDNA fragmentation patterns, even when ctDNA is present in very low fractions.

To ensure precision and robustness, the study included rigorous data preprocessing steps, such as alignment to reference genomes and calculation of Shannon entropy for cfDNA fragments. Moreover, validation across two independent cohorts further contributes to the model’s robustness and applicability in MCED.

This integration of machine learning with targeted cfDNA panels represents a scalable, cost-effective solution for early cancer detection. However, while the study highlights the potential of this method, it also faces limitations, including its retrospective design and the need for further validation, particularly in stage 1 cancer detection where sensitivity remains lower compared to later stages.

3.9. PATHFINDER [22]

The PATHFINDER study is a large-scale prospective cohort study that enrolled 6662 participants, including both asymptomatic individuals aged 50 or older, and those with additional cancer risk factors such as smoking history, previous cancer history, and genetic predisposition. The study aimed to evaluate the performance of a MCED test using blood-based biomarkers, tracking participants over a one-year period.

The study employed a machine learning model that integrates cfDNA methylation patterns to predict the presence of cancer. Data processing included the generation of composite scores based on cfDNA methylation signatures, reflecting cancer risk. The robustness of the study was ensured through independent validation cohorts, confirming the test’s ability to detect cancers at various stages, including early stages (I and II). Results show solid performance, with a Positive Predictive Value (PPV) of 38%, a Negative Predictive Value (NPV) of 98.6%, and a specificity of 99.1%. The PPV increased to 43% in participants with additional risk factors.

One of the main limitations of the PATHFINDER study is the potential for selection bias, given the focus on individuals with high adherence to cancer screening protocols, which may limit the generalizability of the findings to the broader population. Additionally, while the study shows promise in detecting multiple cancers, the sensitivity varies depending on the cancer type, which could impact its overall effectiveness in a real-world setting.

4. Cell-Free RNA Based Tests

4.1. 31-miRP Signature [23]

The study explores the use of a 31-miRP (microRNA pairs) signature for detecting multiple cancer types, including early-stage cancers. Analyzing data from 15,832 individuals across 13 cancer types, with a focus on nine early-stage cancers, the study employed several machine learning algorithms, with Random Forest (RF) emerging as the most effective for both cancer detection and classification. Data processing involved rigorous feature selection to pinpoint the most relevant microRNA pairs, the use of relative expression values instead of traditional normalization, and cross-validation to enhance model stability. Results showed high accuracy, with AUC values ranging from 0.961 to 0.998, and achieved strong sensitivity and specificity across the nine cancer types. The robustness of the findings was further supported by validating the model on an independent dataset, which confirmed the high predictive power of the microRNA pairs across different cancer types.

However, the study’s retrospective design may limit its generalizability, and the absence of a prospective validation cohort means further research is needed to confirm clinical utility. While miRNAs are promising biomarkers, this study addressed potential variability by using relative expression values, which mitigates some challenges related to stability and consistent detection.

4.2. thromboSeq [24]

“This study explored the use of platelet RNA as a biomarker for detecting and localizing both early- and late-stage cancers. A total of 2351 participants were included, with 1628 cancer patients at various stages and 723 control samples, including both asymptomatic and symptomatic controls. Participants were selected based on specific inclusion criteria, including cancer stage and the presence of platelet RNA profiles. The study was designed as a case-control analysis to compare RNA profiles between cancer patients and controls, with a diverse sample set from European and North American populations to ensure broader applicability of the results. A unique biomarker—tumor-educated platelet (TEP) RNA profiles—is employed differentiating this study from the more commonly cfDNA. This approach leverages the ability of platelets to alter their RNA content in response to local and systemic cues from tumors, thus providing a rich source of potential biomarkers for cancer detection. The study integrates advanced machine learning techniques, particularly swarm intelligence-enhanced classification algorithms, to optimize the selection of RNA biomarkers from a large dataset. This method not only aids in the accurate detection of cancer but also in the identification of the tumor’s site of origin, achieving an accuracy of 85% in the latter.

This study applied advanced machine learning algorithms, including RF for tumor site-of-origin analysis and a swarm intelligence-enhanced algorithm for cancer detection. These approaches were optimized through iterative ANOVA modeling and cross-validation, ensuring robust model performance. Data validation and preprocessing were meticulously handled, with stringent quality controls applied to the platelet RNA sequencing process.

Study limitations include its retrospective case-control design and decreased specificity in symptomatic controls, which indicates a higher rate of false positives in non-cancerous diseases. Additionally, the variability in detection accuracy across different cancer types, particularly lower rates in early-stage cancers, highlights the need for further validation and refinement of this approach.

5. Metabolites Based Tests

5.1. GAGomes [25]

This study focused on MCED using free glycosaminoglycans (GAGs), involving case-control development study that included 979 participants, 553 of which are cancer patients. This study used metabolic type of biomarkers called Glycosaminoglycan profiles (GAGomes) that exists in both plasma and urine. The type of assays used to detect such biomarkers are simpler and more cost effective than molecular based assays.

The other uniqueness about this study is that it follows a prospective approach which is more suitable for evaluation of MCED as it can follow participants over time to monitor the actual incidence of cancer after the test, making it more applicable to real-word scenarios.

Bayesian multivariable logistic regression was applied for cancer detection, while Bayesian Additive Regression Trees (BART) was applied for putative cancer location prediction (PCL) which is comparable to TOO in molecular based tests. Plasma, Urine, and combined approaches were all tested and compared, the combined model (plasma & urine) demonstrated the highest diagnostic accuracy. However, the variability in GAG levels across different populations may limit the generalizability of the findings.

5.2. 3D-EGN + SERS [26]

The study explored the potential of using metabolic biomarkers in urine for MCED, including pancreatic, prostate, lung, and colorectal cancers. The study stands out by utilizing a novel three-dimensional evolutionary gold nanoarchitecture (3D-EGN) combined with Surface-Enhanced Raman Scattering (SERS) to detect these biomarkers. This platform enabled the direct analysis of whole urine samples in their liquid state, preserving volatile metabolites that might otherwise be lost during sample preparation processes.

The study included a total of 218 samples with 56 healthy controls and 162 cases from four different cancer types. Logistic regression is employed for binary classification and a convolutional neural network (CNN) for multiclass classification, which allowed for the precise identification of different cancer types from the SERS data. To mitigate the risk of overfitting and to handle the small and imbalanced dataset, the study incorporated rigorous preprocessing steps, including min-max normalization and random under-sampling to balance the training dataset. The validation process involved splitting the dataset into 80% training and 20% test sets, ensuring the robustness of the model.

One of the key differentiators of this study is its focus on high-throughput, noninvasive diagnostics using easily accessible biofluids, positioning it as a promising tool for rapid onsite screening. Additionally, the integration of advanced nanomaterial technology with sophisticated machine learning models highlights the potential for developing highly sensitive and specific diagnostic platforms. However, the study’s limitations include a relatively small and homogenous sample size, which may affect the generalizability of the findings. Further validation with larger, more diverse populations and exploration of its applicability to other cancer types and biofluids will be essential to fully realize the potential of this diagnostic approach.

6. Proteins Based Studies

6.1. OncoSeek [27]

This study involved 9382 participants, comprising 1255 cancer patients and 8127 healthy controls. The study employed a case-control design to compare protein marker levels between cancer patients and controls, focusing on seven specific protein tumor markers for nine types of cancer. This focus on protein biomarkers is intended to reduce the cost of MCED tests as opposed to other types of biomarkers.

The GLM algorithm was selected to develop the model for distinguishing cancer from non-cancer individuals. 10-fold cross-validation is applied and repeated 30 times. The average prediction value from these GLM models was defined as the probability of cancer (POC). The POC value at 90.0% specificity was chosen as the cut-off value. If the test result exceeded the cut-off, it indicated the presence of cancer signals; otherwise, no cancer signals were detected.

To predict the TOO, true positive patients were used to develop the model utilizing the RF and GBM methods. Due to the imbalanced sample sizes for each cancer type, a down-sampling method was applied to balance the sample sizes. The two organs with the highest prediction probabilities were identified as the potential TOO.

The key limitation of the study lies in its reliance on a selected set of protein markers, which may limit generalizability across all cancer types or populations. Furthermore, the sample size, while adequate, might not fully capture all potential variations in marker levels across different demographics.

6.2. OneTest [28]

OneTest is a protein based MCED test developed by 20/20 GeneSystems. It covers 28 types of cancers by using eight protein biomarkers. The study utilizes real world data from 2 hospitals in both retrospective and prospective approaches. However, data processing was needed to mitigate the irregular and missing data.

The study compared multiple ML models and found that the (Long Short-Term Memory) LSTM model is the best due to its ability to handle temporal dependencies and irregularities in the data, making it well-suited for the longitudinal analysis of tumor markers. Data processing included the normalization of tumor marker levels, handling of missing data through imputation, and the integration of multiple data sources into a cohesive dataset. The robustness of the study was ensured through cross-validation, and the model’s performance was further validated on an independent dataset to confirm its predictive power across different cancer types and stages. Moreover, LSTM was tested using both single time-point and time-series data. It was found that using time-series data can further improve the predictive performance.

Limitations of the study include reliance on retrospective data, which may introduce biases due to the variability in how medical records are maintained across different centers. Additionally, while the LSTM model is powerful, its complexity may limit its interpretability, making it difficult to pinpoint the specific markers or patterns most predictive of cancer outcomes.

6.3. DEcancer [29]

The study integrates multiple types of data, including proteomic data (proteins), DNA-based data (ctDNA omega scores), and epidemiological data (age, sex, ethnicity). 1005 participants are included, with 193 cancer patients and 812 healthy controls. This multimodal approach allows the model to leverage the strengths of different data sources for improved detection and classification accuracy. Concentration on a limited set of features is driven by the selection of a minimal, highly predictive set of biomarkers. This contrasts with other methods that often use larger, less efficient biomarker panels making it simpler and more clinically feasible. This is applied via a recursive feature elimination process that is based on the Gini-index in RF models to systematically identify the most relevant biomarkers for cancer detection and classification. Moreover, comprehensive data augmentation and preprocessing are applied which further improve the model robustness.

DEcancer does not rely on a single machine learning technique. Instead, it tests and optimizes several models, including RF, SVM, Logistic Regression, and Multilayer Perceptron (MLP). The choice of the best model is data-driven, depending on the performance during validation. Monte Carlo cross-validation, is used which involves repeatedly partitioning the data into training and validation sets in a random manner, which provides a more robust estimate of model performance and helps prevent overfitting.

Unlike previous studies that may have relied solely on cross-validation, DEcancer employs hold-out test sets that are separate from the training and validation datasets. This approach provides a more realistic estimate of the model’s performance on unseen data, ensuring that the model is likely to generalize well to new patient populations.

Results for detecting Stage 1 cancer shows a sensitivity range of 84–100% at 99% specificity depending on cancer type. Worth noting that a total of 4 cancer types reached 100% sensitivity. For cancer classification, the accuracy is between 97–100%. While this demonstrates strong performance for both detection and classification at early-stage cancer, there are also several limitations that should be considered. For instance, the sample size is small for cancer cases. Moreover, the integration of different types of data imposes higher cost and complexity in clinical implementation. Although rigorous validation techniques are applied, the complexity of integrating multiple data types could still pose a risk of overfitting, especially given the relatively small size of the dataset. Overfitting could result in the model performing well on the specific datasets used for training but less well on unseen data.

7. Discussion

Four different types of biomarkers are investigated, namely protein, cfDNA, cfRNA and metabolic. The two studies that used solely protein biomarkers [27,28] showed good performance while being the most cost-effective. On the other hand, the cfDNA biomarkers have been addressed by most studies. This is likely because cfDNA is abundant with cancer biomarkers compared to others. The CCGA study by GRAIL has shown that the WG methylation is the most promising feature and hence further refinement, and validation is currently undergoing in subsequent studies which shows that CCGA approach is well-established and holds the most potential for clinical implementation. The study by Bao et al. [17] have shown best performance over all cfDNA methods while mitigating the cost by reducing the sequencing depths. However, only three types of cancers are considered in the test.

cfRNA biomarkers have demonstrated promising performance, even surpassing cfDNA in some aspects, as shown in the study by Al Ja’farawy et al. [23], covering a wider range of cancer types and a larger cohort. However, its practical use in MCED is currently constrained by several technical, biological, and logistical challenges. Advancements in RNA stabilization techniques, extraction protocols, and computational tools will be crucial in overcoming these limitations and unlocking the full potential of cfRNA in cancer detection.

Using metabolic biomarkers in both plasma and urine tests provided decent performance with lower cost than cfDNA tests, however, complicating the screening process. Moreover, results may be affected by inflammation and metabolic conditions.

The DEcancer study by Halner et al. [29] employed a multi-modal approach, integrating protein biomarkers, ctDNA, and epidemiological data. This comprehensive strategy demonstrated superior performance compared to other studies, highlighting the critical value of utilizing diverse biomarkers in cancer detection.

Overall, a total of 25 primary cancer types have been addressed across the studies. Lung cancer has been universally considered, highlighting its prominence as the leading cause of cancer-related deaths worldwide, attributed to both its high fatality rate and low early detection rates [1]. Other top fatal cancers, such as colorectal and pancreatic cancer, have been included in most of the studies due to their significant mortality impact [4]. On the other hand, brain cancer, despite its notably low survival rate compared to other cancers, has not been as widely considered [3]. Notably, it has been observed that as the number of tumor types included in an MCED test increases, there is a corresponding decrease in performance. Table 3 summarizes the cancer types that appeared in the studies.

Various ML methods have been applied throughout the literature, broadly categorized into three groups: traditional machine learning methods, deep learning methods, and ensemble methods. The specific methods and their frequency of use are shown in Table 4. RF and LR are the most commonly applied methods. LR is favored for its simplicity and interpretability, making it ideal for clinical applications. On the other hand, RF, an ensemble method, is highly robust and often performs well without extensive tuning. Its ability to handle non-linear data, missing data, and imbalanced datasets makes it particularly suitable for healthcare settings, where real-world data is often complex and unstructured. Furthermore, as reflected in the studies, stacking ensemble ML models demonstrate superior performance compared to other methods by combining the strengths of multiple algorithms.

In analyzing ML approaches, it is crucial to address the inherent challenges posed by the heterogeneity and high dimensionality of omics data. Robust variable selection methods, such as LASSO, elastic net, and other regularization techniques, have been extensively employed to enhance model stability and performance. These methods help mitigate the risks of overfitting and ensure that the models are capable of generalizing well to new data. As highlighted in the study by Wu and Ma [30], overlooking these aspects can lead to unstable and irreproducible findings. Therefore, a critical evaluation of the robustness of machine learning models is essential in the context of MCED, and future research should prioritize the development of methods that can effectively handle these complexities.

With regard to studies’ design, most have implemented case-control retrospective approaches. Only OneTest used real world data which fits more the purpose of cancer screening and provides generatability. However, this comes with the complexity of data inconsistency.

MCED tests represent a promising frontier in oncology, offering the potential to revolutionize cancer diagnosis and significantly improve patient outcomes. Despite the significant advancements made to date, future work is critical to address several challenges and enhance the effectiveness of these tests. Future work can be summarized as follows:

Prospective real-world data studies while optimizing irregular and missing data.
Increase scope of cancer types especially top fatal and low survival rate cancers.
Generalizability to different populations.
Improve staging data collection and balance cancer type representation.
Understanding how these biomarkers are linked to cancer by implementing explainable ML.
Large scale clinical validation.
Cohort expansion.
Exploring additional biomarkers.
Combine both cfDNA and cfRNA to improve the overall accuracy
Cost effectiveness analysis.

In conducting this review, several limitations must be acknowledged. First, the scope and coverage of the literature included were constrained by both the exclusive focus on studies available in the PubMed and ScienceDirect databases and the selection of studies published within the last five years. While these databases are comprehensive, this approach may have excluded relevant research published in other databases or journals, particularly those in languages other than English, potentially impacting the completeness of the review. Additionally, the heterogeneity of the included studies, especially in terms of design, population, and outcomes, presents challenges in drawing generalized conclusions, necessitating a more cautious interpretation of the findings. Furthermore, the rapidly evolving nature of the research field under review means that new studies could emerge that might alter the conclusions drawn here. As such, the findings of this review should be viewed as reflective of the current state of knowledge, acknowledging that this may change with future research.

8. Conclusions

This study emphasizes the need for precise, minimally invasive early cancer detection methods to reduce mortality. MCED tests show promise by utilizing advanced assay technologies and data science and are gaining much interest by research community in recent years. In this paper, we systematically explore various MCED studies and their applied ML models for different types of biomarker data. We discuss the strengths and limitations of different study designs and compare their performances. Future directions are proposed, emphasizing the importance of enhancing generatability by expanding to different cohorts, enhancing model transparency, and fostering collaborative efforts to develop robust, cost effective and clinically applicable MCED tools.

Author Contributions

Conceptualization, M.H., G.A., and S.A.; methodology, M.H.; validation, M.H.; writing, M.H.; supervision, G.A. and S.A.; project administration, S.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

Siegel, R.L.; Miller, K.D.; Wagle, N.S.; Jemal, A. Cancer statistics, 2023. CA Cancer J. Clin. 2023, 73, 17–48. [Google Scholar] [CrossRef] [PubMed]
Yabroff, K.R.; Bradley, C.J.; Mariotto, A.B.; Brown, M.L.; Feuer, E.J. Estimates and projections of value of life lost from cancer deaths in the United States. JAMA Oncol. 2021, 7, 1446–1454. [Google Scholar] [CrossRef]
Allemani, C.; Matsuda, T.; Di Carlo, V.; Harewood, R.; Matz, M.; Nikšić, M.; Bonaventure, A.; Valkov, M.; Johnson, C.J.; Estève, J.; et al. Global surveillance of trends in cancer survival 2000–14 (CONCORD-3): Analysis of individual records for 37,513,025 patients diagnosed with one of 18 cancers from 322 population-based registries in 71 countries. Lancet 2018, 391, 1023–1075. [Google Scholar] [CrossRef] [PubMed]
Smith, R.A.; Andrews, K.S.; Brooks, D.; Fedewa, S.A.; Manassaram-Baptiste, D.; Saslow, D.; Brawley, O.W.; Wender, R.C. Cancer screening in the United States, 2018: A review of current American Cancer Society guidelines and current issues in cancer screening. CA Cancer J. Clin. 2018, 68, 297–316. [Google Scholar] [CrossRef] [PubMed]
Society, A.C.; Facts, C. Cancer Facts & Figures 2023. Available online: https://www.cancer.org/research/cancer-facts-statistics/all-cancer-facts-figures.html (accessed on 31 August 2024).
Wan, J.C.M.; Massie, C.; Garcia-Corbacho, J.; Mouliere, F.; Brenton, J.D.; Caldas, C.; Pacey, S.; Baird, R.; Rosenfeld, N. Liquid biopsies come of age: Towards implementation of circulating tumour DNA. Nat. Rev. Cancer 2017, 17, 223–238. [Google Scholar] [CrossRef]
Heitzer, E.; Haque, I.S.; Roberts, C.E.S.; Speicher, M.R. Current and future perspectives of liquid biopsies in genomics-driven oncology. Nat. Rev. Genet. 2019, 20, 71–88. [Google Scholar] [CrossRef]
Brito-Rocha, T.; Constâncio, V.; Henrique, R.; Jerónimo, C. Shifting the Cancer Screening Paradigm: The Rising Potential of Blood-Based Multi-Cancer Early Detection Tests. Cells 2023, 12, 935. [Google Scholar] [CrossRef]
Yang, J.; Nittala, M.R.; Velazquez, A.E.; Buddala, V.; Vijayakumar, S. An Overview of the Use of Precision Population Medicine in Cancer Care: First of a Series. Cureus 2023, 15, e37889. [Google Scholar] [CrossRef]
Wang, H.-Y.; Lin, W.-Y.; Zhou, C.; Yang, Z.-A.; Kalpana, S.; Lebowitz, M.S. Integrating Artificial Intelligence for Advancing Multiple-Cancer Early Detection via Serum Biomarkers: A Narrative Review. Cancers 2024, 16, 862. [Google Scholar] [CrossRef]
Jamshidi, A.; Liu, M.C.; Klein, E.A.; Venn, O.; Hubbell, E.; Beausang, J.F.; Gross, S.; Melton, C.; Fields, A.P.; Liu, Q.; et al. Evaluation of cell-free DNA approaches for multi-cancer early detection. Cancer Cell 2022, 40, 1537–1549.e12. [Google Scholar] [CrossRef]
Liu, M.C.; Oxnard, G.R.; Klein, E.A.; Swanton, C.; Seiden, M.V.; CCGA Consortium. Sensitive and specific multi-cancer detection and localization using methylation signatures in cell-free DNA. Ann. Oncol. 2020, 31, 745–759. [Google Scholar] [CrossRef] [PubMed]
Klein, E.; Richards, D.; Cohn, A.; Tummala, M.; Lapham, R.; Cosgrove, D.; Chung, G.; Clement, J.; Gao, J.; Hunkapiller, N.; et al. Clinical validation of a targeted methylation-based multi-cancer early detection test using an independent validation set. Ann. Oncol. 2021, 32, 1167–1177. [Google Scholar] [CrossRef] [PubMed]
Chen, X.; Dong, Z.; Hubbell, E.; Kurtzman, K.N.; Oxnard, G.R.; Venn, O.; Melton, C.; Clarke, C.A.; Shaknovich, R.; Ma, T.; et al. Prognostic significance of blood-based multi-cancer detection in plasma cell-free DNA. Clin. Cancer Res. 2021, 27, 4221–4229. [Google Scholar] [CrossRef] [PubMed]
Nicholson, B.D.; Oke, J.; Virdee, P.S.; Harris, D.A.; O‘Doherty, C.; Park, J.E.; Hamady, Z.; Sehgal, V.; Millar, A.; Medley, L.; et al. Multi-cancer early detection test in symptomatic patients referred for cancer investigation in England and Wales (SYMPLIFY): A large-scale, observational cohort study. Lancet Oncol. 2023, 24, 733–743. [Google Scholar] [CrossRef]
Nguyen, V.T.C.; Nguyen, T.H.; Doan, N.N.T.; Pham, T.M.Q.; Nguyen, G.T.H.; Vo, D.L.; Phan, T.H.; Jasmine, T.X.; Nguyen, H.T.; Nguyen, T.V.; et al. Multimodal analysis of methylomics and fragmentomics in plasma cell-free DNA for multi-cancer early detection and localization. eLife 2023, 12, RP89083. [Google Scholar] [CrossRef]
Bao, H.; Wang, Z.; Ma, X.; Guo, W.; Zhang, X.; Tang, W.; Chen, X.; Wang, X.; Chen, Y.; Mo, S.; et al. Letter to the Editor: An ultra-sensitive assay using cell-free DNA fragmentomics for multi-cancer early detection. Mol. Cancer 2022, 21, 129. [Google Scholar] [CrossRef]
Moldovan, N.; van der Pol, Y.; Ende, T.v.D.; Boers, D.; Verkuijlen, S.; Creemers, A.; Ramaker, J.; Vu, T.; Bootsma, S.; Lenos, K.J.; et al. Multi-modal cell-free DNA genomic and fragmentomic patterns enhance cancer survival and recurrence analysis. Cell Rep. Med. 2024, 5, 101349. [Google Scholar] [CrossRef]
Zhang, Z.; Pi, X.; Gao, C.; Zhang, J.; Xia, L.; Yan, X.; Hu, X.; Yan, Z.; Zhang, S.; Wei, A.; et al. Integrated fragmentomic profile and 5-Hydroxymethylcytosine of capture-based low-pass sequencing data enables pan-cancer detection via cfDNA. Transl. Oncol. 2023, 34, 101694. [Google Scholar] [CrossRef]
Gao, Q.; Lin, Y.; Li, B.; Wang, G.; Dong, L.; Shen, B.; Lou, W.; Wu, W.; Ge, D.; Zhu, Q.; et al. Unintrusive multi-cancer detection by circulating cell-free DNA methylation sequencing (THUNDER): Development and independent validation studies. Ann. Oncol. 2023, 34, 486–495. [Google Scholar] [CrossRef]
Helzer, K.; Sharifi, M.; Sperger, J.; Shi, Y.; Annala, M.; Bootsma, M.; Reese, S.; Taylor, A.; Kaufmann, K.; Krause, H.; et al. Fragmentomic analysis of circulating tumor DNA targeted cancer panels. Ann. Oncol. 2023, 34, 813–825. [Google Scholar] [CrossRef]
Schrag, D.; Beer, T.M.; McDonnell, C.H.; Nadauld, L.; Dilaveri, C.A.; Reid, R.; Marinac, C.R.; Chung, K.C.; Lopatin, M.; Fung, E.T.; et al. Blood-based tests for multicancer early detection (PATHFINDER): A prospective cohort study. Lancet 2023, 402, 1251–1260. [Google Scholar] [CrossRef] [PubMed]
Wu, P.; Li, D.; Zhang, C.; Dai, B.; Tang, X.; Liu, J.; Wu, Y.; Wang, X.; Shen, A.; Zhao, J.; et al. A unique circulating microRNA pairs signature serves as a superior tool for early diagnosis of pan-cancer. Cancer Lett. 2024, 588, 216655. [Google Scholar] [CrossRef] [PubMed]
Veld, S.G.I.; Arkani, M.; Post, E.; Antunes-Ferreira, M.; D’ambrosi, S.; Vessies, D.C.; Vermunt, L.; Vancura, A.; Muller, M.; Niemeijer, A.-L.N.; et al. Detection and localization of early- and late-stage cancers using platelet RNA. Cancer Cell 2022, 40, 999–1009.e6. [Google Scholar] [CrossRef] [PubMed]
Bratulic, S.; Limeta, A.; Dabestani, S.; Birgisson, H.; Enblad, G.; Stålberg, K.; Hesselager, G.; Häggman, M.; Höglund, M.; Simonson, O.E.; et al. Noninvasive detection of any-stage cancer using free glycosaminoglycans. Proc. Natl. Acad. Sci. USA 2022, 119, e2115328119. [Google Scholar] [CrossRef]
Al Ja’farawy, M.S.; Linh, V.T.N.; Yang, J.-Y.; Mun, C.; Lee, S.; Park, S.-G.; Han, I.W.; Choi, S.; Lee, M.-Y.; Kim, D.-H.; et al. Whole urine-based multiple cancer diagnosis and metabolite profiling using 3D evolutionary gold nanoarchitecture combined with machine learning-assisted SERS. Sens. Actuators B Chem. 2024, 412, 135828. [Google Scholar] [CrossRef]
Luan, Y.; Zhong, G.; Li, S.; Wu, W.; Liu, X.; Zhu, D.; Feng, Y.; Zhang, Y.; Duan, C.; Mao, M. A panel of seven protein tumour markers for effective and affordable multi-cancer early detection by artificial intelligence: A large-scale and multicentre case–control study. EClinicalMedicine 2023, 61, 102041. [Google Scholar] [CrossRef]
Wu, X.; Wang, H.-Y.; Shi, P.; Sun, R.; Wang, X.; Luo, Z.; Zeng, F.; Lebowitz, M.S.; Lin, W.-Y.; Lu, J.-J.; et al. Long short-term memory model—A deep learning approach for medical data with irregularity in cancer predication with tumor markers. Comput. Biol. Med. 2022, 144, 105362. [Google Scholar] [CrossRef]
Halner, A.; Hankey, L.; Liang, Z.; Pozzetti, F.; Szulc, D.A.; Mi, E.; Liu, G.; Kessler, B.M.; Syed, J.; Liu, P.J. DEcancer: Machine learning framework tailored to liquid biopsy based cancer detection and biomarker signature selection. iScience 2023, 26, 106610. [Google Scholar] [CrossRef]
Wu, C.; Ma, S. A selective review of robust variable selection with applications in bioinformatics. Brief Bioinform. 2014, 16, 873–883. [Google Scholar] [CrossRef]

Figure 1. Top Cancers and Percentages of Stage 1 Detection [1,5].

Figure 2. Summary of Search Methodology.

Table 3. Cancer types applied in reviewed studies.

Cancer Type

[11]

[12]

[13]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

Lung

✓

Colorectal

✓

Pancreatic

✓

Liver and Biliary

✓

Ovarian

✓

Esophageal

✓

Gastrointestinal

✓

Breast

✓

Prostate

✓

Bladder

✓

Lymphoma

✓

Head and neck

✓

Uterine

✓

Brain

✓

Kidney

✓

Myeloma

✓

Skin

✓

Leukemia

✓

Soft Tissue

✓

Anus

✓

Cervical

✓

Thyroid

✓

Penile

✓

Testicle

✓

Thymoma

✓

Table 4. Machine Learning Methods Applied in Reviewed Studies.

Category	Machine Learning Method	Studies
Classical Methods	Logistic Regression (LR)	[11,12,16,18,20,23,24,26,28,29]
	Support Vector Machine (SVM)	[11,18,20,23,29]
	Generalized Linear Model (GLM)	[17,27]
	Decision Tree (DT)	[28]
	K-Nearest Neighbor (KNN)	[18,28]
	Naive Bayes (NB)	[28]
	Bayesian Logistic Regression (BLR)	[25]
	Bayesian Additive Regression Trees (BART)	[25]
Deep Learning Methods	Long Short-Term Memory (LSTM)	[28]
	Graph Convolutional Neural Networks (GCNN)	[16]
	Neural Networks	[17]
Ensemble Methods	Random Forest (RF)	[11,16,17,18,19,23,24,27,28,29]
	Gradient Boosting Machine (GBM)	[17,27,28]
	Extreme Gradient Boosting (XGB)	[16,17,23]
	Ensemble Stacked Model	[16,17]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hajjar, M.; Albaradei, S.; Aldabbagh, G. Machine Learning Approaches in Multi-Cancer Early Detection. Information 2024, 15, 627. https://doi.org/10.3390/info15100627

AMA Style

Hajjar M, Albaradei S, Aldabbagh G. Machine Learning Approaches in Multi-Cancer Early Detection. Information. 2024; 15(10):627. https://doi.org/10.3390/info15100627

Chicago/Turabian Style

Hajjar, Maryam, Somayah Albaradei, and Ghadah Aldabbagh. 2024. "Machine Learning Approaches in Multi-Cancer Early Detection" Information 15, no. 10: 627. https://doi.org/10.3390/info15100627

APA Style

Hajjar, M., Albaradei, S., & Aldabbagh, G. (2024). Machine Learning Approaches in Multi-Cancer Early Detection. Information, 15(10), 627. https://doi.org/10.3390/info15100627

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Learning Approaches in Multi-Cancer Early Detection

Abstract

1. Introduction

2. Review Search Methodology

3. Cell-Free DNA Based MCED Tests

3.1. The Circulating Cell-Free Genome Atlas Study [11,12,13]

3.1.1. First CCGA Sub-Study [11]

3.1.2. Second CCGA Sub-Study [12]

3.1.3. Third CCGA Sub-Study [13,14]

3.2. SYMPLIFY [15]

3.3. SPOT-MAS [16]

3.3.1. Cancer Prediction Single-Feature Models

3.3.2. Cancer Prediction Combined-Feature Models

3.3.3. Tumor of Origin Prediction Models

3.4. Bao et al. [17]

3.5. Moldovan et al. [18]

3.6. Zhang Z. et al. [19]

3.7. THUNDER [20]

3.8. K. T. Helzer et al. (GLMNET) [21]

3.9. PATHFINDER [22]

4. Cell-Free RNA Based Tests

4.1. 31-miRP Signature [23]

4.2. thromboSeq [24]

5. Metabolites Based Tests

5.1. GAGomes [25]

5.2. 3D-EGN + SERS [26]

6. Proteins Based Studies

6.1. OncoSeek [27]

6.2. OneTest [28]

6.3. DEcancer [29]

7. Discussion

8. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI