2.2.2. ACMG-AMP Criteria Implementation
MARGINAL software implements 17 out of 28 ACMG-AMP criteria automatically. In particular, according to variant annotation, it can automatically generate predictions on ten and seven criteria that conclude a pathogenic and a benign variant classification, respectively (
Table 1). If a criterion is positive, the algorithm will assign 1, otherwise, it will assign 0. The rest of the ACMG-AMP criteria are based on categories that are specific to each case (such as familial co-segregation data or de novo status and allelic data) and require user input. Thus, these were not included in the proposed process.
PVS1 criterion suggests strong pathogenicity and is recommended for loss-of-function (LOF) variants (represented as frameshift indel, stop-gain, stop-loss and splicing variants in canonical transcripts). LOF is a known mechanism of disease in BRCA1 and BRCA2 genes. Considering that the vast majority of LOF variants in these genes are considered pathogenic, with the exception of those located in the last amino acids of the last exon of BRCA2, PVS1 is assigned as 1 for all these variants; for all other variants, PVS1 is assigned a value equal to 0.
For the implementation of PS1, PM5, PP5, and BP6 criteria, as a reference to identify pathogenic and benign variants, we interrogated ClinVar, excluding variants interpreted from a single submitter and variants with conflicting interpretations (review status ≥ 2). More specifically, when a missense variant is pathogenic, then if a different nucleotide change resulting in the same amino acid change is known to be pathogenic, PS1 is assigned a 1. In contrast, if a new missense amino acid change occurs at the same position as another pathogenic missense change, then PM5 is assigned a 1. We used the annotation information derived from VEP related to CDS_position (relative position of base pair in coding sequence), protein_position (relative position of amino acid in protein), amino_acids (amino acid change in case the variant affects the protein-coding sequence) and GIVEN_REF (reference allele from input) fields [
18]. To ensure that these changes do not impact splicing, we inferred the “pathogenicity” score from MMSplice for missense variants (pathogenicity score < 0.85). Concerning PP5 and BP6 criteria, if a variant has recently been identified as pathogenic by a reputable source, but an independent evaluation cannot be performed, then PP5 is applied. If a variant has recently been identified as benign by a reputable source, but an independent evaluation is not possible, then BP6 is applied. As a result, if a variant is identified as “pathogenic” or “likely pathogenic” by ClinVar, PP5 is assigned as 1, otherwise, if a variant is identified as “benign” or “likely benign” by ClinVar, BP6 is assigned as 1. As a result of the limitation mentioned above, i.e., review status ≥ 2, we can consider the PP5 and BP6 criteria to approach PS3 and BS3, respectively, which is to say that they can be considered to be functional evidence [
8].
BA1, BS1 and PM2 criteria that are related to population frequency were implemented based on ExAC [
27] and gnomAD databases [
28] and more specifically the non-TCGA and non-cancer data sets, respectively, considering all populations. If the variant has an allele frequency > 5%, BA1 is assigned as 1. If the allele frequency in a control population is higher than expected for the disorder (in our case, the threshold for rare variants is 1%), BS1 is assigned as 1. For dominant inheritance, as in the case of
BRCA1 and
BRCA2, when a variant is absent in all control subjects from the above databases, PM2 is assigned as 1, as the rarity of the variant advocates for pathogenicity.
When the prevalence of the variant in affected individuals is significantly higher than in controls, then PS4 is applied. As a result of a case-control study (between the CanVaS population of interest and gnomAD exome control database), all variants whose odds ratio is higher than 2 and p-value is less than 0.05 for cancer risk are coded as 1 for PS4. We inferred the odds ratio and p-value from Fisher’s exact test.
If a protein’s length changes due to in-frame insertions or deletions in a non-repeat region or stop-loss variants, PM4 is applied. Our annotation for the repeat region was based on the ‘‘rmsk’’ database from the UCSC Genome Browser (using the UCSC Table Browser data retrieval tool) [
24,
25]. The database records the results of RepeatMasker, which were generated by screening DNA sequences for repeats. If the variants are in-frame insertions or deletions in the non-repeat region, or stop-loss variants, PM4 is assigned as 1. If the variants are in-frame insertions or deletions in the repeat region, BP3 is assigned as 1.
PM1 criterion includes moderate evidence of pathogenicity if specific protein domains are thought to be essential for protein function and/or mutational hot spots and all missense variants in these domains are known to be pathogenic. In addition, benign variants are not located in these domains. For BRCA1 and BRCA2, there are critical and well-established functional domains, but not without benign variants. Therefore, PM1 is assigned as 0.
If a missense variant occurs in a gene in which missense variants often result in disease and which tends to have fewer benign missense variants, PP2 is applied. However, for a missense variant in a gene where truncating variants are the predominant cause of disease, BP1 is applied. Since there are not insufficient data for BRCA1 and BRCA2 concerning these criteria, PP2 and BP1 are assigned as 0.
If multiple lines of computational evidence point to a deleterious effect of a variant on the gene or gene product (conservation, evolutionary, splicing impact, etc.), then the supporting pathogenic evidence of PP3 is assigned as 1. However, if multiple lines of computational evidence suggest no impact on the gene or gene product, BP4 is assigned as 1 for supporting benign evidence. Considering that all in silico programs agree on the prediction, this evidence can be considered supporting, and PP3 or BP4 is applied. The computational data can be obtained via VEP from the “dbNSFP” database, using MetaSVM and Condel predictions [
29] to predict deleteriousness and the GERP++ score (GERP++_RS) to predict evolutionary conservation. The impact of splicing can be assessed using VEP from the “dbscSNV” database, using adaptive boosting and random forest scores (ada_score and rf_score, respectively). If at least one of the above in silico data exists and MetaSVM and Condel predictions indicate a deleterious variant, GERP++_RS is higher than 2, ada_score and rf_score are higher than 0.6 and PP3 is assigned as 1, otherwise, BP4 is assigned as 1.
In the case of a synonymous variant, known not to alter splicing and the nucleotide position not being highly conserved, BP7 is applied. By using VEP, the effect on splicing can be predicted from the ‘‘dbscSNV’’ database and the conservation information can be inferred from the “dbNSFP” database as above. If ada_score and rf_score are less than 0.6 and GERP++_RS is less than 2, BP7 is assigned as 1.
2.2.3. Machine Learning Modeling
In this study, a machine learning model is proposed in order to discriminate between P/LP, VUS and B/LB variants. After a detailed analysis, we concluded that the optimal classification result is achieved via two two-tier serial classifiers. We compared the performance of eight machine learning algorithms, namely, logistic regression (LR), linear discriminant analysis (LDA), k-nearest neighbors classifier (KNN), decision tree classifier (CART), naive Bayes (BernoulliNB), support vector machine (SVM), random forest (RF), and multi-layer perceptron (MLP) classifier, in a classification scheme based on a serial combination of two classifiers. In particular, instead of a typical three-tier classifier, we propose two serial classifiers to reduce the three-tier to two-tier classification to simplify the machine learning model and thus increase separation power. In this way, as illustrated in
Figure 1, the same feature vectors were used for both classifiers. The first classifier was trained for separating VUS from all other variants (B/LB and P/LP, labeled as “other variants”) and the second classifier was trained to discriminate between the remaining two classes—B/LB and P/LP.
Machine learning algorithms were trained based on the scikit-learn library in Python on version 1.0.2 [
19]. The training set consisted of 80% of the total variants, while the test set consisted of the remaining 20% (no other criteria were applied to select variants). The different models were trained using three-fold cross-validation to evaluate and compare their performances, and this procedure was repeated 20 times. To determine the best approach, we used accuracy, F1 measure, precision, recall and specificity. When implementing algorithms, the default scikit-learn library parameters were used unless otherwise specified.
In addition, an optimization step was performed on the parameters of each algorithm to improve their classification performance. The parameter selected for LR was solver = ‘newton-cg’, for BernoulliNB was fit_prior = none, for SVM was kernel = ‘linear’; the parameters for RF were n_estimators = 200, max_features = 11 and for MLP were hidden_layer_sizes = 11, solver = ‘lbfgs’ and max_iter = 400.
To evaluate the significance of the features used to train the model, we applied feature importance analysis with SHAP (Shapley additive explanations) values [
30] on the training set, using the SHAP Python library. This method allows us to determine precisely how the features contribute to the model output. Furthermore, we computed Spearman’s correlation [
31] among features to examine feature collinearity. As a result of our analysis based on the CanVaS database, we found that RF (nonlinear machine learning classification algorithm) or MLP (fully connected class of feedforward artificial neural network) are the best options for classifier 1, and LR (linear) or BernoulliNB (a variant of naive Bayes, a classification algorithm of machine learning based on Bayes theorem) for classifier 2. The flow diagram of the whole process is presented in
Figure 1.