Figure 1.
The SARS-CoV-2 genome is roughly 29–30 kb in length, encoding structural and non-structural proteins. Open reading frame (ORF) 1ab encodes the non-structural proteins, and the four structural proteins: S (spike), E (envelope), M (membrane), and N (nucleocapsid) are encoded by their respective genes. The spike region is composed of 3821 base pairs, hence coding for 1274 amino acids.
Figure 1.
The SARS-CoV-2 genome is roughly 29–30 kb in length, encoding structural and non-structural proteins. Open reading frame (ORF) 1ab encodes the non-structural proteins, and the four structural proteins: S (spike), E (envelope), M (membrane), and N (nucleocapsid) are encoded by their respective genes. The spike region is composed of 3821 base pairs, hence coding for 1274 amino acids.
Figure 2.
Example of 4-mers of the amino acid sequence “VLPLVFVFVFM”.
Figure 2.
Example of 4-mers of the amino acid sequence “VLPLVFVFVFM”.
Figure 3.
The distortion score (blue line) for different numbers of clusters using
k-means. The dashed green line shows the runtime (in s). The dashed black line shows the optimal number of clusters computed using the Elbow method [
60].
Figure 3.
The distortion score (blue line) for different numbers of clusters using
k-means. The dashed green line shows the runtime (in s). The dashed black line shows the optimal number of clusters computed using the Elbow method [
60].
Figure 4.
t-SNE plots for the original data and for different feature selection methods applied on the original data.
Figure 4.
t-SNE plots for the original data and for different feature selection methods applied on the original data.
Figure 5.
(a) t-SNE plots for the original variants as labels; (b) t-SNE plot with labels obtained after applying HDBSCAN without any feature selection method on the frequency vectors.
Figure 5.
(a) t-SNE plots for the original variants as labels; (b) t-SNE plot with labels obtained after applying HDBSCAN without any feature selection method on the frequency vectors.
Figure 6.
t-SNE plots for HDBSCAN without and with feature selection methods.
Figure 6.
t-SNE plots for HDBSCAN without and with feature selection methods.
Figure 7.
Runtime for different clustering methods (no feature selection method). The x-axis shows number of clusters.
Figure 7.
Runtime for different clustering methods (no feature selection method). The x-axis shows number of clusters.
Figure 8.
Runtime for different clustering methods (random Fourier feature selection). The x-axis shows number of clusters.
Figure 8.
Runtime for different clustering methods (random Fourier feature selection). The x-axis shows number of clusters.
Figure 9.
Runtime for different clustering methods (Boruta feature selection). The x-axis shows number of clusters.
Figure 9.
Runtime for different clustering methods (Boruta feature selection). The x-axis shows number of clusters.
Figure 10.
Runtime for different clustering methods (LASSO feature selection). The x-axis shows number of clusters.
Figure 10.
Runtime for different clustering methods (LASSO feature selection). The x-axis shows number of clusters.
Table 1.
Variants information and distribution in the dataset. The S/Gen. column represents the number of mutations in the S-region/entire genome. The total number of amino acid sequences in our dataset is .
Table 1.
Variants information and distribution in the dataset. The S/Gen. column represents the number of mutations in the S-region/entire genome. The total number of amino acid sequences in our dataset is .
Pango Lineage | Region | Labels | No. Mutations S-Region/Genome | No. Sequences |
---|
B.1.1.7 | UK [12] | Alpha | 8/17 | 13,966 |
B.1.351 | South Africa [12] | Beta | 9/21 | 1727 |
B.1.617.2 | India [13] | Delta | 8/17 | 7551 |
P.1 | Brazil [14] | Gamma | 10/21 | 26,629 |
B.1.427 | California [15] | Epsilon | 3/5 | 12,784 |
Table 2.
Variant-wise (weighted) score with mean, standard deviation (S.D.), and runtime for different clustering methods with number of clusters . Best values are shown in bold.
Table 2.
Variant-wise (weighted) score with mean, standard deviation (S.D.), and runtime for different clustering methods with number of clusters . Best values are shown in bold.
Embed. | Algorithm | Score (Weighted) for Different Variants | Runtime (s) |
---|
Alpha | Beta | Delta | Gamma | Epsilon | Mean | S.D. |
---|
OHE | k-means | 0.36 | 0.05 | 0.70 | 0.46 | 0.68 | 0.45 | 0.27 | 553.95 |
k-means + Boruta | 0.62 | 0.05 | 0.84 | 0.98 | 0.69 | 0.64 | 0.36 | 20.01 |
k-means + LASSO | 0.50 | 0.05 | 0.70 | 0.69 | 0.68 | 0.52 | 0.28 | 61.78 |
k-means + RFF | 0.98 | 0.0 | 0.29 | 0.99 | 0.99 | 0.65 | 0.47 | 28.02 |
k-modes | 0.98 | 0.01 | 0.99 | 0.98 | 0.90 | 0.77 | 0.43 | 198,926.91 |
k-modes + Boruta | 0.99 | 0.01 | 0.80 | 0.98 | 0.77 | 0.71 | 0.40 | 6646.35 |
k-modes + LASSO | 0.98 | 0.05 | 0.56 | 0.98 | 0.75 | 0.66 | 0.39 | 36,435.04 |
k-modes + RFF | 0.99 | 0.0 | 0.23 | 0.99 | 0.99 | 0.64 | 0.49 | 1370.85 |
Fuzzy | 0.96 | 0.01 | 0.68 | 0.98 | 0.80 | 0.69 | 0.40 | 29,193.81 |
Fuzzy + Boruta | 0.60 | 0.26 | 0.61 | 0.98 | 0.68 | 0.63 | 0.26 | 664.19 |
Fuzzy + LASSO | 0.96 | 0.01 | 0.74 | 0.98 | 0.80 | 0.70 | 0.40 | 3958.60 |
Fuzzy + RFF | 0.44 | 0.0 | 0.0 | 1.00 | 1.00 | 0.49 | 0.50 | 395.95 |
Hierarchical | 0.55 | 0.04 | 0.69 | 0.70 | 0.68 | 0.53 | 0.28 | 106,092.34 |
Hierarchical + Boruta | 0.59 | 0.25 | 0.60 | 0.99 | 0.67 | 0.62 | 0.26 | 2761.91 |
Hierarchical + LASSO | 0.55 | 0.04 | 0.19 | 0.70 | 0.56 | 0.41 | 0.28 | 8650.66 |
Hierarchical + RFF | 0.98 | 0.00 | 0.29 | 0.98 | 0.98 | 0.65 | 0.47 | 2542.87 |
k-mers | k-means | 0.4 | 0.15 | 0.61 | 0.69 | 0.44 | 0.45 | 0.21 | 66.43 |
k-means + Boruta | 0.42 | 0.10 | 0.61 | 0.69 | 0.65 | 0.50 | 0.24 | 15.77 |
k-means + LASSO | 0.99 | 0.007 | 0.84 | 0.99 | 0.77 | 0.72 | 0.41 | 9.56 |
k-means + RFF | 1.00 | 0.0 | 0.28 | 1.00 | 1.00 | 0.66 | 0.48 | 8.65 |
k-modes | 0.99 | 0.005 | 0.87 | 0.99 | 0.77 | 0.73 | 0.42 | 17,580.25 |
k-modes + Boruta | 0.99 | 0.31 | 0.86 | 0.99 | 0.85 | 0.81 | 0.28 | 2965.03 |
k-modes + LASSO | 0.99 | 0.17 | 0.99 | 0.99 | 0.07 | 0.63 | 0.47 | 784.20 |
k-modes + RFF | 1.00 | 0.00 | 0.0 | 0.61 | 1.00 | 0.52 | 0.50 | 794.56 |
Fuzzy | 0.35 | 0.10 | 0.61 | 0.69 | 0.44 | 0.44 | 0.23 | 2358.84 |
Fuzzy + Boruta | 0.36 | 0.15 | 0.61 | 0.69 | 0.44 | 0.45 | 0.21 | 230.17 |
Fuzzy + LASSO | 0.99 | 0.31 | 0.65 | 0.99 | 0.82 | 0.76 | 0.29 | 94.36 |
Fuzzy + RFF | 0.44 | 0.0 | 0.0 | 1.00 | 0.0 | 0.29 | 0.44 | 460.82 |
Hierarchical | 0.32 | 0.10 | 0.58 | 0.70 | 0.46 | 0.43 | 0.23 | 9934.57 |
Hierarchical + Boruta | 0.36 | 0.14 | 0.63 | 0.68 | 0.46 | 0.45 | 0.22 | 1908.75 |
Hierarchical + LASSO | 0.99 | 0.58 | 0.58 | 0.99 | 0.83 | 0.80 | 0.21 | 713.07 |
Hierarchical + RFF | 1.00 | 0.00 | 0.28 | 1.00 | 1.00 | 0.66 | 0.48 | 1120.14 |
Table 3.
Clustering quality metrics for different clustering methods with number of clusters . Best values are shown in bold.
Table 3.
Clustering quality metrics for different clustering methods with number of clusters . Best values are shown in bold.
Embedding | Algorithm | Different Metrics for Different Variants | Runtime (s) |
---|
Silhouette Coefficient | Calinski–Harabasz Index | Davies–Bouldin Index |
---|
OHE | k-means | 0.48 | 13,089.36 | 1.38 | 553.95 |
k-means + Boruta | 0.34 | 17,478.82 | 1.4 | 20.01 |
k-means + LASSO | 0.48 | 13,112.61 | 1.38 | 61.78 |
k-means + RFF | 0.25 | 4855.80 | 2.04 | 28.02 |
k-modes | 0.26 | 6626.82 | 2.59 | 198,926.91 |
k-modes + Boruta | 0.33 | 13,146.09 | 1.89 | 6646.35 |
k-modes + LASSO | 0.28 | 9596.01 | 1.92 | 36,435.04 |
k-modes + RFF | | 1389.75 | 1.07 | 1370.85 |
Fuzzy | 0.27 | 8821.95 | 2.13 | 29,193.81 |
Fuzzy + Boruta | 0.33 | 13,397.80 | 1.89 | 664.19 |
Fuzzy + LASSO | 0.26 | 8663.69 | 2.17 | 3958.60 |
Fuzzy + RFF | 0.19 | 12,069.50 | 1.09 | 395.95 |
Hierarchical | 0.44 | 10,516.02 | 1.84 | 106,092.34 |
Hierarchical + Boruta | 0.32 | 13,324.79 | 1.60 | 2761.91 |
Hierarchical + LASSO | 0.42 | 10,529.10 | 1.61 | 8650.66 |
Hierarchical + RFF | 0.26 | 5010.25 | 1.00 | 2542.87 |
k-mers | k-means | 0.75 | 329,866.74 | 0.45 | 66.43 |
k-means + Boruta | 0.76 | 342,083.63 | 0.43 | 15.77 |
k-means + LASSO | 0.42 | 17,269.34 | 1.52 | 9.56 |
k-means + RFF | 0.25 | 5251.93 | 1.55 | 8.65 |
k-modes | 0.05 | 10,257.1 | 7.11 | 17,580.25 |
k-modes + Boruta | 0.07 | 12,058.15 | 6.52 | 2965.03 |
k-modes + LASSO | 0.42 | 15,704.43 | 1.54 | 784.20 |
k-modes + RFF | | 13,66.91 | 1.07 | 794.56 |
Fuzzy | 0.75 | 329,410.74 | 0.44 | 2358.84 |
Fuzzy + Boruta | 0.76 | 341,678.01 | 0.43 | 230.17 |
Fuzzy + LASSO | 0.41 | 15,010.79 | 2.58 | 94.36 |
Fuzzy + RFF | 0.19 | 13,293.96 | 0.99 | 460.82 |
Hierarchical | 0.72 | 284,726.92 | 0.47 | 9934.57 |
Hierarchical + Boruta | 0.73 | 280,129.56 | 0.42 | 1908.75 |
Hierarchical + LASSO | 0.41 | 15,218.95 | 2.03 | 713.07 |
Hierarchical + RFF | 0.26 | 5258.43 | 1.00 | 1120.14 |
Table 4.
Contingency tables of variants vs. clusters (no feature selection, k-mers) with number of clusters .
Table 4.
Contingency tables of variants vs. clusters (no feature selection, k-mers) with number of clusters .
Variant | k-means (Cluster IDs) | k-modes (Cluster IDs) |
---|
0 | 1 | 2 | 3 | 4 | 0 | 1 | 2 | 3 | 4 |
---|
Alpha | 1512 | 8762 | 2926 | 680 | 86 | 8 | 11,492 | 284 | 330 | 1852 |
Beta | 295 | 601 | 626 | 172 | 33 | 64 | 9 | 1604 | 31 | 19 |
Epsilon | 956 | 7848 | 3155 | 638 | 187 | 0 | 1 | 8532 | 613 | 3638 |
Delta | 2706 | 2605 | 1342 | 868 | 30 | 0 | 1 | 3192 | 3491 | 867 |
Gamma | 682 | 22,140 | 3016 | 741 | 50 | 26,519 | 7 | 7 | 61 | 35 |
Table 5.
Contingency tables of variants vs. clusters (no feature selection, one-hot embedding) with number of clusters .
Table 5.
Contingency tables of variants vs. clusters (no feature selection, one-hot embedding) with number of clusters .
Variant | k-means (Cluster IDs) | k-modes (Cluster IDs) |
---|
0 | 1 | 2 | 3 | 4 | 0 | 1 | 2 | 3 | 4 |
---|
Alpha | 52 | 9716 | 1601 | 2165 | 432 | 11,235 | 13 | 2316 | 25 | 377 |
Beta | 24 | 650 | 1011 | 18 | 24 | 5 | 81 | 30 | 29 | 1582 |
Epsilon | 427 | 7972 | 223 | 3626 | 536 | 8 | 0 | 4111 | 18 | 8647 |
Delta | 5 | 2800 | 484 | 1201 | 3061 | 3 | 0 | 3988 | 3490 | 70 |
Gamma | 513 | 23,032 | 1425 | 70 | 1589 | 6 | 26,502 | 114 | 0 | 7 |
Table 6.
Contingency tables of variants vs. clusters (no feature selection, k-mers) with number of clusters .
Table 6.
Contingency tables of variants vs. clusters (no feature selection, k-mers) with number of clusters .
Variant | Fuzzy (Cluster IDs) | Hierarchical (Cluster IDs) |
---|
0 | 1 | 2 | 3 | 4 | 0 | 1 | 2 | 3 | 4 |
---|
Alpha | 666 | 1515 | 78 | 2945 | 8762 | 1772 | 3442 | 650 | 8036 | 66 |
Beta | 171 | 279 | 31 | 627 | 601 | 501 | 491 | 164 | 544 | 27 |
Epsilon | 637 | 942 | 186 | 3172 | 7847 | 1166 | 3804 | 636 | 6994 | 184 |
Delta | 839 | 2725 | 28 | 1354 | 2605 | 2997 | 1292 | 827 | 2411 | 24 |
Gamma | 739 | 669 | 47 | 3034 | 22,140 | 865 | 3501 | 734 | 21,484 | 45 |
Table 7.
Contingency tables of variants vs. clusters (no feature selection, one-hot embedding) with number of clusters .
Table 7.
Contingency tables of variants vs. clusters (no feature selection, one-hot embedding) with number of clusters .
Variant | Fuzzy (Cluster IDs) | Hierarchical (Cluster IDs) |
---|
0 | 1 | 2 | 3 | 4 | 0 | 1 | 2 | 3 | 4 |
---|
Alpha | 629 | 11,005 | 2101 | 226 | 5 | 2845 | 8651 | 32 | 244 | 2194 |
Beta | 213 | 508 | 19 | 987 | 0 | 1002 | 605 | 18 | 35 | 67 |
Epsilon | 713 | 16 | 3597 | 8458 | 0 | 520 | 7324 | 426 | 457 | 4057 |
Delta | 3419 | 74 | 1042 | 3016 | 0 | 1168 | 2547 | 2 | 2479 | 1355 |
Gamma | 1590 | 262 | 88 | 61 | 24,628 | 2008 | 22,333 | 497 | 1522 | 269 |
Table 8.
Contingency tables of variants vs. clusters (random Fourier feature selection, k-mers) with number of clusters .
Table 8.
Contingency tables of variants vs. clusters (random Fourier feature selection, k-mers) with number of clusters .
Variant | k-means (Cluster IDs) | k-modes (Cluster IDs) |
---|
0 | 1 | 2 | 3 | 4 | 0 | 1 | 2 | 3 | 4 |
---|
Alpha | 0 | 12,603 | 0 | 1363 | 0 | 12,603 | 0 | 0 | 1363 | 0 |
Beta | 0 | 1727 | 0 | 0 | 0 | 1727 | 0 | 0 | 0 | 0 |
Epsilon | 0 | 10,348 | 0 | 0 | 2436 | 10,348 | 0 | 2436 | 0 | 0 |
Delta | 0 | 7551 | 0 | 0 | 0 | 7551 | 0 | 0 | 0 | 0 |
Gamma | 13,076 | 12,569 | 984 | 0 | 0 | 25,632 | 13 | 0 | 0 | 984 |
Table 9.
Contingency tables of variants vs. clusters (random Fourier feature selection, one-hot embedding) with number of clusters .
Table 9.
Contingency tables of variants vs. clusters (random Fourier feature selection, one-hot embedding) with number of clusters .
Variant | k-means (Cluster IDs) | k-modes (Cluster IDs) |
---|
0 | 1 | 2 | 3 | 4 | 0 | 1 | 2 | 3 | 4 |
---|
Alpha | 0 | 0 | 0 | 1363 | 12,603 | 12,603 | 0 | 0 | 0 | 1363 |
Beta | 0 | 0 | 0 | 0 | 1727 | 1727 | 0 | 0 | 0 | 0 |
Epsilon | 0 | 2436 | 0 | 0 | 10,348 | 10,348 | 0 | 2436 | 0 | 0 |
Delta | 0 | 0 | 637 | 0 | 6914 | 7551 | 0 | 0 | 0 | 0 |
Gamma | 13,076 | 0 | 984 | 0 | 12,569 | 25,632 | 13 | 0 | 984 | 0 |
Table 10.
Contingency tables of variants vs. clusters (random Fourier feature selection, k-mers) with number of clusters .
Table 10.
Contingency tables of variants vs. clusters (random Fourier feature selection, k-mers) with number of clusters .
Variant | Fuzzy (Cluster IDs) | Hierarchical (Cluster IDs) |
---|
0 | 1 | 2 | 3 | 4 | 0 | 1 | 2 | 3 | 4 |
---|
Alpha | 0 | 0 | 0 | 13,966 | 0 | 12,603 | 0 | 0 | 1363 | 0 |
Beta | 0 | 0 | 0 | 1727 | 0 | 1727 | 0 | 0 | 0 | 0 |
Epsilon | 0 | 0 | 0 | 12,784 | 0 | 10,348 | 0 | 2436 | 0 | 0 |
Delta | 0 | 0 | 0 | 7551 | 0 | 7551 | 0 | 0 | 0 | 0 |
Gamma | 0 | 0 | 0 | 13,553 | 13,076 | 12,569 | 13,076 | 0 | 0 | 984 |
Table 11.
Contingency tables of variants vs. clusters (random Fourier feature selection, one-hot embedding) with number of clusters .
Table 11.
Contingency tables of variants vs. clusters (random Fourier feature selection, one-hot embedding) with number of clusters .
Variant | Fuzzy (Cluster IDs) | Hierarchical (Cluster IDs) |
---|
0 | 1 | 2 | 3 | 4 | 0 | 1 | 2 | 3 | 4 |
---|
Alpha | 0 | 0 | 0 | 13,966 | 0 | 12,603 | 0 | 0 | 1363 | 0 |
Beta | 0 | 0 | 0 | 1727 | 0 | 1727 | 0 | 0 | 0 | 0 |
Epsilon | 0 | 0 | 0 | 12,784 | 0 | 10,348 | 0 | 2436 | 0 | 0 |
Delta | 0 | 0 | 0 | 7551 | 0 | 7551 | 0 | 0 | 0 | 0 |
Gamma | 0 | 0 | 0 | 13,269 | 13,360 | 12,569 | 13,076 | 0 | 0 | 984 |
Table 12.
Contingency tables of variants vs. clusters (LASSO feature selection, k-mers) with number of clusters .
Table 12.
Contingency tables of variants vs. clusters (LASSO feature selection, k-mers) with number of clusters .
Variant | k-means (Cluster IDs) | k-modes (Cluster IDs) |
---|
0 | 1 | 2 | 3 | 4 | 0 | 1 | 2 | 3 | 4 |
---|
Alpha | 303 | 11,365 | 383 | 1909 | 6 | 8 | 10,958 | 282 | 2660 | 58 |
Beta | 1551 | 4 | 148 | 23 | 1 | 65 | 9 | 1617 | 12 | 24 |
Epsilon | 8536 | 1 | 671 | 3576 | 0 | 0 | 1 | 12,000 | 112 | 671 |
Delta | 3098 | 0 | 3693 | 760 | 0 | 0 | 0 | 3121 | 19 | 4411 |
Gamma | 16 | 13 | 198 | 36 | 26,366 | 26,577 | 7 | 7 | 0 | 38 |
Table 13.
Contingency tables of variants vs. clusters (LASSO feature selection, one-hot embedding) with number of clusters .
Table 13.
Contingency tables of variants vs. clusters (LASSO feature selection, one-hot embedding) with number of clusters .
Variant | k-means (Cluster IDs) | k-modes (Cluster IDs) |
---|
0 | 1 | 2 | 3 | 4 | 0 | 1 | 2 | 3 | 4 |
---|
Alpha | 53 | 9716 | 431 | 2165 | 1601 | 11,208 | 2295 | 13 | 63 | 387 |
Beta | 24 | 650 | 24 | 18 | 1011 | 3 | 28 | 76 | 28 | 1592 |
Epsilon | 427 | 7972 | 536 | 3626 | 223 | 4 | 3981 | 0 | 427 | 8372 |
Delta | 41 | 4015 | 0 | 9 | 34,864 | 41 | 4015 | 0 | 9 | 3486 |
Gamma | 514 | 23,031 | 1586 | 70 | 1428 | 5 | 114 | 25,989 | 514 | 7 |
Table 14.
Contingency tables of variants vs. clusters (LASSO feature selection, k-mers) with number of clusters .
Table 14.
Contingency tables of variants vs. clusters (LASSO feature selection, k-mers) with number of clusters .
Variant | Fuzzy (Cluster IDs) | Hierarchical (Cluster IDs) |
---|
0 | 1 | 2 | 3 | 4 | 0 | 1 | 2 | 3 | 4 |
---|
Alpha | 1344 | 5 | 12,042 | 362 | 213 | 1967 | 606 | 30 | 11,345 | 18 |
Beta | 99 | 1 | 6 | 440 | 1181 | 24 | 1667 | 6 | 22 | 8 |
Epsilon | 3220 | 0 | 0 | 780 | 8784 | 3667 | 509 | 8582 | 26 | 0 |
Delta | 4464 | 0 | 0 | 543 | 2544 | 3892 | 245 | 3367 | 40 | 7 |
Gamma | 202 | 26,169 | 16 | 232 | 10 | 12 | 1053 | 1 | 11 | 25,552 |
Table 15.
Contingency tables of variants vs. clusters (LASSO feature selection, one-hot embedding) with number of clusters .
Table 15.
Contingency tables of variants vs. clusters (LASSO feature selection, one-hot embedding) with number of clusters .
Variant | Fuzzy (Cluster IDs) | Hierarchical (Cluster IDs) |
---|
0 | 1 | 2 | 3 | 4 | 0 | 1 | 2 | 3 | 4 |
---|
Alpha | 2079 | 11,051 | 604 | 227 | 5 | 2671 | 8521 | 32 | 2635 | 107 |
Beta | 19 | 476 | 214 | 1018 | 0 | 92 | 566 | 18 | 1031 | 20 |
Epsilon | 3525 | 15 | 712 | 8532 | 0 | 4650 | 7218 | 426 | 457 | 33 |
Delta | 1040 | 70 | 3385 | 3056 | 0 | 4189 | 2473 | 2 | 690 | 197 |
Gamma | 87 | 251 | 799 | 75 | 25,417 | 352 | 22,158 | 497 | 2155 | 1467 |
Table 16.
Contingency tables of variants vs. clusters (Boruta feature selection, k-mers) with number of clusters .
Table 16.
Contingency tables of variants vs. clusters (Boruta feature selection, k-mers) with number of clusters .
Variant | k-means (Cluster IDs) | k-modes (Cluster IDs) |
---|
0 | 1 | 2 | 3 | 4 | 0 | 1 | 2 | 3 | 4 |
---|
Alpha | 8762 | 86 | 2925 | 680 | 1513 | 11,403 | 7 | 184 | 1823 | 549 |
Beta | 601 | 33 | 626 | 172 | 295 | 6 | 6 | 640 | 1060 | 15 |
Epsilon | 7848 | 187 | 3155 | 638 | 956 | 1 | 0 | 11,170 | 947 | 666 |
Delta | 2605 | 30 | 1342 | 868 | 2706 | 0 | 0 | 2894 | 690 | 3967 |
Gamma | 22,140 | 50 | 3016 | 741 | 682 | 6 | 25,428 | 6 | 1128 | 61 |
Table 17.
Contingency tables of variants vs. clusters (Boruta feature selection, one-hot embedding) with number of clusters .
Table 17.
Contingency tables of variants vs. clusters (Boruta feature selection, one-hot embedding) with number of clusters .
Variant | k-means (Cluster IDs) | k-modes (Cluster IDs) |
---|
0 | 1 | 2 | 3 | 4 | 0 | 1 | 2 | 3 | 4 |
---|
Alpha | 10 | 368 | 1586 | 9901 | 2101 | 1829 | 7 | 552 | 188 | 11,390 |
Beta | 22 | 12 | 1022 | 652 | 19 | 1060 | 6 | 15 | 640 | 6 |
Epsilon | 5 | 624 | 218 | 8351 | 3586 | 951 | 0 | 671 | 11,161 | 1 |
Delta | 0 | 3073 | 476 | 2988 | 1014 | 695 | 0 | 3970 | 2886 | 0 |
Gamma | 25,015 | 95 | 1417 | 34 | 68 | 1131 | 25,425 | 61 | 6 | 6 |
Table 18.
Contingency tables of variants vs. clusters (Boruta feature selection, k-mers) with number of clusters .
Table 18.
Contingency tables of variants vs. clusters (Boruta feature selection, k-mers) with number of clusters .
Variant | Fuzzy (Cluster IDs) | Hierarchical (Cluster IDs) |
---|
0 | 1 | 2 | 3 | 4 | 0 | 1 | 2 | 3 | 4 |
---|
Alpha | 668 | 1513 | 78 | 2945 | 8762 | 9373 | 702 | 2641 | 1198 | 52 |
Beta | 171 | 297 | 31 | 627 | 601 | 823 | 170 | 457 | 254 | 23 |
Epsilon | 637 | 943 | 186 | 3170 | 7848 | 8419 | 644 | 2949 | 591 | 181 |
Delta | 851 | 2713 | 28 | 1354 | 2605 | 2847 | 879 | 1563 | 2245 | 17 |
Gamma | 739 | 669 | 47 | 3034 | 22,140 | 22,955 | 743 | 2330 | 560 | 41 |
Table 19.
Contingency tables of variants vs. clusters (Boruta feature selection, one-hot embedding) with number of clusters .
Table 19.
Contingency tables of variants vs. clusters (Boruta feature selection, one-hot embedding) with number of clusters .
Variant | Fuzzy (Cluster IDs) | Hierarchical (Cluster IDs) |
---|
0 | 1 | 2 | 3 | 4 | 0 | 1 | 2 | 3 | 4 |
---|
Alpha | 2172 | 11,083 | 438 | 268 | 5 | 1977 | 1585 | 543 | 9839 | 22 |
Beta | 26 | 46 | 103 | 1552 | 0 | 898 | 7 | 198 | 623 | 1 |
Epsilon | 3682 | 4 | 656 | 8442 | 0 | 546 | 2983 | 641 | 8613 | 1 |
Delta | 1049 | 37 | 3269 | 3196 | 0 | 949 | 1235 | 2635 | 2731 | 1 |
Gamma | 96 | 29 | 389 | 139 | 25,976 | 1703 | 26 | 2017 | 1192 | 21,691 |
Table 20.
Number of clusters for HDBSCAN clustering with different feature selection methods.
Table 20.
Number of clusters for HDBSCAN clustering with different feature selection methods.
Algorithm | Number of Clusters |
---|
HDBSCAN | 2016 |
HDBSCAN + Boruta | 1419 |
HDBSCAN + LASSO | 1838 |
HDBSCAN + RFF | 1947 |