Author Contributions
Conceptualization, P.F.Z. and S.G.; methodology, P.F.Z. and S.G.; Software, P.F.Z., I.D., and S.C.R.-K.; investigation, P.F.Z. and A.T.C.; data curation, P.F.Z. and A.T.C.; formal analysis, S.C.R.-K., I.D., P.F.Z., and A.T.C.; writing—original draft preparation, P.F.Z.; writing—review and editing, I.D. and S.C.R.-K.; visualization, P.F.Z., I.D., S.C.R.-K., and A.T.C.; supervision, P.F.Z.; funding acquisition, S.G. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Summary of MIBiG module arrangements which occur at least twice in the MIBiG v2 database. The summary includes: functional classification (
a) as explained in
Section 4.2; module type (
b), where mixed refers to modules which are composed of both PKS and NRPS domains (e.g., C-A-ACP); reduction profile of PKS modules (
c), where the special reduction profile ketone * refers to modules in which reductive domains are present but the reduction cannot be performed because the KR module is missing (e.g., KS-AT-DH-ACP); epimerization of the NRPS module (
d), where L refers to the L-configuration of the substrate and E refers to the R-configuration of the substrate; the presence of MTs (
e); number of domains in the modules (
f); classification of PKS modules into normal and trans-AT acting modules (
g); position in the BGC (
h); and number of overall occurrences in the database (
i).
Figure 2.
Phylogenetic composition of the A-domain training dataset. The composition is shown for each protein (a) and all the A-domains (b). Almost all fungi sequences can be attributed to the phylum of ascomycota. The division of others comprises domains where no taxonomy could be found or small groups, such as streptophyta (No. 6), candidatus tectomicrobia (No. 5), and arthropoda (No. 3).
Figure 3.
Screenshots of SeMPI v2 web server features. (a) The input form allows for the selection of the BGC detection window, the number of natural compounds to compare to the predictions, and the selection of different public and local databases for the screening. (b) The results page provides detailed information about the predicted scaffold as well as links to further analysis resources. (c) The genome browser allows the observation of all biosynthetic relevant domains, genes, and clusters in the surrounding genome. (d) The molecular workbench enables expert users to submit custom molecular queries to the database screening.
Figure 4.
Reduction and methyltransferase domain combinations for described PKS modules. The resulting building block modifications are depicted below the domain combination. The methyltransferase position for cis and trans-AT modules are shown. Each domain combination (X) is encapsulated such as KS-(AT)-X-ACP. The figure is based on the article by Nguyen et al. [
18].
Figure 5.
Nomenclature for PK and NRP building blocks. R1 in alpha position is defined by the substrate choice of the AT- or A-domain. The α-position of PK substrates can be further modified co-synthetically by C-methyltransferases. The stereo configuration of the α-carbon of NRP substrates is defined by the presence of an E domain. The β-position of PK substrates can be further reduced. For NRP substrates, only the nitrogen atom, also referred to as β-position, can be methylated.
Figure 6.
Scaffold generation for a hypothetical mixed cluster. The NRP positions are marked α for the N-C-rest and β for the ketone following the PK nomenclature. The substrate specificity leads to a modification in the α-position. The module arrangement results in a modification in the β-position. Each parameter and corresponding modification is color-coded. Loading modules do not add a carbon atom in the β-position, marked with a dashed box (β1). Terminal modules add an additional carboxyl (TE) or hydroxyl (TD) substructure to the scaffold. Epi. refers to the epimerization of the NRP substrates, resulting in L- or D-forms.
Figure 7.
Demonstration of the module order algorithm. (a) Example blocks which cannot be combined since they are on different strand sides or far apart in the cluster. (b) An all-vs.-all matrix which generates hypothetical combined block pairs. Each block can be either at the beginning or the end of the newly generated block. Unreasonable combinations are excluded, marked with a red line. (c) A hypothetical example of 2 alignments performed for the D-A block combination. Each combination is scored with 252 reference BGCs. (d) Based on the score assigned to each block pair, the best scoring pairs are extracted and combined to the final block.
Figure 8.
Explanation of the sequence encoding scheme. (a) The annotated sequences are aligned. (b) Each position in the alignment is encoded in a binary feature using one-hot encoding. (c) The RF models are trained on the encoded arrays. A very simplified decision is demonstrated. On the live system, many more positions are taken into consideration.
Figure 9.
Example demonstration of the extended MCS algorithm. To simplify the example, only two building blocks are used; the algorithm can potentially scale up to 10 building blocks. (a) Initially, the building blocks are ordered by their number of atoms. The matches of the biggest building blocks are most meaningful. (b) All MCSs of the first block (B1) with the target molecule (Mol) are computed. The example shows only one MCS, but some molecules (especially ring systems) can have large numbers of MCS. The number of MCS to find for each B1 in a target molecule (Mol) is limited to 20. (c) A new molecule is created for each MCS, where the MCS is removed from the scaffold. This new molecule is then submitted to a new MCS search with the next building block (B2). (d) A visual example of the scoring algorithm based on two predicted scaffold fragments. First, the dipeptide (B1) is scored; then, the second molecule (B2) is scored on the remaining part of the target molecule (Mol).
Table 1.
Cross-validation results and detection thresholds for biosynthetic domains used in the pipeline. The detection thresholds are referred to as domT by the HMMER manual [
15].
Abbreviation | Name | F1 | Precision | Recall | Support | Threshold |
---|
ACP | Acyl carrier protein | 0.96 | 1.00 | 0.91 | 4315 | 13.9 |
AT | Acyltransferase | 1.00 | 1.00 | 1.00 | 2893 | 47.4 |
A | Adenylation domain | 1.00 | 1.00 | 0.99 | 3092 | 19.6 |
CAL | Coenzyme A ligase | 0.94 | 0.89 | 1.00 | 180 | 221.8 |
C | Condensation domain | 1.00 | 1.00 | 1.00 | 2734 | 25.9 |
DH2 1 | Dehydratase | 0.98 | 0.95 | 1.00 | 414 | 33.2 |
DH | Dehydratase | 0.99 | 1.00 | 0.99 | 1469 | 31.1 |
DHt 2 | Dehydratase | 0.88 | 0.98 | 0.81 | 113 | 44.3 |
ER | Enoylreductase | 1.00 | 1.00 | 1.00 | 646 | 68.8 |
E | Epimerization domain | 1.00 | 0.99 | 1.00 | 424 | 60.4 |
KR | Ketoreductase | 1.00 | 1.00 | 1.00 | 3374 | 20 |
KS | Keto-synthase | 1.00 | 1.00 | 1.00 | 3970 | 72 |
PCP | Peptide carrier protein | 0.97 | 0.95 | 1.00 | 2785 | 21.9 |
TD | Reductive Thioesterase | 1.00 | 1.00 | 1.00 | 118 | 42.1 |
TE | Thioesterase | 1.00 | 1.00 | 1.00 | 824 | 36.5 |
bACP | β-branching acyl carrier protein | 0.59 | 0.42 | 1.00 | 155 | 28.9 |
cMT | C-Methyltransferase | 1.00 | 0.99 | 1.00 | 326 | 104.7 |
nMT | N-Methyltransferase | 0.99 | 1.00 | 0.99 | 173 | 42.7 |
oMT | O-Methyltransferase | 0.99 | 0.99 | 0.99 | 184 | 70.6 |
tAT_d | Trans-acyltransferase docking domain | 0.98 | 0.97 | 1.00 | 665 | 50 |
Macro avg. | | 0.97 | 0.96 | 0.99 | 30,144 | |
Micro avg. | | 0.99 | 0.99 | 0.99 | 30,144 | |
Weighted avg. | | 0.99 | 0.99 | 0.99 | 30,144 | |
Table 2.
The 20 most common modules found in the MIBiG. The functional classification, normal (N), special (S), and not functional (NF) is explained in
Section 4.2. The modification refers to the reduction profile of PKS modules or epimerization of NRPS modules. The occurrence refers to the overall count in all BGCs of the MIBiG v2.
Domain Order | Functional | Type | Modification | Occurrence |
---|
C-A-PCP | N | NRPS | L | 1429 |
KS-AT-DH-KR-ACP | N | PKS | Enoyl | 895 |
KS-AT-KR-ACP | N | PKS | Hydroxyl | 628 |
C-A-PCP-E | N | NRPS | E | 362 |
KS-AT-DH-ER-KR-ACP | N | PKS | Alkyl | 335 |
KS-AT-ACP | N | PKS | Ketone | 247 |
A-PCP | N | NRPS | L | 235 |
C-A-PCP-TE | N | NRPS | L | 202 |
KS-tAT_d-KR-ACP | S | PKS | Hydroxyl | 139 |
KS-tAT_d-DH-KR-ACP | S | PKS | Enoyl | 125 |
C-A-nMT-PCP | N | NRPS | L | 120 |
KS-tAT_d-ACP | S | PKS | Alkyl | 71 |
cAL-ACP | NF | PKS | - | 62 |
KS-AT-DH-KR-ACP-TE | N | PKS | Enoyl | 61 |
C-A-ACP | N | Mixed | L | 61 |
C-PCP | NF | NRPS | - | 59 |
KS-ACP | S | PKS | Ketone | 52 |
KS-tAT_d | S | PKS | Ketone | 52 |
KS-AT | S | PKS | Ketone | 49 |
KS-tAT_d-DH-KR-cMT-ACP | S | PKS | Enoyl | 49 |
Table 3.
Best grid search parameters for the RF models for AT- and A-domain specificity classification. The error for the micro F1-score is based on the standard deviation over all cross-validation queries.
Parameter | A-Domains | AT-Domains |
---|
Max. depth | 100 | 50 |
Max. features | auto | auto |
N. estimators | 1000 | 100 |
Min. samples split | 10 | 1 |
Bootstrap | No | No |
Criterion | Gini | Gini |
Min. samples leaf | 1 | 2 |
F1-score | 0.75 ± 0.05 | 0.96 ± 0.02 |
Table 4.
Performance for each substrate for AT-domain classification.
Substrate | F1-Score | Precision | Recall | Support |
---|
Ethylmalonyl | 0.5 | 1 | 0.33 | 15 |
Malonyl | 0.98 | 0.99 | 0.98 | 278 |
Methoxymalonyl | 0.59 | 1 | 0.42 | 12 |
Methylmalonyl | 0.94 | 0.9 | 0.99 | 194 |
Micro avg. | 0.95 | 0.95 | 0.95 | 499 |
Macro avg. | 0.75 | 0.97 | 0.68 | 499 |
Table 5.
Performance for each substrate for A-domain classification. The amino acids are named according to the standard amino acid one-letter code. Special substrate abbreviations are: 2-amino-adipic-acid (aad), beta-hydroxy-tyrosine (bht), diaminobutyric acid (dab), 2,3-dihydroxy-benzoic acid (dhb), 2,3-dehydroaminobutyric acid (dhbu), 3,5-dihydroxy-phenyl-glycin (dhpg), Hydroxy-L-ornithine (horn), 4-hydoxy-phenyl-glycine (hpg) ornithine (orn), pipecolic acid (pip).
Substrate | F1-Score | Precision | Recall | Support |
---|
A | 0.91 | 0.92 | 0.90 | 395 |
C | 0.83 | 0.79 | 0.88 | 64 |
D | 0.75 | 0.75 | 0.75 | 56 |
E | 0.46 | 0.7 | 0.35 | 55 |
F | 0.40 | 0.32 | 0.54 | 122 |
G | 0.77 | 0.73 | 0.82 | 78 |
H | 0.15 | 0.33 | 0.10 | 10 |
I | 0.77 | 0.86 | 0.70 | 63 |
K | 0.36 | 0.62 | 0.25 | 20 |
L | 0.70 | 0.73 | 0.67 | 149 |
N | 0.75 | 0.74 | 0.76 | 55 |
P | 0.76 | 0.79 | 0.74 | 61 |
Q | 0.84 | 0.84 | 0.84 | 37 |
R | 0.56 | 0.63 | 0.50 | 24 |
S | 0.77 | 0.76 | 0.79 | 132 |
T | 0.83 | 0.78 | 0.89 | 118 |
V | 0.72 | 0.68 | 0.76 | 119 |
W | 0.77 | 0.79 | 0.76 | 41 |
Y | 0.50 | 0.64 | 0.42 | 72 |
aad | 0.82 | 0.88 | 0.78 | 63 |
bht | 0.48 | 0.43 | 0.55 | 11 |
dab | 0.88 | 0.92 | 0.84 | 44 |
dhb | 0.97 | 0.96 | 0.99 | 155 |
dhbu | 0.27 | 0.50 | 0.18 | 11 |
dhpg | 1.00 | 1.00 | 1.00 | 12 |
horn | 0.73 | 0.80 | 0.67 | 12 |
hpg | 0.91 | 0.89 | 0.93 | 42 |
orn | 0.58 | 0.67 | 0.52 | 31 |
pip | 0.60 | 0.86 | 0.46 | 13 |
Micro avg. | 0.76 | 0.76 | 0.76 | 2065 |
Macro avg. | 0.69 | 0.74 | 0.67 | 2065 |
Table 6.
Performance metrics for the classification of AT- and A-domain substrates. ROC-AUC refers to the area under the receiver operating characteristic curve.
Metric | AT-Domains | A-Domains |
---|
Accuracy | 0.95 | 0.76 |
Error rate | 0.05 | 0.24 |
Matthews correlation coefficient (MCC) | 0.91 | 0.74 |
ROC-AUC (macro) | 1.00 | 0.96 |
ROC-AUC (micro) | 1.00 | 0.97 |
Precision (macro) | 0.97 | 0.74 |
Precision (micro) | 0.95 | 0.76 |
Recall (macro) | 0.68 | 0.67 |
Recall (micro) | 0.95 | 0.76 |
F1-score (macro) | 0.75 | 0.69 |
F1-score (micro) | 0.95 | 0.76 |
Support | 499 | 2065 |
Table 7.
Selected Pfam domains and corresponding PSMs. The number of appearances of a certain PSM in the structure of a secondary metabolite is correlated to the number of Pfam domains found in the producing cluster. The descriptions are derived from the Pfam database. Pfam-derived abbreviations are: deoxythymidine diphosphate (dTDP), nicotinamide adenine dinucleotide (NAD), nucleoside diphosphate (NDP), uridine diphosphate glucose (UDP), guanosine diphosphate (GDP), flavin adenine dinucleotide (FAD), domain of unknown function (DUF).
PSM | Pfam ID | Description | Pearson Correlation |
---|
Glyco | PF00908.16 | dTDP-4-dehydrorhamnose 3,5-epimerase | 0.67 |
Glyco | PF01370.20 | NAD dependent epimerase/dehydratase family | 0.64 |
Glyco | PF03559.13 | NDP-hexose 2,3-dehydratase | 0.61 |
Glyco | PF00201.17 | UDP-glucoronosyl and UDP-glucosyl transferase | 0.59 |
Glyco | PF03033.19 | Glycosyltransferase family 28 N-terminal domain | 0.56 |
Glyco | PF01041.16 | DegT/DnrJ/EryC1/StrS aminotransferase family | 0.52 |
Glyco | PF08421.10 | Putative zinc binding domain | 0.47 |
Glyco | PF16363.4 | GDP-mannose 4,6 dehydratase | 0.44 |
Glyco | PF04101.15 | Glycosyltransferase family 28 C-terminal domain | 0.28 |
Glyco | PF01075.16 | Glycosyltransferase family 9 (heptosyltransferase) | 0.21 |
Glyco | PF00728.21 | Glycosyl hydrolase family 20, catalytic domain | 0.18 |
Glyco | PF02838.14 | Glycosyl hydrolase family 20, domain 2 | 0.18 |
Glyco | PF01915.21 | Glycosyl hydrolase family 3 C-terminal domain | 0.13 |
Glyco | PF00933.20 | Glycosyl hydrolase family 3 N terminal domain | 0.13 |
Glyco | PF14885.5 | Hypothetical glycosyl hydrolase family 15 | 0.1 |
Cl | PF04820.13 | Tryptophan halogenase | 0.66 |
Cl | PF00999.20 | Sodium/hydrogen exchanger family | 0.51 |
Spiroketal | PF12680.6 | SnoaL-like domain | 0.61 |
Spiroketal | PF00890.23 | FAD binding domain | 0.54 |
SS | PF07992.13 | Pyridine nucleotide-disulphide oxidoreductase | 0.46 |
NO2 | PF01678.18 | Diaminopimelate epimerase | 0.67 |
NO2 | PF06722.11 | Protein of unknown function (DUF1205) | 0.27 |
6-Ring | PF16197.4 | Ketoacyl-synthetase C-terminal extension | 0.53 |
6-Ring | PF02801.21 | Beta-ketoacyl synthase, C-terminal domain | 0.52 |
6-Ring | PF08990.10 | Erythronolide synthase docking | 0.51 |
6-Ring | PF00743.18 | Flavin-binding monooxygenase-like | 0.46 |
5-Ring | PF12680.6 | SnoaL-like domain | 0.56 |
5-Ring | PF00890.23 | FAD binding domain | 0.53 |
5-Ring | PF01551.21 | Peptidase family M23 | 0.46 |
5-Ring | PF08990.10 | Erythronolide synthase docking | 0.45 |
5-Ring | PF00486.27 | Transcriptional regulatory protein, C terminal | 0.26 |
5-Ring | PF16197.4 | Ketoacyl-synthetase C-terminal extension | 0.26 |
5-Ring | PF00109.25 | Beta-ketoacyl synthase, N-terminal domain | 0.26 |
Table 8.
R2-score (coefficient of determination) for the prediction of PSMs based on 5-fold cross-validation. The baseline is based on a hypothetical model, assuming that no PSM is predicted at all.
PSM Name | Baseline Model | Regression Model |
---|
Glyco | −0.118 | 0.548 |
Cl | −0.091 | 0.26 |
Spiroketal | −0.031 | 0.427 |
SS | −0.02 | 0.126 |
NO2 | −0.015 | 0.097 |
6-Ring | −0.14 | 0.416 |
5-Ring | −0.052 | 0.374 |
Table 9.
Average (arithmetic mean) Tanimoto similarity between produced SMs and the predictions made by SeMPI v2 and antiSMASH v5.
Cluster Type | Number of Clusters | SeMPI v2 | AntiSMASH v5 |
---|
PKS | 110 | 0.44 | 0.42 |
NRPS | 158 | 0.57 | 0.50 |
Mixed | 219 | 0.45 | 0.42 |
All | 487 | 0.48 | 0.45 |
Table 10.
Best scoring parameter weights for each BGC type. The corresponding percentage of Top 10 and Top 50 ranked instances of the true SMs are shown.
Type | Similarity-Score | MCS-Score | PSM-Score | Top 10 | Top 50 |
---|
PKS | 1 | 1 | 0.3 | 0.53 | 0.72 |
NRPS | 1 | 0.4 | 0.5 | 0.59 | 0.72 |
Mixed | 0.4 | 0.6 | 0.1 | 0.45 | 0.64 |
All | 0.3 | 0.4 | 0.1 | 0.47 | 0.67 |
Table 11.
Percentage of Top 10 and Top 50 ranked instances of true SMs based on different scoring functions. The mixed scoring is based on the parameter weights determined in
Section 2.5.2.
Scoring Function | Cluster Type | SeMPI v2 (Top 10) | AntiSMASH v5 (Top 10) | SeMPI v2 (Top 50) | AntiSMASH v5 (Top 50) |
---|
Similarity | PKS | 25.4 | 20.3 | 45.61 | 48.55 |
MCS | PKS | 36.0 | 22.5 | 56.14 | 49.28 |
Mixed | PKS | 49.1 | - | 72.81 | - |
Similarity | NRPS | 50.3 | 45.9 | 64.57 | 61.01 |
MCS | NRPS | 45.1 | 45.9 | 57.71 | 61.01 |
Mixed | NRPS | 55.4 | - | 68.57 | - |
Similarity | Mixed | 25.1 | 22.6 | 46.91 | 41.03 |
MCS | Mixed | 32.5 | 22.6 | 47.74 | 41.03 |
Mixed | Mixed | 39.9 | - | 62.14 | - |
Similarity | All | 33.5 | 29.5 | 52.44 | 49.59 |
MCS | All | 37.4 | 30.1 | 52.82 | 49.80 |
Mixed | All | 47.0 | - | 66.54 | - |
Table 12.
RF parameters and their configuration used for the grid search to determine the best configuration for substrate specificity prediction. Each parameter is described in detail in the scikit-learn documentation [
48].
Parameter | Configurations |
---|
Max. depth | 10, 50, 100, None |
Max. features | 1, 3, 10, auto, sqrt |
N. estimators | 100, 500, 1000 |
Min. samples split | 2, 5, 10 |
Bootstrap | True, False |
Criterion | gini, entropy |
Min. samples leaf | 1, 5, 10 |
Table 13.
SMARTS patterns used to match PSMs in known BGC products described in the MIBiG.
Name | SMARTS Patterns |
---|
Glyco | [#6]O[D3]1CCCCO1 |
Cl | Cl |
Spiroketal | [C&R]O[C&R]([C&R])([C&R])O[C&R] |
SS | SS |
NO2 | O=[N+][O−] |
6-Ring | CC1CCCC(C)O1 |
5-Ring | C1CCCO1 |
Table 14.
Databases and the number of stored molecules used for the screening of predicted scaffolds. All compounds are linked to their source database. The predicted streptomyces SMs are linked to the preprocessed database on the SeMPI v2 web server.
Database Name | Number of Molecules |
---|
MIBiG | 1528 |
StreptomeDB 2.0 | 3990 |
NANPDB | 6841 |
DrugBank | 9277 |
MolPort NP | 120,555 |
Analyticon Zinc | 663 |
ChEBI (3star) | 46,547 |
Norine | 641 |
Predicted streptomyces SMs | 1519 |