1. Introduction
An unpresented molecular framework such as that in
Figure 1a can be investigated in drug space. In early stages of drug discovery, three-dimensional (3D) similarity between chemicals has been used to find desirable ligands of a chosen therapeutic target in virtual screening (VS;
Figure 1b) [
1,
2]. To our knowledge, chemical similarity is a coarse predictor for filtering out less promising chemicals rather than selecting the most desirable compound. Chemical similarity has also contributed to target screening (in other words, retro-VS) under the chemocentric assumption in
Figure 1c. Chemocentric assumption means if two similar molecules are likely to possess similar properties, they can share biological targets or may show similar pharmacological profiles [
3,
4]. Remarkably, Jain’s group conducted on-target and off-target prediction through the comparison of two-dimensional (2D) and 3D chemical similarity [
5]. Based on this comparison, while dual 2D and 3D similarity-based predictions showed superiority for either 2D or 3D predictions, 3D predictions did not show dramatic improvement over 2D predictions. In addition, the increase of data points, according to the conformer sampling sizes, makes the computing cost of 3D features increase more rapidly than 2D features. However, despite it being less cost-effective, 3D similarity is the best feature for in silico target screening of unprecedented drug scaffolds and new drug-like molecular frameworks [
6] because (1) novel, unprecedented drug scaffolds have very low 2D similarity to known bioactive molecules [
7,
8,
9], (2) novel pharmacological profiles of drugs are more frequently found using 3D similar off-target predictions [
5], and (3) realistic drug properties can be generated from their factual and flexible 3D structures [
10,
11,
12].
The internalization of Michelangelo Buonarroti’s quote, “Every block of stone (chemical) has a statue (utility) inside it, and it is the task of the sculptor (chemist) to discover it”, inspired this research for the ‘chemistry-oriented synthesis’ of an unprecedented drug scaffold [
7,
8,
9] and the chemocentric target profiling of this scaffold [
7]. For this purpose, we have intensively studied the 3D similarity of unprecedented drug scaffolds (the query compounds) with known molecular frameworks (the reference compounds). When comparing query and reference compound pairs, 3D similarity of the pairs depends on (1) conformational sampling of the compounds, (2) the alignment method, (3) the chosen descriptors, and (4) the distance coefficients (e.g., Jaccard–Tanimoto). In addition to the four factors of 3D VS, retro-VS of unprecedented drug scaffolds (query compounds) requires compound–target associations (target class information), as shown in
Figure 1. These associations are the source of the substantial difference between VS and retro-VS in problem-solving in data science, specifically, (1) one-to-one comparison for VS, as shown in
Figure 1b; (2) one-to-group (class) comparison for retro-VS, as shown in
Figure 1c; and (3) group-to-group comparison for typical parametric statistics such as ANOVA and
t-test. When we calculated the similarity of compound pairs in retro-VS, the hope was to ultimately identify the primary target of the query through calculated chemical similarity rather than finding the most similar compound to the query structure. To achieve this, one-to-group comparison must be essentially quantified. To our knowledge, such measurements have not been properly reported in cheminformatics. Notably, 2D similarity distributions with target annotation have been reported using statistical fitting models such as Shoichet’s group [
3], Bajorath’s group [
13], and Nasr’s group [
14]. However, even though the number of studies using 3D similarity is enormous with review articles by Zhang et al. [
15] and Shin et al. [
16], 3D similarity distribution is rarely mentioned in the literature. Other than the distribution, network analysis (edge: similarity, node: chemical) such as that by Torres et al. [
17] or the machine-learning algorithm-based classifiers have also been used [
11,
18]. Most classifiers do not only use chemical similarity, but also use other descriptors together [
18]. Although several studies have treated 3D similarity distribution such as Jain’s group [
5], Medina-Franco’s group [
19], and Pérez-Nueno’s group [
20], the distribution comprised every compound instead of compounds grouped by target [
5,
19]. In addition, it was either visualized without a fitting model [
19] or its statistical model was chosen without parameter optimization [
5]. Exceptionally, although Pérez-Nueno’s group reported Gaussian distribution using 3D similarity, the study assumed Gaussian distribution with only one centroid and fitting parameter was also not optimized, despite the small number of ligands [
20].
In this study, we quantitatively compared a query compound with a target class (one-to-group) using two types of similarity distributions, namely, maximum likelihood (ML) estimation of queries and a Gaussian mixture model (GMM) of target classes (
Figure 1d). As raw data of this study, the Jaccard–Tanimoto similarity coefficients were calculated for (1) query-to-ligand pairs (e.g., the left second row of the
Figure 1d) and (2) ligand pairs within each target class (e.g., the left first row of
Figure 1d). The query-to-ligand similarity was transformed into query distribution via ML estimation, and the ligand pair similarity was also transformed into a representative distribution of a target class using GMM. The difference between two distributions was quantified by Kullback–Leibler (K–L) divergence, which represented the quantitative comparison between a query and a target class. In order to evaluate whether the K–L divergence accurately achieved one-to-group comparison, a query chosen from a group of known ligands for a target was tested to observe discrimination between the original target and other targets. In sequence, the target profiles of an unprecedented drug scaffold was explained by K–L divergence.
2. Theoretical Background
Kullback–Leibler divergence: K–L divergence measures the difference between two statistical or probabilistic distributions. In particular, K–L divergence is employed in various machine learning and deep learning algorithms for statistical inference [
21,
22]. Since K–L divergence implies relative entropy, which is an important concept in understanding statistical phenomena, it applies to statistical physics, chemistry, and social science.
Let us define two probability spaces,
and
, where
is the sample space,
is
–algebra, and
and
are probability distributions. Then, to define Kullback–Leibler divergence, a unique measurable function is devised,
, known as the Radon–Nykodym derivative, so that
For any measurable set,
[
22] when using the measurable function
. The Kullback–Leibler divergence,
, is defined as either
or
where the probability density functions
p(
x) and
q(
x) are defined as
The Kullback–Leibler divergence represents the information for comparing
P(
x) and
Q(
x) distributions [
23]. Hence, the implication of Kullback–Leibler divergence depends on the definitions of
P(
x) and
Q(
x). For example,
Model Inference: If
P(
x) represents the testing distribution based on the model, and
Q(
x) represents the distribution from the raw data, the difference is the error between the model and reality [
24];
Informatics: If
P(
x) and
Q(
x) represent information extracted from two objectives, the divergence is a measurement for the discrimination between two objectives [
13,
25];
Bayesian Statistics: If
P(
x) represents a prior distribution and
Q(
x) represents a posterior distribution, the divergence represents the information gained through updating [
26,
27].
In sequence, let us consider a special example. Assume the probability distributions
P(
x) and
Q(
x) replace the Gaussian distributions
and
, where
Using Equations (3) and (5), the Kullback–Leibler divergence between the two Gaussian distributions
and
in Equation (5) are as follows:
This Kullback–Leibler divergence between the univariate normal distributions (Equation (6)) therefore extends to multivariate distributions [
28].
Gaussian mixture model: The mixture models are methods that analyze compositional data. With
representing a probabilistic density generated from the unknown compositional data,
representing a well-known probability density, and
x representing a random vector, the functional operator,
, is defined as
where for k = 1, 2, …, K,
,
are the weights and vectors of the hyperparameters and
is the
component, which is independently and identically distributed (iid) [
29]. In this work, GMM was adopted to obtain a representative distribution [
30]. Notably, GMM is a model that describes non-Gaussian distributions as well as Gaussian distributions [
31]. The probability density
represents the Gaussian density function
in Equation (5). In the Gaussian mixture model, estimations of the weight (
), the mean (
), and the standard deviation (
) are essential. Herein, the two methods (i.e., the EM algorithm [
32] and ML estimation [
33]) were chosen to estimate the hyperparameters from sparse and incomplete data. The EM algorithm for GMM consists of an initial guess for the GMM parameters and iterative calculation (E-step)–parameter determination (M-step). The iterative steps continue until the set of hyperparameters,
, are less than positive, and infinitesimal number,
, as shown in the ccccccmathematical elucidation (
Supplementary Materials Equations (S1.6)–(S1.12) [
34]. For convenience, when applying the ML estimation,
is transformed into the mixture model and
is replaced by
.
3. Results and Discussion
In this study, a quantitative method was developed to describe discriminative information for target prediction of a query compound only from chemical similarity and known compound–target association information. For this purpose, 3D similarity distributions were acquired from a 3D similarity matrix occupied by Jaccard–Tanimoto coefficients [
35] regarding (1) query-to-ligand pairs and (2) ligand pairs within each target class. The Jaccard–Tanimoto coefficients were calculated from two types of features, molecular shape and pharmacophore features, using the Openeye Toolkit. Query compounds and target classes were compared and quantified according to the following process:
Step 1. EM algorithm-based GMM allowed to obtain a representative distribution (Q-distribution) for a target class, following either Gaussian or non-Gaussian distribution;
Step 2. A query-to-ligand similarity distribution was fitted onto a Gaussian distribution using ML estimation;
Step 3. K–L divergence between the two distributions from Step 1 and Step 2 allowed target predictions of the query compound. Greater deviation of K–L divergence values between target classes indicated that the query compound was a more representative ligand of a class than other query compounds. In addition, the probability, , derived from the K–L divergence values and the feasibility index, , allowed for quantification of discrimination between the target classes.
Dataset: In order to select example target classes for this study, an unprecedented scaffold with structural novelty and its targets were focused. Among our previous studies, bis-
N,
N-dimethylaminophenylamino tetrahydropyran (BNDS-A), which was the most potent to regulate in vitro inflammation (IC
50 of nitric oxide production = 12 μM), was chosen for this quantitative method (
Figure 1a). The association of two targets with BNDS-A, estrogen receptor alpha (ESR), and vitamin D receptor (VDR) was proven by the stepwise approach consisting of (1) 2D similarity search, (2) multiplication of 3D similarity coefficients of every ligand within each target, P(Tc)/C(hits), (3) self/cross-similarity, and (4) western blot analysis in our previous work [
7]. However, despite low predicted probability, capthesin D (CTSD) and cyclooxygenase-2 (COX2) could also be regulated by BNDS-A in the same study. Neither the most similar compound to BNDS-A (one-to-one comparison) nor ANOVA test between target pairs (group-to-group comparison) could suggest the primary target of BNDS-A. Therefore, to quantitatively compare them with BNDS-A, the four targets, ESR, VDR, COX2, and CTSD, were selected. In addition, an additional four targets, HIV-1 protease (HIV1), heat shock protein 90 (HSP90), transient receptor potential cation channel subfamily V4 (TRPV4), DNA topoisomerase I (TOP1), were randomly selected from the target prediction literature [
36] to evaluate our methodology. For convenience, simple numbers denoted the target classes, in other words,
Either
m or
n were called the class number, which was an integer between 1 and 4, as in Equation (8), and
and
represent vectors whose elements are the Tanimoto coefficients of query compounds in the
mth class.
was defined as the Tanimoto matrix operator, so
where
is a scalar operator between the
ith and
jth queries to calculate the Tanimoto coefficient and
and
are unit vectors for the
i-axis and
j-axis, where <, > is the inner product.
Representative distributions Q for target classes: The representative distributions corresponding to each target class using GMM of ligand pair similarity were obtained. First, using the similarity matrix
in Equation (9), where m = n, the following univariate probability densities,
, were defined by
where
is the probability measure;
x is the Tanimoto–Jaccard coefficients;
and the range of
x is [0, 2]; and
. Therefore, the probability densities,
, satisfy the following equation:
Second, to extract representative distributions from
, the Gaussian mixture model was utilized, in which probability densities,
, are expressed as approximated from
, which is the weighted sum of K univariate Gaussian distributions. That is,
where
and
are shown in
Table 1. To estimate the hyperparameters
and
, the EM algorithm was used as described in
Section 4.
Table 1 shows the mean, standard deviation, and weight corresponding to the components of the mixture model.
Figure 2 depicts the difference between the probability densities,
, and
, where
K = 1, 3, and 7. When comparing component
K, raw data were similarly fitted to histograms when K = 3 and K = 7, and normal Gaussian modeling showed insufficient fitting for ESR, COX2, and CTSD (
Figure 2). Commonly, the means and modes of the representative distributions existed near 0.5, and every distribution was skewed to the right.
Gaussian distributions for queries: To quantitatively compare the representative distributions corresponding to ESR, VDR, COX2, and CTSD with the query distributions, Kullback–Leibler divergence was introduced and calculated by building each distribution for each query.
For this purpose,
of Equation (9) was used in a similar way to the described method for the representative distributions of the target classes. When a query was the
lth ligand of
, the
lth column’s elements in the above matrix were used for the
lth column vector,
, as in
where the values of
for
j = 1, 2, …, N were represented by the N × N matrices, for which the elements
satisfied
Using the vector
from Equation (13), the following univariate probability densities,
, were defined as
where the probability measure
was derived from Equation (10).
Before obtaining the probability distribution, two assumptions were made. First, it was assumed that a distribution from one query was not a weighted sum of Gaussian distributions, but rather a simple Gaussian distribution. It was reasonable that a distribution from one query was simpler than the
Q-distribution of a target class with 13,957 queries. Second, to estimate the parameters of the Gaussian distribution, ML estimation was chosen as a general method, in which
where
μ1 and
σ1 are hyperparameters and are maximized log likelihood functions for normal distribution, in other words,
Using definitions Equations (16) and (17), each query resulted in four distributions corresponding to the four classes (i.e., ESR, VDR, COX2, and CTSD). For example, when CHEMBL539392 was chosen as a query (
l) among the ligands of ESR (Class 1), the distributions
,
, and
were obtained under the definitions of Equations (8) and (15). According to Equations (16) and (17), four representative Gaussian distributions of the query compound CHEMBL539392 were acquired from the column vector between CHEMBL539392 and 13,957 ligands of each class, which were
In the same way, univariate normal distributions were obtained of all of the query compounds in each class. Since the number of classes was four and there were 13,957 query compounds in each class, the Gaussian distributions
, derived from
, presented the class number, either
m or
n, which was an integer between 1 and 4, and the query number,
l, which was an integer from 1 to 13,957. As a result, the frequency distributions of the estimates, alongside the means (
) and standard deviations (
), were described as shown in
Figure 3 and
Supplementary Figures S5–S7. ML estimation did not show any difference between self-query (m = n) and cross-query (m ≠ n) with regard to frequency. Even though cathepsin D (CTSD) showed a slightly lower mean than the other classes, self-comparison also showed a low mean, as shown in
Figure 3. Regardless of whether a class or a query compound was used (self/cross), 3D similarity of ligand pairs within a class showed the mode near 0.6, thereby confirming the need for quantitative comparison between queries. Notably, the univariate probability distributions of 3D similarity did not discriminate between target class at all.
Discrimination and K–L divergence: In sequence, 3D similarity distributions of target classes and query compounds were quantitatively compared through K–L divergence calculations. First, the information describing specific Tanimoto–Jaccard coefficients,
x, were defined as
from two probability density distributions,
and
, which were generated from a query compound and a class. Hence, following the expected value from the above information in Equation (19) with respect to one query compound, the K–L divergence,
represented a measurement for the discrimination.
In a one-component GMM (
K = 1), the K–L divergence between Gaussian distributions of every query and the
Q-distributions (
Table 1) are calculated; randomly chosen query compounds are described in
Table 2. To show the calculation process in detail, CHEMBL539392 was chosen as an example. Using the above equation for Kullback–Leibler divergence between normal distributions,
where
We obtained four K–L divergences corresponding to the queries of 2.1493, 4.6939, 2.0810, and 1.6354, respectively (see calculation procedure in the
Supplementary Materials Equations (S2.1–S2.8). As shown in
Table 2 and
Supplementary Table S3, the K–L divergence of every query compound was not always the smallest value from their original targets, as annotated by ChEMBL DB. Even though a considerable number of query compounds showed that the K–L divergence resulting from an original target was smaller than values from other target classes, CHEMBL539392 of ESR, CHEMBL1163237 of COX2, and CHEMBL263810 of CTSD were considered to be less different than other targets, therefore giving a false prediction (
Table 2). When we counted the query compounds that discriminated between the original targets and other targets from the 13,957 query compounds under the four classes via GMM (
K = 1), the correct prediction numbers were 6300, 5200, 4100, and 6400 among each of the 13,957 queries from ESR, VDR, COX2, and CTSD, respectively. When applying GMM (
K = 3) and (
K = 7) for the
Q-distributions, the true positive ratio decreased (ESR: 5100; VDR: 4500; COX2: 3200; CTSD: 4900 (
K = 3); ESR: 4900; VDR: 4500; COX2: 3100; CTSD: 4800 (
K = 7)).
In order to further evaluate the discriminative power of K–L divergence between target classes, an additional four classes as well as the four classes for BNDS-A were compared with the shared ligands in
Table 3 and
Supplementary Table S2. In
Table 3, ritonavir (CHEMBL163) is a clinically approved drug on the HIV1 (human immunodeficiency virus type 1) protease as its primary target. Notably, ritonavir showed the distinct K–L divergence value to discriminate HIV1 with other targets. In addition, the result can rationalize why ritonavir cannot show a distinct difference between VDR and COX2. In contrast, myricetin (CHEMBL 164) showed very disappointing result with poor discrimination between K–L divergence values. However, when we checked every target of myrcetin, the natural compound did not show target specificity on any single protein to explain the result. The annotated activities were limited to the known targets (VDR: 31–40 μM, COX2: 100 μM, HSP90 13.5 μM in cell-based assay, TOP1: IC50 = 11.9 μg mL
−1) in ChEMBL DB. Furthermore, despite the absent data on HIV1 of myrcetin, the flavonoid compound with multiple hydroxyl groups showed experimental activity on ubiquitin-specific protease having functional similarity (peptidase domain) with HIV1 to explain the K–L divergence value of 0.0393. In sequence, because reserpine (CHEMBL772), a clinically approved natural product, has target specificity on vesicular monoamine transporters with trivial activities on the annotated targets (VDR/COX2/TOP1), every target did not show a difference with untested targets (ESR/CPTD/HIV1). In addition, even though CHEMBL1813048 was the ligand of COX2 and TRPV4, K–L divergence could not support the finding. However, the result can be explained by the experimental data: (1) Ki against TRPV4 was more than 10 μM and (2) indirect regulation of COX2 was recorded through the Prostaglandin H2 receptor in ChEMBL DB. When compared with a 2D fingerprint based Top5 prediction of the additional target classes [
36], our method can provide how much each query is quantitatively different with each target class from the raw data without any refinements such as assay, activity index, and duplicated ligands. This point is very important for investigating unprecedented drug scaffolds having weak activity out of the Top5 of a target class.
After the individual K–L divergence comparisons of each query, comparisons between the target classes were quantified. In sequence, the K–L divergence between the Gaussian distributions of 13,957 queries and the
Q-distributions (K = 1, 3, and 7) for the four target classes were presented as a cumulative distribution, as seen in
Figure 4,
Figure 5,
Figure 6 and
Figure 7. To investigate the feasibility of the information, the following distribution was defined:
where
is the query number in class
m and the random variable
represents a class number, so that
If the K–L divergence (Equation (20)) is an ideal measurement for discrimination between target classes, () would satisfy the following conditions:
Sufficient condition: The feasibility index,
, is defined as
The above conditions implied a quantitative measurement for the discrimination. In particular,
in the sufficient condition represents the ratio between two probabilities (i.e., that a query compound belonged to a class of itself as well as belonging to other classes). A larger value of
indicated better feasibility or resolution of discrimination.
Table 4 depicts the probability of the K–L divergence
for
, indicating that, except for example
m = 3 where the class was COX2, the tested classes met the necessary conditions
in Equation (25) with respect to the feasibility index in Equation (26), it was easier to distinguish a query compound in the CTSD class where
m = 4 from every class except itself (
Figure 8). When the feasibility index resulting from the GMM (
K = 1) was compared with the index calculated from the GMM (
K = 3) and (
K = 7) for the
Q-distributions, GMM (
K = 1) showed superior feasibility for class discrimination using GMM (
K = 3) or (
K = 7), as shown in
Table 4.
Representative ligands for better discriminative predictions: According to the results described in
Figure 4,
Figure 5,
Figure 6,
Figure 7 and
Table 4, 3D similarity-based K–L divergence together with
and F
m showed discriminative power with regard to some query–class associations. The question therefore remains regarding the efficient use of the 3D-chemocentric approach under the current discriminative power, where it can be applied to investigate the novel pharmacology of an unprecedented compound. For this purpose, K–L divergence of an unprecedented compound should be calculated to compare known ligands and target classes. In detail, representative ligands within each target class were chosen for the comparison. For example, we selected four representative queries based on their Tanimoto–Jaccard coefficients (
x), and K-L divergence value, namely, (1)
x is the nearest to the mean of the
Q distribution (GMM,
K = 1), (2)
x is the nearest to an outlier of the
Q distribution (mean ± 2SD), (3) the range of K–L divergence between two target classes, and (4) the highest similarity with an unprecedented compound (
Table 4). As an example, BNDS-A, a recently reported in-house compound [
7], was used as the unprecedented compound due to the absence of ChEMBL DB. The first query compound close to the mean of the
Q distribution showed a smaller K–L divergence than the other compounds (
Table 5). The initial assumption and initial selection of the target class of BNDS-A (in other words, the selection of the
Q distribution), resulted in a critical effect on the K–L divergence of BNDS-A as a query compound to predict the target class. When ESR was assumed as the initial target of BNDS-A, BNDS-A was more ESR ligand-like than CHEMBL558943 (at mean − 2SD for the ESR
Q distribution) and CHEMBL604989 (which exhibited the biggest K–L divergence gap), and was less ESR-like than CHEMBL499809 (at the mean for the ESR
Q distribution) and CHEMBL2 (at the mean + 2SD). Under the
Q of ESR assumption, BNDS-A showed the lowest K–L divergence with the VDR ligands (0.0588 of VDR < 0.2116 of ESR), suggesting that BNDS-A was more VDR ligand-like than ESR ligand-like. When the initial target was transferred to VDR or COX2, BNDS-A showed the lowest K–L divergence required to satisfy the assumption (chosen
Q). In all BNDS-A rows of
Table 4, while the order of K–L divergence of BNDS-A (VDR < ESR < CTSD) was retained under the assumed every target class of BNDS-A, COX2 showed the lowest K–L divergence under only COX2
Q distribution and did not show consistent prediction. Therefore, BNDS-A was more VDR ligand-like than COX2 ligand-like. Experimentally, BNDS-A regulated the expression level of targets in a concentration-dependent manner (VDR > CTSD >> ESR) [
7]. Notably, K–L divergence of 3D similarity distributions can be an additional comparison method of known methods to predict the target of a novel compound such as (1) the rank of 3D similarity score [
7,
15,
16] or (2) p-value of one 3D similarity distribution [
20]. Whenever achieving the relevance between a novel query and a target class is the aim, K–L divergence can be used for 3D-chemocentric informatics, as seen in the example of BNDS-A.
4. Materials and Methods
Data collection: All data, except for the in-house compound (BNDS-A), were extracted from the ChEMBL database (1. ESR, VDR, COX2, and CTSD: version 23 through KNIME community node, 2. HIV1, HSP90, TRPV4, and TOP1: version 25 through MySQL) [
37]. Version 23 was available in both the ChEMBL community node of KNIME and in-house MySQL built from the dump file from ChEMBL ftp (
ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/). HIV1 protease, HSP90, TRPV4, and TOP1 data were chosen based on the literature [
36] and downloaded from the ChEMBL 25 version.
Conformational sampling: Extracted compounds were converted from 2D structures into 3D conformation using Omega of the Openeye software [
38] under the following conditions: (1) the MMFF94 force field excluding Coulomb interactions and the attractive part of Van der Waals interactions (option:
mmff94s_Trunc) to retain the forces: bonding stretching, angle bending, stretch-bend interaction, out-of-plane bending at tricooridnate centers, torsion interaction, and the repulsive part of Van der Waals interactions; (2) 15 kcal/mol as the energy window; (3) hydrogen deletion from the input file fragment prior to the substructure search (option:
deleteFixHydrogens); (4) permission to generate stereoisomers; and (5) maximum acceptable number of rotatable bonds of 25 [
39]. Due to computational burden and space limitation to write similarity into a matrix during calculation at posterior work, 3D structures of every compound were merged into the structure files (file extension: sdf) according to target class, and 13,957 3D structures (with duplication due to different conformation) from the files were chosen via stratified sampling in KNIME to produce the dataset for similarity matrices as shown in
Supplementary Table S1.
Alignment method: In order to align the 3D-structures of compound pairs, center of the mass was used [
40]. In detail, it is reported that SIMPLEX algorithm for the alignment is already implemented in ROCS [
15]. Shape Toolkit in the Openeye software [
40,
41] provides ‘
OEBOOrientation’ used in
OEBestOverlay. To optimize the alignment of each paired 3D structures, the starting point should be chosen before finding centers-of-mass of two conformers and
OEBestOverlay uses an inertial frame alignment method to decide on starting positions by default. Under the default condition (‘
OEBOOrientation_Inertial’), the first 3D structure (refmol in the python code in the
Supplementary Materials) was aligned by its principal moments of inertia, then the second structure (fitmol in the python code in the
Supplementary Materials) object was aligned in four positions with the primary and secondary moments of inertia in both possible directions. Therefore, the alignment of a compound pair (A, B) is approximately the same and absolutely not identical with the alignment (B, A).
3D Descriptors: In order to describe a molecular shape, atom-centered Gaussian sphere model was implemented in OE-MPI/ROCS and the Shape Toolkit [
40,
41]. OE-MPI, a kind of MPI (message passing interface), was also provided by Openeye for thread parallel calculation with a high number of CPUs. The Gaussian sphere model describing the 3D shape of compounds used the sum of Gaussian functions of individual heavy atoms except for hydrogen.
f and
g are characteristic functions to present the 3D atomic structure of each compound,
I: self-volume overlaps of each entity, independent;
O: the overlap between the two functions, dependent on orientation of two molecules.
Color features of every query were generated under the default algorithm of the Shape Toolkit. Color features were defined by pharmacophore types (H-bond donor, H-bond acceptor, negative charge, positive charge, hydrophobic, and ring) in a color force field (Implicit Mills Dean) and color atoms were described by Gaussian functions as being relatively hard with a steep gradient.
3D Similarity matrix: The Jaccard–Tanimoto coefficient of two features, shape and color were calculated, combined, and written into 3D similarity matrices using the functions in the supplementary python script [
42].
- -
OEOverlay(): optimization of the alignment(overlap) between query and database
- -
OEBestOverlayScoreIter(): sorting all scores to highest Tanimoto coefficient before writing similarity score into an empty matrix.
In this study, while the dimension of 3D similarity matrices for Q distributions (GMM) was 13,957 by 13,957, the dimension of 3D similarity matrices for query distributions (ML estimation) was 1 by 13,957. Every sampled compound of four target classes (13,957 conformers x four target classes) was used as the query to show the performance of K–L divergence. The BNDS-A compound is only one query not existing in any target class.
Script for K–L divergence. In order to realize (1) the GMM model, (2) the ML estimation, and (3) K–L divergence, python scripts were written using python libraries such as pandas [
43], numpy [
44], and scipy [
45] under anaconda installation [
46], so that every code was uploaded to GitHub [
47].