1. Introduction
Virtual screening refers to the use of a computer-based method to process compounds from a library or database of compounds in order to identify and select the ones that are likely to possess a desired biological activity, such as the ability to inhibit the action of a particular therapeutic target. The selection of molecules with a virtual screening algorithm should yield a higher proportion of active compounds, as assessed by experiment, relative to a random selection of the same number of molecules [
1,
2].
Many virtual screening (VS) approaches have been implemented for searching chemical databases, such as substructure search, similarity, docking and QSAR [
3]. Of these, similarity searching is the simplest, and one of the most widely-used techniques for ligand-based virtual screening (LBVS) [
4]. The increasing of the importance of similarity searching applications is particularly due to its role in lead optimization in drug discovery programs, where the nearest neighbors for an initial lead compound are sought in order to find better compounds.
There are many studies in the literature associated with the measurement of molecular similarity [
5,
6,
7]. Similarity searching aims to search and scan chemical databases to identify those molecules that are most similar to a user-defined reference structure using some quantitative measures of intermolecular structural similarity. However, the most common approaches are based on 2D fingerprints, with the similarity between a reference structure and a database structure computed using association coefficients such as the Tanimoto coefficient [
2,
8]. Several methods have been used to further optimize the measures of similarity between molecules, including weighting, standardization, and data fusion [
9,
10,
11,
12,
13].
Similarity measures methods play a significant role in detecting the rate for pairwise molecular similarity. In this study, the similarity method that developed inspired from quantum machines theory. The quantum machines was recently employed in the information retrieval field [
14,
15,
16], and due to many similarities aspects between the text and chemical information retrieval, it was adapted for chemoinformatics as well [
17]. These analogies have provided the basis for the work in this paper, which is to introduce a new similarity method inspired from quantum theory to calculate the similarity for chemical database according to the reference structures. The Standard Quantum-Based (SQB) similarity method requires Re-representation of molecular compounds in order to be adapted with the mathematical quantum space which is called complex Hilbert space. The representation of molecular compounds formulates in complex numbers formats which can play vital role in development of SQB method. The use of complex numbers is one of the key concepts of the mathematical formalism of quantum theory. This study also developed three different techniques to re-represent and embed the molecular compounds in complex numbers formats. Finally, the screening experiment was simulated with three popular datasets converted to Pipeline Pilot ECFC_4 2D fingerprints, which are MDL Drug Data Report (MDDR), Maximum Unbiased Validation (MUV) and Directory of Useful Decoys (DUD).
2. Related Work
In a broad sense, the comprehensive umbrella of chemoinformatics is not restricted to molecular searching and drug design, but includes several classical chemical disciplines such as physical chemistry, medicinal chemistry, analytical chemistry, and others. Hence, the interaction among disciplines plays a fundamental role in generating new approaches to deal with chemoinformatic issues such as quantum-chemistry approaches to deal with mathematical aspects of chemistry. Generally during chemical bonding, there is no substitute to identify the behavior of atoms than quantum-chemistry approaches. This has led several scientists to consider only molecular descriptors derived from quantum chemistry. The quantum-chemical descriptors can be classified depending on the type of descriptors into orbital-based, energy-based, wave function based, and others [
18].
The bridge between chemical concepts and quantum-chemistry was introduced by Atoms-in-Molecules (AIM) quantum approach [
19]. The quantum-chemical algorithms in QSAR/QSPR play a fundamental role in providing higher accuracy than Force-Field based methods. Holder,
et al. [
20] employed quantum chemical concepts first for verification of graph theory of molecules, and second to predict the refractive index of polymer matrices. The quantum-chemistry descriptors in the development of QSAR/QSPR deal with the chemical, physical, biochemical, and pharmacological properties of compounds [
21,
22]. The combination of two types of connectivity 2D and 3D of molecular descriptors with quantum-chemistry descriptors showed a more preferable approach to QSPR than using each connectivity individually [
23]. Within the same environment, the approaches of quantum-chemical are also used with 3D-QSAR models to calculate stereo-electronic properties [
24,
25,
26]. Another study [
27] employed quantum-chemistry depending on some properties of orbital calculations of molecules in order to overcome the limitation of classical QSAR approaches.
On the other hand, the nature of quantum mechanics as well as computing power were also derived and used in docking and improvement of known lead compounds in order to provide highest accuracy [
28,
29,
30]. In addition, Quantum Mechanics Methods (QMM) were extensively applied in Linear Scaling in order to evaluate the binding enthalpy between ligand and proteins [
31,
32,
33,
34]. Other usage of QMM was to calculate energies and optimization of molecular structures at semi-empirical level [
35]. In drug discovery process, the QMM has been devoted to investigate and describe electronic properties of molecules such as electronic and polarize effects, charge distribution, and bond state (forming/breaking) [
36,
37].
One of the approaches that was used for compounds similarity searching is Molecular Quantum Similarity Measure (MQSM) which was proposed by Carbó [
38]. The MQSM approach depends on quantum-chemical descriptors to measure the similarity/diversity of molecules through the analysis of electron density function which is calculated by QMM. As a consequence, the QSAR models were developed based on MQSM [
39,
40,
41]. In addition, the Molecular Quantum Self-Similarity Measures (MQSSM) was developed based on MQSM by comparing each molecule with itself [
37,
42,
43]. The MQSM was also employed to classify the quantum objects of molecules by using dendrograms [
44].
Moreover, Maldonado
et al. [
7] introduced comprehensive study of the applications and theories of molecular similarity measures in chemoinformatics. Generally, the measures of molecular similarity rely mainly on three factors. First are the features of molecular structures which can be used to detect the similarity/diversity of molecular compounds, which is known as descriptor. The molecular descriptors differ from 1D, 2D, and 3D molecular structures. The molecular descriptors influence the chemical, physical and biological properties as well as the order of atoms and chemical bonds [
5,
7]. Second, the similarity coefficients, which are mostly categorized to distance, correlation and probabilistic coefficient, for instance, Tanimoto, Euclidean distance, Pearson and so on. More details about similarity coefficients are discussed in the recent comprehensive study presented by Todeschini
et al. [
8]. The third and last is the degree of importance for molecular fragments represented by weighting schema approaches. Various techniques were used for this purpose such as Bayesian Inference Network [
11].
Evident from the above studies, the Quantum Mechanics Methods were used at semi-empirical level of computational chemistry whether for quantum-chemical descriptors or quantum-similarity measure. In this study, the adoption of quantum mechanics concepts is investigated to be used as similarity searching method to find similarity/dissimilarity between reference and library molecules.
4. Experimental Design
The experiments were carried out using the most popular chemoinformatics databases, the MDL Drug Data Report (MDDR) [
49], Maximum Unbiased Validation (MUV) [
50], and Directory of Useful Decoys (DUD) [
51]. All molecules in these databases were converted to Pipeline Pilot ECFC_4 (extended connectivity fingerprints and folded to size 1024 bits) [
52], and these data sets have been used recently by our research group area [
11,
12,
13,
47].
The screening experiments were performed with ten reference structures selected randomly from each activity class. These structures were unified and applied on TAN and four cases of SQB similarity method. For the MDDR dataset, three data sets (MDDR-DS1, MDDR-DS2 and MDDR-DS3) with 102516 molecules were chosen. The MDDR-DS1 contains 11 activity classes, with some of the classes involving actives that are structurally homogeneous and with others involving actives that are structurally heterogeneous (
i.e., structurally diverse). The MDDR-DS2 data set contains 10 homogeneous activity classes, while the MDDR-DS3 data set contains 10 heterogeneous activity classes. Details of these three data sets are given in
Table 1,
Table 2 and
Table 3. Each row of a table contains an activity class, the number of molecules belonging to the class, and the class’s diversity, which was computed as the mean pairwise Tanimoto similarity calculated across all pairs of molecules in the class using ECFC_4. The second data set, (MUV) as shown in
Table 4, was reported by Rohrer and Baumann [
50]. This data set contains 17 activity classes, with each class containing up to 30 actives and 15,000 inactive molecules. The diversity of the class for this dataset shows that it contains high diversity or more heterogeneous activity classes. This data set was also used in the previous study by our research group [
12,
47].
Table 1.
MDDR-DS1 structure activity classes.
Table 1.
MDDR-DS1 structure activity classes.
Activity Index | Activity Class | Active Molecules | Pairwise Similarity |
---|
31420 | Renin inhibitors | 1130 | 0.290 |
71523 | HIV protease inhibitors | 750 | 0.198 |
37110 | Thrombin inhibitors | 803 | 0.180 |
31432 | Angiotensin II AT1 antagonists | 943 | 0.229 |
42731 | Substance P antagonists | 1246 | 0.149 |
06233 | Substance P antagonists | 752 | 0.140 |
06245 | 5HT reuptake inhibitors | 359 | 0.122 |
07701 | D2 antagonists | 395 | 0.138 |
06235 | 5HT1A agonists | 827 | 0.133 |
78374 | Protein kinase C inhibitors | 453 | 0.120 |
78331 | Cyclooxygenase inhibitors | 636 | 0.108 |
Table 2.
MDDR-DS2 structure activity classes.
Table 2.
MDDR-DS2 structure activity classes.
Activity Index | Activity Class | Active Molecules | Pairwise Similarity |
---|
07707 | Adenosine (A1) agonists | 207 | 0.229 |
07708 | Adenosine (A2) agonists | 156 | 0.305 |
31420 | Renin inhibitors 1 | 1300 | 0.290 |
42710 | CCK agonists | 111 | 0.361 |
64100 | Monocyclic-lactams | 1346 | 0.336 |
64200 | Cephalosporins | 113 | 0.322 |
64220 | Carbacephems | 1051 | 0.269 |
64500 | Carbapenems | 126 | 0.260 |
64350 | Tribactams | 388 | 0.305 |
75755 | Vitamin D analogous | 455 | 0.386 |
Table 3.
MDDR-DS3 structure activity classes.
Table 3.
MDDR-DS3 structure activity classes.
Activity Index | Activity Class | Active Molecules | Pairwise Similarity |
---|
09249 | Muscarinic (M1) agonists | 900 | 0.111 |
12455 | NMDA receptor antagonists | 1400 | 0.098 |
12464 | Nitric oxide synthase inhibitors | 505 | 0.102 |
31281 | Dopamine-hydroxylase inhibitors | 106 | 0.125 |
43210 | Aldose reductase inhibitors | 957 | 0.119 |
71522 | Reverse transcriptase inhibitors | 700 | 0.103 |
75721 | Aromatase inhibitors | 636 | 0.110 |
78331 | Cyclooxygenase inhibitors | 636 | 0.108 |
78348 | Phospholipase A2 inhibitors | 617 | 0.123 |
78351 | Lipoxygenase inhibitors | 2111 | 0.113 |
Table 4.
MUV structure activity classes.
Table 4.
MUV structure activity classes.
Activity Index | Activity Class | Pairwise Similarity |
---|
466 | S1P1 rec. (agonists) | 0.117 |
548 | PKA (inhibitors | 0.128 |
600 | SF1 (inhibitors) | 0.123 |
644 | Rho-Kinase2 (inhibitors) | 0.122 |
652 | HIV RT-RNase (inhibitors) | 0.099 |
689 | Eph rec. A4 (inhibitors | 0.113 |
692 | SF1 (agonists) | 0.114 |
712 | HSP 90 (inhibitors) 30 | 0.106 |
713 | ER-a-Coact. Bind. (inhibitors) | 0.113 |
733 | ER-b-Coact. Bind. (inhibitors) | 0.114 |
737 | ER-a-Coact. Bind. (potentiators) | 0.129 |
810 | FAK (inhibitors) | 0.107 |
832 | Cathepsin G (inhibitors) | 0.151 |
846 | FXIa (inhibitors) | 0.161 |
852 | FXIIa (inhibitors) | 0.150 |
858 | D1 rec. (allosteric modulators) | 0.111 |
859 | M1 rec. (allosteric inhibitors) | 0.126 |
The third data set used in this study is Directory of Useful Decoys (DUD), which has been recently compiled as a benchmark data set specifically for docking methods. It was introduced by [
51] and was used recently in molecular virtual screening [
53] as well as molecular docking [
54]. The decoys for each target were chosen specifically to fulfil a number of criteria to make them relevant and as unbiased as possible. In this study twelve subsets of DUD with 704 active compounds and 25,828 decoys were used as shown in
Table 5.
This study presented SQB similarity method which deals with complex and real Hilbert space. The complex Hilbert space generated by three different proposed techniques (i.e., SQB-Complex Tech. 1, SQB-Complex Tech. 2, and SQB-Complex Tech. 3), while the real Hilbert space generated as a special case of complex space (i.e., SQB-Real). The comparison of the retrieval results obtained by four cases of SQB method that have been compared with Tanimoto (TAN). The TAN coefficient has been used in ligand-based virtual screening for many years and is now considered a reference standard.
Table 5.
Number of active and inactive compounds for twelve DUD sub datasets, where Na: number of active compounds; Ndec: number of decoys.
Table 5.
Number of active and inactive compounds for twelve DUD sub datasets, where Na: number of active compounds; Ndec: number of decoys.
No. | Data Set | Active and Inactive |
---|
Na | Ndec |
---|
1 | Fgfr1t | 120 | 4550 |
2 | fxa | 146 | 5745 |
3 | gart | 40 | 879 |
4 | gbp | 52 | 2140 |
5 | gr | 78 | 2947 |
6 | hivpr | 62 | 2038 |
7 | hivrt | 43 | 1519 |
8 | hmga | 35 | 1480 |
9 | Hsp90 | 37 | 979 |
10 | mr | 15 | 636 |
11 | na | 49 | 1874 |
12 | pr | 27 | 1041 |
Total | | 704 | 25,828 |
5. Results and Discussion
The experimental results on MDDR-DS1, MDDR-DS2, MDDR-DS3, MUV, and DUD are presented in
Table 6,
Table 7,
Table 8,
Table 9 and
Table 10 respectively, using cut offs at 1% and 5%. In these Tables, the results of SQB method which used four different cases of molecular compounds representation compared with benchmark TAN are reported. The three techniques of complex space (
i.e., Tech.1–Tech.3) used different schema of weighting functions to embed in complex numbers format, the real Hilbert space also generated as special case of quantum space. Each row in the tables lists the recall for top 1% and 5% of the activity class, and the best recall rate in each row are shaded. The Mean rows in the tables correspond to the mean when averaged over all activity classes (the best average is bolded), and the Shaded Cells rows correspond to the total number of shaded cells for each technique across the full set of activity classes.
The recall values of MDDR-DS1, MDDR-DS2 and MDDR-DS3 that reported in
Table 6,
Table 7 and
Table 8 respectively showed that the proposed SQB method which deals with complex Hilbert space are obviously superior to TAN especially at cutoff 1% for DS1–DS3, and for DS1 and DS2 at cutoff 5%. The complex Hilbert space generated through the re-representation of molecular compounds in term of complex numbers format via three different ways. The Okapi technique (
i.e., Tech. 3) which used to embed complex Hilbert space gave superior retrieval results than TAN in DS1–DS3 for top 1%. While for 5% cutoff, the representation techniques which are Tech 1 and Tech 2 provide preferable retrieval results to TAN for DS1 and DS2 respectively. In the contrast, the results of SQB method which deals with real space are slightly inferior to TAN for both cutoffs 1% and 5% for DS1 and DS3, while it outperformed TAN for DS2 at top 5%.
For MUV Dataset,
Table 9 shows the results of SQB in complex and real cases as well as TAN. It is shown that SQB method with complex representation evidently gave superior retrieval results than TAN for nine activity classes for 1% and for eight activity classes for 5%, as a consequence the average Mean is better than TAN. However, the real representation of SQB method provided superior recall values in five activity classes as well as the overall Mean outperformed TAN for top 5%, but the best Mean presented by SQB was with Tech. 3 complex representation for both cutoffs. On the other hand, for the DUD dataset that was reported in
Table 10, the best retrieval results for both cutoffs were also obtained by SQB with Tech. 3 complex representation for the number of activity classes, and the overall Mean. In contrast, the performance of SQB with real space representation outperformed TAN for the cutoff 1% while it was inferior to TAN for cutoff 5%. Therefore, statistical analysis was required to provide strong judgment about the performances of the proposed methods.
Some of the activity classes, such as low-diversity activity classes, may contribute disproportionately to the overall value of mean recall. Therefore, using the mean recall value as the evaluation criterion could be impartial in some methods, but not in others. To avoid this bias, the effective performances of the different methods have been further investigated based on the total number of shaded cells for each method across the full set of activity classes. This is shown in the bottom row of
Table 6,
Table 7,
Table 8,
Table 9 and
Table 10. According to the total number of shaded cells in these Tables, the Tech. 3 representation of SQB method at top 1% was the best performing search across the three data sets. In contrast, at top 5% case the TAN was equal to Tech. 3 of SQB only in high-diversity data set (
i.e., MDDR-DS3) while for other data sets the SQB was preferable. Moreover the SQB method was superior in terms of total number of shaded cells in MUV and DUD dataset.
The Kendall W test of concordance was used for ranking the performance of the similarity methods for the MDDR, MUV and DUD datasets. Here, the values of the recall for all activity classes (11 classes for DS1, 10 classes for DS2 and DS3, 17 classes for MUV, and 12 classes for DUD) was considered as a judge ranking (raters) of the similarity methods (ranked objects). The outputs of this test are the Kendall coefficient (
W), Chi-Square (
X2) and the significance level (
p value). Hence, the
p value is considered as significant if
p < 0.05, and then it is possible to give an overall ranking to the similarity methods. For instance, the value of the Kendall coefficient for DS1 in
Table 11 is 0.222 while the
p value is significant (
p < 0.05) and the overall rankings of the similarity methods is: SQB(C./T3) > SQB(C./T1) > SQB(C./T2) > TAN > SQB(R.). In
Table 11, the results of Kendall W test of top 1% for all used datasets show that the values of associated probability (
p) are less than 0.05. This indicates that the SQB method is significant in cut-off 1% for all cases. As a consequence, the overall ranking of techniques indicates that the SQB with Tech. 3 complex representation is superior to TAN and SQB with real representation.
On the other hand, the Kendall W test results of similarity methods in case of the top 5% for three data sets are shown in
Table 12 reported. The P values for MUV and DUD datasets are 0.031, and 0.0001 respectively, which indicate that the Tech. 3 representation of SQB method still outperformed other methods. In contrast of the MDDR data set, Tech. 1 and Tech. 2 representation techniques significantly outperformed other methods for DS1 and DS2. While only for MDDR-DS3, TAN provided better ranking among other methods.
Table 6.
Retrieval results of top 1% and 5% for MDDR-DS1 dataset.
Table 6.
Retrieval results of top 1% and 5% for MDDR-DS1 dataset.
Activity Index | SQB (Complex) Tech. 1 | SQB (Complex) Tech. 2 | SQB (Complex) Tech. 3 | SQB (Real) | TAN | SQB (Complex) Tech. 1 | SQB (Complex) Tech. 2 | SQB (Complex) Tech. 3 | SQB (Real) | TAN |
---|
1% | 5% |
31420 | 72.18 | 72.44 | 73.73 | 70.03 | 69.69 | 87.75 | 87.24 | 87.22 | 84.03 | 83.49 |
71523 | 26.33 | 25.41 | 26.84 | 25.58 | 25.94 | 60.16 | 48.48 | 48.7 | 48.65 | 48.92 |
37110 | 18.33 | 22.09 | 24.73 | 9 | 9.63 | 39.81 | 45.77 | 45.62 | 19.56 | 21.01 |
31432 | 41.61 | 37.2 | 36.66 | 37.34 | 35.82 | 82 | 70.57 | 70.44 | 76.48 | 74.29 |
42731 | 19.06 | 20.49 | 21.17 | 17.34 | 17.77 | 28.77 | 24.58 | 24.35 | 28.19 | 29.68 |
06233 | 12.45 | 12.26 | 12.49 | 10.75 | 13.87 | 20.96 | 19.04 | 20.04 | 21.04 | 27.68 |
06245 | 7.18 | 6.37 | 6.03 | 6.03 | 6.51 | 15.39 | 13.99 | 13.72 | 13.63 | 16.54 |
07701 | 10.33 | 10.91 | 11.35 | 8.25 | 8.63 | 26.9 | 25.41 | 26.73 | 21.85 | 24.09 |
06235 | 10.51 | 10.9 | 10.15 | 9.14 | 9.71 | 22.47 | 23.72 | 22.81 | 19.13 | 20.06 |
78374 | 12.46 | 11.77 | 13.08 | 13.65 | 13.69 | 20.95 | 20.73 | 19.56 | 20.55 | 20.51 |
78331 | 6.08 | 6.54 | 5.92 | 5.78 | 7.17 | 10.31 | 11.48 | 11.37 | 13.1 | 16.2 |
Mean | 21.50 | 21.48 | 22.01 | 19.35 | 19.85 | 37.77 | 35.54 | 35.50 | 33.29 | 34.77 |
Shaded cells | 2 | 1 | 5 | 0 | 3 | 4 | 2 | 0 | 0 | 4 |
Table 7.
Retrieval results of top 1% and 5% for MDDR-DS2 dataset.
Table 7.
Retrieval results of top 1% and 5% for MDDR-DS2 dataset.
Activity Index | SQB (Complex) Tech. 1 | SQB (Complex) Tech. 2 | SQB (Complex) Tech. 3 | SQB (Real) | TAN | SQB (Complex) Tech. 1 | SQB (Complex) Tech. 2 | SQB (Complex) Tech. 3 | SQB (Real) | TAN |
---|
1% | 5% |
07707 | 72.62 | 71.31 | 72.09 | 58.5 | 61.84 | 75.15 | 74.22 | 74.37 | 70.39 | 70.39 |
07708 | 95.87 | 96.06 | 95.68 | 55.61 | 47.03 | 99.87 | 100 | 99.61 | 64.97 | 56.58 |
31420 | 72.02 | 71.32 | 78.56 | 62.22 | 65.1 | 95.04 | 95.24 | 94.88 | 87.04 | 88.19 |
42710 | 82.18 | 77.45 | 76.82 | 83 | 81.27 | 91.09 | 93 | 91.09 | 89.18 | 88.09 |
64100 | 88.9 | 87.92 | 87.8 | 80.73 | 80.31 | 99.23 | 98.94 | 99.03 | 94.59 | 93.75 |
64200 | 63.3 | 70 | 70.18 | 53.13 | 53.84 | 95.18 | 98.93 | 99.38 | 81.34 | 77.68 |
64220 | 60.9 | 66.79 | 67.58 | 34.61 | 38.64 | 84.06 | 90.9 | 90.62 | 48.11 | 52.19 |
64500 | 67.36 | 78.64 | 79.2 | 29.04 | 30.56 | 83.28 | 92.72 | 92.48 | 47.68 | 44.8 |
64350 | 82.45 | 80.83 | 81.68 | 81.86 | 80.18 | 96.02 | 93.75 | 90.78 | 87.96 | 91.71 |
75755 | 97.6 | 97.91 | 98.02 | 85.4 | 87.56 | 98.17 | 98.39 | 98.37 | 94.07 | 94.82 |
Mean | 78.32 | 79.82 | 80.76 | 62.41 | 62.63 | 91.70 | 93.60 | 93.06 | 76.53 | 75.82 |
Shaded cells | 3 | 1 | 5 | 1 | 0 | 3 | 6 | 1 | 0 | 0 |
Table 8.
Retrieval results of top 1% and 5% for MDDR-DS3 dataset.
Table 8.
Retrieval results of top 1% and 5% for MDDR-DS3 dataset.
Activity Index | SQB (Complex) Tech. 1 | SQB (Complex) Tech. 2 | SQB (Complex) Tech. 3 | SQB (Real) | TAN | SQB (Complex) Tech. 1 | SQB (Complex) Tech. 2 | SQB (Complex) Tech. 3 | SQB(Real) | TAN |
---|
1% | 5% |
09249 | 10.17 | 10.61 | 10.99 | 9.92 | 12.12 | 18.05 | 18.26 | 17.8 | 21.4 | 24.17 |
12455 | 5.65 | 6.65 | 7.03 | 5.12 | 6.57 | 7.59 | 10.23 | 11.42 | 8.1 | 10.29 |
12464 | 5.04 | 6.17 | 6.92 | 5.56 | 8.17 | 12.78 | 16.09 | 16.79 | 10.56 | 15.22 |
31281 | 15.14 | 18.19 | 18.67 | 10.29 | 16.95 | 20.86 | 27.43 | 29.05 | 15.14 | 29.62 |
43210 | 5.77 | 6.93 | 6.83 | 5.31 | 6.27 | 11.83 | 13.54 | 14.12 | 14.47 | 16.07 |
71522 | 4.74 | 6.34 | 6.57 | 3.03 | 3.75 | 10.56 | 13.26 | 13.82 | 9.2 | 12.37 |
75721 | 18.44 | 20.14 | 20.38 | 15.24 | 17.32 | 25.1 | 30.13 | 30.61 | 22.27 | 25.21 |
78331 | 6.16 | 6.03 | 6.16 | 5.48 | 6.31 | 10.16 | 12.11 | 11.97 | 12.03 | 15.01 |
78348 | 8.03 | 8 | 8.99 | 9.67 | 10.15 | 20 | 21.89 | 21.14 | 22.72 | 24.67 |
78351 | 10.87 | 11.98 | 12.5 | 10.03 | 9.84 | 11.8 | 12.63 | 13.3 | 11.95 | 11.71 |
Mean | 9.0 | 10.10 | 10.50 | 7.96 | 9.74 | 14.87 | 17.55 | 18.0 | 14.78 | 18.4 |
Shaded cells | 0 | 1 | 5 | 0 | 4 | 0 | 0 | 5 | 0 | 5 |
Table 9.
Retrieval results of top 1% and 5% for MUV dataset.
Table 9.
Retrieval results of top 1% and 5% for MUV dataset.
Activity Index | SQB (Complex) Tech. 1 | SQB (Complex) Tech. 2 | SQB (Complex) Tech. 3 | SQB (Real) | TAN | SQB (Complex) Tech. 1 | SQB (Complex) Tech. 2 | SQB (Complex) Tech. 3 | SQB (Real) | TAN |
---|
1% | 5% |
466 | 2.41 | 1.03 | 1.38 | 2.41 | 3.1 | 5.17 | 8.28 | 8.62 | 6.9 | 5.86 |
548 | 8.28 | 10.34 | 11.38 | 7.59 | 8.62 | 22.07 | 24.14 | 24.14 | 21.03 | 22.76 |
600 | 3.79 | 4.48 | 5.52 | 2.41 | 3.79 | 13.1 | 14.83 | 16.21 | 10.34 | 11.38 |
644 | 7.59 | 8.28 | 8.97 | 7.24 | 7.59 | 14.14 | 17.93 | 17.93 | 17.24 | 17.59 |
652 | 2.76 | 4.14 | 3.79 | 2.07 | 2.76 | 7.59 | 8.97 | 9.66 | 8.62 | 7.93 |
689 | 3.79 | 5.17 | 4.48 | 2.07 | 3.79 | 8.28 | 11.38 | 11.72 | 8.28 | 9.66 |
692 | 0.69 | 1.03 | 1.38 | 0.69 | 0.69 | 3.79 | 5.17 | 4.83 | 6.21 | 4.83 |
712 | 3.45 | 4.48 | 5.17 | 4.14 | 4.14 | 9.31 | 12.41 | 11.03 | 16.9 | 10.34 |
713 | 2.76 | 2.76 | 2.76 | 2.41 | 3.1 | 7.59 | 6.55 | 5.86 | 7.24 | 7.24 |
733 | 3.45 | 4.14 | 4.14 | 1.38 | 3.45 | 9.31 | 8.62 | 8.62 | 8.97 | 8.97 |
737 | 2.41 | 1.72 | 1.72 | 1.38 | 2.41 | 8.97 | 8.62 | 8.28 | 12.41 | 8.28 |
810 | 1.72 | 2.41 | 1.72 | 2.41 | 2.07 | 7.24 | 10.34 | 11.03 | 10.34 | 6.9 |
832 | 6.21 | 7.24 | 8.28 | 4.48 | 6.55 | 13.1 | 14.83 | 14.83 | 11.38 | 13.1 |
846 | 10.34 | 12.76 | 12.41 | 8.97 | 9.66 | 25.86 | 25.86 | 26.9 | 23.45 | 28.62 |
852 | 9.66 | 9.31 | 9.66 | 8.62 | 12.41 | 19.31 | 20 | 20 | 18.62 | 21.38 |
858 | 1.72 | 1.38 | 1.38 | 3.1 | 1.72 | 5.17 | 6.21 | 6.21 | 7.93 | 5.86 |
859 | 1.72 | 2.07 | 2.41 | 2.07 | 1.38 | 7.93 | 10 | 8.62 | 10.69 | 8.97 |
Mean | 4.27 | 4.86 | 5.09 | 3.73 | 4.54 | 11.05 | 12.59 | 12.61 | 12.15 | 11.74 |
Shaded cells | 2 | 6 | 9 | 2 | 3 | 2 | 3 | 8 | 5 | 2 |
Table 10.
Retrieval results of top 1% and 5% for DUD dataset.
Table 10.
Retrieval results of top 1% and 5% for DUD dataset.
Activity Index | SQB (Complex) Tech. 1 | SQB (Complex) Tech. 2 | SQB (Complex) Tech. 3 | SQB (Real) | TAN | SQB (Complex) Tech. 1 | SQB (Complex) Tech. 2 | SQB (Complex) Tech. 3 | SQB (Real) | TAN |
---|
1% | 5% |
FGFR1T | 2.67 | 2.92 | 2.92 | 2.33 | 2.5 | 6.75 | 6.5 | 7 | 6.17 | 6.67 |
FXA | 3.15 | 3.36 | 3.36 | 2.26 | 1.92 | 8.84 | 7.74 | 8.29 | 8.08 | 7.88 |
GART | 5.25 | 5.75 | 5.75 | 7.5 | 7.75 | 23 | 22.75 | 23.25 | 22.25 | 22.25 |
GBP | 15.77 | 16.73 | 15.96 | 13.65 | 13.27 | 28.65 | 30.58 | 30.96 | 21.35 | 20.96 |
GR | 2.18 | 3.46 | 3.21 | 2.31 | 2.31 | 6.79 | 8.21 | 8.46 | 6.67 | 6.41 |
HIVPR | 4.52 | 2.74 | 3.55 | 3.39 | 3.55 | 13.23 | 10.97 | 11.29 | 11.45 | 11.77 |
HIVRT | 1.86 | 1.86 | 1.86 | 1.63 | 1.63 | 5.58 | 6.74 | 6.98 | 4.65 | 4.88 |
HMGA | 6.57 | 5.43 | 5.43 | 5.71 | 6.29 | 10.86 | 11.43 | 13.14 | 10.29 | 10.29 |
HSP90 | 3.78 | 4.05 | 4.05 | 2.16 | 1.62 | 9.19 | 9.19 | 8.38 | 8.11 | 8.11 |
MR | 5.33 | 5.33 | 5.33 | 5.33 | 5.33 | 8.67 | 9.33 | 10 | 9.33 | 9.33 |
NA | 3.06 | 4.9 | 5.31 | 2.24 | 2.24 | 6.33 | 9.59 | 9.8 | 4.9 | 5.1 |
PR | 2.22 | 2.22 | 2.22 | 1.85 | 1.85 | 5.19 | 6.67 | 5.19 | 4.44 | 4.81 |
Mean | 4.69 | 4.89 | 4.91 | 4.19 | 4.18 | 11.09 | 11.64 | 11.89 | 9.8 | 9.87 |
Shaded cells | 5 | 7 | 6 | 1 | 2 | 3 | 2 | 9 | 0 | 0 |
Table 11.
Rankings of Similarity Methods Based on Kendall W Test Results using MDDR (DS1–DS3), MUV and DUD datasets for Top 1%.
Table 11.
Rankings of Similarity Methods Based on Kendall W Test Results using MDDR (DS1–DS3), MUV and DUD datasets for Top 1%.
Data Set | W | X2 | p | Ranking |
---|
DS1 | 0.222 | 9.808 | 0.043 | SQB(C./T3) > SQB(C./T1) > SQB(C./T2) > TAN > SQB(R.) |
DS2 | 0.452 | 18.08 | 0.001 | SQB(C./T3) = SQB(C./T1) > SQB(C./T2) > SQB(R.) > TAN |
DS3 | 0.483 | 19.356 | 0.0006 | SQB(C./T3) > SQB(C./T2) = TAN > SQB(C./T1) > SQB(R.) |
MUV | 0.272 | 18.518 | 0.0009 | SQB(C./T3) > SQB(C./T2) > TAN > SQB(C./T1) > SQB(R.) |
DUD | 0.258 | 12.415 | 0.014 | SQB(C./T3) > SQB(C./T2) > SQB(C./T1) > TAN > SQB(R.) |
Table 12.
Rankings of Similarity Methods Based on Kendall W Test Results using MDDR (DS1–DS3), MUV and DUD datasets for Top 5%.
Table 12.
Rankings of Similarity Methods Based on Kendall W Test Results using MDDR (DS1–DS3), MUV and DUD datasets for Top 5%.
Data Set | W | X2 | p | Ranking |
---|
DS1 | 0.525 | 23.127 | 0.0001 | SQB(C./T1) > SQB(C./T2) = TAN > SQB(C./T3) > SQB(R.) |
DS2 | 0.738 | 29.535 | 0.000006 | SQB(C./T2) > SQB(C./T1) > SQB(C./T3) > SQB(R.) = TAN |
DS3 | 0.378 | 15.120 | 0.004 | TAN > SQB(C./T3) > SQB(C./T2) > SQB(R.) > SQB(C./T1) |
MUV | 0.155 | 10.588 | 0.031 | SQB(C./T3) = SQB(C./T2) > SQB(R.) > TAN > SQB(C./T1) |
DUD | 0.478 | 22.961 | 0.0001 | SQB(C./T3) > SQB(C./T2) > SQB(C./T1) > TAN > SQB(R.) |
The results of the MDDR search shown in
Table 6,
Table 7 and
Table 8 show that the use of complex numbers format in Hilbert space of SQB method at cut-off 1% produced the highest mean values and number of shaded cells compared with TAN and real representation of SQB method. The best
p-value at top 1% was 0.0006 for MDDR-DS3dataset. The technique that employed Okapi function to embed the molecules in term of the complex numbers format provided the best retrieval results compared with other embedded techniques as well as real representation and TAN. It was preferable for five activity classes and average Mean for DS1–DS3. However, the mean of retrieval results for the real SQB method was slightly inferior to TAN for DS1 and DS2. On the other hand, the TAN method was preferable for average mean to SQB complex/Tech. 3 in MDDR-DS3 dataset at 5% cut-off despite the fact both methods were equivalent for shaded cells of activity classes. While the case in MDDR-DS1 dataset is reversed, the Tech. 1 complex representation of SQB method outperformed TAN in mean recall criteria despite the shaded cells for both methods were equivalent. In contrast to homogeneous dataset, the mean of real SQB method was superior to the TAN method. While the MDDR-DS2 dataset includes highly similar activities, the MUV and DUD datasets have been carefully designed to include sets of highly dissimilar actives. Most of the similarity methods as well as our proposed methods here show a very high recall rate for the low diversity dataset and very low recall for the high diversity datasets, such as MDDR-DS3, MUV and DUD used in this study.
The results of MUV and DUD datasets are shown in
Table 9 and
Table 10 respectively. The results of both 1% and 5% cut-off for MUV are slightly preferable for both SQB representation methods compared to benchmark TAN method. The mean retrieval of complex Tech. 3 SQB method exceeded the TAN for both cut-offs, while according to shaded cells of activity classes, the proposed methods also was superior to TAN. In contrast, the real representation of SQB method outperformed TAN at top 5%. On the other hand, for DUD dataset the complex and real representation of SQB method at both cut-offs outperformed other methods for both criteria, whether mean recall values, and higher activity class values (
i.e., shaded cells). Moreover, the Kendall W test for DUD dataset show the superiority of the results of complex cases of SQB method than TAN method, where the
p-values were 0.014 for top 1% and 0.0001 for top 5%. For MUV dataset, the results of complex Tech. 3 and Tech. 2 of SQB method outperformed other methods with significant level, where
p-values 0.0009 and 0.031 for top 1% and 5% respectively, while the preferable results of real SQB method was only at top 5% which exceed TAN method.
From the above discussion, the proposed SQB method with four different representation cases of molecules have investigated using ten cases for three popular data sets, and both cut-offs 1% and 5%. The use of complex numbers format for molecules representation proved to be superior compared with real representation and benchmark TAN method, where the complex SQB method outperformed TAN in nine cases. The best proposed technique to embed in term of complex format is that obtained by Okapi function, which outperformed other complex techniques and real SQB as well as TAN. In contrast, the real representation of SQB method was slightly preferable in three cases for MDDR-DS2 and MUV data sets at top 5% as well as DUD dataset at top 1%.