Enhancing Conformational Sampling for Intrinsically Disordered and Ordered Proteins by Variational Autoencoder

Zhu, Jun-Jie; Zhang, Ning-Jie; Wei, Ting; Chen, Hai-Feng

doi:10.3390/ijms24086896

Open AccessArticle

Enhancing Conformational Sampling for Intrinsically Disordered and Ordered Proteins by Variational Autoencoder

by

Jun-Jie Zhu

^1,†

,

Ning-Jie Zhang

^1,†,

Ting Wei

¹ and

Hai-Feng Chen

^1,2,*

¹

State Key Laboratory of Microbial Metabolism, Joint International Research Laboratory of Metabolic & Developmental Sciences, Department of Bioinformatics and Biostatistics, National Experimental Teaching Center for Life Sciences and Biotechnology, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China

²

Shanghai Center for Bioinformation Technology, Shanghai 200240, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Int. J. Mol. Sci. 2023, 24(8), 6896; https://doi.org/10.3390/ijms24086896

Submission received: 3 February 2023 / Revised: 26 March 2023 / Accepted: 27 March 2023 / Published: 7 April 2023

(This article belongs to the Section Molecular Biophysics)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Intrinsically disordered proteins (IDPs) account for more than 50% of the human proteome and are closely associated with tumors, cardiovascular diseases, and neurodegeneration, which have no fixed three-dimensional structure under physiological conditions. Due to the characteristic of conformational diversity, conventional experimental methods of structural biology, such as NMR, X-ray diffraction, and CryoEM, are unable to capture conformational ensembles. Molecular dynamics (MD) simulation can sample the dynamic conformations at the atomic level, which has become an effective method for studying the structure and function of IDPs. However, the high computational cost prevents MD simulations from being widely used for IDPs conformational sampling. In recent years, significant progress has been made in artificial intelligence, which makes it possible to solve the conformational reconstruction problem of IDP with fewer computational resources. Here, based on short MD simulations of different IDPs systems, we use variational autoencoders (VAEs) to achieve the generative reconstruction of IDPs structures and include a wider range of sampled conformations from longer simulations. Compared with the generative autoencoder (AEs), VAEs add an inference layer between the encoder and decoder in the latent space, which can cover the conformational landscape of IDPs more comprehensively and achieve the effect of enhanced sampling. Through experimental verification, the Cα RMSD between VAE-generated and MD simulation sampling conformations in the 5 IDPs test systems was significantly lower than that of AE. The Spearman correlation coefficient on the structure was higher than that of AE. VAE can also achieve excellent performance regarding structured proteins. In summary, VAEs can be used to effectively sample protein structures.

Keywords:

intrinsically disordered proteins; variational autoencoders; enhancing sampling; autoencoders

1. Introduction

Intrinsically Disordered Proteins (IDPs) [1] have no fixed 3-dimensional structure under physiological conditions. Previous works have shown that IDP content in the eukaryotes proteomics is more than 40% [2]. Based on their variable conformations, IDPs play key roles in biological processes, such as transcription, signal transduction, and protein modification, upon binding to a wide variety of receptors or ligands [3]. Meanwhile, IDPs are closely associated with complex diseases such as tumors, cardiovascular diseases, and neurodegeneration, e.g., p53 tumor suppressor protein [4], α-synuclein [5], and abnormally phosphorylated Tau protein [6] in Alzheimer’s disease. Additionally, many disease-associated proteins often contain intrinsically disordered regions in their functional cores, which are mostly target sites of specific drug molecules.

For the study of IDPs, the critical conformations and corresponding dynamic processes are decisive factors in the results. Nuclear Magnetic Resonance (NMR), X-ray crystallography, and CryoEM are traditionally experimental methods in structural biology used to measure protein structures. However, due to the characteristic of conformational diversity, these traditional methods could not capture the conformation ensemble [7]. Molecular dynamics (MD) simulations can generate a continuous, atomic-level trajectory, which provides a powerful tool for studying the structured and dynamic properties of disordered regions [8]. Therefore, MD simulations become an important supplementary for experimental methods. Since MD simulations are of high computational cost, which restricts the simulation time scale to a few microseconds, there is still an unavoidable risk of under-sampling [9]. To obtain the IDP conformation as extensively as possible, a longer simulation as an adequate sampling strategy also means more computational resources.

Recently, deep learning has been proposed to improve the sampling efficiency of protein conformations [10]. By transforming the Cartesian coordinates of three-dimensional conformations into low-dimensional vectors, neural network models can be trained to generate the latent space from which new three-dimensional conformations can be sampled at a relatively low computational cost [11]. Gupta et al. [12] designed an autoencoder (AE) for IDP trajectories, calculating a multivariate Gaussian distribution in latent space to sample more various IDP conformations. However, according to the Spearman correlation coefficient between the AE-generated conformations and the original ones, the Gaussian distribution might not conform to the true distribution of IDP conformations. Here we designed variational autoencoders (VAEs) to avoid this contradiction and explore the conformational space of IDPs more rationally. After conducting comparison tests in all 5 IDP systems, ranging from RS1 with 24 residues to α-synuclein with 140 residues, the performance of VAEs was better than that of AEs with generated conformations more similar to the original ones.

2. Results

We built a VAE to generate protein conformations that are more fully reconstructed based on as few simulated conformations as possible, mainly for IDPs. The VAE model was tested on 5 IDP systems: RS1, Abeta40, PaaA2, R17, and α-synuclein. Upon training on sampled conformations from short MD simulations, VAE can effectively generate new conformations that sample from the long MD simulations.

To evaluate the performance of different IDP systems, we first calculated the RMSD between the VAE-generated protein conformations and the input ones to measure the similarity between the two. Then to represent the overall conformation distribution, the Spearman correlation was calculated. Next, we used the chemical shifts and radius of gyrations to further verify the authenticity and rationality of the structure. Moreover, we also performed the same test process on 2 structured protein systems: BPTI and ubiquitin, the results suggesting that there is still huge room to improve the current VAE model.

2.1. VAEs Perform Better than AEs on Different Data Split

First, to evaluate the performance of AEs and VAEs, models were trained with data of different scales in generating new IDP conformations. We split 50,000 conformations of each protein into two parts, the first part (e.g., the first 10% of the conformations) as a training set and the latter as a test set for model training. Such a splitting approach is based on the feature of MD simulation that the first part often contains more extended and diverse conformations, while the latter part contains relatively compact ones [13]. By training the model with the first part, the model could learn enough diverse conformations, and a similar training process can be accomplished on a rather short MD trajectory. For the structure of AEs, we set the latent dimension to 2/3

N_{R e s}

(

N_{R e s}

: number of residues) and built them with four-layer encoders, referring to the setting designed by Gupta et al. [12]. For VAEs, as the latent dimension was proved not to significantly affect the quality of generated conformations, we set the latent dimension to 2. We built them with four-layer encoders, the same as that of AE (Table 1). From this comparative test, VAE had clear advantages against AE on the conformation generation task. The results are shown in Figure 1.

VAEs performed better than AE with lower RMSD and higher Spearman coefficients, meaning that the VAE-generated conformations could better restore the characteristics of the original ones, e.g., on the PaaA2 system, the training model with the first 50% of the data set, equal to 2.5 W conformations, VAE produced an average RMSD of 7.53 Å while AE gave 9.55 Å. The mean RMSDs and Spearman coefficients of VAE are significantly lower than those of AE in the Wilcoxon rank sum test with a p-value less than 1.58 × 10⁻⁵. Meanwhile, VAE did not sample extremely deviated conformations with large RMSD with the original set, as AE did. This is because the inference layer of VAE is based on the established distribution, which ensures more continuous sampling, e.g., for Abeta40, AE gave a maximum RMSD of 38.63 Å while the maximum RMSDs from VAE were all lower than 24 Å. In addition, VAE produced much higher Spearman coefficients than AE. In Abeta40 and PaaA2 systems, AE gave the lowest coefficients of 0.379 and 0.186, respectively, which suggests that using Gaussian distribution to sample in the latent space of AE may not accurately capture the low-dimensional distribution mapped from the encoder. On the other hand, VAE produced more uniform latent space, ensuring that sampled latent vectors could fit better the distribution of input data, the Spearman coefficient of which was always larger than 0.55.

2.2. VAEs Can Be Further Optimized by Adjusting Hyperparameters

After clarifying the advantages of the VAE over AE with the same layer parameters, we further adjusted the parameters of the VAE model to check whether the model could be further optimized.

Here we first adjusted the number of layers of the model. The number of neuron units in each layer was adjusted according to the number of layers to balance the compression factor between layers (Table S3). The test results are shown in Figure 2.

For almost all neural network models, more hidden layers mean that the network structure is more complex and may obtain more comprehensive features of the input object in the training process. Still, it is also more likely to lose essential biological information due to overfitting. According to the above results, the change in the number of VAE layers did not significantly affect the RMSDs of generated IDP conformations. Only when the VAE was of 4 or 5 layers did α-synuclein perform smaller RMSDs. In view of the Spearman coefficient, 4-layer VAE reached the highest coefficient on all IDPs except for RS1. Taken all together, 4-layer VAE could perform relatively better on our IDP structure generation task.

We next examined the effect of training epochs and the dimension of latent space on VAE performance. Here we used the Spearman coefficient and the average RMSD on the test set to check the performance of VAE under each round. The training epochs and latent dimension were recorded from 2 to 20 at 2 intervals. Figure S1 showed that VAEs rapidly converged within 5 epochs. Meanwhile, the dimension of latent space did not significantly affect the performance of VAE, as shown in Figure 2. VAE with 2-dimensional latent space performed slightly before others in the tests, e.g., the VAE of 2-dimensional latent space gave the minimum RMSD of 9.72 Å on Abeta40.

2.3. Visualization of VAE-Generated Conformations

To display representative and diverse VAE-generated conformations, we generated and refined 500 conformations for each IDP. We then clustered the conformations with k-centroids according to their RMSD. PyMol displayed the conformations identified as clustering centers (Figure 3 and Figure S2). VAE could capture global IDP characters as it generated conformations similar to input ones. VAE generated rather extended conformations for RS1 while reproducing the random coils well for α-synuclein, which means that VAE could generate disordered regions. Meanwhile, for Abeta40, PaaA2, and R17, which performed more compactly and owned several helices in a natural state, VAE-generated conformations also achieved a similar proportion of these structural regions. Although VAE mispredicted the short sheets to loops of the same direction in α-synuclein, which might result from the hardness of learning long-range interactions for deep learning models, VAE performed well on the target IDP systems overall. Nevertheless, we could find VAE-generated conformations with large RMSD against input ones far from the original set, as indicated in Figure 1. For this part, we considered that these conformations still preserved the essential structure properties of IDPs exactly, which represented novel conformations. Through conducting principal component analysis, we found that the generated conformations are reduced to quite a wide range in the first and second principal components and that they form a distribution similar to those from MD, especially in RS1 and Abeta40 in the same principal component space, which further proved generated conformations with diversity and novelty (shown in Figure S3).

2.4. Experimental Validation of VAE-Generated Conformations

Finally, we validated the VAE-generated conformations to check if they corresponded to relevant experimental data. We calculated secondary chemical shifts of Cα for generated configurations and the MD trajectories with SPARTA+ [14] and compared them with experimental measurements. We also calculated the radius of gyration (Rg) for each IDP on VAE-generated conformations and MD trajectories by MDTraj. We also calculated a Ramachandran plot for the generated ensembles to further check their validity and diversity (Figure S4). Figure 4 shows the comparison result for secondary chemical shifts from three datasets, and Figure 5 for Rg values. Here, 50,000 conformations are generated for chemical shift calculation, the same number of conformations with MD trajectories. The results show that the VAE-generated conformations can achieve results similar to MD simulation for local feature reduction of the experimental structure (Figure 4). The mean squared error between chemical shifts of VAE conformation and the experimental ones did not deviate from that between MD trajectories and experimental structures. Besides, Rg calculated from VAE-generated conformations performed slightly higher than that from MD trajectories in three IDP systems, RS1, Abeta40, and α-synuclein, while being consistent with that from MD trajectories in the other two systems. We can also find that in some tested protein systems, the conformations generated by MD simulation and VAE have large deviations from the experimental values, such as Rg values of PaaA2 and α-synuclein. Considering that our VAE model is trained completely based on the structures from the MD simulations, correcting the error between MD and experimental data is the fundamental strategy to further improve the performance of the VAE model.

2.5. Tests of VAEs on Structured Proteins

After proving that VAE performed well in enhanced conformational sampling of IDPs, we verified it on two structured proteins to check whether it could achieve the same success as IDPs. The two structured proteins are BPTI and ubiquitin, which contain 58 and 76 residues, respectively (Table 2). For structured proteins, ubiquitin is processed similarly to IDPs, simulated with force field ESFF1 and TIP3P water model. The trajectory of BPTI is taken from the previous work of D.E. Shaw et al. [15]. We conducted the same tests on structured proteins, calculating RMSD and Spearman coefficients between VAE-generated conformations and MD trajectories. Results showed that VAE performed better than AE in the conformational acquisition of structured proteins (Figure 6).

There was no significant change in their prediction effects on AE and VAE as the proportion of the training set was increased because the conformation of structured proteins was much more stable than IDPs, so the conformational distribution in the MD simulation trajectory is also more uniform. Compared with AE, VAE showed little difference in the performance of these two structured protein systems. For the BPTI system, the average RMSD of AE was about 1.42 Å, while VAE was about 1.70 Å. For the ubiquitin system, the average RMSD of AE formation and input conformation was about 1.63 Å, while that of VAE was about 1.55 Å.

From visualization and experimental validation, the difference between VAE performance for structured protein conformations is small, and the cluster for generated distribution is more concentrated. Especially for the results of the ubiquitin system, the RMSD difference between conformations was less than 0.05 Å in 92.9%.

3. Discussion

Extensive sampling protein conformation space is a critical foundation for molecular simulation and docking [16]. In this research, we constructed a VAE model, a generative neural network, which performed well on both IDPs and structured proteins. The VAE is trained to learn the sampled conformations from MD simulations in a relatively short period (Table S4), which not only enables the reduction of the input protein structure with a nonlinear transformation towards low-dimensional space but also generates new conformations by disturbing the latent space with Gaussian noise [17,18]. A well-trained VAE can sample conformations of target proteins effectively and accurately, both for structured proteins and IDPs.

Compared to AE, VAE fits the latent space to a more stable distribution, like the Gaussian distribution here, which contributes to continuous sampling from the latent space to avoid generating unreasonable conformations, and also enables smoother property changes between the reconstructed structures.

In this study, the scale of VAE-generated conformations was consistent with the test set size. As VAE applies uniform sampling to the latent space, when increasing the scale of generated conformations, we can foresee that the conformational features presented in the generated structures will be more continuous [12], and some high-energy conformations different from the input ones will be predicted. It is traditionally quite hard to sample these conformations with MD simulation, though it is becoming much easier for VAE. However, it is difficult to measure the quality of this part of the generated conformation. We paid more attention to whether most generated conformations were similar to the input ones because it could indicate whether our model learned the essential features of IDP successfully. Because most conformations are similar to the input ones, the remaining conformations will likely be the potential high-energy IDP conformations obtained by changing the sampling distribution or IDP conformations in unsampled potential wells. From the results of this study, VAE can indeed restore the characteristics of input conformations in a large proportion of the generated ones (Figure 3). Therefore, the performance of VAE is better than that of AE, and the generated extreme conformations of VAE can be considered real unsampled IDP conformations.

From the tests, we found that there was no linear correlation between the VAE reconstruction mean deviation and the size of protein systems. For example, on the Abeta40 system (40 residues), the average RMSD between the reconstructed structure and the input structure reached 7.36 Å, lower than 8.66 Å from the RS1 system (24 residues). While visualizing the generated conformations, we found that structured motifs, such as

α

-helix or

β

-sheet, were preserved better than disordered regions.

Obviously, the present VAE training mode has some limitations concerning reproducing trajectory for IDPs. Though performing better than AE, VAE has only flatted atomic coordinates vectors as the input during feature extraction of the data. Some potential physical and chemical properties of proteins are also lost as the coordinates are tiled. This might be why the VAE predictive effect reaches a certain limit [19]. Therefore, to further improve the performance of the deep learning network for protein conformation generation, we can consider retaining as much protein property information as possible. We believed that the protein representation by graph network [20] is very worth trying, for which the atomic position and chemical bond information of proteins would be reduced and extracted in the form of graphs [21,22]. Therefore, the dimensionality reduction processing and reduction criteria of such graph-like data need more thought in the next step.

4. Material and Methods

4.1. Molecular Dynamic Simulation

We conducted 1 μs MD simulations for the target IDP systems, including RS1 [23], Abeta40 [24], PaaA2 [25], R17 [26], and α-synuclein [27], the properties of these IDPs were listed in Table S5. The detailed simulation conditions for these IDPs and structured proteins are listed in Table 2, which are consistent with corresponding experimental conditions. All simulations are conducted in AMBER18 [28] with ESFF1 force field [29] for protein and TIP3P [30] solvent model, whose initial structures are the same as those in the work of Mu et al. [13]. As the calculated radius of gyration for MD trajectories agreed with experimental values, these test systems were converged and correctly captured the features of disordered proteins [13,15].

For MD simulation preparation, we first packed IDPs in cubic boxes using tleap in AmberTools [28], adding water molecules and neutralizing the system with Na⁺ and Cl⁻ ions. Extra ions were then added to achieve ionic strength in accordance with experimental conditions. Energy minimization was performed with 500 steps of the steep descent method and 500 steps of the conjugate gradient method. A total of 10 picoseconds of canonical ensemble (NVT) MD simulations were conducted to heat the systems. Finally, these systems were switched to NPT and simulated three times independently under the pmemd.cuda acceleration [31]. The electrostatic interactions were calculated using the Particle Mesh Ewald (PME) algorithm [32].

4.2. PDB Data Extraction and Preprocessing

Protein trajectories in AMBER trajectory formats were converted into PDB format by CPPTRAJ [33], stripping all solvent molecules and ions. A total of 50,000 frames at 20 ps intervals were saved for each protein. Then, we used the Biobox library in Python to align all frames to the first one according to the RMSD of backbone atoms [34]. The Cartesian coordinates of these heavy atoms were flattened into 1D-vectors (i.e., a coordinate matrix of shape

[N, 3]

is compiled into a vector of length

3 N

) and further normalized using MinMax scaling as Equation (1):

c_{i}^{n o r m e d} = \frac{c_{i} - m i n (c)}{m a x (c) - m i n (c)}

(1)

where

c

refers to the input vector and

c_{i}

is the i-th coordinate of the vector. Normalization of input data accelerates model convergence and improves the model’s accuracy. As the minimum and maximum of the normalized data are preserved, data can be back-transformed precisely for analysis.

4.3. Variational Autoencoder Design

A major purpose of training autoencoders with MD trajectories is that latent variables are effectively solved while encoding. These variables can be used as collective variables (CVs) in meta-dynamic simulations [35], which is feasible for exploring conformation space. AEs are a kind of unsupervised deep learning model whose architecture includes two parts, encoders and decoders [36]. Encoders map the input data (e.g., flattened Cartesian coordinates) to a low-dimensional latent space, while decoders generate high-dimensional data (e.g., reconstructed coordinates) from the encoder-reduced latent vectors with a completely symmetrical structure. However, evidence shows that the latent space in AEs trained on protein conformations fails to capture useful low-dimensional representation [37,38]. Besides, classical AEs are trained with discrete data, which makes it difficult to generate novel conformations of enough credibility. Gupta et al. [12] calculated a multivariate Gaussian distribution in latent space with AEs to sample new IDP conformations. Nevertheless, the Spearman correlation coefficients between the distribution of AE-generated conformations and that of original ones show that the latent distribution of AE is sometimes far from the Gaussian distribution.

To address these problems, we built Variational Autoencoder (VAE) models (Figure 7). Compared with the AE model, VAE adds a variational layer between the encoder and decoder, constraining latent vectors to follow a specified distribution (e.g., Gaussian distribution) during optimization. At first, the encoder of VAE is a deep neuron network (DNN), with a rectified linear unit (ReLU) as an activation function, which bottlenecks the input

x

to a two-dimension latent space

z

. Then, the latent space is perturbed by Gaussian noise in variational layer to sample

z

′, as described by Kingma and Welling [39]. Next, the decoder mirrors the encoder, mapping

z

′ to the final output

x

′.

VAE constructs continuous latent space, unlike the discrete one of AE, to sample similar conformations to the input ones. The test results on various protein systems indicate that the performance of VAE is better than AE’s under various criteria, which will be presented in the Result.

4.4. VAE Training

In this implementation, we developed the AEs and VAEs in Python3.8 using Keras (https://keras.io/ (accessed on 23 May 2022)) with Tensorflow backend v2.4.1 (https://www.tensorflow.org/ (accessed on 23 May 2022)) [40]. The loss function that assesses the VAE model is calculated by summing two parts: reconstruction loss (

L_{R}

) and Kullback–Leibler divergence loss (

L_{K L}

) with Equation (2):

L_{V A E} = L_{R} + L_{K L}

(2)

One of the two parts, reconstruction loss

L_{R}

, is the mean squared error between predicted data

x

’ and the original input

x

, which quantifies how well the original data

x

is approximated by VAE, evaluating the ability of VAE to reconstruct IDP conformations with Equation (3):

L_{R} = E [| | x^{'} - x | |] = \frac{1}{n} \sum_{i = 1}^{n} (x_{i}^{'} - x_{i})^{2}

(3)

The second, Kullback–Leibler divergence loss

L_{K L}

, is based on variational inference on latent space of VAE, which takes the latent space priors that sample

z

′ into consideration with Equation (4):

L_{K L} = E [\frac{1 + l o g σ (z)^{2} - μ (z)^{2} - σ (z)^{2}}{2}]

(4)

where

μ

and

σ

, separate latent vectors transformed from an encoder, respectively estimate the mean and standard deviation of the Gaussian prior applied on the latent space. The composite VAE loss function allows the model to sample novel conformations of high credibility with the same priors.

4.5. Evaluation Criteria Calculation

Several previous studies have systematically designed the evaluation system for AEs and VAEs on reconstructing the nonlinear representation of tertiary protein structures [41,42,43]. Here we applied a similar strategy to quantify the performance of AE and VAE. There are four evaluation indicators: root-mean-square deviation (RMSD), Spearman correlation coefficient, chemical shift, and radius of gyration (Rg).

4.6. Root-Mean-Square Deviation (RMSD)

RMSD quantifies the differences between generated conformations and the input ones. RMSD is calculated as Equation (5):

R M S D = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(r_{i}^{0} - r_{i})}^{2}}

(5)

where

r_{i}^{0}

and

r_{i}

are Cartesian coordinates of the

i

-th atom from reference

r^{0}

and structure

r

, respectively.

N

is the number of atoms. Here the coordinates of structure

r

are all back transformed into Cartesian space, and all heavy atoms in the protein were considered when calculating RMSD.

4.7. Spearman Correlation Coefficient

The Spearman correlation coefficient quantitatively reflects the difference in data distribution between generated coordinate vectors and original ones. A higher coefficient indicates that the original space is better preserved by the model. The Spearman correlation coefficient is calculated as Equation (6):

ρ = 1 - \frac{6 \sum d_{i}^{2}}{n (n^{2} - 1)}

(6)

where

d_{i}

and

n

are different in paired ranks and sample size, respectively. The calculation of the Spearman correlation coefficient is conducted with SciPy [44].

4.8. Chemical Shift

The secondary chemical shift of

C α

can be an index for experimental validation. SPARTA+ [14] is used to calculate the chemical shifts of generated proteins and MD trajectories, and the source of experimental chemical shifts is listed in Table S1.

4.9. Radius of Gyration

The radius of gyration (Rg) is a global structural character that correlates with the length of the protein. Here, MDTraj [45] was used to calculate the Rg of generated proteins and MD trajectories, and the experimental Rg is listed in Table S2.

4.10. Refinement and Visualization of Generated Structures

VAE-generated conformations may exist considerable violations in bond length and angle, so it is necessary to complete and refine the resulting structures. Firstly, tleap was used to add missing hydrogens. Then we subjected the structures to energy minimization in water. This process included 250 steps of the steep descent method and 250 steps of the conjugate gradient method. The force field used was AMBERff14SB [46], and the solvent model was TIP3P.

To obtain representative conformations for visualization, we clustered the refined structures according to their all-against-all RMSD. Here we used the method of k-centroid with the kclust tool provided in MMTSB Tool Set [47] and ensured that the top three clusters covered over 80% of the conformations. The structures selected as centroids of the top three categories are rendered and visualized in Pymol (http://www.pymol.org/pymol (accessed on 1 April 2021)).

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ijms24086896/s1. References [48,49,50,51,52,53,54,55,56,57] are cited in the supplementary materials.

Author Contributions

Conceptualization, J.-J.Z. and H.-F.C.; software, J.-J.Z. and N.-J.Z.; validation, J.-J.Z. and N.-J.Z.; writing—original draft preparation, J.-J.Z. and N.-J.Z.; writing—review and editing, T.W. and H.-F.C.; funding acquisition, H.-F.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The VAE codes are deposited at https://github.com/Junjie-Zhu/VAE.git (accessed on 4 March 2023).

Acknowledgments

This work was supported by the Center for HPC at Shanghai Jiao Tong University, the National Key Research and Development Program of China (2020YFA0907700), the National Natural Science Foundation of China (21977068 and 32171242), and the Fundamental Research Funds for the Central Universities (YG2023LC03).

Conflicts of Interest

The authors declare no conflict of interest.

References

Dunker, A.K.; Oldfield, C.J.; Meng, J.; Romero, P.; Yang, J.Y.; Chen, J.W.; Vacic, V.; Obradovic, Z.; Uversky, V.N. The unfoldomics decade: An update on intrinsically disordered proteins. BMC Genom. 2008, 9 (Suppl. 2), S1. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Uversky, V.N. Intrinsically disordered proteins from A to Z. Int. J. Biochem. Cell Biol. 2011, 43, 1090–1103. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Uversky, V.N.; Gillespie, J.R.; Fink, A.L. Why are “natively unfolded” proteins unstructured under physiologic conditions? Proteins 2000, 41, 415–427. [Google Scholar] [CrossRef] [PubMed]
Zhao, R.; Gish, K.; Murphy, M.; Yin, Y.; Notterman, D.; Hoffman, W.H.; Tom, E.; Mack, D.H.; Levine, A.J. Analysis of p53-regulated gene expression patterns using oligonucleotide arrays. Genes Dev. 2000, 14, 981–993. [Google Scholar] [CrossRef]
Singleton, A.B.; Farrer, M.; Johnson, J.; Singleton, A.; Hague, S.; Kachergus, J.; Hulihan, M.; Peuralinna, T.; Dutra, A.; Nussbaum, R.; et al. alpha-Synuclein locus triplication causes Parkinson’s disease. Science 2003, 302, 841. [Google Scholar] [CrossRef] [Green Version]
Kolarova, M.; Garcia-Sierra, F.; Bartos, A.; Ricny, J.; Ripova, D. Structure and pathology of tau protein in Alzheimer disease. Int. J. Alzheimers Dis. 2012, 2012, 731526. [Google Scholar] [CrossRef] [Green Version]
Jensen, M.R.; Zweckstetter, M.; Huang, J.R.; Blackledge, M. Exploring free-energy landscapes of intrinsically disordered proteins at atomic resolution using NMR spectroscopy. Chem. Rev. 2014, 114, 6632–6660. [Google Scholar] [CrossRef]
Castro, T.G.; Munteanu, F.D.; Cavaco-Paulo, A. Electrostatics of Tau Protein by Molecular Dynamics. Biomolecules 2019, 9, 116. [Google Scholar] [CrossRef] [Green Version]
Allison, J.R. Computational methods for exploring protein conformations. Biochem. Soc. Trans. 2020, 48, 1707–1724. [Google Scholar] [CrossRef]
Tsuchiya, Y.; Taneishi, K.; Yonezawa, Y. Autoencoder-Based Detection of Dynamic Allostery Triggered by Ligand Binding Based on Molecular Dynamics. J. Chem. Inf. Model. 2019, 59, 4043–4051. [Google Scholar] [CrossRef] [Green Version]
Tian, H.; Jiang, X.; Trozzi, F.; Xiao, S.; Larson, E.C.; Tao, P. Explore Protein Conformational Space With Variational Autoencoder. Front. Mol. Biosci. 2021, 8, 781635. [Google Scholar] [CrossRef] [PubMed]
Gupta, A.; Dey, S.; Hicks, A.; Zhou, H.X. Artificial intelligence guided conformational mining of intrinsically disordered proteins. Commun. Biol. 2022, 5, 610. [Google Scholar] [CrossRef] [PubMed]
Mu, J.; Pan, Z.; Chen, H.F. Balanced Solvent Model for Intrinsically Disordered and Ordered Proteins. J. Chem. Inf. Model. 2021, 61, 5141–5151. [Google Scholar] [CrossRef] [PubMed]
Shen, Y.; Bax, A. SPARTA+: A modest improvement in empirical NMR chemical shift prediction by means of an artificial neural network. J. Biomol. NMR 2010, 48, 13–22. [Google Scholar] [CrossRef] [Green Version]
Shaw, D.E.; Maragakis, P.; Lindorff-Larsen, K.; Piana, S.; Dror, R.O.; Eastwood, M.P.; Bank, J.A.; Jumper, J.M.; Salmon, J.K.; Shan, Y.; et al. Atomic-Level Characterization of the Structural Dynamics of Proteins. Science 2010, 330, 341–346. [Google Scholar] [CrossRef] [Green Version]
Glielmo, A.; Husic, B.E.; Rodriguez, A.; Clementi, C.; Noé, F.; Laio, A. Unsupervised Learning Methods for Molecular Simulation Data. Chem. Rev. 2021, 121, 9722–9758. [Google Scholar] [CrossRef]
Eguchi, R.R.; Choe, C.A.; Huang, P.S. Ig-VAE: Generative modeling of protein structure by direct 3D coordinate generation. PLoS Comput. Biol. 2022, 18, e1010271. [Google Scholar] [CrossRef]
Moritsugu, K. Multiscale Enhanced Sampling Using Machine Learning. Life 2021, 11, 1076. [Google Scholar] [CrossRef]
Li, C.; Liu, J.; Chen, J.; Yuan, Y.; Yu, J.; Gou, Q.; Guo, Y.; Pu, X. An Interpretable Convolutional Neural Network Framework for Analyzing Molecular Dynamics Trajectories: A Case Study on Functional States for G-Protein-Coupled Receptors. J. Chem. Inf. Model. 2022, 62, 1399–1410. [Google Scholar] [CrossRef]
Zhou, D.; Zheng, L.; Xu, J.; He, J. Misc-GAN: A Multi-scale Generative Model for Graphs. Front. Big Data 2019, 2, 3. [Google Scholar] [CrossRef] [Green Version]
Zhou, H.; Wang, W.; Jin, J.; Zheng, Z.; Zhou, B. Graph Neural Network for Protein-Protein Interaction Prediction: A Comparative Study. Molecules 2022, 27, 6135. [Google Scholar] [CrossRef] [PubMed]
Sanyal, S.; Anishchenko, I.; Dagar, A.; Baker, D.; Talukdar, P. ProteinGCN: Protein model quality assessment using Graph Convolutional Networks. bioRxiv 2020. bioRxiv:2020.2004.2006.028266. [Google Scholar] [CrossRef]
Xiang, S.; Gapsys, V.; Kim, H.Y.; Bessonov, S.; Hsiao, H.H.; Möhlmann, S.; Klaukien, V.; Ficner, R.; Becker, S.; Urlaub, H.; et al. Phosphorylation drives a dynamic switch in serine/arginine-rich proteins. Structure 2013, 21, 2162–2174. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Sgourakis, N.G.; Yan, Y.; McCallum, S.A.; Wang, C.; Garcia, A.E. The Alzheimer’s peptides Abeta40 and 42 adopt distinct conformations in water: A combined MD/NMR study. J. Mol. Biol. 2007, 368, 1448–1457. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Sterckx Yann, G.J.; Volkov Alexander, N.; Vranken Wim, F.; Kragelj, J.; Jensen Malene, R.; Buts, L.; Garcia-Pino, A.; Jové, T.; Van Melderen, L.; Blackledge, M.; et al. Small-Angle X-ray Scattering- and Nuclear Magnetic Resonance-Derived Conformational Ensemble of the Highly Flexible Antitoxin PaaA2. Structure 2014, 22, 854–865. [Google Scholar] [CrossRef] [Green Version]
Wensley, B.G.; Batey, S.; Bone, F.A.; Chan, Z.M.; Tumelty, N.R.; Steward, A.; Kwa, L.G.; Borgia, A.; Clarke, J. Experimental evidence for a frustrated energy landscape in a three-helix-bundle protein family. Nature 2010, 463, 685–688. [Google Scholar] [CrossRef] [Green Version]
Bertoncini, C.W.; Jung, Y.S.; Fernandez, C.O.; Hoyer, W.; Griesinger, C.; Jovin, T.M.; Zweckstetter, M. Release of long-range tertiary interactions potentiates aggregation of natively unstructured alpha-synuclein. Proc. Natl. Acad. Sci. USA 2005, 102, 1430–1435. [Google Scholar] [CrossRef] [Green Version]
Case, D.A.; Ben-Shalom, I.Y.; Brozell, S.R.; Cerutti, D.S.; Cheatham, T.E., III; Cruzeiro, V.W.D.; Duke, R.E.; Giambasu, G.; Gilson, M.K.; Gohlke, H.; et al. AMBER 2018; University of California: San Francisco, CA, USA, 2018. [Google Scholar]
Song, D.; Liu, H.; Luo, R.; Chen, H.F. Environment-Specific Force Field for Intrinsically Disordered and Ordered Proteins. J Chem. Inf. Model. 2020, 60, 2257–2267. [Google Scholar] [CrossRef]
Jorgensen, W.L.; Chandrasekhar, J.; Madura, J.D.; Impey, R.W.; Klein, M.L. Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 1983, 79, 926–935. [Google Scholar] [CrossRef]
Gotz, A.W.; Williamson, M.J.; Xu, D.; Poole, D.; Le Grand, S.; Walker, R.C. Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs. 1. Generalized Born. J. Chem. Theory Comput. 2012, 8, 1542–1555. [Google Scholar] [CrossRef]
Essmann, U.; Perera, L.; Berkowitz, M.L.; Darden, T.; Lee, H.; Pedersen, L.G. A smooth particle mesh Ewald method. J. Chem. Phys. 1995, 103, 8577–8593. [Google Scholar] [CrossRef] [Green Version]
Roe, D.R.; Cheatham, T.E., 3rd. PTRAJ and CPPTRAJ: Software for Processing and Analysis of Molecular Dynamics Trajectory Data. J. Chem. Theory Comput. 2013, 9, 3084–3095. [Google Scholar] [CrossRef] [PubMed]
Degiacomi, M.T. Coupling Molecular Dynamics and Deep Learning to Mine Protein Conformational Space. Structure 2019, 27, 1034–1040. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ketkaew, R.; Creazzo, F.; Luber, S. Machine Learning-Assisted Discovery of Hidden States in Expanded Free Energy Space. J. Phys. Chem. Lett. 2022, 13, 1797–1805. [Google Scholar] [CrossRef]
Jin, Y.; Johannissen, L.O.; Hay, S. Predicting new protein conformations from molecular dynamics simulation conformational landscapes and machine learning. Proteins 2021, 89, 915–921. [Google Scholar] [CrossRef]
Wetzel, S.J. Unsupervised learning of phase transitions: From principal component analysis to variational autoencoders. Phys. Rev. E 2017, 96, 022140. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Song, Z.; Zhou, H.; Tian, H.; Wang, X.; Tao, P. Unraveling the energetic significance of chemical events in enzyme catalysis via machine-learning based regression approach. Commun. Chem. 2020, 3, 134. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M. TensorFlow: A System for Large-Scale Machine Learning; USENIX Association: Savannah, GA, USA, 2016; pp. 265–283. [Google Scholar]
Alam, F.F.; Shehu, A. Variational Autoencoders for Protein Structure Prediction. In Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, Virtual, 21–24 September 2020; Association for Computing Machinery: New York, NY, USA; p. 27. [Google Scholar]
Alam, F.F.; Rahman, T.; Shehu, A. Evaluating Autoencoder-Based Featurization and Supervised Learning for Protein Decoy Selection. Molecules 2020, 25, 1146. [Google Scholar] [CrossRef] [Green Version]
Guo, X.; Du, Y.; Tadepalli, S.; Zhao, L.; Shehu, A. Generating Tertiary Protein Structures via an Interpretative Variational Autoencoder. Bioinform. Adv. 2020, 1, vbab036. [Google Scholar] [CrossRef]
Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat. Methods 2020, 17, 261–272. [Google Scholar] [CrossRef] [Green Version]
McGibbon, R.T.; Beauchamp, K.A.; Harrigan, M.P.; Klein, C.; Swails, J.M.; Hernández, C.X.; Schwantes, C.R.; Wang, L.P.; Lane, T.J.; Pande, V.S. MDTraj: A Modern Open Library for the Analysis of Molecular Dynamics Trajectories. Biophys. J. 2015, 109, 1528–1532. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Maier, J.A.; Martinez, C.; Kasavajhala, K.; Wickstrom, L.; Hauser, K.E.; Simmerling, C. ff14SB: Improving the Accuracy of Protein Side Chain and Backbone Parameters from ff99SB. J. Chem. Theory Comput. 2015, 11, 3696–3713. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Feig, M.; Karanicolas, J.; Brooks, C.L., 3rd. MMTSB Tool Set: Enhanced sampling and multiscale modeling methods for applications in structural biology. J. Mol. Graph. Model. 2004, 22, 377–395. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Cragnell, C.; Durand, D.; Cabane, B.; Skepö, M. Coarse-grained modeling of the intrinsically disordered protein Histatin 5 in solution: Monte Carlo simulations in combination with SAXS. Proteins 2016, 84, 777–791. [Google Scholar] [CrossRef]
Hou, L.; Shao, H.; Zhang, Y.; Li, H.; Menon, N.K.; Neuhaus, E.B.; Brewer, J.M.; Byeon, I.-J.L.; Ray, D.G.; Vitek, M.P.; et al. Solution NMR Studies of the Aβ(1−40) and Aβ(1−42) Peptides Establish that the Met35 Oxidation State Affects the Mechanism of Amyloid Formation. J. Am. Chem. Soc. 2004, 126, 1992–2005. [Google Scholar] [CrossRef]
Kang, L.; Janowska, M.K.; Moriarty, G.M.; Baum, J. Mechanistic insight into the relationship between N-terminal acetylation of α-synuclein and fibril formation rates by NMR and fluorescence. PLoS ONE 2013, 8, e75018. [Google Scholar] [CrossRef] [Green Version]
Cornilescu, G.; Marquardt, J.L.; Ottiger, M.; Bax, A. Validation of Protein Structure from Anisotropic Carbonyl Chemical Shifts in a Dilute Liquid Crystalline Phase. J. Am. Chem. Soc. 1998, 120, 6836–6837. [Google Scholar] [CrossRef]
Otting, G.; Liepinsh, E.; Wüthrich, K. Disulfide bond isomerization in BPTI and BPTI(G36S): An NMR study of correlated mobility in proteins. Biochemistry 1993, 32, 3571. [Google Scholar] [CrossRef]
Rauscher, S.; Gapsys, V.; Gajda, M.J.; Zweckstetter, M.; de Groot, B.L.; Grubmüller, H. Structural Ensembles of Intrinsically Disordered Proteins Depend Strongly on Force Field: A Comparison to Experiment. J. Chem. Theory Comput. 2015, 11, 5513–5524. [Google Scholar] [CrossRef] [Green Version]
Granata, D.; Baftizadeh, F.; Habchi, J.; Galvagnion, C.; De Simone, A.; Camilloni, C.; Laio, A.; Vendruscolo, M. The inverted free energy landscape of an intrinsically disordered peptide by simulations and experiments. Sci. Rep. 2015, 5, 15449. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Hofmann, H.; Soranno, A.; Borgia, A.; Gast, K.; Nettels, D.; Schuler, B. Polymer scaling laws of unfolded and intrinsically disordered proteins quantified with single-molecule spectroscopy. Proc. Natl. Acad. Sci. USA 2012, 109, 16155–16160. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Morar, A.S.; Olteanu, A.; Young, G.B.; Pielak, G.J. Solvent-induced collapse of alpha-synuclein and acid-denatured cytochrome c. Protein Sci. 2001, 10, 2195–2199. [Google Scholar] [CrossRef] [PubMed]
Schwalbe, M.; Ozenne, V.; Bibow, S.; Jaremko, M.; Jaremko, L.; Gajda, M.; Jensen, M.R.; Biernat, J.; Becker, S.; Mandelkow, E.; et al. Predictive atomic resolution descriptions of intrinsically disordered hTau40 and α-synuclein in solution from NMR and small angle scattering. Structure 2014, 22, 238–249. [Google Scholar] [CrossRef] [PubMed] [Green Version]

Figure 1. RMSDs and the Spearman correlation coefficients between VAE- or AE-generated IDP conformations and original ones. Dashed lines in boxplots represent mean RMSD. Here, both models are trained on 10% to 50% of conformations separately and are evaluated on the test set. (A) RS1. (B) Abeta40. (C) PaaA2. (D) R17. (E) α-synuclein.

Figure 2. Average RMSD and Spearman correlation coefficient between IDP conformations generated by VAEs of different structures and corresponding MD trajectories. (A) RMSDs and Spearman correlation coefficients on VAEs of different layers (1 to 5 layers at 1 interval). (B) RMSDs and Spearman correlation coefficients on VAEs whose latent spaces are different sizes. (2 to 22 dimensions at 5 intervals).

Figure 3. VAE-generated structures (right) that are centroids of the top 3 clusters compared to conformations of the minimum RMSD in MD trajectories (left), blue parts are N-terminus, and red parts are C-terminus here. (A) RS1. (B) R17. (C) α-synuclein.

Figure 4. Chemical shifts of VAE-generated conformations and MD trajectories compared to experimental data. Experimental chemical shift data of R17 are missing, so R17 is not displayed here. (A) RS1. (B) Abeta40. (C) PaaA2. (D) α-synuclein.

Figure 5. Radius of gyrations for VAE-generated conformations (green) and MD trajectories (grey) compared to experimental data (blue line, dashed lines represent experimental error). (A) RS1. (B) Abeta40. (C) PaaA2. (D) R17. (E) α-synuclein.

Figure 6. Tests of VAEs’ performance on structured proteins. (A) RMSD between original conformations and generated ones. (B) Spearman correlation coefficient. (C) structures and chemical shifts of BPTI. (D) structures and chemical shifts of ubiquitin.

Figure 7. An overview of the architecture of variational autoencoders.

Table 1. Structure of AEs and VAEs where all encoders and decoders are deep neuron networks. Here

N_{R e s}

denotes the number of residues of the input protein.

Table 1. Structure of AEs and VAEs where all encoders and decoders are deep neuron networks. Here

N_{R e s}

denotes the number of residues of the input protein.

	Encoder	Latent Space	Decoder
AE	1024, 512, 256	$2 / 3 N_{R e s}$	256, 512, 1024
VAE	1024, 256, 64, 16	2	16, 64, 256, 1024

Table 2. Protein systems involved in this study. Molecular dynamic simulations are conducted according to the temperature and ion strength listed here.

Protein Systems	Length, aa	Temperature, K	Ion Strength, mM	Time (ns)
Intrinsically Disordered Proteins
RS1	24	298	150	1000
Abeta40	40	277	20
PaaA2	71	298	500
R17	100	295	108
α-synuclein	140	285.5	150
Ordered Proteins
Ubiquitin	76	298	50	1000
BPTI	58	300	50	1000

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, J.-J.; Zhang, N.-J.; Wei, T.; Chen, H.-F. Enhancing Conformational Sampling for Intrinsically Disordered and Ordered Proteins by Variational Autoencoder. Int. J. Mol. Sci. 2023, 24, 6896. https://doi.org/10.3390/ijms24086896

AMA Style

Zhu J-J, Zhang N-J, Wei T, Chen H-F. Enhancing Conformational Sampling for Intrinsically Disordered and Ordered Proteins by Variational Autoencoder. International Journal of Molecular Sciences. 2023; 24(8):6896. https://doi.org/10.3390/ijms24086896

Chicago/Turabian Style

Zhu, Jun-Jie, Ning-Jie Zhang, Ting Wei, and Hai-Feng Chen. 2023. "Enhancing Conformational Sampling for Intrinsically Disordered and Ordered Proteins by Variational Autoencoder" International Journal of Molecular Sciences 24, no. 8: 6896. https://doi.org/10.3390/ijms24086896

APA Style

Zhu, J. -J., Zhang, N. -J., Wei, T., & Chen, H. -F. (2023). Enhancing Conformational Sampling for Intrinsically Disordered and Ordered Proteins by Variational Autoencoder. International Journal of Molecular Sciences, 24(8), 6896. https://doi.org/10.3390/ijms24086896

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Conformational Sampling for Intrinsically Disordered and Ordered Proteins by Variational Autoencoder

Abstract

1. Introduction

2. Results

2.1. VAEs Perform Better than AEs on Different Data Split

2.2. VAEs Can Be Further Optimized by Adjusting Hyperparameters

2.3. Visualization of VAE-Generated Conformations

2.4. Experimental Validation of VAE-Generated Conformations

2.5. Tests of VAEs on Structured Proteins

3. Discussion

4. Material and Methods

4.1. Molecular Dynamic Simulation

4.2. PDB Data Extraction and Preprocessing

4.3. Variational Autoencoder Design

4.4. VAE Training

4.5. Evaluation Criteria Calculation

4.6. Root-Mean-Square Deviation (RMSD)

4.7. Spearman Correlation Coefficient

4.8. Chemical Shift

4.9. Radius of Gyration

4.10. Refinement and Visualization of Generated Structures

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI