1. Introduction
Proteins are highly complex systems. The structures of proteins, such as surfaces and internal cavities, directly determine the behavior and vital activity of the species. The function of a macromolecule often requires that a small molecule or ion is transported through these structures. Transport pathways play an essential role in the functioning of a large number of proteins. The best known include (i) uncovering the principles of life [
1], (ii) the development of new drugs [
2], and (iii) applications in industry [
3]. Identification of cavities in the complex protein structures is a difficult challenge, and many software tools have been developed recently for this purpose, e.g., CAVER [
4], MOLE [
5], MolAxis [
6], ChExVis [
7], or BetaCavityWeb [
8]. All these tools are based on computational geometry methods that employ the Voronoi diagrams. A set of cavities computed by CAVER and visualized using software PyMOL [
9] can be seen in
Figure 1.
Properties of transport pathways changed significantly over time in molecular dynamics simulation, the analysis of an ensemble of structures, in contrast to a single static structure, is needed for the understanding [
4,
10]. With the current computational capacity, it becomes affordable to obtain molecular dynamics (MD) trajectories up to the microsecond time scales. This trend requires new approaches to explore the large data sets, as it becomes impracticable to observe such simulation in a frame-by-frame manner.
Characteristics of individual cavities, such as geometry, physico-chemical properties, and dynamics, are instrumental for identifying and analysing cavities [
4]. The existing representation of cavities is focusing on geometry representation. Geometry representation contains only a few topological properties and is not sufficient to directly support the analysis of topological properties of molecular cavities. To fill this gap, we propose a vectorized representation method based on cavity atomic sequences, which allow combining the topological attribute to analyze very large sets of conformations containing hundreds of thousands of cavities. By analyzing the dynamic pattern of molecular cavities, it can be inferred which amino acids or atoms play a key role in different patterns of variation causing differences in cavities shape and bottleneck. This may be helpful in understanding the molecular function or developing new drugs.
Our approach builds on several aspects of previous work on molecule cavity and distributed representation. We have divided the related work into two sections accordingly.
1.1. Molecular Cavity Computation and Analysis
Many methods have been developed for cavity extraction and clustering. Almost all algorithms for the extraction of cavities simplify the molecule use the hard sphere model. Krone et al. [
11] classified such methods into four main categories by the computational methods used for the cavity detection. The four main categories are formed by methods based on grids, Voronoi diagrams, molecular surfaces, and probes. The first algorithm to detect cavities in molecules was based on grids. This algorithm is not suitable for larger structures due to hardware limitations. Voronoi diagram has proved [
12] to be suitable for detection of paths in molecules and was able to process large molecules as well.
The Voronoi diagram method can be used for processing of molecular dynamics trajectories. The protein is often simplified by the atom positions or the hard sphere model. Moreover, the edges of Voronoi diagrams automatically provide geometrically optimal molecular paths based on the restriction. Because of the lining atom sequence is easier to achieve, a Voronoi-based method is adopted to extract cavity in our work.
To estimate the distance of the two pathways, a method was proposed that evaluated the distance between each ball from the first pathway and its respective closest ball from the second pathway. To accelerate the computation of all pairwise pathway similarities, Eva Chovancova et al. [
4] designed a more efficient algorithm. The basic idea is to represent each pathway by a sequence of
N points and then compute only distances between pairs of points with the corresponding order, thus performing only
N distance computations for each pair of pathways.
For clustering pathways, two main methods are currently used. Clustering of pathways in CAVER 3.0 [
4] was based on the precalculated distance matrix. Average-link hierarchical clustering [
13] was used to construct a tree hierarchy of pathway axes based on their pairwise distances. MOLE 1.2 [
14] offered the most similar functionality to work presented in this paper. It employs clustering for finding the correspondence between cavities from different conformations. The similarity of cavities was computed by comparing the sets of atoms lining the cavities. Experiments revealed that the clustering depends on the ordering of the cavities. However, the clustering of cavities performed by MOLE 1.2 [
14] was not able to clearly separate cavities into clusters corresponding to known transport pathways. To improve the cluster result, we propose a vectorized representation based on molecular cavity atomic sequences, which adds topological characteristics.
1.2. Distributed Representation
Distributed representation has been proven to be one of the most successful methods of machine learning [
15,
16]. The premise of this approach is encoding and storing information about an item within a system through establishing its interactions with other members. Distributed representation was originally inspired by the structure of human memory, where the items are stored in a ”content-addressable” fashion [
17].
Distributed vector representation has been recently established in natural language processing (NLP) as an efficient way to represent semantic/syntactic units with many applications. In this model, known as word2vec, each word is embedded in a vector in a n-dimensional space. Similar words have close vectors, where similarity is defined in terms of both syntax and semantic. The basic idea behind training such vectors is that the meaning of a word is characterized by its context, i.e., neighboring words. Thus, words and their contexts are considered to be positive training samples. Such vectors can be trained using large amounts of textual data in a variety of ways, e.g., neural network architectures like the Skip-gram model [
16]. Interesting patterns were observed by training word vectors using Skip-gram in natural language. Similar words can have multiple degrees of similarity. Using a word offset technique where simple algebraic operations are performed on the word vectors, it was shown for example that vector(“King”) − vector(“Man”) + vector(“Woman”) results in a vector that is closest to the vector representation of the word Queen [
18].
The Word2Vec algorithm can be applied in various NLP tasks including sentiment analysis, entity linking, machine translation and so on. It can also applies in a variety of other fields [
19]. Berger et al. [
20] designed a visualization scheme named Cite2Vec to support a dynamically exploration of document collections. Based on an analogy between transportation elements and NLP terms, the Word2Vec model can also be employed to quantify the implicit traffic interactions among roads [
21]. Similar attempts have been made in the biomedical field to represent data in a continuous vector space. ProtVec [
17] applies Word2Vec to a protein sequence to obtain distributed representations of a 3-gram amino acid sequence. The protein sequence is initially split into 3-gram each having a biological significance and regarded as a “word”. The next step is to run the Word2Vec algorithm using Skip-gram. Seq2Vec [
22] extends ProtVec’s approach by applying not only the sequential elements of ProtVec 3-gram, but also the Doc2Vec representation sequence by directly embedding the sequence itself. Dna2Vec [
23] generalizes the 3-gram structure of ProtVec and Seq2Vec to the k-gram structure. Another method involves SNP2Vec [
24], which embeds a single SNPs into a contiguous space by using a denoising autoencoder [
25] and Diet Networks. Mut2Vec [
26] can be utilized to generate distributed representations of mutations and experimentally validate the efficacy of the generated mutation representations. It also can be used in various deep learning applications such as cancer classification and drug sensitivity prediction.
Similarly, in the field of molecule cavity analysis, a cavity sequence can be encoded by its lining atoms. To the best of our knowledge, Word2Vec and Doc2Vec have not been used to represent molecule cavity sequences. We propose a novel pipeline to generate distributed representations of topological features for the characterization of molecule cavity. These embeddings are low-dimensional and can greatly simplify downstream modeling. We encode cavity lining atoms as words and lining atom sequences as sentences. A Word2Vec model is further applied to obtain a distributed representation of cavity lining atom sequences. To illustrate, we specifically tackle visualization and cavities classification problems.
This paper is organized as follows. In the next section, we elaborate on the method used in the experiment in
Section 2. In
Section 3, we mainly introduce the experimental results. Finally, we provide some discussions and conclusions in
Section 4.
3. Results
To explore the semantic properties of cavities, a clustering view is used to show the differences among clusters in the high-dimensional cavity embedding space. In the natural course of events, clustering results are displayed in two dimensional coordinates. We need to reduce the dimensions of the high-dimensional word embeddings into two-dimensional coordinates to show the structure of clusters. Here, the t-SNE method be used in the clustering view.
The attributes of clustering can be observed from the clustering view. From the quality of the clustering, such as the closeness of clustering and the distance between clusters (overlap of clusters), whether the distributed representation in the space is reasonable, so as to explore the cluster relationship of high-dimensional cavity embedding space, is unknown. As
Figure 5 shows, 8052 cavity trajectories are embedded in the 2D view and, the closer two dots are, the more similar they are in the attributes. We intuitively see there are hundreds of clusters. The clustering result of CAVER Analyst 2.0 [
28] includes 246 clusters. Therefore, it can be concluded that the vectorized representation of the cavity retains the semantic information and the semantically similar cavities are closer together in the t-SNE clustering view.
Two cavities are considered to be similar if at least some of the two cavities have the same lining atoms and similar topological properties. After training the model, we measure the similarity between two cavities by calculating the cosine similarity of their vectors. Given two cavity vectors,
and
, the cosine similarity, cos(
), is represented using a dot product and magnitude as follows,
where
and
are components of vector
and
, respectively. The resulting similarity ranges from
, meaning exactly opposite, to 1, meaning exactly the same, with 0 indicating that they are uncorrelated, and in-between values indicating intermediate similarity or dissimilarity.
For the selection of semantic similar cavity, the previously defined cosine similarity be used to find out the top maximum similarity surrounding cavities in the cavity embedding space. Here, we have an exploration of semantic similar cavities also similar in attributes and geometry. The number of cavities is quite large. If all cavity be chosen to visualize and analyze, the work will be heavy and repeatable. Therefore, we choose some typical cavities to sample the data. First, the cavity with an id of 2 is chosen as an example. Then, we calculate the cosine similarity of all cavities to the cavity with an id of 2. To explore the relationship between the semantic similarity of the cavity and the cavity attributes and geometric structures, a different threshold must be set for cosine similarity of the cavity. One threshold is a similarity larger than 0.9, and the other threshold is a similarity between 0 and 0.1. Finally, when the threshold is larger than 0.9, 13 cavities are selected from 8052 cavities. When the threshold is greater than 0 and less than 0.1, 9 cavities are selected.
The length–width view be used for the purpose of exploring the relationship between cavity semantic information and cavity properties. As shown in
Figure 6A, the length–width view takes the length as the X axis and the radius of the cavity as the Y axis. This representation can help users comprehend the entire cavity from the active site to the outer surface, particularly the positions of bottlenecks. Comparing
Figure 6A with
Figure 7A, there are many differences: The cavity with an id of 2, and its high similarity cavities in
Figure 6A, has a similarity structure. The trajectories of the various cavities are roughly the same. This view clearly shows the consistency of cavities changes over time. However, the
Figure 7A with low similarity has a chaotic view. At some points on the X axis, some of the cavities have the widest width and the others have the narrowest width.
To represent the cavity intuitively, the 3D geometric structure view be used to show the true location of cavities; there are clear cavity distribution characteristics, see
Figure 6B and
Figure 7B. Cavities be selected with the threshold is larger than 0.9 in
Figure 6B geometric structure and spatial distribution are consistent. In
Figure 7B, we can find cavities with the threshold is between 0 and 0.1 geometry are different: there are branches.
Through the comparative analysis mentioned above, we discover semantic similar cavities that are also similar in attribute and geometry. By comparing the different thresholds of the cavity with an id of 2, it can be concluded that the larger the threshold setting, the greater the similarity of various properties. In our paper, we only consider the cavity structure information (analogous to the semantics of the text) when training data: we do not consider its actual spatial position information. If the similarity threshold is set too small, there may be a situation that the actual spatial distance is far because the cavity semantic similarity is low, as
Figure 7B shows.
To analyze clusters of semantically similar cavities, we need to cluster cavity trajectories. The hierarchical clustering is applied to cluster the 8052 cavity trajectories represented by the vectorization; 330 clusters were clustered. Through exploratory analysis of clustering results, some valuable information can be found, such as the main amino acid distribution at the molecular cavity bottleneck. By further analysis of the clustering results, we find that there is a similar topological properties in a certain kind of clustering results, but the real positions of the cavities are not the same. This is because our method vectorization representation cavity didn’t consider about the actual position of the cavity.
Clustering method can be used to analyze the pattern of a certain molecular cavity changes over time. To improve the accuracy of the clustering, we use the cavity data that have been classified according to the actual position as an input. The cluster1 (CAVER clustered results) be taken as an example which is the set of a certain cavity trajectories changes over time. The cluster1 contains 500 cavity trajectories. To judge how many patterns exist in cluster1, the t-SNE visual dimension reduction method was used to process it. From the t-SNE view
Figure 8, it can be found that cavities should be clustered into three clusters. Then, the K-means method was used to classify cluster1 into three clusters.
To have a better analysis of the clustering results, we design some charts. For examples,
Figure 9 is the length–width view of cluster1, cluster00, cluster01, and cluster02. From this view, the radius of the cavity varies with length of the active site.
Figure 10 is the 3D geometric structure view of cluster00, cluster01, and cluster02. From this figure, the actual location and geometry of each cavity in the protein can be known. From
Figure 9 and
Figure 10, it can be seen that our clusters are more organized and show a good clustering result. The cluster1 mainly contains two major behavior patterns. One type includes cluster00 and cluster01, the other type is cluster02. Cluster02 is significantly different from cluster00 and cluster01. The bottleneck of cluster 1 is mainly reflected in cluster02. If we only need to analyze the bottleneck of the cavity, we must pay more attention to cluster02. Using our method can reduce the amount of work required to find the cause of the cavity bottleneck. If users need to pay attention to the differences between the different clusters, it is necessary to introduce the amino acid map.
To further analyze the cavity amino acid composition in each pattern, the type and number of amino acids contained in each cavity of cluster1 is designed to shown in bar view. As
Figure 11 shows, the cavity amino acid distribution of cluster02 in HID is significantly different from the other two clusters. After screening, the amino acids that cause the cavities to differ can be identified, see
Figure 12. The number of amino acid HID in cluster02 is half of cluster00 and cluster01. The number of TRP in cluster01 is half of cluster00 and cluster02. We can speculate that the differences in the pattern of cavity changes were caused by the different amino acids.
From this experiment, our method can be proved that it can maintain the topology characteristics of the cavity and can find the change patterns of a certain cavity.
4. Discussion and Conclusions
In this paper, a distributed representation method is proposed for protein cavity data. This method uses the word embedding technique to represent the cavity path, providing a new perspective for analyzing the sequence of cavities. Compared with the traditional data statistics based method, this method has two advantages: (i) Providing an abstract representation of molecular cavities, which facilitates users to perform operations, such as querying, comparing, and classifying, allowing discovery of some hidden features. Compared with traditional methods, this method can discover some hidden patterns. In addition, the word vector method is used to represent molecular cavities. The cavity is easier to use and other techniques such as machine learning. (ii) Using molecular cavity data to establish a connection between the cavity lining atoms; this connection contains the topological features of the cavity, which enables researchers to discover more cavity change patterns based on this connection, who otherwise may not be able to discover them.
To explore and verify the usefulness of our method, we use a real dataset to demonstrate the effectiveness of our method. We can know that semantic similar cavities also similar in attribute and geometry and molecular cavity can cluster by semantic similarity, can find the change patterns of different cavities.
The method is based on the molecular cavity lining atomic sequence. Emphasis was put on the connection between atoms and ignoring the properties of the cavity itself, such as the actual position and width of the cavity, as it can cause errors. In the future, our work will focus on combining other properties of the molecular cavity to enhance the presentation ability. We intend to further extend the work to design a more general visual analysis system for druggability analysis, which is an interesting direction for future work.