In this section, the data and the experimental methodology used in this work are described. In particular,
Section 3.1 presents an overview of the GNN model, the architecture used for PPI prediction.
Section 3.2 proposes a precise description of the dataset obtained from the preprocessing phase (also described here) and the subsequent graph generation procedure. Subsequently,
Section 3.3 shows the experimental setup chosen for the study, with specific reference to data setting, model selection, and evaluation.
Section 3.4 presents an ablation study, which is a series of additional experiments to further investigate the influence of specific node/edge features from the dataset on the model’s predictive performances. Finally,
Section 3.5 shows a practical use case, with a visual representation of the model predictions on a specific protein.
3.1. The GNN Model
A graph is a non-linear, non-Euclidean data structure defined as a tuple , where N represents the set of nodes and is the set of edges. Nodes describe entities, while edges stand for the relationships between them. Nodes and edges can be associated with values or vectors of values describing their attributes, and are defined as node feature vectors, , and edge feature vectors, , respectively.
In this context, GNNs represent a well-established paradigm within the field of machine learning, in which they have been successfully applied to an enormous variety of different tasks involving graph-structured data [
25]. They exhibit a notable capacity for directly handling relational data in graph format while minimizing information loss and offering high flexibility [
26]. Initial conceptualization of GNNs appeared in 2005 [
27], followed by a comprehensive mathematical formulation in 2009, which also demonstrated that GNNs are universal approximators on graphs [
28]. In the original work, GNNs operated by replicating the topology of the input graph and employing a message-passing mechanism between neighboring nodes to produce outputs on specific locations within the graph. This process was achieved by the means of two MLP units implementing as many parametric functions: a state transition function on each node of the graph and an output function where needed. In this context, problems can be categorized based on their focus on different elements within a graph: node-focused, edge-focused, or graph-focused. In node-focused problems, the model is asked to produce an output in correspondence with each node of the graph or a subset thereof, which can be used for classification, regression, or clustering purposes. Conversely, edge-focused problems concern tasks in which the output is produced either on each edge of the graph or on a subset of
E. Finally, graph-focused tasks deal with problems in which a unique output is produced on the whole graph, and the goal is to predict a property or to cluster the complex object represented by the graph. In this latter case, the output is generally calculated on the nodes and then aggregated by sum or average operations to obtain a single output for each graph.
3.2. Dataset Description
The first step for building the dataset was to search for biological data in the Protein Data Bank in Europe (PDBe), leveraging the Protein Interfaces, Surfaces and Assemblies (PISA) service (all the information is available at
https://www.ebi.ac.uk/pdbe/pisa/), an interactive tool provided by the PDBe database to study the stability of macro-molecular complexes (proteins, DNA/RNA, and ligands, accessed on 12 April 2024). This is essentially regulated by certain physicochemical properties, such as the free energy of the complex formation, hydrogen bonds, hydrophobic specificity, salt bridges across interfaces, solvation energy gain, and interface area. This information is used by the PISA service to analyze a given structure and to make predictions about a potential complex. It is possible to search for a protein using either a query, by indicating its PDB identification code, or to download entire databases. In this latest case, some parameters are available to filter the data, such as the type of interaction, keywords, presence of salt bridges or disulfide bonds, and symmetry number. To collect the dataset for the study, generic dimers characterized by a protein–protein interaction were retrieved from the PDBePISA database, for a total of 46,523 protein samples.
To guarantee biological significance, some criteria were enforced, such as a
symmetry, and only two interacting protein molecules at each interface. Subsequently, each sample was processed with the
Graphein 1.7.0 Python package [
29] to build the corresponding residue-based graph. Graphein is designed to streamline computational analysis for biomolecular structures, offering functionalities for constructing graph and surface mesh representations of various biomolecules, including proteins, nucleic acids, and small molecules. Additionally, Graphein facilitates the retrieval of structural data and chemical information from prominent bioinformatics databases such as the Protein Data Bank and ChEMBL. Therefore, a single graph was produced for each protein, with nodes representing the amino acid residues and edges representing the chemical bonds and forces between them, as depicted in
Figure 1.
Nodes and edges are associated with feature vectors describing their attributes. In particular, residue-level features—each residue is represented by the corresponding carbon alpha atom
—are collected into a vector
describing some physicochemical properties, such as their position vector in the three-dimensional space, the accessible surface area, the residue torsion angles, and DSSP [
30], Meiler [
31], and ExPASy [
32] descriptors. A more detailed description is available at the link in the
Supplementary Materials. On the one hand, the Meiler representation [
31] is a simplified model employed in computational biology and bioinformatics to depict the structural characteristics of amino acids. The method portrays each amino acid as a collection of geometric shapes approximating its backbone and side chain atoms, enabling the efficient analysis and modeling of protein structures and interactions in computational studies while reducing complexity. On the other hand, the ExPASy service [
32], developed by the Swiss Institute of Bioinformatics (SIB), provides general non-residue specific features. It draws amino acid descriptors along the primary sequence of the protein, such as hydrophobicity, polarity, and molecular weight. DSSP [
30] assigns each residue to a secondary structure element (SSE), such as
-helices,
-strands, and turns, by examining their hydrogen bonds and backbone geometries. This represents a standardized approach for characterizing protein secondary structures, with a total of 8 SSE classes, including the special class
random coil, a conformation indicating the absence of a regular secondary structure. A generic edge between two nodes
is represented by a feature vector
, quantifying the distance between their
atoms (measured in angstrom Å) and describing the type of bonds established between the residues. These are represented as a non-categorical vector where a non-null value indicates the presence of a specific type of bond (e.g., multiple atom-level bonds between two residues are considered as parallel edges between residue-level
nodes, and are thus condensed in a single undirected edge). The ground truth for the classification task is obtained from PDBePISA, which defines a generic residue as interacting based on its solvation energy value. In particular, the solvation energy gain of the interface, expressed in kcal/M, is calculated as the difference in solvation energies of all residues between dissociated and associated (interfacing) structures. Therefore, a solvation energy equal to zero corresponds to non-interfacing residues, and a solvation energy different from zero corresponds to interfacing residues, i.e., located within the interfaces on the protein surface.
To test the predictive capabilities of the GNN model and gain some insight on chain interactions within proteins, three different alternative datasets were considered. The first dataset, which will be referred to as the
Whole (protein) dataset, is composed of undirected graphs representing proteins in their entirety. All the interacting chains, possibly interconnected to let the information flow from a chain to another, are considered here, and are composed of 31,979 protein samples. Labels were associated to each node, representing the locations of all the interfaces within the protein at once. As shown in
Figure 2, from a single protein graph belonging to this dataset, two other sets of objects can be extracted.
After slightly increasing the resolution of the dataset, in the second setting, each pair of interacting chains belonging to a generic protein was considered as a single sample. Each chain of the pair includes all its residues. In this context, repeated instances of the same chain can occur when this interacts simultaneously with two or more chains within the protein. Given that the interface location changes for each pair of interacting chains, the label associated to each node will be different. This dataset will be referred to as the
Interface dataset, since a single graph represents the interaction between two single chains within the same protein, and it is composed of 73,427 interface graphs. In the last setting, further increasing the resolution of the graphs, each sample was represented by a single chain of residues, so, for a generic protein, only one instance of each chain could occur in the produced graphs. This dataset will be referred to as the
Chain dataset and it is composed of 87,375 chain graphs, which include all their residues. More details about the datasets are reported in
Table 2.
By treating a protein as a collection of coupling chains (i.e., using the Interface dataset), it is possible to focus the analysis on such critical interaction sites. This approach allows for a deeper understanding of how different protein components come together to form functional complexes. Identifying these interfaces is crucial for deciphering the molecular mechanisms underlying various cellular processes, including signal transduction, gene regulation, and immune responses. Conversely, the Chain dataset considers each chain within a protein independently, disregarding inter-chain interactions. While this simplifies the analysis, it provides insights into the individual behavior and properties of each chain within the protein ensemble. Eventually, considering proteins in their entirety as in the Whole (protein) dataset, without breaking them down into coupled or independent chains, is essential for understanding their overall structure, function, and behavior within biological systems. As a consequence, each dataset is investigated to answer a different biological question. The Chain dataset can provide an answer to the general question: which amino acids belong to interface regions in this peptide chain? This question is investigated by looking only at the characteristics of the amino acids and the structure of the peptide chain that they belong to. Thanks to the properties of GNNs, parts of the chain closer to the amino acid under analysis will be more influential in taking the decision with respect to other regions. The Interface dataset is instead ideal for answering the question: which amino acids play a role in the PPI between two peptide chains of the same protein, thus contributing to the formation of its own quaternary structure? In this scope, the binary classification of amino acids is conditioned to the particular PPI under analysis. The decision is based on the amino acid features and the structure of the chain, as well as on the specific structure of the interacting chain and the reciprocal position of the structures. Finally, the Whole (protein) dataset may answer the question: which amino acids play a role in PPIs between the protein complex that they belong to and other protein complexes? The decision is based on the features of the amino acid under analysis and the quaternary structure that it is a part of, additionally influenced by the characteristics of the other interacting structures and the relative positioning of the two protein complexes.
3.3. Experimental Setup
In this work, PPI prediction is modeled as a node-focused binary classification task using a GNN. The goal is to determine whether a generic node (residue) belongs to a protein interface or not. To test the predictive capabilities of the GNN model and evaluate its performance, experiments were carried out on the three different datasets:
Whole, Interface, and
Chain, described in
Section 3.2. To maintain consistency across datasets, a seed was employed to split data into (non-overlapping) training, validation, and test sets in all the experiments in order to allocate the same proteins to the same sets in all three settings. As an example, the protein 3C3P shown in
Figure 1 was assigned to the test set, and so were its extracted components, as described in
Figure 2. Generally, in ML applications, the training set refers to the data used to train the model, and hence to adapt its parameters to the task at hand. The validation set is commonly used during the learning process for early stopping procedures, so as to avoid overfitting, i.e., when the model loses its generalization capabilities and specializes on the training data. Finally, the test set represents a never-before-seen set of data used for the evaluation of the model performance.
Usually, in a binary classification task, the performance of the model is measured in terms of accuracy; this metric, however, is not as precise on unbalanced datasets as the ones considered in this work. Hence, some other widely used metrics are also considered to evaluate the quality of the predictors, namely accuracy, balanced accuracy, recall, precision, F-score, and AUC (area under ROC curve). The different outcomes of predictions made by a generic model compared to the actual ground truth are described by the true positive, true negative, false positive, and false negative instances. On one hand, a true positive (TP) occurs when the model correctly identifies an instance belonging to the positive class, while a true negative (TN) ensues when it correctly identifies an instance belonging to the negative class. On the other hand, a false positive (FP) arises when the model mistakenly classifies a negative instance as positive, and a false negative (FN) occurs when it incorrectly labels a positive instance as negative. These distinctions are crucial for evaluating the accuracy and reliability of classification models, providing insights into their performance and potential areas for improvement. Therefore, the evaluation metrics correspond to the following:
Observe that the F-score can be interpreted as a weighted average of the precision and recall to provide a more balanced measure in cases of unbalanced datasets. In addition, balanced accuracy is defined as the average of the recall values obtained on each class. Finally, the AUC is the integral of the ROC curve and is equal to 1 for an ideal classifier and
for a random one, where the ROC curve corresponds to the plot of the true positive rate against the false positive rate at each threshold value. To find the best-performing architecture, an extensive grid-search procedure was carried out on the GNN hyperparameters; the search space as well as the best architecture for each setting in terms of F-score is reported in
Table 3.
All the models were implemented with GNNKeras [
33]—a Keras-based library for GNN recurrent architectures—and trained for 500 epochs with a cross-entropy loss function, an Adam optimizer with an initial learning rate of
, and an early stopping procedure on the validation set. Moreover, since the ratio of negative/positive examples in the datasets is quite high, the multiple percentage of this ratio (
) was considered in the grid search as a weight for positive examples to balance the learning process. Note that the training set is perfectly balanced with a value
(i.e., the total number of positive examples weighs as much as the total number of negative ones, so a single positive example is more important at training time than a negative one), while a value
means no balancing policy is used, so a positive example is as important as a generic negative example.
The best architectures in terms of F-score were further investigated in an ablation study to evaluate the impact on the model’s performance in identifying protein interaction sites by systematically removing a set of node/edge features from the datasets. This helps in determining which features are most informative for accurate predictions, guiding the refinement of predictive models for protein–protein interface identification. The procedure is described in detail in
Section 3.4.
3.4. Ablation Study
To evaluate the importance of the contribution of the different feature components, an ablation study was carried out on the best architecture for each dataset setting. In particular, node features were grouped into four sets (3D position, Meiler, ExPASy, and DSSP components), while edge features were grouped into three sets (distance, weak chemical bonds, and strong chemical bonds). Experiments were conducted so as to eliminate one feature/edge group at a time from the datasets and evaluate the performance of the model in the absence of that group. The performance gap obtained by the new model gives an estimate of the importance of the features that were not considered during the learning process. The results of the ablation study on the three different datasets are reported in
Table 4,
Table 5 and
Table 6.
Interestingly enough, the worst model performance is obtained when the DSSP component is excluded at training time, hence when no information about the secondary structure elements (SSEs) is provided to the models. In this context, although the recall score is comparable to that obtained in the previous study, missing information causes a general drop in performance over all the other metrics. Observe that the F-score is significantly affected, suffering from a drop of more than in the worst scenario comparing to the original model. This outcome is in line with the expectations based on biological data, since the backbone torsion angles between adjacent amino acid residues in a protein chain are included among the SEE features. These angles determine the local conformation of the protein backbone, which affects its 3D structure and, consequently, the functional form of the protein. Variations in dihedral angles directly cause a change in the arrangement, affecting the folding pattern of the protein, which leads to the formation of the secondary structures. Additionally, the backbone torsion angles play a crucial role in determining the accessibility of amino acid residues to the surrounding solvent, hence impacting the protein’s accessible surface area and, as a result, its interfaces.