In this study, protein complex detection is described as the optimization of an objective function. In the FCM algorithm, the selection of the initial clustering center significantly affects clustering performance. Hence, the improved DPC algorithm was proposed to determine the initial clustering center and number in the FCM algorithm. Then, protein complexes were mined by using the FCM algorithm. Meanwhile, the FCM algorithm involves considerable parameters, and manual parameter adjustment is tedious and costly. Therefore, an adaptive parameter-adjusting algorithm was proposed in this study to optimize parameters in the FCM algorithm. Substeps of the proposed DFPO algorithm will be described below. In this study, the detection of the protein complex is described as the optimization of an objective function. In the FCM algorithm, the selection of the initial clustering center significantly affects clustering performance. Hence, the improved DPC algorithm was proposed to determine the initial clustering center and number in the FCM algorithm. Then, protein complexes were mined by using the FCM algorithm. Meanwhile, the details of the DPC-FCM algorithm are shown in Algorithm 1.
2.3.1. Preprocessing of PPI Network
A data sample in PPI networks is a set of protein interactions instead of a discrete point. Hence, a PPI network is developed to serve as the input of the proposed detection algorithm. Since discrete data are the input required by the subsequent detection algorithm, while the data in the input PPI network dataset are graph data, it is necessary to preprocess the input PPI network to facilitate the subsequent processing of the detection algorithm.
After modeling and visualizing, the PPI network can be regarded as a Graph
G, which comprises protein aggregate V and edge set
E, and each maximal connected subgraph in the graph is highly likely to be the protein complex to be detected. First, an adjacent matrix with the size of
was established to mark the presence of an edge between the two proteins (1 if an edge is present and 0 if no edge is present). The content of column
is the density peak score of each protein, which can be calculated by the method described in
Section 2.3.2. Upon establishment of the adjacent matrix, if the value at the ith (
) row and the
jth (
) column was 0, no edge is present between the
ith protein and the
jth protein; if the value was 1, an edge is present between the
ith protein and the
jth protein. The vector generated by the
ith row is the vector of the
ith protein and is used to calculate and process subsequent detection algorithms.
2.3.2. Improved DPC Algorithm
The improved DPC algorithm was developed based on the traditional DPC algorithm to determine the clustering center and cluster number in the improved FCM detection algorithm. The data entered in this paper are graph data, which differ from the discrete data used by DPC; thus, data representation conversion is required. The FCM and automatic parameter optimization algorithms also require continuous iterations, resulting in massive time consumption. Hence, the proposed algorithm shall accurately identify the clustering center and cluster number to reduce time consumption and facilitate subsequent algorithm optimization.
In DPC, the local density of each point and its distance from the closest point with a more significant local density were calculated, and the point with a more significant density and longer distance was employed as the clustering center. Similarly, the local density of each point was calculated in this study, and this local density was regarded as the initial score for this point. However, graph data were used in the proposed algorithm, and the determination of local density is different from that in the original DPC algorithm. First, traverse the neighbor protein of this sample point to verify the presence of a protein with a higher local density. If so, this neighbor protein is more suitable as a clustering center. In other words, the effect of this sample point as a clustering center is slightly worse. Hence, the initial score of this point can be reduced, and the reduced result can be used to calculate the final protein score. Based on this score, parameters are set, the distance is determined, and an appropriate clustering center is selected.
- (1)
Calculation of weights of interacting edges
The weight of interacting edges in the PPI network was calculated to calculate protein density based on this weight. Indeed, this weight considers four attributed information of proteins, including subcellular localization information, gene co-expression data information, functional annotation information, and shared neighbor information of two interacting proteins. In this way, interacting edges were weighted and used as the score of the reliability of interacting edges.
In Equation (
9),
represents each protein’s subcellular localization attribute weight. Two proteins that have various attributes of the same location are readily exposed to interaction and are defined in Equation (9):
where
defines the number of subcellular localization attributes shared by the two proteins,
defines the number of subcellular localization attributes of protein
i, and
defines the number of subcellular localization attributes of protein
j.
As the weight of gene expression information of two interacting proteins,
can describe the co-expression similarity of two proteins, which was calculated based on the person correlation coefficient and is defined in Equation (10):
describes the topological structure similarity of two interacting proteins, meaning that the interaction similarity of the two proteins is depicted by the number of neighbor nodes shared by the two proteins, and is defined in Equation (11):
where
denotes the number of neighbor nodes shared by the two proteins,
denotes the number of neighbor nodes the
ith protein, and
denotes the number of neighbor nodes of the
jth protein.
contains a series of functional annotation attributes of proteins. The
values of the two proteins were vectorized, and the cosine similarity of the two protein vectors was calculated to determine the interaction similarity of two interacting proteins, which is defined in Equation (12):
The average weight was obtained by combining the four calculation methods mentioned above, and the method was employed for the calculation of weights of interacting edges in the proposed algorithm and is defined in Equation (13):
- (2)
Calculation of the local density of sample points
In a specific maximal connected subgraph, one protein sample point was selected, as well as the sum of weights of all edges of neighbor proteins adjacent to this protein sample point (see Equation (14)). Then, the sum was divided by the number of all possible edges to obtain the local density of the protein sample point. In other words, the total weight of the edges of the protein sample point and its neighbor proteins was assigned to all possible edges, which is consistent with the definition of density in Equation (15):
where
denotes the weight of each edge in the current neighbor subgraph,
denotes the sum of weights of all edges in the neighbor subgraph consisting of the protein sample point and its direct neighbor proteins, and
denotes the number of proteins in the current neighbor subgraph.
- (3)
Calculation of scores of protein sample points
In the traditional DPC algorithm, a protein sample point with a density more significant than that of the protein sample point and the minimum distance shall be identified after obtaining the local density of the protein sample point, and the distance between the two protein points shall be calculated. The clustering center shall be determined from local density and distance perspectives. However, the calculation of the distance between two protein nodes in the proposed algorithm is complicated as the data used are graph data. Additionally, successive identification of qualified protein points and distance calculation in the case of a dataset containing various protein sample points could be time-consuming. As the FCM and adaptive parameter-adjusting algorithm would be considerably time-consuming, it is essential to propose an appropriate method to reduce time costs.
As discussed, the selection of a clustering center is closely related to the local density. Indeed, a protein with high local density is suitable as the clustering center. However, various proteins in one dense subgraph may have large local densities, and ’low coupling’ would be violated if all of them were used as clustering centers. Hence, a protein with higher local density shall be identified in this dense subgraph as the clustering center. In the proposed algorithm, the distance of each protein pair is not calculated. Hence, all proteins that are mutually adjacent nodes may be selected as the clustering centers, which is irrational if a threshold is used as the criteria for selecting clustering centers (proteins with a density greater than the threshold are selected as clustering centers). It is proposed in this study that for the current protein sample point (
), the presence of a neighbor protein (
) with a higher local density suggests that
is more unsuitable as a clustering center in this dense subgraph compared with
. It has been confirmed that proteins with high local density are suitable as clustering centers. Therefore, it can be deduced that a protein unsuitable for a clustering center is supposed to have a low local density. As a result, the probability of a protein being selected as a clustering center can be reduced by reducing its local density. In this study, the local density of sample point (
) obtained in Step 1 was employed as the initial score of xi and successively compared with the initial scores of its neighbor proteins. If a protein with a higher initial score is found, the score of the current protein is reduced, and the score obtained after traversing all neighbor nodes is the final score of the current protein (
), as defined in Equation (16):
where
is the initial score and
is a parameter reflecting the scaling factor of the density score.
was employed as a parameter of the adaptive parameter-adjusting algorithm and optimized.
- (4)
Selection of the clustering center
was set to be a parameter for selecting the clustering center. Specifically, protein sample points with scores more significant than
were regarded as clustering centers, and it was determined whether these proteins were suitable as clustering centers. For protein
, which has been selected as a clustering center,
, which is the next protein to be selected as a clustering center, cannot be selected as a clustering center if it has an over-small distance from
(<1.0 in this case), in order to ensure that the distance between two clustering centers is not over-small. The low coupling clustering algorithm would eventually select several clustering centers. These protein points would be used as initial clustering centers in the FCM algorithm, and the number of these protein points would be used as an input parameter. The distance of two protein centers can be calculated by Equation (17):
2.3.3. Improved FCM Algorithm Combining the Improved DPC Algorithm
Despite its good performance in tackling specific clustering issues, traditional FCM algorithms exhibit some defects. For instance, the selection of initial points significantly affects the results. Specifically, good clustering performance can be expected in rational selection of random initial points; if random initial points are not appropriately selected (i.e., several initial points at edges), the clustering performance would not be as expected. Additionally, the clustering center number selection significantly affects the algorithm’s performance. This study combined the proposed improved DPC algorithm with the FCM algorithm. Specifically, the clustering center was determined by the improved DPC algorithm, and the number of FCM algorithms for reclustering was used as the initial clustering center.
Additionally, traditional FCM algorithms rely on determining the objective function based on the product of membership and Euclidean distance of the two protein points. For this reason, traditional FCM algorithms only consider the high cohesion state within each community structure; that is, the points in the same community structure are closely connected, while the fact that points in different community structures are sparsely connected is not considered. This study proposes a new novel objective function for clustering so that both high cohesion and low coupling are considered.
- (1)
Determination of the initial clustering center and number of the FCM algorithm
As mentioned in
Section 2.3.2, clustering center and number were determined using the improved DPC algorithm and used as parameter input of the FCM algorithm. Herein, the clustering center set and the number of clustering centers were set to be the initial clustering center and initial cluster number, respectively. This method is superior to the FCM algorithm with a randomly initialized clustering center and number in terms of clustering performance.
- (2)
Design of the objective function
Traditional FCM algorithm considers ‘high cohesion’ but not ‘low coupling’ of different categories. In this study, the design of the objective function would integrate ‘high cohesion, low coupling’ with the detection of the protein complex to optimize an objective function. For ‘high cohesion’, the product of membership and Euclidean distance of the two points (see Equation (6)) was used for calculation. For ‘low coupling’, the following method is proposed: for
, the closest clustering center protein (
) was identified and the offset direction of
can be obtained by subtracting the vector of
from the vector of
. In this way, the distances between clustering centers are large enough to meet the requirements of ‘low coupling’, and it is defined in Equation (18):
where
K represents the number of proteins as clustering center,
represents the vector of current clustering center protein,
represents the vector of the closest clustering center protein, and
represents the distance from
to
.
As a function designed to realize high cohesion in clustering,
J should be as small as possible; as a function designed to realize high cohesion in clustering,
A should be as large as possible, and it can realize low coupling in different clusters. Additionally, an over-large distance between two clustering centers may lead to cases where marginal points of edges serve as clustering centers, resulting in poor clustering performances. In this study, the weights of
J were set as one and
A is set as a value less than 1 in the design of objective function to achieve improved clustering performance, and is defined in Equation (19):
where
is a parameter less than one, and it denotes the weight of
A in the objective function.
as one of the parameters was adjusted in the adaptive parameter-adjusting algorithm.
- (3)
Determination of the category of each protein
In traditional clustering of discrete points, the membership matrix of each protein and all clustering centers was determined, while the clustering center where the maximum membership is located was selected; the membership matrix was classified into the community where the clustering center is located. This is not the case for PPI networks. Indeed, each protein in the PPI network may participate in the generation of multiple protein complexes. In other words, each protein can belong to multiple protein complexes, meaning that this protein could be overlapping. First, the weights of each protein belonging to different clustering centers were calculated according to the membership matrix. Then, all clustering centers were sorted in descending order according to this weight. After that, clustering centers were placed at the top of the list according to the weight, and this protein was added to the protein complexes represented by these clustering centers. Herein, this number was set to be , which reflects the number of protein complexes to which each protein belongs. Clustering of the input PPI network was conducted using the improved FCM algorithm according to the objective function to obtain protein complexes finally.
2.3.4. Adaptive Parameter Adjusting Algorithm
The details of the Adaptive Parameter Adjusting Algorithm are shown in Algorithm 2.
- (1)
Fitness function
Inspired by the ABC algorithm, we proposed an adaptive parameter-adjusting algorithm PO (Parameter Optimization). First, an appropriate fitness function was designed and selected. Each protein complex’s weighted modularity score (modularity) was calculated [
26], and the sum of weighted modularity scores of all protein complexes in the detected protein complexes was determined. Then, the sum of the weighted modularity score was divided by the number of detected protein complexes to obtain the fitness of parameter adjustment in the DFPO algorithm.
First, the sum of weights of all edges in each protein complex
s (
) was calculated. For proteins in the current protein complex
s, if a neighbor node that does not belong to the current protein complex
s is present, the sum of weights of all edges connecting neighbor proteins and the current protein complex
s (
) was calculated. The weighted module score of the current protein complex
s can be obtained by dividing the former by the sum, as shown in Equation (20):
Finally, weighted module scores of all detected protein complexes were determined and divided by the number of protein complexes, and the obtained average modularity score of protein complex was used as the fitness of the DFPO algorithm (
M) is defined in Equation (21):
where
defines the sum of weights of all edges in one detected protein complex,
defines the sum of weights of all edges of proteins in one detected protein complex and neighbor proteins in this detected protein complex, and
N defines the number of detected protein complexes.
Algorithm 2: Adaptive parameter-adjusting algorithm |
- Input:
The weighted PPI network, ; Initial identified protein complexes, ; = 5; = 0.04; = 0.045; = 1.9; = 1; = 0.01; = 0.05; = 0.005; = 3; = 7; = 0.01; = 0.3; = 0.8; = 1.5; = 0.02; = 0.06; = 100; = 200; - Output:
Identified protein complexes, ;
- 2:
while do if then - 4:
= np.random.uniform(0, 8); if < 2 then - 6:
else if < 4 then - 8:
paramdensity = paramdensity + np.random.uniform else if < 6 then - 10:
else - 12:
end if - 14:
else - 16:
- 18:
end if - 20:
if then = 1 - 22:
end if current identified protein complexes (), = DPC-FCM Algorithm 1 () - 24:
if then - 26:
- 28:
else - 30:
- 32:
- 34:
end if i = i + 1; - 36:
end while Obtain the finally identified protein complexes (), = DPC-FCM Algorithm 1 (); - 38:
return Finally Identified protein complexes, .
|
- (2)
Adaptive parameter-adjusting algorithm
As an SIO algorithm, the ABC algorithm categorizes bees during honey collection as foragers, observers, or scouts. Its objective is to identify the nectar source with the most materials. After determining the fitness of each nectar source, foragers would continue to search for new sources based on the greedy strategy. Then, observers would select one nectar source according to the probability. A new nectar source would be acquired by random disturbance at this nectar source, accompanied by a calculation of its fitness. If observers fail to find a better source after multiple trials, the forager who locates this source will become a scout to find a new one.
The proposed DFPO algorithm involves identifying new nectar sources by observer disturbance, as in the ABC algorithm, and global optimization to avoid local optimal solutions. As shown in
Figure 3, there are six steps: (1) the iteration times of the adaptive parameter-adjusting algorithm were determined and used as termination conditions of the proposed algorithm. The iteration times shall be sufficient to make the results convincing. Meanwhile, the FCM algorithm would also iterate multiple times to search for an optimal solution as the improved DPC algorithm identifies clustering centers for the FCM algorithm. Therefore, the iteration times of the FCM algorithm can be a manageable size, as an optimal solution can be obtained after several iterations, and further iterations, which are time-consuming, can barely identify better results. Overall, the iteration times were set to be 100 in this study to balance the time and performance. (2) A constant
was set, and a random number (
) was generated within the designated range. If
, local parameter optimization was executed; otherwise, global parameter optimization was executed. Meanwhile,
decreased as the iteration times increased so that it is guaranteed that the probability of identifying an optimal solution in the global scope decreases during the entire process of parameter optimization. In other words, we hope that the adaptive parameter-adjusting algorithm can first identify an appropriate parameter in the global scope to determine an approximate range before adjusting each parameter to identify a more suitable parameter combination. (3) Global parameter optimization refers to the random disturbance of all parameters in the DFPO algorithm. In local parameter optimization, a random number (
) of 0∼8 is generated: if
, the first parameter
is disturbed; if
, the second parameter
is disturbed; if
, the third parameter
is disturbed; if
, the fourth parameter
is disturbed. After parameter disturbance in each iteration, the DFPO algorithm’s fitness was calculated using this parameter combination. (4) The fitness of the DFPO algorithm was recalculated for parameter combination after each disturbance. Then, the old fitness was substituted by the new fitness, and the new parameter combination substituted the old parameter combination if the new fitness was larger than the old fitness; otherwise, no steps were taken. (5) Iterations are terminated if the iteration times meet the termination conditions; otherwise, iterations are continued. (6) Clustering of the input PPI network was executed with the optimized parameter combination as the input parameter in the DFPO algorithm to obtain the detected protein complex set.