1. Introduction
In the proteomics era, various high throughput experimental techniques and computational methods have produced enormous protein interactions data [
1], which have contributed to predict protein function [
2,
3] and detect protein complexes from protein–protein interaction (PPI) networks [
4]. Prediction of protein complexes can help to understand principles of cellular organization and biological functions of proteins [
5,
6,
7]. A PPI network can be modeled as an undirected graph, where nodes represent proteins and edges represent interactions between proteins. Proteins usually interact with others as a complex to perform their biological functions in cells, such as DNA replication, transcription and protein degradation [
8,
9,
10], so protein complexes are usually dense subgraphs in PPI networks.
Graph clustering [
11] is an unsupervised learning technique that groups the nodes of the graph into clusters taking into consideration the edge structure of the graph in such a way that there should be many edges within each cluster and relatively few between the clusters. Clusters in a PPI network are highly interconnected, or dense regions that may represent complexes. Thus, identifying protein complexes is similar to finding clusters in a graph. Various graph clustering algorithms have been developed to identify protein complexes using the information encoded in the network topology. In general, these methods can be classified into two types: Global method and local method, according to whether they produce clusters based on whole view or partial view of graph topology.
Global approaches exploit the global structure information of networks. Girvan and Newman proposed the Girvan and Newman (GN) algorithm [
12] to partition network by iteratively removing the edges with highest edge betweeness. Markov clustering algorithm (MCL) [
13,
14] starts from an initial flow matrix to identify complexes by simulating stochastic flows between nodes in PPI networks. Spectral clustering methods [
15] construct a similarity graph from initial PPI network, and then determine clusters based on spectral analysis of the similarity graph. Most global methods partition networks into non-overlapping subgraphs and assign all nodes in a subgraph into a cluster. These methods enable identification of all relevant modules within a PPI network, so they might obtain robust and effective performance for protein complex detection. However, global methods are computationally expensive and limited to relatively small PPI networks [
16].
Local clustering methods identify protein complexes by considering local neighbor information in PPI networks instead of global information. A simple strategy of the local method is to enumerate all highly connected subgraphs in PPI networks with density exceeding a specified threshold. Clique Percolation Method (CPM) [
17] finds
k-clique-communities as a union of all
k-cliques that can be reached from each other through a series of adjacent k-cliques. CFinder method [
18] implements this approach and is currently being used in complex detection in PPI networks. Clustering-based on Maximal Cliques (CMC) [
19] identifies maximal cliques as candidate clusters and then adds a post processing on highly overlapping cliques to generate final clusters. However, since searching all maximal cliques in a network is an NP hard problem, these algorithms are computationally expensive. Furthermore, these algorithms cannot provide satisfactory coverage. To improve computational efficiency, algorithms utilizing local expansion and optimization are proposed and often classified as “greedy” and “graph growing” algorithms [
20]. Most of these algorithms start by selecting a highly ranked node as a seed and then expand the seed to a densely connected group of nodes relying on a local benefit function. Researchers often call these kinds of algorithms “seed expansion methods”. The Molecular Complex Detection (MCODE) algorithm [
21] is one of the most classical seed expansion computational methods that can identify densely connected clusters in PPI networks. It first weights all nodes by their
k-core neighborhood density as local network density, and then expands from highest weighted node by adding nodes whose vertex weight percentage (VWP, weight percentage away from the weight of the seed vertex) is above a given threshold. The weighting scheme of MCODE boosts the weight of densely connected nodes. For a node
v, MCODE computes the VWP value of
v to check whether
v is part of the cluster being considered. The VWP value of a node reflects its relative neighborhood density respective to that of the seed in current cluster. However, VWP value might not be an exact representation to measure the closeness between a node and the current cluster.
DPClus algorithm [
22] defines “cluster periphery” of a node with respect to a cluster to address the aforementioned issue. DPClus first weighs an edge by the number of common neighbors between two ends of the edge, and then weighs a node as the sum of the weights of edges incident to the node. For node
v, its “periphery” respect to a cluster
C is defined as the fraction of the number of nodes in
C adjacent to
v and average link number of node in
C. However, “periphery” value only considers the connections between node
v and cluster
C, without taking into account the neighborhood density information of the node
v itself.
It first chooses node with the highest weighted degree as a seed that forms an initial cluster. The weight degree of a node is the sum of all of its adjacent edges’ weights, where an edge weight is measured by the number of common neighbors of interacted proteins. The node weight reflects local density in the node’s immediate neighborhood by the number of triangles on it. Then, DPClus iteratively augments the initial cluster by adding nodes if the density and cluster property of the cluster are higher than user-defined thresholds.
Based on observation that many protein complexes typically have small diameter and average node distance, IPCA [
23] modifies algorithm DPClus by considering subgraph diameter and interaction probability. The interaction probability of a node to a subgraph is defined as the number of edges between the node and subgraph normalized by the total number of nodes in the subgraph, and it is similar to cluster property and also closely related to subgraph density. The node weighing measure and seed selection strategy are identical to DPClus. In the sense of weighted networks, speed and performance in clustering (SPICi) [
24] is proposed to handle the computation complexity of clustering large PPI networks. It builds clusters greedily, starting from local seeds that have high weighted degree, and greedily adding an adjacent unclustered node with the highest support score that maintains the density of the clusters. The cluster expansion approach of SPICi is simpler than DPClus and output is a set of disjoint dense subgraphs.
The study of protein complexes using affinity purification and mass spectrometry [
25] suggests that major protein complexes contain a core in which proteins have relatively more interactions among themselves and each attachment protein binds to a subset of core proteins to form a complex. Based on this observation, ICSC [
4] starts with a subgraph as a seed and then greedily adds nodes to find dense subgraphs. The definition of closeness of a node to a subgraph is the same as the interaction probability used in IPCA. Algorithms in this category include Core [
26], COACH [
10], GC-Coach [
27] and WPNCA [
28], while proteins are likely to have interactions with only one hub-protein within a few complexes that exhibit starlike structures in PPI networks [
29,
30].
PPI networks obtained from high-throughout biological experiments are noisy with false positive interactions. Taking into account the reliability of protein interactions, some efforts are made to identify protein complexes using the topology of PPI networks [
31,
32]. In order to generate robust clustering techniques, several computational approaches detect protein complexes from PPI networks integrating gene ontology (GO) annotation [
33,
34], genomic data [
35] and so on.
Various graph clustering approaches have different clustering criteria to find local dense subgraphs and work well in detecting protein complexes from PPI networks. The local seed expansion method is among the most successful strategies for overlapping graph clustering [
36]. However, there are still some limits in such algorithm: (1) measure the representability of a node to a cluster using only density of the subgraph induced by the node and its immediate neighborhood; (2) given a graph with weighted node, clusters are sensitive to the choice of the starting node [
20]. Existing seeding strategies usually select a node with the highest weight as a starting node (seed) to find a cluster, without a process to adjust centers of clusters. This leads to a lack of diversity of algorithms; (3) existing closeness (interaction probability) of a node to a cluster only considers candidate nodes’ density or connections between the candidate nodes and the cluster.
In this article, we address the above limits and propose a new seed-expansion graph clustering algorithm (SEGC) that produces overlapped clusters for protein complex detection. It consists of three main phases: node weighing, seed selection and cluster expansion. In the stage of node weighing, SEGC combines different attribute information of node structure, and further improves the representability of nodes to a larger local neighborhood by an iterative weighing method. It has a diversity to adapt to different networks. In order to enhance the diversity of proposed algorithm, the roulette wheel is used to choose seed nodes of potential clusters. In the cluster expansion phase, a new closeness is proposed considering the influence of connections between a candidate node and a cluster on both the cluster and candidate node. We apply this clustering algorithm to cluster several PPI networks of Saccharomyces cerevisiae. The results show that SEGC outperforms other algorithms under full coverage in terms of both F-measure and accuracy with a real benchmark protein complex data set.
2. Preliminary
A protein-protein interaction (PPI) network can be represented by a graph
G = (
V,
E) with node (protein) set
V and edge set
E that contains the edges (interactions) of the graph
G. We consider only simple undirected graphs, which contain no self loops and multiple edges. Let
n =
|V| be the number of nodes and
m =
|E| be the number of edges. We denote an edge in
G as an unordered pair
or
, where
V. A graph
H = (
V(
H),
E(
H)) is called a subgraph of
G if
V(
H) ⊆
V and
E (
H) ⊆
E, denoted as
H ⊆
G. The diameter of a subgraph
H is the largest length of a shortest path between any two nodes in subgraph
H, written as
D(
H). An induced subgraph
G[
S] is a graph whose node set is
S ⊆
V and whose edge set consists of all of the edges in
E that have both endpoints in
S. We write [
S] to denote the induced subgraph by node subset
S when without causing confusion.
Table 1 lists the main symbols used in this paper.
Let l be a nonnegative integer. A path of length l from u to v in G is a sequence of n edges of G for which there exists a sequence of mutually distinct nodes such that has, for i = 1, ..., l, the endpoints and . We denote this path by its node sequence. The distance of u and v is the length of the path between u and v in G such that the number of its edges is minimized.
The open neighborhood (or neighborhood) of a node v, denoted as or N(v), is the subgraph induced by all nodes that are adjacent to v. The closed neighborhood is defined in the same way but also includes v itself, denoted as or N[v]. Unless otherwise stated, we also use (or) to represent the node set of (or ).
The 1-neighborhood of a given node
is represented by
, and then the set of
k-neighborhood can be defined by
where
denotes the distance between
and
.
The degree DC(v) of a node v is the number of elements of , i.e., . The degree DC(H) of a node subset H is the sum of degree of the nodes of H, i.e.,.
The goal of traditional graph clustering is grouping the nodes of a given input graph into p disjoint clusters (subgraphs) such that and . For the problem of overlapping clustering in complex detection, the goal is to find clusters such that and . A protein complex is usually abstracted as a connected subgraph in a PPI network and graph clustering is natural for protein complex detection. Here, graph clustering finds clusters within a given graph rather than the clustering between graphs.
3. Method
3.1. Algorithm Overview
We propose a new graph clustering algorithm based on seed-expansion approach (SEGC) to detect protein complexes using network topology attributes only. It consists of three main phases: node weighing, seed selection and cluster expansion. In the stage of node weighing, we compute the weights (i.e., representability) of nodes by a new metric. In seeding phase, the roulette wheel selection is used to find nodes with higher weight as seeds with probability proportional to their weights. In expansion phase, we expand the original seeds to form dense subgraphs as clusters based on a newly defined closeness measure (see Equation (8)). One could find a cluster by executing seed selection and cluster expansion. The seed of next cluster will be selected in nodes that have no cluster assignment. We do not remove any clustered node or edge to keep the original input graph complete. SEGC ensures that every node in PPI networks will be assigned into at least one predicted complex. SEGC can also obtain overlapping clustering, which means that some nodes might be attached to more than one cluster.
3.2. Node Weighing
In graph clustering, how to measure the representability of a node to a cluster by connections between nodes is a key issue. Let
w(
v) be the weight of a node v and be usually computed according to local information within a subgraph consisting of nodes
N[
v]. The node with higher
w(
v) has better representative to the subgraph
N[
v]. The most basic centrality measure is degree centrality (
DC) based on the observation that the hub nodes usually have more edges [
24,
37]. There should be good clusters around high degree nodes in real-world networks with a power-law degree distribution. However, a node with a high degree is not enough to reflect the representability to a cluster [
36,
38]. In addition, the existing node importance metrics are mainly based on the structure information only within a node’s direct neighborhood. A good node weighing measure should reflect the importance of a node in a larger neighborhood of the node.
We proposed a new node weighing vector
W to overcome the above shortcomings. It not only integrates topological attribute information of nodes and edges, but also gets importance of a node
v within
k-neighborhood of
v (i.e., {
v} ∪
) through
k iterations, where
k is a predefined parameter. A larger
k indicates that the weight of node
v represents the information of a larger neighborhood around it. Given attribute matrix
of
n nodes with
q attributes and weight coefficient vector
of attributes, the node weight vector in
i-th (
) iteration is defined as
where
is the attribute matrix of nodes in (
i − 1)th iteration, and element
of
is determined as
where
and
, which we will describe in detail in the following.
Let the weight of an edge
e = (
u,
v) as the number of common neighbors between two ends of
e, that is,
. In order to reflect importance of a node
v more comprehensive, we consider three basic attributes to calculate the weight of a node
v in this paper, including:
DC(
v), the degree of node
v;
DC(
N(
v)), the degree of direct neighbors of
v; and
, the sum of the weights of its incident edges. These three attributes can not only reflect the degree information of the node itself, but also the neighborhood information around the node. For convenience, we initialize the weight of a node as its degree, i.e.,
. Therefore, we can define elements of the attribute vector of node
v in
ith (1 ≤
i ≤
k) iteration as:
Since the significances of each attribute mentioned above are quite different from each other, we use weight coefficient vector
to weigh each attribute. The number of elements in
equals the number of attributes used in the calculation of node weights. Then, we have
where
.
The first item of Equation (5) denotes the centrality information in
ith iteration of the node itself. The second item reflects the centrality information of its adjacent nodes in
ith iteration. The third item adds weights of its incident edges to the centrality information of node
v. If the weights of its incident edges are relatively high, then the node
v might be a meaningful point for local module searches in functional networks, similar to [
24].
From the definition of the node weight, it can also be obtained that the nodes with higher weight should be more representative for its local topological neighborhood. The number of iteration determines the range that the node weight can reflect. For example, in the first iteration, the node weight reflects the direct neighborhood including its adjacent nodes and its incident edges; however, in the ith iteration, the node weight can reflect the i-neighborhood of node v. If i is the diameter of a graph G, then can measure the centrality of node v in the range of the whole network.
Since
, the node weight defined in Equation (5) can also be formulated as:
If
, we have
and the representative of a node is determined only by its degree. If
, the representative of a node is determined by the degree of the
ith neighborhood of node
v. If
, weights of direct edges of a node is a key to measure local importance, in this case, the node weight
which is the same as that defined by DPClus [
22].
The linear combination of the three parts above makes the representability of nodes to a subgraph more complete. As shown in
Figure 1, both node
and node
lie in the complete subgraph induced by
and have the same degree, i.e.,
. However, node
lies in a more important position than
since the 2-neighborhood of
includes some nodes of the dense subgraph induced by
. From Equation (5) with weight coefficient vector (0.2, 0.6, 0.2), we have
,
and
,
. Therefore,
has better representative than
when
.
3.3. Seed Selection
The seed of a cluster should have a better representative for the cluster, which indicates that the weight of the seed node should be relatively larger than other nodes in the cluster. However, the node with the largest weight might not always be the best choice for the seed of the considered cluster. In order to improve the diversity of seed selection, SEGC uses a roulette wheel to select seeds from the perspective of probability. The probability of a node
as a seed is defined as:
The larger the weight is, the larger the probability that the node will be selected as a seed.
At the beginning, our algorithm picks some node v as a seed and extends it to a cluster C(v) using the cluster expansion process described in next section. Once the cluster C(v) is obtained, we begin to select the seed node for next cluster. The seed node of the next cluster should be away from the existing seeds in order to reduce generation of redundant clusters. Hence, all nodes in existing clusters are no longer selected as seed nodes. However, every node might be a member of other clusters to form overlapping clusters. Thus, we choose seed nodes in the unclustered nodes that have not been included in any of predicted clusters by roulette wheel. The entire procedure of the approach terminates when there are no unclustered nodes.
3.4. Cluster Expansion
After obtaining a seed node
v, we extend it to a cluster
C(
v), which initially consists of only the node
v. The candidate node set for current
C(
v) is
N(
C(
v)), the neighbors of
C(
v). For a candidate node
u, we use the adjacent nodes of
u in
C(
v) to determine the priority of whether
u can be extended to
C(
v). We take into account both the proportion of
in the node set of
C(
v) and the proportion of
in the neighborhood of
u. The priority of a candidate node u to cluster
C(
v) is defined as follows:
The measures how strongly a node u is connected to a cluster C(v). For a dense cluster, a node connects to most of the nodes in the cluster. For the nodes lying on the spare periphery of a cluster, most of their neighbors are in the cluster. The first item of Equation (8) represents the effect of the size of current cluster C(v), V(C(v)) is the node set of the subgraph C(v) with node v as the seed node. The second item represents the effect of the size of the neighborhood of u. The priority of a candidate node u to a cluster C(v) is positively correlated to the number of adjacent nodes of u in C(v), negatively correlated to the number of nodes in C(v), and negatively correlated to the number of neighborhood of u.
The parameter in Equation (8) is to control the priority of u to C(v) during the expansion process. When > 0.5, the first item of Equation (8) plays a determining role for . We might obtain a relatively dense cluster with a larger , since we give preference to nodes with more connections with the current cluster. In addition, we might obtain a sparse cluster with a smaller since we give preference to nodes with a low degree. A cluster should be denser around its seed and might be not so dense away from the seed, so we should set a larger in the beginning of the expansion. With the increase of the number of nodes in the cluster, we should set a smaller to allow nodes lying on the periphery of cluster could be found. Hence, we set , where is a predefined parameter to control the reducing rate of .
Considering the network shown in
Figure 1 as an example, let
be the induced subgraph by node set
and the seed node is
. The candidate node
should be a periphery node of
and should be included in the current cluster. Another candidate node
might not be a member of
. If the threshold of priority is set to 0.5, candidate node
is added to
, whereas candidate node
will not be added.
Based on the study of known complexes in protein networks, most complexes have a very small subgraph diameter [
23,
39]. Thus, we have two parameters
and
for node priority and graph diameter, respectively. That is to say, for a candidate node
u and a cluster
C(
v), if
and
, node
u would be added into cluster
C(
v), and then
. The expansion progress would end when we could not find node
u in
N(
C(
v)) satisfying
and
.
3.5. Complexity
We repeat the seed selection and cluster expansion process until all nodes in a graph are clustered. The frame of the proposed approach SEGC are given in Algorithm 1. Let G = (V, E) be the graph corresponding to the considered protein interaction network with node set V and edge set E, |V| = n and |E| = m. Then, the average computational cost for computing edge weights is , where is the average degree of G. It takes time to obtain node weights for k iterations.
It needs time to select one seed, and to select all seeds for |C| clusters. The algorithm obtains a cluster C from its seed. During the expansion process of C, it should take time to compute NC(x, C) for each node and sort them in nondecreasing order. For the worst case, C might include all nodes of the considered network, that is to say, we need time to obtain a cluster and need time in total for cluster expansion. Thus, the time consumed for algorithm SEGC is .
Algorithm 1. A seed-expansion graph clustering method (SEGC). |
Input: A given graph G = (V, E), parameters and . |
Output: A set of clusters. |
1: |
2: For each node , let ; //DC(v) is degree centrality of node v. |
3: For each edge , let ; |
4: for i = 1 to k do //(***Node Weighing***) |
5: For each node let |
|
6: end for |
7: For each node , compute the selection probability |
8: while do |
9: Select a seed node v using roulette wheel; //(***Seed Selection***) |
11: C(v) = Cluster Expansion({v}); //(***Cluster Expansion***) |
12: ; |
13: end while
Subroutine Cluster Expansion(C) |
1: Let ; |
2: if then |
3: Let be the node with highest priority; |
4: ; 5: C = Cluster Expansion(C) 6: else 7: Return C 8: end if |
4. Experiments and Results
We implemented the proposed SEGC algorithm in C++ on Microsoft Visual Studio 2010 (Redmond, WA, USA). SEGC has been successfully executed and tested on Windows 7 platform (Microsoft Corporation, Redmond, WA, USA), running on a PC with Intel Core CPU (Santa Clara, CA, USA)
[email protected] GHz and 8 GB RAM.
4.1. PPI Datasets and Metrics
We use
Saccharomyces cerevisiae as an experimental organism, which is one of the most popular species, because it is one of the earliest research objects and has the most abundant PPI data. Five PPI networks of
Saccharomyces cerevisiae are used and marked as Gavin02 [
6], Gavin06 [
25], Krogan_core, Krogan_extend [
40] and BioGrid, respectively. These data sets are widely used in protein complex detection. Gavin02 includes 1352 proteins and 3210 interactions. Both Gavin06 and Krogan_extend are tandem affinity purification (TAP) data that include 1430 proteins with 6531 interactions and 3672 proteins with 14,317 interactions, respectively. Krogan_core contains only highly reliable interactions among Krogan_extend. BioGrid is constructed by all of low-throughput physical interactions in BioGRID database [
41] (version 3.4.137) and includes 4254 proteins and 21,375 interactions.
Table 2 shows the information of the five networks above. The density of a graph
G = (
V,
E) is the ratio of the total number of edges to the total number of all possible links between all nodes, and is defined as
Density(
G) = 2|
E|/(|
V|(|
V|
1)). We consider only a simple graph in this paper, so we remove all self-interactions and duplicate interactions.
We take CYC2008 [
42] as gold standard complex set to evaluate protein complexes predicted by the proposed algorithm SEGC. There are 408 manually curated complexes in CYC2008. Each protein complex in CYC2008 is reported by small-scale experiments and is of high reliability, so CYC2008 has been used as a benchmark set by many computational approaches for the prediction of protein complexes.
To assess the quality of results obtained by different algorithms, we use several evaluation criteria including precision, recall, F-measure, clustering-wise positive predictive value (PPV), clustering-wise sensitivity (Sn) and accuracy.
F-measure is the most widely used metric [
28,
43,
44], and can evaluate both the accuracy of clusters matching known protein complexes and the accuracy of the known complexes matching the predicted clusters. Given a predicted cluster set
and the gold standard complex set
, the neighborhood affinity score
between a predicted cluster
and a standard complex
in benchmark set is defined as
for
and
.
The neighborhood affinity score
quantizes the closeness between two complexes
and
. The larger the
is, the closer
and
are. If
, then
and
are considered to be matching, where
is predefined threshold and is usually set to 0.2 [
27,
43]. We also set
in this paper.
Let
be the predicted cluster set such that every item in it matches at least one standard complex in
, i.e.,
Let
be the standard cluster set such that every item in it matches at least one predicted complex in
, i.e.,
The
precision and
recall are defined as follows:
F-measure is the harmonic mean of precision and recall to quantize the closeness between predicted complex set and standard complex set:
Let
be a
matrix, where row
corresponds to a cluster
and column
corresponds to an annotated complex
. In addition, the element
of
is the number of proteins that are in common between
and
, i.e.,
. The clustering-wise positive predictive value (
PPV) is defined as:
The
clustering-wise sensitivity (
Sn) is defined as:
where
is the number of proteins in complex
.
Accuracy is another important criteria to evaluate the accuracy of a prediction [
33,
45]. It can be obtained by the geometrical mean of the
PPV and the
Sn as follows:
It is important for a clustering technique to cover all the nodes of a PPI network as clusters can be both dense and sparse. This will ensure that important functional modules or protein complexes are not missed during the clustering process [
16]. The
Coverage of an algorithm can be calculated as
4.2. Parameter Setting
The proposed algorithm SEGC has six predefined parameters, weight coefficients and of node attributes, the number of iterations , reduce rate , closeness threshold and diameter threshold .
Weight coefficients and are used to compute the weights of nodes through iterations. The parameters , and are used in the cluster expansion process. We could find small dense clusters with less periphery nodes with smaller or larger . Diameter threshold is to control the diameter of the found clusters.
BioGrid is a standard protein interaction network data set, in which all interactions are constructed by all of low-throughput physical interactions with high reliability and precision. Thus, we apply alternating direction method on BioGrid to obtain suggested values of these parameters, using
F-measure as an optimization goal. We first fix
= 0,
= 0,
k = 1, and the experiments on BioGrid PPI network with
from 0.1 to 0.9,
r from 0.1 to 0.9 were carried out to verify the influence of parameters
and
r. The
F-measure reaches its maximum value when
is 0.4 and r is 0.3. Then, we fix
= 0.4,
r = 0.3, and the
F-measure is maximized at
= 0.6,
= 0, and
k = 3. Next, we fix
= 0.6,
= 0,
k = 3 and, in turn, try different values of parameters
and
r, and the experiments also obtain the best performance at
= 0.4,
r = 0.3. Therefore, in this study, we set
= 0.6,
= 0,
k = 3,
= 0.4,
r = 0.3.
Figure 2a shows the results of parameters
and
on
F-measure with
r = 0.3,
= 0.4, and the effect of parameters
and
on
F-measure is shown in
Figure 2b with
,
and
. We set diameter threshold
since diameters of most known complexes are relatively small [
35]. Thus, we finally decide to set the parameters (
,
,
k,
,
r,
) of SEGC to default values (0.6, 0, 3, 0.4, 0.3, 2), respectively, in all following experiments unless otherwise noted.
4.3. Effectiveness of Our Strategies
We use algorithm IPCA [
23] as the basic frame to test the effectiveness of each our strategies, such as node weighing in
Section 3.2, roulette wheel in seed selection in
Section 3.3 and priority definition in Equation (8) in
Section 3.4.
We replace the definition of node weights in IPCA with Equation (5) proposed in
Section 3.2, and the parameters in Equation (5) are set to
,
and
. For convenience, we name the IPCA algorithm with new node weighing method as IPCA-node weighing (NW). We add the roulette wheel method to seed selection in IPCA (named as IPCA-RW) and the results are shown in 5th column in
Table 3. Because of the stochastic nature of the selection step, we run the procedure 500 times and choose the best clustering solution in usual practice. We replace the interaction probability (IN) in IPCA with the priority definition according to Equation (8) to obtain algorithm IPCA-NC, where
,
and
.
Table 3 shows the comparison results with original IPCA. It can be seen that each strategy can improve the performance of IPCA to a certain extent.
4.4. Comparison with Other Algorithms
We compare SEGC with other overlapping protein complexes detection methods: CFinder [
18], DPClus [
22], IPCA [
23], Core [
26], soft regularized Markov clustering (SR-MCL) [
44], PE-measure and weighted clustering coefficient (PEWCC) [
31], detecting complex based on uncertain graph model (DCU) [
32], weighted COACH (WCOACH) [
34] and weighted edge based clustering (WEC) [
35].
Table 4 exhibits parameters of each algorithm, which are recommended by authors.
Table 5 shows comparison results of all algorithms on five PPI networks: Gavin02 [
6], Gavin06 [
25], Krogan_core, Krogan_extend [
40] and BioGrid.
Algorithms CFinder, SR-MCL and WEC produce less clusters that are so dense that the number of edges in clusters are nearly the same as that in complete subgraphs, so they have comparatively higher precision than other algorithms. A shorting coming of CFinder, SR-MCL and WEC is the loss of coverage especially on sparse networks. A small coverage usually yields small recall.
DPClus adopts a seed expansion strategy to find clusters, where the density of the cluster determines whether a node be included into the current cluster. Thus, DPClus could find many small dense clusters. The average number of nodes in a predicted cluster of DPClus is the smallest among all 10 experimental algorithms and is usually not bigger than five. This leads to the highest PPV among all algorithms and a higher coverage on the sparse network than CFinder and SR-MCL. Since DPClus removes the nodes and related edges from the considered network after obtaining a cluster, there might be some isolated nodes in the remaining network. Hence, DPClus could not obtain full coverage results. Core extends a cluster from several core proteins. If a candidate node connects with at least half of the nodes in a considered cluster, it would be added into the cluster. Hence, the size of clusters found by Core is usually larger than those found by DPClus, and the density of the found clusters is lower than DPClus. Therefore, Core always has a higher coverage than DPClus.
IPCA adopts also a seed expansion strategy as DPClus. IPCA keeps all nodes and edges in the network during the cluster extension process, and can obtain full coverage results. Our SEGC tries to find a better seed by using the roulette wheel strategy. It also considers both the density of the cluster and the connections between candidate nodes and considered the cluster in cluster extension process. Hence, SEGC improves the efficient of IPCA and can also obtain full coverage. It is also clear that SEGC performs better than the other nine methods in terms of F-measure and accuracy. The F-measure of SEGC is the highest on Gavin02, Krogan_core, Krogan_extend and BioGrid, and the accuracy of SEGC on Gavin02, Krogan_core and Krogan_extend is also the highest.
DCU and WCOACH produce huge clusters with a good coverage. Since a good fraction of each complex is covered by these huge clusters, DCU and WCOACH have a high Sn. The clusters generated by PEWCC are usually smaller than the ones produced by DCU and WCOACH; thus, PEWCC has a better PPV.
It is worth noting that our SEGC has a poor performance on Gavin06. It is because that we use default parameters on Gavin06 such as
,
,
,
,
and
. The parameter
is to control the density of considered clusters. We adopt
as default by executing experiments on the BioGrid dataset. The density of the network from BioGrid is 0.0023, which is likely as those from Gavin02 (0.0035), Krogan_core (0.0019) and Krogan_extend (0.0021) datasets. However, the density of network obtained from Gavin06 is 0.0064, which is almost triple the density of others. This means that the nodes in the Gavin06 PPI network have more connections between them, and the protein complexes existing in Gavin06 PPI network may be denser than clusters obtained from the other PPI networks. A lower value of
cannot accurately measure the closeness within clusters from Gavin06. For
, SEGC obtains predicted clusters with much more nodes in them and has a not so good performance in Gavin06. If we improve
to 0.55, SEGC would get denser clusters and obtain a better performance, as shown in
Table 5 (values in parentheses).
4.5. Stability of SEGC
For seed selection, SEGC repeats the selection procedure a few times with a probabilistic approach, the roulette wheel. The average clustering performance with variances on each data set is summarized in
Table 6. We find SEGC always gives a very small variance for each criteria. It means that our algorithm has a good stability. The stability of SEGC is based on two reasons. First, the seed selection process is not completely random. Second, there is a positive correlation between the weight
of a node
v and the probability
that the node will be selected as a seed. Equation (7) further improves the positive correlation by increasing the node weight
to
. Compared to
,
increases the inhomogeneity of probability
P, and the ordering of the probabilities is not disturbed.
4.6. Examples of Predicted Complexes
We exhibit some predicted protein complexes obtained by our SEGC in this section.
Figure 3 visualizes five predicted complexes, which completely match standard complexes in CYC2008. There are 100 predicted protein complexes found by our SEGC, which completely match standard complexes in CYC2008 in total.
Figure 3 shows five typical protein complex examples such as NuA4, Arp2/3, TRAPP, Transcription factor TFIIIC and Carboxy-terminal domain protein kinase. It can be seen that the proposed algorithm SEGC could find both dense complexes close to the complete subgraph (see
Figure 3a–c) and sparse complexes (see
Figure 3d–e). In particular, SEGC could find complexes with pendant nodes whose degree is 1 in protein networks, as shown in
Figure 3e.
Figure 4 shows two predicted complexes with similar topological structure as found protein complexes, which indicates that they might be potential protein complexes. These might give some useful information for detecting new protein complexes in the future.
Table 7 shows gene ontology annotation information and the corresponding
p-value of the examples shown in
Figure 3 and
Figure 4.
5. Conclusions
Graph clustering has significant popularity in bioinformatics as well as data mining research, and is an effective approach for protein complex identification in protein interaction networks. In this article, we proposed a seed expansion graph clustering algorithm SEGC for protein complex detection in protein interaction networks. SEGC weights nodes by multi-attribute fusion, selects seed nodes using the roulette wheel, and extends a cluster by considering both the density of the cluster and the connection of candidate node itself. It gets a soft clustering under full coverage of the entire network. Compared with other protein complex detection algorithms, SEGC shows a comparable performance in terms of precision, recall, F-measure, clustering-wise positive predictive value (PPV), clustering-wise sensitivity (Sn) and accuracy.
There are still some problems that need further study. In large PPI networks, it is imperative for clustering techniques to find important nodes (e.g., seed nodes) more accurately, while the computation complexity of clustering algorithms can be handled. In addition, suitable non-topological information will help to reduce the noise of data. The combination of non-topological and topological attributes might improve the performance of clustering algorithms.