5.2. Time and Memory Requirements
In
Table 3, the total execution time for Chebyshev Walktrap (CW), Markov Walktrap (MW), Fuzzy Walktrap (FW), and Fuzzy Newman–Girvan (FN-G) is shown. The last two algorithms are outlined in [
10], whereas Fuzzy Markov was proposed in [
15]. The effect of the escape mechanisms of weight inversion (I) and relocation (R) are also shown for Markov Walktrap and Chebyshev Walktrap. The Fuzzy Newman–Girvan is an exhaustive algorithm that will serve as baseline both for the requirements and the clustering quality. For the Walktrap algorithms, the time for the two phases, namely random walking (RW) and community building (CB), are recorded separately, while for the Fuzzy Newman–Girvan case only, the total time is recorded as there is only a single phase. In addition, the last column of
Table 3 lists the number of the vertices visited by the random walker.
Fuzzy Newman–Girvan is considerably slower than any member of the Walktrap family of algorithms. This can be attributed to the exhaustive nature of Fuzzy Newman–Girvan as well as to the extensive use of locality by the Walktrap family. Moreover, the probabilistic constraints of Markov Walktrap and Fuzzy Walktrap resulted in the acceleration of both phases of the respective algorithms, with the second order constraints yielding the lowest times in each case. Concerning the escape strategy of the random walker, the relocation option resulted in a slower walking phase but in an accelerated community building phase, with that combination being more efficient than both weight inversion and the combination of the two escape strategies. Omitting an escape strategy is not advisable. Therefore, it is not recommended to activate both escape strategies at the same time. At any rate, the Chebyshev Walktrap with relocation (CW+R) had the best overall performance tagged along in a close manner by the Markov Walktrap with relocation (MW+R). The original Fuzzy Markov being the tardiest member of the family.
An explanation for the time achieved under the relocation strategy is that the teleportation of the random walker results in cache misses, which translates to expensive fetch cycles in the memory hierarchy system. This can be seen in the last two columns of
Table 3, as there is not a clear correspondence between the number of total visits and the total walking phase time. When relocation is enabled, the mean visit time is clearly higher. At any rate, the number of visits is linear in the vertex set cardinality.
In addition, the selection of h did not appear to have a significant performance impact, although in most cases the random walker was slower when h was a Poisson random variable both in terms of time and in terms of total visits. This can be attributed to the large number of high cost edges which forced the walker to bounce more times inside a community before eventually moving to another. On the other hand, the symmetric form of the binomial distribution mass function resulted in a larger number of low cost edges, facilitating the movement of the random walker and making the communities easily separable compared to the Poisson case.
The memory requirements were monitored with the Ubuntu Watch administrative tool as presented in
Table 4. In contrast to other similar tools such as htop, Watch generates a text output which can be parsed and analyzed. It was periodically ran every 10 s through a bash script resulting in records of several thousand of entries each.
Fuzzy Newman–Girvan consumes more memory than any other algorithm presented in this article by far. However, it utlilizes the memory constantly and consistently, as denoted by the relatively low standard deviation. This is an important performance feature for operating systems process schedulers [
46]. On the other hand, the Walktrap family exploits graph caching. This in turn translates to lower traffic between the disk and the memory, as Neo4j is not a memory-only database, as well as to fewer synchronization and serialization operations. When the relocation strategy is selected, then memory utilization has certain spikes, as it can be inferred from the increased maximum memory occupied and the increased standard deviation. This is a direct result of the random walker teleportation which temporarily annuls any scheduling optimization as well as any caching done at the software or hardware level.
5.3. Community Coherence
The following definition will facilitate further analysis of the experimental results.
Definition 7. The (log)scree plot of a set S is the plot of the (logarithm of the) values of S versus their sorted frequency.
Since
Y does not contain ground truth communities, the communities obtained by the Fuzzy Newman–Girvan will be used as a baseline reference since their sizes are closer to a power law distribution, which is an essential feature of large, scale-free graphs. The deviation
of a set of numbers
from a power law
is quantified by the formula [
62,
63]
where parameters
and
can be estimated by, for instance, a least squares method [
25]. Additionally, the estimated value of
serves as a quality indicator, as it should be as close to
as possible.
The number of communities for each algorithm are shown in
Table 5. Notice that this is not an absolute clustering quality metric, as typically a large number of coherent communities is preferable to a smaller number of sparse ones. Nonetheless, the introduction of the relocation strategy systematically pushes the number of communities towards the reference number, although more evidence is required for determining community coherence. This will be addressed by the two asymmetric indices of this section.
In order to evaluate the clustering quality, the Kullback–Leibler divergence between the sorted sizes of the communities generated by the Fuzzy Newman–Girvan and the sorted community sizes of the remaining algorithms was computed. Recall that for two discrete distributions
and
the Kullback–Leibler divergence is defined as
where
k ranges over the union of discrete events. If
and
have no events in common, then the result is undefined. If for a single event
or
, then the corresponding summand is zero.
Table 6 summarizes the divergence for the Poisson and the binomial cases.
Chebyshev Walktrap with relocation outperforms the remaining algorithms, as it has less divergence from the reference distribution.
A question at this point is whether a correspondence between the communities returned by each algorithm can be found. The asymmetric Tversky index between two sets
T and
V is defined as
and it quantifies the distance between the template set
T and the variant set
V. By the very definition of the index, the template set
T and the variant set
V are not interchangeable, namely
. This agrees with intuition, as it makes sense to ask how much the heuristic results differ from the ground truth community, whereas there is no point in asking the inverse question. On the contrary, with a symmetric distance metric such as, for instance, the Tanimoto similarity coefficient
no distinction can be made between the template and the variant, which can potentially lead to misleading results.
At this point, it should be highlighted that Fuzzy Newman Girvan was executed only once since it is a deterministic algorithm.
Returning to Label (44), the case
is of particular interest in data mining, as it confines the coefficients on the plane which maximizes the minimum distance of
T from
V. Notice that algebraically this asymmetry stems from both the terms
and
, which denote the number of elements of
T not found in
V and vice versa. Both terms signify in their own way how
V is different from
T. The former corresponds to the part of
V which is missing from
T, whereas the latter corresponds to any additions to
V. As a rule,
is more important and, consequently,
. As there is no standard rule for selecting
and
, the following two schemes have been used, a linear
and an exponential
Observe that in the first case , while in the second , which clearly represents a non-linear scaling of the first case. Furthermore, the second case is considerably biased in favor of .
Once for each possible pair of the
m ground truth communities
and the
n estimated ones
the
Tversky indices have been computed, the similarity score
for a given
s is computed
Summing over the range of
s and taking the average, the mean similarity score
is obtained
The overall similarity scores are shown in
Table 7.
Again, Chebyshev Walktrap with relocation outperforms the remaining algorithms as it has the highest similarity with the reference communities. Note that the exponential weighting scheme sharpens the difference between the algorithms by raising the maximum scores and lowering the minimum ones.
For the experiments of the section, the termination criterion was chosen to be a user supplied number of iterations, namely . This number of iterations is sufficiently large for generating communities in a reliable way. Moreover, each iteration is very quick, so the overall execution time was kept at an acceptable level despite the large number of iterations.