We programmed the algorithm using Matlab (MathWorks, Beijing, China, 2014), and runs on a single machine equipped with RedHat 6.4. The host memory is 16 GB, with 3.4 GHz CPU, and the Matlab version is R2014b. In simulations from
Section 5.1, 4-dimisional pdfs are supported to verify limit theorem and the effectiveness of TLF method. We also test the resulting method in real networks. We use TLF method to fuse 4 local similarity indices, CN [
19], AA [
20], RA and PA [
21,
22]. These indices are 4 simple indices with low computation complexity about
, where <
k> represents the average degree of nodes in a network. CN index only considers common neighbors of node pairs; PA index only considers the degree of two nodes; AA and RA consider both common neighbors and degree of nodes with different weights. And compare the method with fusion methods such as naïve Bayes and logistic regression and other global indices with computation complexity more than
.
5.1. Simulation Examples
Four types of structural similarity indices were simulated to evaluate node pairs with and without links. The pdfs of the structural similarity indices are also provided. We construct 3 groups of known distributions for the similarity indices pdfs. One thousand samples extracted from 10,000 existing links and 100,000 samples of nonexistent links were generated following the appropriate pdfs. The 1000 samples serve as probe set; the 100,000 samples with 1000 probe links serve as unknown links for training; and the remaining 9000 samples serve as train set of existing links. Each sample had 4 dimensions to simulate 4 similarity scores. We first compute AUC and Precision for each dimension, then use proposed TLF method to obtain the synthetic score and calculate the AUC and Precision, compared with other combination methods such as Naïve Bayes and logistic regression. Finally, we calculate AUC and Precision using the theoretical limit theorem and compare with the above methods.
Let random vectors and be the scores of existing and nonexistent links, which follow and pdfs, respectively.
Let
are 4-dimensional normal distributions,
where
, and
.
The parameter sets for the 2 groups of simulation examples are as follows.
Group 1: , and ;
Group 2: and .
In each group, , , , , , , , , , and .
The window width h of TLF method in the group 1 and 2 is h = 0.1.
We ignore the constant that makes the integral of
equal to 1. The simulation results of group 3 are shown as
Table 2.
The window width h of TLF method in the group 3 is h = 0.1.
The simulation results in
Table 2 and
Table 3 show us that we can calculate the theoretical limit of combination method based on Theorem 1, and the limit AUC and Precision are highest among all listed methods, though we cannot list all possible conditions. Results also show that TLF method can fuse the information effectively, and obtain the optimum accuracy. We also verify that the transformation of monotonically increasing function does not change the theoretical limit. Theorem 1 provides a platform that can compare each combination method by constructing some distributions, and direct an effect combination method TLF.
5.2. Experiments in Real Networks
The significance of simulation is that the theoretical limit can be derived by theoretical calculation or numerical calculation, and all combination methods can be used to compare with it, finding shortcomings and gaps to design a more rational method. However, the simulation data is different from real network data. We use TLF method to fuse several similarity indices and test in real networks. The basic similarity indices we use are Common Neighbor index (CN) [
19], Adamic-Adar index (AA) [
15], Resource Allocation index (RA) and Preferential Attachment index (PA) [
21,
22]. These indices are local indices. Several global indices such as Katz index [
23], Average Commute Time index (ACT) and Cosine Similarity Time index (Cos+) are served as comparisons [
24,
25]. The definitions of the above indices and their meanings are listed as
Table 4.
We use TLF method to fuse 4 local similarity indices, and compare with fusion method such as naïve Bayes and logistic regression and other global indices. Our experiments are performed on 11 different real networks. (1) Food Web Everglades Web (FWEW) [
26]; (2) Food Web Florida Bay(FWFB) [
27]; (3) Protein-protein Interactions Cell (PPI Cell) [
28]; (4) CKM-3 [
29]; (5) Netscience (NS) [
30]; (6) Yeast [
31]; (7) Political Blogosphere(PB) [
32]; (8) Email [
33]; (9) CA-GrQc(CG) [
34]; (10) Com-dblp(CD) [
35]; (11) Email Enron (EE) [
36,
37]. The basic topological features of 11 real networks are listed in
Table 5. Each original data is randomly divided into training set of 90% links, and the probe set of 10% links.
Table 6 and
Table 7 show the comparisons between TLF method and other combination methods or global indices using AUC and Precision metrics. Each result is the average of 10 realizations. When calculating the Precision metric in Equation (5), we take
L = 100 in datasets 1 to 8, and take
L = 1000 in datasets 9 to 11. In the large networks, TLF method needs to sample to save the computing sources, and in datasets 10 to 11, the under-sampling rate is set as 1000.
The results show us that TLF method performs better than other fusion methods such as naïve Bayes and logistic regression, no matter what evaluation metric use. Almost all combination methods are better than 4 basic indices. From the limit theorem, combination methods are dependent with each dimension. The promotion of fusion index is restrict to each similarity index. Experiment results also exposed this problem: if the single similarity indices perform not well, the fusion index cannot significantly improve the accuracy. For example, in the CKM-3 network, though we use TLF method to fuse 4 basic similarity indices can improve the AUC obviously, it cannot be better than Katz index (0.928).