FONDUE: A Framework for Node Disambiguation and Deduplication Using Network Embeddings †
Abstract
:Featured Application
Abstract
1. Introduction
1.1. The Node Disambiguation Problem
1.2. The Node Deduplication Problem
1.3. Contributions
- We propose FONDUE, a framework exploiting the empirical observation that naturally occurring networks can be embedded well using state-of-the-art NE methods, to tackle two distinct tasks: node deduplication (FONDUE-NDD) and node disambiguation (FONDUE-NDA). The former, by identifying nodes as more likely to be duplicated if contracting them enhances the quality of an optimal NE. The latter, by identifying nodes as more likely to be ambiguous if splitting them enhances the quality of an optimal NE;
- In addition this conceptual contribution, substantial challenges had to be overcome to implement this idea in a scalable manner. Specifically for the NDA problem, through a first-order analysis we derive a fast approximation of the expected NE quality improvement after splitting a node;
- We implemented this idea for CNE [6], a recent state-of-the-art NE method, although we demonstrate that the approach can be applied for a broad class of other NE methods as well;
- We tackle the NDA problem, with extensive experiments over a wide range of networks demonstrate the superiority of FONDUE over the state-of-the-art for the identification of ambiguous nodes, and this at comparable computational cost;
- We also empirically observe that, somewhat surprisingly, despite the increase in accuracy for identifying ambiguous nodes, no such improvement was observed for the ambiguous node splitting accuracy. Thus, for NDA, we recommend using FONDUE for the identification of ambiguous nodes, while using an existing state-of-the-art approach for optimally splitting them;
- Experiments on four datasets for NDD demonstrate the viability of FONDUE-NDD for the NDD problem based on only the topological features of a network.
2. Related Work
3. Methods
3.1. Problem Definition
3.1.1. Formalizing the Node Disambiguation Problem
3.1.2. Formalizing the Node Deduplication Problem
3.1.3. Real Graphs Suffer from Both Issues
3.2. FONDUE as a Generic Approach
The FONDUE Induction Bias
3.3. FONDUE-NDA
- 1.
- Estimating the multiplicities of all —i.e., the number of unambiguous nodes from represented by the node from . This essentially amounts to estimation the contraction c. Note that the number of nodes n in V is then equal to the sum of these multiplicities, and arbitrarily assigning these n nodes to the sets defines and, thus, c;
- 2.
- Given c, estimating the edge set E. To ensure that , for each there must exist at least one edge with and . However, this leaves the problem underdetermined (making this problem ill-posed), as there may also exist multiple such edges.
3.3.1. A First-Order Approximation for Computational Tractability
3.3.2. Additional Heuristics for Enhanced Scalability
3.3.3. FONDUE-NDA Using CNE
3.4. FONDUE-NDD
FONDUE-NDD Using CNE
4. Experiments
4.1. Datasets
4.2. Node Disambiguation
4.2.1. Data Processing
4.2.2. Quantitative Evaluation of Node Identification
4.2.3. Qualitative Evaluation of Nodes Identification
4.2.4. Quantitative Evaluation of Nodes Splitting
4.2.5. Parameter Sensitivity
4.2.6. Execution Time Analysis
4.3. Node Deduplication
4.3.1. Quantitative Evaluation of Node Deduplication
- Edge distribution: We employ 2 different ways to reassign the edges, from 1 node to 2 different nodes, either by randomly distributing the edges with at least 1 edge per node, or by ensuring equal distribution for each one;
- Minimum degree: We only choose to split nodes with a degree larger than a specified minimum;
- Overlap: We specify if there is an overlap in the edge reassignment for the different nodes, i.e., percentage of common edges for each node.
4.4. Discussions
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
NDA | Node Disambiguation |
NDD | Node Deduplication |
NE | Network Embeddings |
CNE | Conditional Network Embeddings |
Appendix A. Real-World Example: PubMed Dataset
Appendix B. Tabulated Results for Ambiguous Node Identification
Ambiguity Rate | 10% | 1% | 0.1% | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | FONDUE-NDA | NC | CC | Degree | FONDUE-NDA | NC | CC | Degree | FONDUE-NDA | NC | CC | Degree | |
AUC | fb-sc | 0.951 | 0.793 | 0.163 | 0.737 | 0.968 | 0.836 | 0.517 | 0.713 | 0.989 | 0.936 | 0.747 | 0.690 |
fb-pp | 0.871 | 0.720 | 0.744 | 0.724 | 0.895 | 0.699 | 0.762 | 0.726 | 0.888 | 0.673 | 0.768 | 0.673 | |
0.768 | 0.491 | 0.280 | 0.730 | 0.862 | 0.587 | 0.380 | 0.736 | − | − | − | − | ||
student | 0.785 | 0.555 | 0.549 | 0.724 | 0.878 | 0.316 | 0.708 | 0.839 | − | − | − | − | |
lesmis | 0.833 | 0.556 | 0.366 | 0.757 | − | − | − | − | − | − | − | − | |
polbooks | 0.952 | 0.590 | 0.308 | 0.841 | 1.000 | 0.813 | 0.327 | 0.877 | − | − | − | − | |
ppi | 0.746 | 0.559 | 0.641 | 0.726 | 0.792 | 0.555 | 0.647 | 0.739 | 0.943 | 0.522 | 0.630 | 0.803 | |
netscience | 0.933 | 0.877 | 0.807 | 0.805 | 0.952 | 0.821 | 0.669 | 0.842 | − | − | − | − | |
GrQc | 0.872 | 0.807 | 0.796 | 0.740 | 0.890 | 0.812 | 0.785 | 0.719 | 0.944 | 0.794 | 0.675 | 0.696 | |
CondMat | 0.861 | 0.837 | 0.797 | 0.757 | 0.867 | 0.839 | 0.810 | 0.744 | 0.873 | 0.840 | 0.793 | 0.709 | |
HepTh | 0.853 | 0.754 | 0.780 | 0.743 | 0.861 | 0.746 | 0.775 | 0.733 | 0.902 | 0.741 | 0.799 | 0.764 | |
cm05 | 0.867 | 0.852 | 0.811 | 0.754 | 0.889 | 0.867 | 0.815 | 0.758 | 0.910 | 0.870 | 0.824 | 0.770 | |
cm03 | 0.875 | 0.851 | 0.815 | 0.758 | 0.886 | 0.843 | 0.822 | 0.742 | 0.909 | 0.818 | 0.802 | 0.751 | |
DCG | fb-sc | 56.843 | 57.324 | 45.767 | 46.226 | 8.504 | 9.709 | 6.313 | 4.695 | 1.738 | 1.371 | 0.697 | 0.509 |
fb-pp | 216.187 | 198.421 | 197.420 | 193.395 | 24.629 | 20.177 | 20.028 | 19.358 | 2.798 | 2.005 | 2.008 | 1.894 | |
16.150 | 13.644 | 12.779 | 14.533 | 2.249 | 1.297 | 1.124 | 1.317 | − | − | − | − | ||
student | 8.481 | 7.243 | 6.196 | 7.105 | 0.845 | 0.698 | 0.501 | 0.548 | − | − | − | − | |
lesmis | 3.297 | 2.073 | 1.816 | 2.262 | − | − | − | − | − | − | − | − | |
polbooks | 4.415 | 2.746 | 2.427 | 3.170 | 1.000 | 0.310 | 0.267 | 0.318 | - | - | - | - | |
ppi | 43.601 | 38.479 | 41.609 | 42.611 | 4.504 | 3.778 | 4.086 | 4.152 | 0.414 | 0.294 | 0.313 | 0.323 | |
netscience | 9.324 | 8.255 | 7.696 | 7.594 | 1.083 | 0.806 | 0.688 | 0.613 | − | − | − | − | |
GrQc | 52.236 | 49.072 | 48.513 | 46.8 | 6.797 | 5.041 | 4.928 | 4.676 | 0.639 | 0.499 | 0.471 | 0.431 | |
CondMat | 199.255 | 197.187 | 194.376 | 188.157 | 21.936 | 20.186 | 19.841 | 18.982 | 2.663 | 2.017 | 1.955 | 1.856 | |
HepTh | 93.870 | 87.125 | 89.909 | 86.928 | 10.935 | 8.754 | 9.299 | 8.829 | 1.286 | 0.796 | 0.886 | 0.821 | |
cm05 | 320.094 | 316.111 | 310.849 | 299.216 | 34.507 | 32.485 | 32.009 | 29.989 | 4.871 | 3.230 | 3.171 | 2.973 | |
cm03 | 253.117 | 247.685 | 242.945 | 234.606 | 28.164 | 25.558 | 24.903 | 23.652 | 2.861 | 2.537 | 2.406 | 2.323 | |
F1-Score | fb-sc | 0.744 | 0.856 | 0.759 | 0.251 | 0.550 | 0.748 | 0.528 | 0.065 | 0.250 | 0.300 | 0.013 | 0.013 |
fb-pp | 0.476 | 0.327 | 0.315 | 0.247 | 0.188 | 0.092 | 0.029 | 0.028 | 0.045 | 0.074 | 0.007 | 0.002 | |
0.388 | 0.227 | 0.190 | 0.277 | 0.111 | 0.017 | 0.006 | 0.022 | − | − | − | − | ||
student | 0.410 | 0.226 | 0.139 | 0.266 | 0.333 | 0.107 | 0.013 | 0.04 | − | − | − | − | |
lesmis | 0.571 | 0.297 | 0.223 | 0.331 | − | − | − | − | − | − | − | − | |
polbooks | 0.700 | 0.288 | 0.260 | 0.284 | 1.000 | 0.24 | 0.04 | 0.12 | − | − | − | − | |
ppi | 0.265 | 0.075 | 0.226 | 0.245 | 0.079 | 0.007 | 0.013 | 0.018 | 0 | 0 | 0 | 0 | |
netscience | 0.595 | 0.517 | 0.441 | 0.368 | 0.333 | 0.093 | 0.053 | 0.013 | − | − | − | − | |
GrQc | 0.398 | 0.352 | 0.322 | 0.251 | 0.171 | 0.102 | 0.041 | 0.041 | 0 | 0.025 | 0 | 0 | |
CondMat | 0.395 | 0.418 | 0.343 | 0.280 | 0.089 | 0.094 | 0.036 | 0.033 | 0.095 | 0.010 | 0.008 | 0 | |
HepTh | 0.397 | 0.279 | 0.329 | 0.273 | 0.128 | 0.054 | 0.062 | 0.047 | 0.125 | 0.013 | 0.033 | 0 | |
cm05 | 0.409 | 0.451 | 0.366 | 0.267 | 0.099 | 0.131 | 0.045 | 0.026 | 0.111 | 0.024 | 0.011 | 0 | |
cm03 | 0.424 | 0.449 | 0.369 | 0.279 | 0.124 | 0.127 | 0.042 | 0.031 | 0.037 | 0.021 | 0.000 | 0 |
References
- National Library of Medicine. PubMed Dataset. Available online: https://pubmed.ncbi.nlm.nih.gov/help/ (accessed on 20 October 2021).
- Xenarios, I.; Rice, D.W.; Salwinski, L.; Baron, M.K.; Marcotte, E.M.; Eisenberg, D. DIP: The database of interacting proteins. Nucleic Acids Res. 2000, 28, 289–291. [Google Scholar] [CrossRef] [Green Version]
- Auer, S.; Bizer, C.; Kobilarov, G.; Lehmann, J.; Cyganiak, R.; Ives, Z. Dbpedia: A nucleus for a web of open data. In The Semantic Web; Springer: Berlin, Germany, 2007; pp. 722–735. [Google Scholar]
- Wang, X.; Cui, P.; Wang, J.; Pei, J.; Zhu, W.; Yang, S. Community preserving network embedding. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
- Fortunato, S.; Barthelemy, M. Resolution limit in community detection. Proc. Natl. Acad. Sci. 2007, 104, 36–41. [Google Scholar] [CrossRef] [Green Version]
- Kang, B.; Lijffijt, J.; De Bie, T. Conditional Network Embeddings. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Tang, J.; Fong, A.C.; Wang, B.; Zhang, J. A Unified Probabilistic Framework for Name Disambiguation in Digital Library. IEEE Trans. Knowl. Data Eng. 2012, 24, 975–987. [Google Scholar] [CrossRef]
- Haak, L.L.; Fenner, M.; Paglione, L.; Pentz, E.; Ratner, H. ORCID: A system to uniquely identify researchers. Learn. Publ. 2012, 25, 259–264. [Google Scholar] [CrossRef] [Green Version]
- Zhang, B.; Al Hasan, M. Name Disambiguation in Anonymized Graphs Using Network Embedding. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, Singapore, 6–10 November 2017; ACM: New York, NY, USA, 2017; pp. 1239–1248. [Google Scholar]
- Parravicini, A.; Patra, R.; Bartolini, D.B.; Santambrogio, M.D. Fast and Accurate Entity Linking via Graph Embedding. In Proceedings of the 2nd Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA), GRADES-NDA’19, New York, NY, USA, 30 June–5 July 2019; ACM: New York, NY, USA, 2019; pp. 10:1–10:9. [Google Scholar]
- Shen, W.; Wang, J.; Han, J. Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE Trans. Knowl. Data Eng. 2014, 27, 443–460. [Google Scholar] [CrossRef]
- Zhang, Y.; Zhang, F.; Yao, P.; Tang, J. Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 1002–1011. [Google Scholar]
- Ma, X.; Wang, R.; Zhang, Y. Author name disambiguation in heterogeneous academic networks. In International Conference on Web Information Systems and Applications; Springer: Berlin, Germany, 2019; pp. 126–137. [Google Scholar]
- Lerchenmueller, M.J.; Sorenson, O. Author disambiguation in PubMed: Evidence on the precision and recall of author-ity among NIH-funded scientists. PLoS ONE 2016, 11, e0158731. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Ma, Y.; Wu, Y.; Lu, C. A Graph-Based Author Name Disambiguation Method and Analysis via Information Theory. Entropy 2020, 22, 416. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Saha, T.K.; Zhang, B.; Al Hasan, M. Name disambiguation from link data in a collaboration graph using temporal and topological features. Soc. Netw. Anal. Min. 2015, 5, 11. [Google Scholar] [CrossRef] [Green Version]
- Hermansson, L.; Kerola, T.; Johansson, F.; Jethava, V.; Dubhashi, D. Entity disambiguation in anonymized graphs using graph kernels. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, San Francisco, CA, USA, 27 October–1 November 2013; pp. 1037–1046. [Google Scholar]
- Xu, J.; Shen, S.; Li, D.; Fu, Y. A network-embedding based method for author disambiguation. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Turin, Italy, 22–26 October 2018; pp. 1735–1738. [Google Scholar]
- Chen, T.; Sun, Y. Task-guided and path-augmented heterogeneous network embedding for author identification. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, Cambridge, UK, 6–10 February 2017; pp. 295–304. [Google Scholar]
- Cavallari, S.; Zheng, V.W.; Cai, H.; Chang, K.C.C.; Cambria, E. Learning Community Embedding with Community Detection and Node Embedding on Graphs. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, Singapore, 6–10 November 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 377–386. [Google Scholar]
- Weichselbraun, A.; Kuntschik, P.; Brasoveanu, A.M. Name variants for improving entity discovery and linking. In Proceedings of the 2nd Conference on Language, Data and Knowledge (LDK 2019), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Leipzig, Germany, 20–23 May 2019. [Google Scholar]
- Alhelbawy, A.; Gaizauskas, R. Graph ranking for collective named entity disambiguation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Association for Computational Linguistics: Baltimore, MD, USA, 2014; Volume 2, pp. 75–80. [Google Scholar]
- Guo, Y.; Che, W.; Liu, T.; Li, S. A graph-based method for entity linking. In Proceedings of the 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, 7–13 November 2011; pp. 1010–1018. [Google Scholar]
- Tang, J.; Qu, M.; Wang, M.; Zhang, M.; Yan, J.; Mei, Q. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 18–22 May 2015; pp. 1067–1077. [Google Scholar]
- Nesterov, Y. Semidefinite relaxation and nonconvex quadratic optimization. Optim. Methods Softw. 1998, 9, 141–160. [Google Scholar] [CrossRef]
- Luo, Z.Q.; Ma, W.K.; So, A.M.C.; Ye, Y.; Zhang, S. Semidefinite relaxation of quadratic optimization problems. IEEE Signal Process. Mag. 2010, 27, 20–34. [Google Scholar] [CrossRef]
- Leone, F.; Nelson, L.; Nottingham, R. The folded normal distribution. Technometrics 1961, 3, 543–550. [Google Scholar] [CrossRef]
- Van Leeuwen, M.; De Bie, T.; Spyropoulou, E.; Mesnage, C. Subjective interestingness of subgraph patterns. Mach. Learn. 2016, 105, 41–75. [Google Scholar] [CrossRef] [Green Version]
- Leskovec, J.; Krevl, A. SNAP Datasets: Stanford Large Network Dataset Collection. 2014. Available online: http://snap.stanford.edu/data (accessed on 20 October 2021).
- Goethals, B.; Le Page, W.; Mampaey, M. Mining interesting sets and rules in relational databases. In Proceedings of the 2010 ACM Symposium on Applied Computing, Sierre, Switzerland, 22–26 March 2010; pp. 997–1001. [Google Scholar]
- Breitkreutz, B.J.; Stark, C.; Reguly, T.; Boucher, L.; Breitkreutz, A.; Livstone, M.; Oughtred, R.; Lackner, D.H.; Bähler, J.; Wood, V.; et al. The BioGRID interaction database: 2008 update. Nucleic Acids Res. 2007, 36, D637–D640. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Knuth, D.E. The Stanford GraphBase: A Platform for Combinatorial Computing; AcM Press: New York, NY, USA, 1993. [Google Scholar]
- Newman, M.E. The structure of scientific collaboration networks. Proc. Natl. Acad. Sci. USA 2001, 98, 404–409. [Google Scholar] [CrossRef] [PubMed]
- Järvelin, K.; Kekäläinen, J. Cumulated Gain-Based Evaluation of IR Techniques. ACM Trans. Inf. Syst. 2002, 20, 422–446. [Google Scholar] [CrossRef]
Datasets | Description |
---|---|
FB-SC | Facebook Social Circles network [29] Consists of anonymized friends list from Facebook. |
FB-PP | Page-Page graph of verified Facebook pages [29]. Nodes represent official Facebook pages while the links are mutual likes between pages. |
Anonymized network generated using email data from a large European research institution modeling the incoming and outgoing email exchange between its members [29]. | |
STD | A database network of the Computer Science department of the University of Antwerp that represent the connections between students, professors and courses [30]. |
PPI | A subnetwork of the BioGRID Interaction Database [31], that uses PPI network for Homo Sapiens. |
lesmis | A network depicting the coappearance of characters in the novel Les Miserables [32]. |
netscience | A coauthorship network of scientists working on network theory and experiments [29]. |
polbooks | Network of books about US politics, with edges between books representing frequent copurchasing of books by the same buyers. http://www-personal.umich.edu/~mejn/netdata/, accessed on 20 October 2021 |
CondMat | Collaboration network of Arxiv Condensed Matter Physics [33] |
GrQc | Collaboration network of Arxiv General Relativity [33] |
HepTh | Collaboration network of Arxiv Theoretical High Energy Physics [33] |
CM03 | Collaboration network of Arxiv Condensed Matter till 2003 [33] |
CM05 | Collaboration network of Arxiv Condensed Matter till 2005 [33] |
PubMed | Collaboration network extracted from the PubMed database (analyzed in Appendix A), containing 2122 nodes with ground truth of 31 ambiguous nodes (6 of which maps to more than 2 entities) and 1 duplicate node [1] |
fb-sc | fb-pp | lesmis | polbooks | STD | |||
# Nodes | 4039 | 22,470 | 986 | 77 | 105 | 395 | |
# Edges | 88,234 | 170,823 | 16,064 | 254 | 441 | 3,423 | |
Avg degree | 43.7 | 15.2 | 32.6 | 6.6 | 8.4 | 17.3 | |
Density | 1.1 × 102 | 6.8 × 104 | 3.3 × 102 | 8.7 × 102 | 8.1 × 102 | 4.4 × 102 | |
ppi | netscience | GrQc | CondMat | HepTh | CM05 | CM03 | |
# Nodes | 3852 | 379 | 4158 | 21,363 | 8638 | 36,458 | 27,519 |
# Edges | 37,841 | 914 | 13,422 | 91,286 | 24,806 | 171,735 | 116,181 |
Avg degree | 19.6 | 4.8 | 6.5 | 8.5 | 5.7 | 9.4 | 8.4 |
Density | 5.1 × 103 | 1.3 × 102 | 1.6 × 103 | 4.0 × 104 | 6.6 × 104 | 2.6 × 104 | 3.1 × 104 |
Ambiguity Rate | 10% | 1% | 0.1% | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | FONDUE-NDA | NC | CC | Degree | FONDUE-NDA | NC | CC | Degree | FONDUE-NDA | NC | CC | Degree | |
Randomly Uniform Contraction | fb-sc | 0.954 | 0.962 | 0.768 | 0.776 | 0.767 | 0.875 | 0.569 | 0.423 | 0.679 | 0.535 | 0.272 | 0.199 |
fb-pp | 0.899 | 0.825 | 0.821 | 0.804 | 0.649 | 0.532 | 0.528 | 0.511 | 0.374 | 0.268 | 0.268 | 0.253 | |
0.783 | 0.661 | 0.619 | 0.704 | 0.529 | 0.305 | 0.264 | 0.310 | − | − | − | − | ||
student | 0.778 | 0.664 | 0.568 | 0.652 | 0.396 | 0.328 | 0.235 | 0.257 | − | − | − | − | |
lesmis | 0.906 | 0.570 | 0.499 | 0.622 | − | − | − | − | − | − | − | − | |
polbooks | 0.972 | 0.604 | 0.534 | 0.698 | 1.000 | 0.310 | 0.267 | 0.318 | − | − | − | − | |
ppi | 0.759 | 0.670 | 0.724 | 0.741 | 0.420 | 0.353 | 0.381 | 0.387 | 0.194 | 0.138 | 0.147 | 0.151 | |
netscience | 0.886 | 0.784 | 0.731 | 0.721 | 0.508 | 0.378 | 0.323 | 0.288 | − | − | − | − | |
GrQc | 0.857 | 0.805 | 0.796 | 0.768 | 0.603 | 0.447 | 0.437 | 0.415 | 0.249 | 0.195 | 0.184 | 0.168 | |
CondMat | 0.864 | 0.855 | 0.843 | 0.816 | 0.601 | 0.553 | 0.543 | 0.520 | 0.367 | 0.278 | 0.269 | 0.255 | |
HepTh | 0.860 | 0.798 | 0.823 | 0.796 | 0.582 | 0.466 | 0.494 | 0.470 | 0.325 | 0.201 | 0.224 | 0.208 | |
cm05 | 0.884 | 0.873 | 0.859 | 0.827 | 0.627 | 0.590 | 0.582 | 0.545 | 0.471 | 0.312 | 0.307 | 0.288 | |
cm03 | 0.888 | 0.869 | 0.852 | 0.823 | 0.635 | 0.577 | 0.562 | 0.534 | 0.335 | 0.297 | 0.281 | 0.272 | |
no common neighbors | fb-sc | 0.953 | 0.989 | 0.768 | 0.764 | 0.730 | 0.933 | 0.591 | 0.418 | 0.399 | 0.665 | 0.321 | 0.172 |
fb-pp | 0.895 | 0.826 | 0.820 | 0.801 | 0.650 | 0.532 | 0.529 | 0.510 | 0.389 | 0.266 | 0.267 | 0.253 | |
0.676 | 0.696 | 0.625 | 0.604 | 0.303 | 0.319 | 0.288 | 0.256 | − | − | − | − | ||
student | 0.659 | 0.726 | 0.531 | 0.587 | 0.368 | 0.447 | 0.201 | 0.229 | − | − | − | − | |
lesmis | 0.755 | 0.591 | 0.498 | 0.486 | − | − | − | − | − | − | − | − | |
polbooks | 0.981 | 0.620 | 0.544 | 0.696 | 1.000 | 0.268 | 0.420 | 0.363 | − | − | − | − | |
ppi | 0.725 | 0.673 | 0.721 | 0.700 | 0.398 | 0.352 | 0.381 | 0.373 | 0.166 | 0.139 | 0.147 | 0.144 | |
netscience | 0.877 | 0.797 | 0.714 | 0.705 | 0.622 | 0.372 | 0.304 | 0.290 | − | − | − | − | |
GrQc | 0.861 | 0.806 | 0.794 | 0.766 | 0.580 | 0.445 | 0.435 | 0.416 | 0.280 | 0.197 | 0.183 | 0.173 | |
CondMat | 0.863 | 0.855 | 0.843 | 0.815 | 0.585 | 0.554 | 0.542 | 0.516 | 0.317 | 0.274 | 0.273 | 0.257 | |
HepTh | 0.856 | 0.798 | 0.824 | 0.796 | 0.581 | 0.467 | 0.494 | 0.480 | 0.285 | 0.204 | 0.212 | 0.213 | |
cm05 | 0.883 | 0.874 | 0.858 | 0.825 | 0.633 | 0.591 | 0.582 | 0.543 | 0.414 | 0.310 | 0.312 | 0.289 | |
cm03 | 0.884 | 0.869 | 0.853 | 0.822 | 0.651 | 0.577 | 0.561 | 0.533 | 0.439 | 0.295 | 0.279 | 0.271 |
FONDUE-NDA | NC [16] | |
---|---|---|
AUC | 0.971 | 0.944 |
TP | 17 | 11 |
FP | 14 | 22 |
F1-score | 0.55 | 0.35 |
NDCG | 0.877 | 0.681 |
NDCG* | 0.743 | 0.38 |
NC Rank | Ambiguous | FONDUE-NDA Rank | Ambiguous |
---|---|---|---|
Jie_Li | 0 | Tao_Wang * | 1 |
Lei_Liu | 0 | Wei_Zhang | 1 |
Jiawei_Wang | 0 | Jing_Li | 1 |
Jun_Liu | 0 | Bin_Zhang * | 1 |
Ni_Wang | 0 | Yan_Li | 1 |
Xin_Liu | 0 | Rui_Li | 0 |
Yao_Chen | 0 | Ying_Liu | 1 |
Huanhuan_Liu | 0 | Feng_Li | 1 |
Jun_Yan | 0 | Yang_Yang * | 1 |
Lei_Chen | 0 | Ying_Sun | 1 |
Ambiguity Rate | 10% | 1% | 0.1% | ||||
---|---|---|---|---|---|---|---|
Method | FONDUE-NDA | MCL | FONDUE-NDA | MCL | FONDUE-NDA | MCL | |
Uniformly Random Contraction | fb-sc | 0.582 | 0.781 | 0.781 | 0.829 | 0.947 | 0.936 |
fb-pp | 0.405 | 0.583 | 0.504 | 0.583 | 0.610 | 0.600 | |
0.122 | 0.248 | 0.214 | 0.276 | − | − | ||
student | 0.215 | 0.124 | 0.131 | 0.183 | − | − | |
lesmis | 0.597 | 0.304 | − | − | − | − | |
polbooks | 0.391 | 0.338 | − | − | − | − | |
ppi | 0.033 | 0.120 | 0.055 | 0.115 | 0.338 | 0.321 | |
netscience | 0.610 | 0.777 | 0.642 | 0.809 | − | − | |
GrQc | 0.498 | 0.665 | 0.661 | 0.682 | 0.762 | 0.784 | |
CondMat | 0.471 | 0.709 | 0.582 | 0.713 | 0.574 | 0.718 | |
HepTh | 0.447 | 0.556 | 0.521 | 0.552 | 0.465 | 0.542 | |
cm05 | 0.471 | 0.722 | 0.674 | 0.732 | 0.488 | 0.741 | |
cm03 | 0.479 | 0.724 | 0.575 | 0.730 | 0.513 | 0.690 | |
No Common Neighbors | fb-sc | 0.609 | 0.952 | 0.887 | 0.985 | 0.862 | 0.998 |
fb-pp | 0.409 | 0.586 | 0.509 | 0.564 | 0.546 | 0.637 | |
0.136 | 0.315 | 0.278 | 0.413 | − | − | ||
student | 0.197 | 0.139 | 0.079 | 0.078 | − | − | |
lesmis | 0.405 | 0.289 | − | − | − | − | |
polbooks | 0.499 | 0.577 | 0.638 | 0.655 | − | − | |
ppi | 0.036 | 0.139 | 0.067 | 0.151 | 0.331 | 0.399 | |
netscience | 0.725 | 0.819 | 0.722 | 0.897 | − | − | |
GrQc | 0.534 | 0.678 | 0.665 | 0.683 | 0.895 | 0.872 | |
CondMat | 0.470 | 0.708 | 0.548 | 0.712 | 0.475 | 0.686 | |
HepTh | 0.469 | 0.571 | 0.529 | 0.561 | 0.404 | 0.455 | |
cm05 | 0.466 | 0.726 | 0.549 | 0.777 | 0.450 | 0.723 | |
cm03 | 0.482 | 0.722 | 0.570 | 0.727 | 0.567 | 0.697 |
10% | 1% | 0.1% | |||||||
---|---|---|---|---|---|---|---|---|---|
FONDUE-NDA | NC | CC | FONDUE-NDA | NC | CC | FONDUE-NDA | NC | CC | |
fb-sc | 10.99 | 94.72 | 52.92 | 5.16 | 90.43 | 51.62 | 7.66 | 90.99 | 51.63 |
fb-pp | 15.03 | 161.60 | 46.83 | 18.02 | 185.64 | 46.48 | 29.65 | 190.25 | 46.06 |
0.81 | 10.18 | 4.77 | 0.84 | 10.26 | 4.62 | − | − | − | |
student | 0.35 | 2.88 | 0.69 | 0.37 | 4.45 | 0.59 | − | − | − |
lesmis | 0.14 | 0.21 | 0.03 | − | − | − | − | − | − |
polbooks | 0.04 | 0.34 | 0.05 | 0.05 | 0.36 | 0.05 | − | − | − |
ppi | 4.09 | 49.23 | 15.64 | 4.17 | 50.89 | 15.03 | 3.53 | 50.73 | 14.95 |
netscience | 0.12 | 0.89 | 0.08 | 0.13 | 0.92 | 0.09 | 0.13 | 0.91 | 0.09 |
GrQc | 1.53 | 13.07 | 2.06 | 1.66 | 13.36 | 2.07 | 1.69 | 13.29 | 2.04 |
CondMat | 11.41 | 82.63 | 13.15 | 12.36 | 86.42 | 13.11 | 13.76 | 88.26 | 12.75 |
HepTh | 3.47 | 28.04 | 2.85 | 3.65 | 27.22 | 2.81 | 4.00 | 27.68 | 2.77 |
cm05 | 23.04 | 143.23 | 30.02 | 24.69 | 144.98 | 25.12 | 26.01 | 137.92 | 25.51 |
cm03 | 16.10 | 116.26 | 17.82 | 20.33 | 137.57 | 17.12 | 18.96 | 124.25 | 17.32 |
Edge Distribution | Minimum Degree | Edge Overlap | Lemis | Polbooks | Netscience | |||
---|---|---|---|---|---|---|---|---|
FONDUE-NDD | ED | FONDUE-NDD | ED | FONDUE-NDD | ED | |||
Balanced | None | 0% | 18.775 | 18.2 | 30.025 | 17.85 | 6.975 | 4.2 |
20% | 15.55 | 8.75 | 22.475 | 10.65 | 3.9 | 2.825 | ||
30% | 14.125 | 9.35 | 20.4 | 8.5 | 3.225 | 1.775 | ||
Graph Average | 0% | 10 | 15.806 | 17.676 | 20.941 | 5.325 | 5.7 | |
20% | 10.167 | 11.083 | 11.471 | 12.941 | 2.775 | 3.025 | ||
30% | 8.611 | 9.333 | 10.794 | 11.176 | 2.725 | 2.625 | ||
2x Graph Average | 0% | 3.857 | 24.857 | 5.727 | 23.364 | 3.471 | 5.029 | |
20% | 5 | 17.429 | 3.818 | 15.909 | 1.735 | 3.412 | ||
30% | 2.857 | 13.429 | 2.545 | 14.091 | 1.735 | 3.206 | ||
Unbalanced | None | 0% | 25.9 | 22.75 | 36.425 | 26.325 | 6.9 | 2.85 |
20% | 16.75 | 10.75 | 25.875 | 10 | 3.75 | 2.55 | ||
30% | 16.3 | 10.525 | 22.875 | 12.075 | 3.65 | 2.775 | ||
Graph Average | 0% | 13.5 | 17.306 | 27 | 23.176 | 5.125 | 5.075 | |
20% | 12.417 | 13.278 | 13.029 | 11.706 | 3.1 | 3.15 | ||
30% | 12.944 | 12.167 | 14.265 | 12.029 | 3.025 | 2.675 | ||
2x Graph Average | 0% | 18.143 | 40.143 | 11.545 | 22.545 | 5.735 | 8.147 | |
20% | 9.429 | 18.429 | 8.364 | 16.182 | 2.118 | 3.5 | ||
30% | 5.714 | 19.286 | 7 | 12.182 | 1.735 | 2.794 |
Dataset | Execution Time (Seconds) |
---|---|
lesmis | 609.34 |
polbooks | 712.29 |
netscience | 1198.63 |
pubmed | 3474.82 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mel, A.; Kang, B.; Lijffijt, J.; De Bie, T. FONDUE: A Framework for Node Disambiguation and Deduplication Using Network Embeddings. Appl. Sci. 2021, 11, 9884. https://doi.org/10.3390/app11219884
Mel A, Kang B, Lijffijt J, De Bie T. FONDUE: A Framework for Node Disambiguation and Deduplication Using Network Embeddings. Applied Sciences. 2021; 11(21):9884. https://doi.org/10.3390/app11219884
Chicago/Turabian StyleMel, Ahmad, Bo Kang, Jefrey Lijffijt, and Tijl De Bie. 2021. "FONDUE: A Framework for Node Disambiguation and Deduplication Using Network Embeddings" Applied Sciences 11, no. 21: 9884. https://doi.org/10.3390/app11219884
APA StyleMel, A., Kang, B., Lijffijt, J., & De Bie, T. (2021). FONDUE: A Framework for Node Disambiguation and Deduplication Using Network Embeddings. Applied Sciences, 11(21), 9884. https://doi.org/10.3390/app11219884