Next Article in Journal
Tree-Structured Regression Model Using a Projection Pursuit Approach
Next Article in Special Issue
An Information Diffusion Model Based on Explosion Shock Wave Theory on Online Social Networks
Previous Article in Journal
Study of Novel Punched-Bionic Impellers for High Efficiency and Homogeneity in PCM Mixing and Other Solid-Liquid Stirs
Previous Article in Special Issue
I Explain, You Collaborate, He Cheats: An Empirical Study with Social Network Analysis of Study Groups in a Computer Programming Subject
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

FONDUE: A Framework for Node Disambiguation and Deduplication Using Network Embeddings †

AIDA, IDLab-ELIS, Ghent University, 9052 Ghent, Belgium
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper published in IEEE DSAA 2020 The 7th IEEE International Conference on Data Science and Advanced Analytics.
Appl. Sci. 2021, 11(21), 9884; https://doi.org/10.3390/app11219884
Submission received: 2 August 2021 / Revised: 13 October 2021 / Accepted: 18 October 2021 / Published: 22 October 2021
(This article belongs to the Special Issue Social Network Analysis)

Abstract

:

Featured Application

FONDUE can be used to preprocess graph structured data, in particular it facilitates detecting nodes in the graph that represent the same real-life entity, and for detecting and optimally splitting nodes that represent multiple distinct real-life entities. FONDUE does this in an entirely unsupervised fashion, relying exclusively on the topology of the network.

Abstract

Data often have a relational nature that is most easily expressed in a network form, with its main components consisting of nodes that represent real objects and links that signify the relations between these objects. Modeling networks is useful for many purposes, but the efficacy of downstream tasks is often hampered by data quality issues related to their construction. In many constructed networks, ambiguity may arise when a node corresponds to multiple concepts. Similarly, a single entity can be mistakenly represented by several different nodes. In this paper, we formalize both the node disambiguation (NDA) and node deduplication (NDD) tasks to resolve these data quality issues. We then introduce FONDUE, a framework for utilizing network embedding methods for data-driven disambiguation and deduplication of nodes. Given an undirected and unweighted network, FONDUE-NDA identifies nodes that appear to correspond to multiple entities for subsequent splitting and suggests how to split them (node disambiguation), whereas FONDUE-NDD identifies nodes that appear to correspond to same entity for merging (node deduplication), using only the network topology. From controlled experiments on benchmark networks, we find that FONDUE-NDA is substantially and consistently more accurate with lower computational cost in identifying ambiguous nodes, and that FONDUE-NDD is a competitive alternative for node deduplication, when compared to state-of-the-art alternatives.

1. Introduction

Increasingly, collected data naturally comes in the form of a network of interrelated entities. Examples include social networks describing social relations between people (e.g., Facebook), citation networks describing the citation relations between papers (e.g., PubMed [1]), biological networks, such as those describing interactions between proteins (e.g., DIP [2]), and knowledge graphs describing relations between concepts or objects (e.g., DBPedia [3]). Thus, new machine learning, data mining, and information retrieval methods are increasingly targeting data in their native network representation.
An important problem across all the fields of data science, broadly speaking, is data quality. For problems on networks, especially those that are successful in exploiting fine- as well as coarse-grained structure of networks, ensuring good data quality is perhaps even more important than in standard tabular data. For example, an incorrect edge can have a dramatic effect on the implicit representation of other nodes, by dramatically changing distances on the network. Similarly, mistakenly representing distinct real-life entities by the same node in the network may dramatically alter its structural properties, by increasing the degree of the node and by merging the possibly quite distinct neighborhoods of these entities into one. Conversely, representing the same real-life entity by multiple nodes can also negatively affect the topology of the graph, possibly even splitting apart communities.
Although identifying missing edges and, conversely, identifying incorrect edges, can be tackled adequately using link prediction methods, prior work has neglected the other task: identifying and appropriately splitting nodes that are ambiguous—i.e., nodes that correspond to more than one real-life entity. We will refer to this task as node disambiguation (NDA). A converse and equally important problem is the problem of identifying multiple nodes corresponding to the same real-life entity, a problem we will refer to as node deduplication (NDD).
This paper proposes a unified and principled framework to both NDA and NDD problems, called framework for node disambiguation and deduplication using network embeddings (FONDUE). FONDUE is inspired by the empirical observation that real (natural) networks tend to be easier to embed than artificially generated (unnatural) networks, and rests on the associated hypothesis that the existence of ambiguous or duplicate nodes makes a network less natural.
Although most of the existing methods tackling NDA and NDD make use of additional information (e.g., node attributes, descriptions, or labels) for identifying and processing these problematic nodes, FONDUE adopts a more widely applicable approach that relies solely on topological information. Although exploiting additional information may of course increase the accuracy on those tasks, we argue that a method that does not require such information offers unique advantages, e.g., when data availability is scarce, or when building an extensive dataset on top of the graph data, is not feasible for practical reasons. Additionally, this approach fits the privacy by design framework, as it eliminates the need to incorporate more sensitive data. Finally, we argue that, even in cases where such additional information is available, it is both of scientific and of practical interest to explore how much can be completed without using it, instead solely relying on the network topology. Indeed, although this is beyond the scope of the current paper, it is clear that methods that solely rely on network topology could be combined with methods that exploit additional node-level information, plausibly leading to improved performance of either type of approach individually.

1.1. The Node Disambiguation Problem

We address the problem of NDA in the most basic setting: given a network, unweighted, unlabeled, and undirected, the task considered is to identify nodes that correspond to multiple distinct real-life entities. We formulate this as an inverse problem, where we use the given ambiguous network (which contains ambiguous nodes) in order to retrieve the unambiguous network (in which all nodes are unambiguous). Clearly, this inverse problem is ill-posed, making it impossible to solve without additional information (which we do not want to assume) or an inductive bias.
The key insight in this paper is that such an inductive bias can be provided by the network embedding (NE) literature. This literature has produced embedding-based models that are capable of accurately modeling the connectivity of real-life networks down to the node-level, while being unable to accurately model random networks [4,5]. Inspired by this research, we propose to use as an inductive bias the fact that the unambiguous network must be easy to model using a NE. Thus, we introduce FONDUE-NDA, a method that identifies nodes as ambiguous if, after splitting, they maximally improve the quality of the resulting NE.
Example 1.
Figure 1a illustrates the idea of FONDUE for NDA applied on a single node. In this example, node i with embedding x i corresponds to two real-life entities that belong to two separate communities, visualized by either full or dashed lines, to highlight the distinction. Because node i is connected to two different communities, most NE methods would locate its embedding x i between the embeddings of the nodes from both communities. Figure 1b shows a split of node i into nodes i and i , each with connections only to one of both communities. The resulting network is easy to embed by most NE methods, with embeddings x i and x i close to their own respective communities. In contrast, Figure 1c shows a split where the two resulting nodes are harder to embed. Most NE methods would embed them between both communities, but substantial tension would remain, resulting in a worse value of the NE objective function.

1.2. The Node Deduplication Problem

The same inductive bias can be used also for the NDD problem. The NDD problem is that given a network, unweighted, unlabeled, and undirected, identify distinct nodes that correspond to the same real-life entity. To this end, FONDUE-NDD determines how well merging two given nodes into one would improve the embedding quality of NE models. The inductive bias considers a merge as better than another one if it results in a better value of the NE objective function.
The diagram in Figure 2 shows the suggested pipeline for tackling both problems.

1.3. Contributions

In this paper, we make a number of related contributions:
  • We propose FONDUE, a framework exploiting the empirical observation that naturally occurring networks can be embedded well using state-of-the-art NE methods, to tackle two distinct tasks: node deduplication (FONDUE-NDD) and node disambiguation (FONDUE-NDA). The former, by identifying nodes as more likely to be duplicated if contracting them enhances the quality of an optimal NE. The latter, by identifying nodes as more likely to be ambiguous if splitting them enhances the quality of an optimal NE;
  • In addition this conceptual contribution, substantial challenges had to be overcome to implement this idea in a scalable manner. Specifically for the NDA problem, through a first-order analysis we derive a fast approximation of the expected NE quality improvement after splitting a node;
  • We implemented this idea for CNE [6], a recent state-of-the-art NE method, although we demonstrate that the approach can be applied for a broad class of other NE methods as well;
  • We tackle the NDA problem, with extensive experiments over a wide range of networks demonstrate the superiority of FONDUE over the state-of-the-art for the identification of ambiguous nodes, and this at comparable computational cost;
  • We also empirically observe that, somewhat surprisingly, despite the increase in accuracy for identifying ambiguous nodes, no such improvement was observed for the ambiguous node splitting accuracy. Thus, for NDA, we recommend using FONDUE for the identification of ambiguous nodes, while using an existing state-of-the-art approach for optimally splitting them;
  • Experiments on four datasets for NDD demonstrate the viability of FONDUE-NDD for the NDD problem based on only the topological features of a network.

2. Related Work

  The problem of NDA differs from named-entity disambiguation (NED; also known as named entity linking), a natural language processing (NLP) task where the purpose is to identify which real-life entity from a list a named-entity in a text refers to. For example, in the ArnetMiner dataset [7] ‘Bin Zhu’ corresponds to more than 10+ authors. The Open Researcher and Contributor ID (ORCID) [8] was introduced to solve the author name ambiguity problem, and most NED methods rely on ORCID for labeling datasets.
NED in this context aims to match the author names to unique (unambiguous) author identifiers [7,9,10,11].
In [7], they exploit hidden Markov random fields in a unified probabilistic framework to model node and edge features. On the other hand, Zhang et al. [12] designed a comprehensive framework to tackle name disambiguation, using complex feature engineering approach. By constructing paper networks, using the information sharing between two papers to build a supervised model for assigning the weights of the edges of the paper network. If two nodes in the network are connected, they are more likely to be authored by the same person.
Recent approaches are increasingly relying on more complex data, Ma et al. [13] used heterogeneous bibliographic networks representation learning, by employing relational and paper-related textual features, to obtain the embeddings of multiple types of nodes, while using meta-path based proximity measures to evaluate the neighboring and structural similarities of node embedding in the heterogeneous graphs.
The work of Zhang et al. [9] focusing on preserving privacy using solely the link information in a graph, employs network embedding as an intermediate step to perform NED, but they rely on other networks (person–document and document–document) in addition to person–person network to perform the task.
Although NDA could be used to assist in NED tasks, NED typically strongly relies on the text, e.g., by characterizing the context in which the named entity occurs (e.g., paper topic) [14]. Similarly, Ma et al. [15] proposes a name disambiguation model based on representation learning employing attributes and network connections, by first encoding the attributes of each paper using variational graph auto-encoder, then computing a similarity metric from the relationship of these attributes, and then using graph embedding to leverage the author relationships, heavily relying on NLP.
In NDA, in contrast, no natural language is considered, and the goal is to rely on just the network’s connectivity in order to identify which nodes may correspond to multiple distinct entities. Moreover, NDA does not assume the availability of a list of known unambiguous entity identifiers, such that an important part of the challenge is to identify which nodes are ambiguous in the first place. This offers a more privacy-friendly advantage and extends the application towards more datasets where access to additional information is restricted or not possible.
The research by Saha et al. [16], and Hermansson et al. [17] is most closely related to ours. These papers also only use topological information of the network for NDA. Yet, Ref. [16] also require timestamps for the edges, while [17] require a training set of nodes labeled as ambiguous and non-ambiguous. Moreover, even though the method proposed by [16] is reportedly orders of magnitude faster than the one proposed by [17], it remains computationally substantially more demanding than FONDUE (e.g., [16] evaluate their method on networks with just 150 entities). Other recent work using NE for NED [9,18,19,20] is only related indirectly as they rely on additional information besides the topology of the network.
The literature on NDD is scarce, as the problem is not well-defined. Conceptually, it is similar to that of named entity linking (NEL) [11,21] problem which aims to link instances of named entities in a text such as a newspaper, articles to the corresponding entities, often in knowledge bases (KB). Consequently, NEL heavily relies on textual data to identify erroneous entities rather than entity connection which is the core of our method. KB approaches for NEL are dominant in the field [22,23], as they make use of knowledge base datasets, heavily relying on labeled and additional graph data to tackle the named entity linking task. This also poses a challenge when it comes to benchmarking our method for NDD. No identified studies that tackles NDD from a topological approach is present in the current literature, at least without reliance on additional attributes and features.

3. Methods

Section 3.1 formally defines the NDA and NDD problems. Section 3.2 introduces the FONDUE framework in a maximally generic manner, independent of the specific NE method it is applied to, or the task (NDD or NDE) it is used for. A scalable approximation of FONDUE-NDA is described throughout Section 3.3, and applied to CNE as a specific NE method. Section 3.4 details the FONDUE-NDD method used for NDD.
Throughout this paper, a bold uppercase letter denotes a matrix (e.g., A ), a bold lower case letter denotes a column vector (e.g., x i ), ( . ) denotes matrix transpose (e.g., A ), and . denotes the Frobenius norm of a matrix (e.g., A ).

3.1. Problem Definition

We denote an undirected, unweighted, unlabeled graph as G = ( V , E ) , with V = { 1 , 2 , , n } the set of n nodes (or vertices), and E V 2 the set of edges (or links) between these nodes. We also define the adjacency matrix of a graph G , denoted A { 0 , 1 } n × n , with A i j = 1 if { i , j } E . We denote a i { 0 , 1 } n as the adjacency vector for node i, i.e., the ith column of the adjacency matrix A , and Γ ( i ) = { j | { i , j } E } the set of neighbors of i.

3.1.1. Formalizing the Node Disambiguation Problem

To formally define the NDA problem as an inverse problem, we first need to define the forward problem which maps an unambiguous graph onto an ambiguous one. This formalizes the ‘corruption’ process that creates ambiguity in the graph. In practice, this happens most often because identifiers of the entities represented by the nodes are not unique. For example, in a co-authorship network, the identifiers could be non-unique author names. To this end, we define a node contraction:
Definition 1.
(Node Contraction). A node contraction c for a graph G = ( V , E ) with V = { 1 , 2 , , n } is a surjective function c : V V ^ for some set V ^ = { 1 , 2 , , n ^ } with n ^ n . For convenience, we will define c 1 : V ^ 2 V as c 1 ( i ) = { k V | c ( k ) = i } for any i V ^ . Moreover, we will refer to the cardinality | c 1 ( i ) | as the multiplicity of the node i V ^ .
A node contraction defines an equivalence relation c over the set of nodes: i c j if c ( i ) = c ( j ) , and the set V ^ is the quotient set V / c . Continuing our example of a co-authorship network, a node contraction maps an author onto the node representing their name. Two authors i and j would be equivalent if their names c ( i ) and c ( j ) are equal, and the multiplicity of a node is the number of distinct authors with the corresponding name.
We can naturally define the concept of an ambiguous graph in terms of the contraction operation, as follows.
Definition 2 (Ambiguous graph).
Given a graph G = ( V , E ) and a node contraction c for that graph, the graph G ^ = ( V ^ , E ^ ) defined as E ^ = { { c ( k ) , c ( l ) } | { k , l } E } is referred to as an ambiguous graph of G . Overloading notation, we may write G ^ c ( G ) . To contrast G with G ^ , we may refer to G as the unambiguous graph.
Continuing the example of the co-authorship network, the contraction operation can be thought of as the operation that replaces author identities with their names, which may map distinct authors onto a same shared name. Note that the symbols for the ambiguous graph and its set of nodes and edges are denoted here using hats, to indicate that in the NDA problem we are interested in situations where the ambiguous graph is the empirically observed graph.
We can now formally define the NDA problem as inverting this contraction operation:
Definition 3 (The Node Disambiguation Problem).
Given an ambiguous graph G ^ = ( V ^ , E ^ ) , NDA aims to retrieve the unambiguous graph G = ( V , E ) and associated node contraction c, i.e., a contraction c for which c ( G ) = G ^ .
To be more precise, it suffices to identify G up to an isomorphism, as the actual identifiers of the nodes are irrelevant.

3.1.2. Formalizing the Node Deduplication Problem

The NDD problem can be formalized as the converse of the NDA problem, also relying on the concept of node contractions. First, a duplicate graph can be defined as follows:
Definition 4 (Duplicate graph).
Given a graph G = ( V , E ) , a graph G ^ = ( V ^ , E ^ ) where { k , l } E ^ { c ( k ) , c ( l ) } E for an appropriate contraction c, and where for each { i , j } E there exists an edge { k , l } E ^ for which c ( k ) = i and c ( l ) = j , is referred to as a duplicate graph of G . Or more concisely, using the overloaded notation from Definition 2, a duplicate graph G ^ is a graph for which c ( G ^ ) = G . To contrast G with G ^ , we may refer to G as the deduplicated graph.
Continuing the example of the co-authorship network, one node in the duplicate graph could correspond to two versions of the name of the same author, such that they are assigned two different nodes in the duplicate graph. A contraction operation that maps duplicate names to their common identity would merge such nodes corresponding to the same author. Hats on top of the symbols of the duplicate graph indicate that in the NDD problem we are interested in the situation where the duplicate graph is the empirically observed one.
The NDD problem can, thus, be formally defined as follows:
Definition 5 (The Node Deduplication Problem).
Given a duplicate graph G ^ = ( V ^ , E ^ ) , NDD aims to retrieve the deduplicated graph G = ( V , E ) and the node contraction c associated with G ^ , i.e., for which G = c ( G ^ ) .

3.1.3. Real Graphs Suffer from Both Issues

Of course, many real graphs both require deduplication and disambiguation. This is particularly true for the running example of the co-authorship network. Yet, while building on the common FONDUE framework, we define and study both problems separately, and propose an algorithm for each in Section 3.3 (for NDA) and Section 3.4 (for NDD). For networks suffering from both problems, both algorithms can be applied concurrently or sequentially without difficulties, thus solving both problems simultaneously.

3.2. FONDUE as a Generic Approach

To address both the NDA and NDD problems, FONDUE uses an inductive bias that the non-corrupted (unambiguous and deduplicated) network must be easy to model using NE. This allows us to approach both problems in the context of NE. Here we first formalize the inductive bias of FONDUE (Section 3.2). This will later allow us to present both the FONDUE-NDA (Section 3.3) and FONDUE-NDD (Section 3.4) algorithms, each tackling one of the data corruption tasks (NDA and NDD, respectively).

The FONDUE Induction Bias

Clearly, both the NDA and NDD problems are inverse problems, with NDA an ill-posed one. Thus, further assumptions, inductive bias, or priors are inevitable in order to solve them. The key hypothesis in FONDUE is that the unambiguous and deduplicated G , considering it is a ‘natural’ graph, can be embedded well using state-of-the-art NE methods. This hypothesis is inspired by the empirical observation that NE methods embed ‘natural’ graphs well.
NE methods find a mapping f : V R d from nodes to d-dimensional real vectors. An embedding is denoted as X = ( x 1 , x 2 , , x n ) R n × d , where x i f ( i ) for i V is the embedding of each node. Most well-known NE methods aim to find an optimal embedding X G * for given graph G that minimizes a continuous differentiable cost function O ( G , X ) .
Thus, given an ambiguous graph G ^ , FONDUE-NDA will search for the graph G , such that c ( G ) = G ^ for an appropriate contraction c, while optimizing the NE cost function on G :
Definition 6 (NE-based NDA problem).
Given an ambiguous graph G ^ , NE-based NDA aims to retrieve the unambiguous graph G and the associated contraction c:
a r g m G i n O G , X G * s . t . c ( G ) = G ^ for some contraction c .
Ideally, this optimization problem can be solved by simultaneously finding optimal splits for all nodes (i.e., an inverse of the contraction c) that yield the smallest embedding cost after re-embedding. However, this strategy requires to (a) search splits in an exponential search space that has the combinations of splits (with arbitrary cardinality) of all nodes, (b) to evaluate each combination of the splits, the embedding of the resulting network needs to be recomputed. Thus, this ideal solution is computationally intractable and more scalable solutions are needed (see Section 3.3).
Similarly, for NDD, given a duplicate graph G ^ , FONDUE-NDD will search for a graph G , such that c ( G ^ ) = G for an appropriate contraction c, again while optimizing the NE cost function on G :
Definition 7 (NE-based NDD problem).
Given a duplicate graph G ^ , NE-based NDD aims to retrieve the deduplicated graph G and the associated contraction c of G ^ :
a r g m G i n O G , X G * s . t . c ( G ^ ) = G for some contraction c .
Generally speaking, to solve this optimization problem, we would want to find the optimal merging for all the nodes that would reduce the cost of the embedding after computing the re-embedding. Yet, a thorough optimization of this problem is beyond the scope of this paper, and as an approximation we rely on a ranking-based approach where we rank networks with randomly merged nodes depending on the value of the objective function after re-embedding. This may be suboptimal, but it highlights the viability of the concept if used for NDD as shown in the results of the experiments.
Although the principle underlying both methods is thus very similar, we will see below that the corresponding methods differ considerably. In common to them is the need for a basic understanding of NE methods.

3.3. FONDUE-NDA

From the above section, it is clear that the NDA problem can be decomposed into two subproblems:
1.
Estimating the multiplicities of all i G ^ —i.e., the number of unambiguous nodes from G represented by the node from G ^ . This essentially amounts to estimation the contraction c. Note that the number of nodes n in V is then equal to the sum of these multiplicities, and arbitrarily assigning these n nodes to the sets c 1 ( i ) defines c 1 and, thus, c;
2.
Given c, estimating the edge set E. To ensure that c ( G ) = G ^ , for each { i , j } E ^ there must exist at least one edge { k , l } E with k c 1 ( i ) and l c 1 ( j ) . However, this leaves the problem underdetermined (making this problem ill-posed), as there may also exist multiple such edges.
As an inductive bias for the second step, we will additionally assume that the graph G is sparse. Thus, FONDUE-NDA estimates G as the graph with the smallest set E for which c ( G ) = G ^ . Practically, this means that an edge { i , j } E ^ results in exactly one edge { k , l } E with k c 1 ( i ) and l c 1 ( j ) , and that equivalent nodes k c l with k , l V are never connected by an edge, i.e., { k , l } E . This bias is justified by the sparsity of most ‘natural’ graphs, and our experiments indicate it is justified.
We approach the NE-based NDA Problem 6 in a greedy and iterative manner. In each iteration, FONDUE-NDA identifies the node that has a split which will result in the smallest value of the cost function among all nodes. To further reduce the computational complexity, FONDUE-NDA only splits one node into two nodes at a time (e.g., Figure 1b), i.e., it splits node i into two nodes i and i with corresponding adjacency vectors a i , a i { 0 , 1 } n , a i + a i = a i . We refer to such a split as a binary split. Note that repeated binary splits can of course be used to achieve the same result as a single split into several notes, so this assumption does not imply a loss of generality or applicability. Once the best binary split of the best node is identified, FONDUE-NDA splits that node and starts the next iteration. The evaluation of each split requires recomputing the embedding, and comparing the resulting optimal NE cost functions with each other.
Unfortunately, this naive strategy is computationally intractable: computing a single NE is already computationally demanding for most (if not all) NE methods. Thus, having to compute a re-embedding for all possible splits, even binary ones (there are O ( n 2 d ) of them, with n the number of nodes and d the maximal degree), is entirely infeasible for practical networks.

3.3.1. A First-Order Approximation for Computational Tractability

Thus, instead of recomputing the embedding, FONDUE-NDA performs a first-order analysis by investigating the effect of an infinitesimal split of a node i around its embedding x i , on the cost O ( G ^ s i , X ^ s i ) obtained after performing the splitting, with G ^ s i and X ^ s i referring to the ambiguous graph and its embeddings’ representation, respectively, after splitting node i.
Drawing intuition from Figure 1, when two distinct authors share the same name in a given collaboration network, their respective separate community (ego-network) are lumped into one big cluster. Yet, from a topological point of view, that ambiguous node (author name) is connected to both communities that are generally different, meaning they share very few, if any, links. This stems from the observation that it is highly unlikely that two authors with the same exact name would belong to the same community, i.e., collaborate together. Furthermore, splitting this ambiguous node into two different ones (distinguishing the two authors), would ideally separate these two communities. Thus, to do so, we consider that each community, that is supposed to be embedded separately, is pulling the ambiguous node towards its own embedding region, and once separated, the embeddings of each of the resolved nodes will be improved. So our main goal is to quantify the amount of improvements in the embedding cost function by separating the two nodes i and i by a unit distance in a certain direction. We propose to split the assignment of the edges of i between i and i , such that all the links from i are distributed to either i or i in such way to maximize the embedding cost function, which could be evaluated by computing the gradient with respect to the separation distance δ i .
Specifically, FONDUE-NDA seeks the split of node i that will result in embedding x i and x i with infinitesimal difference δ i (where δ i = x i x i , x i = x i + δ i 2 , x i = x i δ i 2 , and δ i 0 , e.g., Figure 1b), such that | | δ i O ( G ^ s i , X ^ s i ) | | is large, with δ i O ( G ^ s i , X ^ s i ) being the gradient of O ( G ^ s i , X ^ s i ) with respect to δ i . This can be completed analytically. Indeed, applying the chain rule, we find:
δ i O ( G ^ s i , X ^ s i ) = 1 2 x i O ( G ^ s i , X ^ s i ) 1 2 x i O ( G ^ s i , X ^ s i ) .
Many recent NE methods like LINE [24] and CNE [6], aim to embed ‘similar’ nodes in the graph closer to each other, and ‘dissimilar’ nodes further away from each other (for a particular similarity notion depending on the NE method). For such methods, Equation (3) can be further simplified. Indeed, as such NE methods focus on modeling a property of pairs of nodes (their similarity), their objective functions can be typically decomposed as a summation of node-pair interaction losses over all node-pairs. For example, this can be seen in Section 3.3.3 of the current paper for CNE [6], and in Equations (3) and (6) of [24] for LINE. Each of these node-pair interaction losses quantifies the extent to which the proximity between nodes’ embeddings reflects their ‘similarity’ in the network. For methods where this decomposition is possible, we can thus write the objective function as follows:
O ( G , X ) = j : { i , j } V × V O p ( A i j , x i , x j ) = j : { i , j } E O p ( A i j = 1 , x i , x j ) + l : { k , l } E O p ( A k l = 0 , x k , x l ) ,
where O p ( A i j , x i , x j ) denotes the node-pair interaction loss for the nodes i and j, O p ( A i j = 1 , x i , x j ) the part of objective function that corresponds to node i and node j with an edge between them ( A i j = 1 ) and O p ( A k l = 0 , x k , x l ) is the part of objective function, where node k and node l are disconnected.
Given that Γ ( i ) = Γ ( i ) Γ ( i ) and Γ ( i ) Γ ( i ) = , we can apply the same decomposition approach on x i O ( G ^ s i , X ^ s i ) ,
x i O ( G ^ s i , X ^ s i ) = x i j : j Γ ( i ) O p ( A i j = 1 , x i , x j ) + x i l : l Γ ( i ) O p ( A i l = 0 , x i , x l ) = x i j : j Γ ( i ) O p ( A i j = 1 , x i , x j ) + x i l : l Γ ( i ) O p ( A i l = 0 , x i , x l ) + x i l : l Γ ( i ) O p ( A i l = 0 , x i , x l ) ,
and similarly on x i O ( G ^ s i , X ^ s i ) ,
x i O ( G ^ s i , X ^ s i ) = x i j : j Γ ( i ) O p ( A i j = 1 , x i , x j ) + x i l : l Γ ( i ) O p ( A i l = 0 , x i , x l ) + x i l : l Γ ( i ) O p ( A i l = 0 , x i , x l ) .
Additionally, as both nodes i and i share the same set of non-neighbors of node i, we can write the following:
x i l : l Γ ( i ) O p ( A i l = 0 , x i , x l ) = x i l : l Γ ( i ) O p ( A i l = 0 , x i , x l ) .
Furthermore, incorporating the previous two decompositions, we can rewrite Equation (3) as follows:
δ i O ( G ^ s i , X ^ s i ) = 1 2 [ j : j Γ ( i ) x i O p ( A i j = 1 , x i , x j ) l : l Γ ( i ) x i O p ( A i l = 0 , x i , x l ) + l : l Γ ( i ) x i O p ( A i l = 0 , x i , x l ) j : j Γ ( i ) x i O p ( A i j = 1 , x i , x j ) ] = 1 2 j : j Γ ( i ) x i O p ( A i j = 1 , x i , x j ) x i O p ( A i l = 0 , x i , x l ) 1 2 l : l Γ ( i ) x i O p ( A i j = 1 , x i , x j ) x i O p ( A i l = 0 , x i , x l ) .
Given that Γ ( i ) = Γ ( i ) Γ ( i ) = Γ ( i ) and Γ ( i ) = Γ ( i ) Γ ( i ) = Γ ( i ) , the above equation could be simplified to:
δ i O ( G ^ s i , X ^ s i ) = 1 2 j : j Γ ( i ) ( 1 ) m x i O p ( A i j = 1 , x i , x l ) x i O p ( A i j = 0 , x i , x j ) ,
with m = 0 if j Γ ( i ) and m = 1 if j Γ ( i ) .
Let F i 1 R d × | Γ ( i ) | be a matrix of which the columns correspond to the gradient vectors x i O p ( A i j = 1 , x i , x j ) (one column for each j Γ ( i ) ), and let F i 0 R d × | Γ ( i ) | be the matrix of which the columns correspond to the gradient vectors x i O p ( A i j = 0 , x i , x j ) (also one column for each j Γ ( i ) ). Moreover, let b i { 1 , 1 } | Γ i | be a vector with a dimension corresponding to each of the neighbors j Γ ( i ) of i, with value equal to 1 if that neighbor is a neighbor of i and equal to 1 if it is a neighbor of i after splitting i. Then the gradient Equation (4) can be written more concisely and transparently as follows:
δ i O ( G ^ s i , X ^ s i ) = 1 2 ( F i 1 F i 0 ) b i .
The aim of FONDUE-NDA is to identify node splits for which the embedding quality improvement is maximal. As argued above, we propose to approximately quantify this by means of a first order approximation, by considering the two-norm squared of this gradient, namely by maximizing δ i O ( G ^ s i , X ^ s i ) 2 . Denoting M i = ( F i 1 F i 0 ) ( F i 1 F i 0 ) (and recognizing that b i b i = | Γ ( i ) | is independent of b i ), FONDUE-NDA can thus be formalized in the following compact form:
argmax i , b i { 1 , 1 } | Γ i | b i M i b i b i b i .
Note that M i 0 for all nodes and all splits, such that this is an instance of Boolean quadratic maximization problem [25,26]. This problem is NP-hard, thus it requires further approximations to ensure tractability in practice.

3.3.2. Additional Heuristics for Enhanced Scalability

In order to efficiently search for best split on a given node, we developed two approximation heuristics.
First, we randomly split the neighborhood Γ ( i ) into two and evaluate the objective (Equation (5)). Repeat the randomization procedure for a fixed number of times, pick the split that gives the best objective value as output.
Second, we find the eigenvector v that corresponds to the largest absolute eigenvalue of matrix M i . Sort the element in vector v and assigning top k corresponding nodes to Γ ( i ) and the rest to Γ ( i ) . Evaluating the objective value for k = 1 | Γ ( i ) | and pick the best split.
Finally, we combine theses two heuristics and use the split that gives the best objective Equation (5) as the final split of the node i.

3.3.3. FONDUE-NDA Using CNE

We now apply FONDUE-NDA to conditional network embedding (CNE). CNE proposes a probability distribution for network embedding and finds a locally optimal embedding by maximum likelihood estimation. CNE has objective function:
O ( G , X ) = log ( P ( A | X ) ) = i , j : A i j = 1 log P i j ( A i j = 1 | X ) + i , j : A i j = 0 log P i j ( A i j = 0 | X ) .
Here, the link probabilities P i j conditioned on the embedding are defined as follows:
P i j ( A i j = 1 | X ) = P A , i j N + , σ 1 ( x i x j ) P A , i j N + , σ 1 ( x i x j ) + ( 1 P A , i j ) N + , σ 2 ( x i x j ) ,
where N + , σ denotes a half-normal distribution [27] with spread parameter σ , σ 2 > σ 1 = 1 , and where P A ^ , i j is a prior probability for a link to exist between nodes i and j as inferred from the degrees of the nodes (or based on other information about the structure of the network [28]). First, we derive the gradient:
x i O ( G , X ) = γ j i ( x i x j ) P A i j = 1 | X A i j = 0 ,
where γ = 1 σ 1 2 1 σ 2 2 . This allows us to further compute gradient
δ i O ( G ^ s i , X ^ s i ) = γ 2 (   x i x j   ) b i
Thus, the Boolean quadratic maximization problem has form:
argmax i , b i { 1 , 1 } | Γ i | b i k , l Γ ( i ) ( x i x k ) ( x i x l ) b i b i b i .

3.4. FONDUE-NDD

Using the inductive bias for the NDD problem, the goal is to minimize the embedding cost after merging the duplicate nodes in the graph (Equation (2)). This is motivated by the fact that natural networks tend to be modeled using NE methods, better than corrupted (duplicate) networks, thus their embedding cost should be lower. Thus, merging (or contracting) duplicate nodes (nodes that refer to the same entity) in a duplicate graph G ^ would result in a contracted graph G ^ c that is less corrupt (resembling more a "natural" graph), thus with a lower embedding cost.
Contrary to NDA, NDD is more straightforward, as it does not deal with the problem of reassigning the edges of the node after splitting, but rather simply determining the duplicate nodes in a duplicate graph. FONDUE-NDD applied on G ^ , aims to find duplicate node-pairs in the graph to combine them into one node by reassigning the union of their edges, which would result in contracted graph G ^ c .
Using NE methods, FONDUE-NDD aims to iteratively identify a node-pair { i , j } V ^ c a n d , where V ^ c a n d is the set of all possible candidate node-pairs, that if merged together to form one node i m , would result in the smallest cost function value among all the other node-pairs. Thus, problem 6 can be further rewritten as:
argmin { i , j } V ^ c a n d O G ^ c i j , X ^ c i j ,
where G ^ c i j is a contracted graph from G ^ after merging the node-pair { i , j } , and X ^ c i j its respective embeddings.
Trying this for all possible node-pairs in the graph is an intractable solution. It is not obvious what information could be used to approximate Equation (8), thus we approach the problem simply by randomly selecting node-pairs, merging them, observing the values of the cost function, and then ranking the result. The lower the cost score, the more likely that those merged nodes are duplicates.
Lacking a scalable bottom-up procedure to identify the best node pairs, in the experiments our focus will be on evaluation whether the introduced criterion for merging is indeed useful to identify whether node pairs appear to be duplicates.

FONDUE-NDD Using CNE

Similarly to the previous section, we proceed by applying CNE as a network embedding method, the objective function of FONDUE-NDD is thus the one of CNE evaluated on the tentatively deduplicated graph after attempting a merge:
O ( G ^ c i j , X ^ c i j ) = log ( P ( A ^ c i j | X ^ c i j ) ) = k , l : A ^ c i j , k l = 1 log ( 1 + σ 1 σ 2 1 P k l P k l exp ( ( 1 σ 1 2 1 σ 2 2 ) d k l 2 2 ) ) + k , l : A ^ c i j , k l = 0 log ( 1 + σ 2 σ 1 P k l 1 P k l exp ( ( 1 σ 2 2 1 σ 1 2 ) d k l 2 2 ) ) ,
with the link probabilities P k l conditioned on the embedding are defined as follows:
P k l ( A ^ c i j , k l = 1 | X ) = P A ^ c i j , k l , k l N + , σ 1 ( x k x l ) P A ^ c i j , k l , k l N + , σ 1 ( x k x l ) + ( 1 P A ^ c i j , k l , k l ) N + , σ 2 ( x k x l ) .
Similarly to Section 3.3.3, N + , σ denotes a half-Normal distribution with spread parameter σ , σ 2 > σ 1 = 1 , and where P A ^ c i j , k l , k l is a prior probability for a link to exist between nodes k and l as inferred from the network properties.

4. Experiments

In this section, we investigate quantitatively and qualitatively the performance of FONDUE on both semi-synthetic and real-world datasets, compared to state-of-the-art methods tackling the same problems. In Section 4.1, we introduce and discuss the different datasets used in our experiments, in Section 4.2 we discuss the performance of FONDUE-NDA, and FONDUE-NDD in Section 4.3. Finally, in Section 4.4, we summarize and discuss the results. All code used in this section is publicly available from the GitHub repository https://github.com/aida-ugent/fondue, accessed on 20 October 2021.

4.1. Datasets

One main challenge for assessing the evaluation of disambiguation tasks is the scarcity of availability of ambiguous (contracted) graph datasets with reliable ground truth. Furthermore, other studies that focus on ambiguous node identification often do not publish their heavily processed dataset (e.g., DBLP datasets [16]), which makes it harder to benchmark different methods. Thus, to simulate data corruption in real world datasets, we opted to create a contracted graph given a source graph, and then use the latter as ground truth to assess the accuracy of FONDUE compared to other baselines. To do so, we used a simple approach for node contraction, for both NDA (Section 4.2.1) and NDD (Section 4.3.1). Below, in Table 1 we list the details of the different datasets used after post-processing in our experiments.
Additionally, we also use real-world networks containing ambiguous and duplicate nodes, mainly part of the PubMed collaboration network, analyzed in Appendix A. The PubMed data are released in independent issues, so to build a connected network form the PubMed data, we select issues that contain ambiguous and duplicate nodes. We then select the largest connected component of that network. One main limitation to this dataset is that not every author has an associated Orcid ID, which affects the false positive and false negative labels in the network (author names that might be ambiguous would be ignored). This is further highlighted in the subsequent sections.

4.2. Node Disambiguation

In this section, we investigate the following questions: ( Q 1 ) Quantitatively, how does our method perform in identifying ambiguous nodes compared to the state-of-the-art and other heuristics? (Section 4.2.2); ( Q 2 ) Qualitatively, how reliable is the quality of the detected ambiguous nodes compared to other methods when applied to real world datasets? (Section 4.2.3); ( Q 3 ) Quantitatively, how does our method perform in terms of splitting the ambiguous nodes? (Section 4.2.4); ( Q 4 ) How does the behavior of the method change when the degree of contraction of a network varies? (Section 4.2.5); ( Q 5 ) Does the proposed method scale? (Section 4.2.6); ( Q 6 ) Quantitatively, how does our method perform in terms of node deduplication? (Section 4.3.1).

4.2.1. Data Processing

Before conducting the experiments, the processing of the data to generate semi-synthetic networks was needed. This was completed by contracting each of the thirteen datasets mentioned in Table 2. More specifically, for each network G = ( V , E ) , a graph contraction was performed to create a contracted graph G ^ = ( V ^ , E ^ ) (ambiguous) by randomly merging a fraction r of total number of nodes, to create a ground truth to test our proposed method. This is completed by first specifying the fraction of the nodes in the graph to be contracted ( r { 0.001 , 0.01 , 0.1 } ), and then sampling two sets of vertices, V ^ i V ^ and V ^ j V ^ , such that | V ^ i | = | V ^ j | = r · | V ^ | and V ^ i V ^ j = . Then, every vertex v j V ^ j is merged with the corresponding vertex v i V ^ i by reassigning the links connected to v j to v i and removing v j from the network. The node-pairs ( v i , v j ) later serve as ground truths. We have also tested the case where the set of the candidate contracted vertices have no common neighbors (instead of a uniform selection at random). This mimics some types of social networks where two authors that share the same name, often their ego-networks do not intersect. Further analysis of the PubMed dataset Table 1, revealed that none of the ambiguous nodes shared edges with the neighbors of another ambiguous node.
We have tested the performance of FONDUE-NDA, as well as that of the competing methods listed in the following section, on fourteen different datasets listed in Table 1, with their properties shown in Table 2.

4.2.2. Quantitative Evaluation of Node Identification

In this section, we focus on answering Q 1 , namely, given a contracted graph, FONDUE-NDA aims to identify the list of contracted (ambiguous) nodes present in it. We first discuss the datasets used in the experiments in the following section.
Baselines. As mentioned earlier in Section 1, most entity disambiguation methods in the literature focus on the task of re-assigning the edges of an already predefined set of ambiguous nodes, and the process of identifying these nodes in a given non-attributed network, is usually overlooked. Thus, there exists very few approaches that tackle the latter case. In this section, we compare FONDUE-NDA with three different competing approaches that focus on the identification task, one existing method, and two heuristics.
Normalized-Cut (NC) The work of [16] comes close to ours, as their method also aims to identify ambiguous nodes in a given graph by utilizing Markov clustering to cluster an ego network of a vertex u with the vertex itself removed. NC favors the grouping that gives small cross-edges between different clusters of u’s neighbors. The result is a score reflecting the quality of the clustering, using normalized-cut (NC):
N C = i = 1 k W ( C i , C i ¯ ) W ( C i , C i ) + W ( C i , C i ¯ ) ,
with W ( C i , C i ) as the sum of all the edges within cluster C i , W ( C i , C i ¯ ) the sum of the for all the edges between cluster C i and the rest of the network C i ¯ , and k being the number of clusters in the graph. Although [17] also worked on identifying nodes based on topological features, their method (which is not publicly available) performed worse in all the cases when compared to [16], so we only chose the latter as a competing baseline.
Connected-Component Score (CC) We also include another baseline, connected-component score (CC), relying on the same approach used in [16], with a slight modification. Instead of computing the normalized cut score based on the clusters of the ego graph of a node, we account for the number of connected components of a node’s ego graph, with the node itself removed.
Degree Finally, we use node degree as a baseline. As contracted nodes usually tend to have a higher degree, by inheriting more edges from combined nodes, degree is a simple predictor for the node ambiguity.
Evaluation Metric. FONDUE-NDA ranks nodes according to their calculated ambiguity score (how likely is that node to be ambiguous). The same process goes for NC and CC. At first glance, the evaluation can be approached from a binary classification perspective, by considering the top X ranked nodes as ambiguous (where X is the actual number true-positive), and, thus, we can use the usual metrics for binary classification, such as F1-score, precision, recall and AUC. However, this requires knowing beforehand the number of true-positive, i.e., the number of actual ambiguous nodes (or setting a clear cutoff value), which is only possible in labeled datasets and controlled experiments. In real world settings, if FONDUE-NDA is to be used to detect ambiguous nodes in unlabeled networks, practical application is rather more restricted, as it is more useful to have relevant nodes (ambiguous) ranked more highly than non-relevant nodes. Thus, it is necessary to extend the traditional binary classification evaluation methods, that are based on binary relevance judgments, to more flexible graded relevance judgments, such as, for example, cumulative gain, which is a form of graded precision, as it is identical to the precision when rating scale is binary. However, as our datasets are highly imbalanced by nature, mainly because ambiguous nodes are by definition a small part of the network, a better take on the cumulative gain metric is needed. Hence, we employ the normalized discounted gain to evaluate our method, alongside the traditional binary classification methods listed above. Below, we detail each metric.
Precision The number of correctly identified positive results divided by the number of all positive results
Precision = T P T P + F P
Recall The number of correctly identified positive results divided by the number of all positive samples
Recall = T P T P + F N
F1-score It is the weighted average of the precision where an F1 score reaches its best value at 1 and worst score at 0.
F 1 = 2 Recall × Precision Recall + Precision
Note that, due to the fact that in the binary classification case, the number of false positive is equal to the number of false negative, the value of the recall, precision and F1-score will be the same.
Area Under the ROC curve (AUC) A ROC curve is a 2D depiction of a classifier performance, which could be reduced to a single scalar value, by calculating the value under the curve (AUC). Essentially, the AUC computes the probability that our measure would rank a randomly chosen ambiguous node (positive example), higher than a randomly chosen non-ambiguous node (negative example). Ideally, this probability value is 1, which means our method has successfully identified ambiguous nodes 100 % of the time, and the baseline value is 0.5 , where the ambiguous and non-ambiguous nodes are indistinguishable. This accuracy measure has been used in other works in this field, including [16], which makes it easier to compare to their work.
Discounted Gain (DCG) The main limitation of the previous method, as we discussed earlier, is inability to account for graded scores, but rather only binary classification. To account for this, we utilize different cumulative gain based methods. Given a search result list, cumulative gain (CG) is the sum of the graded relevance values of all results.
CG = i = 1 n relevance i
On the other hand, DCG [34] takes position significance into account, and adds a penalty if a highly relevant document is appearing lower in a search result list, as the graded relevance value is logarithmically reduced proportionally to the position of the result. Practically, it is the sum of the true scores ranked in the order induced by the predicted scores, after applying a logarithmic discount. The higher the better is the ranking.
DCG = i = 1 n relevance i log 2 ( i + 1 )
Normalized Discounted Gain (NDCG) It is commonly used in the information retrieval field to measure effectiveness of search algorithms, where highly relevant documents being more useful if appearing earlier in search result, and more useful than marginally relevant documents which are better than non-relevant documents. It improves upon DCG by accounting for the variation of the relevance, and providing a proper upper and lower bounds to be averaged across all the relevance scores. Thus, it is computed by summing the true scores ranked in the order induced by the predicted scores, after applying a logarithmic discount, then dividing by the best possible score ideal DCG (IDCG, obtained for a perfect ranking) to obtain a score between 0 and 1.
NDCG = NDCG IDCG
Evaluation pipeline. We first perform network contraction on the original graph, by fixing the ratio of ambiguous nodes to r. We then embed the network using CNE, and compute the disambiguation measure of FONDUE-NDA (Equation (7)), as well as the baseline measures for each node. Then, the scores yield by the measures are compare to the ground truth (i.e., binary labels indicating whether a node is a contracted node). This is completed for 3 different values of r { 0.001 , 0.01 , 0.1 } . We repeat the processes 10 times using a different random seed to generate the contracted network and average the scores. For the embedding configurations, we set the parameters for CNE to σ 1 = 1 , σ 2 = 2 , with dimensionality limited to d = 8 .
Results. are illustrated in Figure 3 and shown in detail in Table 3 focusing on NDCG mainly for being a better measure for assessing the ranking performance of each method. FONDUE-NDA outperforms the state-of-the-art method, as well as non-trivial baselines in terms of NDCG in most datasets. It is also more robust with the variation of the size of the network, and the fraction of the ambiguous nodes in the graph. NC seems to struggle to identify ambiguous nodes for smaller networks (Table 2). Additionally, as we tested against multiple network settings, with randomly uniform contraction (randomly selecting a node-pair and merging them together), or a conditional contraction (selecting a node pair that do not share common neighbors to mimic realistically collaboration networks), we did not observe any significant changes in the results.

4.2.3. Qualitative Evaluation of Nodes Identification

Now that we have demonstrated that FONDUE-NDA outperforms current state-of-the-art-methods for identifying ambiguous nodes, in this section we investigate Q 3 , the quality of the results produced by FONDUE-NDA and how do they compare to other competing methods. Additionally, we investigate the question, whether nodes with a higher ambiguity number (nodes that map to a larger number of entities) are more easily identified as ambiguous compared to nodes with lower ambiguity number (that map to a lower number of entities). To do so, semi-synthetic datasets do not constitute valid sources, and we need to rely on real-world dataset to assess the quality of the ground truth of the ambiguous nodes. We collected data from the National Center for Biotechnology Information that provides the PubMed datasets, that comprises citations for biomedical literature from MEDLINE, life science journals, and online books, that may include links to full-text content from PubMed Central and publisher websites. Snippets of those data are released periodically and can be used to build author–author collaboration networks. Using Orcid ID, a persistent digital identifier, usually provided in the metadata of the datasets to distinguish ambiguous nodes, we can build reliable ground truth for our qualitative experiment and investigate the top ranked nodes by FONDUE-NDA, and other baselines.
Datasets. PubMed datasets are publicly available, and updated regularly. Few data snippets are selected to extract author information and build an author–author collaboration network based on the metadata included in the datasets. The biggest connected component of the network is then selected. To assess the ambiguity of the network, we also extract the Orcid ID, if available, for each author in the network. If more than Ocrid iD is associated with one author, the author-name is labeled as ambiguous. Note that not all authors have their Orcid ID listed in the datasets, which might cause a possible degradation in the quality of the labeled datasets, as some author-names might be ambiguous, but due to their Orcid ID not being recorded, they will not be accounted for in the labels, thus affecting the final results.
Metrics. Similarly to Section 4.2.2, we first take the binary classification approach, where we consider the top 31 ranked nodes to be classified as ambiguous (as the network contains 31 ambiguous nodes). The same goes for NC. We use the same metrics from Section 4.2.2. Additionally, to assess the quality of the ranked nodes, we utilize the number of true positives (TP) (nodes that are correctly identified as ambiguous) and false positive (FP) (nodes that are incorrectly identified as ambiguous). We also consider NDCG for graded relevance.
Results. After building the PubMed collaboration network, that contains 2122 nodes with 31 ambiguous nodes (6 of which are names that refer to more than 2 authors), it is embedded using CNE, and we compute the score measure of FONDUE-NDA (Equation (7)), as well as the baseline measures for each node. As explained earlier in Section 4.2.2, FONDUE-NDA ranks the nodes by ambiguity score. For comparison, we use the work of [16] as a baseline for qualitative comparison. Table 4 shows the performance of FONDUE-NDA against NC, for AUC, TP, FP, and F1-score. FONDUE-NDA clearly outperforms NC for binary classification of ambiguous nodes. This result is also highlighted when inspecting further the results in Table 5, that list the top 10 classified nodes by each method, with FONDUE-NDA successfully classifying 90% of the author names. Note that the results are quite intuitive, and conform with our findings in the earlier analysis of the dataset in Appendix A, as authors with Asian names are more likely to share common names, due to shorter name length, and simplified transcription (from Mandarin to English for example).
We also investigated the ranks of nodes that maps to more than 2 entities (highlighted with an asterisk in both tables). Again FONDUE-NDA outperforms NC, and ranks 3 out of 6 highly ambiguous names in the top 10.
This confirms the results obtained in Section 4.2.2, as FONDUE-NDA outperforms NC in as a state-of-the-art method for identifying ambiguous nodes.

4.2.4. Quantitative Evaluation of Nodes Splitting

Following the identification of the ambiguous nodes, in this section, we focus on the task of node splitting, and answering Q 3 of how well does FONDUE-NDA perform when it comes to partitioning the set of edges into two separate ambiguous nodes. Simply put, given an ambiguous node v i , we refer to node splitting the process of replacing this particular node with two different nodes v i , v i and re-assigning the edges of v i , such that Γ ( v i ) Γ ( v i ) = Γ ( v i ) .
Baselines. For the node splitting task, the three baselines previously discussed in Section 4.2.2 are not immediately applicable. However, we adopt the Markov clustering (MCL) approach utilized in normalized cut measure for splitting. Namely, a splitting is given by the MCL clustering on the ego network of an ambiguous node, with the node itself removed.
Evaluation Metric. Given a list of ambiguous nodes, we evaluate the splitting given by FONDUE-NDA and MCL against the ground truth (node splitting according to the original network). This is quantified by computing the adjusted Rand index (ARI) score between FONDUE-NDA and the ground truth, as well as between MCL and the ground truth. ARI score is a similarity measure between two clusterings. ARI ranges between 1 and 1, the higher the score the better the alignment between the two compared clusterings.
Pipeline. First we compute the ground truth. Then, for each ambiguous node, we evaluate the quality (based on ARI) of the split from FONDUE-NDA and MCL compared to the original partition. We repeat the experiments for three different contraction ratios r { 0.001 , 0.01 , 0.1 } for each dataset. For each ratio, the experiment is repeated 3 times with different seeds.
Results. Despite outperforming NC in ambiguous node identification, FONDUE-NDA seems to underperform compared to MCL on nearly every dataset (Table 6). This shows that for node splitting, FONDUE-NDA is not adequately optimized for such task. Many factors contribute to the poor performance, mainly the quality of the embeddings which would lead to a poor performance for the objective function for edge assignment. Nonetheless, given the modular approach for our method, and for a complete framework for node identification and node splitting, we recommend using FONDUE-NDA for former task and MCL for the latter.

4.2.5. Parameter Sensitivity

In this section, we study the robustness of our FONDUE-NDA against different network settings. Mainly, how does the percentage of ambiguous nodes in a graph affect the node identification. In the previous experiment (Section 4.2.2), we have fixed the ratio of ambiguous nodes to { 0.001 , 0.01 , 0.10 } , we follow the same pipeline (generate, embed, evaluate for 10 different random seeds), for different ratios of ambiguous nodes. As listed in Table 3 FONDUE-NDA outperforms MCL and other baselines across nearly all networks with different contraction ratios. We also accounted for different ways the node contraction is conducted as specified in Section 4.2.1. As previously mentioned, the node contraction process is assumed to be a form of corruption of the data, i.e., in real-life (outside an evaluation setup). To simulate this corruption process (so as to generate semi-synthetic test data with known ground-truth), merged nodes were selected uniformly at random. Even though this does not guarantee that the network becomes unnatural in a way that suits the embedding objective function. However, this further strengthens our empirical results: FONDUE-NDA works even when the assumptions are not guaranteed to be satisfied. We also studied different ways to select nodes for contraction in generating semi-synthetic data: only selecting contracted nodes that have common neighbors. FONDUE-NDA consistently outperformed the baselines also with these different network contraction approaches in most datasets, as shown in Table 3 and Figure 3.

4.2.6. Execution Time Analysis

  The runtime of FONDUE-NDA is linear O ( n ) on each node, and is trivially parallelizable. In Figure 4, we show the execution speed of FONDUE-NDA and baselines in node identification and splitting. FONDUE-NDA is faster than CC, and by nearly one order of magnitude than NC in most datasets. Note that FONDUE-NDA approximates Equation (5) by aggregating two different approximation heuristics (i.e., randomized sampling and eigenvector thresholding as listed in Section 3.3.2). Although the best results (in terms of NDCG and ARI) were obtained by the latter approximation. Thus, the runtime results reflect only the execution time of latter heuristic. This is listed in details in Table 7. All the experiments have been conducted on a Intel i7−7700K CPU 4.20GHz, running the Ubuntu 18.04 distribution of Linux, with 32GB of RAM.

4.3. Node Deduplication

4.3.1. Quantitative Evaluation of Node Deduplication

In this section we describe the details of the experimental setup used to tackle the NDD problem. Referring to the data listed in Table 1, we performed our experiments on three semi-synthetic datasets. We also used the PubMed dataset with metadata containing Orcid ID to also build a duplicate network with ground truth derived from the author names with common Orcid ID.
Evaluation Pipeline. As we indicated in Section 4.1, we perform post-processing on the graph data to conduct controlled experiments for comparison with other baselines. For the semi-synthetic datasets (Table 2), to simulate the data corruption process caused by the NDD problem, we perform node splitting on different datasets. For a given graph G = ( V , E ) with n = | V | , we compute the embedding cost function of G. We then randomly choose one node i to split into two i and i , i.e., adding a new node to the graph, which results in a corrupted (duplicate) graph G ^ = ( V ^ , E ^ ) with n d = | V ^ | = n + 1 .
We then randomly choose one node-pair from V ^ × V ^ , and perform node contraction, by merging these 2 nodes such that we end up with 1 node containing all the edges of the previous 2, which results in contracted graph G ^ c = ( V ^ c , E ^ c ) with | V ^ c | = n . We then compute the embedding cost function of G c . We repeat the process by choosing a different random node-pair from V c and computing the resulting embedding cost function, 99 times. Lastly, we compute the embedding cost function after merging the duplicate node pair { i , i } . We compare the value of the objective cost function with that of graph G, and display the ranking in Table 8. Although the process of data corruption seems simple, there are many parameter variables at hand that can affect it. Thus, we introduce few parameters to better model node splitting:
  • Edge distribution: We employ 2 different ways to reassign the edges, from 1 node to 2 different nodes, either by randomly distributing the edges with at least 1 edge per node, or by ensuring equal distribution for each one;
  • Minimum degree: We only choose to split nodes with a degree larger than a specified minimum;
  • Overlap: We specify if there is an overlap in the edge reassignment for the different nodes, i.e., percentage of common edges for each node.
Datasets. As indicated earlier, we have used the three semi-synthetic datasets (Table 2), lesmis, polbooks, and netscience, mainly for their relatively small size, additionally we tested FONDUE-NDD on part of the PubMed dataset (described earlier in Section 4.2.3), a network of 2122 nodes, including one duplicate node.
Baselines. As all competing methods [22,23] require additional labels and attributes for NDD, so we opted to use a simple baseline to assess the performance of FONDUE-NDD, namely the L2-norm of the embedding distance between the candidate duplicate node-pairs { i , i } using CNE (referred to as ED).
E D { i , i } = | | x i x i | | 2
Metric. As described in the previous section, we compute the value of the objective function after the re-embedding for the different merged nodes. We then rank the different node pairs by their value. We use that rank as a metric to predict whether our approach can successfully predict which node-pair is a duplicate. The best value would be 1, which means that 100% of the time, FONDUE-NDD is able to identify the node pairs, as the cost of the re-embedding is the lowest.
Results. The results in Table 8, represent the average ranking of objective cost function over 100 different trial. We ran a 2-side Fisher test to test if the differences between the averages for the two methods are significantly different ( p < 0.05 ), and the averages are highlight in bold when it is the case. The results show that for high degree nodes (higher than the average), FONDUE-NDD outperforms ED, but its performance degrades for low degree nodes. Additionally, the more connected a corrupted node is, the better the improvement of the objective function of the recovered network compared to that of the of corrupted network. This shows that some parameters identified in the previous section plays a large role in the identification of the duplicate nodes using FONDUE-NDD. Overall the intuition behind FONDUE-NDD is highlighted in the results of the experiments. For the PubMed dataset, we find that the average rank is equal to 4 out of 100, while ED ranked 6th. This also confirms the result to semi-synthetic data, as the degree of the duplicate node was above the average of the graph.
Execution time. As we do not account for the time of embedding of the initial duplicate network as part of execution time for FONDUE-NDD, the baseline ED has an execution time of 0, as it is directly derived from the embedding of the duplicate graph. FONDUE-NDD performs additional repeated uniform random node contraction then embedding, as specified in the pipeline section, thus the execution time for FONDUE-NDD varies depending on the size of the network and the number of embeddings executed. Results are shown in Table 9.

4.4. Discussions

Despite its state-of-the-art performance in identifying ambiguous nodes (Section 4.2.2), FONDUE-NDA’s node splitting functionality falls short compared to that of MCL (Section 4.2.4). Nonetheless, we argue that FONDUE-NDA’s main feature is to facilitate the identification of ambiguous nodes, which is one if the highlight contributions of this paper, as its results are consistent across different datasets and contraction ratio, rendering it a versatile tool for network ambiguity detection in the challenging situation when besides the network topology itself no additional information (such as node attributes, descriptions, or labels) is available or may be used.
For node deduplication, FONDUE-NDA performed well in settings where the duplicate nodes have a higher than average degree compared to the network, which is arguably the case for this NDD, as duplicate nodes tend to have higher degree.
The main limitation of FONDUE is its reliance on the scalability of the embedding method. The current backend NE method being CNE, the scalability is limited to medium-sized networks with sub-100,000 nodes.
Implementing additional NE methods for FONDUE-NDA and FONDUE-NDD could be one future areas for exploring and improving the state-of-the-art of NDA and NDD.

5. Conclusions

In this paper, we formalized both the node deduplication problem and the node disambiguation problem as inverse problems. We presented FONDUE as a novel method that exploits the empirical fact that naturally occurring networks can be embedded well using state-of-the-art network embedding methods, such that the embedding quality of the network after node disambiguation or node deduplication can be used as an inductive bias.
For node deduplication, we showed that FONDUE-NDD, using only the topological properties of a graph, can help identify nodes that are duplicate, with experiments on four different datasets successfully demonstrating the viability of the method. Despite it not being an end-to-end solution, it can facilitate filtering out the best candidate nodes that are duplicates.
For tackling node disambiguation, FONDUE-NDA decomposes this task into two sub-tasks: identifying ambiguous nodes, and determining how to optimally split them. Using an extensive experimental pipeline, we empirically demonstrated that FONDUE-NDA outperforms the state-of-the-art when it comes to the accuracy of identifying ambiguous nodes, by a substantial margin and uniformly across a wide range of benchmark datasets of varying size, proportion of ambiguous nodes, and domain, while keeping the computational cost lower than that of the best baseline method, by nearly one order of magnitude.
On the other hand, the boost in ambiguous node identification accuracy was not observed for the node splitting task, where FONDUE-NDA underperformed compared to the competing baseline, Markov clustering. Thus, we suggested a combination of FONDUE for node identification, and Markov clustering on the ego-networks of ambiguous nodes for node splitting, as the most accurate approach to address the full node disambiguation problem.

Author Contributions

Conceptualization, B.K. and T.D.B.; methodology, A.M., B.K. and T.D.B.; software, A.M. and B.K.; validation, A.M., B.K., J.L. and T.D.B.; formal analysis, A.M. and B.K.; investigation, A.M. and B.K.; resources, J.L. and T.D.B.; data curation, A.M. and B.K.; writing—original draft preparation, A.M., B.K., J.L. and T.D.B.; writing—review and editing, A.M., B.K., J.L. and T.D.B.; visualization, A.M. and B.K.; supervision, B.K., J.L. and T.D.B.; project administration, B.K.; funding acquisition, J.L. and T.D.B. All authors have read and agreed to the published version of the manuscript.

Funding

The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013) (ERC Grant Agreement no. 615517), and under the European Union’s Horizon 2020 research and innovation programme (ERC Grant Agreement No. 963924), from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” programme, and from the FWO (project no. G091017N, G0F9816N, 3G042220).

Data Availability Statement

All data used in this study are publicly available, from other sources. See Section 3.1 and Table 1 for sources.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analysis, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

NDANode Disambiguation
NDDNode Deduplication
NENetwork Embeddings
CNEConditional Network Embeddings

Appendix A. Real-World Example: PubMed Dataset

To help illustrate the aforementioned problems in the real-world, we processed the publicly available PubMed dataset courtesy of the U.S. National Library of Medicine, that comprises more than 31 million citations for biomedical literature from various sources, with metadata ranging from author names to their Orcid IDs. An Orcid ID is a unique author identifier used to distinguish researchers. It is useful information to uncover ambiguous author names (an author name that has multiple Orcid IDs) or author name duplicates (distinct names that correspond to the same Orcid ID). From a quick analysis of the 17 millions author data present in the dataset, there exist 154,247 duplicate author names (different names that refer to the same author), and 62,506 ambiguous author names (different authors with the same name). A further visualization in Figure A1 shows the frequency histogram of ambiguous author names with multiple Orcid IDs which ranges from 2 up to more than 200 per name. The most common cause for this problem is that some distinct authors share the same name, which is very common for names for authors of Asian background, as the romanization of their name is often ambiguous. The most ambiguous names in the dataset are ’Wei Zhang’ and ’Wei Wang’, both with more than 200 different authors associated with each of them.
Figure A1. The frequency of ambiguous author names with multiple Orcid IDs. It represents the frequency of names that are shared with distinct authors. ’Wei Zhang’ and ’Wei Wang’ are the most common names, being shared with over than 200 distinct authors each.
Figure A1. The frequency of ambiguous author names with multiple Orcid IDs. It represents the frequency of names that are shared with distinct authors. ’Wei Zhang’ and ’Wei Wang’ are the most common names, being shared with over than 200 distinct authors each.
Applsci 11 09884 g0a1
As for duplicate author names, shown in Figure A2, the main cause is multiple name variations for the same author, for example, the Orcid ID 0000000240600292 belongs to 4 variations of the name ‘Robert Henry’ (Robert J. Henry, Robert James Henry, and R. Henry) that refer to the same author in the dataset. The second cause is erroneously parsed metadata provided by the PubMed baseline dataset, where the coauthors share the same Orcid ID as one of the other authors.
Figure A2. The frequency of different author names that share the same Orcid ID. It represents Orcid IDs that are present in more than 1 author name entry, due to author name variation.
Figure A2. The frequency of different author names that share the same Orcid ID. It represents Orcid IDs that are present in more than 1 author name entry, due to author name variation.
Applsci 11 09884 g0a2
Both of those problems are the result of both the processing and parsing tasks, which are very common for such data type.

Appendix B. Tabulated Results for Ambiguous Node Identification

As discussed in Section 4.2.2, we argued that the best metric for evaluating the task was NDCG. Nonetheless, as many approaches use the classical binary classification metrics, we present below detailed results of the experiments.
Table A1. Performance evaluation for AUC score, DCG and F1-score on multiple datasets for our FONDUE-NDA compared with other baselines. Note that for some datasets with a small number of nodes, we did not perform any contraction for 0.001 as the number of contracted nodes in this case is very small, thus we replaced the values for those methods by “−”.
Table A1. Performance evaluation for AUC score, DCG and F1-score on multiple datasets for our FONDUE-NDA compared with other baselines. Note that for some datasets with a small number of nodes, we did not perform any contraction for 0.001 as the number of contracted nodes in this case is very small, thus we replaced the values for those methods by “−”.
Ambiguity Rate10%1%0.1%
MethodFONDUE-NDANCCCDegreeFONDUE-NDANCCCDegreeFONDUE-NDANCCCDegree
AUCfb-sc0.9510.7930.1630.7370.9680.8360.5170.7130.9890.9360.7470.690
fb-pp0.8710.7200.7440.7240.8950.6990.7620.7260.8880.6730.7680.673
email0.7680.4910.2800.7300.8620.5870.3800.736
student0.7850.5550.5490.7240.8780.3160.7080.839
lesmis0.8330.5560.3660.757
polbooks0.9520.5900.3080.8411.0000.8130.3270.877
ppi0.7460.5590.6410.7260.7920.5550.6470.7390.9430.5220.6300.803
netscience0.9330.8770.8070.8050.9520.8210.6690.842
GrQc0.8720.8070.7960.7400.8900.8120.7850.7190.9440.7940.6750.696
CondMat0.8610.8370.7970.7570.8670.8390.8100.7440.8730.8400.7930.709
HepTh0.8530.7540.7800.7430.8610.7460.7750.7330.9020.7410.7990.764
cm050.8670.8520.8110.7540.8890.8670.8150.7580.9100.8700.8240.770
cm030.8750.8510.8150.7580.8860.8430.8220.7420.9090.8180.8020.751
DCGfb-sc56.84357.32445.76746.2268.5049.7096.3134.6951.7381.3710.6970.509
fb-pp216.187198.421197.420193.39524.62920.17720.02819.3582.7982.0052.0081.894
email16.15013.64412.77914.5332.2491.2971.1241.317
student8.4817.2436.1967.1050.8450.6980.5010.548
lesmis3.2972.0731.8162.262
polbooks4.4152.7462.4273.1701.0000.3100.2670.318----
ppi43.60138.47941.60942.6114.5043.7784.0864.1520.4140.2940.3130.323
netscience9.3248.2557.6967.5941.0830.8060.6880.613
GrQc52.23649.07248.51346.86.7975.0414.9284.6760.6390.4990.4710.431
CondMat199.255197.187194.376188.15721.93620.18619.84118.9822.6632.0171.9551.856
HepTh93.87087.12589.90986.92810.9358.7549.2998.8291.2860.7960.8860.821
cm05320.094316.111310.849299.21634.50732.48532.00929.9894.8713.2303.1712.973
cm03253.117247.685242.945234.60628.16425.55824.90323.6522.8612.5372.4062.323
F1-Scorefb-sc0.7440.8560.7590.2510.5500.7480.5280.0650.2500.3000.0130.013
fb-pp0.4760.3270.3150.2470.1880.0920.0290.0280.0450.0740.0070.002
email0.3880.2270.1900.2770.1110.0170.0060.022
student0.4100.2260.1390.2660.3330.1070.0130.04
lesmis0.5710.2970.2230.331
polbooks0.7000.2880.2600.2841.0000.240.040.12
ppi0.2650.0750.2260.2450.0790.0070.0130.0180000
netscience0.5950.5170.4410.3680.3330.0930.0530.013
GrQc0.3980.3520.3220.2510.1710.1020.0410.04100.02500
CondMat0.3950.4180.3430.2800.0890.0940.0360.0330.0950.0100.0080
HepTh0.3970.2790.3290.2730.1280.0540.0620.0470.1250.0130.0330
cm050.4090.4510.3660.2670.0990.1310.0450.0260.1110.0240.0110
cm030.4240.4490.3690.2790.1240.1270.0420.0310.0370.0210.0000

References

  1. National Library of Medicine. PubMed Dataset. Available online: https://pubmed.ncbi.nlm.nih.gov/help/ (accessed on 20 October 2021).
  2. Xenarios, I.; Rice, D.W.; Salwinski, L.; Baron, M.K.; Marcotte, E.M.; Eisenberg, D. DIP: The database of interacting proteins. Nucleic Acids Res. 2000, 28, 289–291. [Google Scholar] [CrossRef] [Green Version]
  3. Auer, S.; Bizer, C.; Kobilarov, G.; Lehmann, J.; Cyganiak, R.; Ives, Z. Dbpedia: A nucleus for a web of open data. In The Semantic Web; Springer: Berlin, Germany, 2007; pp. 722–735. [Google Scholar]
  4. Wang, X.; Cui, P.; Wang, J.; Pei, J.; Zhu, W.; Yang, S. Community preserving network embedding. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
  5. Fortunato, S.; Barthelemy, M. Resolution limit in community detection. Proc. Natl. Acad. Sci. 2007, 104, 36–41. [Google Scholar] [CrossRef] [Green Version]
  6. Kang, B.; Lijffijt, J.; De Bie, T. Conditional Network Embeddings. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  7. Tang, J.; Fong, A.C.; Wang, B.; Zhang, J. A Unified Probabilistic Framework for Name Disambiguation in Digital Library. IEEE Trans. Knowl. Data Eng. 2012, 24, 975–987. [Google Scholar] [CrossRef]
  8. Haak, L.L.; Fenner, M.; Paglione, L.; Pentz, E.; Ratner, H. ORCID: A system to uniquely identify researchers. Learn. Publ. 2012, 25, 259–264. [Google Scholar] [CrossRef] [Green Version]
  9. Zhang, B.; Al Hasan, M. Name Disambiguation in Anonymized Graphs Using Network Embedding. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, Singapore, 6–10 November 2017; ACM: New York, NY, USA, 2017; pp. 1239–1248. [Google Scholar]
  10. Parravicini, A.; Patra, R.; Bartolini, D.B.; Santambrogio, M.D. Fast and Accurate Entity Linking via Graph Embedding. In Proceedings of the 2nd Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA), GRADES-NDA’19, New York, NY, USA, 30 June–5 July 2019; ACM: New York, NY, USA, 2019; pp. 10:1–10:9. [Google Scholar]
  11. Shen, W.; Wang, J.; Han, J. Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE Trans. Knowl. Data Eng. 2014, 27, 443–460. [Google Scholar] [CrossRef]
  12. Zhang, Y.; Zhang, F.; Yao, P.; Tang, J. Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 1002–1011. [Google Scholar]
  13. Ma, X.; Wang, R.; Zhang, Y. Author name disambiguation in heterogeneous academic networks. In International Conference on Web Information Systems and Applications; Springer: Berlin, Germany, 2019; pp. 126–137. [Google Scholar]
  14. Lerchenmueller, M.J.; Sorenson, O. Author disambiguation in PubMed: Evidence on the precision and recall of author-ity among NIH-funded scientists. PLoS ONE 2016, 11, e0158731. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  15. Ma, Y.; Wu, Y.; Lu, C. A Graph-Based Author Name Disambiguation Method and Analysis via Information Theory. Entropy 2020, 22, 416. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  16. Saha, T.K.; Zhang, B.; Al Hasan, M. Name disambiguation from link data in a collaboration graph using temporal and topological features. Soc. Netw. Anal. Min. 2015, 5, 11. [Google Scholar] [CrossRef] [Green Version]
  17. Hermansson, L.; Kerola, T.; Johansson, F.; Jethava, V.; Dubhashi, D. Entity disambiguation in anonymized graphs using graph kernels. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, San Francisco, CA, USA, 27 October–1 November 2013; pp. 1037–1046. [Google Scholar]
  18. Xu, J.; Shen, S.; Li, D.; Fu, Y. A network-embedding based method for author disambiguation. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Turin, Italy, 22–26 October 2018; pp. 1735–1738. [Google Scholar]
  19. Chen, T.; Sun, Y. Task-guided and path-augmented heterogeneous network embedding for author identification. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, Cambridge, UK, 6–10 February 2017; pp. 295–304. [Google Scholar]
  20. Cavallari, S.; Zheng, V.W.; Cai, H.; Chang, K.C.C.; Cambria, E. Learning Community Embedding with Community Detection and Node Embedding on Graphs. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, Singapore, 6–10 November 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 377–386. [Google Scholar]
  21. Weichselbraun, A.; Kuntschik, P.; Brasoveanu, A.M. Name variants for improving entity discovery and linking. In Proceedings of the 2nd Conference on Language, Data and Knowledge (LDK 2019), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Leipzig, Germany, 20–23 May 2019. [Google Scholar]
  22. Alhelbawy, A.; Gaizauskas, R. Graph ranking for collective named entity disambiguation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Association for Computational Linguistics: Baltimore, MD, USA, 2014; Volume 2, pp. 75–80. [Google Scholar]
  23. Guo, Y.; Che, W.; Liu, T.; Li, S. A graph-based method for entity linking. In Proceedings of the 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, 7–13 November 2011; pp. 1010–1018. [Google Scholar]
  24. Tang, J.; Qu, M.; Wang, M.; Zhang, M.; Yan, J.; Mei, Q. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 18–22 May 2015; pp. 1067–1077. [Google Scholar]
  25. Nesterov, Y. Semidefinite relaxation and nonconvex quadratic optimization. Optim. Methods Softw. 1998, 9, 141–160. [Google Scholar] [CrossRef]
  26. Luo, Z.Q.; Ma, W.K.; So, A.M.C.; Ye, Y.; Zhang, S. Semidefinite relaxation of quadratic optimization problems. IEEE Signal Process. Mag. 2010, 27, 20–34. [Google Scholar] [CrossRef]
  27. Leone, F.; Nelson, L.; Nottingham, R. The folded normal distribution. Technometrics 1961, 3, 543–550. [Google Scholar] [CrossRef]
  28. Van Leeuwen, M.; De Bie, T.; Spyropoulou, E.; Mesnage, C. Subjective interestingness of subgraph patterns. Mach. Learn. 2016, 105, 41–75. [Google Scholar] [CrossRef] [Green Version]
  29. Leskovec, J.; Krevl, A. SNAP Datasets: Stanford Large Network Dataset Collection. 2014. Available online: http://snap.stanford.edu/data (accessed on 20 October 2021).
  30. Goethals, B.; Le Page, W.; Mampaey, M. Mining interesting sets and rules in relational databases. In Proceedings of the 2010 ACM Symposium on Applied Computing, Sierre, Switzerland, 22–26 March 2010; pp. 997–1001. [Google Scholar]
  31. Breitkreutz, B.J.; Stark, C.; Reguly, T.; Boucher, L.; Breitkreutz, A.; Livstone, M.; Oughtred, R.; Lackner, D.H.; Bähler, J.; Wood, V.; et al. The BioGRID interaction database: 2008 update. Nucleic Acids Res. 2007, 36, D637–D640. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  32. Knuth, D.E. The Stanford GraphBase: A Platform for Combinatorial Computing; AcM Press: New York, NY, USA, 1993. [Google Scholar]
  33. Newman, M.E. The structure of scientific collaboration networks. Proc. Natl. Acad. Sci. USA 2001, 98, 404–409. [Google Scholar] [CrossRef] [PubMed]
  34. Järvelin, K.; Kekäläinen, J. Cumulated Gain-Based Evaluation of IR Techniques. ACM Trans. Inf. Syst. 2002, 20, 422–446. [Google Scholar] [CrossRef]
Figure 1. (a) A node that corresponds to two real-life entities that belongs to two communities. Links that connect the node with different communities are plotted in either full lines or dashed lines. (b) an ideal split that aligns well with the communities. (c) a less optimal split.
Figure 1. (a) A node that corresponds to two real-life entities that belongs to two communities. Links that connect the node with different communities are plotted in either full lines or dashed lines. (b) an ideal split that aligns well with the communities. (c) a less optimal split.
Applsci 11 09884 g001
Figure 2. FONDUE pipeline for both NDA and NDD. Data corruption can lead to two types of problems: node ambiguation (e.g., multiple authors sharing the same name represented with one node in the network) in the left part of the diagram, and node duplication (e.g., one author with name variation represented by more than 1 node in the network). We then define two tasks to resolve both problems separately using FONDUE.
Figure 2. FONDUE pipeline for both NDA and NDD. Data corruption can lead to two types of problems: node ambiguation (e.g., multiple authors sharing the same name represented with one node in the network) in the left part of the diagram, and node duplication (e.g., one author with name variation represented by more than 1 node in the network). We then define two tasks to resolve both problems separately using FONDUE.
Applsci 11 09884 g002
Figure 3. Bar plots for the diffent metrics (higher is better) for each dataset listed in Table 3, for each of the 4 measures, FONDUE-NDA, Degree, NC, CC, for network with 10 % uniformly contracted nodes. More detailed results are listed in Appendix B.
Figure 3. Bar plots for the diffent metrics (higher is better) for each dataset listed in Table 3, for each of the 4 measures, FONDUE-NDA, Degree, NC, CC, for network with 10 % uniformly contracted nodes. More detailed results are listed in Appendix B.
Applsci 11 09884 g003
Figure 4. Bar plots for the runtime performance in seconds (lower is better) for each dataset listed in Table 7, for each of the 3 measures, FONDUE-NDA, NC, CC for different percentage of contracted nodes, 0.1 % , 1 % , and 10 % , respectively.
Figure 4. Bar plots for the runtime performance in seconds (lower is better) for each dataset listed in Table 7, for each of the 3 measures, FONDUE-NDA, NC, CC for different percentage of contracted nodes, 0.1 % , 1 % , and 10 % , respectively.
Applsci 11 09884 g004
Table 1. The different datasets used in our experiments (Section 4.1).
Table 1. The different datasets used in our experiments (Section 4.1).
DatasetsDescription
FB-SCFacebook Social Circles network [29]
Consists of anonymized friends list from Facebook.
FB-PPPage-Page graph of verified Facebook pages [29].
Nodes represent official Facebook pages while the
links are mutual likes between pages.
emailAnonymized network generated using email data from a large
European research institution modeling the incoming and
outgoing email exchange between its members [29].
STDA database network of the Computer Science department of the
University of Antwerp that represent the connections between students,
professors and courses [30].
PPIA subnetwork of the BioGRID Interaction Database
[31], that uses PPI network for Homo Sapiens.
lesmisA network depicting the coappearance of characters in
the novel Les Miserables [32].
netscienceA coauthorship network of scientists working on network theory and
experiments [29].
polbooksNetwork of books about US politics, with edges between books
representing frequent copurchasing of books by the same buyers.
http://www-personal.umich.edu/~mejn/netdata/, accessed on 20 October 2021
CondMatCollaboration network of Arxiv Condensed Matter Physics [33]
GrQcCollaboration network of Arxiv General Relativity [33]
HepThCollaboration network of Arxiv Theoretical High Energy Physics [33]
CM03Collaboration network of Arxiv Condensed Matter till 2003 [33]
CM05Collaboration network of Arxiv Condensed Matter till 2005 [33]
PubMedCollaboration network extracted from the PubMed database (analyzed in
Appendix A), containing 2122 nodes with ground truth of 31 ambiguous nodes
(6 of which maps to more than 2 entities) and 1 duplicate node [1]
Table 2. Various properties of each semi-synthetic network used in our experiments.
Table 2. Various properties of each semi-synthetic network used in our experiments.
fb-scfb-ppemaillesmispolbooksSTD
# Nodes403922,47098677105395
# Edges88,234170,82316,0642544413,423
Avg degree43.715.232.66.68.417.3
Density1.1 × 1026.8 × 1043.3 × 1028.7 × 1028.1 × 1024.4 × 102
ppinetscienceGrQcCondMatHepThCM05CM03
# Nodes3852379415821,363863836,45827,519
# Edges37,84191413,42291,28624,806171,735116,181
Avg degree19.64.86.58.55.79.48.4
Density5.1 × 1031.3 × 1021.6 × 1034.0 × 1046.6 × 1042.6 × 1043.1 × 104
Table 3. Performance evaluation (NDCG) on multiple datasets for our method compared with other baselines, for two different contraction methods. Note that for some datasets with small number of nodes, we did not perform any contraction for 0.001 as the number of contracted nodes in this case is very small, thus we replaced the values for those methods by “−”.
Table 3. Performance evaluation (NDCG) on multiple datasets for our method compared with other baselines, for two different contraction methods. Note that for some datasets with small number of nodes, we did not perform any contraction for 0.001 as the number of contracted nodes in this case is very small, thus we replaced the values for those methods by “−”.
Ambiguity Rate10%1%0.1%
MethodFONDUE-NDANCCCDegreeFONDUE-NDANCCCDegreeFONDUE-NDANCCCDegree
Randomly Uniform Contractionfb-sc0.9540.9620.7680.7760.7670.8750.5690.4230.6790.5350.2720.199
fb-pp0.8990.8250.8210.8040.6490.5320.5280.5110.3740.2680.2680.253
email0.7830.6610.6190.7040.5290.3050.2640.310
student0.7780.6640.5680.6520.3960.3280.2350.257
lesmis0.9060.5700.4990.622
polbooks0.9720.6040.5340.6981.0000.3100.2670.318
ppi0.7590.6700.7240.7410.4200.3530.3810.3870.1940.1380.1470.151
netscience0.8860.7840.7310.7210.5080.3780.3230.288
GrQc0.8570.8050.7960.7680.6030.4470.4370.4150.2490.1950.1840.168
CondMat0.8640.8550.8430.8160.6010.5530.5430.5200.3670.2780.2690.255
HepTh0.8600.7980.8230.7960.5820.4660.4940.4700.3250.2010.2240.208
cm050.8840.8730.8590.8270.6270.5900.5820.5450.4710.3120.3070.288
cm030.8880.8690.8520.8230.6350.5770.5620.5340.3350.2970.2810.272
no common neighborsfb-sc0.9530.9890.7680.7640.7300.9330.5910.4180.3990.6650.3210.172
fb-pp0.8950.8260.8200.8010.6500.5320.5290.5100.3890.2660.2670.253
email0.6760.6960.6250.6040.3030.3190.2880.256
student0.6590.7260.5310.5870.3680.4470.2010.229
lesmis0.7550.5910.4980.486
polbooks0.9810.6200.5440.6961.0000.2680.4200.363
ppi0.7250.6730.7210.7000.3980.3520.3810.3730.1660.1390.1470.144
netscience0.8770.7970.7140.7050.6220.3720.3040.290
GrQc0.8610.8060.7940.7660.5800.4450.4350.4160.2800.1970.1830.173
CondMat0.8630.8550.8430.8150.5850.5540.5420.5160.3170.2740.2730.257
HepTh0.8560.7980.8240.7960.5810.4670.4940.4800.2850.2040.2120.213
cm050.8830.8740.8580.8250.6330.5910.5820.5430.4140.3100.3120.289
cm030.8840.8690.8530.8220.6510.5770.5610.5330.4390.2950.2790.271
Table 4. Performance of FONDUE-NDA compared to NC for the PubMed dataset containing 31 ambiguous author-names of which 6 are associated with more than 2 Orcid IDs (highly ambiguous). NDCG* reflects the ranking score of those highly ambiguous nodes.
Table 4. Performance of FONDUE-NDA compared to NC for the PubMed dataset containing 31 ambiguous author-names of which 6 are associated with more than 2 Orcid IDs (highly ambiguous). NDCG* reflects the ranking score of those highly ambiguous nodes.
FONDUE-NDANC [16]
AUC0.9710.944
TP1711
FP1422
F1-score0.550.35
NDCG0.8770.681
NDCG*0.7430.38
Table 5. The top 10 ranked nodes of FONDUE-NDA and NC, against the ground truth (1 if the node is ambiguous and 0 otherwise). FONDUE-NDA correctly classifies 9 out of the 10 nodes as ambiguous. Names with * are highly ambiguous, referring to more than 2 Orcid IDs.
Table 5. The top 10 ranked nodes of FONDUE-NDA and NC, against the ground truth (1 if the node is ambiguous and 0 otherwise). FONDUE-NDA correctly classifies 9 out of the 10 nodes as ambiguous. Names with * are highly ambiguous, referring to more than 2 Orcid IDs.
NC RankAmbiguousFONDUE-NDA RankAmbiguous
Jie_Li0Tao_Wang *1
Lei_Liu0Wei_Zhang1
Jiawei_Wang0Jing_Li1
Jun_Liu0Bin_Zhang *1
Ni_Wang0Yan_Li1
Xin_Liu0Rui_Li0
Yao_Chen0Ying_Liu1
Huanhuan_Liu0Feng_Li1
Jun_Yan0Yang_Yang *1
Lei_Chen0Ying_Sun1
Table 6. Adjusted Rand index score for FONDUE-NDA and MCL.
Table 6. Adjusted Rand index score for FONDUE-NDA and MCL.
Ambiguity Rate10%1%0.1%
MethodFONDUE-NDAMCLFONDUE-NDAMCLFONDUE-NDAMCL
Uniformly Random Contractionfb-sc0.5820.7810.7810.8290.9470.936
fb-pp0.4050.5830.5040.5830.6100.600
email0.1220.2480.2140.276
student0.2150.1240.1310.183
lesmis0.5970.304
polbooks0.3910.338
ppi0.0330.1200.0550.1150.3380.321
netscience0.6100.7770.6420.809
GrQc0.4980.6650.6610.6820.7620.784
CondMat0.4710.7090.5820.7130.5740.718
HepTh0.4470.5560.5210.5520.4650.542
cm050.4710.7220.6740.7320.4880.741
cm030.4790.7240.5750.7300.5130.690
No Common Neighborsfb-sc0.6090.9520.8870.9850.8620.998
fb-pp0.4090.5860.5090.5640.5460.637
email0.1360.3150.2780.413
student0.1970.1390.0790.078
lesmis0.4050.289
polbooks0.4990.5770.6380.655
ppi0.0360.1390.0670.1510.3310.399
netscience0.7250.8190.7220.897
GrQc0.5340.6780.6650.6830.8950.872
CondMat0.4700.7080.5480.7120.4750.686
HepTh0.4690.5710.5290.5610.4040.455
cm050.4660.7260.5490.7770.4500.723
cm030.4820.7220.5700.7270.5670.697
Table 7. Execution time (in seconds) comparison table for the different datasets averaged over 10 different experiments. The embedding time of the networks was not taken into account for the computation of the runtime results.
Table 7. Execution time (in seconds) comparison table for the different datasets averaged over 10 different experiments. The embedding time of the networks was not taken into account for the computation of the runtime results.
10%1%0.1%
FONDUE-NDANCCCFONDUE-NDANCCCFONDUE-NDANCCC
fb-sc10.9994.7252.925.1690.4351.627.6690.9951.63
fb-pp15.03161.6046.8318.02185.6446.4829.65190.2546.06
email0.8110.184.770.8410.264.62
student0.352.880.690.374.450.59
lesmis0.140.210.03
polbooks0.040.340.050.050.360.05
ppi4.0949.2315.644.1750.8915.033.5350.7314.95
netscience0.120.890.080.130.920.090.130.910.09
GrQc1.5313.072.061.6613.362.071.6913.292.04
CondMat11.4182.6313.1512.3686.4213.1113.7688.2612.75
HepTh3.4728.042.853.6527.222.814.0027.682.77
cm0523.04143.2330.0224.69144.9825.1226.01137.9225.51
cm0316.10116.2617.8220.33137.5717.1218.96124.2517.32
Table 8. Results of the controlled experiments for each dataset. The average ranking of objective cost function over 100 different trials. The lower, the better. Bold numbers indicate that the difference in averages are significant ( p < = 0.05 ) .
Table 8. Results of the controlled experiments for each dataset. The average ranking of objective cost function over 100 different trials. The lower, the better. Bold numbers indicate that the difference in averages are significant ( p < = 0.05 ) .
Edge DistributionMinimum DegreeEdge OverlapLemisPolbooksNetscience
FONDUE-NDDEDFONDUE-NDDEDFONDUE-NDDED
BalancedNone0%18.77518.230.02517.856.9754.2
20%15.558.7522.47510.653.92.825
30%14.1259.3520.48.53.2251.775
Graph Average0%1015.80617.67620.9415.3255.7
20%10.16711.08311.47112.9412.7753.025
30%8.6119.33310.79411.1762.7252.625
2x Graph Average0%3.85724.8575.72723.3643.4715.029
20%517.4293.81815.9091.7353.412
30%2.85713.4292.54514.0911.7353.206
UnbalancedNone0%25.922.7536.42526.3256.92.85
20%16.7510.7525.875103.752.55
30%16.310.52522.87512.0753.652.775
Graph Average0%13.517.3062723.1765.1255.075
20%12.41713.27813.02911.7063.13.15
30%12.94412.16714.26512.0293.0252.675
2x Graph Average0%18.14340.14311.54522.5455.7358.147
20%9.42918.4298.36416.1822.1183.5
30%5.71419.286712.1821.7352.794
Table 9. Runtime for FONDUE-NDD in seconds, for 100 iterations (contracting each time a different random node pair and computing its embeddings).
Table 9. Runtime for FONDUE-NDD in seconds, for 100 iterations (contracting each time a different random node pair and computing its embeddings).
DatasetExecution Time (Seconds)
lesmis609.34
polbooks712.29
netscience1198.63
pubmed3474.82
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Mel, A.; Kang, B.; Lijffijt, J.; De Bie, T. FONDUE: A Framework for Node Disambiguation and Deduplication Using Network Embeddings. Appl. Sci. 2021, 11, 9884. https://doi.org/10.3390/app11219884

AMA Style

Mel A, Kang B, Lijffijt J, De Bie T. FONDUE: A Framework for Node Disambiguation and Deduplication Using Network Embeddings. Applied Sciences. 2021; 11(21):9884. https://doi.org/10.3390/app11219884

Chicago/Turabian Style

Mel, Ahmad, Bo Kang, Jefrey Lijffijt, and Tijl De Bie. 2021. "FONDUE: A Framework for Node Disambiguation and Deduplication Using Network Embeddings" Applied Sciences 11, no. 21: 9884. https://doi.org/10.3390/app11219884

APA Style

Mel, A., Kang, B., Lijffijt, J., & De Bie, T. (2021). FONDUE: A Framework for Node Disambiguation and Deduplication Using Network Embeddings. Applied Sciences, 11(21), 9884. https://doi.org/10.3390/app11219884

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop