1. Introduction
Data are critical and essential to deal with in every field. Due to a huge amount of internet and database applications, there is a need to deal with huge data. Many real-world datasets can be represented by a graph which consists of set of vertices and set of edgesepresenting nodes, and interaction between entities is represented as edges [
1]. A simple graph is a graph where only one edge is allowed between its vertices. Most of the real-world datasets are represented by multi-graphs in which more than one edge type between vertices are allowed in such graphs. Although dealing with multi-graph data is very important, most of the research in graphs deal with a simple graph. The simple graph represents many interaction systems such as social networks, Resource Description Framework (RDF) data, chemical compounds, protein–protein interaction (PPI) networks, physical interactions, and complex networks. Moreover, the multi-graph represents the more powerful presentation of real interaction systems because its presentation has many relationships between objects of the system, so the multi-graph has a lot of information to explore valuable patterns.
One of the most important mining problem patterns in dealing with graphs is the subgraph-matching problem [
2]. This problem can be represented as given G, which is a graph (simple graph or multi-graph) (for more details see Definitions 1 and 2), and given q, which is a query graph (simple graph or multi-graph). The subgraph-matching problem is to find all matches of q in G. This problem has two directions: exact or inexact (approximate) graph matching. Exact graph matching is known as graph isomorphism and requires the function between two graphs’ vertices to preserve the adjacency. Moreover, it needs to compute the optimal mapping. However, inexact graph matching is based on computing a sub-optimal mapping as an assignment problem.
Generally, the subgraph isomorphism problem is a Non-deterministic problem (NP-complete problem). However, some specific graphs can also have a small complexity [
3], for instance, the particular case in which the big graph is a forest and the small one to be matched is a polynomial-complexity tree. Some existing scenarios deal with such a problem as the feature-based indexing and the enumeration approach. In the feature-based indexing approach, which is defined by mapping vertices of a graph to vector space (feature space), which will be used to minimize the candidate set for each query vertex, this vector space is used in similarity, classification, and clustering tasks. Feature-based indexing is followed by filtering and verification. Some graph patterns are chosen as indexing features throughout the filtering in order to reduce candidate graphs. In the verification, they are checking for the subgraph isomorphism using the selected candidate.
The enumeration method relies on backtracking techniques. This approach aims to find embedding by the growth of partial solutions and avoids using indexing. There exist other approaches; one of them is based on defining the equivalence classes at query or database level or both and defining them by exploiting vertex relationships.
Cordella et al. [
4] presented the first version of their subgraph isomorphism algorithm, where they examined its performance for the isomorphism of small and medium-size graphs. The algorithm, using a set of feasibility rules, allows us to significantly prune the search space and the computational cost of the matching process.
Most research in subgraph isomorphism problems focuses on obtaining small candidate sets, producing effective matching orders, and improving searching methods. Filtering methods are used to obtain small candidate sets. Defining new data structures aims to acquire small candidate sets as well as produce effective matching orders, while using symmetry breaking has a main role in improving the searching methods.
In [
5] they presented an algorithm for graph/subgraph isomorphism suited for dealing with large graphs. The first version algorithm is improved by analyzing the features in detail with special reference time and memory requirements. The technique offers a wide range of applications because the graph topology is not constrained. A state-space representation (SSR) of the matching process and five feasible rules for decreasing the search tree are described. The selected representation enables one to simultaneously compare the syntactic and semantic properties of the node–pair pairs that need to match. The primary improvement made to an early implementation of the VF algorithm, described in [
4], is that the data structures utilised during the exploration of the search area are set up in a way that significantly reduces memory requirements. As a result, the method may match graphs with a lot of nodes and branches.
Moreover, some early subgraph-matching algorithms such as VF and other (see [
4,
6,
7,
8,
9]) find candidate sest by using local filters that consider the neighborhood of vertices. On the other hand,
and others such as [
10,
11,
12,
13] build auxiliary data structures on a query and data graph to obtain small candidate sets and produce effective matching orders by estimating as precise a search cost as possible.
In [
14], the search process is performed via a backtracking algorithm that, at each step, reduces the bit-vector domain. Before starting the search, it completes two preliminary steps. The first one, Prematch, fills domains by using vertex invariant to select them based on labels and topology. The second step avoids this by locally guaranteeing that two pattern vertices cannot match the same target vertex. After the initial stages, the pattern’s vertices are arranged as indicated by a static search method. The pattern vertex with the most branches between it and the partial solution is the one that will be matched next. The sequence begins with each pattern vertex that has a single, compatible target vertex. It chooses the vertex with the biggest sum of its neighbours’ degrees when there are two vertices that are equally qualified to be the following vertex in the ordering.
The constraint fulfilment issue with subgraph isomorphism can be explained. Finding an assignment of values (target vertices) to all variables such that all criteria are met given a collection of variables (pattern vertices) and a set of constraints constitutes a constraint fulfilment problem for the subgraph isomorphism problem.
Christine Solnon in [
15] showed that the constraint fulfilment issue with subgraph isomorphism can be explained. Finding an assignment of values (target vertices) to all variables such that all criteria are met given a collection of variables (pattern vertices) and a set of constraints constitutes a constraint fulfilment problem for the subgraph isomorphism problem.
Messmer and Bunke in [
16] proposed a new method for graph and subgraph isomorphism detection based on a decision tree representation. The decision tree is generated offline from a priori-known model graphs. At run time, the decision tree is used to detect all graph and subgraph isomorphisms from an input graph to any of the model graphs in time that is only polynomial in the graphs’ size and independent of the number of model graphs.
In [
17] the authors introduced a symmetry-breaking node equivalence for pruning the search space in backtracking algorithm for subgraph-matching problems, and also proved that backtracking algorithm for the monomorphism search problem (i.e., a general framework for subgraph matching) which is that its complexity equals the number of one-to-one function between query and data graph vertices.
The subgraph-matching problem in multi-graphs is relatively new. Ingalalli et al. [
18] presented SuMGra, a feature-based indexing method that supports subgraph matching in a multi-graph. SuMGra mapped nodes to a six-features vector such that one of them is a structural feature and the remaining features are information of multi edges. SuMGra uses a high-dimension R-tree index for indexing feature vectors and Ordered Trie with Inverted List for multi indexing edges of nodes, which use file techniques.
Subgraph matching has many applications in different fields such as network analysis [
19], RDF query processing [
20], cheminformatics [
21], bioinformatics [
22,
23], and malware detection [
24]. Moreover, the problem is used as a black box or subroutine in finding frequent subgraphs [
1,
25], so the problem is a partner with the frequent subgraph mining problem in applications of finding frequent subgraphs such as analysis and understanding of complex networks, finding motifs and graphlets [
26], and biology [
27]. Moreover finding frequent subgraph is useful in classification [
28].
This paper proposes a node embedding-based solution for multi-graph matching. The proposed model is composed of two stages. The first step’s primary concept is that the density-like neighborhood structure is found in the data graph for the densest extracted vertex in the query graph to obtain the densest subgraph, then the k-nearest neighborhood query is found. For each layer graph, the second step, mapping the vertex to the feature vector (Vertex Embedding), improves the model proposed. The principal component analysis (PCA) approach is used to minimize the node embedding size to be effective with the KD-tree indexing. To eliminate the redundancy in the generated pattern matching the query graph, symmetry breaking conditions will also be used. The filtering method is implemented to minimize the number of candidate data nodes of the initiate query vertex. Finally, the effect of the concatenation of the structural features (orbits features) with the meta-features (summary of general, statistical, information-theoretic, etc.) for signatures of nodes on the model performance is tested. The proposed method guarantees that when a query graph has a match in the data graph, the candidate set has at least one vertex arrive at a solution. Using symmetry breaking helps to generate all distinct embeddings of the query graph.
The rest of this paper is organized as follows: Theoretical background and steps of the proposed model are presented in
Section 2 and
Section 3, respectively. Experimental scenarios and discussions are introduced in
Section 4. Finally, conclusions and future work are presented in
Section 5.
2. Theoretical Background
Firstly, it is assumed that all graphs are unweighted (i.e., edges have no weight) and loop-free (which means no edges from any vertex to itself).
Definition 1. Unlabelled, Undirected Simple Graph. An unlabelled undirected simple graph is a pair (V, E) such that V is the set of vertices and E is the set of undirected edges, edges.
A simple graph is a very simple concept in graph theory, but it is useful as a model of some problems.
Definition 2. Unlabelled, Undirected Multigraph Graph. An unlabelled, undirected multi-graph G is a tuple of four parts (V, E, , ) where V is the set of vertices, E is the set of undirected edges, , is the set of labels and , such that is the power set of .
The multi-graph is the more effective way to present real interaction systems because it has many relationships between system objects, giving it a lot of information to look at valuable patterns.
Furthermore, the multi-graph is a simple concept in graph theory and it is generalized of a simple graph. However, it is very powerful, interesting, and useful as the model of a lot of problems. A data multi-graph G is shown in
Figure 1.
One of the most important features is the graph core number for each vertex. To calculate the graph core number for each vertex the multigraph will be transformed into the simplified graph.
Definition 3. is the set of vertices of (multi)graph G, is the set of edges of (multi)graph G and . Max degree is denoted by d.
Definition 4. Neighboring Graph. Given an unlabelled undirected simple graph H, the set of a neighbor of a vertex v: = { | }.
Definition 5. Simple Subgraph. A simple subgraph h of H is a graph such that and .
Definition 6. K-core. Given an unlabelled, undirected simple graph H and an integer k, the K-core of H is the maximal sub-graph h of H such that the minimum degree of h is at least k. That is, every vertex in h is connected to at least k other vertices in h (i.e., , ).
Definition 7. Core Number. Given the unlabelled, undirected simple graph H, the core number of vertex v in H, denoted core(v), is the largest k such that the K-core of H contains v.
Definition 8. Simplified Graph. Given an unlabelled, undirected multigraph Graph G = (V, E, , ). Then, the simplified graph such that and , means the simplified graph is the multigraph without multi-edges as in Figure 2. The simplified graph is the multi-graph after replacing all multi-edges with one edge, indicating the existence of relations between these vertices regardless of the exact number of edges and their relationships. The simplified graph is used to obtain the core number.
The core number is used to build features of vertices. By definition of core number, observe that core number is the dense measure of a vertex. If the core number is high, then the vertex will be with high density. In the following sections, we will know the importance and how to compute it by an efficient algorithm. For example, the core number of
of a simplified graph
, a multi-graph Q1 is shown in
Table 1 which is extracted from
Figure 3.
Definition 9. Triangle number of the vertex. Given a simple unlabeled graph, the triangle number of v (i.e., tr(v))is the number of triangles that v is in.
core(v) ≤ deg(v).
Definition 10. Multigraph homomorphism. Given a multigraph and a multigraph , the subgraph isomorphism from Q and G is a function such that: , then .
The multigraph homomorphism is computed implicitly in the proposed model
Definition 11. Multigraph isomorphsim. Given a multigraph and a multigraph , the subgraph isomorphism from Q and G is a bijection function (i.e., one-to-one and onto function) such that: iff .
Definition 12. Multigraph automorphism. Given a multigraph , the subgraph isomorphism from Q and G is a bijection function (i.e., one-to-one and onto function) such that: iff .
Given u,v . Define an equivalence relation over V(G), ∃ an automorphism such that , then u and v in a equivalent class. The equivalence classes are called orbits.
Definition 13. Subgraph isomorphism for multigraphs. Given a query multigraph and a data multigraph , the subgraph isomorphism from Q and G is an injective function (i.e., one-to-one function) such that:
Definition 14. Vertex Signature. The vertex signature of vertex v is a multiset contains labels of multi edges that are in the incident on v. Mathematically, .
Remark 3. By definition of vertex signature, observe that deg(u) equals the number of sets in(i.e., Cardinality of the vertex signature).
For instance, in Q1,
= {{2},{5},{2,4}}, all vertex signatures of vertices of the multigraph of
Figure 3 are depicted in
Table 1.
Definition 15. Candidate set. Given a graph Q, then the candidate set of vertex u C(u) is defined as C(u) =⋐}, such that ⋐ is the subset operation on a multiset. Define ⋐ operation as a function F from to as an injective function (i.e., one-to-one), F(x) = y, and ∀ v ∈ C(u) is similar to u.
The candidate set is an essential concept in many algorithms and techniques. It is like a starting point for algorithms to obtain solutions. The difficulty is in obtaining a candidate set and computing it efficiently. The next sections will give more details of importance, use, and obtaining candidate sets in the algorithm.
3. Subquery Matching in Multigraph
In this section, the general framework of the proposed algorithm is presented. The proposed algorithm consists of two phases: indexing for store feature vectors and multiedges (
Section 3.4). Subgraph search space to enumerate all functions(i.e., embeddings) (
Section 3.5). In the two phases, K-core, core number, and vertex orbits counting are the basic tools (Sub
Section 3.3).
3.1. Problem Definition (Sub-Multigraph Query Matching)
Given a connected query multi-graph Q such that the set of vertices of Q
, data multi-graph G such that
, the sub-multigraph query matching problem is to enumerate all the embeddings of Q in G, so that each embedding is isomorphic to the query Q. Mathematically, sub-multigraph query matching is to enumerate all possible functions
from Q to G. Given G is shown in
Figure 1 and Q2 is shown in
Figure 4, Algorithm 1 has to enumerate all possible embeddings (i.e., enumerate all possible functions
) of Q2 to G. All embeddings are as follows:
Match1 is a set of ordered pairs such that the first coordinate is the query vertex
and the second coordinate is the data vertex
, which is the image of
. Q2 in
Figure 4 has one embedding in G. If the query multi-graph has at least two embeddings in G then all embeddings are isomorphic to each other.
Algorithm 1: The proposed algorithm framework |
- 1:
Building off-line index KD-tree - 2:
= GetInitiate() - 3:
U = OrderQueryVertices() - 4:
= SelectCandidate() - 5:
All Matching = ∅ - 6:
for all v ∈ do - 7:
= - 8:
= v - 9:
Matched List = [, ] - 10:
All Matching: = RoutineSubgraphBacktracking (All Matching, Matched List, Q, G) - 11:
end for
|
3.2. Overview of the Proposed Algorithm
Generally, in the proposed Algorithm 1 the following steps are performed. Firstly, build index KD-tree off-line (Line 1). Find initiate query vertex (
) through GetInitate function by ordering query vertices using the proposed order (
Section 3.5). In (Line 3), order all query vertices by effective order, representing U to use it in a subgraph search. Use index KD-tree to find candidate set
that matches for
(Line 4). Matching starts with sub-match or partial matching by match
with v in
and calling RoutineSubgraphBacktracking function (
Section 3.4 and
Section 3.5) to match remain query vertices (Lines 6–10).
3.3. K-Core, Core Number and Vertex Orbits Counting
A graph’s k-core has interesting properties: applications such as community search, locating influential, keyword extraction from text, link spam detection, real-time story identification, dense subgraphs, and clustering. The k-core of a graph is the maximal subgraph with a minimum degree of at least k (k-core is well-defined) [
29]. It is readily demonstrated that this subgraph is unique by contradiction (i.e., maximal propriety). The maximum k such that G has a k-core, which is the maximum core number of G. There are a lot of algorithms to compute and obtain a k-core graph and core number of vertices [
29]. The core number of vertices, can be found in polynomial time (
), but can actually be obtained in
m) time such that
n =
and
m =
, by modifications on bin sorting in implementation (where
refers to the time complexity). However, the algorithm in [
30] is embedded in the Algorithm 2 which has an
complexity.
Peeling technique can be performed with a list (or array) linear heap data structure or CoreD-Local (or CoreD-Local-opt) algorithm with h-index in
[
31]. The peeling algorithm’s idea attractively removes (i.e., peels) the minimum-degree node of a graph. The step of obtaining min from the set of nodes which are not visited nodes by the linear heap data structure can be achieved in constant time
. Algorithm 2 is the fastest one applied to the suggested datasets. So, the Algorithm 2 is used to acquire the core number feature for each vertex in the data and query graph. The relation between core number and degree of a vertex is not necessarily positive correlation (or correlated) (for example, the star graph is a simple graph of order n such that each vertex has degree equals 1 except that one of them has degree equal to
n− 1. Then, the core number of each vertex is 1, …). However, this property is satisfied in some datasets.
Vertex orbits counting in simple graph, given orbit
and vertex
v,
= number of occurrences of v in orbit
as subgraph.
= number of occurrences of v in orbit
as induced subgraph such that
[
32].
Algorithm 2: The Core Number Algorithm |
- 1:
Compute the degrees of vertices, deg. - 2:
order the set of vertices V(G) in increasing order of their degrees. - 3:
for all v in V(G) in the order do - 4:
core[v] = deg[v]. - 5:
end for - 6:
for all u in N(v): Neighbour of vertex v do - 7:
if then - 8:
deg[u] = deg[u] − 1. - 9:
reorder V(G). - 10:
end if - 11:
end for - 12:
Return the list of core numbers
|
3.4. Indexing
This subsection shows how to use indices (i.e., KD-tree and set-trie) and features usefully in the proposed algorithm. It also details how to compute them effectively. The indexing is split into two phases, offline and on-line indexing, as follows:
offline indexing: In this phase, we compute the useful and efficient features and build indices on data graph G (i.e., indexing feature vectors of data vertices and multi edges).
On-line indexing: In this phase, using indices to find candidate sets for matching query Q in graph G.
Firstly, our criteria are choosing the features such that they are not constant in each vertex of the data graph for at most cases of a graph, distinct features, and no two different features are equivalent. Secondly, splitting the features into two kinds of density features and similarity features. They assume that density features have more priority than similarity features to reach into the dense subgraph before obtaining similar data vertices. By the word “dense”, we mean roughly that the subgraph contains a many edges and it is well connected.
Surely, there exist features that will be the density and similarity feature.
3.4.1. Off-Line KD-Tree Index
Mapping each vertex of the data graph into D-vector (D = 8) space and each coordinate corresponds to the features for reducing the sub-problem of graph mining to the geometric problem. Mathematically,
: V(G)∪ {
}
.
Table 2 shows features and its descriptions.
The density features
,
, and
are used to reach the dense subgraph. The similarity features
such that
are used to obtain the similar vertices given query vertex in the query graph Q. In other words,
,
and
are for density and the remaining features are for similarity. Constructing the KD-tree index by organizing the information supplied by some efficient features are in
Table 2, the feature vectors of vertices of data graph are shown in
Table 3.
To compute the core number for each vertex in the data graph, we have to construct the simplified graph
and then compute the core number for each vertex in
. The first feature,
, is a core number used to obtain the dense vertices (i.e., have a lot of relations) to reach dense subgraph to search in it. The core number feature is computed by the Algorithm 2 in
. The triangle number feature
was used to obtain more dense vertices and symmetry area, computing it for all vertices by the algorithm in [
32]. The remaining features can be computed directly (by using graph representation adjacent list). We keep on inserting data graph feature 8-vectors into the KD-tree. Then the offline index is built.
3.4.2. On-Line KD-Tree Index Query
Given G, which is a multi-graph (Definition 2), and given Q, which is a query graph, finding candidate sets in the proposed algorithm is very critical and important for matching a query graph Q in graph G. In (Line 4) in Algorithm 1.
is the candidate set for matching the initiate query vertex
, which is found by querying the KD-tree index (
Section 3.4.1).
KD-tree index solves many problems such as the nearest neighborhood, k-nearest neighborhood, range query problems, and many geometric problems.
Finding candidate sets is a problem approaching the k-nearest neighborhood problem as the solution for computing candidates. K is used as a parameter in the method. Its value changes according to a dataset and query graph (i.e., clique or non-clique graph). This parameter gives flexibility for finding candidates. The challenge is finding dense area using feature-based in sparse (i.e., |(Q)| = ) query graphs, vice versa when query graph is dense (i.e., |(Q)| = ).
In on-line indexing, we query with query vertex (i.e., initiate query vertex ) of query graph Q to obtain candidate set .
The solution of k-nearest neighborhood equals the first version of the candidate set of given query vertex. Then we can compute the candidate set by the effective way by this reduction. Before finding
it is necessary to find
(
Section 3.5).
3.4.3. On-Line and Off-Line Set-Trie Index
This subsection aims for finding candidate sets for rest query vertices to match it in the graph G. In
Section 3.4.1 and
Section 3.4.2 responsible for finding candidate set for
, using KD-tree index.
Definition 16. SuperSetQuery (v,{x,y}). It is a function that has two parameters, namely a data vertex v and a multi-edge between x and y {x,y}. The data vertex v is used to construct a set-trie index for its neighborhood subgraph structure. The multi-edge between x and y {x,y} from of query graph as a query. This function returns all data vertices which are in N(v) such that the multi-edges between them and v (i.e., ) are a superset of query multi-edge {x,y}.
The set-trie index helps to find all possible candidates for the rest of query vertices. During recursion of Algorithm 3 it refines this set of all possible candidates to another set for reducing search space in Algorithm 4.
A set-trie is a tree data structure similar to an ordinary trie data structure. It builds the set-trie, such as building the ordinary trie data structure [
33]. In the proposed algorithm, building a set-trie of multiedges for a data vertex (i.e., vertex signature
), for instance
, is
= {{1},{2},{3},{2,3,4}}, its neighbourhood subgraph of
is shown in
Figure 5a and its set-trie is shown in
Figure 5b. For finding all possible candidates, assume building a set-trie index for some data vertex such as
and querying in the index by a multi-edge such as the multiedge between u2 and u3 (i.e., {u2,u3}). We then have the output set of candidate data vertices for
. This set satisfies SuperSetQuery and its output is {v6,v7}. Moreover, set operations operated using a succinct data structure which is a compressed bitmap, and for datasets that have at most 32 kinds of relations, bitmasks are used.
Algorithm 3: Routine Subgraph Backtracking |
- 1:
Get - 2:
MatchCandidateVertices = FindJoinable (, , ) - 3:
for all∈ Match candidate vertices do - 4:
.add() - 5:
.add() - 6:
Matched List = [, ]/*such that didnot matched, without this condition the algorithm compute homomorphism (see Definition 9). */ - 7:
Routine Subgraph Backtracking (All Matching, Matched List, Q, G) - 8:
if All Matching.size == Q.order then - 9:
All Matching.union(Matched List) - 10:
end if - 11:
.pop() - 12:
.pop() - 13:
end for - 14:
return All Matching
|
Algorithm 4: RoutineFindJoinable |
- 1:
= adj() - 2:
: = Corresponding Matched vertices of - 3:
PossibleCandidateVertices = SuperSubSetQuery([i], {[i], }) - 4:
for all v ∈ PossibleCandidateVertices do - 5:
if then - 6:
MatchCandiateVertices.add(v) - 7:
end if - 8:
end for - 9:
return MatchCandiateVertices
|
3.5. Subgraph Query
This section presents the method of finding embeddings(i.e., functions from query and data graph) of query multi-graph, preserving connection and structure of Q and ordering.
3.5.1. Finding Elegant Initiate Query Vertex
Definition 17. Elegant vertex. An elegant vertex is in a dense area (or subgraph).
Finding an elegant initiate query vertex is important for finding the candidate set in Line 4 of Algorithm 1. We order V(Q) in decreasing order of three score functions.
- (1)
Core number, computing core number over efficiently by Algorithm 2, = core(u), .
- (2)
Degree of (i.e., number of incident edges) in , = deg(u), .
- (3)
Triangle number, = tr(u), .
Assume that the score priorities in ascending order are
,
,
(i.e., ordering according by
and if two query nodes have the same
then ordering according by
). The initiate query vertex
will be the top of the order. In
as showed in
Figure 4, the
Table 4 shows the core number and degree of query vertices of
. After applying the previous method, finding the total order of
is
, then the elegant vertex of
is
which is
.
Lemma 1. The initiate query vertex is elegant.
Proof. The proof is obtained in the previous method of obtaining . □
3.5.2. Query Vertex Ordering
The order which is in line 3 in Algorithm 1 is very critical because it is used in query searching to obtain (Line 1 in Algorithm 3) which is the vertex that should be matched.
Define the total order over such that it satisfies the condition: Assume and such that for every 3-tuple (u,v,x) if u v x, {u,v} ∉ E(Q) and {u,x} ∈ E(Q) there exists y such that y u and {u,y} ∈ E(Q). When expanding a vertex, sort its not-visited adjacent vertices in decreasing order by three score function , , .
After obtaining
, apply the total order
using it. In Q2 as shown in
Figure 4, after applying the total order
obtaining
=
. Score functions depend on the structure of the subgraph (or graph) to reach dense and similar subgraphs.
3.5.3. Candidates for Initial Query Vertex
In
Section 3.4 we showed on-line indexing and some of the details for finding. This subsection shows the remaining details of finding
. First, selecting candidate set
given initial query vertex
and then start matching and perform the Subgraph Search. By the definition of the candidate set, candidate set has to satisfy the following two conditions:
To satisfy the first condition and save the time of subset operation on multiset computation, using the necessary condition “candidate set has all of the data vertices except feature that has features greater than features measures ”. Mathematically, .
To satisfy the second condition (i.e., similarity condition), we define the similarity set of
as follows:
. Now we compute
using the k-nearest neighborhood query in KD-tree. This condition is a refinement for the candidate set. Now we find the candidate set by merging two queries (i.e., range and k-nearest neighborhood), find the 7-upper similar query using the modified k-nearest neighborhood query, and use k as a parameter, as mentioned in
Section 3.4.2. After computing and filtering
such that it satisfies the two conditions, we sort it in decreasing order of the feature vectors, with decreasing priorities from
to
, to get maximized priority candidate vertices.
Theorem 1. SelectCandidate () function outputs at least one valid candidate vertex.
Proof. The proof is obtained in the previous method in the
Section 3.5.3. □
3.5.4. Subgraph Searching
In subgraph search Algorithm 3 there is the function responsible for finding or finishing the matching the query graph Q in the graph G, after matching with the candidate vertex. Backtracking is used as a searching technique for this function. In this function, after the partial matching (Line 9 in Algorithm 1), obtain for matching it (Line 1) by the total order . In (Line 2) we finding candidate data vertices for . If v, which is in Line 6 of Algorithm 1, is a good matching for , the backtracking grows the matching Q in G, else (i.e., v is not suitable matching for ), the backtracking reverts back and tries the next candidate vertex (Lines 3–13).
In the FindJoinable Algorithm 4 there is the function responsible for preserving the structure and connections of query graph Q. First, we find matched neighbor of and the corresponding matched vertices in a data graph (Lines 1–2). Second, a candidate for query vertices is found.
Now we Find possible candidate vertices for (Line 3) by querying SuperSetQuery in set-trie index, building an index over multi-edges of neighbors of the [i] vertex, query multi-edge between [i] and and use intersections operations over vertices that are output of SuperSetQuery. Now and after, we obtain the vertices to satisfy the structural connectivity of the query graph as in Lines 4–8 for reducing the number of candidates by checking vertex signature of v (Line 4). Vertex signature of v is a superset (using ⋐ operation) vertex signature of .
To solve the problem that is the superset between
and
, we approach the problem as a maximum matching problem on a bipartite graph in
using [
34] algorithm, such that E will be number of edges (i.e., number of subset or superset relations between the two partitions of bipartite graph) and V will be number of vertices (i.e., number of elements of
and
which equals
). To approximate the subset in multiset ⋐ usingthe embedding in
Section 3.6 we just check that
in O(1). RoutineFindJoinable guarantees that preserving the structure and connections of the query graph is proved by induction on order of vertices
[
18].
3.6. Improvement Algorithm by Vertex Features Using Structural and Meta Features and Symmetry Breaking
In this subsection, we apply the embedding vertex (i.e., mapping vertex to feature vector) to structural features for each layer graph (i.e., is simple graph induces one relationship) and meta-features for vertex signature (i.e., information about multi-edges such as the number of multi edges, statistics information, information-theoretic measures, etc.). The embedding will be useful for computing orbits (i.e., partitioning nodes such that each partition has nodes with the same structure) in the graph (i.e., query or data graph), and orbits will be used in symmetry breaking.
3.6.1. Vertex Features Using Orbits and Meta Features
We embed vertices of the query and data graphs to structural and meta-features. Structural features are number of occurrences (induced or non-induced) of the vertex for all 73 orbits in [
32] for each layer of the graph. These are concatenated; the algorithm uses induced occurrence.
Table 5 shows the structural features of data nodes up to pattern order 3 of induced graph or layer of the graph of a working relationship.
Meta features are summaries (i.e., min, max, mean, …) of general, statistical, information-theoretic, itemset, complexity, clustering features...using [
35] for signatures of nodes. So, node embedding will be concatenation structural and meta features. Furthermore, the embedding preserve the order in
Section 3.5.3 (i.e., v ∈
then for all dim. d
), for example
in
Figure 6 can be a candidate of
or
in
Figure 7 and
,
,
and
in
Figure 6 can be candidates of
. However, node embedding size will be large and it is not efficient for KD-tree indexing for acquiring a candidate for
. So, using dimension reduction techniques such as PCA, embed node feature to 2D feature vector is expedient.
3.6.2. Symmetry Breaking Condition
In Subgraph Backtracking Algorithm 4 will generate all pattern matches with the query graph but among generated patterns there exist repeated patterns or redundancy in patterns. So, symmetry breaking will solve the redundancy problem.
The graph’s automorphism is a graph isomorphism between the graph and itself. Thus, graph automorphism generates all corresponding one-to-one mappings between the graph and itself, orbits of the graph which are a partition of vertices, and each partition has nodes which are permuted such that the graph still has the same structure. So, symmetry breaking breaks symmetry in each orbit. Then Algorithm 4 will generate all different matchings.
Computing orbits of the query graph, using the embedding in
Section 3.6.1 preserves node orbits or structure and clustering embedding using any clustering algorithm such as k-means. Each cluster is an orbit with similar nodes. Moreover, after acquiring a candidate of
and before adding mapping in line 4 and 5 in Algorithm 4, we then check the condition (Symmetry Breaking Filter):
[if and and are in the same orbit then
Lemma 2. Given a multi-graph Q, Aut(Q) let be the set of all automorphisms of Q. The Symmetry Breaking filter generates all symmetry breaking conditions.
Proof. We need to prove that if and and they are in the same orbit, then or vice-versa—that total order according to specific indexing of vertices. Symmetry Breaking filter generates that if and and they are in the same orbit and according of query vertex order then or vice-versa, then or vice-versa for all i,j . □
Theorem 2. The Algorithm 1 with Symmetry Breaking filter generates all distinct embeddings of query graph Q in data graph G.
Proof. Let O be an embedding of Q in G and assume that for all i, j, a, b , i ≠ j and a ≠ b, and are vertex indices (i.e., ids) in O that can be mapped to the same vertex . Since Φ is a one-to-one function, then there exists . Moreover there exists two automorphsim and are matched. Two possibilities exist: firts, (i) that the algorithm breaks. That means or vice versa, in this case, using Symmetry Breaking filter will break one of or and then the algorithm does not generate this mapping. (ii) the algorithm does not break. Then the algorithm generates mappings that contain and means that the automorphism maps to or vice-versa without symmetry break condition, from previous lemma. Then, by contradiction from case (ii), O will be mapped only once. □
In
Figure 4 Q2, if
and
, then orbits will be
. In
Figure 7 and
have the same embedding.
Moreover, after obtaining a candidate of the initial query vertex in Algorithm 1, it is refined using subset operation on multiset ⋐ in def 2.16.
4. Experimental Results and Discussions
In this section, the performance of the proposed algorithm on real and synthetic (random) multi-graphs is evaluated.
The evaluation is performed by comparing the proposed method with querying multi-graphs via an efficient indexing algorithm (SuMGra) [
18]. The proposed model consists of two phases. The first phase’s main idea is finding the densest query vertex, applying the filtering process to minimize the number of candidate data nodes of the initiate query vertex, finding the density similar neighborhood structure in the data graph, and finding the k-nearest neighborhood query to extract the densest subgraph.
The main idea of the second phase is mapping the vertex-to-feature vector (Vertex Embedding) for each layer graph, using a dimension reduction, principal component analysis (PCA) method to reduce the node embedding size to be efficient with the KD-tree indexing, using symmetry breaking condition to remove the redundancy in the generated pattern that matches with query graph, applying the improved filtering process to minimize the number of candidate data nodes of the initiate query vertex, and testing the effect of the concatenation of the structural features (orbits features) with the meta-features (summary of general, statistical, information-theoretic, etc.) for signatures of nodes on the model performance.
For ease of reading, we use the word “Proposed” to indicate the first phase of the proposed model, while we use “W/O” to indicate the second phase of the proposed without meta-features and “improvement” to indicate the second phase of the proposed method with the meta-features.
All the experiments were run on a PC, with Intel Inside CORE i3 processors 2.00 GHz, and 4 GB RAM, running on Windows10 OS. All algorithms in the paper have been implemented using python.
4.1. Description of Datasets and Query Subgraphs
In this research, experiments are executed over five datasets (three real multi-graph and two random multi-graph datasets). The three real multi-graph datasets are available at [
36]. Dataset descriptions and some statistics are shown in
Table 6. The genetic multi-graph HUMAN-HIV1 takes into account several genetic connections for biological organisms. These interactions include physical association, direct interaction, colocalization, association, and suppressive genetic interaction, which is determined by inequality. PLASMODIUM is a genetic multi-graph that concerns Plasmodium Falciparum and has relations such as Direct interaction, physical association, and association. EU-AIR is a TRANSPORTATION multi-graph, which is composed of thirty-seven different layers, each one corresponding to a different airline operating in Europe.
The two random multi-graph datasets are generated to test the efficiency of all algorithms. We generate query subgraphs with three types of random multi-graphs ordered from three to six nodes, and the following kinds of graphs: (i) a tree that is sparse (i.e., n − 1 edges) connected acyclic graphs; classes of graph such as paths and stars are also generated. (ii) Erdos-Renyi random graph with probability 0.5 to add edge which is dense (i.e., O()) graph. (iii) clique, which is a complete graph (i.e., every pair of a node is connected). All nodes of all types of graphs of signature sizes from 1 to 4 are randomly generated. For each multi-graph dataset, we generated 1000 samples for each kind. We report the average time for the first 1000 embedding for each query graph, and the queries which have no embedding are not counted in our experiment.
Synthetic data which are a random multi-graph are generated as a data graph that has 100 to 200 nodes and is a dense graph. We generated two kinds of synthetic data. The first has five relationships and the second has 10 relationships.
4.2. Performance of the Proposed Algorithms
In this section, the results are presented using different metrics, such as the average time for all orders of query and 517 kinds of graphs, the average size of for all kinds of queries (i.e., trees, cliques, and random 518 graphs), and the embedding of a node feature in the plan (i.e., node embedding).
In the
Figure 8 shows the avg size of
for Human HIV4, plasmodium, and EUAi datasets.
Figure 9 shows the avg size of
for the two random multigraph random 1 and random 2 datasets.
The average size of
, in the second phase of the proposed algorithm gives improvement of the first phase with/without meta-features methods. Filtering is better than the proposed and SUMGRA. In
Figure 8 SUMGRA the difference is remarkable, and in the first phase of the proposed algorithm the difference is remarkable as shown in in
Figure 8.
Generally, the second phase of the proposed with/without meta-features methods works well about time and has meaningful embedding and robust filtering.
Figure 10,
Figure 11 and
Figure 12 show that in both random cliques and Erdos random graphs these queries are dense, so the best methods according to time are the second phase of the proposed with/without meta-features. However, in random trees, the first stage of the proposed method is the best according to time and that is due to this query is a sparse graph not dense so there is no need to find the symmetries. Finding the symmetries as proposed in with/without meta-feature in a sparse graph (i.e., not dense) will spend more time without result improvement.
Theoretically, if the query has a lot of symmetry, then the improvement with/without meta-features algorithms work well. Observably, improvement with/without meta-features methods work well in cliques and dense Erdos random graphs. The improvement method with meta-features works well on tree patterns except in
Figure 11c, such that the proposed is the best in it. Generally, the first phase of the proposed algorithm is better than SUMGRA, and the second phase, which is the improvement with/without meta-features and methods, is better than the first phase of the proposed method.
Figure 13 and
Figure 14 show the query time for the two random multigraph datasets random 1 and random 2 respectively.
It agrees with the theoretical study in both random cliques and Erdos random graph. These queries are dense, so the best methods are the second phase of the proposed with/without meta-features. However, despite the convergence of all results in the random trees, the second phase of the proposed meta-feature is the best one. Although the random trees query is sparse, the original data are generated as a dense graph.
Figure 15,
Figure 16 and
Figure 17 show the node embedding Query for Human HIV4, plasmodium, and EUAi datasets respectively. In
Figure 18 shows the node embedding Query for the two random multi-graph datasets.
In the
Figure 15,
Figure 16,
Figure 17 and
Figure 18 Principal Component Analysis (PCA) was used so, the x-axis and y-axis are represented the first and second principal component respectively.
In figures of node embedding such as
Figure 16b or
Figure 17b reflecting the order in
Section 3.6, the order preserves structural and multi-edges information. On the contrary, the first phase of the proposed algorithm and SUMGRA algorithm preserve some properties in a simple version of multi-graphs such as degree, k-core number, and triangle number of nodes and not sufficient proprieties of multi edges. Furthermore, node embedding of random multi-graphs
Figure 18 has no pattern- being dense in the center means nodes have the same dense properties.