1. Introduction
Algorithms for calculating the (weighted) path distance between vertices in a graph appeared in the middle of the 20th century, which were motivated by the growing interest at the time in the applications of the mathematical analysis of graphs. The Bellman–Ford algorithm is the main reference of these early studies [
1]. Dijkstra’s algorithm for solving the same problem appeared at about the same time [
2], and it differs from the other, being more efficient depending on the particular problem. More specifically, Dijkstra’s algorithm is best used when you have non-negative weights and need an efficient solution for static scenarios. Bellman–Ford, on the other hand, is more versatile for situations where edge weights can be negative and where it is important to detect negative cycles, making it suitable for more complex or dynamic environments (see [
3], p. 604). Thus, Dijkstra’s algorithm can be used for finding the shortest driving route between two cities or for finding the shortest path for data packets in a network. However the Bellman–Ford algorithm can be used for analyzing currency exchange rates to detect arbitrage opportunities or for analyzing supply chains with possible gains and losses during transport.
After these original works, the growing interest in the subject (due to the numerous applications that graph theory has found in many fields) has given rise to a great deal of research on graph analysis, which often includes the study of these structures when considered as metric spaces. The idea of also considering a graph as a metric space goes back to the beginnings of the theory of graphs. Metric notions began to appear explicitly in mathematical works in the second half of the last century. The main metric that was considered (and, in a sense, the only one until the latter part of the century) was the so-called path distance [
4,
5,
6]: for undirected and connected graphs, this metric evaluated between two vertices (nodes) is defined as the length of the shortest path between them (see, for example, [
7] §.2.2.2). One of the first advances in the metric analysis of graphs was the introduction of weights in the definition of the path distance, assigning weights to the individual paths connecting two consecutive nodes and calculating the infimum of the sum of these weights. Some recent papers on the subject using weighted distances that have inspired this paper are [
8,
9].
As in the case of other notions of fundamental graph theory, the relevant theoretical ideas appeared together with research topics from other scientific fields, such as sociology [
10,
11,
12]. The definition of different metrics and algorithms to compute them increased greatly in the last decade of the last century, often proposed by problems from other disciplines such as chemistry or crystallography (see [
6,
13] and the references therein). In this sense, there is a particular case that deserves attention, that is, the resistance distance [
14,
15]. It is a concept derived from electrical circuit theory and applied to graph theory. It measures the “effective resistance’’ between two nodes (vertices) in a graph, treating the graph as an electrical network where each edge is a resistor. The resistance distance between two nodes
i and
j in a graph can be calculated using the Laplacian matrix
L of the graph. Specifically, if
denotes the Moore–Penrose pseudoinverse of
L, the resistance distance
is given by
where
is the
element of
This metric is a measure of the ease with which electric current flows between two points, with a higher resistance distance indicating a more “difficult’’ or “costly’’ connection in the graph’s structure. This concept has applications in various fields such as data analysis, network reliability, and clustering tasks. Related also to some ideas in theoretical chemistry [
13] and social network analysis [
12], this definition turned out to be a useful tool in the study of molecular configurations in chemistry, although it has also been used in network analysis and other fields [
16,
17,
18]. In general, metrics could play a relevant role when studying properties such as robustness (see, for example, the survey [
19]; see also [
20]). The interested reader can find more information on the applications of metric graphs in the books [
7,
21,
22].
In the same direction, in this paper, we provide an algorithm to compute a new distance that also appeared in an applied context in connection with the automatic analysis of fraud in economic networks (see [
23]). In this paper, we show that this metric allows one to consider vertices that are far away when the distance is measured using the path distance, becoming close with respect to this metric. It is especially useful in the economic analysis of fraud in business networks, as often the strategy used to hide such fraud is to consider that the analysis tools to be used against them are based on the path distance: closer vertices in terms of fewer intermediate nodes means that they are more related vertices. However, financial fraudsters always try to introduce many intermediate companies (intermediate nodes) to avoid detection.
Technically, our metric is defined as a weighted metric, but this is achieved by dividing the sums of weights appearing in the infimum that provides the value of the metric using a new term that depends on the number of steps involved in the summation. We will show that this change in the weighting process forces us to radically change the algorithm to calculate it, since in this new case, longer paths could give shorter distances. However, in the present work, we consider the case of acyclic directed graphs in order to avoid restrictions that make it impossible to define a weighted path metric, since continuous passage through a cycle could always give a null value to the metric (which would mean that it is not a metric). There are other ways to avoid this (see, for example, Proposition 4.1 in [
23]), but in our case, we decided to compute the distance between vertices by restricting the set of possible paths in the infimum that gives it. The way to do this is to avoid cycles and to consider directed graphs. As a result, what we compute is not a metric on the whole graph but only a distance between two vertices chosen in it.
Let us give now some basic definitions about graphs and metric spaces. Let us start by explaining some concepts related to the general definition of what a metric is, which will be adapted later according to the graph theoretic framework. Let be the non-negative real numbers. An (extended) quasi-metric on a set is a function such that for all , the axioms
if and only if and
hold. The resulting quasi-metric structure
is called a quasi-metric space. For the specific framework of this paper, a useful summary of the notions of distance in graphs, with sufficient explanation and many examples, is given in Chapter 15 in [
24].
In this paper, we will deal with the so called path-length-weighted distance, which was introduced in [
23] (§.4).
It should be noted that the version defined there was given for non-directed graphs, and so the definition we use is slightly different. However, we will also define an extended quasi-metric in this case by giving the value when there is no path for going from a to and considering only allowed paths between vertices.
Since we are interested in how to compute the distance and not in theoretical questions about its metric space structure, we will focus our attention on the computational algorithm. All the notions on graph theory that are needed can be found in books on this subject, such as, for example, [
7]. We will introduce some of them in the next section.
2. Main Definitions and Context
Take a weighted directed acyclic graph where V is the set of vertices (nodes), and E is the class of edges among them. Let us define what we call a proximity function. This is an extended function that describes by means of a non-negative real number a relation among node a and node b. It is supposed that represents some sort of distance among the nodes, but—and this is crucial—it is not assumed to be a distance (that is, none of the axioms of the metric are assumed). In the case where a and b are not connected in the graph, we will write or ∞ depending on how we write the algorithm, that is, how we decide to codify the links in the graph. So, we will consider structures when the graph is provided with a proximity function
Consider a non-increasing sequence
of positive real numbers that will be assumed to be the weights associated to the lengths of the paths considered in the definition of a distance starting from the proximity function
. Although all the graphs are finite, we will often define
W as an infinite sequence using only its first elements. Given two vertices
, we define the set of all possible paths from
a to
b as
The length of a path
is denoted by
, and its total distance is denoted by
where
is the
sum of path weights of the path
Given two points
, we define an extended quasi-metric in
V as the function on
by
with the convention that
. As in the original definition in [
23] (§.4), we will call it the path-length-weighted quasi-metric. The reader can find there the proof that it defines a metric. Given two paths
and
, we define the concatenation of both as
Write for the vertices of the graph. As in the classical math distance case, we are interested in this paper in computing the distance from any of the points in V to
3. Previous to the Algorithm
Consider the weighted directed graph without cycles and fix a vertex to which we want to measure the distance from any other vertex of the graph. Write and suppose that we want to compute the distance from any other node of the graph to .
In both the Bellman–Ford and Dijkstra algorithms, the shortest distances from the source node to all other nodes in a graph are computed by selecting the closest unvisited node to the current node. In this process, all edges leaving it to unvisited nodes are examined, updating the distance from these nodes if the distance through the selected node is less than the currently known distance. Thus, the updated distance is given by the path from this node to the current node and the path of least distance between the current node and the source.
We introduce the following formal operation ⊞ to describe the updated distance calculation in a simpler way. Fix a vertex
and let us write
for the vertex from which we want to compute the distance to
, and
is another vertex that belongs to one of the paths from
to
Let us define
where
and
define the path with the shortest distance between the current node and the source.
This weighted sum substitutes the normal addition used in the Bellman–Ford algorithm, and it allows us to change the weights when the length of the path increases. Unlike the Bellman–Ford algorithm, this algorithm has the additional complexity in that the path with the shortest distance between two nodes does not have to be the one that passes through the nodes with the shortest distances, since the number of steps must be taken into account. As illustrated in Example 1, replacing the normal sum of the Bellman–Ford algorithm with the weighted sum is not sufficient, as it is crucial to take the path length into account in each distance calculation to ensure that no path possibility is discarded.
Example 1. Consider the graph in Figure 1. We will calculate the path-length-weighted distance from to . For this purpose, we consider the weights associated with the lengths of the paths as . This means that the distance between nodes is the average of the path weights. The first step in calculating the distance has to take into account, of course, the only way forward in the graph: going from to For the following steps, we clearly have that the path with the smallest distance between the nodes and iswith a length and a distance Since is only directly connected to , when we apply a Bellman–Ford-type algorithm, the computation of is determined by the path , which results inIn spite of this, the path with the shortest distance between the nodes and iswith a length and an associate distance Consequently, as shown in Example 1, the shortest distance from an unvisited node to the current node for the metric presented in this article does not have to be calculated from the path with the shortest distance between the current node and the source. Given three vertices , this implies that the path with the shortest distance from a to c passing through b is not necessarily the union of the paths with the shortest distance from a to b and from b to c. In order to not consider all possible paths in the calculation of the distances, multi-objective optimization can be used, which allows us to restrict attention to the set of possible paths with the smallest distance and to make tradeoffs within this set.
Assume that, at a particular step of the algorithm, two or more paths
are found between the vertices
. Define on
an order given by
if and only if
It is straightforward to see that it is an order relation. Note also that it implies that
. The following result shows that it is sufficient to explore the minimal paths of
(in the sense defined by ⪯), i.e., those residing in the following Pareto front,
This front is illustrated in
Figure 2.
Proposition 1. Consider a graph G and the corresponding elements defined above. Fix , and let such that . Then, for any and , we have that the paths and in satisfy , and, in particular, .
This means that, once you have found a better path (in terms of you can discard the worst one (and stop the exploration), as it will not be used to find any distance between the last vertices.
Proof. Consider
,
, and
in the statement. Then, we have that
Recall that
. Then,
and so
. Moreover,
Thus, we obtain
. □
Remark 1. It is easy to see that Proposition 1 also works if the composition of the paths is done in the opposite way. That is, the same conclusion holds if and belong to with and .
4. The Algorithm
Taking into account the ideas of the previous section, in this part of the paper, we explain the algorithm to compute the path-length-weighted distance in a graph. To make the new algorithm more efficient, the sum of the weights—
—will be used in the calculations instead of the distance—
—which will be calculated in the last step. The counting parameter
m will be used to indicate the step number in the algorithm, that is, the length of the path. Let
denote the set of all nodes, where
represents the source node. In this context, we define
, with
, as the set of all computed tuples as
, where (to simplify notation)
denotes the sum of all weights of a path
and
represents its length. We introduce the following formal operation ⊕ to describe the process of adding a new node in terms of the sum of all weights. Once a couple of adjacent vertices
are fixed, we write
This definition is closely related to (but of course not the same as) that of ⊞ given in Equation (
1), but, as we noted above, it is written in terms of
s rather than
Finally, we set the (ordered) set
The new algorithm that we propose follows the next steps. The parameter gives the size of the paths that are considered at each step connecting any vertex with the source node :
- (1)
Initialization: For , initialize the algorithm by doing for and . We have that for all i, since we are considering steps of length equal to 0 at the beginning. Now, consider the “set of vertices connected to by a path of length 0”, that is, Also, put, according to the definition,
- (2)
Search for all paths: Put
Consider the set of adjacent vertices to
that is, all the vertices that are connected to
. For
consider the quantities
given by Equation (
4) and add them to the set
that has been initialized as only containing
that is,
- (3)
Path filtering: The main idea of the algorithm is to update the sets
as
m grows when repeating step (2). For a given
m following 1 (make
for the immediate step), we have that
is the set of tuples
of all paths
of lengths less than or equal to
We need to reduce the size of these sets by eliminating some non-essential elements. According to Proposition 1, Equation (
3) will be used to discard the tuples belonging to those paths that give a greater distance than others. We do that by replacing
which is defined for any
as
Note that in the case that a given has any path of size m go to , this step will remove the tuple from the set
- (4)
Repeat: Repeat steps (2) and (3) until m is greater than the maximum length of all paths in the graph connecting
- (5)
Distance computation: Finally, once the final versions of the sets
have been calculated, we proceed to the calculation of the distances. Then, for each
, the distance from the node
to the source node
is given by
Note that this way of calculating the distance, although it can be laborious, guarantees that it works for any non-increasing sequence of weights
The reader can find below Algorithm 1, which is a visual representation illustrating the steps of the proposed algorithm. This diagram serves as a helpful visual aid to complement the step-by-step explanation provided above.
Algorithm 1: Path length weight (, W, n, ) |
|
5. Examples of Graphs
In this section, we will present some examples to show the properties of the explained algorithm. We will explore some different cases in order to demonstrate that the conditions of the requirements are necessary to obtain a suitable algorithm. Also, our idea is to present different classes of graphs, which can be associated using topological properties, to show that our technique works in different situations.
We start with an example of a tree-type graph. After that, we will show that our algorithm also works for a centered-star-type graph. Finally, we will also present a case of a graph with no singular topological structure to prove the generality of our procedure.
To visualize the graphs, the Kamada–Kawai layout algorithm [
25] has been used, which is a method for drawing two-dimensional graphs where the nodes are placed so that the distances in the drawing are proportional to the shortest distances between them in the original graph. In this way, it is possible to observe how the drawing of the graph changes depending on the distance used. The numbers shown on the arrows are the corresponding values of
except where other notation is explained. Let us remark that the “distances” that are written in the figures are the ones that can be computed: recall again that we are working with directed graphs with no cycles, so it may happen that we can compute the “distance” from
to
but not the “distance” from
to
In this case, we write in the corresponding edge the one that can be computed:
- (1)
Tree-type graph: In this type of graph, all nodes are connected to each other by exactly one single path, and there are no cycles (closed paths). For the calculation of the distance, we use the weights associated with the inverse of the lengths of the paths—this is
. The comparison among
Figure 3 and
Figure 4 shows the particular nature of the metric we have defined, in which two points
and
that are more distanced in the tree (with more intermediate nodes) are, however, nearer than one of the vertices that crosses the path that connects
with
The reader can see this, for example, in the comparison of the position and distances between
and
(the path-length-weighted distance is
), and between
and
(the path-length-weighted distance is
). Moreover, the distances between
and
and between
and
are equal, although it is necessary to pass through
to reach
from
This effect is especially remarkable in the case of trees, where the usual path distance between a vertex and the root increases with the number of branches separating them. We have seen that for our distance, however, this is not necessarily true.
- (2)
Centered-star-type graph: In these graphs, there is a central node—
in
Figure 5—which is directly connected to all other nodes in the graph. Compared to
Figure 6, where the weights
were used,
Figure 7 shows how the graph changes when applying the distance calculation with the weights
In this case, the distance decreases as the path length increases. This is again a relevant difference with the usual weighted path distance case, and it could be used to model different network behaviors where proximity in terms of number of intermediate nodes does not adequately represent the desired relationships between nodes.
- (3)
Graph with no singular topological structure: Let us now consider a graph that does not have an easily definable shape or pattern, such as a tree or a star. The example that we chose is shown in
Figure 8. It is a non-connected graph. This graph can be separated into two connected subgraphs—that is, its two connected components—to calculate the distances of the nodes. In spite of this, as can be seen in
Figure 9, our algorithm also worked without the need to separate these graphs into their connected components. The weights
were again considered.
6. Special Cases: Constraints on the Weights and on the Proximity Matrix to Improve the Efficiency of the Algorithm
To reduce the complexity of the algorithm, algebraic conditions can be imposed on both the proximity matrix and the weights associated with the length of the paths. The trivial case is when these weights are all equal to a constant
In this case, the distance coincides with the weighted path metric (see [
24], p. 258) multiplied by the constant
Therefore, the calculation of the distance can be performed using the Bellman–Ford or Dijkstra algorithms.
In this section, we will consider the weights associated with the inverse of a power of the lengths of the paths
, with the power
k being greater than or equal to 1. The importance of these weights lies in the fact that it is the natural generalization of the case where the power
k is equal to 1, i.e., the case where the distance between nodes coincides with the average of the path weights. This class of weights, which have been already used in the previous sections as general examples, is also relevant for applications (see [
23]).
The results provided in this section can be implemented directly in the algorithm to make it faster. We explain the requirements that have to be met, as well as the procedures for using them.
Proposition 2. Consider a graph and the weights associated with the lengths of the paths , with . Fix , and let such that , and . Then, for any and such that , we have that the paths and in satisfy .
Proof. Consider
,
, and
in the statement so that
,
, and
. Then,
Then, by hypothesis,
This implies that
. Therefore,
so
. □
Proposition 3. Consider a graph and the weights associated with the lengths of the paths , with . Fix , and let such that , and . Then, for any and such that , we have that the paths and in satisfy .
The proof of this result is analogous to Proposition 2, so it will be omitted.
Remark 2. It is trivial to see that Propositions 2 and 3 also work if the paths and in are used, being in this case and .
These propositions propose to define in
and for each
the order relation
given by the conditions
(suggested by Proposition 2), and the order relation
given by
(suggested by Proposition 3).
As can be seen, these order relations are more restrictive than the order relation ⪯ used in the previous sections (although they depend on the path R), since they are given as a function of the distance d, while ⪯ is a function of the sum of the weights s.
To reduce the set of paths suitable for calculating the distance with our algorithm using these order relations, we need them to not depend on the path R. This can be done by imposing some additional properties to the graph. In particular, this works if the graph meets one of the following conditions, which are different depending on the order relation we will use:
- 1
Requirements for the order
: Let
be a weighted directed graph with no cycles. The condition we impose in this case is the following: for all
in
V, with
being non-empty and
b connected to
c—that is,
—the inequality
holds for all
. Note that this requirement is independent of
R, so we write
for the associate order relation. As it is not always possible to verify this condition given a graph, we will check that for all
, it is satisfied that
It is obvious that this condition implies (
5). Using this, we can simplify the algorithm. Clearly, in this case, it is enough to explore the paths of the following Pareto front:
- 2
Requirements for the order
: As above, let
be a weighted directed graph with no cycles. The requirement is, in this case, the following: for all
in
V, with
being non-empty and
b connected to
c—that is,
—the inequality
has to be satisfied for all
. A sufficient condition to be checked is that for all
, it is satisfied that
Again, this implies (
8). In this case, it is enough to consider the following paths:
To show how the path filters explained above work for each of the algorithms, we will calculate the distance from node
to
in the graph illustrated in
Figure 10 (which satisfies the requirements given by (
6)) using the weights associated with the path lengths
. Since
is only directly connected by
, the computation of
is determined by the paths
, with
:
- (1)
, with , , and ;
- (2)
, with , , and ;
- (3)
, with , , and ;
- (4)
, with , , and ;
- (5)
, with , , and .
In this case, if we use the general path filtering given by (
3), we would obtain the possible paths
since
.
Instead, if we use the path filtering for this particular case given by (
7), we have that
,
, and
obtaining directly
Therefore, for the calculation of , we only have to compute the distance associated with the path that is,