Geospatial conflation involves various processes and stages. Roughly speaking, one can decompose the conflation problem into two main stages. In stage one, the similarity or spatial offset between geographic features is measured, which expresses the likelihood that an individual feature pair from two datasets should match. This is the similarity measurement stage. It is important in that a strong similarity measure can produce high scores for a corresponding features pair that represent the same object, and therefore, increase their likelihood of being matched. The second stage is match selection, in which, given all the similarity/distance metrics in stage one, one selects a set of feature pairs to conflate. Optionally, there is a third stage in which one merges the geometries or attribute sets of the matched feature pairs in stage two.
Among these two stages, the similarity measurement problem has been extensively studied in the literature. A classic similarity metric is the Hausdorff distance. Given two GIS features
and
, it measures the maximum deviation of any point of
to feature
as a point set, and vice versa, the maximum deviation of any point in
to
as a set. Mathematically, the Hausdorff distance is defined as:
This definition is faithful and the greater the difference between
and
, the greater their distance metric. When the distance metric is zero (or close to zero), the two features must be the same. In addition to the Hausdorff distance, numerous other similarity metrics have been employed in the literature, involving spatial offset and separation [
9], angular distance and orientation [
10,
12,
16], shape [
16,
17], and topological measurements such as the degree of nodes. While these metrics differ in details, they embody the same principles as the Hausdorff distance: GIS features that are close in location or other aspects should have a low distance/dissimilarity value. The reader is referred to the reviews in [
1,
2,
16] for detailed coverage of the (dis)similarity metrics. That being said, the remainder of this section is focused on the second stage, match selection.
2.1. Heuristic vs. Optimized Conflation Methods
The simplest match selection method is based on a greedy strategy of matching the closest or most similar features. Depending on the order of data processing, there are a number of variant strategies. For example, one can sequentially match each feature in one dataset to its closest candidate feature in the other dataset. This strategy is the nearest neighbor join method and can be directly carried out using most existing GIS packages. Such spatial join strategies, as pointed out in [
18], could lead to logically inconsistent matches. The closest relationship may not be reflexive. That is, the closest feature in
for a feature
may have the closest feature in
that is different from
.
There are different ways to address conflicts in match results. For example, one could select the pair of closest features to match and then exclude them in future steps. This is called the k-Closest Pair Query (KCPQ) [
19] and is widely used in database research. Beeri et al. [
18] proposed a two-sided nearest neighbor join method, in which they choose among possibly conflicting pairs of candidate matches of a local area using a probabilistic score between 0 and 1. This is followed by subsequent studies by Tong et al. [
20]. Another method for coping with map conflicts is to eliminate them in advance. This is exemplified by a general technique called the Rubbersheeting process, which dates back to the early days of conflation research [
3,
4]. The general idea is to reduce the spatial displacement between two maps by iteratively identifying and merging control points (called “anchors”) on the two maps, and then applying a continuous transformation (e.g., affine) to the local regions between these anchors. The anchor points are typically points that are easily identifiable by the human expert (road junctions, land marks, etc.). After the spatial displacement is reduced, a simpler method, such as spatial join and polygon overlay, is used for match selection.
Generally, the aforementioned methods are greedy and sequential in nature [
9,
10]. They make match selections one by one and cannot undo an erroneous match once it is made. In comparison, a different type of match selection method, called optimization-based conflation method, does not suffer from these issues. As will be discussed shortly, they match all features “simultaneously” by treating the match selection as an optimization problem of minimizing the total discrepancy between two maps.
The heuristic methods are fundamentally different from the optimization-based methods in that the match solution for such an algorithm is not unique in nature. For example, the match produced by the nearest neighbor join (i.e., greedy) method may well depend on the order in which the data are processed. It even depends on which dataset is used as the source dataset and which is used as the target (see [
18]). In contrast, any optimization based conflation model has a unique solution with respect to the pre-specified match condition: the optimal solution. Therefore, one cannot generally express a heuristic conflation method as an optimized conflation model.
2.2. Existing Optimized Conflation Models
This subsection presents an analytical review of the main optimized conflation models in the literature, starting with the map assignment problem in the 1980s. This includes the formulation of the optimization problem for each conflation model, as well as a discussion of its functional features and structural characteristics. The models presented include (1) the original assignment problem and a set of common notation for the assignment problem and all subsequent models, (2) the network-flow based fixed-charged matching problems which offer more structural flexibility, (3) a unified bidirectional matching model aimed at reconciling m:1 (and 1:n) matches in the two opposite directions of match, (4) a topological conflation model aimed at preserving edge-to-edge connectivity during the match, and (5) a node-arc topological model aimed at preserving the node-arc incidence relation during the matching.
2.2.1. Common Notation
Throughout this paper, MILP is used as a common language to express new and existing optimized conflation models. Generally speaking, MILP is a high-level algebraic modeling language in which the decision problem (i.e., choice of match relation in this context) is expressed in terms of sets and decision variables. The sets contain constants/parameters describing the objects involved in the optimization model and the attributes of these objects, while the decision variables describe the actions/outcome of the model (e.g., which objects are matched). The MILP model itself is in essence a set of linear equations about the requirements that the final solution (e.g., match relation) must satisfy.
For consistency, a set of common definitions for both the constants and variables are presented below, which will apply to all conflation models in this article. Variables and parameters in the original articles of conflation models may be renamed, if necessary, to conform to the common notation defined here.
are the two geospatial datasets to be matched and conflated. is a directed distance or dissimilarity metric (such as the directed Hausdorff distance) that supports membership relation decisions. The directed distance is zero if the feature coincides with a part of the feature . is the same distance/dissimilarity metric measured in the opposite direction () is the total distance, defined as the greater of the two directional distances. is a cutoff distance, beyond which two features are considered too distant/dissimilar to be a potential match. is a sufficiently large number that ensures that all coefficients of the form in a given model are positive.
With the above distance definitions, one can define the sets of potential matches:
is the set of potential forward partial matches from to where the directed distance is less than the cutoff distance .
is the set of potential partial backward matches from features in to features in
is the set of potential partial full matches between features of and .
The decision variables are:
if feature is considered the same object as , and 0 otherwise.
if feature corresponds to part of feature , and 0 otherwise.
if feature corresponds to part of , and 0 otherwise.
2.2.2. The Assignment Problem (AP)
The Map Assignment problem [
13], is the first optimized conflation model in the literature. As mentioned in
Section 1, it is inspired by the classic job assignment problem in operations research for assigning a set of
workers to the same number of jobs. The goal is to find a minimum cost match plan (in terms of time) while respecting the 1:1 relation between the two sets. In the map assignment problem, the distance/dissimilarity between two features has been used as the assignment cost (instead of time). Using the aforementioned common notation, the assignment problem can be expressed as:
The objective Function (1) is aimed at minimizing the total matching distance. It is expressed as the sum of distances for the matches that have actually been selected (those for which ). Constraints (2) and (3) maintain that the match relation is one-to-one. That is, each feature is matched to exactly one feature in (2) and vice versa, each feature in is matched to exactly one in (3). As the constraints specify the number of features that can be/must be assigned to each feature in () and each feature in (), they are called the “cardinality” constraints. In the assignment problem, the cardinality constraints are the only constraints.
Li and Goodchild [
8] were the first to implement and test the map assignment problem. They found that one of the two cardinality constraints may not be feasible because the two datasets
and
are typically not equal in size. Assuming that the size of
is smaller than that of
, they changed constraints (3) to the following inequality form:
Li and Goodchild [
9] extended the basic assignment model by considering partial matches. To this end, they used the directed Hausdorff distance instead of the full Hausdorff distance to measure distance. Being a half distance, the directed Hausdorff distance is zero when one feature coincides with part of a target feature. Additionally, they defined and used a cutoff distance
, and defined the similarity of a feature
to a feature
as follows:
where
denotes the directed Hausdorff distance. They then defined a model as follows (with a slight change in notation):
s.t.
where
,
are the lengths of features (roads in [
9]).
is a parameter defined to be 1 if
has no nearby feature (within the cutoff distance
), and zero otherwise.
Constraints (6) are similar to the cardinality constraints (4) of the assignment problem. Constraints (7) maintain that the total length of the lines assigned to a target line should not exceed the target’s length.
is a parameter used to allow for errors in the lengths of the involved features. Constraints (8) maintain that if there are nearby features to
(i.e.,
), then one of these nearby features
must be assigned to
. Li et al. [
9] defined an analogous model for the reverse direction of matching from
to
. They called these two models sub-model 1 and sub-model 2. They then applied the two sub-models one by one and solved any inconsistencies between the two sub-models in a post-processing procedure. It should be noted that the constraint sets (7) and (8) could lead to infeasibility. In some occasions, constraints (8) could be used to force one feature
to be assigned to two different target features
and
(therefore violating (6)).
The assignment problem formulation was followed in the subsequent research. For example, Tong et al. [
10] applied the original assignment model to their road network data and reported a low match rate of 56.5%. Although simple in structure, the assignment problem formulation has its limitations. It has a strong requirement that all features in (one of) the datasets must be assigned (constraint (2)). Li et al. [
9]’s work, while enhancing the assignment problem in several directions, introduced its own issues. Such issues are further discussed in [
15].
2.2.3. The Fixed-Charge Matching (fc-Matching) Problems
The fixed-charge matching (fc-matching) problems represent an improvement over the assignment problem with less stringent requirements and a more flexible structure. This is achieved by replacing the underlying optimization model in the assignment problem with a more powerful model called the minimum cost network-flow circulation problem (network flow problems for short).
To address the limitations of the assignment problem formulation, Lei et al. [
11,
15] proposed two new optimized conflation models based on the network flow problem. The network flow problem is a more powerful and flexible model than the assignment problem as it can express a range of optimization problems including the shortest path, the assignment problem itself, and the fixed-charge matching problems, among many others. One advantage of the network-flow based conflation models is that they can be solved using fast specialized algorithms such as the push-relabel algorithm. However, the problem format of the network flow is different from the commonly used MILP models and may require a separate optimization package to solve.
The fixed-charge matching problems, first defined in [
11] include two models called the
fc-matching and the
fc-bimatching models. They differ in that the
fc-matching model assumes a one-to-one correspondence between the two matched GIS datasets, whereas the
fc-bimatching model allows many-to-one (and one-to-many) correspondence. The fixed-charge-matching models are special instances of the network problem. The network problem by definition, optimizes the number of flows along the edges of a specially designed network. Each edge in the network is directed and has an associated cost for carrying a unit amount of flow, an upper bound, and a lower bound for the flow. The only requirement is
flow preservation at each node. That is, the amount of flow entering a network node must be equal to the amount of flow leaving that node. The decision variables are the number of flows
along the network edges
, and the system objective is to minimize the total flow cost. In the context of matching, the amount of flow represents the possible matching of an object
in one dataset
to an object
in a second dataset
. It is 1 if
is assigned to
and 0 otherwise.
Figure 1 presents a diagram for the one-to-one
fc-matching problem. The labels on the network edges represent their attributes: the flow cost, the lower bound, and the upper bound. Most edges have a lower bound of 0 and an upper bound of 1 (representing an assignment or non-assignment). On these occasions, the lower and upper bounds are omitted and only the flow cost is shown on the label. In this network, all flows come out of the source node
and eventually enter the sink node
, except for the rebalancing flows from
to
. Each node in dataset
is linked to the source, and each node in
is linked to the sink with zero-cost edges. The upper bound of 1 on these edges ensures that each
(and
) can be assigned to at most one node in the other dataset. The links between the nodes of
and nodes of
represent the actual assignment, and the edge costs represent the distance/dissimilarity metrics. All these costs are positive, making the model seeking a minimum cost match plan. However, the cost of the rebalancing link
is set to a negative value (
) to ensure that some non-zero flow will be generated in the network.
Figure 2 presents a schematic view of the
fc-bimatching model, structured as a min-cost network flow problem. Structurally, the network roughly consists of two halves. The left sub-network represents the assignment of objects from dataset
(
) to objects in the dataset
. The link from a node in
to a node in
represents a
possible assignment from object
in dataset
to object
in
(e.g., from
to
). The right sub-network represents possible assignments in the opposite direction from
to
. As in the
fc-matching diagram, the labels on the network edges represent their flow cost and lower/upper bounds.
There are several special source and sink nodes: . They inject flows to and absorb flows from the regular nodes in and . Links between these special nodes have a lower bound of 0 and an upper bound of infinity (or a very large number ). Most of them have zero flow costs. The only exceptions are the flow rebalancing link and the excess flow link . The rebalancing link has a negative cost () and provides an incentive for the flow network to generate non-zero flows (assignments). In this study, is set to be the cutoff distance plus one. The excess flow link has a positive cost for penalizing excessive multi-assignments to one object. This is because multi-assignments represent partial matches, which are harder to characterize in terms of their match relations. For example, in the one-to-one fc-matching problem, any feature can be matched to only one target feature. This means that the match is exclusive. Once matched, a feature cannot be involved in any other match. This property helps in reducing ambiguous and erroneous matches. In comparison, no such constraints can be imposed in the many-to-one fc-bimatching problem. By the definition of many-to-one matching, many features may be matched to one target. If a line is split into three parts, nothing stops the model from matching one or two of the parts to the correct target feature while mismatching the other parts. Therefore, partial matches are more error-prone. This is why the penalty factor is imposed to discourage multi-assignments.
Two earlier versions of the fixed-charge matching problems called the
p-matching problems, were defined in [
15]. They are similar in structure to the
fc-matching models, except that they pre-define the number of matches to
. This is enforced by setting both the upper and lower bounds of the flow rebalancing link to
. The
p-matching requires a search for an appropriate value of
based on the level of matching errors at each
. The
fc-matching models improve the
p-matching model by allowing
to be automatically determined by optimization.
Wu et al. [
12] developed a modified version of the network flow based road conflation model, which is similar in structure to the
fc-matching problem. However, they defined the capacity of an edge
based on the lengths of the two roads being matched. More specifically, they defined the capacity for edge
to be:
, where
are the lengths of
and
. Additionally, they used a hypothetical null object in each dataset to represent non-matches. They also used a pre-processing step to preclude unlikely edges from the flow networks based on a number of characteristics of the pair of candidate roads. This includes their angle, length capacity (similar to (7)), etc. Their experimental results showed that these modifications reportedly improved the accuracy of matching compared with
fc-matching.
2.2.4. The Unified Bidirectional Matching u-Bimatching Problem
The main purpose of the unified bidirectional matching problem [
21] is to eliminate potentially conflicting matches between the two opposite match directions. As discussed earlier, these inconsistencies between the two match directions may occur in m:1 matching models.
In the
fc-bimatching model, the partial assignments (i.e., m:1 matches) are penalized in favor of full, one-to-one assignments. When possible, the model attempts to make full assignments first. However, the
fc-bimatching model does not preclude inconsistent partial assignments. To address this issue, Lei et al. [
21] attempted to eliminate erroneous partial matches by imposing new constraints in optimized conflation models. They achieved this by including full and partial matches in a new optimization model with a set of “link” constraints to preclude inconsistent assignments. Using the common notation defined at the beginning of this section, their model called the unified bidirectional matching model
u-bimatching, can be described in MILP as follows:
In the above,
is a sufficiently large number for formulating certain constraints such as (16) and (17) (in the so-called “big M” method in operations research [
14]).
is a relative weight value for emphasizing full assignments
(as compared to partial assignments,
).
In the u-bimatching formulation, the objective function maximizes the benefits associated with making full, forward, and backward partial assignments, respectively. Constraints (10) and (11) are cardinality constraints stating that each feature in can be assigned to at most one target feature in , and vice versa each can be assigned to at most one . They establish a one-to-one correspondence with the decision variable . Constraints (12) and (13) are the cardinality constraints for the forward partial assignments and backward partial assignments , respectively. Either establish unilaterally that a source feature can “belong” to at most one target feature. Constraints (14) and (15) maintain that a source feature can not be assigned to a target both as a full assignment and as a partial assignment.
Constraint (16) is a link constraint that ensures the compatibility of the forward/backward partial assignments as well as the full assignments. The left-hand-side (LHS) has four terms. Clearly, if any of the last three terms are positive, then it forces to be zero. That is, cannot be assigned as a part of . If all of them are zero, then is allowed to be one. The three terms represent three occasions in which the assignment should not happen. This includes: (a) if is assigned in a full match, (b) if is assigned in a full match, and (c) is assigned to some feature . Obviously, the first two cases are incompatible with . The last case (c) is also incompatible because it forms a chain of transitive assignments , implying that is part of a different feature , which is absurd. Constraint (17) is similar to constraint (16), except that the roles of and are swapped.
2.2.5. The Edge Connectivity Based Matching (ec-Matching) Problem
The edge connectivity based matching (ec-matching) problems (introduced in [
22]) are an extension of the previous conflation models in that they can ensure that the edge-to-edge connectivity relation is preserved between the two matched datasets (as it should be). The conflation models reviewed so far match the features of the two datasets at the individual feature level. Certain consistency conditions have been enforced. This includes, for example: cardinality constraints (one feature may not be assigned to two different targets), consistency between opposite assignments (transitive assignment is not allowed), and the penalty on multi-assignments. However, no attention has been paid to the consistency between inter-feature relationships, such as adjacency and incident relations. Such relationships are often used by human experts to match difficult cases. For example, when there are large spatial offsets, one may need to check nearby neighbors to determine whether a pair of features is a true match. If the neighbors all match, then it is likely that they should match as well. Such “tracing” reflects the fact that human experts use topological relations during conflation. By analogy, conflation models should also respect the topological relations so that they are the same on either side of the match. For linear networks, there are at least two types of topological relations: edge connectivity and edge-node incident relations. Two types of optimized conflation models have been developed in the literature.
The first type of topological relation is edge connectivity. If two roads are connected in reality, they should be so in both GIS datasets. Consequently, it would be an error to match a pair of connected roads in
to a pair of disconnected roads in
. Based on this notion of preserving connectivity, Lei et al. [
22] proposed the edge connectivity based matching
ec-matching problem. In addition to the common notation established earlier, the following are needed:
if are connected, and 0 otherwise.
if
are connected, and 0 otherwise.
is a (forward) incompatibility set containing all the edges for which are connected in and are not connected in . For the aforementioned reasons for topological consistency, one can see that if the assignment is made, then the assignment cannot happen. Similarly, is a backward incompatibility set containing the set of elements in that cannot be assigned to if the assignment .
With the addition notation, the
ec-matching model [
22] is formulated in MILP as:
Constraints (19) and (20) are cardinality constraints. Constraint (21) embodies the aforementioned connectivity preservation idea. For a pair of candidate features , if any assignment in the incompatibility set happened, then is forced to be zero. By enumerating all possible incompatible sets, the model makes it impossible to match a connected pair to a disconnected pair . Symmetrically, constraint (22) forbids any connected pair in to be matched to disconnected pairs in .
2.2.6. The m:1 (1:n) Element Connectivity Bi-Matching Problem (ec-Bimatching)
Lei et al. [
22] also proposed a many-to-one version of the connectivity based conflation model, called the
ec-bimatching problem. The model is similar in structure to the
ec-matching model except that it deals with m:1 (and 1:n) matches. Therefore, directed distances are used instead of full distances. Accordingly, the two incompatibility sets are modified to the following forms:
With the modified incompatibility sets and partial assignment variables, the
ec-bimatching problem [
22] can be formulated in MILP as:
The constraints above are similar to those of the
ec-matching model, except for the use of partial assignment variables. The original formulation of [
22] included link constraints of a form similar to (16) and (17), which are omitted here for simplicity and ease of comparison.
2.2.7. The Edge-Node Matching (en-Matching) Problem
The edge-node matching (en-matching) problem [
23] introduces a second type of topological condition, namely, the node arc incidence relationship, into optimized conflation. As an example of such conditions, if a road
is matched to a road
, then the end nodes of
and
as road junctions must match. Otherwise, the match relation is topologically incorrect, and should not be made. To preserve the node-arc relation during matching, the nodes and the edges of the two linear networks must be matched simultaneously and consistently. To this end, Lei et al. [
23] proposed an edge-node matching (
en-matching) model, which requires additional definitions as follows:
Let be the vertex set of , and be vertex set of ,
is the set of candidate node pairs whose distances are less than the cutoff distance.
Given an edge in (or ), let denote its from-node, and denote its to-node.
is a distance metric between the nodes in and .
is the benefit or similarity metric for matching nodes and .
is a weight value for making nodal matches (vs. making edge matches). Lei et al. [
23] assumed
based on the assumption that each junction is associated with four roads on average. Lei et al. [
23] also used a parameter
to prioritize the matching of higher-degree nodes. This is to avoid isolated “islands” of matched cliques (see [
23] for details).
A new decision variable for matching two nodes is needed:
if is assigned to , and 0 otherwise.
Now the
en-matching problem can be formulated in MILP as:
In the above, the objective function (28) maximizes the total weighted similarity between matched edges and between matched nodes. Constraints (29) through (32) are the cardinality constraints for the nodal and edge assignments, as before. Constraint (33) is one of the topological constraints. It states that if the edge assignment is made, then its from-node should be matched to either the from-node of or the to-node of . Constraints (34), (35), and (36) express similar conditions. Collectively, constraints (33) through (36) maintain that if is matched to , then their from- and to-nodes must match.
In the existing literature, most optimized conflation methods are developed for linear features such as roads [
9,
10,
11,
20,
24,
25,
26,
27]. This is probably due to the importance of roads as a spatial reference and their use in early studies by the US Census. By comparison, optimized conflation for point and polygon features received less research attention. For point features, Li and Goodchild’s work [
8] used a hypothetical point dataset (along with a road dataset) to compare the optimization-based assignment problem and two greedy matching algorithms for conflation. However, it was not followed by others in the literature. This is probably due to the fact that point features are relatively simple in structure. Conflation for polygon features is pre-dominated by heuristic methods. A common example is the polygon overlay method (see e.g., [
28,
29]) in which two polygon features from different sources are considered the same if their geometric intersection is large enough (compared to the two original features). To the best of our knowledge, no optimized conflation studies have been published for polygon data so far.
2.2.8. M:N Matching Methods
So far, this article has considered the optimized conflation model in the more strict sense of [
9,
11]. That is, only models that can find the optimal match plan (i.e., minimum distance match) are discussed. In the literature, some scholars (see e.g., [
26]) also consider a broader class of conflation algorithms as optimization based, in a weaker sense that they reduce the distance or some other metrics (without necessarily finding the optimal solution). In this broader category, some of the conflation algorithms can consider the more complex m:n match case.
For example, the “heuristic probabilistic relaxation” method [
24,
30] can take into account spatial context and therefore, can handle the m:n match cases. In [
30], a confidence matrix is computed between pairs of road intersections using relative distances. Then, repeatedly, joint compatibility of two neighboring features is computed and used to update the confidence of matches until the confidence level is sufficiently high. Yang et al. [
24] extended the work of [
30] to match road features.
Fu et al. [
31] adopted a multi-variable logistic regression approach for conflation, in which the Hausdorff distance, string distance, and the direction difference of two roads were used to classify candidate road pairs into two classes: match and unmatch. Guo et al. [
26] extended logistic regression based conflation by using the so-called “strokes” (or sequences of road segments) as the candidates for matching. Consequently, their method can also handle the m:n matching case.
It should be noted that the heuristic probabilistic relaxation and logistic regression methods are heuristic in nature, and cannot guarantee to find the minimum discrepancy match plan as the mathematical programming based models reviewed in the previous subsections. Nonetheless, they are briefly discussed here due to their similarity to (strictly) optimized conflation models.