A Registration Method for Historical Maps Based on Self-Supervised Feature Matching

Qin, Zikang; Feng, Yumin; Wu, Gang; Dong, Qing; Han, Tianxin

doi:10.3390/app15031472

Open AccessArticle

A Registration Method for Historical Maps Based on Self-Supervised Feature Matching

by

Zikang Qin

¹

,

Yumin Feng

¹,

Gang Wu

^1,2,*

,

Qing Dong

^1,2

and

Tianxin Han

^1,2

¹

School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China

²

Key Laboratory of Intelligent Computing in Medical Image, Ministry of Education, Northeastern University, Shenyang 110819, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(3), 1472; https://doi.org/10.3390/app15031472

Submission received: 2 January 2025 / Revised: 27 January 2025 / Accepted: 29 January 2025 / Published: 31 January 2025

(This article belongs to the Special Issue Advanced Pattern Recognition & Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

:

Comparing historical map images of the same region from different periods is an effective method for studying urban history and planning. Image registration techniques in the field of computer vision can be applied to this task. However, historical map registration faces unique challenges, including insufficient training data, variations in image sizes, and unavailable texture features. To address these challenges, we constructed a dedicated dataset of over 100 scanned historical maps, including both raw and preprocessed segmented images. We then developed an enhanced SuperGlue-based registration framework, optimized for the specific obstacles posed by historical maps, such as low texture and large image size. Additionally, we proposed a self-supervised fine-tuning feature extraction algorithm and a Transformer-based architecture utilizing graph attention mechanisms to refine feature descriptors and enhance feature matching performance. Experimental results indicate that our solution achieves superior performance compared to existing models, with RMSE reduced by up to 20%, ROCC improved by up to 10%, and processing time shortened by at least 15%.

Keywords:

historical map; image registration; feature description; graph attention; feature matching

1. Introduction

Since ancient times, maps have served as records of geographic information, illustrating the spatiotemporal distribution, interconnections, and developmental changes of natural and cultural phenomena. In fields such as historical studies, urban planning, and tourism development, comparing historical map images of the same region from different periods has become an important research paradigm [1,2,3]. This is known as image registration from the perspective of computer vision, which spatially aligns images acquired at different times, by different sensors, or from varying viewpoints, enabling accurate positioning, comparison, and fusion among them. An image registration task typically begins with the extraction of a set of representative feature points from a pair of images. Subsequently, these feature points are then matched based on their feature vectors, resulting in a corresponding set of relevant feature points between the two input images. Finally, the transformation model is derived from these matched feature points, describing how one image aligns with the other. Although image registration algorithms have found wide-ranging applications in various computer vision disciplines, including 3D reconstruction [4], change detection [5], and medical imaging [6], there has yet to be any specialized research dedicated to the registration of historical map images.

Traditional and learning-based methods are the main categories of image registration schemes. Traditional registration approaches are generally divided into intensity-based approaches [7,8] and feature-based approaches [9,10]. Although intensity-based methods can produce satisfactory results, they often require substantial human intervention and struggle with images of low contrast. In contrast, feature-based methods, which establish accurate correspondences between images by computing feature points and their corresponding descriptors, are easily affected by noise. Recently, research on learning-based methods has gained popularity and has advanced rapidly [11,12], providing significant improvements over traditional methods in many aspects. However, their precision tends to drop markedly when dealing with high-resolution images and significant scale differences. In the registration of Synthetic Aperture Radar (SAR) images [13], which resemble historical maps, numerous learning-based methods have also been introduced. Nevertheless, these methods are highly dependent on large volumes of manually annotated training data, making them unsuitable when historical map image data are scarce or difficult to annotate.

To address these challenges, we propose a historical map registration framework that focuses specifically on the feature-matching component, which significantly improves overall registration accuracy.

(1): We construct a dedicated dataset specifically for historical map registration research. In collaboration with relevant museums and archaeological departments, we collected historical maps spanning various dynasties from cities in China such as Beijing, Tianjin, and Shenyang. These scanned map images were organized into city-based sub-datasets for registration studies. Inspired by the processing of cellular images in medical imaging [14], we designed an improved U-Net-based segmentation model [15] to extract the main contours of the maps from the original images, thereby effectively removing creases, stains, and other artifacts.
(2): We propose a new computer vision task focused on historical map registration and outline a corresponding research framework. Drawing inspiration from SAR image registration tasks, which share similarities with map registration, we integrated these approaches with the unique characteristics of historical map data to propose a novel registration architecture. First, we preprocess both the reference image and the moving image to remove noise in the original data. Next, we employ our self-supervised feature extraction method with non-maximum suppression (NMS) [16] to obtain feature points and descriptors. Then, we apply a graph attention mechanism to enrich these feature descriptors with contextual information. Subsequently, we establish feature point correspondences between the two images using the iterative Sinkhorn algorithm. Finally, we filter and refine these matched point pairs to estimate the transformation model, thereby completing the historical map registration task.
(3): We propose a self-supervised feature extraction method. Historical maps are limited in quantity and difficult to annotate manually. However, we observed that the processed map data essentially consist of collections of lines. Leveraging this characteristic, we generated fundamental geometric shapes (such as cubes, intersecting lines, letters, etc.) and applied geometric transformations to produce derivative images. These original and transformed shapes served as our original shape dataset. We then employed a convolutional neural network-based encoder–decoder architecture to train a feature extraction model, establishing a baseline. Subsequently, we applied this baseline model to historical map images for feature extraction and fine-tuned the results, ultimately obtaining a refined feature extraction model.
(4): We propose a feature update module based on a graph attention mechanism. Map images typically exhibit pronounced local correlations while spanning a wide range of scales, making the learning of positional information crucial for improving matching accuracy. Therefore, we transform feature points into nodes in a graph: similarities among feature points within the same image are regarded as intra-graph edges, and similarities among feature points across different images are treated as inter-graph edges. We then construct a graph attention architecture to propagate positional information across the feature maps. This approach enables the learning of local positional information around feature points and the similarity relationships across different scales, thereby enhancing the accuracy of subsequent feature matching.

The remainder of this paper is organized as follows. In Section 2, we provide a clear overview of the literature on related topics. Section 3 then provides a comprehensive explanation of each element in our proposed framework. Subsequently, Section 4 presents empirical findings and evaluates the impact of various components. Lastly, in Section 5, we conclude our work briefly and anticipate future improvements.

2. Related Work

Image registration is the task of establishing geometric correspondence between two images, which is achieved through feature extraction and matching. Based on this correspondence, transformation models can be built to characterize the geometric relationship between the images. In the field of image registration, approaches can be broadly categorized into two key paradigms: traditional methods and learning-based techniques.

2.1. Traditional Methods

Traditional image registration approaches can be broadly divided into two categories: intensity-based methods and feature-based methods.

Intensity-based methods formulate the registration task as an optimal transport problem, where transformation parameters are iteratively optimized using metrics such as mutual information [10], entropy, cross-correlation [9], among others. Through iterative optimization, the final deformation parameters are derived to align the images. However, intensity-based methods often struggle with repetitive textures and low-contrast regions, where structural information is insufficient.

Feature-based methods estimate deformation parameters by establishing correspondences between matched feature point pairs. Descriptors encode local geometric and photometric information around feature points, ensuring robustness to geometric distortions and noise. The Scale-Invariant Feature Transform (SIFT) [17], proposed in 2004, laid the foundation for feature-based registration by introducing scale-invariant keypoints and gradient-based descriptors. Subsequent advancements, such as RootSIFT [18] (improving descriptor discriminability) and ASIFT [19] (enhancing affine invariance), further refined feature matching robustness. The Speeded-Up Robust Features (SURF) [20] algorithm accelerated computation through integral images and Haar wavelets, while KAZE [21] leveraged nonlinear diffusion filtering for stable keypoint detection in noisy images. Recent work includes POS-GIFT [8], which leverages phase congruency maps to handle intensity variations, and RIFT2 [7], a rotation-invariant descriptor tailored for multimodal registration.

Traditional methods face limitations when applied to historical map images, which frequently exhibit noise artifacts, stylistic variations across eras, and degraded structural continuity. Our method addresses these challenges through a dedicated preprocessing module for noise suppression and a graph attention network to model contextual relationships among feature points.

2.2. Learning-Based Methods

Recently, deep learning techniques have emerged as the dominant approach for many traditional computer vision tasks, including semantic segmentation, image registration, classification, and object detection. In the context of image registration, these methods achieve higher accuracy in establishing correspondences, enabling the estimation of geometrically consistent transformation models. Current learning-based registration methods can be categorized into descriptor-based and descriptor-free paradigms.

Descriptor-based methods learn feature representations end-to-end, replacing handcrafted descriptors with neural networks. Pioneering works like LIFT and MagicPoint demonstrated the feasibility of learning discriminative descriptors. R2D2 [22] advances this paradigm by jointly predicting reliable keypoints, robust descriptors, and confidence scores to prioritize trustworthy regions. D2Net [23] unifies keypoint detection and description within a single network, demonstrating robustness to significant intensity variations and viewpoint shifts. Despite their success, feature-based methods often fail to generalize to images with large domain gaps (e.g., stylistic disparities across historical maps) and extreme scale variations.

Recent detector-free approaches, such as RoMa [11], introduce a coarse-to-fine dense matching framework leveraging hierarchical feature representations. DeDoDe [24] decouples feature detection and description through a modular pipeline, enabling independent optimization of each component. Despite their advancements, state-of-the-art methods like RoMa and DeDoDe require large-scale pretraining on annotated datasets, limiting their applicability to historical maps—scarce, annotation-deficient, and stylistically heterogeneous. To address this, we propose a self-supervised framework specifically designed for historical map registration, eliminating dependency on manual annotations.

3. Materials and Methods

The proposed historical map registration framework is illustrated in Figure 1. The original map images are input to a map segmentation module for noise suppression. Then, we train a basic feature extraction model with dual-branch decoders—one dedicated to detecting feature points and the other to generating discriminative descriptors in our generated geometric shape dataset. The preprocessed images are then passed through this model to be fine-tuned. Within the graph attention module, a custom graph neural network aggregates contextual information across feature points through multi-hop message passing, significantly improving descriptor consistency. The Sinkhorn algorithm is first applied to compute dense correspondences, followed by RANSAC-based outlier rejection to estimate the optimal transformation model, yielding the final registered image pair.

3.1. Historical Map Dataset

The dataset comprises high-resolution scanned historical maps from some Chinese cities: Beijing, Tianjin, Shenyang, and others. We show some examples of our dataset in Figure 2. These maps, predominantly exceeding 1500 × 1500 pixels in resolution, span multiple dynasties and exhibit significant variations in cartographic techniques and artistic styles. Owing to their historical prominence as political and cultural hubs, Beijing and Tianjin possess richer archival collections, with 20 and 22 meticulously preserved maps, respectively. For comparative analysis, over four maps per city were collected from other regions.

During the curation phase, preliminary visual inspections were conducted to assess map integrity and image fidelity, ensuring the quality of our dataset. Through collaboration with museums and archaeological institutions, the physical maps underwent professional-grade digitization using non-invasive scanning protocols. Given the fragile material composition of centuries-old maps, multi-phase scanning and precision stitching algorithms were employed to achieve higher geometric accuracy.

For artifact suppression, a custom map segmentation pipeline was developed, combining morphological operations and deep learning-based inpainting. Prior to model training, the maps were tiled into 512 × 512 pixel patches and annotated by cartographic experts to delineate structural elements (roads, landmarks, text). The raw scans and annotations were organized into a hierarchical geodatabase with spatiotemporal metadata (e.g., dynastic period, geographic coordinates), enabling reproducible computational analysis.

3.2. Map Preprocessing

During the model training phase, we addressed the heterogeneity of historical map images, which exhibit significant variations in resolution, size, and cartographic style. Given the prohibitive cost of complete manual annotation (particularly for high-resolution maps), we adopted a transfer learning strategy with self-supervised fine-tuning. First, maps were tiled into 256 × 256 pixel patches for coarse annotation of dominant features (primary roads, building footprints, and major waterways). Drawing inspiration from vascular structure segmentation [14]—a task that shares topological similarities with road network extraction—we initialized our model using a pretrained U-Net architecture. This baseline model was then fine-tuned on coarsely labeled map patches. By prioritizing the alignment of macro-level structures, minor discrepancies in local features (e.g., branch roads) were rendered less impactful, thereby simplifying downstream feature matching. To enhance cross-style generalization, we integrated Instance–Batch Normalization (IBN) [25] into the network, which enabled robust performance across diverse map aesthetics.

To handle the high-resolution historical maps (typically > 1500 × 1500 pixels), we implemented a tiling-and-stitching pipeline. The original maps were divided into 512 × 512-pixel overlapping patches (with a stride of 256) to preserve contextual continuity. Each patch was processed by the segmentation network, and the local predictions were aggregated through weighted averaging in overlapping regions, resulting in a full-resolution segmentation mask identical to the input dimensions.

To mitigate boundary artifacts in stitching results, we designed an overlapping tiling strategy with adaptive fusion weighting. Adjacent patches were generated with 20% overlapping regions, ensuring sufficient coverage of edge regions. During reconstruction, pixel-wise confidence scores (derived from the network’s softmax outputs) were used to blend predictions in overlapping zones, with higher weights assigned to high-confidence regions near patch centers. This approach effectively suppressed edge discontinuities while maintaining global consistency. The result of our map image segmentation method is shown in Figure 3.

3.3. Self-Supervised Map Feature Extraction

Traditional supervised learning methods [22,23] rely on manually annotated datasets. However, human annotations typically provide approximate localization of feature points rather than subpixel-level precision. Moreover, expert-guided annotation of cartographic features requires substantial resources, including domain expertise and dedicated infrastructure. Finally, the scarcity of annotated historical maps hinders the training of supervised models on limited datasets, which constrains their ability to learn robust feature representations.

Inspired by Jaderberg’s self-supervised geometric reasoning framework [26], we pretrain our model on a synthetic dataset of geometric primitives (e.g., lines, polygons, and curves). Through self-supervised pretraining [27], the model autonomously learns structural patterns from automatically generated labels, thereby eliminating manual annotation and enabling scalability for large-scale historical map analysis.

The decoder generates a 65-channel probability map through a 1 × 1 convolution, where each spatial position corresponds to an 8 × 8 grid in the input image. Each channel encodes activation probabilities for discrete positions within the 8 × 8 grid, with the 65th channel representing the absence of features. The use of non-overlapping 8 × 8 grids ensures exact spatial correspondence between decoder outputs and input regions. A channel-wise softmax operation computes pixel-level probabilities. During training, the spatial position with the highest probability within each 8 × 8 grid is selected as the feature point. During inference, a confidence threshold rejects low-probability candidates, and the 65th channel is excluded to suppress non-feature regions. The structure of our map feature extraction is shown in Figure 4.

To mitigate upsampling artifacts, we replace transposed convolution with Pixel Shuffle [28], which preserves structural integrity through sub-pixel convolution. We further augment the dataset using geometric transformations (such as rotation and scaling) and photometric distortions, then fine-tune the model to jointly predict pixel-accurate feature locations and rotation-invariant feature descriptors.

In order to achieve self-supervised training of the model, we pretrained the model on synthetic geometric primitives (e.g., intersecting lines, rectangles, squares) with programmatically generated feature points (e.g., intersections, endpoints). This pretraining phase enhances the model’s sensitivity to structural features such as intersection points, endpoints, and corners. In this study, semantic segmentation is applied to map data to extract road networks containing dense structural features. Direct application of the model for feature extraction would yield excessively dense correspondence sets between the images to be registered. Overly dense correspondence sets adversely affect downstream processing—they increase computational costs, demand higher hardware resources, and complicate graph-based matching architectures.

Therefore, we propose a customized Non-Maximum Suppression (NMS) [29] algorithm to appropriately adjust the quantity and distribution of output feature points, mitigating computational complexity in downstream tasks. The customized NMS algorithm, integrated into the feature extraction pipeline, is formalized in Algorithm 1. Non-Maximum Suppression is a computational technique that iteratively selects local maxima while suppressing neighboring non-maximal responses within a defined neighborhood. The search neighborhood is parameterized by its spatial extent (e.g., kernel size), with both scale and geometry configurable to meet empirical requirements.

In Algorithm 1, the local search radius r means the size of the neighborhood, and by setting the value of r, the local search range can be restricted. In lines 3 and 4, the algorithm uses a max-pooling operation to compute the confidence matrix scores and obtain the local maxima within each neighborhood. The results are then compared one-to-one with the original confidence matrix, and points with the highest confidence within their respective neighborhoods are marked as True; otherwise, they are marked as False, filtering out the first generation of feature points.

Next, in lines 6, 7, and 8, the confidence of non-maximum feature points within the neighborhood is set to zero, while the local maxima are retained. Lines 9 and 10 perform a similar function to lines 3 and 4, except that the search range excludes the neighborhood centered on the previous generation feature points. Line 11 merges feature points across multiple generations as the final search result.

Through this process, only one candidate feature point is selected within each local neighborhood, while the confidence of other feature points is forced to zero. This reduces the density of extracted features and ensures that the extracted feature set is more evenly distributed.

Algorithm 1 Algorithm of Non-Maximum Suppression

Input:: local search radius $r = 16$ , confidence matrix of candidate feature points $s c o r e s$
Output:: $s c o r e s$ after NMS algorithm processing
1:: $z e r o s = t o r c h . z e r o s_l i k e (s c o r e s)$
2:: $p o o l_s c o r e = t o r c h . n n . f u n c t i o n a l . m a x_p o o l 2 d (s c o r e s)$
3:: $f i r s t_m a s k = s c o r e s = = p o o l_s c o r e$
4:: for each $i \in [1, n]$ do
5:: $p o o l_f i r s t_m a s k = t o r c h . n n . f u n c t i o n a l . m a x_p o o l 2 d (f i r s t_m a s k . f l o a t ())$
6:: $s m a l l_m a s k = p o o l_f i r s t_m a s k > 0$
7:: $s m a l l_s c o r e s = t o r c h . w h e r e (s m a l l_m a s k, z e r o s, s c o r e s)$
8:: $p o o l_s m a l l_s c o r e s = t o r c h . n n . f u n c t i o n a l . m a x_p o o l 2 d (s m a l l_s c o r e s)$
9:: $s e c o n d_m a s k = s m a l l_s c o r e s = = p o o l_s m a l l_s c o r e s$
10:: $f i r s t_m a s k = f i r s t_m a s k | (s e c o n d_m a s k & (s m a l l_{m} a s k))$
11:: end for
12:: return $t o r c h . w h e r e (f i r s t_m a s k, s c o r e s, z e r o s)$

3.4. Map Feature Matching Based on Graph Neural Networks

A short introduction about the background of the Transformer, which is an encoder that relies on the attention mechanism, is given in [30]. The attention mechanism is a prevalent strategy in deep learning, particularly for tasks in natural language processing (NLP) and computer vision. When handling sequences, it concentrates on elements that are most pertinent to the current objective, thereby extracting essential information. The attention layer receives three input vectors: query Q, key K, and value V. The query vector Q accesses information from the key vector K by computing the dot product between Q and each corresponding key vector K, as illustrated by the following equation:

A t t e n t i o n (Q, K, V) = Softmax (Q K^{T}) V

(1)

Intuitively, attention acquires relevant information by calculating the similarity between each K and the Q. If the similarity is high, more relevant information will be extracted from V. This process is also referred to as message passing in graph neural networks.

Historical map images often lack texture but exhibit strong local correlations. Similar building contours may appear within the same map, while maps from different periods may share spatially analogous configurations. To leverage these similarities, we construct a complete graph where nodes represent feature points from two encoded candidate sets. There are two types of undirected edges: the intra-image edges, which are the connections between feature points of single image, and the inter-image edges, which are the connections among feature points of the two input images. These edges define dual propagation mechanisms: intra-image edges implement self-attention to model contextual relationships, whereas inter-image edges facilitate cross-attention for correspondence reasoning.

Let

{}^{(l)}{can_f}_{i}^{A}

denote the feature of the i-th candidate pixel in the reference image A at the l-th layer of the graph neural networks. This feature represents the aggregated information from other points

\{j : (i, j) \in E\}

, where

E \in \{E_{I n t r a}, E_{I n t e r}\}

. Therefore, the update rule for

{}^{(l)}{can_f}_{i}^{A}

is given by Equation (2):

{}^{(l + 1)}{can_f}_{i}^{A} = {}^{l}{can_f}_{i}^{A} + M L P ([{}^{l}{can_f}_{i}^{A} ‖ m e s s_{E \to i}])

(2)

In this process,

{}^{(l)}{can_f}_{i}^{A}

is concatenated with

m e s s_{E \to i}

, two vectors. E denotes the type of edge through which information flows at the current layer. For layers

1 \sim L / 2

,

E = E_{I n t r a}

, and feature information is aggregated along the intra-graph undirected edges towards the current feature. For layers

L / 2 + 1 \sim L

,

E = E_{I n t e r}

, and feature information is aggregated along the inter-graph undirected edges towards the current feature.

m e s s_{E \to i}

represents the aggregated information generated based on the attention mechanism:

α_{i j} = Softmax (S i m_{i} (q u e r y_{i}, k e y_{i}))

(3)

m e s s_{E \to i} = \sum_{j : (i, j) \in E} α_{i j} v a l u e_{j}

(4)

Let Q denote the query image and S denote the source information image. Due to the two types of information aggregation, namely, intra-image and inter-image aggregation, we define the following. Let the query vector

q u e r y_{i}

represent a linear mapping of the i-th pixel in the query Q, while the key vector

k e y_{j}

and value vector

v a l u e_{j}

represent linear mappings of the j feature point in the source information image S. The computation process for these vectors is given by the following equation:

q u e r y_{i} = W_{1} \cdot {}^{(l)}{can_f}_{i}^{Q} + b_{1}

(5)

[\begin{matrix} k e y_{j} \\ v a l u e_{j} \end{matrix}] = [\begin{matrix} W_{2} \\ W_{3} \end{matrix}] \cdot {}^{(l)}{can_f}_{j}^{S} + [\begin{matrix} b_{2} \\ b_{3} \end{matrix}]

(6)

The final feature vectors of reference image A are obtained after multiple iterations:

f i n a l_f_{i}^{A} = W \cdot {}^{(L)}{can_f}_{i}^{A} + b, \forall i \in A

(7)

3.5. Optimal Transport

Based on the feature vectors obtained in the previous section, we can compute the feature matching score matrix S for images A and B:

S_{i, j} = 〈 f i n a l_f_{i}^{A}, f i n a l_f_{j}^{B} 〉, \forall (i, j) \in A \times B

(8)

where

〈 \cdot, \cdot 〉

means the dot product operation. Each element in the matrix represents the similarity among candidate features in images A and B. At this point, we can transform the feature matching problem of images into the task of solving the assignment matrix X, such that the value of

\sum_{i, j} S_{i, j} X_{i, j}

is minimized. This is a classic optimal transport problem.

Due to many points involved in the matching phase (typically, over 1000), the time cost must be optimized during the solution process. Therefore, we use an entropy-regularized algorithm to find an approximate solution [31]. We introduce an entropy regularization term:

d_{S}^{λ} (r, c) = min_{X \in U (r, c)} \sum_{i, j} S_{i j} X_{i j} - ε H (X)

(9)

where

ε

is the regularization coefficient, and

H (X)

is the regularization. The latter is given by the following expression:

H (X) = - \sum_{i, j} X_{i j} (log (X_{i j}) - 1)

(10)

We use the Sinkhorn algorithm to iteratively solve for an approximate solution to this problem, ultimately obtaining the assignment matrix X.

We apply RANSAC to eliminate mismatches, leveraging the planarity assumption that corresponding points between the two images lie on a shared plane explainable by a homography matrix. RANSAC follows a hypothesis-and-verification paradigm, iteratively refining model parameters. The algorithm initializes by assuming the data conform to a parametric model, randomly selecting minimal subsets to estimate provisional models. The consensus set is then evaluated by counting inliers—data points within a predefined error tolerance of the model. Inliers are defined as points satisfying the geometric constraint within a reprojection error threshold, whereas outliers violate this criterion. Models with larger consensus sets (i.e., higher inlier ratios) are probabilistically favored, as they maximize the likelihood of correct model estimation. Through iterative hypothesis generation and validation, RANSAC converges to the optimal model maximizing inlier count, while discarding residual outliers.

In other words, all correctly matched point pairs between the two images to be registered can fit such a homography model. Therefore, RANSAC can be used to optimize the initial set of matched feature point pairs. The steps for using RANSAC to refine the initial feature match set in this study are as follows: (1) In the preliminary set of matched feature point pairs, define an initial homography transformation matrix as follows:

[\begin{matrix} h_{11} & h_{12} & h_{13} \\ h_{21} & h_{22} & h_{23} \\ h_{31} & h_{32} & h_{33} \end{matrix}]

(11)

Then, randomly sample at least four pairs of feature points and compute the homography transformation matrix H.

s [\begin{matrix} x^{'} \\ y^{'} \\ 1 \end{matrix}] = [\begin{matrix} h_{11} & h_{12} & h_{13} \\ h_{21} & h_{22} & h_{23} \\ h_{31} & h_{32} & h_{33} \end{matrix}] [\begin{matrix} x \\ y \\ 1 \end{matrix}]

(12)

Among them,

(x, y)

denotes the coordinates of the feature points in the sensed image, whereas

(x^{'}, y^{'})

signifies the coordinates of the feature points in the reference image. Additionally, s represents the scaling parameter.

Based on the homography transformation matrix, compute the reprojected coordinates of the features between two input images, then compare the distance between the reprojected point and its matching point in another map. If the distance falls below a predetermined threshold, the pair is deemed a correct match, and the tally of accurate matches is incremented; otherwise, it is deemed a mismatch. After reprojecting, the distance is calculated as shown in Equation 13:

{(x^{'} - \frac{h_{11} x + h_{12} y + h_{13}}{h_{31} x + h_{32} y + h_{33}})}^{2} + {(y^{'} - \frac{h_{21} x + h_{22} y + h_{23}}{h_{31} x + h_{32} y + h_{33}})}^{2}

(13)

Repeat the previous two steps, comparing the number of correct matches recorded after multiple iterations. The homography transformation matrix that yields the largest number of correct matches is regarded as the optimal transformation model. Then, using this optimal transformation model, any incorrect matches are filtered out, and all correct matches are output, thereby completing the optimization of the initial feature matching set.

Before the algorithm begins, a maximum number of iterations, denoted as

m a x_i t e r N u m

, must be specified. During the iterative process, the actual number of iterations,

i t e r N u m

, is continuously updated until it reaches the maximum iteration count. The update process is shown in Equation (14).

i t e r N u m = \frac{log (1 - p)}{log (1 - w^{m})}

(14)

where p is the confidence level, representing the probability that all randomly sampled data points from the dataset are inliers, typically set to

0.995

. The term w denotes the ratio of inliers to the total dataset, and m is the minimum number of data samples required in the first sampling step.

4. Results

The method we proposed was trained on PyTorch using a self-constructed map dataset. We set the learning rate to

10^{- 3}

, adopting the input size of

1440 \times 1920

and a batch size of four. The basic geometric shape image size was

256 \times 256

, with a total of 63,000 images to train baseline model, and more than 100 historical map images were used to fine-tune the model.

The model was trained for 50 epochs, and each epoch took approximately 14 min under the following hardware configuration: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20 GH; RTX 3090 GPU with 24 GB VRAM × 2; 256 GB DDR4 RAM.

Table 1 shows the special parameters.

4.1. Metrics

Because there has been no relevant research on historical map image registration thus far, we adopt evaluation metrics from SAR image registration, which is similar to historical map registration, to assess the registration framework proposed in our study:

$N_{a l l}$ . This is the total number of keypoint matching pairs obtained in the feature matching module [32]. A higher $N_{a l l}$ means that there are more correspondences between keypoints in different images.
NOCC (Number of Correct Correspondences [32]). This metric measures the number of correct correspondences between keypoints. A higher number of correct matches leads to a more accurate estimation of the geometric transformation model.
ROCC (Ratio of Correct Correspondences [33]). This metric evaluates the proportion of accurate matches by assessing the presence of outlier matches among key points during the registration process. A higher ROCC indicates that a greater number of key points have been successfully matched, suggesting that the registration is more resilient to incorrect matches and outliers. The ROCC is calculated as follows:

$R O C C = \frac{N O C C}{N_{a l l}}$

(15)
RMSE (Root Mean Squared Error [33]). RMSE is a metric used to gauge the reliability and precision of the registration procedure. It quantifies how accurately two images are aligned by inspecting both the forward and inverse geometric transformations. A lower RMSE score indicates greater registration accuracy, reflecting fewer alignment errors.

$R M S E = \sqrt{\frac{1}{N} \sum_{i = 0}^{N} {∥P_{i} - g (f (P_{i}))∥}^{2}}$

(16)

$P_{i}$ denotes a key point within the sensed image, and N signifies the total number of key points present in the sensed image. The functions f and g correspond to the forward and inverse transformation models. Here, $P_{i}$ represents a key point in the sensed image, and N indicates the total number of key points within that image. The functions f and g refer to the forward and backward transformation models, respectively.
RT (Runtime [32]). This means the time complexity of the algorithm. Registration efficiency is also a key focus of our study. By comparing the runtime differences of various methods across different modules, we can clearly illustrate the time complexity of each approach.

4.2. MapSegment

To validate the necessity of image segmentation preprocessing, we conducted a controlled experiment comparing registration performance on raw historical maps versus segmented maps processed by the proposed MapSegment pipeline.

We applied MapSegment to the images in Figure 5, with results demonstrated in Figure 6. The original images include historical maps with various drawing styles. Some images exhibit noticeable creases (a, b, c), evident signs of paper aging (d, g, h), and the presence of irrelevant text or color interference (e, f). These noise elements negatively impact the registration process. To address this, we applied image segmentation methods for preprocessing. As shown in Figure 6, preprocessing preserves key cartographic structures (e.g., roads, buildings) while suppressing extraneous elements (e.g., chromatic variations, textual annotations). The segmentation results make the main road information in the images significantly clearer, with noise notably reduced. From an observational perspective, the primary features expressed in the images become easier to distinguish.

The preprocessed maps form the core of our Historical Map Registration Benchmark (HMRB), ensuring domain-consistent evaluation for subsequent experiments. Visual inspection confirms enhanced discriminability of key features, enabling robust correspondence estimation despite stylistic variations.

To rigorously evaluate the efficacy of MapSegment, we established SuperPoint and SuperGlue as baseline models. Feature matching was performed on both raw historical maps and segmented maps, with comparative results shown in Table 2 and visualized in Figure 7.

Compared with the original image, the effective matching pairs (

N_{a l l}

) obtained by matching the segmented image have an average improvement of 15%. Original maps exhibited sparser correspondences (mostly <15% inlier ratio) with clustered spatial distributions, particularly when stylistic or structural discrepancies existed between pairs. In contrast, segmented maps achieved much higher spatially uniform correspondences.

4.3. MapExtraction

To validate the superior performance of the self-supervised learning-based feature extraction algorithm designed for map images in this study, this subsection conducts comparative experiments using the SuperPoint feature extraction algorithm for pose estimation as a reference. Two segmented maps of Tianjin are selected as demonstration cases, as shown in Figure 8.

In Figure 8, (a) and (d) show the segment map and the pink dots represent the extracted feature points. (c) and (f) display the output of our algorithm for the two segmented maps of Tianjin, while (b) and (e) are results of the SuperPoint.

The figure shows that the SuperPoint extracts more points than ours. Notably, it frequently extracts feature points with closely spaced coordinates on the same road. In contrast, the proposed feature extraction algorithm employs the NMS (Non-Maximum Suppression) algorithm to control the coordinate distances between adjacent feature points, making a more evenly distributed set of feature points. This reduction in feature point redundancy is beneficial for improving the efficiency of subsequent feature matching.

4.4. MapMatcher

Then, we apply the MapMatcher proposed in this study to historical map data and compare its performance with several popular models in recent years, including RIFT2 [7], D2Net [23], SuperGlue [12], DeDoDe [24], R2D2 [22], and RoMa [11]. We show the results in Table 3 using the a–h image in Figure 6.

Runtime (RT) analysis demonstrates that MapMatcher and R2D2 achieve substantially lower latency compared to baseline methods. Baselines including RIFT2, D2Net, SuperGlue, and DeDoDe exhibit inference times ranging from 1 to over 4 s. D2Net reaches peak latencies of 4–5 s (e.g., 4.651 s on task a–c), while SuperGlue operates within 1–2 s. R2D2 maintains sub-2-second latency, with minima reaching 0.56s (task a–c), demonstrating computational efficiency. MapMatcher achieves near-real-time performance (such as 1.063 s on a–e and 1.062 s on a–g), securing its position as a top-tier efficient method.

According to the number of extracted keypoints (Ref and Sen), RIFT2, SuperGlue and RoMa tend to extract a very large number of keypoints (Ref and Sen often exceed 4000–6000), which can be computationally heavy and might not always be necessary. D2Net and DeDoDe extract fewer keypoints (around a few hundred to a little over a thousand), resulting in less data for matching. R2D2 balances between extremes but still produces fewer keypoints (around 1000–2000). MapMatcher strikes a moderate balance: approximately 3500 reference and around 2400 sensed keypoints, which is sufficient to ensure robust matching without excessive overhead.

N_{a l l}

shows the Number of Matched Keypoint Pairs, which directly reflects the potential quality of the image alignment. RIFT2 generates dense correspondence sets (e.g., 2059 pairs in task a–b), attributed to its exhaustive keypoint sampling strategy. D2Net and DeDoDe produce sparser matches (often <200 pairs), indicating conservative feature selection heuristics. SuperGlue achieves a high match count at the cost of increased runtime (RT). R2D2 prioritizes speed over match density, with most tasks yielding <300 pairs. RoMa’s dense matching paradigm incurs high computational overhead, yet suffers from lower precision (more keypoints but fewer valid matches). MapMatcher consistently achieves superior match density, exemplified by 2418 pairs (a–b), 2400 pairs (a–g), and 1829 pairs (a–h).

In summary, MapMatcher demonstrates a superior trade-off across all evaluation metrics. MapMatcher achieves near-R2D2-level runtime efficiency while yielding more matches, rivaling or surpassing robust matching methods such as RIFT2 and SuperGlue. Although we extracted fewer feature points than RoMa, the ones we extracted have a higher probability of successful matching. This combination of speed and match quality indicates that MapMatcher is both time-efficient and effective at producing rich, reliable correspondences, making it a strong choice for real-time or near real-time image matching and registration tasks.

We benchmarked MapExtraction (feature extraction) and MapMatcher (feature matching) against state-of-the-art methods to assess their performance on historical map datasets. Quantitative results are summarized in Table 4. Let

N_{a l l}

denote the total number of matched pairs, and ROCC represent the ratio of correct correspondences. MapExtraction enhances matching precision by prioritizing high-saliency keypoints during feature extraction. Despite reducing the total keypoints, MapExtraction boosts both efficiency and precision by filtering out low-confidence candidates.

In the feature matching stage, compared to Nearest Neighbor (NN) matching and the SuperGlue method, the proposed MapMatcher demonstrates a higher matching accuracy. Under optimal experimental conditions, MapMatcher’s accuracy improved by approximately 30% compared to SuperGlue, further validating its outstanding performance in complex historical map data. This indicates that MapMatcher not only handles heterogeneous features but also effectively addresses registration challenges such as low texture, providing a superior solution for the automated processing of historical maps.

4.5. MapRegistration

We applied the RANSAC algorithm to the match pairs obtained earlier to estimate the transformation model between the images. We selected SIFT2, SuperGlue, DeDoDe, and RoMa, which have better performance in Section 3.4, to be compared with our method. The results show in Table 5.

Relative to SIFT2, MapRegistration demonstrates a 10%+ speedup, 15–200% higher mutual information (MI), and several-fold improvements in ROCC and RMSE. Compared to SuperGlue, MapRegistration achieves >50% faster runtime, mostly 20–35% higher MI, and 10% lower RMSE. Although DeDoDe occasionally outperforms in ROCC for specific cases, MapRegistration dominates across the majority of test scenarios. RoMa’s dense matching strategy yields an excessive number of correspondences, often exceeding practical utility. However, this dense correspondence set introduces noise, increasing computational overhead and degrading ROCC compared to other methods.

In summary, MapRegistration emerges as the most balanced and effective solution, delivering state-of-the-art registration accuracy while maintaining competitive runtime efficiency. It outperforms competing methods across nearly all metrics, solidifying its position as the preferred choice for historical map registration tasks.

5. Conclusions

To address the challenges of historical map registration, we collaborated with museums and research institutions across different regions to collect and organize a dataset of historical maps from various periods and cities, using different drawing techniques. We designed a U-Net-based image segmentation model and manually annotated a large number of historical maps, which became the training and test sets of our model, removing noise such as creases and stains from the images. Next, we used the SuperPoint model, initially training a baseline feature extraction model using the generated basic shape images to extract feature points and descriptors. We then fine-tuned the model with the annotated historical map images to improve its performance on map data. To enhance the information contained in feature descriptors, we proposed a graph attention network called MapFormer, allowing information to flow between feature points within and across images. We proposed a customized iterative Sinkhorn algorithm, converting the problem into a classical optimal transport problem. In the final stage, we applied the RANSAC algorithm to the matched pairs to obtain the transformation model.

We evaluated our registration framework extensively, demonstrating that our method surpasses some state-of-the-art models among multiple performance metrics. Quantitative average improvements include up to 20% lower RMSE, 10% higher ROCC, and more than 15% faster runtime.

Our method achieves superior performance across multiple metrics, demonstrating the effectiveness of the proposed historical map registration framework. It has potential applications in archaeological research, historical building reconstruction, and urban change tracking analysis.

We believe that there are two potential development directions for subsequent registration architectures. Currently, our map registration dataset consists primarily of historical maps based on paper, which are relatively homogeneous. Future researchers can improve the dataset by incorporating urban maps from different sources (such as remote sensing images and modern maps), thus improving the generalization capability of the registration model across various types of map data. On the other hand, we can further improve our model from multiple directions. One is to employ more sophisticated baseline models for training in order to achieve higher accuracy, which would be suitable for scenarios requiring stringent registration results. Alternatively, existing models could be pruned and optimized (e.g., by using early stopping or filtering) to further improve registration efficiency.

Author Contributions

Z.Q.: Conceptualization, methodology, software, validation, investigation, data curation, and writing—original draft. Y.F.: Conceptualization, methodology, and formal analysis. G.W.: Supervision, resources, writing—review and editing, project administration, and funding acquisition. Q.D.: Data curation, writing—review and editing. T.H.: Writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China (Grant No. 2019YFB1405302) and the National Natural Science Foundation of China (Grant No. 61872072).

Data Availability Statement

The historical map dataset of this article will be made available by the authors upon request.

Acknowledgments

The authors wish to thank TimeMarking.com for providing the map information used in this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Orabi, R. Aleppo Pixelated: An Urban Reading through Digitized Historical Maps and High-Resolution Orthomosaics Case Study of al-Aqaba and al-Jallūm Quarters. Digital 2024, 4, 152–168. [Google Scholar] [CrossRef]
Xia, X.; Zhang, T.; Heitzler, M.; Hurni, L. Vectorizing historical maps with topological consistency: A hybrid approach using transformers and contour-based instance segmentation. Int. J. Appl. Earth Obs. Geoinf. 2024, 129, 103837. [Google Scholar] [CrossRef]
Smith, E.S.; Fleet, C.; King, S.; Mackaness, W.; Walker, H.; Scott, C.E. Estimating the density of urban trees in 1890s Leeds and Edinburgh using object detection on historical maps. Comput. Environ. Urban Syst. 2025, 115, 102219. [Google Scholar] [CrossRef]
Ju, F.; Li, Y.; Zhao, J.; Dong, M. 2D/3D fast fine registration in minimally invasive pelvic surgery. Biomed. Signal Process. Control 2025, 100, 107145. [Google Scholar] [CrossRef]
Hui, N.; Jiang, Z.; Cai, Z.; Ying, S. Vision-HD: Road change detection and registration using images and high-definition maps. Int. J. Geogr. Inf. Sci. 2024, 38, 454–477. [Google Scholar] [CrossRef]
Darzi, F.; Bocklitz, T. A Review of Medical Image Registration for Different Modalities. Bioengineering 2024, 11, 786. [Google Scholar] [CrossRef]
Xie, Z.; Zhang, W.; Wang, L.; Zhou, J.; Li, Z. Optical and SAR Image Registration Based on the Phase Congruency Framework. Appl. Sci. 2023, 13, 5887. [Google Scholar] [CrossRef]
Hou, Z.; Liu, Y.; Zhang, L. POS-GIFT: A geometric and intensity-invariant feature transformation for multimodal images. Inf. Fusion 2024, 102, 102027. [Google Scholar] [CrossRef]
Pallotta, L.; Giunta, G.; Clemente, C. Subpixel SAR Image Registration Through Parabolic Interpolation of the 2-D Cross Correlation. IEEE Trans. Geosci. Remote Sens. 2020, 58, 4132–4144. [Google Scholar] [CrossRef]
Sengupta, D.; Gupta, P.; Biswas, A. A survey on mutual information based medical image registration algorithms. Neurocomputing 2022, 486, 174–188. [Google Scholar] [CrossRef]
Edstedt, J.; Sun, Q.; Bökman, G.; Wadenbäck, M.; Felsberg, M. RoMa: Robust dense feature matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 19790–19800. [Google Scholar]
Sarlin, P.E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperGlue: Learning Feature Matching with Graph Neural Networks. arXiv 2020, arXiv:1911.11763. [Google Scholar] [CrossRef]
Liaghat, A.; Helfroush, M.S.; Norouzi, J.; Danyali, H. Airborne SAR to Optical Image Registration Based on SAR Georeferencing and Deep Learning Approach. IEEE Sensors J. 2023, 23, 26446–26458. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar] [CrossRef]
Song, Y.; Pan, Q.K.; Gao, L.; Zhang, B. Improved non-maximum suppression for object detection using harmony search algorithm. Appl. Soft Comput. 2019, 81, 105478. [Google Scholar] [CrossRef]
Boroujeni, S.P.H.; Razi, A. IC-GAN: An Improved Conditional Generative Adversarial Network for RGB-to-IR image translation with applications to forest fire monitoring. Expert Syst. Appl. 2024, 238, 121962. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Arandjelovic, R.; Zisserman, A. Three things everyone should know to improve object retrieval. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2911–2918. [Google Scholar] [CrossRef]
Morel, J.M.; Yu, G. ASIFT: A New Framework for Fully Affine Invariant Image Comparison. SIAM J. Imaging Sci. 2009, 2, 438–469. [Google Scholar] [CrossRef]
Bay, H.; Ess, A.; Tuytelaars, T.; Van Gool, L. Speeded-Up Robust Features (SURF). Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar] [CrossRef]
Alcantarilla, P.F.; Bartoli, A.; Davison, A.J. KAZE Features. In Proceedings of the Computer Vision—ECCV 2012, Florence, Italy, 7–13 October 2012; Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 214–227. [Google Scholar] [CrossRef]
Revaud, J.; De Souza, C.; Humenberger, M.; Weinzaepfel, P. R2D2: Reliable and Repeatable Detector and Descriptor. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
Dusmanu, M.; Rocco, I.; Pajdla, T.; Pollefeys, M.; Sivic, J.; Torii, A.; Sattler, T. D2-Net: A Trainable CNN for Joint Detection and Description of Local Features. arXiv 2019, arXiv:1905.03561. [Google Scholar] [CrossRef]
Edstedt, J.; Bökman, G.; Wadenbäck, M.; Felsberg, M. DeDoDe: Detect, Don’t Describe—Describe, Don’t Detect for Local Feature Matching. In Proceedings of the 2024 International Conference on 3D Vision (3DV), Davos, Switzerland, 18–21 March 2024; pp. 148–157. [Google Scholar]
Pan, X.; Luo, P.; Shi, J.; Tang, X. Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net. arXiv 2020, arXiv:1807.09441. [Google Scholar] [CrossRef]
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial Transformer Networks. arXiv 2016, arXiv:1506.02025. [Google Scholar] [CrossRef]
Balestriero, R.; Ibrahim, M.; Sobal, V.; Morcos, A.; Shekhar, S.; Goldstein, T.; Bordes, F.; Bardes, A.; Mialon, G.; Tian, Y.; et al. A Cookbook of Self-Supervised Learning. arXiv 2023, arXiv:2304.12210. [Google Scholar] [CrossRef]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. arXiv 2016, arXiv:1609.05158. [Google Scholar] [CrossRef]
Hosang, J.; Benenson, R.; Schiele, B. Learning non-maximum suppression. arXiv 2017, arXiv:1705.02950. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar] [CrossRef]
Cuturi, M. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–8 December 2013; Curran Associates, Inc.: Red Hook, NY, USA, 2013; Volume 26. [Google Scholar]
Norouzi, J.; Helfroush, M.S.; Liaghat, A.; Danyali, H. A Deep-Based Approach for Multi-Descriptor Feature Extraction: Applications on SAR Image Registration. Expert Syst. Appl. 2024, 254, 124291. [Google Scholar] [CrossRef]
Ma, W.; Zhang, J.; Wu, Y.; Jiao, L.; Zhu, H.; Zhao, W. A Novel Two-Step Registration Method for Remote Sensing Images Based on Deep and Local Features. IEEE Trans. Geosci. Remote Sens. 2019, 57, 4834–4843. [Google Scholar] [CrossRef]

Figure 1. Workflow of the main steps of the method.

Figure 2. Example historical map images.

Figure 3. Map image segmentation result.

Figure 4. Structure of map feature extraction.

Figure 5. Original maps of Beijing City in different historical periods.

Figure 6. Segment maps of Beijing City in different historical periods.

Figure 7. Feature matching results on original and segment map.

Figure 8. Feature extraction visualization of SuperPoint and MapExtraction.

Table 1. Training parameters.

Parameter Label	Value
Input Image Size	1440 × 1920
Learning Rate	$10^{- 3}$
Epoch	50
Batch Size	4

Table 2. Effectiveness of MapSegment.

Process	Metrics	a–b	a–c	a–d	a–e	a–f	a–g	a–h
-	RT	1.104 s	1.271 s	1.086 s	1.080 s	1.106 s	1.091 s	1.089 s
	Ref	3622	3622	3622	3622	3622	3622	3622
	Sen	3186	4001	2819	2623	3626	2921	2806
	$N_{a l l}$	1980	66	30	185	53	1974	888
MapSegment	RT	1.079 s	1.094 s	1.094 s	1.063 s	1.092 s	1.062 s	1.067 s
	Ref	3559	3559	3559	3559	3559	3559	3559
	Sen	2425	4144	3876	2959	4034	2893	3136
	$N_{a l l}$	2418	501	303	630	252	2400	1829

Table 3. Comparative experiment results of different feature matching models.

Models	Metrics	a–b	a–c	a–d	a–e	a–f	a–g	a–h
RIFT2	RT	2.067 s	2.730 s	2.321 s	2.032 s	3.145 s	2.659 s	2.675 s
	Ref	6099	6099	6099	6099	6099	6099	6099
	Sen	6373	6659	4999	4792	5855	4765	6127
	$N_{a l l}$	2059	986	1090	1451	2237	1530	1439
D2Net	RT	4.320 s	4.651 s	4.376 s	5.011 s	5.234 s	4.012 s	4.368 s
	Ref	4316	4316	4316	4316	4316	4316	4316
	Sen	3975	4399	4361	3667	3872	4785	4800
	$N_{a l l}$	293	144	161	314	330	138	499
SuperGlue	RT	1.786 s	1.329 s	1.345 s	1.377 s	2.153 s	2.356 s	1.038 s
	Ref	6223	6223	6223	6223	6223	6223	6223
	Sen	5620	5727	6971	6615	5047	5713	5759
	$N_{a l l}$	241	1796	133	1762	523	375	357
DeDoDe	RT	2.435 s	3.025 s	3.152 s	2.125 s	2.158 s	2.568 s	2.325 s
	Ref	1620	1620	1620	1620	1620	1620	1620
	Sen	1271	2320	1197	1629	1950	2739	2847
	$N_{a l l}$	57	208	76	285	79	231	274
R2D2	RT	0.894 s	0.561 s	1.357 s	1.639 s	0.864 s	1.173 s	1.269 s
	Ref	1518	1518	1518	1518	1518	1518	1518
	Sen	1445	2658	1871	1916	2779	1141	2159
	$N_{a l l}$	91	169	50	114	157	138	145
RoMa	RT	1.939 s	1.704 s	1.548 s	2.492 s	1.875 s	1.640 s	1.801 s
	Ref	5796	5796	5796	5796	5796	5796	5796
	Sen	6847	5369	5083	5341	6422	5967	5718
	$N_{a l l}$	1817	1394	1023	1115	1317	531	641
MapMatcher	RT	1.079 s	1.094 s	1.094 s	1.063 s	1.092 s	1.062 s	1.067 s
	Ref	3559	3559	3559	3559	3559	3559	3559
	Sen	2425	4144	3876	2959	4034	2893	3136
	$N_{a l l}$	2418	501	303	630	252	2400	1829

Table 4. Results of different feature extraction and match algorithms.

Extractor	Matcher	$N_{all}$	ROCC
SIFT2	NN + ratio	1532	0.103
SIFT2	MapMatcher	1231	0.132
SuperPoint	NN + mutual	2431	0.079
	SuperGlue	3101	0.197
	MapMatcher	2873	0.213
MapExtraction	SuperGlue	2731	0.208
MapExtraction	MapMatcher	2313	0.310

Table 5. Results of registration.

Models	Metrics	a–b	a–c	a–d	a–e	a–f	a–g	a–h
SIFT2	RT	2.067 s	2.108 s	2.183 s	2.019 s	2.181 s	2.053 s	2.047 s
	MI	0.0366	0.0433	0.0364	0.0345	0.0391	0.0359	0.0337
	ROCC	0.029	0.160	0.147	0.114	0.096	0.036	0.175
	RMSE	1.369	0.120	0.951	2.756	1.376	1.765	0.753
SuperGlue	RT	5.765 s	4.122 s	3.199 s	3.150 s	3.453 s	4.017 s	3.240 s
	MI	0.0764	0.0401	0.0339	0.0472	0.0467	0.0798	0.0614
	ROCC	0.079	0.325	0.230	0.112	0.384	0.197	0.155
	RMSE	0.030	0.105	2.151	3.154	2.025	0.372	0.236
DeDoDe	RT	3.157 s	3.544 s	3.828 s	2.691 s	2.898 s	3.606 s	3.070 s
	MI	0.0270	0.0206	0.0273	0.0253	0.0210	0.0414	0.0372
	ROCC	0.286	0.426	0.295	0.294	0.387	0.207	0.396
	RMSE	1.126	1.185	0.850	0.718	1.110	1.134	1.837
RoMa	RT	4.411 s	4.399 s	2.715 s	4.276 s	3.639 s	3.507 s	4.135 s
	MI	0.0286	0.0256	0.0261	0.0276	0.0270	0.0490	0.0350
	ROCC	0.089	0.035	0.078	0.079	0.060	0.030	0.088
	RMSE	0.772	0.174	1.644	1.911	0.441	0.137	0.831
ours	RT	2.025 s	1.892 s	1.442 s	1.640 s	1.325 s	1.530 s	1.363 s
	MI	0.1030	0.0503	0.0571	0.0498	0.0618	0.0912	0.0720
	ROCC	0.487	0.244	0.222	0.372	0.515	0.278	0.392
	RMSE	0.043	0.024	0.032	0.156	0.080	0.200	0.089

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qin, Z.; Feng, Y.; Wu, G.; Dong, Q.; Han, T. A Registration Method for Historical Maps Based on Self-Supervised Feature Matching. Appl. Sci. 2025, 15, 1472. https://doi.org/10.3390/app15031472

AMA Style

Qin Z, Feng Y, Wu G, Dong Q, Han T. A Registration Method for Historical Maps Based on Self-Supervised Feature Matching. Applied Sciences. 2025; 15(3):1472. https://doi.org/10.3390/app15031472

Chicago/Turabian Style

Qin, Zikang, Yumin Feng, Gang Wu, Qing Dong, and Tianxin Han. 2025. "A Registration Method for Historical Maps Based on Self-Supervised Feature Matching" Applied Sciences 15, no. 3: 1472. https://doi.org/10.3390/app15031472

APA Style

Qin, Z., Feng, Y., Wu, G., Dong, Q., & Han, T. (2025). A Registration Method for Historical Maps Based on Self-Supervised Feature Matching. Applied Sciences, 15(3), 1472. https://doi.org/10.3390/app15031472

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Registration Method for Historical Maps Based on Self-Supervised Feature Matching

Abstract

1. Introduction

2. Related Work

2.1. Traditional Methods

2.2. Learning-Based Methods

3. Materials and Methods

3.1. Historical Map Dataset

3.2. Map Preprocessing

3.3. Self-Supervised Map Feature Extraction

3.4. Map Feature Matching Based on Graph Neural Networks

3.5. Optimal Transport

4. Results

4.1. Metrics

4.2. MapSegment

4.3. MapExtraction

4.4. MapMatcher

4.5. MapRegistration

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI