CamGNN: Cascade Graph Neural Network for Camera Re-Localization

Wang, Li; Jia, Jiale; Dai, Hualin; Li, Guoyan

doi:10.3390/electronics13091734

Open AccessArticle

CamGNN: Cascade Graph Neural Network for Camera Re-Localization

School of Computer and Information Engineering, Tianjin Chengjian University, Tianjin 300392, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(9), 1734; https://doi.org/10.3390/electronics13091734

Submission received: 7 April 2024 / Revised: 25 April 2024 / Accepted: 29 April 2024 / Published: 1 May 2024

Download

Browse Figures

Versions Notes

Abstract

:

In response to the inaccurate positioning of traditional camera relocation methods in scenes with large-scale or severe viewpoint changes, this study proposes a camera relocation method based on a cascaded graph neural network to achieve accurate scene relocation. Firstly, the NetVLAD retrieval method, which has advantages in image feature representation and similarity calculation, is used to retrieve the most similar images to a given query image. Then, the feature pyramid is employed to extract features at different scales of these images, and the features at the same scale are treated as nodes of the graph neural network to construct a single-layer graph neural network structure. Secondly, a top–down connection is used to cascade the single-layer graph structures, where the information of nodes in the previous graph is fused into a message node to improve the accuracy of camera pose estimation. To better capture the topological relationships and spatial geometric constraints between images, an attention mechanism is introduced in the single-layer graph structure, which helps to effectively propagate information to the next graph during the cascading process, thereby enhancing the robustness of camera relocation. Experimental results on the public dataset 7-Scenes demonstrate that the proposed method can effectively improve the accuracy of camera absolute pose localization, with average translation and rotation errors of 0.19 m and 6.9°, respectively. Compared to other deep learning-based methods, the proposed method achieves more than 10% improvement in both average translation and rotation accuracy, demonstrating highly competitive localization precision.

Keywords:

camera relocation; camera pose; graph neural network; feature pyramid; network cascading

1. Introduction

Camera relocation is one of the research directions in the field of computer vision, aimed at determining the camera’s pose in three-dimensional space. For complex scenes such as environmental changes, occlusion, and changes in perspective, traditional geometric structure-based methods [1,2,3,4,5] are often limited by their dependence on scene geometric information. When the scene changes, its performance may be affected. Although image retrieval methods [6,7,8] utilize known scene image databases, their efficiency and accuracy may be limited when facing new environments or large-scale databases. In recent years, significant progress has been made in camera repositioning methods based on convolutional neural networks [9,10,11,12,13,14,15,16], such as PoseNet [17], which estimates camera poses from a single image through deep learning techniques. However, these methods still have limitations when dealing with complex scenes, especially when dealing with environmental changes, occlusion, and perspective changes, and their accuracy may be challenged. CamGNN aims to address these issues by introducing the idea of graph neural networks [18,19,20,21], fully utilizing the topological relationships between images and geometric constraints in space, to improve the accuracy of predicting and estimating the absolute pose of the camera. Inspired by the outstanding performance of graph neural networks in computer vision tasks such as classification [22,23], segmentation [24,25], human pose estimation [26], and scene representation [27], we propose a camera relocation method based on deep learning and graph neural networks. Firstly, take an RGB image as the image to be queried, and use NetVLAD [28] to retrieve the K most similar images. Then, the node features of these images at different scales are extracted through a feature pyramid. Next, based on the scale of node features, they are established into graph structures and connected through top–down message passing. Finally, the absolute pose estimation of the camera is obtained by minimizing the loss function between two image pairs. Compared with traditional camera relocation methods, methods based on deep learning and graph neural networks have better generalization ability and do not require feature matching and geometric calculations, thereby reducing the demand for computing resources. In addition, this method effectively addresses the problem of inaccurate positioning in scenes with large-scale or severe perspective changes by fully utilizing the topological relationships between images and geometric constraints in space. Our research has made the following three contributions: Firstly, we introduce a cascaded graph neural network model in the camera relocation task, which views the image as nodes in the graph and uses edges to represent the relationships between viewpoints. We have achieved information transmission and feature fusion by connecting low-level nodes and high-level nodes, thereby improving the accuracy and robustness of camera relocation tasks. Secondly, we propose a message-passing mechanism that fuses all node information from the previous graph structure into one message node and propagates these fused image features, camera poses, and other spatial information to other nodes in the next cascaded graph structure through an attention mechanism. This mechanism can effectively capture local and global image features, thereby improving the positioning accuracy of camera relocation in perspective change problems. Finally, in our method, spatial information such as image features and camera poses is automatically transmitted through graph neural networks, avoiding the cumbersome process of traditional manual feature and pose extraction. The automation and efficiency of this method can greatly improve the application efficiency of camera relocation.

2. Related Work

2.1. Camera Relocation

Camera relocation is an important task in the field of computer vision, aimed at determining the position and direction of the camera in three-dimensional space, also known as camera pose estimation. It has a wide range of applications, such as autonomous navigation and positioning of robots, intelligent driving, and AR/VR [29]. The goal of camera relocation is to determine the position and pose of the camera by analyzing image information in the scene. To achieve this goal, researchers have proposed various methods and techniques. Feature matching method: This is one of the most commonly used camera relocation methods. It extracts feature points from multiple images and infers the camera’s position and pose by matching these feature points [30,31,32]. Common features include corners, edges, SIFT, SURF, etc. Then, algorithms such as RANSAC can be used to eliminate erroneous matches and estimate the camera’s pose. Model matching method: This method is based on known 3D models or 3D point cloud data in the scene [33,34,35,36]. Firstly, the initial camera pose is calculated using the PnP (perspective n-point) algorithm based on the correspondence between camera parameters and 2D–3D point pairs. Then, iterative optimization methods such as bundle adjustment (BA) or nonlinear optimization algorithms are used to further refine the camera pose. Learning methods: With the development of deep learning, more and more researchers are exploring the use of neural networks to solve camera relocation problems [37,38,39,40,41,42]. These methods are typically trained on large-scale datasets and perform camera relocation by learning the relationship between image features and camera pose. For example, convolutional neural networks (CNNs) can be used to extract image features, and regression or classification models can be used to predict camera position and pose. Other methods: There are also some other camera relocation methods, such as the inertial measurement unit (IMU)-based method, which combines inertial sensors and image information to achieve camera relocation; Based on the Visual SLAM (Simultaneous Localization and Mapping) method, camera relocation is achieved by simultaneously performing camera localization and scene reconstruction [43]. We used a GNN-based model for camera relocation. This model can achieve high-precision absolute camera pose estimation in scenes with large-scale or severe perspective changes by learning spatial relationships and sequence information in the scene.

2.2. Graph Neural Network

A graph neural network (GNN) is a deep learning model specifically designed for processing graph-structured data. Unlike traditional neural network models that target vector or matrix data, GNNs can handle data with rich relationships and connectivity. The core idea of a GNN is to learn the representation of nodes in a graph by transmitting and aggregating information between nodes and edges. It is based on the local neighborhood structure of the graph, interacting and updating the features of each node with those of its neighboring nodes, to obtain a richer representation. GNN can gradually improve the representation of nodes through multiple iterations of node and edge updates. In each iteration, nodes update based on the current representation and information from neighboring nodes and pass the updated representation to the next round. This iterative process allows the GNN to capture graph information globally. Due to its ability to process graph-structured data, the GNN has been widely applied in many fields, such as minority shot learning [44,45], human action recognition [46], motion prediction [47], biology and drug discovery [48,49] and feature matching [50]. Recent studies have shown that the introduction of graph neural networks (GNNs) provides a novel solution for issues such as camera pose estimation and scene reconstruction. For example, the innovative research of GL Net [51] has successfully applied GNNs to camera pose estimation tasks. Compared to the traditional RNN, the GNN is better able to exchange information between discontinuous frames in video editing, effectively improving the processing effect of camera repositioning problems. Considering that the GNN can handle graph-structured data, such as the perspective conversion relationship between cameras, we use graph neural networks as the main model. By transmitting and aggregating information from nodes and edges, we can learn richer feature representations for each camera perspective and further infer the absolute pose of the camera. This method based on graph neural networks shows high accuracy and robustness in camera relocation tasks, providing an effective method and approach to solve the problem of camera relocation.

3. Method

3.1. Cascaded Graph Neural Networks

The present study introduces a camera relocation method based on a cascaded graph neural network. This method requires only one RGB image as the query image. Using NetVLAD, a deep learning method for image retrieval, enhances the robustness of local features through soft allocation clustering, and combines VLAD pooling layers to enhance discriminative ability, resulting in better image retrieval results. After extracting the feature vectors of each image in the dataset, calculate the cosine distance between the query image and each image in the dataset. Subsequently, the K images with the smallest distances are selected in ascending order and considered as the most similar images. The cascaded graph neural network learns and infers the relationship between the query image and these K most similar images. Finally, the camera’s absolute pose estimation results are obtained through further optimization and calibration. The network model structure is illustrated in Figure 1 below.

3.2. Definition of a Graph

The graph neural network is a type of deep learning model used for processing graph data. It gradually updates the representations of nodes by transmitting and aggregating messages through the nodes and edges, thereby capturing the structure and relationships within the graph data. A graph is typically defined as

G = {V, E}

, where

V = {V_{i}}

represents a set of N nodes and

E = {v_{i j}}

represents a set of edges connecting the nodes. Each node has a feature representation denoted as

x_{i}

, and each edge has a feature representation denoted as

e_{i j}

. The graph in the network model of this paper is a set of graphs

G_{x_{i}}

at different scales connected from top to bottom, which can be described as

G = G_{x_{i}} \oplus G_{x_{i - 1}},

(1)

The variable

G_{x_{i}}

represents the graph structure established by the nodes with the

i

scale feature, while

G_{x_{i - 1}}

represents the graph structure established by the nodes with the

i - 1

scale feature, where

i

= 2, 3, 4 (experimental verification has shown that, when the number of cascade layers is 3, the effect of camera relocation is optimal in the paper).

G_{x_{i}} = {V, E}

here,

V = {V_{i}}

represents a set of

N

nodes, and

E = {v_{i j}}

represents a set of edges connecting the nodes.

3.3. Construction of the Graph

In the context of camera relocation, images can be regarded as nodes in a graph, and the relationships between viewpoints can be represented as edges in the graph. By leveraging these relationships, graph neural networks can learn and infer the position and orientation information of the camera.

Node Features: The node features

G_{x_{i}}

are obtained by extracting multi-scale features from the query image and its corresponding K most similar images using a feature pyramid. This results in a sequence of node features denoted as

x_f p n_l i s t = \sum x_{i}

.

Here,

x_{i} = f_{f p n} (I)

represents the node features of scale

G_{x_{i}}

at the

i

dimension.

These characteristics will be passed as input to the graph neural network.

Edge Features: Image nodes that are visually similar and spatially adjacent or have specific geometric relationships will be connected. This method not only captures the visual consistency between images, but also ensures their spatial adjacency or specific geometric relationships. The edge features of the graph

G_{x_{i}}

are computed by calculating the features and edge indices connecting two terminal nodes, and then mapping the computed edge features to a lower-dimensional space through a learnable linear mapping function

f_{p r o j}

. This can be represented as

e_{i j} = f_{p r o j} ([x_{i}, x_{j}])

.

3.4. Message Transmission

The key idea of graph neural networks is to enable each node to comprehensively consider information from its neighboring nodes through message passing and aggregation and update its representation. In the context of camera relocation tasks, this implies leveraging the topological relationships between images and spatial geometric constraints to enhance relocation accuracy and robustness.

Upon constructing the graph

G_{x_{i}}

, message passing is performed to facilitate the exchange of information between different nodes

v_{i}

. Message passing updates the representation of nodes by aggregating information from multiple neighboring nodes. This aggregation effect can smooth out the feature differences between nodes, thereby reducing the impact of noise and outliers on the representation of individual nodes. Even if a node is disturbed by noise, the normal information of its neighboring nodes will still play a correction role in the aggregation process, making the final node representation more robust.

Initially, edge features are updated by combining the information of the current edge with that of neighboring nodes, expressed as

e_{i j} = f_{e d g e} ([x_{i}, x_{j}, e_{i j}]),

(2)

Here,

f_{e d g e}

represents a two-layer MLP with ReLU activation.

This operation enables the edge features to capture the correlations between images and spatial constraints, further optimizing the performance of camera relocation.

After updating the edge features, messages are generated to each node:

m_{j i} = m (v_{j} \to v_{i}) = f_{m e s s} ([x_{i}, x_{j}, e_{i j}]),

(3)

Equation (3) represents the transmission of information from node

v_{j}

to node

v_{i}

, where

f_{m e s s}

there is a two-layer MLP with ReLU activation.

In camera relocation, it is necessary to analyze multiple images in the scene to estimate the position and orientation of the camera. Due to the potentially significant differences between each image, it is important to consider the significance of information between different nodes in the message-passing process to better aggregate information.

Furthermore, when computing messages between nodes, it is necessary to appropriately adjust the content of the messages passed between nodes during the message-passing process, as some nodes may contain irrelevant information such as moving objects or projected shadows. This adjustment is essential for better-capturing information about significant geometric objects and thereby improving the accuracy of camera relocation.

In this process, attention mechanisms are introduced to adjust the weights of messages passing between nodes, thereby adaptively adjusting the importance of information between nodes. In this way, spatial relationships and similarities between images are better utilized to improve the accuracy and robustness of the relocation results.

The message

m_{j i}

is represented as a sequence:

[m_{j i}^{(1)}, m_{j i}^{(2)}, \dots, m_{j i}^{(n)}]

(4)

Here,

n

represents the length of the message sequence.

For each element

i \in [1, n]

, it is represented as a one-dimensional feature vector

f_{j i}^{(n)}

:

f_{j i}^{(n)} = W_{f} (m_{j i}^{(n)}),

(5)

Subsequently, calculate the attention weight

a_{j i}^{(i)}

of each element

j

concerning the element

i

:

a_{j i}^{(i)} = \frac{\exp (W_{j}^{T} f_{j i}^{(i)})}{\sum_{k = 1}^{n} \exp (W_{j}^{T} f_{j i}^{(K)})},

(6)

Here,

W_{j}

represents the learned low-dimensional embedding vector, indicating the attention matrix.

Finally, the attention weights are multiplied by the compressed feature vectors, and another learned mapping

W_{g}

is used to restore them to their original dimensions. This process allows the model to focus on relevant parts of the input and generate the final output:

m_{j i}^{a t t} = \sum_{i = 1}^{n} a_{j i}^{(i)} W_{g} (f_{j i}^{(i)}),

(7)

Through this attention mechanism, the model can more accurately encode message sequences and adaptively focus on information relevant to camera relocation, thereby enhancing model performance.

Upon receiving messages from all neighboring nodes, to comprehensively utilize the information from surrounding nodes and obtain a more comprehensive and accurate representation, the current node’s aggregate message is calculated by taking the average of these messages and is represented as follows:

m_{i}^{a g g} = \frac{1}{N_{i}} \sum_{e_{i j} \in ε} m_{j i}^{a t t},

(8)

Finally, to better utilize the features of the node itself, the current node’s features are connected with the aggregated messages, thereby combining the node’s information with the summarized information of the surrounding nodes to form a richer representation. This can be expressed as

x_{i} = f_{c o n} ([x_{i}, m_{i}^{a g g}]),

(9)

Here,

f_{c o n}

represents a two-layer MLP with ReLU activation.

The message is repeatedly transmitted R times throughout the entire graph, with the weights shared between iterations.

By combining message passing and attention mechanisms, cascaded graph neural networks can fully utilize the advantages of both. The messaging mechanism provides rich contextual information, while the attention mechanism can effectively select key information and suppress irrelevant information. This combination makes cascaded graph neural networks more robust and accurate in handling complex scenes and noisy data.

3.5. Network Cascading

To enhance the accuracy and precision of image matching in camera relocation tasks, and to more accurately estimate the position and orientation of the camera, this study introduces a network cascade technique. This involves connecting multiple neural network models in a specific sequence to form a cascade structure, where each sub-network within the cascade structure can exchange and interact information with each other.

The approach described in this paper involves merging the node information from the previous graph into a message node. It then utilizes an attention mechanism to propagate the fused image features and spatial information such as camera poses to other nodes in the next cascaded graph. Additionally, these graphs are connected in a top–down manner, effectively capturing both local and global image features to enhance the accuracy of camera relocation in scenes with large-scale or severe viewpoint changes.

In most cases, multimodal data composed of multiple images and spatial information often exhibit high dimensionality and complex interrelationships. To better model this data, it is possible to embed the images and spatial information into a low-dimensional vector space and perform information propagation and aggregation in this space.

In the cascaded graph neural network, each cascaded graph

G_{x_{i}}

contains multiple image and spatial information. To accomplish the camera relocation task, it is crucial to transmit information between each cascaded graph and perform multi-level, multi-scale modeling and feature extraction. All node information from the previous cascaded graph can be fused into a message node to obtain a feature vector representing the previous cascaded graph. As shown in Figure 2 below, this enables the transmission of information from the previous cascaded graph to the next cascaded graph

G_{x_{i - 1}}

, thereby facilitating multi-level, multi-scale modeling and feature extraction to provide richer contextual information for the camera relocation task.

Specifically, given a cascade graph with

n

nodes, for each node’s embedding vector denoted as

h_{i}

, the message node

m_{j}

can be defined as follows:

m_{j} = \sum_{i = 1}^{n} \frac{1}{n} h_{i}

(10)

Here,

\frac{1}{n}

represents the mean pooling operation, which averages the information of all nodes in the previous cascaded graph to obtain a message node

m_{j}

.

In this way, the feature vector

m_{j}

representing the previous cascade graph is used to replace the feature vectors of all nodes in the previous cascade graph, simplifying the complex graph structure into a more compact representation. This simplification streamlines the structure of the cascade graph neural network and reduces the parameter count.

The cascaded graph neural network can efficiently handle large-scale data in camera relocation tasks and improve the model’s training and inference speed by reducing the parameter count and simplifying the network structure. Additionally, due to the smaller dimension of the feature vectors, the network’s memory footprint is correspondingly reduced, making the cascaded graph neural network more applicable to resource-constrained environments. Therefore, this cascaded graph neural network is better suited to meet the demands of camera relocation tasks, enabling more efficient and reliable camera pose estimation.

In cascaded graph neural networks, attention mechanisms allow the network to propagate and integrate fused image features and spatial information between different cascaded graphs. By weighing the importance of different nodes or features, it enables the network to selectively focus on information related to the current task. Especially when dealing with image data with complex spatial structures, the attention mechanism is particularly prominent, which can better capture the correlation between different regions and improve model performance. The attention mechanism can help networks better understand and utilize the correlation and importance between data, with certain generalized features, and can exhibit similar effects in different tasks and datasets.

Specifically, as shown in Figure 3 below, in the current cascade graph with m nodes, each node’s embedding vector is denoted as

h_{k}

, and the embedding vector of the message nodes from the previous cascade graph is denoted as

m_{j}

. Firstly, a query vector

q_{j}

is initialized for calculating attention weights. Subsequently, the query vector

q_{j}

is dot-produced with the embedding vectors of all nodes in the previous cascade graph, followed by passing through a softmax function to obtain attention weights

a_{i, j}

:

a_{i, j} = \frac{\exp (q_{j}^{T} h_{K})}{\sum_{i = 1}^{n} \exp (q_{j}^{T} h_{i})}

(11)

Here,

q_{j}

represents the query vector in the current cascade graph,

h_{i}

represents the embedding vector of the

i

node in the previous cascade graph, and

n

represents the total number of nodes in the previous cascade graph.

Finally, the embedding vector of each node in the current cascading graph is subjected to element-wise multiplication with its corresponding attention weight. The resulting product of the node embedding vector and attention weight, along with the message node, is used as input. After passing through a fully connected layer, the updated node feature vector

h_{K}^{'}

is obtained:

h_{K}^{'} = r e l u (W [h_{k} a_{i, j}, m_{j}] + b)

(12)

Here,

W

represents the learnable weight matrix,

[h_{k} a_{i, j}, m_{j}]

denotes the element-wise product of the node embedding vector and attention weight

h_{k} a_{i, j}

, concatenated with the message node

m_{j}

in the feature space,

r e l u

represents the activation function, and

b

represents the bias term.

3.6. Pose Estimation

Cascade graph neural networks progressively extract and integrate information from graph data to enhance the understanding and representation capabilities of graph structure and node features in camera relocation tasks, thereby more accurately predicting the absolute position and orientation of the camera.

The approach in this paper involves using linear projection to map the updated edge features in the last cascade graph neural network to the 6-degrees-of-freedom (6-DoF) absolute camera translation and rotation. Specifically, the edge feature

e_{i j}

is mapped to the absolute translation

{\hat{t}}_{i j}

\in R^{3}

, and the logarithm part of the unit quaternion

\log {(\hat{q})}_{i j}

\in R^{3}

is added with a linear projection offset, which can be represented as

{\hat{p}}_{i j} [{\hat{t}}_{i j} \log {(\hat{q})}_{i j}] = W_{p o s e} e_{i j} + b

(13)

Among which

\log (q) = {\begin{cases} \frac{x}{‖ x ‖} \cos^{- 1} (y) i f ‖ x ‖ \neq 0 \\ 0 o t h e r w i s e \end{cases}

(14)

The symbols

y

and

x

, respectively, denote the real part and the imaginary part of a unit quaternion.

To achieve absolute pose estimation for the camera, the approach in this study minimizes the error between two images in the model parameter learning process. Specifically, an objective function is defined to minimize the reprojection error and rotation error for estimating the absolute pose of the camera. The objective function is defined as follows:

L = \frac{1}{N_{e}} \sum_{v_{i j} \in E} d ({\hat{p}}_{i j}, p_{i j})

(15)

Here,

N_{e}

represents the number of edges, and

d (\cdot)

represents the distance function between the predicted absolute camera pose

{\hat{p}}_{i j}

and the true absolute camera pose

p_{i j}

, which can be expressed as

d ({\hat{p}}_{i j}, p_{i j}) = {‖ {\hat{R}}_{i j} - R_{i j} ‖}_{F} + {‖ {\hat{t}}_{i j} - t_{i j} ‖}_{2}

(16)

Here,

{‖ \cdot ‖}_{F}

represents the Frobenius norm of the matrix, and

{‖ \cdot ‖}_{2}

represents the 2-norm of the vector.

The framework flowchart of the pose retrieval model is shown in Figure 4.

4. Experimental Validation and Analysis

To validate the effectiveness of the proposed method in scenes with large-scale or severe viewpoint changes, code was developed using the PyTorch 1.8.2 open-source deep learning library. The training and testing were conducted using the widely recognized and extensively used 7-Scenes public dataset in the indoor scene relocation field, released by Microsoft (https://www.microsoft.com/en-us/research/project/rgb-d-dataset-7-scenes/overview/ (accessed on 6 April 2024)).

4.1. Introduction to the Experimental Dataset

The 7-Scenes dataset comprises rich RGB-D images captured indoors using a Microsoft Kinect device, encompassing seven distinct scenes, as illustrated in Figure 5. This dataset includes scenes with significant scale or viewpoint changes, each of which was captured in real time by multiple personnel. Throughout the capture process, consecutive images were recorded by the RGB-D camera to fully depict the capture trajectory and real-time conditions of the scenes. The resolution of each image is 640

\times

480. These images are divided into training and testing sequences for training and evaluating deep learning models.

The true camera pose label can be represented using a 4

\times

4 matrix, as shown in Figure 6. This matrix contains rotation and translation information, where the first 4

\times

4 submatrix represents the rotation matrix R, describing the camera’s orientation in three-dimensional space, and the subsequent 1

\times

3 submatrix represents the translation matrix t, describing the camera’s position in three-dimensional space.

4.2. Experimental Procedure and Parameter Settings

During the experimental process, it is advantageous to select image sequences with significant scale or viewpoint variations as part of the training and testing datasets, such as image sequences from scenes like Office and Pumpkin. These scenes exhibit drastic changes in camera poses, encompassing a wide range of motion and rotation, which can provide a more comprehensive evaluation of the model’s performance.

All images are resized to have a uniform height of 256 pixels and normalized by subtracting the mean pixel value, ensuring that each pixel value falls within a smaller range. Sizing and normalization processing can help improve the model’s learning performance on data features. The former ensures consistent input data size and reduces uncertainty, while the latter eliminates scale differences between data, ensuring the stability of the training process and avoiding gradient-related issues. Although the processing method may reduce the diversity of the data, it can train the model more stably and reduce the instability of the model during the training process.

A learning rate that is too high can lead to inability to converge, while a learning rate that is too small can slow down training. A larger batch size can accelerate convergence, but it increases memory consumption and training time, while a smaller batch size increases the randomness of the training process. By training the model on the training set, evaluating its performance on the validation set, and adjusting parameters based on the results, the best performing parameter combination was selected as the final model parameters. The initial learning rate is set to 10⁻³ to ensure the use of a relatively large learning rate at the beginning of training to rapidly explore the loss of function space. As training progresses, a strategy of gradually decreasing the learning rate by 5% is adopted to stabilize the model as it approaches the optimal point and to better fit the data. To balance the impact of variations in different scenarios on the model, the batch size is set to 6 to increase sample diversity and improve training effectiveness. Additionally, the number of training epochs is set to 50 to better capture the statistical characteristics and patterns in the data. Furthermore, the translation loss weight is set to −3.0 and the rotation loss weight is set to −1.0. The graphics processor (GPU) model used in this experiment is NVIDIA RTX A5000, made by NVIDIA in Santa Clara, CA, USA, equipped with 24 GB of GDDR6 graphics memory and 8192 CUDA cores. The method presented in this article demonstrates a reasonable computational cost during the inference process. The average inference time for each image is about 0.35 s, which includes key steps such as NetVLAD retrieval stage, feature pyramid extraction, graph structure establishment, and message passing. These calculation times are all within an acceptable range, providing a reliable foundation for the practical application of the method.

4.3. Experimental Results and Analysis

4.3.1. Comparison with Leading Baselines

In the experiment, we comprehensively compared the method proposed in this article with various mainstream methods such as PoseNet, BranchNet, MapNet, DenseVLAD, and NN Net. Although PoseNet has good real-time performance, it is limited in handling complex scenes and fast actions; BranchNet excels in multitasking learning, but its performance deteriorates when there are significant differences in tasks; MapNet excels in processing sequence information, but is sensitive to scene changes and computationally complex; DenseVLAD [52] is suitable for large-scale image retrieval, but lacks robustness to pose changes; NN Net has strong feature learning ability, but is prone to overfitting in complex scenes and imbalanced samples. In contrast, the method proposed in this article exhibits significant advantages in maintaining high accuracy while being more adaptable to complex scenes and pose changes. The comparative results of these methods on the test set are shown in Table 1.

In comparison with the displacement and rotation errors presented in Table 1, it is evident that the average displacement and rotation errors of the proposed method are smaller than those of PoseNet, MapNet, DenseVLAD, NN-Net, and BranchNet. In comparison with these methods, the average displacement accuracy of the proposed method is improved by 56%, 21%, 27%, 14%, and 30%, respectively, and the average rotation accuracy is improved by 30%, 12%, 45%, 27%, and 18%, respectively. Compared to the PoseNet method, in terms of displacement accuracy, the most significant improvement is observed in the Chess scene, with the displacement error reduced from 0.39 m to 0.11 m, representing a 72% increase in accuracy. Regarding rotation accuracy, both the Chess and Pumpkin scenes exhibit a 48% improvement in rotation accuracy.

Overall, on the 7-Scenes test set, our method performs well in both displacement error and rotation error, and has significant advantages compared to other networks and methods, verifying its effectiveness and reliability. However, we also noticed that, in specific scenarios such as Fire, the rotation error is slightly higher than some comparison methods due to the mismatch between data distribution and training data. Nevertheless, our method still demonstrates good overall performance and is expected to achieve higher accuracy and stability through continuous improvement in future research.

4.3.2. Ablation Study

To investigate the impact of the attention mechanism in the network cascade process on the accuracy of camera absolute pose estimation, this study compares the performance of the proposed method with a baseline model without attention mechanism in the network cascade process, using displacement error and rotation error as evaluation metrics. The comparison results are presented in Table 2 and Table 3.

The results of absolute camera pose displacement accuracy and rotation accuracy in different scenarios from Table 2 and Table 3 show that, compared to the model without attention mechanism, the proposed method reduces the displacement error from 0.24 m to 0.19 m, resulting in a 21% improvement in accuracy. Additionally, the rotation error decreases from 7.4° to 6.9°, leading to a 7% improvement in accuracy. This validates that incorporating an attention mechanism in the network cascade process can effectively enhance the accuracy of absolute camera pose localization.

To verify the effectiveness of network cascading in camera absolute pose positioning accuracy, experiments were conducted to compare our method with models without network cascading.

To begin with, two models are designed: the approach proposed in this paper and a baseline model without network cascading. For the baseline model, only the distinctive scale-invariant features of the query image and its K most similar images are extracted, and graph structures are separately constructed for these features.

During the training process, the training set is used to train two models. For the approach described in this paper, a network cascading training strategy was adopted to ensure that the models at each stage can gradually improve accuracy.

Upon completion of the training, the two models were evaluated using a test set. Each test sample (i.e., image), was input into both models to obtain their respective camera absolute pose estimation results.

Finally, the performance of the two models is compared based on the evaluation criteria of displacement error and rotation error. The comparison results are presented in Table 4 and Table 5.

The results of the absolute camera pose displacement accuracy and rotation accuracy under different scenarios in Table 4 and Table 5 show that, compared to the model without network cascading, the proposed method reduces the displacement error from 0.25 m to 0.19 m, resulting in a 24% improvement in accuracy. The rotation error also decreases from 7.5° to 6.9°, leading to an 8% improvement in accuracy. In terms of displacement accuracy, the most significant improvement is observed in the Kitchen scenario, with the displacement error decreasing from 0.28 m to 0.20 m, representing a 28% improvement in accuracy. The Heads, Office, and Pumpkin scenarios all show an improvement of approximately 20% in displacement accuracy. Regarding rotation accuracy, the Pumpkin scenario exhibits the most significant improvement, with the rotation error decreasing from 5.1° to 4.2°, resulting in an 18% improvement in accuracy. The Kitchen and Stairs scenarios both show an improvement of approximately 8% in rotation accuracy. These findings validate that network cascading can effectively enhance the accuracy of absolute camera pose localization.

To investigate the impact of the number of cascaded graph neural networks on the accuracy of camera absolute pose estimation, features at the 2, 3, and 4 scales are extracted from the images using a feature pyramid. Subsequently, corresponding graph structures are established, where each graph structure is composed of nodes with features of the same scale. These graph structures are then connected from top to bottom to form two, three, and four cascaded graph neural network models, denoted as

G_{2}

,

G_{3}

,

G_{4}

. The performance of the three models is compared based on two evaluation metrics: displacement error and rotation error, as shown in Table 6 and Table 7.

In Table 6 and Table 7, it can be observed that the absolute position displacement and rotation errors of the camera are under different scenarios. The displacement and rotation errors

G_{3}

are reduced by 0.04 m and 0.4°, respectively, compared to

G_{2}

, with an improvement in accuracy of 17% and 5%. In comparison to

G_{4}

, the displacement and rotation errors are reduced by 0.02 m and 0.2°, with an improvement in accuracy of 10% and 3%, respectively. The experimental results indicate that the number of cascaded layers has a significant impact on the camera repositioning accuracy. The fewer cascading layers limit the model’s ability to mine features and spatial information, resulting in limited accuracy, especially in complex scenes or situations with large viewpoint changes. As the number of cascading layers increases, the model can learn and fuse features more deeply, improving the accuracy of localization. However, excessive cascading layers may lead to overfitting, reduce generalization ability, and increase computational complexity. When the number of cascaded graph neural networks is three, the accuracy of the absolute camera pose localization is higher. Therefore, in order to ensure accuracy while avoiding overfitting and reducing computational complexity, the method proposed in this paper chooses a cascaded graph neural network structure composed of three networks.

In summary, when facing scenarios with large-scale or significant changes in perspective, the method proposed in this study demonstrates clear advantages. However, in certain extreme cases, such as extreme changes in perspective or specific environmental conditions, there are still some challenges. Future research will focus on further improvements to better adapt to these complex scenarios.

5. Conclusions

The present study introduces a camera relocation method based on a cascaded graph neural network. This method involves extracting features at different scales using a feature pyramid, establishing a graph structure with nodes possessing the same scale features, and cascading multiple single-layer graph structures from top to bottom. It fuses the node information from the previous graph into a message node and utilizes an attention mechanism to propagate the fused image features and spatial information such as camera pose to other nodes in the next cascaded graph. While enhancing the accuracy of absolute camera pose localization, it demonstrates better adaptability to large-scale or severe viewpoint changes in image inputs, exhibiting high environmental adaptability. Experimental validation on a dataset shows that the proposed method achieves an average translation error of 0.19 m and an average rotation error of 6.9°, indicating a significant improvement in localization accuracy compared to traditional camera relocation methods. Although the indoor scenes in the 7-Scenes dataset cover a certain degree of diversity, they may still not fully represent the complexity and diversity of various environments in the real world. Due to current resource and experimental limitations, future research will validate our method on a wider dataset.

Author Contributions

All of the authors extensively contributed to the work. Conceptualization, L.W., J.J., H.D. and G.L.; methodology, L.W. and J.J.; validation, J.J. and H.D.; investigation, H.D. and G.L.; writing—original draft preparation, J.J.; writing—review and editing, L.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number No. 62204168.

Data Availability Statement

The datasets generated during and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors have no competing interests to declare that are relevant to the content of this article.

Nomenclature

$V_{i}$	Node representation of the $i$ scale $G_{x_{i}}$ .
$v_{i j}$	Edge representation of the $i$ scale $G_{x_{i}}$ .
$V$	The node set representation of $G_{x_{i}}$ .
$E$	The edges set representation of $G_{x_{i}}$ .
$x_{i}$	Node features of the $i$ scale $G_{x_{i}}$ .
$e_{i j}$	Edge features of the $i$ scale $G_{x_{i}}$ .
$G_{x_{i}}$	The graph structure established by $V = {V_{i}}$ with $i$ scale features.
$x_f p n_l i s t$	$G_{x_{i}}$ feature sequence representation.
$m_{j i}$	The message node of $v_{i}$ , $v_{i}$ transmits information to $v_{j}$ .
$m_{i}^{a g g}$	Adjusting and updating the message nodes of $v_{i}$ through attention mechanism.
$m_{j}$	$G_{x_{i}}$ message node representation.
$h_{K}^{'}$	Implementing cross cascade graph information transmission and aggregated updated $G_{x_{i}}$ node feature vector representation through attention mechanism.
${\hat{p}}_{i j}$	Predicted 6-degree-of-freedom absolute camera pose.

References

Taira, H.; Okutomi, M.; Sattler, T.; Cimpoi, M.; Pollefeys, M.; Sivic, J.; Pajdla, T.; Toril, A. InLoc: Indoor visual localization with dense matching and view synthesis. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 43, 1293–1307. [Google Scholar] [CrossRef] [PubMed]
Shavit, Y.; Ferens, R. Introduction to camera pose estimation with deep learning. arXiv 2019, arXiv:1907.05272. [Google Scholar]
Irschara, A.; Zach, C.; Frahm, J.; Bischof, H. From structure-from-motion point clouds to fast location recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 2599–2606. [Google Scholar]
Sattler, T.; Havlena, M.; Radenovic, F. Hyperpoints and Fine Vocabularies for Large-Scale Location Recognition. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2102–2110. [Google Scholar]
Shi, T.; Shen, S.; Gao, X.; Zhu, L. Visual Localization Using Sparse Semantic 3D Map. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 315–319. [Google Scholar]
Laskar, Z.; Melekhov, I.; Kalia, S.; Kannala, J. Camera Relocalization by Computing Pairwise Relative Poses Using Convolutional Neural Network. arXiv 2017, arXiv:1707.09733. [Google Scholar]
Fischler, M.; Bolles, R. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Read. Comput. Vis. 1987, 726–740. [Google Scholar] [CrossRef]
Ding, M.; Wang, Z.; Sun, J.; Shi, J.; Luo, P. CamNet: Coarse-to-Fine Retrieval for Camera Re-Localization. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2871–2880. [Google Scholar]
Brachmann, E.; Krull, A.; Nowozin, S.; Shctton, J.; Michel, F.; Gumhold, S.; Rother, C. DSAC—Differentiable RANSAC for Camera Localization. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–27 July 2017; pp. 2492–2500. [Google Scholar]
Brachmann, E.; Rother, C. Visual Camera Re-Localization from RGB and RGB-D Images Using DSAC. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5847–5865. [Google Scholar] [CrossRef] [PubMed]
Bae, W.; Yoo, J.; Ye, J. Beyond Deep Residual Learning for Image Restoration: Persistent Homology-Guided Manifold Simplification. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–27 July 2017; pp. 1141–1149. [Google Scholar]
Li, X.; Wang, S.; Zhao, Y.; Verbook, J.; Kannala, J. Hierarchical Scene Coordinate Classification and Regression for Visual Localization. arXiv 2019, arXiv:1909.06216. [Google Scholar]
Duong, N.; Soladié, C.; Kacete, A.; Richard, P.; Royan, J. Efficient multi-output scene coordinate prediction for fast and accurate camera relocalization from a single RGB image. Comput. Vis. Image Underst. 2020, 190, 102850. [Google Scholar] [CrossRef]
Yang, L.; Bai, Z.; Tang, C.; Li, H.; Furukawa, Y.; Tan, P. SANet: Scene Agnostic Network for Camera Localization. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 42–51. [Google Scholar]
Huang, Z.; Zhou, H.; Li, Y.; Yang, B.; Xu, Y.; Zhou, X.; Bao, H.; Zhang, G.; Li, H. VS-Net: Voting with Segmentation for Visual Localization. arXiv 2021, arXiv:2105.10886. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Kendall, A.; Grimes, M.; Cipolla, R. PoseNet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2938–2946. [Google Scholar]
Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P. A Comprehensive Survey on Graph Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4–24. [Google Scholar] [CrossRef]
Zhou, J.; Cui, G.; Zhang, Z.; Hu, S.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; Sun, M. Graph Neural Networks: A Review of Methods and Applications. AI Open 2020, 1, 57–81. [Google Scholar] [CrossRef]
Battaglia, P.; Hamrick, J.; Bapst, V.; Sanchez-Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.; Raposo, D.; Santoro, A.; Faulkmer, R. Relational inductive biases, deep learning, and graph networks. arXiv 2018, arXiv:1806.01261. [Google Scholar]
Hamilton, W.; Ying, R.; Leskovec, J. Inductive Representation Learning on Large Graphs. arXiv 2017, arXiv:1706.02216. [Google Scholar]
Gao, H.; Ji, S. Graph U-Nets. arXiv 2019, arXiv:1905.05178. [Google Scholar] [CrossRef] [PubMed]
Amel, B.; Mohamed, A. An efficient end-to-end deep learning architecture for activity classification. Analog Integr. Circuits Signal Process. 2019, 99, 23–32. [Google Scholar]
Wang, W.; Lu, X.; Shen, J. Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks. arXiv 2020, arXiv:2001.06807. [Google Scholar]
Cui, Y.; Liu, X.; Liu, H. Geometric Attentional Dynamic Graph Convolutional Neural Networks for Point Cloud Analysis. Neurocomputing 2021, 432, 300–310. [Google Scholar] [CrossRef]
Cai, Y.; Ge, L.; Liu, J.; Cai, J.; Chan, T.; Yuan, J.; Thalmann, N. Exploiting Spatial-Temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2272–2281. [Google Scholar]
Zhou, Y.; While, Z.; Kalogerakis, E. Scenegraphnet: Neural message passing for 3d indoor scene augmentation. arXiv 2019, arXiv:1907.11308. [Google Scholar]
Gong, Q.; Liu, Y.; Zhang, L.; Liu, R. Ghost-dil-NetVLAD: A Lightweight Neural Network for Visual Place Recognition. arXiv 2021, arXiv:2112.11679. [Google Scholar]
Kehl, W.; Manhardt, F.; Tombari, F.; Llic, S.; Navab, N. SSD-6D: Making RGB-based 3D detection and 6D pose estimation great again. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1530–1538. [Google Scholar]
Lowe, D. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; Gool, L. Surf: Speeded up robust features. Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
Shotton, J.; Glocker, B.; Zach, C.; Lzadi, S.; Criminisi, A.; Fitzgibbon, A. Scene coordinate regression forests for camera relocalization in RGB-D images. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2930–2937. [Google Scholar]
Sattler, T.; Leibe, B.; Kobbelt, L. Efficient & effective prioritized matching for large-scale image-based localization. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1744–1756. [Google Scholar] [PubMed]
Walch, F.; Hazirbas, C.; Leal-Taixe, L.; Scattler, T.; Hilsenbeck, S.; Cremers, D. Image-based localization using lstms for structured feature correlation. arXiv 2016, arXiv:1611.07890. [Google Scholar]
Sak, H.; Senior, A.; Beaufays, F. Long short-term memory based recurrent neural network architectures for large scale acoustic modeling. arXiv 2014, arXiv:1402.1128. [Google Scholar]
Kendall, A.; Cipolla, R. Geometric Loss Functions for Camera Pose Regression with Deep Learning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–27 July 2017; pp. 6555–6564. [Google Scholar]
Brahmbhatt, S.; Gu, J.; Kim, K.; Hays, J.; Kautz, J. Geometry-Aware Learning of Maps for Camera Localization. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2616–2625. [Google Scholar]
Valada, A.; Radwan, N.; Burgard, W. Deep Auxiliary Learning for Visual Localization and Odometry. arXiv 2018, arXiv:1803.03642. [Google Scholar]
Radwan, N.; Valada, A.; Burgard, W. VLocNet++: Deep Multitask Learning for Semantic Visual Localization and Odometry. arXiv 2018, arXiv:1804.08366. [Google Scholar] [CrossRef]
Naseer, T.; Burgard, W. Deep regression for monocular camera-based 6-DoF global localization in outdoor environments. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 1525–1530. [Google Scholar]
Wang, B.; Chen, C.; Lu, C.; Zhao, P.; Trigoni, N.; Markham, A. AtLoc: Attention Guided Camera Localization. arXiv 2019, arXiv:1909.03557. [Google Scholar] [CrossRef]
Mur-Artal, R.; Montiel, J.; Tardos, J. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
Chen, J.; Lei, B.; Song, Q.; Ying, H.; Chen, D.; Wu, J. A Hierarchical Graph Network for 3D Object Detection on Point Clouds. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 389–398. [Google Scholar]
Garcia, V.; Bruna, J. Few-Shot Learning with Graph Neural Networks. arXiv 2017, arXiv:1711.04043. [Google Scholar]
Liu, Z.; Zhang, H.; Chen, Z.; Wang, Z.; Ouyang, W. Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition. arXiv 2020, arXiv:2003.14111. [Google Scholar]
Li, M.; Chen, S.; Zhao, Y.; Zhang, Y.; Wang, Y.; Tian, Q. Dynamic Multiscale Graph Neural Networks for 3D Skeleton Based Human Motion Prediction. arXiv 2020, arXiv:2003.08802. [Google Scholar]
Le, N.Q.K. Predicting emerging drug interactions using GNNs. Nat. Comput. Sci. 2023, 3, 1007–1008. [Google Scholar] [CrossRef] [PubMed]
Tran, T.O.; Vo, T.H.; Le, N.Q.K. Omics-based deep learning approaches for lung cancer decision-making and therapeutics development. Brief. Funct. Genom. 2023, elad031. [Google Scholar] [CrossRef] [PubMed]
Sarlin, P.; Detone, D.; Malisiewicz, T.; Rabinovich, A. SuperGlue: Learning Feature Matching with Graph Neural Networks. arXiv 2019, arXiv:1911.11763. [Google Scholar]
Xue, F.; Wu, X.; Cai, S.; Wang, J. Learning Multi-View Camera Relocalization With Graph Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11372–11381. [Google Scholar]
Torii, A.; Arandjelovic, R.; Sivic, J.; Okutomi, M.; Pajdla, T. 24/7 place recognition by view synthesis. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1808–1817. [Google Scholar]

Figure 1. Cascade graph neural network mode. This figure shows the method flow: using NetVLAD to retrieve similar image features, constructing a multi-scale graph structure, and incorporating spatial information. The cascaded graph structure and attention mechanism achieve feature fusion, and ultimately obtain a 6-degree-of-freedom camera pose through linear projection, achieving camera repositioning.

Figure 2. Generation of message nodes. By averaging pooling operations, all node information from the previous cascaded graph is fused into a single message node, forming a compact feature vector representation. This simplification not only reduces the complexity of the graph structure, but also reduces the number of parameters, effectively improving the efficiency of cascaded graph neural networks.

Figure 3. Message dissemination based on attention mechanism. By initializing the query vector and calculating attention weights, effective propagation of previous cascade graph information in the current cascade graph node has been achieved. The attention weight is multiplied element by element with the node embedding vector and processed through a fully connected layer to obtain the updated node feature vector, which improves the accuracy and efficiency of camera repositioning.

Figure 4. The flowchart of the pose retrieval model framework.

Figure 5. Sample images of each scene in 7-Scenes.

Figure 6. Camera pose label.

Table 1. Comparison with state-of-the-art methods on the 7-Scenes test set regarding displacement error and rotation error.

Sequence	PoseNet	MapNet	DenseVLAD	NN-Net	BranchNet	Ours
Chess	0.39 m, 6.4°	0.12 m, 3.3°	0.21 m, 12.5°	0.13 m, 6.5°	0.18 m, 5.2°	0.11 m, 3.3°
Fire	0.47 m, 13.6°	0.31 m, 11.2°	0.31 m, 13.9°	0.24 m, 12.1°	0.29 m, 9.0°	0.25 m, 13.4°
Heads	0.28 m, 12.1°	0.18 m, 13.7°	0.17 m, 14.5°	0.17 m, 12.9°	0.24 m, 14.7°	0.16 m, 10.7°
Office	0.37 m, 7.2°	0.19 m, 5.1°	0.21 m, 11.2°	0.21 m, 7.9°	0.31 m, 7.1°	0.17 m, 4.6°
Pumpkin	0.52 m, 8.1°	0.25 m, 4.0°	0.28 m, 11.8°	0.26 m, 6.4°	0.21 m, 5.6°	0.21 m, 4.2°
Kitchen	0.54 m, 8.3°	0.27 m, 4.9°	0.32 m, 11.3°	0.24 m, 8.2°	0.31 m, 7.4°	0.20 m, 4.7°
Stairs	0.42 m, 13.4°	0.33 m, 12.3°	0.29 m, 12.3°	0.27 m, 11.8°	0.36 m, 10.1°	0.26 m, 7.5°
Avg.	0.43 m, 9.9°	0.24 m, 7.8°	0.26 m, 12.5°	0.22 m, 9.4°	0.27 m, 8.4°	0.19 m, 6.9°

Table 2. Effect of attention mechanism on camera relocation displacement error.

Sequence	Chess	Fire	Heads	Office	Pumpkin	Kitchen	Stairs	Avg.
No-attention	0.14 m	0.31 m	0.23 m	0.20 m	0.24 m	0.26 m	0.31 m	0.24 m
Ours	0.11 m	0.25 m	0.16 m	0.17 m	0.21 m	0.20 m	0.26 m	0.19 m

Table 3. Effect of attention mechanism on camera relocation rotation error or.

Sequence	Chess	Fire	Heads	Office	Pumpkin	Kitchen	Stairs	Avg.
No-attention	3.5°	13.7°	11.2°	5.2°	4.9°	5.3°	8.0°	7.4°
Ours	3.3°	13.4°	10.7°	4.6°	4.2°	4.7°	7.5°	6.9°

Table 4. Effect of network cascade on camera relocation displacement error.

Sequence	Chess	Fire	Heads	Office	Pumpkin	Kitchen	Stairs	Avg.
No-cascade	0.16 m	0.29 m	0.21 m	0.21 m	0.27 m	0.28 m	0.30 m	0.25 m
Ours	0.11 m	0.25 m	0.16 m	0.17 m	0.21 m	0.20 m	0.26 m	0.19 m

Table 5. Effect of network cascade on camera relocation rotation error.

Sequence	Chess	Fire	Heads	Office	Pumpkin	Kitchen	Stairs	Avg.
No-attention	3.5°	13.7°	11.2°	5.2°	4.9°	5.3°	8.0°	7.4°
Ours	3.3°	13.4°	10.7°	4.6°	4.2°	4.7°	7.5°	6.9°

Table 6. Effect of the number of cascaded graph neural networks on camera relocation displacement error.

Sequence	Chess	Fire	Heads	Office	Pumpkin	Kitchen	Stairs	Avg.
$G_{2}$	0.14 m	0.27 m	0.21 m	0.20 m	0.23 m	0.24 m	0.29 m	0.23 m
$G_{3}$	0.11 m	0.25 m	0.16 m	0.17 m	0.21 m	0.20 m	0.26 m	0.19 m
$G_{4}$	0.13 m	0.27 m	0.19 m	0.18 m	0.21 m	0.23 m	0.27 m	0.21 m

Table 7. Effect of the number of cascaded graph neural networks on camera relocation rotation error.

Sequence	Chess	Fire	Heads	Office	Pumpkin	Kitchen	Stairs	Avg.
$G_{2}$	3.6°	13.8°	11.0°	5.1°	4.7°	5.1°	7.9°	7.3°
$G_{3}$	3.3°	13.4°	10.7°	4.6°	4.2°	4.7°	7.5°	6.9°
$G_{4}$	3.5°	13.8°	10.7°	4.9°	4.4°	4.8°	7.5°	7.1°

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, L.; Jia, J.; Dai, H.; Li, G. CamGNN: Cascade Graph Neural Network for Camera Re-Localization. Electronics 2024, 13, 1734. https://doi.org/10.3390/electronics13091734

AMA Style

Wang L, Jia J, Dai H, Li G. CamGNN: Cascade Graph Neural Network for Camera Re-Localization. Electronics. 2024; 13(9):1734. https://doi.org/10.3390/electronics13091734

Chicago/Turabian Style

Wang, Li, Jiale Jia, Hualin Dai, and Guoyan Li. 2024. "CamGNN: Cascade Graph Neural Network for Camera Re-Localization" Electronics 13, no. 9: 1734. https://doi.org/10.3390/electronics13091734

APA Style

Wang, L., Jia, J., Dai, H., & Li, G. (2024). CamGNN: Cascade Graph Neural Network for Camera Re-Localization. Electronics, 13(9), 1734. https://doi.org/10.3390/electronics13091734

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CamGNN: Cascade Graph Neural Network for Camera Re-Localization

Abstract

1. Introduction

2. Related Work

2.1. Camera Relocation

2.2. Graph Neural Network

3. Method

3.1. Cascaded Graph Neural Networks

3.2. Definition of a Graph

3.3. Construction of the Graph

3.4. Message Transmission

3.5. Network Cascading

3.6. Pose Estimation

4. Experimental Validation and Analysis

4.1. Introduction to the Experimental Dataset

4.2. Experimental Procedure and Parameter Settings

4.3. Experimental Results and Analysis

4.3.1. Comparison with Leading Baselines

4.3.2. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Nomenclature

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI