3.2. Definition of a Graph
The graph neural network is a type of deep learning model used for processing graph data. It gradually updates the representations of nodes by transmitting and aggregating messages through the nodes and edges, thereby capturing the structure and relationships within the graph data. A graph is typically defined as
, where
represents a set of N nodes and
represents a set of edges connecting the nodes. Each node has a feature representation denoted as
, and each edge has a feature representation denoted as
. The graph in the network model of this paper is a set of graphs
at different scales connected from top to bottom, which can be described as
The variable represents the graph structure established by the nodes with the scale feature, while represents the graph structure established by the nodes with the scale feature, where = 2, 3, 4 (experimental verification has shown that, when the number of cascade layers is 3, the effect of camera relocation is optimal in the paper). here, represents a set of nodes, and represents a set of edges connecting the nodes.
3.3. Construction of the Graph
In the context of camera relocation, images can be regarded as nodes in a graph, and the relationships between viewpoints can be represented as edges in the graph. By leveraging these relationships, graph neural networks can learn and infer the position and orientation information of the camera.
Node Features: The node features are obtained by extracting multi-scale features from the query image and its corresponding K most similar images using a feature pyramid. This results in a sequence of node features denoted as .
Here, represents the node features of scale at the dimension.
These characteristics will be passed as input to the graph neural network.
Edge Features: Image nodes that are visually similar and spatially adjacent or have specific geometric relationships will be connected. This method not only captures the visual consistency between images, but also ensures their spatial adjacency or specific geometric relationships. The edge features of the graph are computed by calculating the features and edge indices connecting two terminal nodes, and then mapping the computed edge features to a lower-dimensional space through a learnable linear mapping function . This can be represented as .
3.4. Message Transmission
The key idea of graph neural networks is to enable each node to comprehensively consider information from its neighboring nodes through message passing and aggregation and update its representation. In the context of camera relocation tasks, this implies leveraging the topological relationships between images and spatial geometric constraints to enhance relocation accuracy and robustness.
Upon constructing the graph , message passing is performed to facilitate the exchange of information between different nodes . Message passing updates the representation of nodes by aggregating information from multiple neighboring nodes. This aggregation effect can smooth out the feature differences between nodes, thereby reducing the impact of noise and outliers on the representation of individual nodes. Even if a node is disturbed by noise, the normal information of its neighboring nodes will still play a correction role in the aggregation process, making the final node representation more robust.
Initially, edge features are updated by combining the information of the current edge with that of neighboring nodes, expressed as
Here, represents a two-layer MLP with ReLU activation.
This operation enables the edge features to capture the correlations between images and spatial constraints, further optimizing the performance of camera relocation.
After updating the edge features, messages are generated to each node:
Equation (3) represents the transmission of information from node to node , where there is a two-layer MLP with ReLU activation.
In camera relocation, it is necessary to analyze multiple images in the scene to estimate the position and orientation of the camera. Due to the potentially significant differences between each image, it is important to consider the significance of information between different nodes in the message-passing process to better aggregate information.
Furthermore, when computing messages between nodes, it is necessary to appropriately adjust the content of the messages passed between nodes during the message-passing process, as some nodes may contain irrelevant information such as moving objects or projected shadows. This adjustment is essential for better-capturing information about significant geometric objects and thereby improving the accuracy of camera relocation.
In this process, attention mechanisms are introduced to adjust the weights of messages passing between nodes, thereby adaptively adjusting the importance of information between nodes. In this way, spatial relationships and similarities between images are better utilized to improve the accuracy and robustness of the relocation results.
The message
is represented as a sequence:
Here, represents the length of the message sequence.
For each element
, it is represented as a one-dimensional feature vector
:
Subsequently, calculate the attention weight
of each element
concerning the element
:
Here, represents the learned low-dimensional embedding vector, indicating the attention matrix.
Finally, the attention weights are multiplied by the compressed feature vectors, and another learned mapping
is used to restore them to their original dimensions. This process allows the model to focus on relevant parts of the input and generate the final output:
Through this attention mechanism, the model can more accurately encode message sequences and adaptively focus on information relevant to camera relocation, thereby enhancing model performance.
Upon receiving messages from all neighboring nodes, to comprehensively utilize the information from surrounding nodes and obtain a more comprehensive and accurate representation, the current node’s aggregate message is calculated by taking the average of these messages and is represented as follows:
Finally, to better utilize the features of the node itself, the current node’s features are connected with the aggregated messages, thereby combining the node’s information with the summarized information of the surrounding nodes to form a richer representation. This can be expressed as
Here, represents a two-layer MLP with ReLU activation.
The message is repeatedly transmitted R times throughout the entire graph, with the weights shared between iterations.
By combining message passing and attention mechanisms, cascaded graph neural networks can fully utilize the advantages of both. The messaging mechanism provides rich contextual information, while the attention mechanism can effectively select key information and suppress irrelevant information. This combination makes cascaded graph neural networks more robust and accurate in handling complex scenes and noisy data.
3.5. Network Cascading
To enhance the accuracy and precision of image matching in camera relocation tasks, and to more accurately estimate the position and orientation of the camera, this study introduces a network cascade technique. This involves connecting multiple neural network models in a specific sequence to form a cascade structure, where each sub-network within the cascade structure can exchange and interact information with each other.
The approach described in this paper involves merging the node information from the previous graph into a message node. It then utilizes an attention mechanism to propagate the fused image features and spatial information such as camera poses to other nodes in the next cascaded graph. Additionally, these graphs are connected in a top–down manner, effectively capturing both local and global image features to enhance the accuracy of camera relocation in scenes with large-scale or severe viewpoint changes.
In most cases, multimodal data composed of multiple images and spatial information often exhibit high dimensionality and complex interrelationships. To better model this data, it is possible to embed the images and spatial information into a low-dimensional vector space and perform information propagation and aggregation in this space.
In the cascaded graph neural network, each cascaded graph
contains multiple image and spatial information. To accomplish the camera relocation task, it is crucial to transmit information between each cascaded graph and perform multi-level, multi-scale modeling and feature extraction. All node information from the previous cascaded graph can be fused into a message node to obtain a feature vector representing the previous cascaded graph. As shown in
Figure 2 below, this enables the transmission of information from the previous cascaded graph to the next cascaded graph
, thereby facilitating multi-level, multi-scale modeling and feature extraction to provide richer contextual information for the camera relocation task.
Specifically, given a cascade graph with
nodes, for each node’s embedding vector denoted as
, the message node
can be defined as follows:
Here, represents the mean pooling operation, which averages the information of all nodes in the previous cascaded graph to obtain a message node .
In this way, the feature vector representing the previous cascade graph is used to replace the feature vectors of all nodes in the previous cascade graph, simplifying the complex graph structure into a more compact representation. This simplification streamlines the structure of the cascade graph neural network and reduces the parameter count.
The cascaded graph neural network can efficiently handle large-scale data in camera relocation tasks and improve the model’s training and inference speed by reducing the parameter count and simplifying the network structure. Additionally, due to the smaller dimension of the feature vectors, the network’s memory footprint is correspondingly reduced, making the cascaded graph neural network more applicable to resource-constrained environments. Therefore, this cascaded graph neural network is better suited to meet the demands of camera relocation tasks, enabling more efficient and reliable camera pose estimation.
In cascaded graph neural networks, attention mechanisms allow the network to propagate and integrate fused image features and spatial information between different cascaded graphs. By weighing the importance of different nodes or features, it enables the network to selectively focus on information related to the current task. Especially when dealing with image data with complex spatial structures, the attention mechanism is particularly prominent, which can better capture the correlation between different regions and improve model performance. The attention mechanism can help networks better understand and utilize the correlation and importance between data, with certain generalized features, and can exhibit similar effects in different tasks and datasets.
Specifically, as shown in
Figure 3 below, in the current cascade graph with m nodes, each node’s embedding vector is denoted as
, and the embedding vector of the message nodes from the previous cascade graph is denoted as
. Firstly, a query vector
is initialized for calculating attention weights. Subsequently, the query vector
is dot-produced with the embedding vectors of all nodes in the previous cascade graph, followed by passing through a softmax function to obtain attention weights
:
Here, represents the query vector in the current cascade graph, represents the embedding vector of the node in the previous cascade graph, and represents the total number of nodes in the previous cascade graph.
Finally, the embedding vector of each node in the current cascading graph is subjected to element-wise multiplication with its corresponding attention weight. The resulting product of the node embedding vector and attention weight, along with the message node, is used as input. After passing through a fully connected layer, the updated node feature vector
is obtained:
Here, represents the learnable weight matrix, denotes the element-wise product of the node embedding vector and attention weight , concatenated with the message node in the feature space, represents the activation function, and represents the bias term.
3.6. Pose Estimation
Cascade graph neural networks progressively extract and integrate information from graph data to enhance the understanding and representation capabilities of graph structure and node features in camera relocation tasks, thereby more accurately predicting the absolute position and orientation of the camera.
The approach in this paper involves using linear projection to map the updated edge features in the last cascade graph neural network to the 6-degrees-of-freedom (6-DoF) absolute camera translation and rotation. Specifically, the edge feature
is mapped to the absolute translation
, and the logarithm part of the unit quaternion
is added with a linear projection offset, which can be represented as
The symbols and , respectively, denote the real part and the imaginary part of a unit quaternion.
To achieve absolute pose estimation for the camera, the approach in this study minimizes the error between two images in the model parameter learning process. Specifically, an objective function is defined to minimize the reprojection error and rotation error for estimating the absolute pose of the camera. The objective function is defined as follows:
Here,
represents the number of edges, and
represents the distance function between the predicted absolute camera pose
and the true absolute camera pose
, which can be expressed as
Here, represents the Frobenius norm of the matrix, and represents the 2-norm of the vector.
The framework flowchart of the pose retrieval model is shown in
Figure 4.