A Scene Graph Similarity-Based Remote Sensing Image Retrieval Algorithm

Ren, Yougui; Zhao, Zhibin; Jiang, Junjian; Jiao, Yuning; Yang, Yining; Liu, Dawei; Chen, Kefu; Yu, Ge

doi:10.3390/app14188535

Open AccessArticle

A Scene Graph Similarity-Based Remote Sensing Image Retrieval Algorithm

by

Yougui Ren

^1,2,

Zhibin Zhao

^1,*,

Junjian Jiang

¹,

Yuning Jiao

¹,

Yining Yang

²,

Dawei Liu

¹,

Kefu Chen

¹ and

Ge Yu

¹

School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China

²

Service Center of Natural Resource Affairs of Liaoning Province, Shenyang 110001, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(18), 8535; https://doi.org/10.3390/app14188535

Submission received: 15 August 2024 / Revised: 12 September 2024 / Accepted: 18 September 2024 / Published: 22 September 2024

(This article belongs to the Special Issue Deep Learning for Graph Management and Analytics)

Download

Browse Figures

Versions Notes

Abstract

:

With the rapid development of remote sensing image data, the efficient retrieval of target images of interest has become an important issue in various applications including computer vision and remote sensing. This research addressed the low-accuracy problem in traditional content-based image retrieval algorithms, which largely rely on comparing entire image features without capturing sufficient semantic information. We proposed a scene graph similarity-based remote sensing image retrieval algorithm. Firstly, a one-shot object detection algorithm was designed for remote sensing images based on Siamese networks and tailored to the objects of an unknown class in the query image. Secondly, a scene graph construction algorithm was developed, based on the objects and their attributes and spatial relationships. Several construction strategies were designed based on different relationships, including full connections, random connections, nearest connections, star connections, or ring connections. Thirdly, by making full use of edge features for scene graph feature extraction, a graph feature extraction network was established based on edge features. Fourthly, a neural tensor network-based similarity calculation algorithm was designed for graph feature vectors to obtain image retrieval results. Fifthly, a dataset named remote sensing images with scene graphs (RSSG) was built for testing, which contained 929 remote sensing images with their corresponding scene graphs generated by the developed construction strategies. Finally, through performance comparison experiments with remote sensing image retrieval algorithms AMFMN, MiLaN, and AHCL, in precision rates, Precision@1 improved by 10%, 7.2%, and 5.2%, Precision@5 improved by 3%, 5%, and 1.7%; and Precision@10 improved by 1.7%, 3%, and 0.6%. In recall rates, Recall@1 improved by 2.5%, 4.3%, and 1.3%; Recall@5 improved by 3.7%, 6.2%, and 2.1%; and Recall@10 improved by 4.4%, 7.7% and 1.6%.

Keywords:

remote sensing image; scene graph; image retrieval; object detection; Siamese networks

1. Introduction

Following improved spatial, temporal, and spectral resolutions, remote sensing images can play increasingly significant roles in various application fields such as national economy development and environmental resource management [1]. For instance, in marine management, corresponding remote sensing images are often required to be retrieved from a vast remote sensing image database based on a query image to determine the exact location, change patterns, and complete details of maritime activities [2]. In the past, this task was performed manually, which required a significant amount of manpower and resources, resulting in extremely low efficiency and presenting considerable challenges for marine management and resource protection. In recent years, with the continuous development of deep learning techniques and their significant success in image-processing applications, it has become possible to use these techniques to address challenges in remote sensing image retrieval.

Remote sensing image retrieval aims to find images within a remote sensing image database that are similar to query images. Essentially, it is an image similarity computation problem, which is a specific application of general image retrieval tasks combined with the characteristics of remote sensing images [3]. Image retrieval primarily includes two approaches: content-based image retrieval (CBIR) [4] and semantic-based image retrieval (SBIR) [5]. CBIR algorithms calculate similarity by acquiring the deep features of the entire image, lacking the ability to capture semantic information within the image, resulting in a low retrieval accuracy. When performing image retrieval, not only are the content features of the image considered but higher-level semantic information, such as the attributes and spatial relationships of target images, are also taken into account. The semantic-based approach to remote sensing image retrieval primarily extracts semantic information from remote sensing images and then uses it to calculate the similarity of remote sensing images to accurately obtain the query results. However, existing semantic-based remote sensing image retrieval methods primarily perform image retrieval through the attributes of the objects within images, neglecting the relationships among objects in remote sensing images. This omission prevents the model from fully utilizing the semantic information within images.

CBIR and SBIR retrieval processes are described in the following example. In Figure 1, (a) is the query image Iq, containing three target detection results (storage tanks) arranged in a straight line, I₁, I₂, I₃ ∈ Gallary (a remote sensing image database), and Iq has similar image features to I₁, I₂, and I₃, red box is target detection result, green line is spatial relationship. According to the CBIR method of feature matching, the retrieval results might include I₁, I₂, and I₃. However, based on the existing SBIR methods, which only consider the category and number of objects without considering spatial relationships, I₂ and I₃ might be considered as retrieval results because both of them contain three oil tanks. In reality, although I₂ has three oil tank objects, the spatial arrangement of the tanks is triangular, which does not match that of Iq. I₃ has three linearly arranged oil tank objects, making its semantic features more consistent with Iq.

Generally, the CBIR approach has a low retrieval accuracy due to a lack of semantic information, and typically the SBIR approach ignores the relationships among object targets [6]. Therefore, this research employed scene graphs, which can better represent both the attributes and relationships of targets to capture more semantic information to further improve retrieval accuracy.

Compared with ordinary images, remote sensing images have more significant characteristics. Firstly, they have a top-down view, where the relationships among objects are mainly reflected in spatial positions on a plane. Secondly, they have large fields of view, containing several objects with complex spatial relationships. Finally, they have complex backgrounds, including numerous unknown objects. Therefore, this research adopted a semantic-based approach to develop a scene-graph-based remote sensing image retrieval algorithm. The algorithm simulated the human behavior of comparing key targets to retrieve images, simplifying remote sensing image retrieval to match the multiple categories and spatial relationships of the targets in image scenes, effectively using the semantic information in the images. In this work, the semantic information of targets was first obtained using a one-shot object detection algorithm [7] for the remote sensing images. Then, scene graphs [8] were constructed based on this semantic information. Finally, the similarity of scene graphs was calculated and ranked to obtain the retrieval results. The contributions of this paper include the following:

A Siamese- network-based one-shot object detection algorithm was designed. A Siamese network model was proposed considering the issue of potentially unknown categories of objects in query images. The training objective of the model was to ensure that the distance in the feature representation space between any two input samples reflected their image feature similarity. In addition, to deal with the problem of complex backgrounds in a remote sensing image, a feature extraction network based on asymmetric convolution was developed in each feature extraction branch of the Siamese network to enhance feature extraction capabilities. To address the problem of missing small targets in remote sensing images, a feature pyramid structure was designed to improve detection capabilities for small targets. Finally, a candidate region generation network based on attention mechanisms was proposed to solve the problem of low positional regression accuracy in one-shot object detection tasks.
A remote sensing image retrieval method was developed using scene graph similarity. By constructing scene graphs and performing image matching and retrieval based on scene graph similarity, this method more fully utilized the high-level semantic information in images, including the categories and spatial relationships of objects, to improve retrieval accuracy and efficiency, which fundamentally aligned with human behavior in image retrieval.
Several scene graph construction strategies for target spatial relationships have been developed to meet various retrieval needs, including fully connected, randomly connected, nearest neighbor, star-shaped, and circular connections. These strategies can flexibly adapt to various data features and retrieval requirements, providing a variety of options for scene-graph-based remote sensing image retrieval.
The RSSG dataset, which contains a corresponding scene graph for each image, was created by using various scene graph construction strategies. To our best knowledge, RSSG is the first remote sensing dataset in the academic community. The experimental results of the performance evaluation using RSSG showed that the selection of appropriate construction strategies according to different scene characteristics obviously enhanced retrieval performance.

The remainder of the paper is organized as follows. Section 1 introduces the background, significance, content, main work, and organizational structure of the research. Section 2 discusses the progress of related works. Section 3 describes the developed remote sensing image retrieval algorithm based on scene graph similarity, consisting of three sub-algorithms: a one-shot object detection algorithm, a scene graph construction algorithm using detection results, and a graph neural network-based similarity calculation algorithm. Section 4 presents the experimental results, demonstrating that the retrieval accuracy of the algorithm developed in this research is superior to similar algorithms. Finally, Section 5 provides a summary and outlook.

2. Related Work

Remote sensing image retrieval has always been the focus of remote sensing imagery. In early research works, many methods primarily used annotated labels to find similar images. However, since the labels of images could not fully express their contents, this approach often gives imprecise retrieval results. In contrast, content-based and semantic-based image retrieval methods have significantly improved retrieval performance.

A content-based remote sensing image retrieval (CBRSIR) approach generally starts with extracting features from remote sensing images before calculating the retrieval results based on visual feature similarity. In the feature extraction stage, the following two main methods are available: manual feature extraction and deep-learning-based feature extraction. Regarding manual feature extraction CBIR methods, Bretschneider et al. [9] used the spectral features of remote sensing images, Lowe [10] employed a scale-invariant feature transform (SIFT) to describe the boundaries or contour information of geographic objects, and Scott et al. [11] utilized shape features to extract the binary bitmap features of remote sensing images for image retrieval. Manual feature extraction is a time-consuming and highly specialized method, and more critically, it may overlook less intuitive or hidden image features. In recent years, deep-learning-based feature extraction techniques, represented by convolutional neural networks (CNNs), have presented an excellent performance in image processing. The intermediate outputs of CNN models encompass both the low-level details and high-level semantic information of images, providing a comprehensive visual content extraction. Generally, deep-learning-based feature extraction models consist of convolutional layers for local feature extraction, pooling layers for feature fusion, and fully connected layers reflecting global image features. Tong et al. [12] investigated the performance of the features extracted by common CNNs and various aggregation methods for remote sensing image retrieval. Cao et al. [13] integrated local and global features into one single model to retrieve candidate sets using global descriptors and further reorder them using local feature matching. Ge et al. [14] developed a remote sensing image retrieval method using a multilayer feature fusion mechanism. They first extracted middle-layer and high-layer features from remote sensing images with various input sizes using VGG16 and GoogLeNet, respectively. Then, they separately fused the middle-layer and high-layer features through feature transformation, feature addition, and feature aggregation, finally fusing the middle-layer and high-layer features based on adaptive similarity weights. Cao et al. [15] introduced a triplet network architecture for CNN feature extraction networks based on metric learning principles and triplet loss functions to effectively enhance the performance of remote sensing image retrieval. Ye et al. [16] addressed the problem of poor retrieval performance due to the inability of single image features to fully reflect global image information by developing a remote sensing image retrieval method that adaptively fused multiple features using regression CNNs. Specifically, they used RESTNet and VGG16 to extract image features, calculated correlation coefficient matrices, predicted weights for each feature using regression CNNs, and obtained query classes. Finally, they performed a result ranking based on the distance between the queried image and query classes. Reference [3] also constructed a query image class composed of multiple images similar to the query image and strengthened query image features with the common features of images in the query image class; the experimental results demonstrated that this approach could improve retrieval accuracy. In addition to CNNs, graph convolutional networks (GCNs) [17] are increasingly applied in image retrieval tasks since they effectively model local structures using CNNs and explore the mutual dependencies between graph nodes. Chaudhuri et al. [18] proposed a Siamese graph convolution network (SGCN) framework for the calculation of the similarity between two remote sensing images. The aim of the SGCN was to learn a latent representation space based on a region adjacency graph (RAG) for remote sensing images, where similar remote sensing images were close and dissimilar images were far. Banerjee et al. [19] addressed issues such as the excessive aggregation of local area information in remote sensing images at a single node, which might lead to missing high-frequency information. They developed an edge attention mechanism to highlight the interactions among regions in remote sensing images to enhance our understanding of global image information and improve retrieval effectiveness.

A semantic-based remote sensing image retrieval (SBRSIR) approach primarily generates high-level semantic information such as object or scene descriptions for remote sensing images and then calculates the similarity based on this semantic information. However, the challenge of this approach lies in generating comprehensive semantic descriptions that accurately represent the details of remote sensing images. The authors in [20,21] focused on generating abstract titles for remote sensing images. However, due to the wide perspective of remote sensing images and the rich information of geographical features contained in these images, abstract titles might not adequately describe the detailed image information, making them unsuitable for precise image-to-image retrieval tasks in remote sensing. Generally, the detailed semantic information in images contains two main aspects: the objects present in the image and the relationships among these objects [22]. The academic community has employed techniques such as image classification, object detection, or visual relationship detection to solve semantic discovery problems [23]. Many solutions have been developed to model semantic information. Roopak [24] used knowledge graphs to structurally store the information about objects and their relationships in images. Other research works [25,26,27,28] have further explored the application of ontology technology to improve the efficiency and accuracy of retrieval. In recent years, there has been a growing interest among researchers to employ scene graphs to represent image semantics, since scene graphs have been found to be more conducive to help in image retrieval. For instance, Schroeder et al. [29] generated scene graphs based on the COCO-Stuff [30] dataset and trained a scene graph embedding network to obtain semantic-based structured image retrieval. Wang et al. [31] used visual scene graphs (VSGs) and textual scene graphs (TSGs) simultaneously to construct and extract the visual relationships between any two objects in images. This multimodal approach helped evaluate image similarity from both the perspectives of content and semantics. Yoon et al. [32] applied scene graphs to model high-level semantic information in images and used GCNs to calculate scene graph similarity as a measure of original image similarity. The abovementioned studies primarily focused on general images. As previously discussed, remote sensing images have more distinct characteristics than general images, which necessitate different approaches for constructing scene graphs. Firstly, compared to general images, remote sensing imaging covers a wider field of view, resulting in a multitude of geographical features. Secondly, remote sensing images are captured from a top-down perspective, emphasizing planar spatial relationships among objects, including their relative or absolute positions with respect to a reference point. In contrast, scene graphs for general images may also involve dynamic relationships in addition to spatial relationships. To date, the academic community has not yet developed comprehensive scene graph construction strategies specifically tailored for remote sensing images, especially methods to fully describe geographical relationships.

3. Remote Sensing Image Retrieval Algorithms Based on Scene Graph Similarity

3.1. Problem Description

In marine management scenarios, public reporting is an important approach for marine management departments in monitoring illegal activities such as land reclamation and sea filling [33]. The similarity between publicly reported images and those in remote sensing image databases could be determined using remote sensing image retrieval algorithms, allowing for the rapid retrieval of similar images in databases, thereby improving marine management efficiency [34]. Publicly reported images served as the query image Iq, and the remote sensing image database Gallary was considered as the set of images to be searched, as illustrated in Figure 2. The developed image retrieval algorithm determined the similarity of the publicly reported images with all images in the remote sensing database, and the results were obtained based on the similarity order [35].

Typically, traditional content-based image retrieval algorithms extract features from two input images and then calculate their similarity based on feature vectors. However, direct feature extraction from remote sensing images cannot acquire the semantic information in the images, which affects image retrieval accuracy. Therefore, this research proposed a remote sensing image retrieval algorithm based on scene graph similarity. The challenges of this approach include the following:

(1): The basic task for extracting semantic information from images is object detection. Traditional object detection tasks possess predefined sets of target object categories to classify and locate objects within images. However, target object categories in publicly reported images, i.e., query image Iq, are often unpredictable, making it a typical one-shot object detection task.
(2): A scene graph is a structured representation of the scene in an image to clearly express the objects, attributes, and relationships among the content in the scene. Remote sensing images are significantly different from ordinary images, primarily due to their overhead perspective; therefore, relationships among objects are mainly represented as spatial relationships on a plane. Secondly, the field of view is large, thus the images include several targets and the complex spatial relationships among them. This poses challenges for constructing scene graphs for remote sensing images. To the best of our knowledge, there is currently a lack of methods for constructing scene graphs specifically for remote sensing images in the academic community.

To address challenge (1), this research presented a one-shot object detection algorithm for remote sensing images based on Siamese networks. For challenge (2), it was proposed to obtain the category information of the targets based on the detection results. Building on this, a graph neural network-based scene graph similarity computation module was introduced, employing a neural tensor network to perform the similarity calculations. Finally, a scene graph similarity-based remote sensing image retrieval algorithm was formed, as illustrated in Figure 2:

In the Figure 2, Iq denotes the query image, and Gallary denotes the collection of n remote sensing images to be retrieved with Is, Is∈Gallary. Firstly, a one-shot object detection algorithm was used to extract the target information from both the query image Iq and the image Is in Gallary. Therefore, using a scene graph construction algorithm, the target information from Iq and Is was transformed into node (L, x, y, w, h) and edge (d, θ) information within the scene graph Gq and Gs. Finally, a scene graph similarity calculation algorithm was employed to calculate the similarity between the original two input remote sensing images. This process included a scene graph feature extraction module and a feature similarity calculation module. The scene graph feature extraction module utilized an edge-enhanced graph convolutional network (EGCN) to extract features, generating node features X and edge features E. The feature similarity calculation module integrated node and edge features into graph features H, using a neural tensor network to compute feature similarity.

3.2. One-Shot Object Detection Algorithm for Remote Sensing Images Based on a Siamese Network

Based on the characteristics of remote sensing image data and marine management requirements, one-shot object detection should be implemented in the context of remote sensing image retrieval. This entails detecting objects in retrieval images belonging to the same category as the objects in the query image. In a one-shot object detection task, a new class has only one sample, and detection is performed using only base-class data without fine-tuning for new class objects. To address this problem, this research proposed a one-shot object detection algorithm based on a Siamese network [36] and asymmetric convolution networks, called SiamACDet. The overall process is presented in Figure 3.

The input consisted of a retrieval image and a query image. Firstly, both images were passed through a shared-weight feature extraction network, ACSENet, to obtain feature maps. Then, the images and their feature maps were input into an attention-based region proposal network (ARPN) [37] to generate candidate regions (proposals). Finally, the proposals along with the feature maps of the images were fed into a dual-branch detector to obtain the final category and position information, as shown in the red box.

(1): ACSE residual module based on asymmetric convolution

Due to the complex background characteristics of remote sensing images, traditional residual networks present weaker feature extraction capabilities for these images. To enhance the ability of the network to extract features related to targets in remote sensing images, this research replaced standard convolutions in the residual module with asymmetric convolution (AC) structures [38]. Furthermore, considering that standard convolutional layers perform only convolution operations on individual channels without addressing inter-channel dependencies and neglect correlated information among channels, this research introduced a channel attention mechanism known as squeeze and excitation (SE) [39] into the AC residual module. The SE module dynamically adjusted the importance of each channel to more accurately capture crucial features within the input feature maps.

The asymmetric convolution-based AC module, as depicted in Figure 4b, modified the original residual module by replacing the 3 × 3 convolution kernel traditionally combined with BN and ReLU layers with a structure that integrated 3 × 3, 1 × 3, and 3 × 1 convolution kernels, each combined with BN and ReLU layers, and summed their outputs. This asymmetric convolution structure enhanced the representational capacity of the standard square convolution kernel by introducing two additional asymmetric convolution kernels, one horizontal and one vertical, to its skeletal structure. Hence, the developed asymmetric convolution structure improved the effectiveness of feature extraction.

The ACSE residual module with the added attention mechanism, as illustrated in Figure 5, included a channel-wise multiplication operation denoted as V. Building upon the AC module, the process started with global pooling, followed by two fully connected layers with sigmoid activation functions to obtain the weight values. These weights were then multiplied channel-wise with the original image.

(2): ACSENet feature extraction network based on asymmetric convolution

ResNet101 is a typical architecture within the deep residual network ResNet, comprising 101 convolutional and pooling operations. To solve the problem of detecting small objects in remote sensing images, in this chapter, ResNet101 was modified by replacing the standard residual modules with ACSE modules from the second to fifth layers, adopting a feature pyramid network [40] (FPN) structure.

The FPN-structure-based feature extraction network ACSENet is presented in Figure 6. Initially, 1 × 1 convolution operations were applied to feature the maps obtained from the second to fifth layers. Then, higher-level feature maps underwent upsampling to facilitate fusion with adjacent feature maps. Finally, the fused feature maps were convolved with a 3 × 3 kernel to yield the output feature map. This process integrated feature maps from different scales to effectively enhance the adaptability of the model to small objects and boost its feature extraction capability.

The network structure parameters are showed in the Table 1.

(3): Generation of the network ARPN based on the candidate regions of the attention mechanism

In object detection algorithms, the region proposal network (RPN) generates candidate regions (proposals) through predefined anchor points to perform localization tasks. A standard RPN network primarily classifies anchors into foreground and background categories without distinguishing specific anchor types, generating several anchors with categories different from those of the query image. This could adversely affect the precision of the localization task. To address this challenge, this research proposed an attention-based region proposal network (ARPN) that incorporated an attention mechanism. This network refined the features extracted from the retrieval feature map to “purify” the generated proposals such that their categories aligned as closely as possible with those of the query image.

The overall structure of the attention-mechanism-based ARPN is illustrated in Figure 7. Firstly, the feature map of the query image underwent global average pooling and depthwise separable convolution operations to obtain the support feature S. The retrieval image feature map underwent depthwise separable convolution operations to obtain the query feature Q. Then, a cross-correlation [37] operation was performed between the two to generate an attention feature map. Finally, the obtained attention feature map was input into the RPN network to generate proposals.

(4): Double-head detector

In one-shot object detection algorithms, detectors process and transform the candidate regions (proposals) outputted by the RPN, calculate the similarity between the feature maps of the retrieval and query images, and thereby obtain the category prediction results. Traditional Siamese networks typically use Euclidean distances to calculate the similarity between two features. However, Euclidean distances only calculate the geometric distance between two features, considering the differences among feature vector dimensions, neglecting the correlations among feature vectors. To better handle the correlations among feature vectors, this research designed a dual-branch detector structure based on CNNs and fully connected networks to accomplish the classification task. Figure 8 presents the overall structure of the double-head detector, where the input is two feature maps of size 7 × 7 × 512 obtained after ROI pooling. Firstly, the two feature maps were concatenated to generate a 7 × 7 × 1024 feature map. Then, they were separately passed through the FC-head and Conv-head to generate the corresponding similarities, which were then averaged to obtain the final similarity.

3.3. Scene Diagram Construction

A scene graph is a graphical data structure applied to encode and organize information in a scene, comprising object instances, object properties, and the relationships among objects. It possesses strong representation capabilities and is widely applied in the field of computer vision. In remote sensing imagery, targets are predominantly static objects, and the relationships among them are mainly spatial positioning. Relative positioning is crucial for the determination of semantic associations, and the application of directional relationships enables a more precise description of the spatial arrangements and interactions among objects in geographical scenes. This method captures the subtle differences among objects in a spatial scene, thereby enhancing the understanding of the semantic information in images.

Considering different retrieval scenarios with distinct data characteristics, this research employed different connection methods to construct scene graphs, adapting to various retrieval requirements. These connection methods include fully connected, randomly connected, nearest neighbor connected, star connected, and ring connected methods, which are described below.

(1): Fully connected: Each node is directly connected to every other node in the graph, capturing the relationships among all nodes but introducing redundancy with the increase in node number.
(2): Randomly connected: Each node is probabilistically connected to other nodes in the graph, decreasing the connection number while maintaining information transfer among nodes.
(3): Nearest neighbor connected: Starting from a selected node, other nodes are incrementally added to the graph based on their proximity to the already selected nodes. This method effectively captures the local structure and similarity in the data, improving retrieval and analysis efficiency.
(4): Star connected: One central node is directly connected to all other nodes, while connections among other nodes are absent. This configuration is appropriate for scenes with a clear center or theme, emphasizing information related to the central node for easier retrieval and understanding.
(5): Ring connected: Nodes are connected in a circular manner where each node is connected to its adjacent nodes. This method protects the spatial information and layout relationships among targets, capturing the periodic relationships among the data.

The process of constructing the scene graph is shown in Figure 9. The remote sensing image on the left can be used to build different scene graphs based on the various connection methods, visually demonstrating the characteristics of each method. This figure includes the object detection results of the remote sensing image, as well as the node attribute table and adjacency matrix constructed based on the object detection results. The node attribute table lists the category and positional attributes of each object, while the adjacency matrix shows the connection relationships between different objects. Based on these results, scene graphs were constructed using five different connection methods: fully connected, randomly connected, nearest neighbor connected, star connected, and ring connected.

In Figure 9, each node represents an object, including the object’s category and positional attributes, denoted as (L, x, y, w, h), where L represents the category, and (x, y, w, h) represents the position. Each edge represents the spatial relationship between two objects, expressed as (d, θ), where d is the distance between the starting node and the target node, and θ is the angle from the starting node to the target node in a polar coordinate system constructed with the top-left corner of the remote sensing image as the origin.

3.4. Similarity Calculation of the Remote Sensing Image of the Fusion Scene Map

3.4.1. Feature Extraction Module of the Scene Image

A graph feature extraction module typically involves node embedding, which is often implemented using graph convolutional networks (GCNs) [41]. Traditional GCNs only consider the node features and graph structure, ignoring the edge information. To solve this limitation, we introduced a GCN with an edge feature (EGCN) [42]. The structure of the graph convolutional module composed of the EGCN is illustrated in Figure 10. In the EGCN, the input to the first layer included node features X0 of size N × NF, with N being the node number in the graph and NF being the feature number per node. In addition, edge features E0 of size N × N × NP were included, where NP denoted the feature number per edge, and Xi and E represented the outputs of the i-th layer of the EGCN. Unlike traditional GCNs where the same adjacency matrix is used as the input in each layer, the EGCN replaced the adjacency matrix with an edge feature matrix. This allowed the neural network to learn using both node and edge features. In the EGCN, the edge features from the output of the previous layer were applied as the input to the next layer, enabling dynamic adjustments at each layer. This contrasts with traditional GCN modules where the same adjacency matrix is input to each GCN layer without incorporating edge-specific information.

3.4.2. Feature Similarity Calculation Module

The feature similarity calculation module primarily computes the similarity between the feature vectors H_q and Hs extracted by the scene graph feature extraction module for graphs G_q and G_s, respectively. This paper utilized a neural tensor network (NTN) [43] for feature similarity calculations.

Figure 11 shows the flowchart of the feature similarity module. Initially, the node feature vectors X_q and X_s and edge feature vectors Eq and Es extracted from graphs G_q and G_s by the scene graph feature extraction module were fused to obtain graph feature vectors H_q and H_s. Then, H_q and H_s were passed into the NTN network to compute their feature similarity, giving the output of the similarity computation.

4. Experimental Evaluation

4.1. Datasets and Experimental Environment

(1): Object detection in optical remote sensing images (DIOR) dataset [44]

The object detection in optical remote sensing images (DIOR) dataset is a large-scale dataset designed for object detection in remote sensing images. It includes 20 categories such as airplanes, airports, vehicles, bridges, and train stations, comprising a total of 23,463 images.

(2): Remote sensing images with scene graph (RSSG) ocean remote sensing dataset

The remote sensing images with scene graph (RSSG) ocean remote sensing dataset is a custom dataset created by us. It covers relevant images from a coastal area between 2020 and 2021, containing five categories of sea fields, ships, aquaculture rafts, oil tanks, and houses, with total number of 929 images. A corresponding scene graph was created for each image. Annotations were performed using LabelImg to outline the regions of interest in the sample images, followed by the building of scene graphs based on these regions using NetworkX. An example of the dataset is presented in Figure 12.

(3): Experimental environment

The hardware and software information used in the experiment is in the Table 2:

4.2. Evaluation of the One-Shot Object Detection Algorithm for Remote Sensing Images

4.2.1. Evaluation Indicators

Typically, there are multiple object classes in object detection, introducing mean aver age precision (

m A P

) as an evaluation metric. Average precision (

A P

is calculated from precision (

P r e c i s i o n

) and recall (

R e c a l l

), where

P r e c i s i o n

is the ratio of correctly predicted positive samples to all samples predicted as positive by the model and

R e c a l l

is the ratio of correctly predicted positive samples to all actual positive samples.

R e c a l l

and

P r e c i s i o n

were plotted on the horizontal and vertical axes, respectively, to generate a

P R

curve, with the area under the curve representing the

A P

value. The equation for the calculation of

A P

is shown in Equation (1):

A P = \int_{0}^{1} P (R) d (R)

(1)

The mean average precision (

m A P

), which is the average of the

A P

values across all classes, is calculated as presented in Equation (2), where

N

is the total number of classes, and

A P_{i}

is the

A P

value for the i-th class:

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(2)

P r e c i s i o n

is the proportion of samples that are predicted as positive by the model and are actually positive.

R e c a l l

is the proportion of samples that are predicted as positive by the model out of all actual positive samples. The equation for the calculation of

P r e c i s i o n

and

R e c a l l

is shown in Equation (3):

{\begin{cases} P r e c i s i o n = \frac{T P}{T P + F P} \\ R e c a l l = \frac{T P}{T P + F N} \end{cases}

(3)

Among them,

T P

represents the true positive samples, meaning the positive samples correctly predicted by the model, and

F P

represents the false positives samples, meaning the negative samples incorrectly predicted as positive by the model.

F N

refers to the false negatives samples, where positive samples are incorrectly predicted as negative.

4.2.2. Performance Comparison Result

To evaluate the performance of the developed model, this research conducted comparative experiments on the DIOR and RSSG datasets, comparing them with other one-shot object detection algorithms, CoAE [45], AIT [46], and BHRL [47]. A momentum stochastic gradient descent (Momentum SGD) optimizer [48] with an initial learning rate of 1.00 × 10⁻³, momentum parameter of 0.9, weight decay coefficient of 1.00 × 10⁻⁴, and batch size of 2 during training was employed. In the DIOR dataset, categories such as train stations and chimneys were set as invisible classes, while those such as oil tanks and ships were set as visible classes, totaling 12 categories. In the RSSG dataset, harbor and oil tank categories were set as visible classes, and ship, sea field, and factory categories were set as invisible classes. The experiments involved training the visible classes and testing separately the visible and invisible classes. The experimental results obtained for the DIOR and RSSG datasets are summarized in Table 3 and Table 4, respectively.

Based on Table 3 and Table 4, the proposed SiamACDet algorithm performed well in one-shot object detection tasks for remote sensing images. On the DIOR dataset, compared to the previous methods CoAE, AIT, and BHRL, the proposed algorithm improved the mAP values for visible classes by 3.9%, 2.9%, and 1.8%, and for invisible classes, by 2.5%, 1.8%, and 1.1%, respectively. On the RSSG dataset, compared to CoAE, AIT, and BHRL, the proposed algorithm showed improvements in the mAP values for visible classes by 3.6%, 1.8%, and 1.1%, and for invisible classes, by 1.9%, 1.3%, and 0.2%, respectively. Compared to previous methods, this study introduced asymmetric convolutions and attention mechanisms into the ResNet-based feature extraction network, proposing the ACSENet feature extraction network. The asymmetric convolution structure enhances the representational capability of standard square convolution kernels, allowing for the better extraction of deep features in remote sensing images. This addresses the problem of the weakened feature extraction ability and reduced detection accuracy caused by the complex backgrounds of remote sensing images. The FPN structure is incorporated into the ACSENet feature extraction network to improve the detection accuracy of small objects in remote sensing images. An ARPN candidate region generation network based on an attention mechanism is designed to replace the traditional RPN network. This network fuses the features of the query image with those of the image being retrieved, reducing the interference from anchors of different classes from the query image and improving the accuracy of the position regression task, without a significant decrease in detection speed compared to the RPN network. Additionally, this study designed a dual-branch detector to replace the traditional similarity calculation function in the Siamese network for the classification task, which better captures the relationship between query image features and support image features. Both the FC-head and Conv-head improve the detection accuracy of the SiamACDet algorithm, and combining the two allows for a better learning of the relationships between the complex feature vectors, enhancing the classification ability of the algorithm.

4.3. Evaluation of the Remote Sensing Image Retrieval Algorithm

4.3.1. Evaluation Indicators

P r e c i s i o n @ K

and

R e c a l l @ K

were used as evaluation metrics to evaluate the performance of the retrieval algorithms:

$P r e c i s i o n @ K$ : $P r e c i s i o n @ K$ measured the proportion of relevant instances among the top $K$ retrieved items. It was calculated as the ratio of relevant retrieved instances to the total number of retrieved instances at rank $K$ . In this context, $P r e c i s i o n @ K$ at $K$ = 1, 5, and 10 evaluated the precision of the retrieval algorithm when the top 1, 5, and 10 retrieved items were considered, respectively. The equation for the calculation of $P r e c i s i o n @ K$ is shown in Equation (4):

$P r e c i s i o n @ K = \frac{T P @ K}{K}$

(4)

Among them,

T P @ K

refers to the number of positive samples in the top

K

ranked retrieval results.

2.: $R e c a l l @ K$ : $R e c a l l @ K$ measured the proportion of relevant instances which were retrieved among all relevant instances. It was calculated as the ratio of relevant instances retrieved to the total number of relevant instances in the dataset at rank $K$ . Specifically, $R e c a l l @ K$ at $K$ = 1, 5, and 10 assessed how well the retrieval algorithm captured relevant instances within the top 1, 5, and 10 retrieved items, respectively. The equation for the calculation of $R e c a l l @ K$ is shown in Equation (5):

$R e c a l l @ K = \frac{T P @ K}{R}$

(5)

Among them,

T P @ K

refers to the number of positive samples within the top

K

ranked retrieval results, while

R

represents the total number of positive samples.

4.3.2. Performance Test

To validate the performance of the developed remote sensing image retrieval algorithm based on scene graph similarity, the algorithm scene graph similarity-based remote sensing image retrieval algorithm (SGSRSIIR) proposed in this research was compared with the remote sensing image retrieval algorithms [49] asymmetric multimodal feature matching network (AMFMN), metric-learning-based deep hashing network [50] (MiLaN), and asymmetric hash code learning [51] (AHCL). The experimental results are summarized in Table 5 and Table 6.

In Table 5 and Table 6, it is observed that the remote sensing image retrieval algorithm developed in this research improved Precision@K and Recall@K metrics compared to previous methods on the RSSG dataset. In the comparison to previous methods, AMFMN completes retrieval by extracting the salient features of images, while MiLaN and AHCL achieve retrieval by transforming deep image features into hash codes. Unlike previous methods, this study proposes a remote sensing image retrieval algorithm that does not use global image features but rather constructs a scene graph based on the semantic information of objects contained in the image. On the basis of the traditional graph convolutional network, an edge-feature-based EGCN graph feature extraction network is designed. The category information of objects in the image and their positional relationships are transformed into features in the scene graph, while the Euclidean distance and corresponding angles between two objects are used as edge features in the scene graph. This allows for a more effective utilization of edge features, improving the network’s feature extraction capabilities and enhancing retrieval performance. A neural tensor network (NTN) is employed to replace Euclidean distance and cosine similarity for calculating the similarity between feature vectors. Compared with traditional similarity calculation methods, the NTN network is better suited to learning the nonlinear relationships between two graph feature vectors, resulting in a better retrieval performance.

5. Conclusions

To address the shortcomings of traditional remote sensing image retrieval algorithms such as insufficient semantic information acquisition and low retrieval accuracy, this research proposed a one-shot object detection algorithm for remote sensing images based on Siamese networks. Therefore, a scene-graph-based remote sensing image retrieval algorithm was developed. We created the RSSG dataset, comprising 929 remote sensing images, and employed a variety of scene graph construction strategies to create a corresponding scene graph for each image. Experimental evaluations performed on the DIOR dataset and our custom RSSG dataset demonstrated significant improvements in precision and recall compared to other algorithms, verifying the effectiveness of the developed approach.

Our method developed scene graphs using the node attributes and positional information extracted from object detection outputs, which were then used to represent the similarity among the original remote sensing images based on scene graph similarity. Fundamentally, this approach constituted a semantic-based image retrieval method. However, this process might overlook certain image details of individual targets, such as their color and shape. Therefore, our future work aims to integrate both the semantic information of individual target objects and the global positional information of remote sensing images, while reintroducing image features specific to individual targets. This integration will facilitate similarity calculations based on fused semantic and image features.

Author Contributions

Conceptualization, Y.R. and Z.Z.; methodology, J.J. and Y.J.; software, Y.Y.; validation, Y.R., D.L. and K.C.; formal analysis, Z.Z., K.C., G.Y. and Y.R.; investigation, Y.J.; resources, Y.Y.; data curation, Y.J. and D.L.; writing—original draft preparation, Y.R.; writing—review and editing, Z.Z. and J.J.; visualization, D.L.; supervision, Y.Y.; project administration, G.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (grant numbers 62137001 and 62272093.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Avtar, R.; Komolafe, A.A.; Kouser, A.; Singh, D.; Yunus, A.P.; Dou, J.; Kumar, P.; Gupta, R.D.; Johnson, B.A.; Minh, H.V.T.; et al. Assessing sustainable development prospects through remote sensing: A review. Remote Sens. Appl. Soc. Environ. 2020, 20, 100402. [Google Scholar] [CrossRef]
Li, Y.; Ma, J.; Zhang, Y. Image retrieval from remote sensing big data:A survey. Inf. Fusion 2021, 67, 94–115. [Google Scholar] [CrossRef]
Ye, F.; Zhao, X.; Luo, W.; Li, D.; Min, W. Query-Adaptive Remote Sensing Image Retrieval Based on Image Rank Similarity and Image-to-Query Class Similarity. IEEE Access 2020, 8, 116824–116839. [Google Scholar] [CrossRef]
Wan, J.; Wang, D.; Hoi, S.C.; Wu, P.; Zhu, J.; Zhang, Y.; Li, J. Deep Learning for Content-Based Image Retrieval: A Comprehensive Study. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA; 2014. [Google Scholar]
Alzu’bi, A.; Amira, A.; Ramzan, N. Semantic content-based image retrieval: A comprehensive study. J. Vis. Commun. Image Represent. 2015, 32, 20–54. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, D.; Lu, G.; Ma, W.Y. A survey of content-based image retrieval with high-level semantics. Pattern recognition 2007, 40, 262–282. [Google Scholar] [CrossRef]
Antonelli, S.; Avola, D.; Cinque, L.; Crisostomi, D.; Foresti, G.L.; Galasso, F.; Marini, M.R.; Mecca, A.; Pannone, D. Few-Shot Object Detection: A Survey. ACM Comput. Surv. 2022, 54, 37. [Google Scholar] [CrossRef]
Li, H.; Zhu, G.; Zhang, L.; Jiang, Y.; Dang, Y.; Hou, H.; Shen, P.; Zhao, X.; Shah, S.A.A.; Bennamoun, M. Scene Graph Generation: A comprehensive survey. Neurocomputing 2024, 566, 127052. [Google Scholar] [CrossRef]
Bretschneider, T.; Cavet, R.; Kao, O. Retrieval of remotely sensed imagery using spectral information content. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, Toronto, ON, Canada, 7 November 2002; pp. 2253–2255. [Google Scholar]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Scott, G.J.; Klaric, M.N.; Davis, C.H.; Shyu, C.R. Entropy-Balanced Bitmap Tree for Shape-Based Object Retrieval from Large-Scale Satellite Imagery Databases. IEEE Trans. Geosci. Remote Sens. 2011, 49, 1603–1616. [Google Scholar] [CrossRef]
Tong, X.Y.; Xia, G.S.; Hu, F.; Zhong, Y.F.; Datcu, M.H.; Zhang, L.P. Exploiting Deep Features for Remote Sensing Image Retrieval: A Systematic Investigation. IEEE Trans. Big Data 2020, 6, 507–521. [Google Scholar] [CrossRef]
Cao, B.; Araujo, A.; Sim, J. Unifying deep local and global features for image search. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 726–743. [Google Scholar]
Ge, Y.; Yang, Z.H.; Huang, Z.H.; Ye, F.M. A multi-level feature fusion method based on pooling and similarity for HRRS image retrieval. Remote Sens. Lett. 2021, 12, 1090–1099. [Google Scholar] [CrossRef]
Cao, R.; Zhang, Q.; Zhu, J.S.; Li, Q.; Li, Q.Q.; Liu, B.Z.; Qiu, G.P. Enhancing remote sensing image retrieval using a triplet deep metric learning network. Int. J. Remote Sens. 2020, 41, 740–751. [Google Scholar] [CrossRef]
Famao, Y.E.; Chen, S.X.; Meng, X.L. Remote sensing image retrieval method based on regression CNN feature fusion. Sci. Surv. Gand Mapp. 2023, 48, 168–176. [Google Scholar]
Xu, B.; Cen, K.; Huang, J.; Shen, H.; Cheng, X. A Survey on Graph Convolutional Neural Network. Chin. J. Comput. 2020, 43, 755–780. [Google Scholar]
Chaudhuri, U.; Banerjee, B.; Bhattacharya, A. Siamese graph convolutional network for content based remote sensing image retrieval. Comput. Vis. Image Underst. 2019, 184, 22–30. [Google Scholar] [CrossRef]
Chaudhuri, U.; Banerjee, B.; Bhattacharya, A.; Datcu, M. Attention-driven graph convolution network for remote sensing image retrieval. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Lu, X.X.; Wang, B.Q.; Zheng, X.T.; Li, X.L. Exploring Models and Data for Remote Sensing Image Caption Generation. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2183–2195. [Google Scholar] [CrossRef]
Li, X.L.; Zhang, X.T.; Huang, W.; Wang, Q. Truncation Cross Entropy Loss for Remote Sensing Image Captioning. IEEE Trans. Geosci. Remote Sens. 2021, 59, 5246–5257. [Google Scholar] [CrossRef]
Le, T.M.; Dinh, N.T.; Van, T.T. Developing a model semantic-based image retrieval by combining KD-Tree structure with ontology. Expert Syst. 2023, 18, e13396. [Google Scholar] [CrossRef]
Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Krasin, I.; Pont-Tuset, J.; Kamali, S.; Popov, S.; Malloci, M.; Kolesnikov, A.; et al. The open images dataset v4. Int. J. Comput. Vis. 2020, 128, 1956–1981. [Google Scholar] [CrossRef]
Hybridised, K.C.N. OntoKnowNHS: Ontology Driven Knowledge Centric Novel Hybridised Semantic Scheme for Image Recommendation Using Knowledge Graph. In Proceedings of the Knowledge Graphs and Semantic Web: Third Iberoamerican Conference and Second Indo-American Conference, KGSWC 2021, Kingsville, TX, USA, 22–24 November 2021; p. 138. [Google Scholar]
Asim, M.N.; Wasim, M.; Khan, M.U.G.; Mahmood, N.; Mahmood, W. The Use of Ontology in Retrieval: A Study on Textual, Multilingual, and Multimedia Retrieval. IEEE Access 2019, 7, 21662–21686. [Google Scholar] [CrossRef]
Nhi, N.T.U.; Le, T.M.; Van, T.T. A Model of Semantic-Based Image Retrieval Using C-Tree and Neighbor Graph. Int. J. Semant. Web Inf. Syst. 2022, 18, 23. [Google Scholar] [CrossRef]
Dinh, N.T.; Van, T.T.; Le, T.M. Semantic relationship-based image retrieval using KD-tree structure. In Proceedings of the Asian Conference on Intelligent Information and Database Systems, Ho Chi Minh City, Vietnam, 28–30 November 2022; pp. 455–468. [Google Scholar]
Dinh, N.T.; Le, T.M.; Van, T.T. An improvement method of KD-Tree using k-means and k-NN for semantic-based image retrieval system. In Proceedings of the World Conference on Information Systems and Technologies, Budva, Montenegro, 12–14 April 2022; pp. 177–187. [Google Scholar]
Schroeder, B.; Tripathi, S. Structured query-based image retrieval using scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 178–179. [Google Scholar]
Caesar, H.; Uijlings, J.; Ferrari, V. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18 June 2018; pp. 1209–1218. [Google Scholar]
Wang, S.; Wang, R.; Yao, Z.; Shan, S.; Chen, X. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1508–1517. [Google Scholar]
Yoon, S.; Kang, W.Y.; Jeon, S.; Lee, S.; Han, C.; Park, J.; Kim, E.-S. Image-to-image retrieval by learning similarity between scene graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, online event, 19–21 May 2021; pp. 10718–10726. [Google Scholar]
O’Connor, R.J.; Spalding, A.K.; Bowers, A.W.; Ardoin, N.M. Power and participation: A systematic review of marine protected area engagement through participatory science Methods. Mar. Policy 2024, 163, 106133. [Google Scholar] [CrossRef]
Liu, C.; Ma, J.; Tang, X.; Liu, F.; Zhang, X.; Jiao, L. Deep hash learning for remote sensing image retrieval. IEEE Trans. Geosci. Remote Sens. 2020, 59, 3420–3443. [Google Scholar] [CrossRef]
Dubey, S.R. A Decade Survey of Content Based Image Retrieval Using Deep Learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 2687–2704. [Google Scholar] [CrossRef]
Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 850–865. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Ding, X.; Guo, Y.; Ding, G.; Han, J. Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1911–1920. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18 June 2018; pp. 7132–7141. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Gong, L.; Cheng, Q. Exploiting edge features for graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9211–9219. [Google Scholar]
Socher, R.; Chen, D.; Manning, C.D.; Ng, A. Reasoning with neural tensor networks for knowledge base completion. Adv. Neural Inf. Process. Syst. 2013, 26. [Google Scholar]
Li, K.; Wan, G.; Cheng, G.; Meng, L.Q.; Han, J.W. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS-J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Hsieh, T.-I.; Lo, Y.-C.; Chen, H.-T.; Liu, T.-L. One-shot object detection with co-attention and co-excitation. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Chen, D.-J.; Hsieh, H.-Y.; Liu, T.-L. Adaptive image transformer for one-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12247–12256. [Google Scholar]
Yang, H.; Cai, S.; Sheng, H.; Deng, B.; Huang, J.; Hua, X.-S.; Tang, Y.; Zhang, Y. Balanced and hierarchical relation learning for one-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7591–7600. [Google Scholar]
Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
Yuan, Z.Q.; Zhang, W.K.; Fu, K.; Li, X.; Deng, C.B.; Wang, H.Q.; Sun, X. Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval. IEEE Trans. Geosci. Remote Sens. 2022, 60, 19. [Google Scholar] [CrossRef]
Roy, S.; Sangineto, E.; Demir, B.; Sebe, N. Metric-Learning-Based Deep Hashing Network for Content-Based Retrieval of Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2021, 18, 226–230. [Google Scholar] [CrossRef]
Song, W.W.; Gao, Z.; Dian, R.W.; Ghamisi, P.; Zhang, Y.J.; Benediktsson, J.A. Asymmetric Hash Code Learning for Remote Sensing Image Retrieval. IEEE Trans. Geosci. Remote Sens. 2022, 60, 14. [Google Scholar] [CrossRef]

Figure 1. Examples of sensing image retrieval based on semantic features.

Figure 2. The developed remote sensing image retrieval algorithm.

Figure 3. The schematic diagram of the SiamACDet algorithm.

Figure 4. Flowcharts of the residual and AC modules.

Figure 5. The flowchart of the ACSE module.

Figure 6. The flowchart of the FPN-structure-based ACSENet.

Figure 7. The schematic diagram of the ARPN.

Figure 8. The flowchart of the double-head detector.

Figure 9. Example of the scene image construction method.

Figure 10. The EGCN module.

Figure 11. Flowchart of the feature similarity calculation module.

Figure 12. Sample of the RSSG dataset.

Table 1. Network structure parameters.

Name	ResNet101	ACSENet
Conv1	7 × 7, 64, S2 3 × 3, maxpool, S2	7 × 7, 64, S2 3 × 3, maxpool, S2
Conv2_x	$[\begin{array}{l} 1 \times 1 C o n v, 64 \\ 3 \times 3 C o n v, 64 \\ 1 \times 1 C o n v, 256 \end{array}] \times 3$	$[\begin{array}{l} 1 \times 1 C o n v, 64 \\ 3 \times 3 C o n v, 64 \\ 1 \times 3 C o n v, 64 \\ 3 \times 1 C o n v, 64 \\ 1 \times 1 C o n v, 256 \end{array}] \times 3$
Conv3_x	$[\begin{array}{l} 1 \times 1 C o n v, 128 \\ 3 \times 3 C o n v, 128 \\ 1 \times 1 C o n v, 512 \end{array}] \times 4$	$[\begin{array}{l} 1 \times 1 C o n v, 128 \\ 3 \times 3 C o n v, 128 \\ 1 \times 3 C o n v, 128 \\ 3 \times 1 C o n v, 128 \\ 1 \times 1 C o n v, 512 \end{array}] \times 4$
Conv4_x	$[\begin{array}{l} 1 \times 1 C o n v, 256 \\ 3 \times 3 C o n v, 256 \\ 1 \times 1 C o n v, 1024 \end{array}] \times 23$	$[\begin{array}{l} 1 \times 1 C o n v, 256 \\ 3 \times 3 C o n v, 256 \\ 1 \times 3 C o n v, 256 \\ 3 \times 1 C o n v, 256 \\ 1 \times 1 C o n v, 1024 \end{array}] \times 23$
Conv5_x	$[\begin{array}{l} 1 \times 1 C o n v, 512 \\ 3 \times 3 C o n v, 512 \\ 1 \times 1 C o n v, 2048 \end{array}] \times 3$	$[\begin{array}{l} 1 \times 1 C o n v, 512 \\ 3 \times 3 C o n v, 512 \\ 1 \times 3 C o n v, 512 \\ 3 \times 1 C o n v, 512 \\ 1 \times 1 C o n v, 2048 \end{array}] \times 3$

Table 2. Experimental equipment.

Experimental Equipment	Parameters
Operating system	Ubuntu20.04.2 LTS
CPU	Intel^® Core™ i7-13700 CPU @ 2.10 GHz ×16
Memory	32G
GPU	NVIDIA GeForce GTX 3090
Programming language and framework	Python3.8, PyTorch
Development tools	PyCharm

Table 3. Results of the comparison experiment on the DIOR dataset (%).

Algorithm		CoAE [17]	AIT [18]	BHRL [19]	SiamACDet
Visible classes	oil tank	33.6	35.2	38.4	41.7
	boat	7.9	8.5	8.3	9.1
	plane	43.8	44.2	45.6	48
	house	36.3	36.2	39.5	39.3
	dam	8.2	9.9	10.3	12.5
	highway service areas	10.6	13.2	11.5	16.8
	gymnasium	30.9	28.7	30.6	31
	football	51.2	54.6	55.2	58.1
	overpass	13.9	14.1	19.3	14.3
	wind generator	9.6	13.2	11	18.1
	bridge	3.8	4.5	6.5	4.6
	basketball court	37.2	36.9	35.4	39.7
	mAP	23.9	24.9	26	27.8
Invisible classes	golf course	2.4	4.2	5.5	5.5
	vehicle	5.8	5.9	6.2	8
	railway station	14.8	15.1	15.9	17.4
	chimney	16.5	17.2	17.6	18.7
	mAP	9.9	10.6	11.3	12.4

Table 4. Results of the comparison experiment on the RSSG dataset (%).

Algorithm	Visible Classes			Invisible Classes
Algorithm	Farming Floating Rafts	Oil Tank	mAP	Sea Fields	Boat	House	mAP
CoAE [17]	8.1	30.6	19.4	2.6	8.5	36.3	15.8
AIT [18]	9.6	32.7	21.2	2.9	9.6	36.2	16.2
BHRL [19]	10.1	33.6	21.9	3.1	10	39.4	17.5
SiamACDet	10.5	35.4	23	3.5	10.2	39.4	17.7

Table 5. Results of the comparison experiment on Precision (%).

Algorithm	Precision@1	Precision@5	Precision@10
AMFMN	27.6	10.2	10.3
MiLaN	20.4	8.2	9
AHCL	32.4	11.5	11.4
SGSRSIIR	37.6	13.2	12

Table 6. Results of the comparison experiment on Recall (%).

Algorithm	Recall@1	Recall@5	Recall@10
AMFMN	6.9	12.8	25.7
MiLaN	5.1	10.3	22.4
AHCL	8.1	14.4	28.5
SGSRSIIR	9.4	16.5	30.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ren, Y.; Zhao, Z.; Jiang, J.; Jiao, Y.; Yang, Y.; Liu, D.; Chen, K.; Yu, G. A Scene Graph Similarity-Based Remote Sensing Image Retrieval Algorithm. Appl. Sci. 2024, 14, 8535. https://doi.org/10.3390/app14188535

AMA Style

Ren Y, Zhao Z, Jiang J, Jiao Y, Yang Y, Liu D, Chen K, Yu G. A Scene Graph Similarity-Based Remote Sensing Image Retrieval Algorithm. Applied Sciences. 2024; 14(18):8535. https://doi.org/10.3390/app14188535

Chicago/Turabian Style

Ren, Yougui, Zhibin Zhao, Junjian Jiang, Yuning Jiao, Yining Yang, Dawei Liu, Kefu Chen, and Ge Yu. 2024. "A Scene Graph Similarity-Based Remote Sensing Image Retrieval Algorithm" Applied Sciences 14, no. 18: 8535. https://doi.org/10.3390/app14188535

APA Style

Ren, Y., Zhao, Z., Jiang, J., Jiao, Y., Yang, Y., Liu, D., Chen, K., & Yu, G. (2024). A Scene Graph Similarity-Based Remote Sensing Image Retrieval Algorithm. Applied Sciences, 14(18), 8535. https://doi.org/10.3390/app14188535

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Scene Graph Similarity-Based Remote Sensing Image Retrieval Algorithm

Abstract

1. Introduction

2. Related Work

3. Remote Sensing Image Retrieval Algorithms Based on Scene Graph Similarity

3.1. Problem Description

3.2. One-Shot Object Detection Algorithm for Remote Sensing Images Based on a Siamese Network

3.3. Scene Diagram Construction

3.4. Similarity Calculation of the Remote Sensing Image of the Fusion Scene Map

3.4.1. Feature Extraction Module of the Scene Image

3.4.2. Feature Similarity Calculation Module

4. Experimental Evaluation

4.1. Datasets and Experimental Environment

4.2. Evaluation of the One-Shot Object Detection Algorithm for Remote Sensing Images

4.2.1. Evaluation Indicators

4.2.2. Performance Comparison Result

4.3. Evaluation of the Remote Sensing Image Retrieval Algorithm

4.3.1. Evaluation Indicators

4.3.2. Performance Test

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI