1. Introduction
With the rapid development of Earth observation technology, the volume of remote sensing data has increased exponentially [
1,
2,
3]. Remote sensing images have broad applications such as early warning for natural disasters, emergency response, and urban construction planning [
4,
5]. It is a challenging task to quickly obtain remote sensing images of interest from a pool of massive remote sensing images, stimulating extensive research interests among the scientific community [
6,
7].
Early text-based remote sensing image retrieval (TBRSIR) performed retrieval based on pre-annotated textural information like acquisition time, resolution, and sensor type. Considering the tedious and burdensome manpower requirements and ambiguous annotation derived from text-based retrieval, researchers have turned to the content-based remote sensing image retrieval (CBRSIR). The CBRSIR is always composed of two main parts: feature extraction and similarity measurement. Due to the rapid advances of artificial intelligence technology and computer hardware, obtaining deep features through convolutional neural networks has become a substitute for traditional hand-crafted features. It has been proved that convolutional neural networks, pre-trained or finetuned, can greatly improve the retrieval performance of remote sensing images [
8,
9,
10,
11,
12].
At present, most of the CBRSIR methods are based on a single label. For simple scenes, such as forests and chaparral, a single label is enough to obtain excellent retrieval results, but for complex scenes, such as sparse residential areas, a single label cannot distinguish complex categories within an image. Remote sensing images are usually composed of multiple categories. In this case, a single label, which represents the most significant semantic content of a remote sensing image, might ignore the complex and abundant information contained in that image. As shown in
Figure 1, we select two images from the single label dataset UCMD and the multi-label dataset MLRSD as an example, it is difficult to accurately describe complex remote sensing image scenes using a single label. Therefore, in recent years, in order to overcome the limitations of single-label remote sensing image retrieval, efforts have been made towards multi-label retrieval of remote sensing images [
13,
14,
15,
16,
17,
18], and it is suggested that multi-label retrieval dramatically outperforms single-label retrieval.
However, in the current DCNN-based remote sensing image retrieval architecture, the feature dimensions of the remote sensing images is usually very high [
12]. The larger the dataset, the greater the memory cost and the longer the time required to search images. For example, 10 million images with a 512-dimensional feature vector require about 20 GB of memory, which is very unfavorable for massive remote sensing image retrieval. In addition, the time-consuming calculation among feature vectors of the image makes it difficult to meet the real-time retrieval requirements in actual applications. Thus, how to achieve real-time retrieval of remote sensing images in the era of massive remote sensing data is still challenging.
Recently, hashing methods have become increasingly attractive in the task of Approximate Nearest Neighbors (ANN) search problems [
19]. Through mapping high-dimensional image feature vector compact binary hash codes into Hamming space and calculating Hamming distance by simple bit operation and XOR operation, hashing methods can save calculation time and reduce memory consumption to a great degree. For example, 10 million pictures with 512-dimensional feature vectors using 128-bit hash code only need 160 MB for storage. Due to their advantages of simple structure, low space cost, and flexibility to expand, hashing methods have been widely used for fast large-scale image retrieval. To solve the problem of semantic gap resulted from the process of encoding the high-dimensional data to binary bits, researchers have proposed strategies to maintain semantic similarity. For example, in the natural image field [
20], Zhang et al. used an optimized loss function to preserve semantic similarity. However, in the remote sensing field, almost all existing deep hashing networks are based on single-label learning, which is not able to adapt to complex remote sensing scenes. In a bit more detail, remote sensing images are usually composed of multiple land categories, so that a single label describing the most significant semantic content might ignore the complex and abundant information contained in the image. To tackle this dilemma, we attempt to introduce multi-label supervision into the deep hashing framework. Furthermore, we propose pair-wise label similarity loss to fully exploit the multi-label semantic information, including the hard similarity loss represented by cross entropy and soft similarity loss measured by mean square error. Specifically, the hard similarity loss accounts for completely similar or dissimilar sample pairs, while the soft similarity loss considers partially similar sample pairs, which together encourage the deep hashing model to preserve the semantic consistency between the input paired image samples. The purpose of our research is to improve efficiency and reduce storage of multi-label image retrieval without losing accuracy. The main contributions of this paper can be summarized as follows:
(1) We propose a semantic-preserving deep hashing model for multi-label remote sensing image retrieval. As far as we know, this is the first attempt to introduce hashing methods into multi-label remote sensing image retrieval.
(2) We propose paired label similarity loss to make full use of multi-label semantic information, including hard similarity loss represented by cross-entropy and soft similarity loss measured by mean square error. Specifically, the hard similarity loss considers completely similar or dissimilar sample pairs, while the soft similarity loss considers partially similar sample pairs. Together, they encourage the deep hash model to maintain semantic consistency between the input paired image samples.
We conduct comparative experiments with other baseline models, including the IDHN Model [
21], the improved ISDH model [
22], the DMSH model [
23], the DHN model [
24], and the DPSH model [
25], to assess the effectiveness of our proposed model.
The remaining part of this paper is organized as follows: in
Section 2, related work on the multi-label retrieval of natural image, multi-label retrieval of remote sensing images and image retrieval based on deep hashing is summarized and analyzed. In
Section 3, the system architecture of our model is described in detail, and the design of loss function based on deep hashing network is emphasized. In
Section 4, comparative experiments are conducted to demonstrate the superiority of our modal. In
Section 5, discussions and conclusions are presented, followed by future directions toward which further efforts can be made.
3. A Semantic-Preserving Deep Hashing Model
In this section, firstly we introduce the basic idea of deep hashing-based multi-label image retrieval in
Section 3.1, and then we describe the system architecture and retrieval process of our multi-label image retrieval model in detail, while we focus on the design of loss function in
Section 3.3.
3.1. The Basic Idea of Deep Hashing Based Multi-Label Image Retrieval
The basic idea of deep hashing-based multi-label image retrieval is to learn a feature mapping function by using the similarity information of the multi-label remote sensing image and then to map the input image to a length of q hash code. Then, the obtained hash code is used to calculate the hamming distance to measure the similarity between remote sensing images. Among them, the image similarity supervision information
required for deep hashing can be calculated by the cosine distance between the multi-label vectors. The cosine similarity between remote sensing image
and remote sensing image
is defined as follows:
In Equation (1), , refer to the label vector of multi-label remote sensing image, respectively; refers to the inner product of the vector and the vector , and its value is equal to ; represents the L2 normal form operation. According to Formula (1), it can be seen that there are three similarities in paired multi-label remote sensing images: completely similar (), similar parts (), dissimilar (). For an ANN search task, the hash code learned by deep hashing is required to retain the similarity of the paired images. The similarity value of paired hash code and is inversely propotional to the hamming distance. That is to say, in the case of a given paired hash code and ,if , the hamming distance of the and should be as small as possible, preferably close to 0; if , the hamming distance of the and should be as large as possible, the best approaching ; if , the hamming distance of and should be between the minimum distance and the maximum distance.
3.2. System Architecture of Deep Hashing Model
The whole system architecture of our model is shown in
Figure 2, and is mainly composed of two parts, namely the deep feature extraction module and the hash learning module. While the deep feature extraction module is responsible for generating high-level and abstract image representation through multi-level architecture, the hash learning module is responsible for converting each image to a binary space in a binary sequence. To preserve the similarity information of multi-label remote sensing image pairs and control hashing quality, we improve the pair-wise similarity loss and use quantization loss to limit the output value range of the hash network.
The training process of our model is: the paired image is fed into the model, the high-dimensional features of multi-label remote sensing images are extracted through multi-layer convolution and two fully connected layers, and then the output is fed to a hash layer that connects the fully connected layer FC1 and the fully connected layer FC2 to generate a hash code of q length. Then, the image similarity is used as supervision information to train the model in an end-to-end manner. In the retrieval process, the multi-label remote sensing image is encoded into a binary code, and then the distance between the binary code of the query image and other images is calculated, and finally the ranked search results are returned.
We choose AlexNet as the feature extraction backbone network. We also use VGG16 network to extract image feature to show the scalability of our model. To make full use of the deep features extracted by the first fully connected layer, we further input the first fully connected layer into the hash layer.
Table 1 gives main parameters of our deep hashing multi-label retrieval model, in which Conv_i represents the i-th convolutional layer, Maxpool_i represents the i-th pooling layer, and Fc_i represents the i-th layer is a fully connected layer. In our model, a hash layer with q bits output is exploited to replace the classification layer in the AlexNet network. In addition, in order to reduce the possible loss of semantic information and make full use of the deep features extracted by the first fully connected layer, the first fully connected layer Fc_6 is connected to the hash layer Fc_8.
To obtain the binary hash code, we use the symbolic function to quantify the unified features and transform the output of the deep hashing network. The formula for generating a binary hash code is as follows:
In Formula (2), when the k-th element of the vector of is greater than 0, the k-th element of the vector of is 1, otherwise it is 0. In this way, the deep hash network can directly encode remote sensing images and get the corresponding binary code.
3.3. Design of Loss Function
The design of loss function directly affects the quality of the hash code. In our model, the loss function is composed of a paired similarity loss function and a quantization loss.
- (1)
Paired similarity loss
Based on the three kinds of similarity, the similarity between image pairs can be divided into hard similarity
and soft similarity
. This means that, when the image
and the image
are completely similar or dissimilar (
= 1 or
0), the similarity of the image pairs (
,
) belongs to hard similarity and cross entropy loss is suitable for loss calculation. Similarly, when the image
and the image
are partially similar (
< 1), the similarity of the image pair belongs to soft similarity and the mean square error is used to represent the loss. In recent years, many deep hashing methods have mostly performed definition in a hard-allocation manner. That is, if they share no less than one class label, the pairwise similarity is “1”, and if they do not share any class label, it is “0”. However, this definition of similarity does not reflect the similarity ranking of paired images containing multiple labels. To express the hard similarity and soft similarity uniformly, we introduce parameters
to balance the range of the two values in designing the loss function;
= 1 refers to cross entropy loss and
= 0 refers to mean square error loss. Since the value range of
is 2q times the value range of
, we introduce hyperparameters
to balance the range of them. Inspired by [
39], in order to preserve the semantic similarity of the paired multi-label remote sensing images in the Hamming space, we constructed an inner product expression
based on the hash code and applied it to the construction of the deep hashing model in the loss function of the pairwise similarity. In addition, we use the similarity
of paired images to weight the paired similarity loss, expressed as
, where
is the adjustment parameter. Therefore, our paired similarity loss calculation formula is as follows:
In Formula (3), is a hyperparameter that controls the range of the inner product value and prevents gradient disappearance effect caused by the inner product value of the input sigmoid function being too small. We take , that is .
- (2)
Quantization loss
For the binary representation of hash codes, most existing hash algorithms use activation functions like sigmoid or tanh to restrict the output value range of the hash layer and quantize the loss function to constrain the network output value to be nearby. However, the sigmoid/tanh activation function is saturated. The closer the activation output value is closer to
, the smaller the gradient value passed to the bottom layer by the model training is, which is more likely to result in gradient disappearance. We use a more relaxed activation function (softsign function)
to limit the output of the hash layer to within the range of (−1, 1). We replace the binary code
with the network output value
and employ the quantization loss to constrain the value of each element of
in the near the discrete value
. Therefore,
is redefined as
, the paired quantization loss
can be defined as:
Here,
is the L1 normal form operation,
is the absolute value operation, and 1 is a vector with elements of the same scale and size as
. The total loss function of the deep hash model can be defined as:
In a nutshell, our model extracts image features through a convolutional neural network and generates binary codes through a hash layer; finally, we design a new loss function to better maintain the multi-label semantic information of hash learning contained in complex remote sensing image scene so as to improve retrieval performance.
5. Conclusions and Prospects
In this paper, we propose a semantic-preserving deep hashing model for multi-label remote sensing image retrieval. Deep learning and hash learning are integrated for improving efficiency and reducing storage in the complex remote sensing images retrieval without losing accuracy, which is of critical importance in the era of big data.
In our model, first we use convolutional neural networks to extract image features, and then jump-connect the first fully connected layer to the hash layer to fully mine the multi-label semantic information contained in remote sensing images. The experimental results demonstrate the effectiveness and superiority of our model. Our attempt to introduce hash learning into the research of multi-label retrieval of remote sensing images is conducive for real-time environmental applications such as environmental detection, emergency response and many other different fields.
Our future work will be directed to applying graph neural network into our hash model to mine the semantic relationship between multi-label of remote sensing images to further improve performance of multi-label remote sensing images retrieval.