1. Introduction
With the growth of multimodal data on the web, cross-modal retrieval has become an interactive method in visual and language understanding. Because of the high retrieval speed and low storage space, hashing algorithm has become a real-time processing method for multi-modal complex scenes to give machines cognitive and understanding capabilities. This is a method that can map high-dimensional data to low-dimensional space and guarantee the similarity of original high-dimensional data in low-dimensional space. Hash algorithms include single-modal hash algorithms [
1,
2,
3,
4] and cross-modal hash algorithms [
5,
6]. Compared with the single-mode hash algorithm, cross-modal retrieval focuses on the mutual retrieval tasks between different modalities.
Cross-modal retrieval tasks include real-value cross-modal retrieval and cross-modal hashing retrieval. Real-valued cross-modal retrieval processing methods include classification-based and embedded-based, and its typical representative works include stacked cross attention network (SCAN) [
7] and saliency-guided attention network (SAN) [
8]. In the two modalities of image and text, image-text retrieval is mainly to realize the matching problem of image and text to explore the correspondence between image regions and sentence words. In order to further capture the modal semantic information, reduce the amount of calculation and accelerate the retrieval speed, the deep hash algorithm has become the mainstream method of image-text retrieval to measure modal semantic information in the Hamming space [
9,
10,
11].
Figure 1 shows the overall architecture of cross-modal retrieval.
Although the cross-modal hash algorithm can capture modal semantics quickly and efficiently, modal semantic alignment and semantic reconstruction still face great challenges. In unsupervised cross-modal hashing, Su et al. proposed [
12] a semantic reconstruction framework to mine modal semantic information. Considering semantic alignment, Zhang et al. proposed [
13] a deep semantic alignment hashing algorithm to explore the inner connection of modal semantics. In terms of the network model, Hu proposed [
14] unsupervised knowledge distillation to create something from nothing. Bai et al. proposed [
15] to resist loss to associate different modal semantics. Gu et al. proposed [
16] an adversary-guided attention module to enhance feature learning. Shen et al. proposed [
17] automatic twin-bottleneck hashing, whose main goal is to generate similarity graphs based on graph convolutional network (GCN). These studies have improved the performance of cross-modal hash retrieval to a certain extent. In current research, researchers have found that second-order statistics can speed up network convergence compared to stochastic gradient descent (SGD) [
18], but these studies are still in the enlightenment stage, and these studies have not yet been explored in cross-modal hashing. In addition, modal semantic reconstruction is still the difficulty of cross-modal hashing.
Encouraged by unsupervised cross-modal hashing algorithms [
12,
13,
14,
15,
16,
17], global covariance pooling [
18,
19,
20] and spatial pooling network module [
21], we propose a novel deep semantic-preserving reconstruction hashing (DSPRH). The algorithm combines spatial and channel semantic information, and mines modal associated semantic information based on adaptive self-encoding and joint semantic reconstruction loss. The main contributions are summarized as follows: (1) We introduce a new spatial pooling network module based on tensor regular-polymorphic decomposition theory to generate rank-1 tensor to capture high-order context semantics, which can assist the backbone network to capture important contextual modal semantic information. (2) Based on optimization perspective, we use global covariance pooling to capture channel semantic information and accelerate network convergence. In feature reconstruction layer, we use two bottlenecks auto-encoding to achieve visual-text modal interaction. (3) In metric learning, we design a new semantic-preserving reconstruction loss (using semantic alignment and cosine triplet loss function) to optimize model parameters, which can preserve the correlation between image modalities and text modalities. (4) The DSPRH algorithm is tested on two datasets (MIRFlickr-25K and NUS-WIDE). The experimental results show that DSPRH has achieved better performance on image2text and text2image retrieval tasks.
The remaining chapters of this paper are organized as follows:
Section 2 is related work.
Section 3 describes the proposed deep semantic-preserving reconstruction hashing method.
Section 4 presents the experimental results and analysis.
Section 5 summarizes the work of this paper and proposes future research plans.
2. Related Work
Image-text retrieval is the technical support for building a new generation of cross-media search system and cross-media monitoring system, and it is the core technology of cross-media analysis and reasoning, intelligent cognition and understanding. Cross-modal retrieval has become a universal search engine and entertainment tool, which facilitates people’s production and life. For example, Taobao shopping, Jingdong shopping, and Kuaishou. These can realize the mutual search of images and texts. In the massive multi-modal heterogeneous data, how to quickly and accurately find different modal data has become a research hotspot and difficulty. The goal of cross-modal hash retrieval is to use less storage space to quickly retrieve modal information. Image-text retrieval is an important means of current question answering systems and video image description, and it is also an important field of machine understanding. In this section, we introduce deep cross-modal hashing and key modular knowledge.
2.1. Cross-Modal Hashing
With the rapid increase of network data, deep cross-modal retrieval has become a research hotspot in recent years. Many researchers have carried out a lot of research on loss function design and network model construction. Specifically, Li et al. proposed self-supervised adversarial hashing [
5] to discover high-level semantic information. Wu et al. proposed unsupervised deep cross-modal hashing [
6] to integrate deep learning and matrix factorization. Zhang et al. proposed unsupervised generative adversarial cross-modal hashing [
9] to perform unsupervised representation learning. Liu et al. proposed matrix tri-factorization hashing [
10] to explore effective objective functions. Wang et al. proposed semantic-rebased cross-modal hashing [
11] to achieve unsupervised learning. From the perspective of semantic reconstruction and alignment, joint semantic reconstruction and semantic alignment have received extensive attention in retrieval [
12,
13,
14,
15,
16]. Hu et al. proposed collective reconstructive embeddings [
22] to deal with heterogeneous challenges. Chen et al. proposed shallow modal representation [
23]. Some research work [
24,
25] has carried out research on cross-modal retrieval network models, the main goal of which is to obtain robust modal semantic description. There is also some research work [
26,
27,
28,
29,
30] to improve the loss function to maintain semantic relevance. Nie et al. proposed [
31] multi-scale fusion transmembrane retrieval to effectively use modal information. Lin et al. proposed [
32] mask cross-modal hashing to capture inter-modal semantic information. Zhang et al. proposed [
33] a multi-pathway unsupervised hashing method based on a generative adversarial network (GAN) to utilize multi-modal information. In addition, Zhang et al. proposed [
34] semi-supervised GAN hashing to mine image modal and text modal information. Although cross-modal retrieval faces great challenges, these researches have greatly promoted the progress of cross-modal hashing technology.
2.2. Second-Order Covariance Pooling
In recent years, researchers have found that second-order optimization has better network optimization performance than SGD. Li et al. improved [
20] the performance of large-scale visual recognition tasks based on matrix power normalization. The iSQRT-COV is proposed [
19] to accelerate network convergence. In order to deal with three-dimensional tensors, the researcher proposed GSoP-Net block [
35], which is a network module that is easy to embed. Wang et al. discussed [
18] the benefits of global covariance pooling (GCP) to the network from loss optimization and gradient prediction. The research shows that second-order statistical optimization can enhance the feature expression ability of convolutional neural networks. Specifically, when the backbone network is fixed, compared with global average pooling, global covariance pooling has better feature description capabilities. In addition, Wang et al. gave [
36] a better feature representation method based on GCP. Research has found that second-order optimization algorithms (such as GCP) can accelerate network convergence algorithms and improve overall network performance compared to first-order optimization (such as global mean pooling (GAP)).
In summary, loss function and model optimization are currently the main techniques for cross-modal retrieval. Encouraged by DJSRH [
12] and DSAH [
13], our main goal is to explore a new metric learning method to complete cross-modal retrieval tasks. Specifically, the core idea of DJSRH [
12] is to reconstruct a joint semantic loss to associate two modal information (image feature space and text feature space). This technology was published in ICCV 2019. Compared with DJSRH [
12], the DSAH [
13] algorithm increases the semantic alignment loss based on the joint semantic reconstruction loss. DSAH [
13] comprehensively utilizes feature space semantics and Hamming space semantic information, so its retrieval performance is better than DJSRH. DSAH [
13] was proposed in ICMR 2020. Encouraged by the DJSRH [
12] and DSAH [
13] algorithms, we propose the DSPRH algorithm. We use the idea of feature fusion to reconstruct the image feature space, and propose a new cross-modal retrieval loss function based on considering multiple modal semantic alignment loss, pairwise loss and cosine triple loss.
3. Our Method
In the current research, researchers are focusing on accelerating network convergence and cross-modal hash information reconstruction. The goal of these works is to enhance the ability of feature expression to explore visual-language understanding tasks. Different from the current cross-modal hash research, our main contributions are as follows: (1) We introduce a new spatial pooling to assist the backbone network AlexNet to capture important contextual semantic information. (2) In terms of network training and feature extraction, we use GCP to accelerate network convergence and use GCP to capture channel semantic information. In addition, we use two bottlenecks auto-encoding to achieve visual-text modal interaction. (3) In metric learning, we propose a new semantic-preserving reconstruction loss, which can reserve important modal semantic information.
3.1. The Overall Architecture of The Algorithm
Figure 2 shows the architecture of DSPRH. Specifically,
Figure 2 contains three main steps. DSPRH uses ImgNet to extract image feature space, and uses TxtNet to extract text feature space. Where ImgNet uses GCP-based optimization to accelerate network convergence. In addition, ImgNet’s unique technology is based on SPNet to extract important modal spatial features, and based on GCP to obtain important modal channel features. In order to further reconstruct the modal semantic information, we used the two bottlenecks module to reconstruct the text feature space and the image feature space, which is a self-encoding process. Finally, we proposed a new semantic-preserving reconstruction loss.
This work is different from the existing unsupervised cross-modal hashing work. Specifically, we introduced a new reconstruction feature space and used alignment loss to reconstruct modal information. In addition, we used pairwise loss and cosine triple loss to preserve the semantic information of image modal and text modal. The experiment showed the effectiveness of semantic-preserving reconstruction loss.
In
Figure 2, the paired semantic information entered the text network to obtain the text feature space, and the paired semantic information entered the image network to obtain the image feature space. Visual-language feature extraction consists of image feature space and text feature space. Visual-language feature extraction consists of image feature space and text feature space. Visual-language feature extraction is shown in
Section 3.2.
In
Section 3.2, we mainly introduce visual-language feature extraction. We propose a feature fusion strategy, which includes 3 types of features. The fully connected layer did not need to go through the iSQRT-COV layer to obtain the feature description of the image, and the output of the convolutional layer needed to go through the iSQRT-COV layer mapping to obtain the high-level semantics of the image. The iSQRT-COV was proposed [
19] to accelerate network convergence, and iSQRT-COV can work with convolutional networks to obtain high-level image semantic description, and its output form was consistent with the output form of the fully connected layer. The full name of the iSQRT-COV module is iSQRT-COV meta-layer. The three types of feature outputs were weighted and fused, which we named the fusion layer. In this way, we reconstruct the image feature space. The image-text feature extraction process is shown in
Figure 3. In
Figure 3, SPNet block [
21] is a low-rank expression convolution kernel, which is a strip pooling operation to reconstruct the semantic feature map, which was proposed in the CVPR 2020 conference.
The DSPRH mainly includes three main technologies: (1) ImgNet’s backbone network is AlexNet, and we used spatial pooling and GCP optimization to extract robust image descriptions. (2) After ImgNet and TxtNet extract visual and language features, respectively, the two bottleneck coding model was used to reconstruct text features and image features. (3) We proposed a new loss function, the full name is semantic-preserving reconstruction loss. This loss function includes alignment loss, pairwise loss and cosine triple loss. Research work shows that this modal semantic reconstruction can significantly improve the performance of unsupervised cross-modal hashing.
3.2. Visual-Language Feature Space
Visual-language feature extraction is one of the key elements of DSPRH in
Figure 2. We detailed the visual-text feature extraction steps in
Figure 3. For fair comparison, we used AlexNet as the backbone network for visual feature extraction. Visual feature extraction consists of three parts. Based on the pre-trained AlexNet model, we extracted the
fc7 feature
f3. We used the feature of
conv5 as iSQRT-COV meta-layer [
19] to capture the semantic information between channels, and the feature was
f2 after passing through the fully connected layer. In order to quickly capture the spatial modal information, we input the features of
conv5 into the SPNet block [
21], and then cascaded the iSQRT-COV meta-layer [
18,
19,
35,
36] and the fully connected layer fc to extract the image feature
f1. Where the weights of
f1,
f2, and
f3 were
w1,
w2 and
w3 in the image feature fusion layer. The text feature space was constructed based on Text Net. In order to maintain the fairness of text feature extraction, we kept the Text Net model parameter settings consistent with most existing works [
11,
12,
13,
15,
16].
3.3. Feature Reconstruction Layer
In
Figure 2, the dimension of the image feature space and the dimension of the text feature space were inconsistent, and the dimension of the hash code was much smaller than the dimension of the image and text feature space. In order to align image-text semantics, inspired by twin-bottleneck hashing [
17], we introduced an image feature reconstruction layer after the image feature space, and we introduced a text feature reconstruction layer after the text feature space. In
Figure 2, we introduced the feature reconstruction layer, which included image feature reconstruction layer and text feature reconstruction layer. In the image feature reconstruction layer, it included image encoder, image coding and image decoder. In the text feature reconstruction layer, it included text encoder, text coding and text decoder.
In
Figure 2, through the feature reconstruction layer, we aligned image and text feature semantics, and aligned image and text hash code semantics. Therefore, we proposed alignment loss. Considering that there were image hash code output and text hash code output in cross-module hashing, we proposed pairwise loss to optimize the DSPRH model parameters. Similar to the idea of pairwise loss, inspired by the triple loss, we proposed the cosine triple loss to further optimize the network parameters. Based on the above considerations, we proposed semantic-preserving reconstruction loss, which is a new metric learning method that considers both semantic alignment and modal semantic association.
The definition of basic symbols is shown in
Table 1.
Assuming we have M semantic pairs, we can define it as:
where the feature of
is
and the feature of
is
.
i represents the semantics of the
i-th image.
We defined the encoder function, coding function and decoder function as
Enc,
Cod and
Dec, respectively. In image feature reconstruction layer, the basic definition is as follows:
where
is the output of the image coding layer,
is the image hash code and
is the reconstructed text feature space.
and
are encoder layer and decoder layer parameters, respectively.
In the same way, we can get text feature encoding
, text hash code
and reconstructed image feature space
.
where
and
are text encoder layer and text decoder layer parameters, respectively.
3.4. Semantic-Preserving Reconstruction Loss
This loss includes alignment loss (reconstruction loss), pairwise loss and cosine triple loss.
3.4.1. The Entire Alignment Loss
The reconstruction loss
is defined as follows:
Based on the diagonalization trend of the similarity matrix program between modalities, the ranking loss is proposed to effectively associate the modal semantic information.
where
represents the semantic alignment loss within the modal, and
represents the semantic alignment loss between the modalities.
is the similarity matrix based on
, where
I is the image modality and
T is the text modality.
is the sort alignment loss.
In order to further correlate the modal semantics, the fusion feature matrix is defined as follows:
where
is a hyperparameter.
is the spatial similarity of image semantic features and
is the spatial similarity of image semantic features.
The semantic alignment loss of feature space and hash code space is
to reduce the semantic gap.
where
,
.
is the similarity matrix based on
.
is the similarity matrix based on
in
.
u is the hyperparameter to balance the importance of different modalities.
The entire alignment loss (
) of DSPRH is defined as follows:
3.4.2. Cosine Triplet Loss
Cosine triple is a loss function in which triples and cosines are nested. The image hash code is , is a text hash code related to , and is text hash code not related to .
The relevant definition of the loss function of the cosine triplet is as follows:
where
is the cosine triple loss of the image search text.
represents the cosine distance metric between
and
.
where
is the cosine triple loss of the text search image. Text hash code is
,
is a image hash code related to
, and
is image hash code not related to
.
The entire cosine triple loss (
) of DSPRH is defined as follows:
3.4.3. Pairwise Loss
We define the pairwise loss of DSPRH, and its goal is to reserve modal semantic information. In the semantic pair, the image hash code is
, the text hash code is
and the semantic similarity matrix
.
is the similarity between
i-th image and
j-th text. Based on the maximum likelihood criterion, we can define the cross-modal pairwise loss of DSPRH. Its definition is as follows:
where
is the cross-modal pairwise loss,
.
3.4.4. Semantic-Preserving Reconstruction Loss (All Loss of DSPRH)
Based on alignment loss, pairwise loss, and cosine triple loss, the total loss of DSPRH is defined as follows:
where
,
and
are three hyperparameters, which mainly reflect the importance of the three losses to network optimization.
is the alignment loss and reconstruction loss of DSPRH,
is the cosine triple loss of DSPRH and
is the cross-modal pairwise loss of DSPRH.
4. Experiments
In this section, we test the performance of DSPRH on two public datasets (MIRFlickr-25K and NUS-WIDE). We compare the performance of the DSPRH method with some state-of-the-art methods.
4.1. Datasets, Evaluation Indicators and Baselines
Datasets introduction: We evaluated the DSPRH algorithm on two datasets; MIRFlickr-25K and NUS-WIDE. Specifically, MIRFlickr-25K contains 25,000 image-text semantic pairs. Each semantics is labeled with one of 24 categories. In the experimental part, we used 20015 semantic pairs in our experiment, these settings are based on the current cross-modal hash retrieval algorithm settings [
12,
13,
14,
15,
16]. Among them, 2000 semantic pairs were used for query.
Another dataset was NUS-WIDE. The data set contains 269,498 semantic pairs and defines 81 categories. During the experiment, 186,577 semantic pairs and the top 10 most frequent annotations were executed in our experiment. This experiment was also performed according to the current standards of most cross-modal hashing algorithms [
12,
13,
14,
15,
16]. Among them, 2000 semantic pairs were used for query.
Experimental details: For a fair comparison, our backbone network selects the AlexNet network, which is also the current cross-modal hashing general setting [
6,
12,
13,
29]. The AlexNet used by DSPRH is a pre-trained model used in the Image network. We added ResNet34 as the backbone network mainly to illustrate the scalability of our algorithm, which was just an additional experiment of the DSPRH algorithm. Performing related experiments on MIRFlickr-25K, we extracted 4096-dimensional image features and extracted 1386-dimensional bag-of-words (BOW) features for each text. When performing experiments on the NUS-WIDE dataset, we extracted 4096-dimensional features of each image and 500-dimensional BOW features of each text.
is a hyperparameter in formula 12, where
is 0.9 based on pre-work [
12]. In formulas 15 and 16, the hyper-parameter ν is equal to 0.001. In Equation (19), the weight distribution of the three loss functions (
) is described in the section of ablation analysis.
Evaluation indicators: The mean average precision (MAP) is a general index to measure the performance of cross-modal retrieval algorithms, which can reflect the average of the average retrieval accuracy of all query samples. In this article, cross-modal retrieval involved 6 evaluation indicators, including Image2text MAP@50, Image2text MAP, Text2image MAP@50, Text2image MAP, MAP@50 and MAP.
Baselines: This paper selected 8 algorithms to perform MAP comparison, and 17 algorithms to perform MAP@50 comparison to evaluate algorithm performance. To test the effectiveness of the DSPRH algorithm, we repeated the DJSRH [
12] and DSAH [
13], because the DSPRH enlightenment idea originated from DJSRH [
12] and DSAH [
13]. Specifically, these baselines are described as follows: UKD [
14] (CVPR 2020), SRCH [
11] (IJCAI 2020), MGAH [
33] (TMM 2020), UGACH [
9] (AAAI 2018), DJSRH [
12] (ICCV 2019), DSAH [
13] (ICMR 2020), DBRC [
28] (TMM 2019), CRB [
22] (TIP 2019), DADH [
15] (ICMR 2020), AGAH [
16] (ICMR 2019), UDCMH [
6] (IJCAI, 2018), SCH-GAN [
34] (TOC, 2020); DSPOH [
29] (TNNLS, 2019), SCRATCH [
23] (TCSVT, 2019), MTFH [
10] (TPAMI, 2019), EGDH [
37] (IJCAI, 2019), DMFH [
31] (TCSVT, 2020), BATCH [
26] (TKDE, 2020), ATFH-N [
25] (TETCI, 2020), MDCH [
32] (TMM, 2020), CPAH [
27] (TIP, 2020), MLCAH [
30] (TMM, 2020) a d SSAH [
5] (CVPR, 2018).
4.2. DSPRH and Baselines Comparison
4.2.1. Performance Comparison on MIRFlickr-25K
Most current algorithms actually focus on MAP@50 and MAP when calculating the performance of cross-modal hashing algorithms. MAP is the result of all query semantics, and MAP@50 mainly calculates MAP for the first 50 elements of the query.
Table 2 shows the performance comparison on MIRFlickr-25K. In
Table 2, inspired by the two algorithms corresponding to the blue font [
12,
13], the alignment loss of DSPRH partly comes from these two algorithms, so we repeated these two algorithms [
12,
13] and calculated the corresponding MAP and MAP@50 results. In addition, we analyzed the work of cross-modal hashing algorithms in recent years. We found that we could mine modal semantic information from the aspects of semantic spatial information reconstruction and loss function, which can greatly improve performance.
Compared with some current frontier works (based on graph convolution, model pruning and GAN), our proposed DSPRH has better performance. Compared with these two algorithms (DSAH [
13] and DJSRH [
12]), DSPRH is based on the tensor generation module (TRM), which generates many level 1 tensors to capture contextual feature fragments. Then, we used these level 1 tensors to recover high-order context features through the TRM. Therefore, our proposed DSPRH can obtain the semantic modal information of the fusion context on the one hand with less computational complexity. In addition, the DSPRH algorithm is based on three loss functions to optimize network parameters, confirming the effectiveness of DSPRH.
Specifically, if we choose AlexNet as the backbone network, for the image2text retrieval performance, DSPRH compared to DSAH [
13] MAP@50 improved by 4.1%, 4.9%, 3.4% and 3.2% at 16, 32, 64 and 128 bits, respectively. For text2image retrieval performance, DSPRH MAP@50 has been improved by 3.3%, 2.5%, 3.1% and 3.5% at 16, 32, 64 and 128 bits, respectively, compared to DSAH [
13]. Compared to baselines, both image2text MAP and text2image MAP, DSPRH showed better performance. The ResNet34 as the backbone network is to verify the scalability of DSPRH, which can further significantly improve the performance of DSPRH.
4.2.2. Performance Comparison on NUS-WIDE
Table 3 shows performance comparison on NUS-WIDE. Similar to
Table 2, we compare the performance of MAP and MAP@50, respectively. Encouraged by DJSRH [
12] and DSAH [
13], we proposed DSPRH. The blue font corresponds to the main comparison algorithm, and the red part is the proposed algorithm (including AlexNet for the backbone network and ResNet34 for the backbone network). AlexNet as the backbone network was mainly to evaluate the effectiveness of the algorithm, which was a fair comparison experiment. ResNet34 can further improve the performance of the algorithm, which was used to illustrate the network scalability of the algorithm in this paper.
Specifically, DSAH [
13] and DJSRH [
12] were the two most relevant baselines in this article. Because the performance of DSAH [
13] was better than DJSRH [
12], we mainly analyzed the performance comparison between DSPRH and DSAH [
13]. If we choose AlexNet as the backbone network, for the image2text retrieval performance, DSPRH compared to DSAH [
13] MAP@50 improved by 2.5%, 3.6%, 4.2% and 3.5% at 16, 32, 64 and 128 bits, respectively. For text2image retrieval performance, DSPRH MAP@50 has been improved by 1.4%, −0.1%, 2.8% and 1.4% at 16, 32, 64 and 128 bits, respectively, compared to DSAH [
13].
Based on MAP to measure cross-modal hashing retrieval performance, we also select 8 baselines for evaluation. If we choose AlexNet as the backbone network, for the image2text retrieval performance, DSPRH MAP compared to DSAH [
13] improved by 6.9%, 8.4%, 8.2% and 9.5% at 16, 32, 64 and 128 bits, respectively. For text2image retrieval performance, DSPRH MAP improved by 8.5%, 8.9%, 10.5% and 6.7% at 16, 32, 64 and 128 bits, respectively, compared to DSAH [
13]. DSPRH can be further expanded. When we choose ResNet34 as the backbone network, the performance of DSPRH algorithm can be further improved. In short, DSPRH achieved better performance compared to baselines. DSPRH was tested on two public data sets. Experimental results show that our algorithm has better retrieval performance and can fully mine modal context semantic information.
4.3. Ablation Analysis
4.3.1. Loss Function Ablation Analysis
In this section, we analyzed the influencing factors of DSPRH, including loss function influence, modal semantic information fusion influence and hyperparameter sensitivity analysis. The loss function of DSPRH is semantic-preserving reconstruction loss (full loss). Full loss includes alignment loss, cosine triple loss and pairwise loss as shown in
Figure 2 and Equation (19). Therefore, we performed ablation analysis on the loss function part. We show the loss function ablation analysis in
Table 4, which represents the loss function ablation analysis under the hash code length of 128, 64, 32 and 16. The figures of the loss function ablation analysis is described in
Appendix A.
Loss ablation analysis on MIRFlickr-25K: We discuss the loss function ablation analysis with different bits in
Figure A1. In
Figure A1, we found the best performance of our proposed semantic-preserving reconstruction loss (full loss). The cosine triple loss (
) contributed the most to full loss compared to alignment loss and pairwise loss with 128 bits. In
Figure A1, we found the best performance of our proposed semantic-preserving reconstruction loss (full loss) with 64 bits. The cosine triple loss (
) contributed the most to full loss compared to alignment loss and pairwise loss with 64 bits. In
Figure A1, we found the best performance of full loss with 32 bits. The pairwise loss (
) contributed the most to full loss compared to alignment loss and cosine triple loss with 32 bits. In
Figure A1, we found the best performance of full loss with 16 bits. The pairwise loss (
) contributed the most to full loss compared to alignment loss and cosine triple loss with 16 bits.
In short, we found that DSPRH can get the best performance on MIRFlickr-25K in full loss (alignment loss, cosine triple loss and pairwise loss). When hash bits were 64 bits and 128 bits, the cosine triple loss will benefit the most from full loss. When hash bits were 32 bits and 16 bits, the pairwise loss will benefit the most from full loss.
Loss ablation analysis on NUS-WIDE: We discuss the loss function ablation analysis with different bits in
Figure A2. In
Figure A2, we found the best performance of our proposed semantic-preserving reconstruction loss (full loss). In
Figure A2, pairwise loss contributed the most to our proposed semantic-preserving reconstruction loss with different bits. The order of the advantage of loss function was: Pairwise loss > alignment loss > cosine triple loss.
Loss ablation analysis summary:
Table 4,
Figure A1 and
Figure A2 show the loss function ablation analysis of DSPRH under different datasets. The experimental results showed that pairwise loss achieves good results on both data sets. This loss was mainly used in image single-mode retrieval research and was less concerned by cross-mode retrieval. However, in cross-modal retrieval, the meaning of semantic information interaction between image modality and text modality was consistent with the meaning of single-modal image for information interaction, so this loss can also be applied to cross-modal retrieval tasks.
Specifically, the performance of cosine triplet loss was better than pairwise loss at a higher number of bits (64 bits and 128 bits), and the performance of pairwise loss was better than alignment loss in
Figure A1. The order of the advantage of loss function on MIRFlickr-25K was: Full loss > cosine triple loss > alignment loss > pairwise loss. On the contrary, the performance of the pairwise loss was better than the alignment loss at a lower number of bits (32 bits and 16 bits), and the performance of the alignment loss was better than the cosine triplet loss in
Figure A1. The order of the advantage of loss function on MIRFlickr-25K was: Full loss > pairwise loss > alignment loss > cosine triple loss. In the NUS-WIDE dataset with a large amount of data, pairwise loss contributed the most to our proposed semantic-preserving reconstruction loss with different bits. The order of the advantage of loss function was: Full loss > pairwise loss> alignment loss> cosine triple loss. Therefore, the loss function proposed in this paper had better performance and achieved better retrieval performance than current cross-modal retrieval algorithms.
4.3.2. Feature Space Fusion Analysis
In
Figure 3, visual feature extraction included three types:
,
and
. The weights of the three features were
,
and
.
was a fully connected feature of AlexNet.
was a feature of AlexNet’s conv5, GCP and fully connected layer cascade.
was a feature of strip pool block, GCP and fully connected cascade. In
Figure 3,
+
+
= 1, in order to simplify the parameter, we assume
was equal to
. We also carried out relevant experiments to verify, and the results found that when
was equal to
, the retrieval performance was the best. Next we showed the relevant ablation analysis. Overall,
,
and
were 0.25, 0.25 and 0.5 on MIRFlickr-25K, respectively.
,
and
were 0.125, 0.125 and 0.75 on NUS-WIDE, respectively.
During feature fusion,
and
were equal and both related to
, so we only need to display the parameter sensitivity analysis of
. In
Figure 4, the feature space based on feature fusion has better results. Specifically, when the values of
,
and
were 0.5, 0.25 and 0.25, respectively, DSPRH retrieval performance was the best.
In
Figure 4, the feature space based on feature fusion had better results. When w3 was equal to 1, w1 and w2 were both 0, which means that no feature fusion was performed. At this time, the MAP and MAP@50 evaluations of DSPRH were both unstable. When w3 was equal to 0.75 and w1 and w2 were both 0.125, DSPRH evaluation performance was relatively stable. Specifically, in
Figure 4c,d as w3 increased, DSPRH obtained better performance. However, when w3 reached the maximum value, the performance of DSPRH was unstable. Therefore, when w3, w2 and w1 were 0.75, 0.125 and 0.125, respectively, DSPRH retrieval performance is the best.
In
Figure 3, we show the feature space reconstruction method, and the performance analysis of the proposed feature fusion strategy is shown in
Figure 4. We found that when w3 is equal to 1, the algorithm retrieval performance did not achieve the optimal value in
Figure 4, so the image feature space reconstruction method of weighted fusion can obtain better retrieval performance. Therefore, the image feature space of DSPRH algorithm was implemented in
Figure 3 and
Figure 2 based on the proposed feature space reconstruction method. This also fully confirmed that the proposed multi-feature fusion feature extraction module was very effective in
Figure 3.
4.3.3. Hyper-Parameter Sensitivity Analysis
There are four hyper-parameter sensitivities (λ in Equation (12); α, β and γ in formula 19) which need to be determined in DSPRH. The value of the λ was 0.9, which follows the standards of DSAH [
13] and DJSRH [
12]. Next we analyzed the impact of the other three parameters. DSPRH chooses six indicators to discuss the validity of the parameters. Among them,
Figure 5 contained 6 retrieval evaluation indicators (Image2text MAP@50, Image2text MAP, Text2image MAP@50, Text2image MAP, MAP@50 and MAP).
In
Figure 5a, when the value of parameter α was 1, the DSPRH had the best performance on MIRFlickr-25K. Therefore, the parameter α of DSPRH is 1 in this paper. In
Figure 5b, when the value of parameter β was 0.03, the DSPRH had the best performance on MIRFlickr-25K. Therefore, the parameter β of DSPRH is 0.03. In
Figure 5c, when the value of parameter γ was 1, the DSPRH had the best performance on MIRFlickr-25K. Therefore, the parameter γ of DSPRH is 1 in this paper.
In
Figure 5d, when the value of parameter α was 1, the DSPRH had the best performance on NUS-WIDE. Therefore, the parameter α of DSPRH is 1. In
Figure 5e when the value of parameter β was 0.03, the DSPRH had the best performance on NUS-WIDE. Therefore, the parameter β of DSPRH is 0.03. In
Figure 5f, when the value of parameter γ was 1, the DSPRH had the best performance on NUS-WIDE. Therefore, the parameter γ of DSPRH is 1.
In short, the three parameters (α, β and γ) of DSPRH have the same parameter settings on two datasets (MIRFlickr-25K and NUS-WIDE). Specifically, in Equation (19), the values of α, β and γ are 1, 0.03 and 1, respectively. Therefore, the proposed DSPRH has better parameter adaptability in cross-modal retrieval tasks.
4.3.4. Other Ablation Analysis
We showed that the precision-recall rate curve and the change curve of precision with top-K are at 128 bits. DJSRH [
12] and DSAH [
13] were the two baselines most relevant to DSPRH, so we perform other ablation analyses (precision-recall, precision @top-K).
Figure 6a shows Image2text precision-recall curve.
Figure 6b shows text2image precision-recall curve.
Figure 6a,b shows that the DSPRH achieved better Image2text and text2image precision-recall curve performance than DJSRH [
12] and DSAH [
13].
Figure 6c shows Image2text precision @top-K curve.
Figure 6d shows text2image precision @top-K curve.
Figure 6c,d show that the DSPRH achieved better Image2text and text2image precision @top-K curve performance than DJSRH [
12] and DSAH [
13].
DJSRH [
12] was proposed in ICCV 2019, the main goal is to reconstruct the loss function. The algorithm reconstructs the semantic loss based on image modal and text modal features. DJSRH [
12] achieved better performance, but the algorithm did not consider semantic alignment loss. Compared with DJSRH, DSAH [
13] considers semantic alignment loss, so image modal semantics and text modal semantic information are aligned and associated. DSAH [
13] was proposed in ICMR 2020. Encouraged by the DSAH algorithm, we propose the DSPRH algorithm. We used the idea of feature fusion to reconstruct the image feature space, and propose a new cross-modal retrieval loss function based on considering multiple modal semantic alignment loss, pairwise loss and cosine triple loss.
In this way, semantic alignment loss can reduce the semantic differences between the two modalities, and the pairwise loss and cosine triple loss can shorten the distance between similar samples between the two modalities and maintain the differences between dissimilar samples between the two modalities.
Figure 6 further shows that the proposed DSPRH has better retrieval performance, which also fully demonstrates that the proposed technology is scalable and suitable for cross-modal retrieval tasks. Therefore, the experimental chapter shows that the proposed algorithm has better retrieval performance, and further elaborates the reliability and effectiveness of the proposed algorithm.
5. Conclusions
In this work, we propose a novel unsupervised cross-modal hash retrieval method, named deep semantic-preserving reconstruction hashing. This work uses tensor regular-polymorphic decomposition theory, two bottlenecks auto-encoding and semantic-preserving reconstruction loss to complete the cross-modal retrieval task. The specific description is as follows: In the original feature space, we use tensor regular-polymorphic decomposition theory to generate rank-1 tensor to capture high-order context modal information. This can assist the backbone network to capture important modal semantic information. Based on second-order statistical theory, we use global covariance pooling to capture channel semantic information and accelerate network convergence. In order to achieve modal and semantic interaction, we propose a two bottlenecks self-encoding feature reconstruction layer to achieve visual-text modal information interaction. In the metric learning part, we design a new semantic-preserving reconstruction loss to achieve modal alignment and information reconstruction, which can reserve important modal semantic information. The experiment was performed on two datasets (MIRFlickr-25K and NUS-WIDE), and the paper gives detailed experimental evaluation, parameter setting, baseline comparison and ablation analysis. Experimental results show that DSPRH has achieved better retrieval performance.
The proposed method has good adaptability to perform cross-modal retrieval tasks, and we do not need to adjust the parameters of the loss function to deal with different cross-modal retrieval tasks. In this work, the study found that the construction of the loss function is the key to cross-modal retrieval, and the metric learning strategy is the key to affecting the performance of cross-modal retrieval. When reconstructing the feature space, the proposed algorithm adopts a multi-feature fusion strategy. At this time, the parameters need to be fine-tuned and manually adjusted. In future research, we will continue to explore key technologies for cross-modal retrieval, such as image-text retrieval, image-text matching and video-text retrieval.
Author Contributions
Conceptualization, S.C.; methodology, A.D.; software, S.C.; validation, A.D.; formal analysis, S.C.; writing—original draft preparation, S.C.; writing—review and editing, L.W.; visualization, S.C.; All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported in part by the Xinjiang Uygur Autonomous Region Natural Science Foundation Project, under Grant 2020D01C058, the National Science Foundation of China, under Grant 61771416 and U1903213, the Creative Research Groups of Higher Education of Xinjiang Uygur Autonomous Region, under Grant XJEDU2017T002, and the Xinjiang Uygur Autonomous region graduate research and innovation project, under Grant XJ2019G039.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A
Figure A1.
Loss function ablation analysis with different bits on MIRFlickr-25K.
Figure A1.
Loss function ablation analysis with different bits on MIRFlickr-25K.
Figure A2.
Loss function ablation analysis with different bits on NUS-WIDE.
Figure A2.
Loss function ablation analysis with different bits on NUS-WIDE.
References
- Rong-Cheng, T.; Xian-Ling, M.; Wei, W. MLS3RDUH: Deep Unsupervised Hashing via Manifold based Local Semantic Similarity Structure Reconstructing. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI), Yokohama, Japan, 11–17 July 2020; pp. 3466–3472. [Google Scholar]
- Yanzhao, X.; Yu, L.; Yangtao, W.L.G.; Peng, W.; Ke, Z. Label-Attended Hashing for Multi-Label Image Retrieval. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI), Yokohama, Japan, 11–17 July 2020; pp. 955–962. [Google Scholar]
- Lixin, F.; KamWoh, N.; Ce, J.; Tianyu, Z.; Chee, S.C. Deep Polarized Network for Supervised Learning of Accurate Binary Hashing Codes. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI), Yokohama, Japan, 11–17 July 2020; pp. 825–831. [Google Scholar]
- Zheng, C.; Zhu, L.; Lu, X.; Li, J.; Cheng, Z.; Zhang, H. Fast Discrete Collaborative Multi-modal Hashing for Large-scale Multimedia Retrieval. IEEE Trans. Knowl. Data Eng. 2019, 32, 2171–2184. [Google Scholar] [CrossRef]
- Chao, L.i.; Cheng, D.; Ning, L.; Wei, L.; Xinbo, G.; Dacheng, T. Self-Supervised Adversarial Hashing Networks for Cross Modal Retrieval. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4242–4251. [Google Scholar]
- Gengshen, W.; Zijia, L.; Jungong, H.; Li, L.; Guiguang, D.; Baochang, Z.; Jialie, S. Unsupervised Deep Hashing via Binary Latent Factor Models for Large-scale Cross-modal Retrieval. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI), Stockholm, Sweden, 13–19 July 2018; pp. 2854–2860. [Google Scholar]
- Lee, K.H.; Chen, X.; Hua, G.; Hu, H.; He, X. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 201–216. [Google Scholar]
- Zhong, J.; Haoran, W.; Jungong, H.; Yanwei, P. Saliency-Guided Attention Network for Image-Sentence Matching. In Proceedings of the 2019 IEEE International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 5753–5762. [Google Scholar]
- Jian, Z.; Yuxin, P.; Mingkuan, Y. Unsupervised Generative Adversarial Cross-Modal Hashing. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New Orleans, LA, USA, 2–7 February 2018; pp. 539–546. [Google Scholar]
- Liu, X.; Hu, Z.; Ling, H.; Cheung, Y. MTFH: A Matrix Tri-Factorization Hashing Framework for Efficient Cross-Modal Retrieval. Available online: https://sci-hub.do/10.1109/TPAMI.2019.2940446 (accessed on 5 November 2020).
- Weiwei, W.; Yuming, S.; Haofeng, Z.; Yazhou, Y.; Li, L. SRCH: Set and Rebase: Determining the Semantic Graph Connectivity for Unsupervised Cross-Modal Hashing. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI), Yokohama, Japan, 11–17 July 2020; pp. 853–859. [Google Scholar]
- Shupeng, S.; Zhisheng, Z.; Chao, Z. Deep Joint-Semantics Reconstructing Hashing for Large-Scale Unsupervised Cross-Modal Retrieval. In Proceedings of the 2019 IEEE International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 3027–3035. [Google Scholar]
- Dejie, Y.; Dayan, W.; Wanqian, Z.; Haisu, Z.; Bo, L.; Weiping, W. Deep Semantic-Alignment Hashing for Unsupervised Cross-Modal Retrieval. In Proceedings of the International Conference on Multimedia Retrieval (ICMR), Dublin, Ireland, 8–11 June 2020; pp. 44–52. [Google Scholar]
- Hengtong, H.; Lingxi, X.; Richang, H.; Qi, T. Creating Something from Nothing: Unsupervised Knowledge Distillation for Cross-Modal Hashing. In Proceedings of the 33rd IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–18 November 2020; pp. 3120–3129. [Google Scholar]
- Cong, B.; Chao, Z.; Qing, M.; Jinglin, Z.; Shengyong, C. Deep Adversarial Discrete Hashing for Cross-Modal Retrieval. In Proceedings of the International Conference on Multimedia Retrieval (ICMR), Dublin, Ireland, 8–11 June 2020; pp. 525–531. [Google Scholar]
- Wen, G.; Xiaoyan, G.; Jingzi, G.; Bo, L.; Zhi, X.; Weiping, W. Adversary Guided Asymmetric Hashing for Cross-Modal Retrieval. In Proceedings of the International Conference on Multimedia Retrieval (ICMR), Ottawa, ON, Canadara, 10–13 June 2019; pp. 159–167. [Google Scholar]
- Yuming, S.; Jie, Q.; Jiaxin, C.; Mengyang, Y.; Li, L.; Fan, Z.; Fumin, S.; Ling, S. Auto-Encoding Twin-Bottleneck Hashing. In Proceedings of the 33rd IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–18 November 2020; pp. 2815–2824. [Google Scholar]
- Qilong, W.; Li, Z.; Banggu, W.; Dongwei, R.; Peihua, L.; Wangmeng, Z.; Qinghua, H. What Deep CNNs Benefit from Global Covariance Pooling: An Optimization Perspective. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Peihua, L.; Jiangtao, X.; Qilong, W.; Zilin, G. Towards Faster Training of Global Covariance Pooling Networks by Iterative Matrix Square Root Normalization. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 947–955. [Google Scholar]
- Peihua, L.; Jiangtao, X.; Qilong, W.; Wangmeng, Z. Is Second-order Information Helpful for Large-scale Visual Recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2070–2078. [Google Scholar]
- Qibin, H.; Li, Z.; Ming-Ming, C.; Jiashi, F. Strip Pooling: Rethinking Spatial Pooling for Scene Parsing. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4002–4011. [Google Scholar]
- Mengqiu, H.; Yang, Y.; Fumin, S.; Ning, X.; Richang, H.; Heng, T.S. Collective Reconstructive Embeddings for Cross-Modal Hashing. IEEE Trans. Image Process. 2019, 28, 2770–2784. [Google Scholar]
- Zhen-Duo, C.; Chuan-Xiang, L.; Xin, L.; Liqiang, N.; Wei, Z.; Xin-Shun, X. Scratch: A Scalable Discrete Matrix Factorization Hashing Framework for Cross-Modal Retrieval. IEEE Trans. Circuits Syst. Video Techn. 2020, 30, 2262–2275. [Google Scholar]
- Zhong, J.; Yuxin, S.; Yunlong, Y.; Yanwei, P.; Jungong, H. Attribute-Guided Network for Cross-Modal Zero-Shot Hashing. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 321–330. [Google Scholar]
- Liu, X.; Cheung, Y.; Hu, Z.; He, Y.; Zhong, B. Adversarial Tri-Fusion Hashing Network for Imbalanced Cross-Modal Retrieval. IEEE Trans. Emerg. Top. Comput. Intell. 2020. [Google Scholar] [CrossRef]
- Wang, Y.; Luo, X.; Nie, L.; Song, J.; Zhang, W.; Xu, X. Batch: A Scalable Asymmetric Discrete Cross-Modal Hashing. IEEE Trans. Knowl. Data Eng. 2020. [Google Scholar] [CrossRef]
- De, X.; Cheng, D.; Chao, L.; Xianglong, L.; Dacheng, T. Multi-Task Consistency-Preserving Adversarial Hashing for Cross-Modal Retrieval. IEEE Trans. Image Process. 2020, 29, 3626–3637. [Google Scholar]
- Di, H.; Feiping, N.; Xuelong, L. Deep Binary Reconstruction for Cross-Modal Hashing. IEEE Trans. Multimed. 2019, 21, 973–985. [Google Scholar]
- Lu, J.; Kai, L.; Zechao, L.; Fu, X.; Guo-Jun, Q.; Jinhui, T. Deep Semantic-Preserving Ordinal Hashing for Cross-Modal Similarity Search. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 1429–1440. [Google Scholar]
- Ma, X.; Zhang, T.; Xu, C. Multi-Level Correlation Adversarial Hashing for Cross-Modal Retrieval. IEEE Trans. Multimed. 2020. [Google Scholar] [CrossRef]
- Nie, X.; Wang, B.; Li, J.; Hao, F.; Jian, M.; Yin, Y. Deep Multiscale Fusion Hashing for Cross-Modal Retrieval. IEEE Trans. Circuits Syst. Video Technol. 2020. [Google Scholar] [CrossRef]
- Lin, Q.; Wenming, C.; He, Z.; He, Z. Mask Cross-modal Hashing Networks. IEEE Trans. Multimed. 2020. [Google Scholar] [CrossRef]
- Jian, Z.; Yuxin, P. Multi-Pathway Generative Adversarial Hashing for Unsupervised Cross-Modal Retrieval. IEEE Trans. Multimed. 2020, 22, 174–187. [Google Scholar]
- Jian, Z.; Yuxin, P.; Mingkuan, Y. SCH-GAN: Semi-Supervised Cross-Modal Hashing by Generative Adversarial Network. IEEE Trans. Cybern. 2020, 50, 489–502. [Google Scholar]
- Zilin, G.; Jiangtao, X.; Qilong, W.; Peihua, L. Global Second-order Pooling Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 3024–3033. [Google Scholar]
- Qilong, W.; Jiangtao, X.; Wangmeng, Z.; Lei, Z.; Peihua, L. Deep CNNs Meet Global Covariance Pooling: Better Representation and Generalization. IEEE Trans. Pattern Anal. Mach. Intell. 2020. [Google Scholar] [CrossRef] [Green Version]
- Yufeng, S.; Xinge, Y.; Feng, Z.; Shuo, W.; Qinmu, P. Equally Guided Discriminative Hashing for Cross-modal Retrieval. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI), Macao, China, 10–16 August 2019; pp. 4767–4773. [Google Scholar]
| Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).