Object-Level Visual-Text Correlation Graph Hashing for Unsupervised Cross-Modal Retrieval
Abstract
:1. Introduction
- We propose a novel object-level visual-text correlation graph hashing (OVCGH) to mine the fine-grained object-level similarity existing in cross-modal data and suppress noise interference. The results on two public datasets show that our method achieves the best experimental performance.
- We design an object-level correlation graph building module that can construct the object-level similarity in the modal. It can construct the graph structure for the image modal and the text modal at the object-level in an unsupervised manner. Furthermore, the constructed graph structure contains the global information of its original semantic structure.
- We design a novel cross-modal dependency building module to build the object-level similarity between modalities while avoiding noise interference. It can model the dependency between the image object region and the text tag, ensuring that cross-modal data with similar objects can have a higher similarity score.
2. Related Work
3. The Proposed Method
3.1. Problem Definition
3.2. Graph-Level Representations Learning
3.3. Graph Guided Hashing Learning
4. Experiments
4.1. Datasets and Compared Methods
4.2. Evaluation Methods
4.3. Implementation Details
4.4. Experiment Results
4.5. Ablation Study
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Deng, C.; Chen, Z.; Liu, X.; Gao, X.; Tao, D. Triplet-based deep hashing network for cross-modal retrieval. IEEE Trans. Image Process. 2018, 27, 3893–3903. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Kiros, R.; Salakhutdinov, R.; Zemel, R.S. Unifying visual-semantic embeddings with multimodal neural language models. arXiv 2014, arXiv:1411.2539. [Google Scholar]
- Wu, Y.; Wang, S.; Huang, Q. Online asymmetric similarity learning for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4269–4278. [Google Scholar]
- Zhang, T.; Wang, J. Collaborative quantization for cross-modal similarity search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2036–2045. [Google Scholar]
- Liu, W.; Wang, J.; Ji, R.; Jiang, Y.G.; Chang, S.F. Supervised hashing with kernels. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2074–2081. [Google Scholar]
- Wang, J.; Liu, W.; Sun, A.X.; Jiang, Y.G. Learning hash codes with listwise supervision. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3032–3039. [Google Scholar]
- Zhan, Y.W.; Luo, X.; Wang, Y.; Xu, X.S. Supervised Hierarchical Deep Hashing for Cross-Modal Retrieval. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 20–24 October 2020; pp. 3386–3394. [Google Scholar]
- Zhang, J.; Peng, Y.; Yuan, M. Unsupervised generative adversarial cross-modal hashing. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Li, C.; Deng, C.; Wang, L.; Xie, D.; Liu, X. Coupled cyclegan: Unsupervised hashing network for cross-modal retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January –1 February 2019; Volume 33, pp. 176–183. [Google Scholar]
- Su, S.; Zhong, Z.; Zhang, C. Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 3027–3035. [Google Scholar]
- Velickovic, P.; Fedus, W.; Hamilton, W.L.; Liò, P.; Bengio, Y.; Hjelm, R.D. Deep graph infomax. Stat 2018, 1050, 21. [Google Scholar]
- Liu, H.; Ji, R.; Wu, Y.; Huang, F.; Zhang, B. Cross-modality binary code learning via fusion similarity hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7380–7388. [Google Scholar]
- Zhang, X.; Lai, H.; Feng, J. Attention-aware deep adversarial hashing for cross-modal retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 591–606. [Google Scholar]
- Bronstein, M.M.; Bronstein, A.M.; Michel, F.; Paragios, N. Data fusion through cross-modality metric learning using similarity-sensitive hashing. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 13–18 June 2010; pp. 3594–3601. [Google Scholar]
- Zhang, D.; Li, W.J. Large-scale supervised multimodal hashing with semantic correlation maximization. In Proceedings of the AAAI Conference on Artificial Intelligence, Quebec City, QC, Canada, 27–31 July 2014; Volume 28. [Google Scholar]
- Lin, Z.; Ding, G.; Hu, M.; Wang, J. Semantics-preserving hashing for cross-view retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3864–3872. [Google Scholar]
- Jiang, Q.Y.; Li, W.J. Deep cross-modal hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3232–3240. [Google Scholar]
- Li, C.; Deng, C.; Li, N.; Liu, W.; Gao, X.; Tao, D. Self-supervised adversarial hashing networks for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4242–4251. [Google Scholar]
- Xu, R.; Li, C.; Yan, J.; Deng, C.; Liu, X. Graph Convolutional Network Hashing for Cross-Modal Retrieval. In Proceedings of the International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 982–988. [Google Scholar]
- Thompson, B. Canonical correlation analysis. In Encyclopedia of Statistics in Behavioral Science; Wiley: Hoboken, NJ, USA, 2005. [Google Scholar]
- Song, J.; Yang, Y.; Yang, Y.; Huang, Z.; Shen, H.T. Inter-media hashing for large-scale retrieval from heterogeneous data sources. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 22–27 June 2013; pp. 785–796. [Google Scholar]
- Kumar, S.; Udupa, R. Learning hash functions for cross-view similarity search. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, Barcelona, Spain, 16–22 July 2011. [Google Scholar]
- Weiss, Y.; Torralba, A.; Fergus, R. Spectral hashing. Nips. Citeseer 2008, 1, 4. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Joulin, A.; Grave, E.; Bojanowski, P.; Mikolov, T. Bag of tricks for efficient text classification. arXiv 2016, arXiv:1607.01759. [Google Scholar]
- Ribeiro, L.F.; Zhang, Y.; Gardent, C.; Gurevych, I. Modeling global and local node contexts for text generation from knowledge graphs. Trans. Assoc. Comput. Linguist. 2020, 8, 589–604. [Google Scholar] [CrossRef]
- Huiskes, M.J.; Lew, M.S. The mir flickr retrieval evaluation. In Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval, Vancouver, BC, Canada, 30–31 October 2008; pp. 39–43. [Google Scholar]
- Chua, T.S.; Tang, J.; Hong, R.; Li, H.; Luo, Z.; Zheng, Y. Nus-wide: A real-world web image database from national university of singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval, Santorini Island, Greece, 8–10 July 2009; pp. 1–9. [Google Scholar]
- Ding, G.; Guo, Y.; Zhou, J. Collective matrix factorization hashing for multimodal data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Ohio, USA, 23–28 June 2014; pp. 2075–2082. [Google Scholar]
- Rastegari, M.; Choi, J.; Fakhraei, S.; Hal, D.; Davis, L. Predictable dual-view hashing. In Proceedings of the International Conference on Machine Learning PMLR, Atlanta, GA, USA, 16–21 June 2013; pp. 1328–1336. [Google Scholar]
- Ding, G.; Guo, Y.; Zhou, J.; Gao, Y. Large-scale cross-modality search via collective matrix factorization hashing. IEEE Trans. Image Process. 2016, 25, 5427–5440. [Google Scholar] [CrossRef] [PubMed]
- Long, M.; Cao, Y.; Wang, J.; Yu, P.S. Composite correlation quantization for efficient multimodal retrieval. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, Pisa, Italy, 17–21 July 2016; pp. 579–588. [Google Scholar]
- Zhang, J.; Peng, Y. Multi-pathway generative adversarial hashing for unsupervised cross-modal retrieval. IEEE Trans. Multimed. 2019, 22, 174–187. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
Method | Image→Text | Text→Image | ||||||
---|---|---|---|---|---|---|---|---|
16 | 32 | 64 | 128 | 16 | 32 | 64 | 128 | |
CVH [22] | 0.602 | 0.587 | 0.578 | 0.572 | 0.607 | 0.591 | 0.581 | 0.574 |
PDH [30] | 0.623 | 0.624 | 0.621 | 0.626 | 0.627 | 0.628 | 0.628 | 0.629 |
CMFH [31] | 0.659 | 0.660 | 0.663 | 0.653 | 0.611 | 0.606 | 0.575 | 0.563 |
CCQ [32] | 0.637 | 0.639 | 0.639 | 0.638 | 0.628 | 0.628 | 0.622 | 0.618 |
CMSSH [14] | 0.611 | 0.602 | 0.599 | 0.591 | 0.612 | 0.604 | 0.592 | 0.585 |
SCM [15] | 0.636 | 0.64 | 0.641 | 0.643 | 0.661 | 0.664 | 0.668 | 0.670 |
DJSRH [10] | 0.659 | 0.661 | 0.675 | 0.684 | 0.655 | 0.671 | 0.673 | 0.685 |
MGAH [33] | 0.685 | 0.693 | 0.704 | 0.702 | 0.673 | 0.676 | 0.686 | 0.690 |
VCGH | 0.701 | 0.711 | 0.714 | 0.723 | 0.705 | 0.713 | 0.725 | 0.733 |
Method | Image→Text | Text→Image | ||||||
---|---|---|---|---|---|---|---|---|
16 | 32 | 64 | 128 | 16 | 32 | 64 | 128 | |
CVH [22] | 0.458 | 0.432 | 0.410 | 0.392 | 0.474 | 0.445 | 0.419 | 0.398 |
PDH [30] | 0.475 | 0.484 | 0.480 | 0.490 | 0.489 | 0.512 | 0.507 | 0.517 |
CMFH [31] | 0.517 | 0.550 | 0.547 | 0.520 | 0.439 | 0.416 | 0.377 | 0.349 |
CCQ [32] | 0.504 | 0.505 | 0.506 | 0.505 | 0.499 | 0.496 | 0.492 | 0.488 |
CMSSH [14] | 0.512 | 0.470 | 0.479 | 0.466 | 0.519 | 0.498 | 0.456 | 0.488 |
SCM [15] | 0.517 | 0.514 | 0.518 | 0.518 | 0.518 | 0.510 | 0.517 | 0.518 |
DJSRH [10] | 0.503 | 0.517 | 0.528 | 0.554 | 0.526 | 0.541 | 0.539 | 0.570 |
MGAH [33] | 0.613 | 0.623 | 0.628 | 0.631 | 0.603 | 0.614 | 0.640 | 0.641 |
VCGH | 0.615 | 0.628 | 0.631 | 0.639 | 0.617 | 0.628 | 0.635 | 0.648 |
Task | Method | MIRFlickr | NUS-WIDE | ||||||
---|---|---|---|---|---|---|---|---|---|
16 | 32 | 64 | 128 | 16 | 32 | 64 | 128 | ||
image→text | Baseline | 0.656 | 0.682 | 0.689 | 0.698 | 0.507 | 0.559 | 0.569 | 0.581 |
Baseline-global | 0.681 | 0.688 | 0.692 | 0.703 | 0.589 | 0.593 | 0.611 | 0.615 | |
Baseline-object | 0.686 | 0.691 | 0.699 | 0.710 | 0.594 | 0.611 | 0.619 | 0.623 | |
VCGH | 0.701 | 0.711 | 0.714 | 0.723 | 0.615 | 0.628 | 0.631 | 0.639 | |
text→image | Baseline | 0.664 | 0.690 | 0.691 | 0.700 | 0.498 | 0.583 | 0.576 | 0.619 |
Baseline-global | 0.680 | 0.683 | 0.687 | 0.694 | 0.581 | 0.588 | 0.610 | 0.614 | |
Baseline-object | 0.691 | 0.696 | 0.710 | 0.718 | 0.597 | 0.616 | 0.621 | 0.628 | |
VCGH | 0.705 | 0.713 | 0.725 | 0.733 | 0.617 | 0.628 | 0.635 | 0.648 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shi, G.; Li, F.; Wu, L.; Chen, Y. Object-Level Visual-Text Correlation Graph Hashing for Unsupervised Cross-Modal Retrieval. Sensors 2022, 22, 2921. https://doi.org/10.3390/s22082921
Shi G, Li F, Wu L, Chen Y. Object-Level Visual-Text Correlation Graph Hashing for Unsupervised Cross-Modal Retrieval. Sensors. 2022; 22(8):2921. https://doi.org/10.3390/s22082921
Chicago/Turabian StyleShi, Ge, Feng Li, Lifang Wu, and Yukun Chen. 2022. "Object-Level Visual-Text Correlation Graph Hashing for Unsupervised Cross-Modal Retrieval" Sensors 22, no. 8: 2921. https://doi.org/10.3390/s22082921
APA StyleShi, G., Li, F., Wu, L., & Chen, Y. (2022). Object-Level Visual-Text Correlation Graph Hashing for Unsupervised Cross-Modal Retrieval. Sensors, 22(8), 2921. https://doi.org/10.3390/s22082921