An Efficient Cross-Modal Privacy-Preserving Image–Text Retrieval Scheme
Abstract
:1. Introduction
- General adapter structure: We propose a general adapter structure that introduces a set of trainable pseudo-prompt vectors and an encoder to map these vectors to a specific space on the text input end of a frozen multimodal pre-trained model. The vectors generate appropriate prompts for different tasks. Our general adapter only needs to train about 5.23% of the parameters to improve performance in image–text retrieval and zero-shot image classification tasks in general domain datasets.
- Specialized adapter structure: We propose a specialized adapter structure that incorporates learnable rescaling vectors at specific positions in the multi-head attention and feed-forward layers of the transformer networks on both the text and visual ends of a frozen multimodal pre-trained model in order to perform fine-grained scaling of computation results. The original model equipped with this structure only needs to train about 0.017% of the parameters to improve performance in image–text retrieval and zero-shot image classification tasks in specialized domain datasets.
- Diverse retrieval interfaces: We provide users with both general domain and specialized domain retrieval interfaces. For users’ general retrieval needs, only the trained general adapter needs to be installed on the base model. For data from different specialized domains, it is only necessary to train and install lightweight specialized adapter modules corresponding to those domains. This approach offers users diverse and flexible cross-modal retrieval services.
- Efficient cross-modal retrieval structure: We design an efficient cross-modal retrieval scheme. The hierarchical navigable small world (HNSW) algorithm is used instead of traditional indexing structures. This improves the efficiency of approximate nearest neighbor (ANN) searches for users’ query trapdoors among a large number of high-dimensional encrypted feature indexing vectors.
- Comprehensive privacy protection mechanism: We develop a comprehensive privacy protection mechanism with two levels of security modes that are adaptable to different user tasks. The level I security mode is based on our proposed index distributed storage technology, and the level II security mode is based on asymmetric scalar-product-preserving encryption (ASPE) technology suitable for inner product calculation.
2. Related Work
2.1. Privacy-Preserving Image Retrieval Schemes
- Images are encrypted and then uploaded to the cloud for feature extraction [1,2,3,4,5], and we use the extracted features to build an unsupervised or supervised retrieval model. For this approach, the task of feature extraction from encrypted images can be outsourced to the cloud. Despite encryption preserving the privacy of images on the cloud, accurately and comprehensively extracting dense semantic features from encrypted images is challenging [1,2,3]. Most existing solutions can only extract coarse-grained features and use simple retrieval models [18,19] such as k-means and bag-of-words models. However, these models struggle to learn the complex nonlinear features in images. Additionally, a multi-level feature extraction method for images should be compatible with the encryption algorithm used. Therefore, existing schemes generally choose to extract weak features that differ from but are somewhat related to the original images. They also need to design suitable encryption algorithms and multi-level weak feature extraction algorithms that are compatible with the encryption algorithms [5]. However, while these approaches can ensure a certain degree of privacy and offload the feature extraction task to the cloud, using manually designed weak features instead of the complex fine-grained semantic features in the original images significantly reduces retrieval accuracy.
- Homomorphic encryption is used to modify the retrieval model on the cloud to achieve secure inference [6,7,8,9]. Although these schemes can accomplish secure image retrieval tasks, they have inherent limitations. Firstly, homomorphic encryption algorithms only support integer-type data, while the data and model parameters in machine learning are typically floating-point numbers. Secondly, fully homomorphic encryption (FHE) does not support nonlinear operations and can only approximate results using approximate functions [20]. However, deep learning models that accurately capture data features involve numerous nonlinear operations, leading to a loss of computational accuracy in the original models.
- Features are extracted from original images before encryption [10,11,12]. Then, the encrypted images and their encrypted features are sent to the cloud. For retrieval, the user needs to construct a query trapdoor and send it to the cloud; then, the cloud calculates the similarity between the query trapdoor and the encrypted indexes and returns the most relevant results. This type of scheme transfers the feature extraction work to the user, increasing the user’s workload. However, it allows the use of larger models to extract complex dense semantic features from data, enabling cross-modal retrieval and ensuring maximum retrieval accuracy.
2.2. Privacy-Preserving Cross-Modal Image–Text Retrieval Schemes
3. Preliminary Knowledge
3.1. Contrastive Language-Image Pre-Training (CLIP)
3.2. Hierarchical Navigable Small World Graphs (HNSW)
3.3. Singular Value Decomposition (SVD)
3.4. Asymmetric Scalar-Product-Preserving Encryption (ASPE)
4. System Framework
4.1. System Composition
- Data Owner (DO): Fully trusted, can access all plaintext data in the system, and has unique key management permissions.
- Cloud Server (CS): “Honest but curious”, meaning it will honestly execute user query requests but will also try to peek into users’ privacy. It hosts the retrieval model used for index construction and searches.
- Private Server (PS): Fully trusted and attached to . It can share all keys with and hosts the feature extraction model used for extracting features from original images and text data.
- Data User (DU): Needs prior authorization from to query, sends text or image type queries to , and receives the related query result from .
4.2. System Operation
5. PITR: Privacy-Preserving Image–Text Retrieval Scheme
5.1. Feature Extraction Model
5.1.1. General Domain Feature Extraction Model
5.1.2. Specialized Domain Feature Extraction Model
5.1.3. Dimensionality Reduction of the Extracted Embedding Vectors
5.2. Encryption Model
5.2.1. Symbol Definition and Description
5.2.2. The Operation of the Encryption Model in the Level I Security Mode
5.2.3. The Operation of the Encryption Model in the Level II Security Mode
5.3. Retrieval Model
5.3.1. Index Building and Search Process
Algorithm 1 Index structure construction |
Input: Encrypted feature embedding set: . Output: Hierarchical graph-based index structure . Preparatory: : start from ; find the nearest node to s at layer l. : select the m nearest nodes to s from the set W at level l. : add bidirectional connections between M and s. : assign layers to s using random function. W: current nearest neighbor element set. m: the number of connections each node needs to establish with other nodes. : the size of the dynamic candidate set.
|
Algorithm 2 Search in index structure |
Input: and query node q; the number of the nearest neighbor nodes to be returned is k. Output: The nearest k nodes to q.
|
5.3.2. Similarity Calculation
5.4. Model Security Analysis
6. Experiment and Performance Analysis
6.1. Zero-Shot Image Classification Task
6.1.1. Image Classification in the General Data Domain
6.1.2. Image Classification in the Specialized Data Domain
6.2. Image–Text Retrieval in General Domain Data
6.3. Image–Text Retrieval in Specialized Domain Data
6.4. Retrieval Speed Experiment
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Ferreira, B.; Rodrigues, J.; Leitao, J.; Domingos, H. Practical Privacy-Preserving Content-Based Retrieval in Cloud Image Repositories. Cloud Comput. IEEE Trans. 2019, 7, 784–798. [Google Scholar] [CrossRef]
- Liu, D.; Shen, J.; Xia, Z.; Sun, X. A Content-Based Image Retrieval Scheme Using an Encrypted Difference Histogram in Cloud Computing. Information 2017, 8, 96. [Google Scholar] [CrossRef]
- Xia, Z.; Jiang, L.; Ma, X.; Yang, W.; Ji, P.; Xiong, N. A Privacy-Preserving Outsourcing Scheme for Image Local Binary Pattern in Secure Industrial Internet of Things. IEEE Trans. Ind. Inform. 2019, 16, 629–638. [Google Scholar] [CrossRef]
- Ma, W.; Zhou, T.; Qin, J.; Xiang, X.; Tan, Y.; Cai, Z. A privacy-preserving content-based image retrieval method based on deep learning in cloud computing. Expert Syst. Appl. 2022, 203, 117508. [Google Scholar] [CrossRef]
- Feng, Q.; Li, P.; Lu, Z.; Li, C.; Wang, Z.; Liu, Z.; Duan, C.; Huang, F.; Weng, J.; Yu, P.S. Evit: Privacy-preserving image retrieval via encrypted vision transformer in cloud computing. arXiv 2024, arXiv:2208.14657. [Google Scholar] [CrossRef]
- Zhang, L.; Jung, T.; Feng, P.; Liu, K.; Liu, Y. PIC: Enable Large-Scale Privacy Preserving Content-Based Image Search on Cloud. In Proceedings of the International Conference on Parallel Processing, Beijing, China, 1–4 September 2015. [Google Scholar]
- Bellafqira, R.; Coatrieux, G.; Bouslimi, D.; Quellec, G. Content-based image retrieval in homomorphic encryption domain. In Proceedings of the 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Milan, Italy, 25–29 August 2015; pp. 2944–2947. [Google Scholar]
- Gilad-Bachrach, R.; Dowlin, N.; Laine, K.; Lauter, K.; Naehrig, M.; Wernsing, J. Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy. In Proceedings of the International Conference on Machine Learning, New York City, NY, USA, 19–24 June 2016; pp. 201–210. [Google Scholar]
- Juvekar, C.; Vaikuntanathan, V.; Chandrakasan, A. Gazelle: A Low Latency Framework for Secure Neural Network Inference. arXiv 2018, arXiv:1801.05507. [Google Scholar]
- Cheng, B.; Zhuo, L.; Bai, Y.; Peng, Y.; Zhang, J. Secure index construction for privacy-preserving large-scale image retrieval. In Proceedings of the 2014 IEEE Fourth International Conference on Big Data and Cloud Computing, Sydney, NSW, Australia, 3–5 December 2014; pp. 116–120. [Google Scholar]
- Li, Y.; Ma, J.; Miao, Y.; Wang, Y.; Liu, X.; Choo, K.K.R. Similarity search for encrypted images in secure cloud computing. IEEE Trans. Cloud Comput. 2020, 10, 1142–1155. [Google Scholar] [CrossRef]
- Huang, J.; Luo, Y.; Xu, M.; Fu, S.; Huang, K. Accelerating privacy-preserving image retrieval with multi-index hashing. In Proceedings of the 2022 IEEE/ACM 7th Symposium on Edge Computing (SEC), Seattle, WA, USA, 5–8 December 2022; pp. 492–497. [Google Scholar]
- Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
- Li, J.; Huang, Y.; Wei, Y.; Lv, S.; Liu, Z.; Dong, C.; Lou, W. Searchable symmetric encryption with forward search privacy. IEEE Trans. Dependable Secur. Comput. 2019, 18, 460–474. [Google Scholar] [CrossRef]
- Wang, B.; Song, W.; Lou, W.; Hou, Y.T. Inverted index based multi-keyword public-key searchable encryption with strong privacy guarantee. In Proceedings of the 2015 IEEE Conference on Computer Communications (INFOCOM), Kowloon, Hong Kong, 26 April–1 May 2015; pp. 2092–2100. [Google Scholar]
- Wang, B.; Hou, Y.; Li, M.; Wang, H.; Li, H. Maple: Scalable multi-dimensional range search over encrypted cloud data with tree-based index. In Proceedings of the 9th ACM Symposium on Information, Computer and Communications Security, Kyoto, Japan, 4–6 June 2014; pp. 111–122. [Google Scholar]
- Andola, N.; Prakash, S.; Yadav, V.K.; Raghav; Venkatesan, S.; Verma, S. A secure searchable encryption scheme for cloud using hash-based indexing. J. Comput. Syst. Sci. 2022, 126, 119–137. [Google Scholar] [CrossRef]
- Liang, H.; Zhang, X.; Cheng, H. Huffman-code based retrieval for encrypted JPEG images. J. Vis. Commun. Image Represent. 2019, 61, 149–156. [Google Scholar] [CrossRef]
- Xia, Z.; Jiang, L.; Liu, D.; Lu, L.; Jeon, B. BOEW: A content-based image retrieval scheme using bag-of-encrypted-words in cloud computing. IEEE Trans. Serv. Comput. 2019, 15, 202–214. [Google Scholar] [CrossRef]
- Marcolla, C.; Sucasas, V.; Manzano, M.; Bassoli, R.; Fitzek, F.H.; Aaraj, N. Survey on fully homomorphic encryption, theory, and applications. Proc. IEEE 2022, 110, 1572–1609. [Google Scholar] [CrossRef]
- Hu, S.; Zhang, L.Y.; Wang, Q.; Qin, Z.; Wang, C. Towards private and scalable cross-media retrieval. IEEE Trans. Dependable Secur. Comput. 2019, 18, 1354–1368. [Google Scholar] [CrossRef]
- Zhu, L.; Song, J.; Yang, Z.; Huang, W.; Zhang, C.; Yu, W. DAP 2 CMH: Deep adversarial privacy-preserving cross-modal hashing. Neural Process. Lett. 2022, 54, 2549–2569. [Google Scholar] [CrossRef]
- Wang, Z.; Qin, J.; Xiang, X.; Tan, Y.; Peng, J. A privacy-preserving cross-media retrieval on encrypted data in cloud computing. J. Inf. Secur. Appl. 2023, 73, 103440. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Malkov, Y.A.; Yashunin, D.A. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 824–836. [Google Scholar] [CrossRef]
- Liu, Y.; Fu, Z. Secure search service based on word2vec in the public cloud. Int. J. Comput. Sci. Eng. 2019, 18, 305–313. [Google Scholar] [CrossRef]
- Fu, Z.; Wang, Y.; Sun, X.; Zhang, X. Semantic and secure search over encrypted outsourcing cloud based on BERT. Front. Comput. Sci. 2022, 16, 162802. [Google Scholar] [CrossRef]
- Stewart, G.W. On the early history of the singular value decomposition. SIAM Rev. 1993, 35, 551–566. [Google Scholar] [CrossRef]
- Shishkin, S.L.; Shalaginov, A.; Bopardikar, S.D. Fast approximate truncated SVD. Numer. Linear Algebra Appl. 2019, 26, e2246. [Google Scholar] [CrossRef]
- Wong, W.K.; Cheung, D.W.l.; Kao, B.; Mamoulis, N. Secure kNN computation on encrypted databases. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, Providence, RI, USA, 29 June–2 July 2009; pp. 139–152. [Google Scholar]
- Lei, X.; Tu, G.H.; Liu, A.X.; Xie, T. Fast and secure knn query processing in cloud computing. In Proceedings of the 2020 IEEE Conference on Communications and Network Security (CNS), Avignon, France, 29 June–1 July 2020; pp. 1–9. [Google Scholar]
- Cao, N.; Wang, C.; Li, M.; Ren, K.; Lou, W. Privacy-preserving multi-keyword ranked search over encrypted cloud data. IEEE Trans. Parallel Distrib. Syst. 2013, 25, 222–233. [Google Scholar] [CrossRef]
- Yang, A.; Pan, J.; Lin, J.; Men, R.; Zhang, Y.; Zhou, J.; Zhou, C. Chinese clip: Contrastive vision-language pretraining in chinese. arXiv 2022, arXiv:2211.01335. [Google Scholar]
- Liu, X.; Zheng, Y.; Du, Z.; Ding, M.; Qian, Y.; Yang, Z.; Tang, J. GPT understands, too. arXiv 2023, arXiv:2103.10385. [Google Scholar] [CrossRef]
- Lester, B.; Al-Rfou, R.; Constant, N. The power of scale for parameter-efficient prompt tuning. arXiv 2021, arXiv:2104.08691. [Google Scholar]
- Yu, Y.; Si, X.; Hu, C.; Zhang, J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019, 31, 1235–1270. [Google Scholar] [CrossRef]
- Desai, M.; Shah, M. An anatomization on breast cancer detection and diagnosis employing multi-layer perceptron neural network (MLP) and Convolutional neural network (CNN). Clin. eHealth 2021, 4, 1–11. [Google Scholar] [CrossRef]
- Liu, H.; Tam, D.; Muqeeth, M.; Mohta, J.; Huang, T.; Bansal, M.; Raffel, C.A. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Adv. Neural Inf. Process. Syst. 2022, 35, 1950–1965. [Google Scholar]
- Abdi, H.; Williams, L.J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 433–459. [Google Scholar] [CrossRef]
- Anowar, F.; Sadaoui, S.; Selim, B. Conceptual and empirical comparison of dimensionality reduction algorithms (pca, kpca, lda, mds, svd, lle, isomap, le, ica, t-sne). Comput. Sci. Rev. 2021, 40, 100378. [Google Scholar] [CrossRef]
- Zhou, X.; He, J.; Huang, G.; Zhang, Y. SVD-based incremental approaches for recommender systems. J. Comput. Syst. Sci. 2015, 81, 717–733. [Google Scholar] [CrossRef]
- Zhang, K.; Xu, S.; Li, P.; Zhang, D.; Wang, W.; Zou, B. CRE: An Efficient Ciphertext Retrieval Scheme Based on Encoder. In Proceedings of the International Conference on Neural Information Processing; Springer: Berlin/Heidelberg, Germany, 2023; pp. 117–130. [Google Scholar]
Symbol | Description | Symbol | Description |
---|---|---|---|
Document set | Sentence embedding set | ||
Sentence set | Image embedding set | ||
Image set | Image–text embedding set | ||
Encrypted document set | Encrypted image–text embedding set | ||
Incomplete image–text embedding matrix | Encrypted incomplete image–text embedding matrix | ||
Embedding of query | Query trapdoor |
Scheme | Caltech-101 | Cifar-100 | Cifar-10 | Pets | Food | Flower | DTD | EuroSAT |
---|---|---|---|---|---|---|---|---|
BriVL | - | 35.9 | 72.3 | - | - | - | - | - |
Wukong | - | 77.1 | 95.4 | - | - | - | - | - |
CN-CLIP | 90.6 | 79.7 | 96.0 | 83.5 | 74.6 | 68.4 | 51.2 | 52.0 |
PITR | 92.1 | 81.4 | 96.5 | 87.0 | 83.3 | 81.4 | 79.6 | 70.0 |
Scheme | R@1 | R@5 | R@10 |
---|---|---|---|
Wukong | 76.1 | 94.8 | 97.5 |
R2D2 | 77.6 | 96.7 | 98.9 |
CN-CLIP | 81.6 | 97.5 | 98.8 |
PITR | 82.8 | 97.3 | 99.2 |
Schemes | R@1 | R@5 | R@10 |
---|---|---|---|
Wukong | 53.4 | 80.2 | 90.1 |
R2D2 | 56.4 | 85.0 | 93.1 |
CN-CLIP | 69.2 | 89.9 | 96.1 |
PITR | 70.4 | 91.1 | 96.9 |
Dataset | R@1 | R@5 | R@10 |
---|---|---|---|
Pets | 87.0 | 96.3 | 99.7 |
DTD | 79.6 | 86.2 | 95.8 |
EuroSAT | 70.0 | 88.4 | 100 |
Flowers | 81.7 | 89.9 | 96.2 |
Dataset | R@1 | R@5 | R@10 |
---|---|---|---|
Pets | 88.1 | 98.5 | 100 |
DTD | 83.0 | 91.9 | 98.4 |
EuroSAT | 71.8 | 85.2 | 93.7 |
Flowers | 84.4 | 96.3 | 99.8 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, K.; Xu, S.; Song, Y.; Xu, Y.; Li, P.; Yang, X.; Zou, B.; Wang, W. An Efficient Cross-Modal Privacy-Preserving Image–Text Retrieval Scheme. Symmetry 2024, 16, 1084. https://doi.org/10.3390/sym16081084
Zhang K, Xu S, Song Y, Xu Y, Li P, Yang X, Zou B, Wang W. An Efficient Cross-Modal Privacy-Preserving Image–Text Retrieval Scheme. Symmetry. 2024; 16(8):1084. https://doi.org/10.3390/sym16081084
Chicago/Turabian StyleZhang, Kejun, Shaofei Xu, Yutuo Song, Yuwei Xu, Pengcheng Li, Xiang Yang, Bing Zou, and Wenbin Wang. 2024. "An Efficient Cross-Modal Privacy-Preserving Image–Text Retrieval Scheme" Symmetry 16, no. 8: 1084. https://doi.org/10.3390/sym16081084
APA StyleZhang, K., Xu, S., Song, Y., Xu, Y., Li, P., Yang, X., Zou, B., & Wang, W. (2024). An Efficient Cross-Modal Privacy-Preserving Image–Text Retrieval Scheme. Symmetry, 16(8), 1084. https://doi.org/10.3390/sym16081084