Disambiguity and Alignment: An Effective Multi-Modal Alignment Method for Cross-Modal Recipe Retrieval
Abstract
:1. Introduction
- Q1: How can we measure the similarity between ambiguous food images guided by their corresponding recipes?
- Q2: How can we further improve the fine-grained semantic alignment between ingredients and instructions within each recipe to support food image similarity measurement?
- We propose a novel framework called MMACMR which addresses the problem of ambiguous food images in cross-modal recipe retrieval;
- We introduce a novel deep learning strategy named MDA which promotes the alignment of two modalities without adding new parameters;
- We enhance the representation of recipes by focusing on important ingredients within instructions at the sentence level;
- We conduct extensive experiments on the challenging dataset Recipe1M. The results demonstrate that the proposed technique outperforms several state-of-the-art methods.
2. Method
2.1. Notations and Problem Formulations
2.1.1. Notations
2.1.2. Problem Formulations
2.2. Framework Overview
2.2.1. Image Encoder
2.2.2. Improved Recipe Encoder
2.3. Multi-Modal Disambiguity and Alignment
2.3.1. Inter-Modal Alignment: N-Pairs Triplet Loss
2.3.2. Intra-Modal Alignment with Disambiguity: RGI Loss
2.3.3. Total Loss
2.4. Optimization
Algorithm 1 Optimization procedure of MMACMR |
Input: cross-modal recipe dataset , number of epoch T. Output: parameters , of modality encoders.
|
3. Experiments and Discussion
3.1. Experiment Settings
3.1.1. Dataset
3.1.2. Baselines
- CCA [9] stands for Canonical Correlation Analysis, a classical statistical method used to learn a joint embedding space;
- JE [9] was the first to conduct the cross-modal recipe retrieval task on the Recipe1M dataset. It uses a joint encoder and a classifier to learn the information from food images and recipes;
- AdaMin [10] combines the retrieval loss and classifies the loss to improve the robustness of models and proposes a novel strategy to mine the significant triplets;
- R2GAN [35] promotes the modality alignment by employing a GAN mechanism equipped with two discriminators and one generator;
- MCEN [14] bridges the semantic gap between the two modalities using stochastic latent variable models;
- SN [16] employs three attention mechanisms on three components of recipes to capture the relationship between sentences;
- SCAN [13] introduces semantic consistency loss to regularize the representations of images and recipes;
- HF-ICMA [20] exploits the global and local similarity between the two modalities by considering inter- and intra-modal fusion;
- SEJE [22] constructs a two-phase feature framework and divides the processes of data pre-processing and model training to extract additional semantic information;
- M-SIA [17] argues that multiple aspects in recipes are related to multiple regions in food images and leverages multi-head attention to bridge them;
- X-MRS [39] augments recipe representations by utilizing multilingual translation;
- LCWF-GI [31] employs latent weight factors to fuse the three components of recipes by considering their complex interaction;
- H-T [29] captures the latent semantic information in recipes by applying self-supervised loss to push components sourced from the same close recipe;
- LMF-CSF [30] introduces a low-rank fusion strategy to combine the components in recipes and generate superior representations.
3.1.3. Evaluation Criteria
3.1.4. Implementation Details
3.1.5. Experimental Environment
3.2. Comparison with State-of-the-Art Methods
3.3. Scalability Analysis
3.4. Ablation Studies
3.5. Qualitative Results
3.5.1. Qualitative Results on Image-to-Recipe Retrieval
3.5.2. Qualitative Results on Recipe-to-Image Retrieval
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
MMACMR | Multi-Modal Alignment Method for Cross-Modal Recipe Retrieval |
MDA | Multi-Modal Disambiguity and Alignment |
RGI | Recipe Guide Image |
AI | Artificial intelligence |
LSTM | Long Short-Term Memory |
GANs | Generative Adversarial Networks |
ViT | Vision Transformer |
CLIP | Contrastive Language–Image Pre-training |
KNN | K-Nearest Neighbors |
SOTA | State Of The Art |
MedR | Median Rank |
SSD | Solid-State Disk |
RAM | Random-Access Memory |
HDD | Hard Disk Drive |
CCA | Canonical Correlation Analysis |
References
- Guo, Z.; Jayan, H. Fast Nondestructive Detection Technology and Equipment for Food Quality and Safety. Foods 2023, 12, 3744. [Google Scholar] [CrossRef] [PubMed]
- Guo, Z.; Wu, X.; Jayan, H.; Yin, L.; Xue, S.; El-Seedi, H.R.; Zou, X. Recent developments and applications of surface enhanced Raman scattering spectroscopy in safety detection of fruits and vegetables. Food Chem. 2023, 434, 137469. [Google Scholar] [CrossRef] [PubMed]
- Thames, Q.; Karpur, A.; Norris, W.; Xia, F.; Panait, L.; Weyand, T.; Sim, J. Nutrition5k: Towards automatic nutritional understanding of generic food. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8903–8911. [Google Scholar]
- Min, W.; Wang, Z.; Liu, Y.; Luo, M.; Kang, L.; Wei, X.; Wei, X.; Jiang, S. Large scale visual food recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9932–9949. [Google Scholar] [CrossRef] [PubMed]
- Min, W.; Wang, Z.; Yang, J.; Liu, C.; Jiang, S. Vision-based fruit recognition via multi-scale attention CNN. Comput. Electron. Agric. 2023, 210, 107911. [Google Scholar] [CrossRef]
- Min, W.; Liu, C.; Xu, L.; Jiang, S. Applications of knowledge graphs for food science and industry. Patterns 2022, 3, 100484. [Google Scholar] [CrossRef] [PubMed]
- Wang, W.; Min, W.; Li, T.; Dong, X.; Li, H.; Jiang, S. A review on vision-based analysis for automatic dietary assessment. Trends Food Sci. Technol. 2022, 122, 223–237. [Google Scholar] [CrossRef]
- Liu, Y.; Min, W.; Jiang, S.; Rui, Y. Convolution-Enhanced Bi-Branch Adaptive Transformer with Cross-Task Interaction for Food Category and Ingredient Recognition. IEEE Trans. Image Process. 2024, 33, 2572–2586. [Google Scholar] [CrossRef] [PubMed]
- Salvador, A.; Hynes, N.; Aytar, Y.; Marin, J.; Ofli, F.; Weber, I.; Torralba, A. Learning cross-modal embeddings for cooking recipes and food images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3020–3028. [Google Scholar]
- Carvalho, M.; Cadène, R.; Picard, D.; Soulier, L.; Thome, N.; Cord, M. Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018; pp. 35–44. [Google Scholar]
- Min, W.; Zhou, P.; Xu, L.; Liu, T.; Li, T.; Huang, M.; Jin, Y.; Yi, Y.; Wen, M.; Jiang, S.; et al. From Plate to Production: Artificial Intelligence in Modern Consumer-Driven Food Systems. arXiv 2023, arXiv:2311.02400. [Google Scholar]
- Guo, Z.; Zhang, Y.; Wang, J.; Liu, Y.; Jayan, H.; El-Seedi, H.R.; Alzamora, S.M.; Gómez, P.L.; Zou, X. Detection model transfer of apple soluble solids content based on NIR spectroscopy and deep learning. Comput. Electron. Agric. 2023, 212, 108127. [Google Scholar] [CrossRef]
- Wang, H.; Sahoo, D.; Liu, C.; Shu, K.; Achananuparp, P.; Lim, E.P.; Hoi, S.C. Cross-modal food retrieval: Learning a joint embedding of food images and recipes with semantic consistency and attention mechanism. IEEE Trans. Multimed. 2021, 24, 2515–2525. [Google Scholar] [CrossRef]
- Fu, H.; Wu, R.; Liu, C.; Sun, J. Mcen: Bridging cross-modal gap between cooking recipes and dish images with latent variable model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14570–14580. [Google Scholar]
- Chen, Y.; Zhou, D.; Li, L.; Han, J.M. Multimodal encoders for food-oriented cross-modal retrieval. In Proceedings of the Web and Big Data: 5th International Joint Conference, APWeb-WAIM 2021, Guangzhou, China, 23–25 August 2021; Proceedings, Part II 5. Springer International Publishing: Cham, Switzerland, 2021; pp. 253–266. [Google Scholar]
- Zan, Z.; Li, L.; Liu, J.; Zhou, D. Sentence-based and noise-robust cross-modal retrieval on cooking recipes and food images. In Proceedings of the 2020 International Conference on Multimedia Retrieval, Dublin, Ireland, 8–11 June 2020; pp. 117–125. [Google Scholar]
- Li, L.; Li, M.; Zan, Z.; Xie, Q.; Liu, J. Multi-subspace implicit alignment for cross-modal retrieval on cooking recipes and food images. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Virtual Event, 1–5 November 2021; pp. 3211–3215. [Google Scholar]
- Shukor, M.; Couairon, G.; Grechka, A.; Cord, M. Transformer decoders with multimodal regularization for cross-modal food retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4567–4578. [Google Scholar]
- Li, L.; Hu, C.; Zhang, H.; Maradapu Vera Venkata sai, A. Cross-modal Image-Recipe Retrieval via Multimodal Fusion. In Proceedings of the 5th ACM International Conference on Multimedia in Asia, Taiwan, China, 6–8 December 2023; pp. 1–7. [Google Scholar]
- Li, J.; Sun, J.; Xu, X.; Yu, W.; Shen, F. Cross-modal image-recipe retrieval via intra-and inter-modality hybrid fusion. In Proceedings of the 2021 International Conference on Multimedia Retrieval, Taipei, Taiwan, 21–24 August 2021; pp. 173–182. [Google Scholar]
- Chen, J.J.; Ngo, C.W.; Feng, F.L.; Chua, T.S. Deep understanding of cooking procedure for cross-modal recipe retrieval. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 1020–1028. [Google Scholar]
- Xie, Z.; Liu, L.; Wu, Y.; Zhong, L.; Li, L. Learning text-image joint embedding for efficient cross-modal retrieval with deep feature engineering. ACM Trans. Inf. Syst. (TOIS) 2021, 40, 1–27. [Google Scholar] [CrossRef]
- Xie, Z.; Liu, L.; Li, L.; Zhong, L. Efficient Deep Feature Calibration for Cross-Modal Joint Embedding Learning. In Proceedings of the 2021 International Conference on Multimodal Interaction, Montréal, QC, Canada, 18–22 October 2021; pp. 43–51. [Google Scholar]
- Xie, Z.; Liu, L.; Li, L.; Zhong, L. Learning joint embedding with modality alignments for cross-modal retrieval of recipes and food images. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Virtual Event, 1–5 November 2021; pp. 2221–2230. [Google Scholar]
- Xie, Z.; Liu, L.; Wu, Y.; Li, L.; Zhong, L. Learning tfidf enhanced joint embedding for recipe-image cross-modal retrieval service. IEEE Trans. Serv. Comput. 2021, 15, 3304–3316. [Google Scholar] [CrossRef]
- Cao, D.; Chu, J.; Zhu, N.; Nie, L. Cross-modal recipe retrieval via parallel-and cross-attention networks learning. Knowl.-Based Syst. 2020, 193, 105428. [Google Scholar] [CrossRef]
- Li, J.; Xu, X.; Yu, W.; Shen, F.; Cao, Z.; Zuo, K.; Shen, H.T. Hybrid fusion with intra-and cross-modality attention for image-recipe retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, 11–15 July 2021; pp. 244–254. [Google Scholar]
- Xie, Z.; Li, L.; Zhong, L.; Liu, J.; Liu, L. Cross-Modal Retrieval between Event-Dense Text and Image. In Proceedings of the 2022 International Conference on Multimedia Retrieval, Newark, NJ, USA, 27–30 June 2022; pp. 229–238. [Google Scholar]
- Salvador, A.; Gundogdu, E.; Bazzani, L.; Donoser, M. Revamping cross-modal recipe retrieval with hierarchical transformers and self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15475–15484. [Google Scholar]
- Zhao, W.; Zhou, D.; Cao, B.; Zhang, K.; Chen, J. Efficient low-rank multi-component fusion with component-specific factors in image-recipe retrieval. Multimed. Tools Appl. 2024, 83, 3601–3619. [Google Scholar] [CrossRef]
- Zhao, W.; Zhou, D.; Cao, B.; Liang, W.; Sukhija, N. Exploring latent weight factors and global information for food-oriented cross-modal retrieval. Connect. Sci. 2023, 35, 2233714. [Google Scholar] [CrossRef]
- Wahed, M.; Zhou, X.; Yu, T.; Lourentzou, I. Fine-Grained Alignment for Cross-Modal Recipe Retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 5584–5593. [Google Scholar]
- Wang, H.; Lin, G.; Hoi, S.C.; Miao, C. Learning structural representations for recipe generation and food retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3363–3377. [Google Scholar] [CrossRef]
- Wang, H.; Sahoo, D.; Liu, C.; Lim, E.P.; Hoi, S.C. Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11572–11581. [Google Scholar]
- Zhu, B.; Ngo, C.W.; Chen, J.; Hao, Y. R2gan: Cross-modal recipe retrieval with generative adversarial network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11477–11486. [Google Scholar]
- Sugiyama, Y.; Yanai, K. Cross-modal recipe embeddings by disentangling recipe contents and dish styles. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, 20–24 October 2021; pp. 2501–2509. [Google Scholar]
- Wang, H.; Lin, G.; Hoi, S.; Miao, C. Paired cross-modal data augmentation for fine-grained image-to-text retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 5517–5526. [Google Scholar]
- Yang, J.; Chen, J.; Yanai, K. Transformer-Based Cross-Modal Recipe Embeddings with Large Batch Training. In Proceedings of the International Conference on Multimedia Modeling, Bergen, Norway, 9–12 January 2023; Springer: Cham, Switzerland, 2023; pp. 471–482. [Google Scholar]
- Guerrero, R.; Pham, H.X.; Pavlovic, V. Cross-modal retrieval and synthesis (x-mrs): Closing the modality gap in shared subspace learning. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, 20–24 October 2021; pp. 3192–3201. [Google Scholar]
- Zhu, B.; Ngo, C.W.; Chen, J.; Chan, W.K. Cross-lingual adaptation for recipe retrieval with mixup. In Proceedings of the 2022 International Conference on Multimedia Retrieval, Newark, NJ, USA, 27–30 June 2022; pp. 258–267. [Google Scholar]
- Papadopoulos, D.P.; Mora, E.; Chepurko, N.; Huang, K.W.; Ofli, F.; Torralba, A. Learning program representations for food images and cooking recipes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16559–16569. [Google Scholar]
- Huang, X.; Liu, J.; Zhang, Z.; Xie, Y. Improving Cross-Modal Recipe Retrieval with Component-Aware Prompted CLIP Embedding. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 529–537. [Google Scholar]
- Sun, J.; Li, J. PBLF: Prompt Based Learning Framework for Cross-Modal Recipe Retrieval. In Proceedings of the International Symposium on Artificial Intelligence and Robotics, Shanghai, China, 21–23 October 2022; Springer: Singapore, 2022; pp. 388–402. [Google Scholar]
- Shukor, M.; Thome, N.; Cord, M. Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval. arXiv 2022, arXiv:2212.04267. [Google Scholar]
- Voutharoja, B.P.; Wang, P.; Wang, L.; Guan, V. MALM: Mask Augmentation based Local Matching for Food-Recipe Retrieval. arXiv 2023, arXiv:2305.11327. [Google Scholar]
- Zhang, C.; Song, J.; Zhu, X.; Zhu, L.; Zhang, S. Hcmsl: Hybrid cross-modal similarity learning for cross-modal retrieval. ACM Trans. Multimed. Comput. Commun. Appl. (Tomm) 2021, 17, 1–22. [Google Scholar] [CrossRef]
- Zhu, L.; Zhang, C.; Song, J.; Liu, L.; Zhang, S.; Li, Y. Multi-graph based hierarchical semantic fusion for cross-modal representation. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
- Yi, Z.; Zhu, X.; Wu, R.; Zou, Z.; Liu, Y.; Zhu, L. Multi-Label Weighted Contrastive Cross-Modal Hashing. Appl. Sci. 2023, 14, 93. [Google Scholar] [CrossRef]
- Zou, Z.; Zhu, X.; Zhu, Q.; Liu, Y.; Zhu, L. CREAMY: Cross-Modal Recipe Retrieval by Avoiding Matching Imperfectly. IEEE Access 2024, 12, 33283–33295. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Thomas, C.; Kovashka, A. Preserving semantic neighborhoods for robust cross-modal retrieval. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVIII 16. Springer International Publishing: Cham, Switzerland, 2020; pp. 317–335. [Google Scholar]
- Guo, G.; Wang, H.; Bell, D.; Bi, Y.; Greer, K. KNN model-based approach in classification. In Proceedings of the On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, CoopIS, DOA, and ODBASE 2003, Catania, Sicily, Italy, 3–7 November 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 986–996. [Google Scholar]
- Wang, J.; Zhou, F.; Wen, S.; Liu, X.; Lin, Y. Deep metric learning with angular loss. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2593–2601. [Google Scholar]
Notation | Definition |
---|---|
A cross-modal recipe dataset | |
The food image of the i-th pair | |
The recipe of the i-th pair | |
The title of the recipe | |
The ingredients of the recipe | |
The instructions of the recipe | |
The embedding of the title in a recipe | |
The embedding of the ingredients in a recipe | |
The embedding of the instructions in a recipe | |
The recipe embedding | |
The food image embedding | |
The recipe encoder | |
The image encoder | |
The parameters of recipe encoder | |
The parameters of image encoder | |
The N-pairs triplet loss function | |
The RGI loss function |
Methods | Image to Recipe | Recipe to Image | |||||||
---|---|---|---|---|---|---|---|---|---|
MedR | R@1 | R@5 | R@10 | MedR | R@1 | R@5 | R@10 | ||
1 K | CCA [9] | 15.7 | 14.0 | 32.0 | 43.0 | 24.8 | 9.0 | 24.0 | 35.0 |
JE [9] | 5.2 | 24.0 | 51.0 | 65.0 | 5.1 | 25.0 | 52.0 | 65.0 | |
AdaMin [10] | 2.0 | 39.8 | 69.0 | 77.4 | 2.0 | 40.2 | 68.1 | 78.7 | |
R2GAN [35] | 2.0 | 39.1 | 71.0 | 81.7 | 2.0 | 40.6 | 72.6 | 83.3 | |
MCEN [14] | 2.0 | 48.2 | 75.8 | 83.6 | 1.9 | 48.4 | 76.1 | 83.7 | |
ACME [34] | 1.0 | 51.8 | 80.2 | 87.5 | 1.0 | 52.8 | 80.2 | 87.6 | |
SN [16] | 1.0 | 52.7 | 81.7 | 88.9 | 1.0 | 54.1 | 81.8 | 88.9 | |
SCAN [13] | 1.0 | 54.0 | 81.7 | 88.8 | 1.0 | 54.9 | 81.9 | 89.0 | |
HF-ICMA [20] | 1.0 | 55.1 | 86.7 | 92.4 | 1.0 | 56.8 | 87.5 | 93.0 | |
SEJE [22] | 1.0 | 58.1 | 85.8 | 92.2 | 1.0 | 58.5 | 86.2 | 92.3 | |
M-SIA [17] | 1.0 | 59.3 | 86.3 | 92.6 | 1.0 | 59.8 | 86.7 | 92.8 | |
X-MRS [39] | 1.0 | 64.0 | 88.3 | 92.6 | 1.0 | 63.9 | 87.6 | 92.6 | |
H-T [29] | 1.0 | 60.0 | 87.6 | 92.9 | 1.0 | 60.3 | 87.6 | 93.2 | |
LCWF-GI [31] | 1.0 | 59.4 | 86.8 | 92.5 | 1.0 | 60.1 | 86.7 | 92.7 | |
H-T(ViT) [29] | 1.0 | 64.2 | 89.1 | 93.4 | 1.0 | 64.5 | 89.3 | 93.8 | |
LMF-CSF [30] | 1.0 | 65.8 | 89.7 | 94.3 | 1.0 | 65.5 | 89.4 | 94.3 | |
Ours | 1.0 | 69.1 | 90.8 | 94.9 | 1.0 | 69.2 | 90.6 | 95.0 | |
10 K | JE [9] | 41.9 | - | - | - | 39.2 | - | - | - |
AdaMin [10] | 13.2 | 14.9 | 35.3 | 45.2 | 12.2 | 14.8 | 34.6 | 46.1 | |
R2GAN [35] | 13.9 | 13.5 | 33.5 | 44.9 | 12.6 | 14.2 | 35.0 | 46.8 | |
MCEN [14] | 7.2 | 20.3 | 43.3 | 54.4 | 6.6 | 21.4 | 44.3 | 55.2 | |
ACME [34] | 6.7 | 22.9 | 46.8 | 57.9 | 6.0 | 24.4 | 47.9 | 59.0 | |
SN [16] | 7.0 | 22.1 | 45.9 | 56.9 | 7.0 | 23.4 | 47.3 | 57.9 | |
SCAN [13] | 5.9 | 23.7 | 49.3 | 60.6 | 5.1 | 25.3 | 50.6 | 61.6 | |
HF-ICMA [20] | 5.0 | 24.0 | 51.6 | 65.4 | 4.2 | 25.6 | 54.8 | 67.3 | |
SEJE [22] | 4.2 | 26.9 | 54.0 | 65.6 | 4.0 | 27.2 | 54.4 | 66.1 | |
M-SIA [17] | 4.0 | 29.2 | 55.0 | 66.2 | 4.0 | 30.3 | 55.6 | 66.5 | |
X-MRS [39] | 3.0 | 32.9 | 60.6 | 71.2 | 3.0 | 33.0 | 60.4 | 70.7 | |
H-T [29] | 4.0 | 27.9 | 56.4 | 68.1 | 4.0 | 28.3 | 56.5 | 68.1 | |
LCWF-GI [31] | 4.0 | 27.9 | 56.0 | 67.8 | 4.0 | 28.6 | 55.8 | 67.5 | |
H-T(ViT) [29] | 3.0 | 33.5 | 62.1 | 72.8 | 3.0 | 33.7 | 62.2 | 72.7 | |
LMF-CSF [30] | 3.0 | 34.6 | 62.7 | 73.2 | 3.0 | 34.3 | 62.5 | 72.8 | |
Ours | 2.1 | 38.1 | 65.8 | 75.9 | 2.2 | 38.3 | 65.6 | 75.6 |
Base | I R | Image to Recipe | Recipe to Image | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
MedR | R@1 | R@5 | R@10 | MedR | R@1 | R@5 | R@10 | ||||
1 K | √ | 1.0 | 58.3 | 86.2 | 91.8 | 1.0 | 59.6 | 86.1 | 92.2 | ||
√ | √ | 1.0 | 67.2 | 90.0 | 94.5 | 1.0 | 67.6 | 90.0 | 94.5 | ||
√ | √ | 1.0 | 68.6 | 90.5 | 94.7 | 1.0 | 68.2 | 90.3 | 94.7 | ||
√ | √ | √ | 1.0 | 69.1 | 90.8 | 94.9 | 1.0 | 69.2 | 90.6 | 95.0 | |
10 K | √ | 4.1 | 26.8 | 54.7 | 66.5 | 4.0 | 37.5 | 55.1 | 66.8 | ||
√ | √ | 3.0 | 35.9 | 64.5 | 74.7 | 3.0 | 36.6 | 64.7 | 74.9 | ||
√ | √ | 2.2 | 37.7 | 65.8 | 75.9 | 2.0 | 38.0 | 65.6 | 75.4 | ||
√ | √ | √ | 2.1 | 38.1 | 65.8 | 75.9 | 2.2 | 38.3 | 65.6 | 75.6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zou, Z.; Zhu, X.; Zhu, Q.; Zhang, H.; Zhu, L. Disambiguity and Alignment: An Effective Multi-Modal Alignment Method for Cross-Modal Recipe Retrieval. Foods 2024, 13, 1628. https://doi.org/10.3390/foods13111628
Zou Z, Zhu X, Zhu Q, Zhang H, Zhu L. Disambiguity and Alignment: An Effective Multi-Modal Alignment Method for Cross-Modal Recipe Retrieval. Foods. 2024; 13(11):1628. https://doi.org/10.3390/foods13111628
Chicago/Turabian StyleZou, Zhuoyang, Xinghui Zhu, Qinying Zhu, Hongyan Zhang, and Lei Zhu. 2024. "Disambiguity and Alignment: An Effective Multi-Modal Alignment Method for Cross-Modal Recipe Retrieval" Foods 13, no. 11: 1628. https://doi.org/10.3390/foods13111628
APA StyleZou, Z., Zhu, X., Zhu, Q., Zhang, H., & Zhu, L. (2024). Disambiguity and Alignment: An Effective Multi-Modal Alignment Method for Cross-Modal Recipe Retrieval. Foods, 13(11), 1628. https://doi.org/10.3390/foods13111628