Memory-Based Learning and Fusion Attention for Few-Shot Food Image Generation Method
Abstract
:1. Introduction
- (1)
- To address the semantic inconsistency issue, we propose the enhanced CLIP module by embedding ingredient and image encoders. The former aims to preserve crucial semantic information by transforming sparse ingredient embeddings into compact embeddings. The main idea of the latter is to capture multi-scale feature information to enhance image representations.
- (2)
- To address the insufficient visual realism issue, a Memory module is proposed by implanting a pre-trained diffusion model. This module stores ingredient-image pairs trained by the CLIP module as an information dataset to guide the food image generation process.
- (3)
- An attention fusion module is proposed to enhance the understanding between ingredient and image features by three attention blocks: Cross-modal Attention block (CmA), Memory Complementary Attention block (MCA), and Combinational Attention block (CoA). This module can efficiently refine feature representations of the generated images.
2. Related Works
2.1. Image–Text Matching
2.2. Image Generation from Text
3. Method
- Step 1.
- Train the CLIP module using the food images and their corresponding ingredient.
- Step 2.
- Employ the trained CLIP module to create embedding pairs of ingredient-image, and store these pairs in the Memory module.
- Step 3.
- Generate auxiliary images , and ingredient embeddings , respectively, by using the diffusion module and ingredient encoder of the CLIP module.
- Step 4.
- Query the ingredient embeddings in the Memory module to find the most similar ingredient-image pair (, ).
- Step 5.
- For image generation: (1) Input the query image and auxiliary image into the encoder of the image generation module to generate encoded image features . (2) Combine the ingredient embeddings and query ingredient embeddings and image embeddings using the attention fusion module to derive fused ingredient and image features, denoted as CoA. (3) Feed the fused features CoA into the decoder of the image generation module to produce the final food image .
3.1. Enhanced CLIP Module
Algorithm 1. CLIP module | |
1 | Input: Ingredients , images , batch size , and other parameters. |
2 | Output: Ingredients feature vectors . |
3 | for each epoch do |
4 | for each batch do |
5 | Encode the ingredients to by ingredient encoder. |
6 | Encode the images to by image encoder. |
7 | Generate a diagonal unit matrix with size as the labels . |
8 | Make fuse matrix by . |
9 | To optimize objective function by Formula (2). |
10 | Save the ingredients feature vectors and images in pairs. |
3.2. Memory Module
3.3. Image Generation Module
- (1)
- The CmA block is responsible for extracting interaction features between food ingredients and food images. Specifically, it integrates ingredient embeddings , , and image embeddings to establish four distinct Cross-modal Attention blocks (denoted by , , , ), as shown in Figure 4. These equations of four CmA blocks are formulated as follows:
- (2)
- The intention of constructing the MCA is to adaptively learn the difference between image features retrieved from the Memory module. We assume that the difference between and indicates the similarity between the input ingredients and those stored in the Memory module. The formula for calculating this difference as the weight is defined as follows:
Algorithm 2. Food image generation module. | |
1 | Input: , , , , labels . |
2 | Output: Image . |
3 | for do |
4 | Fuse the image and . |
5 | Input the fused image into the encoder of the U-Net, and obtain the encoder feature . |
6 | Input , , and into the attention block. |
7 | Calculate Cross-Modal Attention , , , by Formulas (3)–(6) |
8 | Calculate Memory Complementary Attention by Formula (8). |
9 | Calculate Combinational Attention by Formula (9). |
10 | Input the attention feature into the decoder of the U-Net. |
11 | Output the image generated by the decoder. |
12 | Calculate the between and by . |
13 | Backpropagation and adjusting weight parameters of food image generation model. |
- (3)
- The CoA block is designed to adaptively assign weight parameters in both and . For measuring the contributions of each attention, the fused attention feature is formulated by a linear combination of these attention parameters, as shown Equation (9):
4. Experiments
4.1. Dataset
4.2. Experimental Settings
4.3. Evaluation Metric
5. Experimental Results and Analysis
5.1. Comparison with the State-of-the-Art
5.2. Ablation Study
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Min, W.; Jiang, S.; Liu, L.; Rui, Y.; Jain, R. A Survey on Food Computing. ACM Comput. Surv. 2019, 52, 1–36. [Google Scholar] [CrossRef]
- Wang, H.; Sahoo, D.; Liu, C.; Shu, K.; Achananuparp, P.; Lim, E.; Hoi, S. Cross-modal food retrieval: Learning a joint embedding of food images and recipes with semantic consistency and attention mechanism. IEEE Trans. Multimed. 2021, 24, 2515–2525. [Google Scholar] [CrossRef]
- Salvador, A.; Hynes, N.; Aytar, Y.; Marin, J.; Torralba, A. Learning cross-modal embeddings for cooking recipes and food images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3020–3028. [Google Scholar]
- Sugiyama, Y.; Yanai, K. Cross-modal recipe embeddings by disentangling recipe contents and dish styles. In Proceedings of the 29th ACM International Conference on Multimedia, New York, NY, USA, 20–24 October 2021; pp. 2501–2509. [Google Scholar]
- Deng, Z.; He, X.; Peng, Y. LFR-GAN: Local Feature Refinement based Generative Adversarial Network for Text-to-Image Generation. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1.1–1.18. [Google Scholar] [CrossRef]
- Nishimura, T.; Hashimoto, A.; Ushiku, Y.; Kameko, H.; Mori, S. Structure-aware procedural text generation from an image sequence. IEEE Access 2020, 9, 2125–2141. [Google Scholar] [CrossRef]
- Wang, S.; Gao, H.; Zhu, Y.; Zhang, W.; Chen, Y. A food dish image generation framework based on progressive growing GANs. In Proceedings of the 15th EAI International Conference, London, UK, 19–22 August 2019; Springer: Cham, Switzerland, 2019; pp. 323–333. [Google Scholar]
- Salvador, A.; Drozdzal, M.; Giró-I-Nieto, X.; Romero, A. Inverse cooking: Recipe generation from food images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10453–10462. [Google Scholar]
- Honbu, Y.; Yanai, K. SetMealAsYouLike: Sketch-based Set Meal Image Synthesis with Plate Annotations. In Proceedings of the 7th International Workshop on Multimedia Assisted Dietary Management, New York, NY, USA, 24 October 2022; pp. 49–53. [Google Scholar]
- Han, F.; Guerrero, R.; Pavlovic, V. Cookgan: Meal image synthesis from ingredients. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Snowmass, CO, USA, 1–5 March 2020; pp. 1439–1447. [Google Scholar]
- Papadopoulos, D.; Tamaazousti, Y.; Ofli, F.; Weber, I.; Torralba, A. How to make a pizza: Learning a compositional layer-based gan model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7994–8003. [Google Scholar]
- Liu, Z.; Niu, K.; He, Z. ML-CookGAN: Multi-Label Generative Adversarial Network for Food Image Generation. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–21. [Google Scholar] [CrossRef]
- Pan, S.; Dai, L.; Hou, X.; Li, H.; Sheng, B. Chefgan: Food image generation from recipes. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12 October 2020; pp. 4244–4252. [Google Scholar]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6 December 2020; pp. 6840–6851. [Google Scholar]
- Vo, N.; Jiang, L.; Sun, C.; Murphy, K.; Li, L.; Fei-Fei, L.; Hays, J. Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 6439–6448. [Google Scholar]
- Zhang, F.; Xu, M.; Mao, Q.; Xu, C. Joint attribute manipulation and modality alignment learning for composing text and image to image retrieval. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12 October 2020; pp. 3367–3376. [Google Scholar]
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning, Virtual Event, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
- Li, L.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Wang, L.; Yuan, L.; Zhang, L.; Hwang, J.; et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 10965–10975. [Google Scholar]
- Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
- Cepa, B.; Brito, C.; Sousa, A. Generative Adversarial Networks in Healthcare: A Case Study on MRI Image Generation. In Proceedings of the 2023 IEEE 7th Portuguese Meeting on Bioengineering (ENBENG), Porto, Portugal, 22–23 June 2023; pp. 48–51. [Google Scholar]
- Chen, E.; Holalkere, S.; Yan, R.; Zhang, K.; Davis, A. Ray Conditioning: Trading Photo-consistency for Photo-realism in Multi-view Image Generation. In Proceedings of the International Conference on Computer Vision, Paris, France, 2–6 October 2023; p. 6622. [Google Scholar]
- Lai, Z.; Tang, C.; Lv, J. Multi-view image generation by cycle CVAE-GAN networks. In Neural Information Processing: 26th International Conference, Sydney, NSW, Australia, 12–15 December 2019; Springer International Publishing: Cham, Switzerland, 2019; pp. 43–54. [Google Scholar]
- YÜksel, N.; BÖrklÜ, H. Nature-Inspired Design Idea Generation with Generative Adversarial Networks. Int. J. 3D Print. Technol. 2023, 7, 47–54. [Google Scholar] [CrossRef]
- Gregor, K.; Danihelka, I.; Graves, A.; Rezende, D.; Wierstra, D. Draw: A recurrent neural network for image generation. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1462–1471. [Google Scholar]
- Dong, Z.; Wei, P.; Lin, L. Dreamartist: Towards controllable one-shot text-to-image generation via contrastive prompt-tuning. arXiv 2022, arXiv:2211.11337. [Google Scholar]
- Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A.; Chechik, G.; Cohen-or, D. An image is worth one word: Personalizing text-to-image generation using textual inversion. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2022. [Google Scholar]
- Li, T.; Li, Z.; Luo, A.; Rockwell, H.; Farimani, A.; Lee, T. Prototype memory and attention mechanisms for few shot image generation. In Proceedings of the Eleventh International Conference on Learning Representations, Virtual, 2–29 April 2022. [Google Scholar]
- Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for training gans. Adv. Neural Inf. Process. Syst. 2016, 29, 2234–2242. [Google Scholar]
- Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 2017, 30, 25–34. [Google Scholar]
- Zhang, R.; Isola, P.; Efros, A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
- Jo, Y.; Yang, S.; Kim, S.J. Investigating loss functions for extreme super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 424–425. [Google Scholar]
- Zhang, H.; Koh, J.Y.; Baldridge, J.; Lee, H.; Yang, Y. Cross-modal contrastive learning for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtually, 19–25 June 2021; pp. 833–842. [Google Scholar]
Network | FID | LPIPS | IS |
---|---|---|---|
ChefGAN [13] | 11.537 | 0.837 | 1.955 |
CookGAN [10] | 10.854 | 0.904 | 2.274 |
MoCA [27] | 9.650 | 0.747 | 5.480 |
DreamArtist [25] | 8.677 | 0.735 | 2.882 |
PGTI [26] | 7.092 | 0.746 | 4.530 |
XMC_GAN [32] | 6.448 | 0.842 | 6.373 |
MLA-Diff | 4.285 | 0.733 | 6.366 |
Architecture | FID |
---|---|
MLA-Diff-without enhanced CLIP module | 6.572 |
MLA-Diff-without Memory module | 6.483 |
MLA-Diff-without attention fusion module | 6.463 |
MLA-Diff-without pre-trained diffusion model | 6.344 |
MLA-Diff | 4.285 |
Group | The Number of CmA | CmA | MCA | CoA | FID | LPIPS | IS |
---|---|---|---|---|---|---|---|
1 | () | √ | 18.707 | 0.843 | 3.260 | ||
√ | √ | 9.650 | 0.747 | 5.480 | |||
√ | √ | √ | 9.259 | 0.735 | 5.210 | ||
2 | () | √ | 18.369 | 0.846 | 3.224 | ||
√ | √ | 9.542 | 0.747 | 5.430 | |||
√ | √ | √ | 9.321 | 0.735 | 5.179 | ||
3 | () | √ | 19.879 | 0.836 | 3.866 | ||
√ | √ | 9.648 | 0.746 | 5.492 | |||
√ | √ | √ | 9.538 | 0.733 | 5.209 | ||
4 | () | √ | 18.076 | 0.848 | 3.380 | ||
√ | √ | 9.752 | 0.746 | 5.413 | |||
√ | √ | √ | 9.289 | 0.735 | 5.028 | ||
5 | (), () | √ | 9.400 | 0.757 | 5.503 | ||
√ | √ | 7.978 | 0.749 | 5.179 | |||
√ | √ | √ | 7.151 | 0.736 | 5.223 | ||
6 | (), () | √ | 9.436 | 0.785 | 5.795 | ||
√ | √ | 7.645 | 0.738 | 5.130 | |||
√ | √ | √ | 6.709 | 0.734 | 5.084 | ||
7 | (), (), (), () | √ | 9.078 | 0.769 | 5.574 | ||
√ | √ | 7.556 | 0.740 | 5.235 | |||
√ | √ | √ | 6.344 | 0.743 | 5.496 |
Group | CmA (Q, K, V) | MCA | CoA | Diffusion | FID | LPIPS | IS |
---|---|---|---|---|---|---|---|
1 | ) | √ | √ | 9.259 | 0.735 | 5.210 | |
√ | √ | √ | 6.240 | 0.732 | 5.882 | ||
2 | ) | √ | √ | 9.321 | 0.735 | 5.179 | |
√ | √ | √ | 6.336 | 0.731 | 5.858 | ||
3 | ) | √ | √ | 9.538 | 0.733 | 5.209 | |
√ | √ | √ | 6.478 | 0.729 | 5.890 | ||
4 | ) | √ | √ | 9.289 | 0.735 | 5.028 | |
√ | √ | √ | 6.480 | 0.729 | 5.879 | ||
5 | ) | √ | √ | 7.151 | 0.736 | 5.223 | |
√ | √ | √ | 4.662 | 0.742 | 5.881 | ||
6 | ) | √ | √ | 6.709 | 0.734 | 5.084 | |
√ | √ | √ | 4.665 | 0.741 | 5.946 | ||
7 | ), (), () | √ | √ | 6.344 | 0.743 | 5.496 | |
√ | √ | √ | 4.285 | 0.733 | 6.366 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ma, J.; Wan, Y.; Ma, Z. Memory-Based Learning and Fusion Attention for Few-Shot Food Image Generation Method. Appl. Sci. 2024, 14, 8347. https://doi.org/10.3390/app14188347
Ma J, Wan Y, Ma Z. Memory-Based Learning and Fusion Attention for Few-Shot Food Image Generation Method. Applied Sciences. 2024; 14(18):8347. https://doi.org/10.3390/app14188347
Chicago/Turabian StyleMa, Jinlin, Yuetong Wan, and Ziping Ma. 2024. "Memory-Based Learning and Fusion Attention for Few-Shot Food Image Generation Method" Applied Sciences 14, no. 18: 8347. https://doi.org/10.3390/app14188347
APA StyleMa, J., Wan, Y., & Ma, Z. (2024). Memory-Based Learning and Fusion Attention for Few-Shot Food Image Generation Method. Applied Sciences, 14(18), 8347. https://doi.org/10.3390/app14188347