An Explainable CNN and Vision Transformer-Based Approach for Real-Time Food Recognition
Abstract
:1. Introduction
- Developing a hybrid CNNs and transformers-based model for efficient food recognition
- Enhance model accuracy and address the computational resources required for Vision Transformers using advanced training techniques such as mixed precision and distributed training.
- Exploring the capabilities of explainable AI techniques (Grad-CAM and LIME) to ease the understanding and interpretation of the machine learning models’ decision-making process
- Integrating the developed model into a mobile application for real-time food monitoring, tracking, and dietary assessment tasks.
2. Related Works
2.1. CNN-Based Approaches
2.2. ViT-Based Approaches
2.3. Hybrid Approaches
3. Methodology
3.1. Data Preparation
Dataset Selection and Preprocessing
3.2. Model Architecture
3.2.1. Vision Transformer (ViT)
Patch Embedding
Transformer Encoder
- Multi-Head Self-Attention (MHSA)
- Residual connections: This ensures stability during training and gradient flow. For a patch embedding X, the output is obtained using the following Equation (4):
- Feed-forward network (FNN): This is a two-layer Multi-Layer Perceptron (MLP) that uses the activation function to transform the output obtained from the attention mechanism. It is generalized and preceded by a layer normalization that ensures training stability. The FFN is computed as shown in Equation (5) below.
Classification Head
3.2.2. Transfer Learning
Pre-Trained Models Used
- ViT-B_16: A fine-grained feature extraction of the 16 × 16 patch size ViT model, pre-trained on ImageNet, provides a robust initialization necessary for our task [59].
- ViT-B_32: This is another ViT model with a patch size 32 × 32, which provides an alternative between computational efficiency and fine-grained feature representation.
- R50-ViT-B_16: This research adopts the R50-ViT-B_16 architecture and integrates CNN and ViT to enhance image classification performance, specifically through a hybrid model combining ResNet-50 (R50) and Vision Transformer (ViT-B16).
Model Initialization with Pre-Trained Weights
Fine-Tuning Process
3.3. Training Procedure
3.3.1. Distributed Training
3.3.2. Mixed Precision Training
3.3.3. Training Loop
4. Results and Discussion
4.1. Experimental Setup
Environment
4.2. Evaluation Metrics
- Accuracy: This metric measures the model’s effectiveness by evaluating how many overall predictions were correct. The accuracy is computed as:
- Recall: This measures the ability of a model to identify all relevant instances of a specific food class correctly, calculated as:
- Precision: That gives the proportion of images classified to a specific class that truly belongs to that class, provided by:
- F1 Score: This measures the test’s accuracy, defined as the harmonic mean of precision and recall, computed as:
- Top-k Accuracy: A metric that represents how often the correct class is in one of the top-k predicted classes with the highest predictions:
4.3. Model Training, Testing, and Evaluation
4.4. Explainability
4.5. Grad-CAM (Gradient-Weighted Class Activation Mapping)
4.6. LIME (Local Interpretable Model-Agnostic Explanations)
5. Model Integration
6. Discussion and Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
CNN | Convolutional Neural Network |
ViT | Vision Transformer |
Grad-CAM | Gradient-weighted Class Activation Mapping |
LIME | Local Interpretable Model-agnostic Explanations |
FP16 | 16-bit Floating Point |
SGD | Stochastic Gradient Descent |
DDP | Distributed Data-Parallel |
SE | Squeeze-and-Excitation |
ONNX | Open Neural Network Exchange |
MFLog | Meal Frequency Log |
References
- Rokhva, S.; Teimourpour, B.; Soltani, A.H. Computer Vision in the Food Industry: Accurate, Real-time, and Automatic Food Recognition with Pretrained MobileNetV2. arXiv 2024, arXiv:2405.11621. [Google Scholar] [CrossRef]
- Armand, T.P.T.; Nfor, K.A.; Kim, J.-I.; Kim, H.-C. Applications of Artificial Intelligence, Machine Learning, and Deep Learning in Nutrition: A Systematic Review. Nutrients 2024, 16, 1073. [Google Scholar] [CrossRef]
- Krutik, R.; Thacker, C.; Adhvaryu, R. Advancements in Food Recognition: A Comprehensive Review of Deep Learning-Based Automated Food Item Identification. In Proceedings of the 2024 2nd International Conference on Electrical Engineering and Automatic Control (ICEEAC), Setif, Algeria, 12–14 May 2024; pp. 1–6. [Google Scholar]
- Mansouri, M.; Chaouni, S.B.; Andaloussi, S.J.; Ouchetto, O. Deep learning for food image recognition and nutrition analysis towards chronic diseases monitoring: A systematic review. SN Comput. Sci. 2023, 4, 513. [Google Scholar] [CrossRef]
- Kiourt, C.; Pavlidis, G.; Markantonatou, S. Deep learning approaches in food recognition. In Machine Learning Paradigms: Advances in Deep Learning-Based Technological Applications; Learning and Analytics in Intelligent Systems; Springer: Cham, Switzerland, 2020; Volume 18, pp. 83–108. [Google Scholar]
- Abiyev, R.; Adepoju, J. Automatic Food Recognition Using Deep Convolutional Neural Networks with Self-attention Mechanism. Hum.-Centric Intell. Syst. 2024, 4, 171–186. [Google Scholar] [CrossRef]
- Fakhrou, A.; Kunhoth, J.; Al Maadeed, S. Smartphone-based food recognition system using multiple deep CNN models. Multimed. Tools Appl. 2021, 80, 33011–33032. [Google Scholar] [CrossRef]
- Metwalli, A.-S.; Shen, W.; Wu, C.Q. Food image recognition based on densely connected convolutional neural networks. In Proceedings of the 2020 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Fukuoka, Japan, 19–21 February 2020; pp. 027–032. [Google Scholar]
- Sefer, M.; Arslan, B.; Batur, O.Z.; Sönmez, E.B. A comparative study of deep learning methods on food classification problem. In Proceedings of the 2020 Innovations in Intelligent Systems and Applications Conference (ASYU), Istanbul, Turkey, 15–17 October 2020; pp. 1–4. [Google Scholar]
- Lin, X.; Lin, M.; Wei, L.; Chang, S.-F. Context-gated convolution. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVIII 16; Springer International Publishing: Glasgow, UK, 2020; pp. 701–718. [Google Scholar]
- Zahra, R.; Shekarizeh, S.; Sabokrou, M. Global-Local Processing in Convolutional Neural Networks. arXiv 2023, arXiv:2306.08336. [Google Scholar]
- Keller, M.; Tai, C.-e.A.; Chen, Y.; Xi, P.; Wong, A. NutritionVerse-Direct: Exploring Deep Neural Networks for Multitask Nutrition Prediction from Food Images. arXiv 2024, arXiv:2405.07814. [Google Scholar]
- Nijhawan, R.; Sinha, G.; Batra, A.; Kumar, M.; Sharma, H. VTnet+ Handcrafted based approach for food cuisines classification. Multimed. Tools Appl. 2024, 83, 10695–10715. [Google Scholar] [CrossRef]
- Zhang, S.; Callaghan, V.; Che, Y. Image-based methods for dietary assessment: A survey. J. Food Meas. Charact. 2023, 18, 727–743. [Google Scholar] [CrossRef]
- Aguilar, E.; Nagarajan, B.; Khatun, R.; Bolaños, M.; Radeva, P. Uncertainty modeling and deep learning applied to food image analysis. In International Joint Conference on Biomedical Engineering Systems and Technologies; Springer International Publishing: Cham, Germany, 2020; pp. 3–16. [Google Scholar]
- Liu, G.; Yang, J.; Chen, J.; Zhu, B.; Jiang, Y.-G. From Canteen Food to Daily Meals: Generalizing Food Recognition to More Practical Scenarios. IEEE Trans. Multimed. 2024, 1–10. [Google Scholar] [CrossRef]
- Magomere, J.; Ishida, S.; Afonja, T.; Salama, A.; Kochin, D.; Yuehgoh, F.; Hamzaoui, I.; Sefala, R.; Alaagib, A.; Semenova, E.; et al. You are what you eat? Feeding foundation models a regionally diverse food dataset of World Wide Dishes. arXiv 2024, arXiv:2406.09496. [Google Scholar]
- Dalakleidi, K.V.; Papadelli, M.; Kapolos, I.; Papadimitriou, K. Applying image-based food-recognition systems on dietary assessment: A systematic review. Adv. Nutr. 2022, 13, 2590–2619. [Google Scholar] [CrossRef] [PubMed]
- Singh, P.K.; Susan, S. Transfer Learning using Very Deep Pre-Trained Models for Food Image Classification. In Proceedings of the 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), Delhi, India, 6–8 July 2023; pp. 1–6. [Google Scholar]
- Touijer, L.; Pastore, V.P.; Odone, F. Food Image Classification: The Benefit of In-Domain Transfer Learning. In International Conference on Image Analysis and Processing; Springer Nature: Cham, Switzerland, 2023; pp. 259–269. [Google Scholar]
- Matarat, K. Enhancing Thai Food Classification: A CNN-Based Approach with Transfer Learning. Math. Model. Eng. Probl. 2024, 11, 1633–1640. [Google Scholar] [CrossRef]
- Jiang, M. Food Image Classification with Convolutional Neural Networks. In Deep Learning, Fall; CS230; Stanford University: Stanford, CA, USA, 2019. [Google Scholar]
- Al-Rubaye, D.; Serkan, A. Deep Transfer Learning and Data Augmentation for Food Image Classification. In Proceedings of the 2022 Iraqi International Conference on Communication and Information Technologies (IICCIT), Basrah, Iraq, 7–8 September 2022; pp. 125–130. [Google Scholar]
- Kaiming, H.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Vaswani, A. Attention is all you need. In Advances in Neural Information Processing Systems; 2017. [Google Scholar]
- Alshomrani, S.; Lina, A.; Banan, A.; Sarah, A.-S. Food Detection by Fine-Tuning Pre-trained Convolutional Neural Network Using Noisy Labels. Int. J. Comput. Sci. Netw. Secur. 2021, 21, 182–190. [Google Scholar]
- Micikevicius, P.; Sharan, N.; Jonah, A.; Gregory, D.; Erich, E.; David, G.; Boris, G.; Michael, H.; Oleksii, K.; Ganesh, V.; et al. Mixed precision training. arXiv 2017, arXiv:1710.03740. [Google Scholar]
- Dash, S.; Isaac, L.; Yin, J.; Wang, X.; Egele, R.; Ellis, A.; Maiterth, M.; Cong, G.; Wang, F.; Balaprakash, P. Optimizing distributed training on frontier for large language models. In Proceedings of the ISC High Performance 2024 Research Paper Proceedings (39th International Conference), Hamburg, Germany, 12–16 May 2024; Prometeus GmbH: Lerici, Italy, 2024; pp. 1–11. [Google Scholar]
- Lee, W.; Rahul, S.; Alex, A. Training with Mixed-Precision Floating-Point Assignments. arXiv 2023, arXiv:2301.13464. [Google Scholar]
- Marion, D.; Fan, M.; Andreas, M.K. Impact of Mixed Precision Techniques on Training and Inference Efficiency of Deep Neural Networks. IEEE Access 2023, 11, 57627–57634. [Google Scholar]
- Aditya, R.; Vink, D.; Venieris, S.; Bouganis, C.-S. Multi-precision policy enforced training (MuPPET): A precision-switching strategy for quantised fixed-point training of CNNs. In Proceedings of the International Conference on Machine Learning, PMLR, Vienna, Austria, 12–18 July 2020; pp. 7943–7952. [Google Scholar]
- Joel, L.-P. Layered gradient accumulation and modular pipeline parallelism: Fast and efficient training of large language models. arXiv 2021, arXiv:2106.02679. [Google Scholar]
- Zhang, Y.; Han, Y.; Cao, S.; Dai, G.; Miao, Y.; Cao, T.; Yang, F.; Xu, N. Adam Accumulation to Reduce Memory Footprints of both Activations and Gradients for Large-scale DNN Training. arXiv 2023, arXiv:2305.19982. [Google Scholar]
- Ma, Y.; Florin, R.; Wu, K.; Alexander, S. Adaptive elastic training for sparse deep learning on heterogeneous multi-gpu servers. arXiv 2021, arXiv:2110.07029. [Google Scholar]
- Theodore Armand, T.P.; Kim, H.-C.; Kim, J.-I. Digital Anti-Aging Healthcare: An Overview of the Applications of Digital Technologies in Diet Management. J. Pers. Med. 2024, 14, 254. [Google Scholar] [CrossRef]
- Armand, T.P.T.; Deji-Oloruntoba, O.; Bhattacharjee, S.; Nfor, K.A.; Kim, H.-C. Optimizing longevity: Integrating Smart Nutrition and Digital Technologies for Personalized Anti-aging Healthcare. In Proceedings of the 2024 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Osaka, Japan, 19–22 February 2024; pp. 243–248. [Google Scholar]
- Tsolakidis, D.; Gymnopoulos, L.P.; Dimitropoulos, K. Artificial Intelligence and Machine Learning Technologies for Personalized Nutrition: A Review. Informatics 2024, 11, 62. [Google Scholar] [CrossRef]
- Shao, Z.; He, J.; Yu, Y.-Y.; Lin, L.; Cowan, A.E.; Eicher-Miller, H.A.; Zhu, F. Towards the creation of a nutrition and food group based image database. arXiv 2022, arXiv:2206.02086. [Google Scholar] [CrossRef]
- Rodríguez-De-Vera, J.M.; Estepa, I.G.; Marc, B.; Bhalaji, N.; Petia, R. LOFI: LOng-tailed FIne-Grained Network for Food Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 3750–3760. [Google Scholar]
- Boyd, L.; Nnamoko, N.; Lopes, R. Fine-Grained Food Image Recognition: A Study on Optimising Convolutional Neural Networks for Improved Performance. J. Imaging 2024, 10, 126. [Google Scholar] [CrossRef] [PubMed]
- Khan, A.; Rauf, Z.; Sohail, A.; Khan, A.R.; Asif, H.; Asif, A.; Farooq, U. A survey of the vision transformers and their CNN-transformer based variants. Artif. Intell. Rev. 2023, 56, 2917–2970. [Google Scholar] [CrossRef]
- Raghu, M.; Thomas, U.; Simon, K.; Zhang, C.; Alexey, D. Do vision transformers see like convolutional neural networks? Adv. Neural Inf. Process. Syst. 2021, 34, 12116–12128. [Google Scholar]
- Rahmat, R.A.; Suhaili, B.K. Malaysian food recognition using alexnet CNN and transfer learning. In Proceedings of the 2021 IEEE 11th IEEE Symposium on Computer Applications & Industrial Electronics (ISCAIE), Penang, Malaysia, 3–4 April 2021; pp. 59–64. [Google Scholar]
- Zharfan, Z.; Lee, C.P.; Lim, K.M. Food recognition with resnet-50. In Proceedings of the 2020 IEEE 2nd International Conference on Artificial Intelligence in Engineering and Technology (IICAIET), Kota Kinabalu, Malaysia, 26–27 September 2020; pp. 1–5. [Google Scholar]
- Chang, L.; Cao, Y.; Luo, Y.; Chen, G.; Vokkarane, V.; Ma, Y. Deepfood: Deep learning-based food image recognition for computer-aided dietary assessment. In Inclusive Smart Cities and Digital Health: 14th International Conference on Smart Homes and Health Telematics, ICOST 2016, Wuhan, China, 25–27 May 2016; Proceedings 14; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 37–48. [Google Scholar]
- Li, X.; Li, Y.; Zou, X.; Ren, T. A high-precision food image classification method with a small number of parameters. In Proceedings of the 2023 4th International Conference on Computer, Big Data and Artificial Intelligence (ICCBD+ AI), Guiyang, China, 15–17 December 2023; pp. 33–36. [Google Scholar]
- Feng, S.; Lu, Z.; Li, Y.; Han, C.; Gu, X.; Wei, S. Foodnet: Multi-scale and label dependency learning-based multi-task network for food and ingredient recognition. Neural Comput. Appl. 2024, 36, 4485–4501. [Google Scholar]
- Min, W.; Wang, Z.; Liu, Y.; Luo, M.; Kang, L.; Wei, X.; Wei, X.; Jiang, S. Large scale visual food recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9932–9949. [Google Scholar] [CrossRef]
- Bianco, S.; Marco, B.; Gaetano, C.; Paolo, N.; Flavio, P. Food Recognition with Visual Transformers. In Proceedings of the 2023 IEEE 13th International Conference on Consumer Electronics-Berlin (ICCE-Berlin), Berlin, Germany, 3–5 September 2023; pp. 82–87. [Google Scholar]
- Min, W.; Liu, L.; Wang, Z.; Luo, Z.; Wei, X.; Wei, X.; Jiang, S. Isia food-500: A dataset for large-scale food recognition via stacked global-local attention network. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 393–401. [Google Scholar]
- Peng, A.; He, J.; Zhu, F. Self-supervised visual representation learning on food images. arXiv 2023, arXiv:2303.09046. [Google Scholar] [CrossRef]
- Liang, H.; Wen, G.; Hu, Y.; Luo, M.; Yang, P.; Xu, Y. MVANet: Multi-task guided multi-view attention network for Chinese food recognition. IEEE Trans. Multimed. 2020, 23, 3551–3561. [Google Scholar] [CrossRef]
- Chen, C.-S.; Chen, G.-Y.; Zhou, D.; Jiang, D.; Chen, D.-S. Res-vmamba: Fine-grained food category visual classification using selective state space models with deep residual learning. arXiv 2024, arXiv:2402.15761. [Google Scholar]
- Liu, C.; Liang, Y.; Xue, Y.; Qian, X.; Fu, J. Food and ingredient joint learning for fine-grained recognition. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 2480–2493. [Google Scholar] [CrossRef]
- Lukas, B.; Guillaumin, M.; Gool, L.V. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings Part VI 13; Springer International Publishing: Berlin/Heidelberg, Germany, 2014; pp. 446–461. [Google Scholar]
- Chen, J.; Ngo, C.-W. Deep-based ingredient recognition for cooking recipe retrieval. In Proceedings of the 24th ACM International Conference on Multimedia, New York, NY, USA, 15–19 October 2016; pp. 32–41. [Google Scholar]
- Fan, B.; Li, W.; Dong, L.; Li, J.; Nie, Z. Automatic Chinese Food recognition based on a stacking fusion model. In Proceedings of the 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Sydney, Australia, 24–27 July 2023; pp. 1–4. [Google Scholar]
- Kawano, Y.; Keiji, Y. Automatic expansion of a food image dataset leveraging existing categories with domain adaptation. In Computer Vision-ECCV 2014 Workshops: Zurich, Switzerland, 6-7, 12 September 2014; Proceedings, Part III 13; Springer International Publishing: Berlin/Heidelberg, Germany, 2015; pp. 3–17. [Google Scholar]
- Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Armand, T.P.T.; Bhattacharjee, S.; Choi, H.-K.; Kim, H.-C. Transformers Effectiveness in Medical Image Segmentation: A Comparative Analysis of UNet-Based Architectures. In Proceedings of the 2024 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Osaka, Japan, 19–22 February 2024; pp. 238–242. [Google Scholar]
- Rudresh, D.; Dave, D.; Naik, H.; Singhal, S.; Omer, R.; Patel, P.; Qian, B.; Wen, Z.; Shah, T.; Morgan, G.; et al. Explainable AI (XAI): Core ideas, techniques, and solutions. ACM Comput. Surv. 2023, 55, 1–33. [Google Scholar]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]
- Junior, K.J.; Carole, K.S.; Armand, T.P.T.; Kim, H.-C.; Initiative, T.A.D.N. Alzheimer’s Multiclassification Using Explainable AI Techniques. Appl. Sci. 2024, 14, 8287. [Google Scholar] [CrossRef]
- Ribeiro, M.T.; Singh, S.; Guestrin, C. Why should i trust you? Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
Dataset Name | No of Classes | Total Samples | Regions | Size (GB) | Ref. |
---|---|---|---|---|---|
Food2K | 2000 | 1,036,564 | Miscellaneous | 64.2 | [48] |
Food101 | 101 | 101,000 | Western | 9.55 | [55] |
VireoFood172 | 172 | 110,241 | Asia | 1.52 | [56] |
CNFOOD-241 | 241 | 191,811 | Chinese | 9.94 | [57] |
UCE-FOOD 256 | 256 | 31,395 | Japanese | 3.97 | [58] |
Total | 2770 | 1,471,011 | 89.18 |
SN | Normalized Category | Original Food Names | Datasets | Type of Operation |
---|---|---|---|---|
1 | beef curry | (“Beef curry” vs. “beef curry”) | (‘VireoFood172’, ‘FOOD 256’) | Case Sensitivity |
2 | saozi noodles | (‘Saozi noodles’, ‘saozi noodles’) | (‘FOOD 2K’, ‘CNFOOD-241’) | Punctuation |
3 | rice | (‘Rice’, ‘Rice’, ‘rice’) | (‘VireoFood172’, ‘CNFOOD-241’, ‘FOOD 256’) | Spacing Issues |
4 | bibimbap | (‘bibimbap’) | (‘FOOD 256’, ‘FOOD 101’) | Uniform |
Model | Food2k | Food101 | VireoFood172 | UCE-Food256 | CBFOOD-241 | Combined | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Acc | T-5 Acc | Acc | T-5 Acc | Acc | T-5 Acc | Acc | T-5 Acc | Acc | T-5 Acc | Acc | T-5 Acc | |
VIT_B_32 | 68.3 | 91.1 | 84.3 | 96.2 | 88.0 | 97.2 | 81.0 | 95.2 | 69.0 | 91.3 | --- | |
VIT_B_16 | 77.2 | 95.3 | 86.6 | 97.5 | 90.0 | 98.7 | 77.1 | 94.3 | 75.2 | 94.0 | --- | |
R50 + VIT_B_16 | 84.1 | 96.2 | 91.3 | 99.0 | 92.3 | 98.5 | 85.0 | 98.0 | 83.4 | 95.2 | 91.17 | 98.35 |
Model | Food2k | Food101 | VireoFood172 | UCE-Food256 | CBFOOD-241 | Combined | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
F1 Score | Recall | Prec | F1 Score | Recall | Prec | F1 Score | Recall | Prec | F1 Score | Recall | Prec | F1 Score | Recall | Prec | F1 Score | Recall | Prec | |
VIT_B_32 | 63.4 | 64.1 | 63.3 | 86.0 | 86.0 | 86.0 | 89.1 | 89.1 | 89.0 | 94.9 | 94.0 | 94.9 | 76.0 | 76.0 | 76.7 | - | ||
VIT_B_16 | 71.5 | 71.1 | 71.0 | 89.1 | 89.2 | 89.2 | 88.2 | 88.2 | 88.2 | 92.5 | 92.6 | 92.7 | 77.0 | 75.2 | 75.8 | - | ||
R50 + VIT_B_16 | 84.2 | 84.1 | 84.0 | 91.1 | 91.2 | 91.3 | 93.7 | 93.8 | 93.8 | 95.3 | 95.5 | 95.3 | 83.4 | 83.5 | 83.4 | 88.35 | 91.17 | 91.22 |
Dataset | Ref. | Technique | Accuracy (%) | Ours (%) |
---|---|---|---|---|
CBFOOD-241 | [53] | RES-VMAMBA | 82.15 | 83.4 |
UCE-Food256 | [44] | ResNet-50 | 49.09 | 85.0 |
[45] | Inception | 80.7 | ||
VireoFood172 | [47] | FoodNet | 89.73 | 93.7 |
[48] | PRENet | 90.80 | ||
[52] | MVANet | 91.08 | ||
[54] | Attention Fusion Network (AFN) | 89.54 | ||
Food101 | [44] | ResNet-50 | 39.75 | 91.3 |
[45] | Inception | 77.4 | ||
[46] | CNN | 77.3 | ||
[48] | PRENet | 91.13 | ||
[49] | ViT | 92.59 | ||
[51] | Self-supervised | 51.0 | ||
Food2k | [46] | CNN | 79.7 | 84.1 |
[48] | PRENet | 83.75 | ||
[49] | ViT | 80.76 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Nfor, K.A.; Theodore Armand, T.P.; Ismaylovna, K.P.; Joo, M.-I.; Kim, H.-C. An Explainable CNN and Vision Transformer-Based Approach for Real-Time Food Recognition. Nutrients 2025, 17, 362. https://doi.org/10.3390/nu17020362
Nfor KA, Theodore Armand TP, Ismaylovna KP, Joo M-I, Kim H-C. An Explainable CNN and Vision Transformer-Based Approach for Real-Time Food Recognition. Nutrients. 2025; 17(2):362. https://doi.org/10.3390/nu17020362
Chicago/Turabian StyleNfor, Kintoh Allen, Tagne Poupi Theodore Armand, Kenesbaeva Periyzat Ismaylovna, Moon-Il Joo, and Hee-Cheol Kim. 2025. "An Explainable CNN and Vision Transformer-Based Approach for Real-Time Food Recognition" Nutrients 17, no. 2: 362. https://doi.org/10.3390/nu17020362
APA StyleNfor, K. A., Theodore Armand, T. P., Ismaylovna, K. P., Joo, M.-I., & Kim, H.-C. (2025). An Explainable CNN and Vision Transformer-Based Approach for Real-Time Food Recognition. Nutrients, 17(2), 362. https://doi.org/10.3390/nu17020362