GourmetNet: Food Segmentation Using Multi-Scale Waterfall Features with Spatial and Channel Attention
Abstract
:1. Introduction
- We propose GourmetNet, a single-pass, end-to-end trainable, multi-scale framework with channel and attention modules for feature refinement;
- The integration of channel and attention modules with waterfall spatial pyramids increases performance due to improved feature extraction combined with the multi-scale waterfall approach that allows a larger FOV without requiring a separate decoder or post-processing.
- GourmetNet achieves state-of-the-art performance on the UNIMIB2016 and UEC FoodPix food segmentation datasets. The GourmetNet code is shared on github (https://github.com/uditsharma29/GourmetNet (accessed on 8 November 2021)).
2. Related Work
2.1. Waterfall Multi-Scale Features
2.2. Attention Mechanisms
2.3. Food Segmentation
3. Proposed Method
3.1. Backbone
3.2. Attention Modules
3.2.1. Channel Attention
3.2.2. Spatial Attention
3.3. Multi-Scale Waterfall Features
4. Experimental Methods
4.1. Datasets
4.2. Parameter Setting
4.3. Evaluation Metrics
5. Results
5.1. Ablation Studies
5.2. Comparison to State-of-the-Art
5.3. Food Classes Performance Analysis
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
ASPP | Atrous Spatial Pyramid Pooling |
CE | Cross-Entropy |
COCO | Common Objects in Context |
DANet | Dual Attention Network |
CRF | Conditional Random Fields |
ENet | Efficient Network |
FCN | Fully Convoluted Networks |
IOU | Intersection over Union |
JSEG | J measure based Segmentation |
mIOU | Mean Intersection over Union |
NLP | Natural Language Processing |
PSPnet | Pyramid Scene Parsing Network |
RNN | Recurrent Neural Network |
SAP | Spatial Average Pooling |
SGD | Stochastic Gradient Descent |
SMP | Spatial Max Pooling |
SPP | Spatial Pyramid Pooling |
WASP | Waterfall Atrous Spatial Pooling |
References
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Noh, H.; Hong, S.; Han, B. Learning Deconvolution Network for Semantic Segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1520–1528. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI); Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
- Artacho, B.; Savakis, A. Waterfall Atrous Spatial Pooling Architecture for Efficient Semantic Segmentation. Sensors 2019, 19, 5361. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Lin, T.; Maire, M.; Belongie, S.J.; Bourdev, L.D.; Girshick, R.B.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. arXiv 2014, arXiv:1405.0312. [Google Scholar]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A. Foreground-Aware Relation Network for Geospatial Object Segmentation in High Spatial Resolution Remote Sensing Imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Milioto, A.; Lottes, P.; Stachniss, C. Real-Time Semantic Segmentation of Crop and Weed for Precision Agriculture Robots Leveraging Background Knowledge in CNNs. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 2229–2235. [Google Scholar] [CrossRef] [Green Version]
- Aslan, S.; Ciocca, G.; Schettini, R. Semantic segmentation of food images for automatic dietary monitoring. In Proceedings of the 2018 26th Signal Processing and Communications Applications Conference (SIU), Izmir, Turkey, 2–5 May 2018; pp. 1–4. [Google Scholar]
- Aslan, S.; Ciocca, G.; Schettini, R. Semantic Food Segmentation for Automatic Dietary Monitoring. In Proceedings of the 2018 IEEE 8th International Conference on Consumer Electronics, (ICCE-Berlin), Berlin, Germany, 2–5 September 2018; pp. 1–6. [Google Scholar]
- Kong, F.; Tan, J. DietCam: Automatic dietary assessment with mobile camera phones. Pervasive Mob. Comput. 2012, 8, 147–163. [Google Scholar] [CrossRef]
- Kawano, Y.; Yanai, K. FoodCam: A Real-Time Food Recognition System on a Smartphone. Multimed. Tools Appl. 2015, 74, 5263–5287. [Google Scholar] [CrossRef]
- Liu, C.; Cao, Y.; Luo, Y.; Chen, G.; Vokkarane, V.; Ma, Y. DeepFood: Deep Learning-Based Food Image Recognition for Computer-Aided Dietary Assessment. In Proceedings of the 14th International Conference on Inclusive Smart Cities and Digital Health, ICOST 2016, Wuhan, China, 25–27 May 2016; Springer: Berlin/Heidelberg, Germany, 2016; Volume 9677, pp. 37–48. [Google Scholar]
- Puri, M.; Zhu, Z.; Yu, Q.; Divakaran, A.; Sawhney, H. Recognition and volume estimation of food intake using a mobile device. In Proceedings of the 2009 Workshop on Applications of Computer Vision (WACV), Snowbird, UT, USA, 7–8 December 2009; pp. 1–8. [Google Scholar] [CrossRef]
- Dehais, J.; Anthimopoulos, M.; Shevchik, S.; Mougiakakou, S. Two-View 3D Reconstruction for Food Volume Estimation. IEEE Trans. Multimed. 2017, 19, 1090–1099. [Google Scholar] [CrossRef] [Green Version]
- Tanno, R.; Ege, T.; Yanai, K. AR DeepCalorieCam V2: Food calorie estimation with CNN and AR-based actual size estimation. In Proceedings of the 24th ACM Symposium on Virtual Reality Software and Technology, Tokyo, Japan, 28 November–1 December 2018; pp. 1–2. [Google Scholar]
- Myers, A.; Johnston, N.; Rathod, V.; Korattikara, A.; Gorban, A.; Silberman, N.; Guadarrama, S.; Papandreou, G.; Huang, J.; Murphy, K. Im2Calories: Towards an Automated Mobile Vision Food Diary. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1233–1241. [Google Scholar]
- Min, W.; Liu, L.; Luo, Z.; Jiang, S. Ingredient-Guided Cascaded Multi-Attention Network for Food Recognition. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1331–1339. [Google Scholar]
- Li, J.; Guerrero, R.; Pavlovic, V. Deep Cooking: Predicting Relative Food Ingredient Amounts from Images. In Proceedings of the 5th International Workshop on Multimedia Assisted Dietary Management, Nice, France, 21 October 2019. [Google Scholar]
- Salvador, A.; Drozdzal, M.; Giro-i Nieto, X.; Romero, A. Inverse Cooking: Recipe Generation From Food Images. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10445–10454. [Google Scholar] [CrossRef] [Green Version]
- Marín, J.; Biswas, A.; Ofli, F.; Hynes, N.; Salvador, A.; Aytar, Y.; Weber, I.; Torralba, A. Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 187–203. [Google Scholar] [CrossRef] [Green Version]
- He, Y.; Khanna, N.; Boushey, C.; Delp, E. Image segmentation for image-based dietary assessment: A comparative study. In Proceedings of the International Symposium on Signals, Circuits and Systems ISSCS2013, Iasi, Romania, 11–12 July 2013; pp. 1–4. [Google Scholar]
- Aslan, S.; Ciocca, G.; Schettini, R. On Comparing Color Spaces for Food Segmentation. In Proceedings of the International Conference on Image Analysis and Processing, Catania, Italy, 11–15 September 2017; pp. 435–443. [Google Scholar]
- Ciocca, G.; Napoletano, P.; Schettini, R. Food Recognition: A New Dataset, Experiments, and Results. IEEE J. Biomed. Health Inform. 2017, 21, 588–598. [Google Scholar] [CrossRef]
- Ege, T.; Shimoda, W.; Yanai, K. A New Large-Scale Food Image Segmentation Dataset and Its Application to Food Calorie Estimation Based on Grains of Rice. In Proceedings of the International Workshop on Multimedia Assisted Dietary Management (MADiMa), Nice, France, 21–25 October 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 82–87. [Google Scholar]
- Chen, L.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
- Artacho, B.; Savakis, A. OmniPose: A Multi-Scale Framework for Multi-Person Pose Estimation. arXiv 2021, arXiv:2103.10180. [Google Scholar]
- Peng, C.; Ma, J. Semantic segmentation using stride spatial pyramid pooling and dual attention decoder. Pattern Recognit. 2020, 107, 107498. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Kendall, A.; Badrinarayanan, V.; Cipolla, R. Bayesian segnet: Model uncertainty in deep convolutional encoder–decoder architectures for scene understanding. arXiv 2015, arXiv:1511.02680. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. arXiv 2016, arXiv:1612.01105. [Google Scholar]
- Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. arXiv 2016, arXiv:1606.02147. [Google Scholar]
- Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [Green Version]
- Chen, L.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. arXiv 2018, arXiv:1802.02611. [Google Scholar]
- Artacho, B.; Savakis, A. UniPose: Unified Human Pose Estimation in Single Images and Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar] [CrossRef]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the 3rd International Conference on Learning Representations, ICLR, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Luong, T.; Pham, H.; Manning, C.D. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; Association for Computational Linguistics: Denver, Colorado, 2015; pp. 1412–1421. [Google Scholar] [CrossRef] [Green Version]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; Volume 37, pp. 2048–2057. [Google Scholar]
- Chen, L.C.; Yang, Y.; Wang, J.; Xu, W.; Yuille, A.L. Attention to Scale: Scale-Aware Semantic Image Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3640–3649. [Google Scholar] [CrossRef] [Green Version]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 3141–3149. [Google Scholar]
- Huang, Q.; Xia, C.; Wu, C.; Li, S.; Wang, Y.; Song, Y.; Kuo, C.J. Semantic Segmentation with Reverse Attention. In Proceedings of the British Machine Vision Conference (BMVC), London, UK, 4–7 September 2017; pp. 18.1–18.13. [Google Scholar]
- Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. BiSeNet: Bilateral Segmentation Network for Real-Time Semantic Segmentation. In Proceedings of the European Conference in Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; 2018; pp. 334–349. [Google Scholar]
- Shi, J.; Malik, J. Normalized cuts and image segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Juan, Puerto Rico, 17–19 June 1997; pp. 731–737. [Google Scholar] [CrossRef]
- Wang, Y.G.; Yang, J.; Chang, Y.C. Color–texture image segmentation by integrating directional operators into JSEG method. Pattern Recognit. Lett. 2006, 27, 1983–1990. [Google Scholar] [CrossRef]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
- Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar]
- Pfisterer, K.J.; Amelard, R.; Chung, A.; Syrnyk, B.; MacLean, A.; Wong, A. Fully-Automatic Semantic Segmentation for Food Intake Tracking in Long-Term Care Homes. arXiv 2019, arXiv:1910.11250. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Tangseng, P.; Wu, Z.; Yamaguchi, K. Looking at Outfit to Parse Clothing. arXiv 2017, arXiv:1703.01386. [Google Scholar]
- Ciocca, G.; Napoletano, P.; Schettini, R. IAT-Image Annotation Tool: Manual. arXiv 2015, arXiv:1502.05212. [Google Scholar]
- Douglas, D.H.; Peucker, T.K. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartogr. Int. J. Geogr. Inf. Geovis. 1973, 10, 112–122. [Google Scholar] [CrossRef] [Green Version]
- Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747. [Google Scholar]
- Sutskever, I.; Martens, J.; Dahl, G.; Hinton, G. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; Volume 28, pp. 1139–1147. [Google Scholar]
Dual Attention | Channel Attention | Spatial Attention | ASPP | WASP | WASPv2 | GFLOPs | #Params | mIOU |
---|---|---|---|---|---|---|---|---|
87.20 | 47.95 M | 68.25% | ||||||
✓ | 51.56 | 45.58 M | 69.44% | |||||
✓ | ✓ | 54.60 | 59.41 M | 69.73% | ||||
✓ | ✓ | 46.98 | 47.49 M | 69.25% | ||||
✓ | ✓ | 48.81 | 47.00 M | 70.29% | ||||
✓ | 47.02 | 46.9 M | 69.17% | |||||
✓ | ✓ | 53.62 | 48.7 M | 70.28% | ||||
✓ | ✓ | 72.00 | 46.9 M | 70.58% | ||||
✓ | ✓ | ✓ | 78.60 | 48.8 M | 71.79% | |||
✓ | ✓ | ✓ | ✓ | 78.60 | 49 M | 69.79% |
Dual Attention | Channel Attention | Spatial Attention | ASPP | WASP | WASPv2 | GFLOPs | #Params | mIOU |
---|---|---|---|---|---|---|---|---|
51.33 | 47.95M | 62.33% | ||||||
✓ | 30.21 | 45.58M | 62.48% | |||||
✓ | ✓ | 31.89 | 59.41M | 62.49% | ||||
✓ | ✓ | 27.47 | 47.49M | 61.95% | ||||
✓ | ✓ | 28.91 | 47M | 63.14% | ||||
✓ | 27.5 | 46.9M | 63.54% | |||||
✓ | ✓ | 31.4 | 48.7M | 64.30% | ||||
✓ | ✓ | 42.3 | 46.9M | 64.29% | ||||
✓ | ✓ | ✓ | 46.2 | 48.8M | 65.13% | |||
✓ | ✓ | ✓ | ✓ | 31.9 | 49M | 63.92% |
Method | mIOU |
---|---|
DeepLab [12] | 43.3% |
SegNet [11] | 44% |
WASPnet [6] | 67.50% |
DeepLabv3+ [37] | 68.87% |
GourmetNet (Ours) | 71.79% |
Food Name | mIOU | Food Name | mIoU |
---|---|---|---|
Croquette | 92.16% | Fried Fish | 16.29% |
Pancake | 91.67% | Tempura | 17.46% |
Udon Noodle | 88.67% | Vegetable Tempura | 18.23% |
Goya Chanpuru | 88.61% | Salmon Meuniere | 30.28% |
Mixed Rice | 87.54% | Chip Butty | 31.03% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sharma, U.; Artacho, B.; Savakis, A. GourmetNet: Food Segmentation Using Multi-Scale Waterfall Features with Spatial and Channel Attention. Sensors 2021, 21, 7504. https://doi.org/10.3390/s21227504
Sharma U, Artacho B, Savakis A. GourmetNet: Food Segmentation Using Multi-Scale Waterfall Features with Spatial and Channel Attention. Sensors. 2021; 21(22):7504. https://doi.org/10.3390/s21227504
Chicago/Turabian StyleSharma, Udit, Bruno Artacho, and Andreas Savakis. 2021. "GourmetNet: Food Segmentation Using Multi-Scale Waterfall Features with Spatial and Channel Attention" Sensors 21, no. 22: 7504. https://doi.org/10.3390/s21227504
APA StyleSharma, U., Artacho, B., & Savakis, A. (2021). GourmetNet: Food Segmentation Using Multi-Scale Waterfall Features with Spatial and Channel Attention. Sensors, 21(22), 7504. https://doi.org/10.3390/s21227504