The Chef’s Choice: System for Allergen and Style Classification in Recipes
Abstract
:1. Introduction
1.1. Motivation
1.2. Goal
1.3. Overview
2. Related Work
2.1. Cuisine Classification
2.2. Allergen Classification
2.3. Findings
2.3.1. Misclassification for Similar Cuisines
2.3.2. One-Vs-Rest and Feature Amount
2.3.3. Noise in Public Recipes
2.3.4. Class Imbalance
2.3.5. Stratification of Multi Labeled Data
2.4. Related Solutions
2.5. Proposed System and Existing Solutions Distinction
3. System Concept and Methodology
4. Data Acquisition
4.1. Kaggle Dataset
4.2. Openfoodfacts Dataset
4.3. Dataset Preprocessing
- Removal of non-alphanumeric characters
- Stopword removal
- POS tagging
- Lemmatization of ingredients
- Filtering
- One-Hot-Encoding
- Up- and down-sampling
- Feature transformation
Up- and Down-Sampling
4.4. Cuisine Classification
- Train test split;
- Feature transformation;
- Hyperparameter tuning;
- Classifier evaluation.
4.4.1. Hyperparameter Tuning
4.4.2. Machine Learning Classifier
4.5. Allergen Classification
- Up- and down-sampling;
- Train test split;
- Feature transformation;
- Hyperparameter tuning;
- Classifier evaluation.
4.5.1. Hyperparameter Tuning
- f1_samples
- roc_auc
- roc_auc_ovr_weighted
4.5.2. Machine Learning Classifier
5. Evaluation
5.1. Evaluation of Classifiers
- Micro;
- Macro;
- Weighted.
5.2. Cuisine Classification
5.3. Allergen Classification
- ROC AUC [37];
- F1 score;
- Accuracy score.
wheat flour, bicarbonate ammonium, bay, dioxide preservative, garlic ginger, yolk sugar, honey mustard, salt organic, riboflavin, cajun, toffee, spread, kernel soybean, vinegar lactic, manufacture, lt, cottonseed oil, masa, advice, mollusc, nut, barley flour, tree, present, tree nut, hazelnut, cashew, walnut, pecan, almond
5.4. Evaluation of Proposed System
5.4.1. Results
5.4.2. User Study Shortcomings
6. Conclusions
6.1. Summary
6.2. Challenges and Problems
- Lack of data;
- Data quality;
- Class imbalance.
6.3. Known Limitations and Discussion
6.4. Future Extensions and Development
6.4.1. Performance
6.4.2. Language
6.4.3. Datasets
6.4.4. Machine Learning Classifier
6.4.5. Regional Cooking Terms
6.4.6. Feedback Loop
6.4.7. Integrations
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Council of European Union. Regulation (EU) No 1169/2011 of the European Parliament and of the Council of 25 October 2011 on the Provision of Food Information to Consumers, Amending Regulations (EC) No 1924/2006 and (EC) No 1925/2006 of the European Parliament and of the Council, and Repealing Commission Directive 87/250/EEC, Council Directive 90/496/EEC, Commission Directive 1999/10/EC, Directive 2000/13/EC of the European Parliament and of the Council, Commission Directives 2002/67/EC and 2008/5/EC and Commission Regulation (EC) No 608/2004. 2011. Available online: https://www.legislation.gov.uk/eur/2011/1169/contents (accessed on 2 December 2020).
- Bruijnzeel-Koomen, C.; Ortolani, C.; Aas, K.; Bindslev-Jensen, C.; Björksten, B.; Wüthrich, B. Adverse reactions to food: Position paper of the European Academy of Allergy and Clinical Immunology. Allergy 1995, 50, 623–635. [Google Scholar] [CrossRef] [PubMed]
- Tang, M.L.K.; Mullins, R.J. Food allergy: Is prevalence increasing? Intern. Med. J. 2017, 47, 256–261. [Google Scholar] [CrossRef] [PubMed]
- Joachims, T. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European Conference on Machine Learning, Chemnitz, Germany, 21–23 April 1998; Springer: Berlin/Heidelberg, Germany, 1998; pp. 137–142. [Google Scholar]
- Rennie, J.D.M.; Rifkin, R. Improving Multiclass Text Classification with the Support Vector Machine. 2001. Available online: https://www.researchgate.net/publication/2522390_Improving_Multiclass_Text_Classification_with_the_Support_Vector_Machine (accessed on 27 February 2022).
- Rish, I. An Empirical Study of the Naive Bayes Classifier. Available online: https://www.cc.gatech.edu/home/isbell/classes/reading/papers/Rish.pdf (accessed on 27 February 2022).
- Kleinbaum, D.G.; Dietz, K.; Gail, M.; Klein, M.; Klein, M. Logistic Regression; Springer: Cham, Switzerland, 2002. [Google Scholar]
- Kalajdziski, S.; Radevski, G.; Ivanoska, I.; Trivodaliev, K.; Stojkoska, B.R. Cuisine classification using recipe’s ingredients. In Proceedings of the 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, 21–25 May 2018; pp. 1074–1079. [Google Scholar] [CrossRef]
- Yujian, L.; Bo, L. A normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 1091–1095. [Google Scholar] [CrossRef] [PubMed]
- Swamynathan, M. Mastering Machine Learning with Python in Six Steps: A Practical Implementation Guide to Predictive Data Analytics Using Python, 2nd ed.; Apress: New York, NY, USA, 2019. [Google Scholar]
- Ramos, J. Using tf-idf to determine word relevance in document queries. In Proceedings of the First Instructional Conference on Machine Learning, Washington, DC, USA, 21–24 August 2003; Volume 242, pp. 29–48. [Google Scholar]
- Teng, C.Y.; Lin, Y.R.; Adamic, L.A. Recipe recommendation using ingredient networks. In Proceedings of the WebSci ’12 4th Annual ACM Web Science Conference Association for Computing Machinery, New York, NY, USA, 22–24 June 2012; pp. 298–307. [Google Scholar] [CrossRef] [Green Version]
- Li, B.; Wang, M. Cuisine Classification from Ingredients. Available online: http://cs229.stanford.edu/proj2015/313_report.pdf (accessed on 27 February 2022).
- Su, H.; Lin, T.W.; Li, C.T.; Shan, M.K.; Chang, J. Automatic recipe cuisine classification by ingredients. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication; Association for Computing Machinery, New York, NY, USA, 13–17 September 2014; pp. 565–570. [Google Scholar] [CrossRef]
- Lane, P.C.R.; Clarke, D.; Hender, P. On developing robust models for favourability analysis: Model choice, feature sets and imbalanced data. Decis. Support Syst. 2012, 53, 712–718. [Google Scholar] [CrossRef] [Green Version]
- Britto, L.; Pacífico, L.; Oliveira, E.; Ludermir, T. A cooking recipe multi-label classification approach for food restriction identification. In Proceedings of the Anais do XVII Encontro Nacional de Inteligência Artificial e Computacional, SBC, Porto Alegre, Brazil, 20–23 October 2020; pp. 246–257. [Google Scholar] [CrossRef]
- Bishop, C.M. Pattern Recognition and Machine Learning; Information Science and Statistics; Springer: New York, NY, USA, 2006. [Google Scholar]
- Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection; Ijcai: Montreal, QC, Canada, 1995; Volume 14, pp. 1137–1145. [Google Scholar]
- Sechidis, K.; Tsoumakas, G.; Vlahavas, I. On the stratification of multi-label data. In Machine Learning and Knowledge Discovery in Databases; Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 145–158. [Google Scholar]
- Alemany-Bordera, J.; Heras Barberá, S.M.; Palanca Cámara, J.; Julian Inglada, V.J. Bargaining agents based system for automatic classification of potential allergens in recipes. ADCAIJ Adv. Distrib. Comput. Artif. Intell. J. 2016, 5, 43–51. [Google Scholar]
- U.S. Department of Agriculture. FoodData Central. 2020. Available online: https://fdc.nal.usda.gov/ (accessed on 2 December 2020).
- Ueda, M.; Takahata, M.; Nakajima, S. User’s food preference extraction for personalized cooking recipe recommendation. In Proceedings of the Second International Conference on Semantic Personalized Information Management: Retrieval and Recommendation, Bonn, Germany, 24 October 2011; Volume 781, pp. 98–105. [Google Scholar]
- Freyne, J.; Berkovsky, S. Intelligent food planning: Personalized recipe recommendation. In Proceedings of the 15th International Conference on Intelligent User Interfaces; Association for Computing Machinery, New York, NY, USA, 7–10 February 2010; pp. 321–324. [Google Scholar] [CrossRef]
- Open Food Facts Community. Open Food Facts—Food Products Database. 2020. Available online: https://world.openfoodfacts.org/data (accessed on 27 February 2022).
- Tsoumakas, G.; Katakis, I. Multi-label classification: An overview. Int. J. Data Warehous. Min. 2007, 3, 1–13. [Google Scholar] [CrossRef] [Green Version]
- Manna, S. Imbalanced Multilabel Scene Classification using Keras. The Owl, 29 July 2020. [Google Scholar]
- Feurer, M.; Hutter, F. Hyperparameter optimization. In Automated Machine Learning; Springer: Cham, Switzerland, 2019; pp. 3–33. [Google Scholar]
- Jaccard, P. The distribution of the flora in the alpine zone. 1. New Phytol. 1912, 11, 37–50. [Google Scholar] [CrossRef]
- Soucy, P.; Mineau, G.W. A simple KNN algorithm for text categorization. In Proceedings of the 2001 IEEE International Conference on Data Mining, San Jose, CA, USA, 29 November–2 December 2001; pp. 647–648. [Google Scholar] [CrossRef]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
- Albon, C. Machine Learning with Python Cookbook: Practical Solutions from Preprocessing to Deep Learning, 1st ed.; O’Reilly Media: Newton, MA, USA, 2018. [Google Scholar]
- Sorower, M.S. A Literature Survey on Algorithms for Multi-Label Learning. 2010. Available online: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.364.5612&rep=rep1&type=pdf (accessed on 27 February 2022).
- Hinton, G.E. Connectionist learning procedures. In Machine Learning; Elsevier: Amsterdam, The Netherlands, 1990; pp. 555–610. [Google Scholar]
- Pillai, I.; Fumera, G.; Roli, F. Designing multi-label classifiers that maximize F measures: State of the art. Pattern Recognit. 2017, 61, 394–404. [Google Scholar] [CrossRef] [Green Version]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
- Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
- Nam, J.; Kim, J.; Loza Mencía, E.; Gurevych, I.; Fürnkranz, J. Large-scale multi-label text classification—Revisiting neural networks. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Nancy, France, 14–18 September 2014; Calders, T., Esposito, F., Hüllermeier, E., Meo, R., Eds.; Springer: Berlin/Heidelberg, Germany, 2014; pp. 437–452. [Google Scholar]
Id | Cuisine | Ingredients |
---|---|---|
0 | Spanish | mussel, black pepper, garlic, saffron thread, olive oil, stew tomato, arborio rice… |
1 | Mexican | tomato, red onion, paprika, salt, corn tortilla, cilantro, cremini, broth, pepper… |
2 | French | chicken broth, truffle, pimento, green pepper, olive, turkey, egg yolk, … |
3 | Chinese | ginger, sesame oil, pea, cooked rice, bell pepper, peanut oil, egg, garlic, … |
Classifier | Parameter |
---|---|
KNN | n_neighbours: 75 |
Logistic Regression | C: 0.5 |
max_iter: 1000 | |
multi_class: auto | |
solver: lbfgs | |
Random Forest | class_weight: balanced |
max_depth: 75 | |
max_features: auto | |
n_estimators: 100 | |
Decision Tree | max_depth: 120 |
max_features: auto | |
min_samples_leaf: 1 | |
SVC | C: 10 |
gamma: 0.001 | |
kernel: rbf | |
LinearSVC | C: 0.2 |
dual: false | |
max_iter: 1100 | |
penalty: l1 |
Classifier | Parameter |
---|---|
Logistic Regression | estimator__C: 20 |
estimator__class_weight: balanced | |
estimator__max_iter: 2500 | |
estimator__solver: saga | |
Random Forest | estimator__class_weight: balanced |
estimator__max_depth: 400 | |
estimator__n_estimators: 2000 | |
Decision Tree | estimator__max_depth: 2500 |
estimator__min_samples_leaf: 5 | |
MLP | activation: relu |
early_stopping: True | |
hidden_layer_sizes: (130,) | |
learning_rate: constant | |
max_iter: 300 |
Participant No. | Correct | Missing | Incorrect | Avg. Seconds/Recipe | Ratio |
---|---|---|---|---|---|
1-informed | 46 | 4 | 3 | 34.88 | 92.00% |
2-informed | 42 | 8 | 9 | 47.50 | 84.00% |
3-informed | 38 | 12 | 0 | 62.81 | 76.00% |
4-informed | 42 | 8 | 3 | 73.75 | 84.00% |
5-informed | 46 | 4 | 3 | 35.31 | 92.00% |
6 | 25 | 25 | 6 | 38.63 | 50.00% |
7 | 40 | 10 | 2 | 41.25 | 80.00% |
8 | 38 | 12 | 4 | 58.69 | 76.00% |
9 | 34 | 16 | 3 | 50.88 | 68.00% |
10 | 36 | 14 | 4 | 36.56 | 72.00% |
Logistic Regression | 36 | 14 | 2 | 4.5 | 72.00% |
MLP | 28 | 22 | 6 | 4.5 | 56.00% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Roither, A.; Kurz, M.; Sonnleitner, E. The Chef’s Choice: System for Allergen and Style Classification in Recipes. Appl. Sci. 2022, 12, 2590. https://doi.org/10.3390/app12052590
Roither A, Kurz M, Sonnleitner E. The Chef’s Choice: System for Allergen and Style Classification in Recipes. Applied Sciences. 2022; 12(5):2590. https://doi.org/10.3390/app12052590
Chicago/Turabian StyleRoither, Andreas, Marc Kurz, and Erik Sonnleitner. 2022. "The Chef’s Choice: System for Allergen and Style Classification in Recipes" Applied Sciences 12, no. 5: 2590. https://doi.org/10.3390/app12052590
APA StyleRoither, A., Kurz, M., & Sonnleitner, E. (2022). The Chef’s Choice: System for Allergen and Style Classification in Recipes. Applied Sciences, 12(5), 2590. https://doi.org/10.3390/app12052590