An Explainable Prediction for Dietary-Related Diseases via Language Models
Abstract
:1. Introduction
2. Method
2.1. Study Population
2.2. Methodology
2.2.1. Data Preprocessing
2.2.2. Sentiment Analysis
2.2.3. Dietary Pattern Extraction
2.2.4. Target Diseases’ Definitions
- A triglyceride (TG) level at or exceeding 200 mg/dL, or a total cholesterol level surpassing 240 mg/dL, indicating elevated lipid concentrations that pose significant health risks.
- An HDL-cholesterol (high-density lipoprotein cholesterol) level falling below the threshold of 40 mg/dL in males or 50 mg/dL in females, reflecting the protective lipid’s insufficiency against cardiovascular diseases.
- An LDL-cholesterol (low-density lipoprotein cholesterol) level at or above 160 mg/dL, highlighting an increased risk of atherosclerotic cardiovascular events. If TG levels were below 400 mg/dL, LDL-cholesterol was calculated using the Friedewald formula to recalibrate the LDL-cholesterol value [27].
2.2.5. Machine Learning-Based Classification
3. Results
3.1. Dietary Pattern Extraction Results
3.2. Predicting Diseases and Analyzing Disease Prediction Results
3.2.1. Obesity Prediction Results
3.2.2. Dyslipidemia Prediction Results
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Shetty, D.; Rit, K.; Shaikh, S.; Patil, N. Diabetes disease prediction using data mining. In Proceedings of the 2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), Coimbatore, India, 17–18 March 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–5. [Google Scholar]
- Mir, A.; Dhage, S.N. Diabetes disease prediction using machine learning on big data of healthcare. In Proceedings of the 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), Pune, India, 16–18 August 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–6. [Google Scholar]
- Sisodia, D.; Sisodia, D.S. Prediction of diabetes using classification algorithms. Procedia Comput. Sci. 2018, 132, 1578–1585. [Google Scholar] [CrossRef]
- Fitriyani, N.L.; Syafrudin, M.; Alfian, G.; Rhee, J. Development of disease prediction model based on ensemble learning approach for diabetes and hypertension. IEEE Access 2019, 7, 144777–144789. [Google Scholar] [CrossRef]
- Mishra, S.; Chaudhury, P.; Mishra, B.K.; Tripathy, H.K. An implementation of feature ranking using machine learning techniques for diabetes disease prediction. In Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies, Udaipur, India, 4–5 March 2016; pp. 1–3. [Google Scholar]
- Minich, D.M.; Bland, J.S. Dietary management of the metabolic syndrome beyond macronutrients. Nutr. Nutr. Rev. Rev. 2008, 66, 429–444. [Google Scholar] [CrossRef] [PubMed]
- Kim, D.Y.; Ahn, A.; Lee, H.; Choi, J.; Lim, H. Dietary patterns independent of fast food are associated with obesity among Korean adults: Korea National Health and Nutrition Examination Survey 2010–2014. Nutrients 2019, 11, 2740. [Google Scholar] [CrossRef] [PubMed]
- Ahluwalia, N.; Andreeva, V.A.; Kesse-Guyot, E.; Hercberg, S. Dietary patterns, inflammation and the metabolic syndrome. Diabetes Metab. 2013, 39, 99–110. [Google Scholar] [CrossRef]
- Choi, I.; Kim, J.; Kim, W.C. Dietary Pattern Extraction Using Natural Language Processing Techniques. Front. Nutr. 2022, 281, 765794. [Google Scholar] [CrossRef] [PubMed]
- Kim, J.H.; Kim, W.C.; Kim, J. A practical solution to improve the nutritional balance of Korean dine-out menus using linear programming. Public Health Nutr. 2019, 22, 957–966. [Google Scholar] [CrossRef]
- Han, S. Hanspell. Available online: https://github.com/ssut/py-hanspell (accessed on 14 June 2022).
- Joulin, A.; Grave, E.; Bojanowski, P.; Mikolov, T. Bag of tricks for efficient text classification. arXiv 2016, arXiv:1607.01759. [Google Scholar]
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef]
- Lim, S.; Kim, M.; Lee, J. Korquad1. 0: Korean qa dataset for machine reading comprehension. arXiv 2018, arXiv:1909.07005. [Google Scholar]
- Kim, H. Soynlp. Available online: https://github.com/lovit/soynlp (accessed on 14 June 2022).
- Ravi, K.; Ravi, V. A Survey on Opinion Mining and Sentiment Analysis: Tasks, Approaches, and Applications. Knowl.-Based Syst. 2015, 89, 14–46. [Google Scholar] [CrossRef]
- Kumar, B.S.; Ravi, V. A Survey of the Applications of Text Mining in Financial Domain. Knowl.-Based Syst. 2016, 114, 128–147. [Google Scholar] [CrossRef]
- LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
- Graves, A.; Schmidhuber, J. Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures. Neural Netw. 2005, 18, 602–610. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
- Röder, M.; Both, A.; Hinneburg, A. Exploring the space of topic coherence measures. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, Shanghai, China, 2–6 February 2015; pp. 399–408. [Google Scholar]
- Zhao, W.; Chen, J.J.; Perkins, R.; Liu, Z.; Ge, W.; Ding, Y.; Zou, W. A heuristic approach to determine an appropriate number of topics in topic modeling. BMC Bioinform. 2015, 16, S8. [Google Scholar] [CrossRef]
- World Health Organization (WHO), Regional Office for the Western Pacific. The Asia-Pacific Perspective: Redefining Obesity and Its Treatment [Internet]; Health Communications Australia: Sydney, Australia, 2000; Available online: https://apps.who.int/iris/handle/10665/206936 (accessed on 20 August 2023).
- Rhee, E.J.; Kim, H.C.; Kim, J.H.; Lee, E.Y.; Kim, B.J.; Kim, E.M.; Song, Y.; Lim, J.H.; Kim, H.J.; Choi, S.; et al. Guidelines for the management of dyslipidemia in Korea. J. Lipid Atheroscler. 2019, 8, 78–131. [Google Scholar] [CrossRef]
- Friedewald, W.T.; Levy, R.I.; Fredrickson, D.S. Estimation of the concentration of low-density lipoprotein cholesterol in plasma, without use of the preparative ultracentrifuge. Clin. Chem. 1972, 18, 499–502. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems 30 (NIPS 2017); NeurIPS Proceedings: New Orleans, LA, USA, 2017; Volume 30. [Google Scholar]
- Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. In Advances in Neural Information Processing Systems 31 (NeurIPS 2018); NeurIPS Proceedings: New Orleans, LA, USA, 2018; Volume 31. [Google Scholar]
- Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2623–2631. [Google Scholar]
- Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Lundberg, S.M.; Lee, S.I. Consistent feature attribution for tree ensembles. arXiv 2017, arXiv:1706.06060. [Google Scholar]
- Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nat. Methods 2020, 17, 261–272. [Google Scholar] [CrossRef]
- Kumar, R.D.; Julie, E.G.; Robinson, Y.H.; Vimal, S.; Seo, S. Recognition of food type and calorie estimation using neural network. J. Supercomput. 2021, 77, 8172–8193. [Google Scholar] [CrossRef]
- Wirfält, E.; Drake, I.; Wallström, P. What do review papers conclude about food and dietary patterns? Food Nutr. Res. 2013, 57, 20523. [Google Scholar] [CrossRef] [PubMed]
- Kang, Y.; Kim, J. Gender difference on the association between dietary patterns and metabolic syndrome in Korean population. Eur. J. Nutr. 2016, 55, 2321–2330. [Google Scholar] [CrossRef] [PubMed]
- Lee, J.; Kim, J. Association between dietary pattern and incidence of cholesterolemia in Korean adults: The Korean Genome and Epidemiology Study. Nutrients 2018, 10, 53. [Google Scholar] [CrossRef]
- Ho, D.E.; Mbonu, O.; McDonough, A.; Pottash, R. Menu labeling, calories, and nutrient density: Evidence from chain restaurants. PLoS ONE 2020, 15, e0232656. [Google Scholar] [CrossRef]
- Mehdipour, F.; Javadi, B.; Mahanti, A.; Ramirez-Prado, G.; Principles, E.C. Fog computing realization for big data analytics. Fog Edge Comput. Princ. Paradig. 2019, 1, 259–290. [Google Scholar]
- López-Gil, J.F.; Wu, S.M.; Lee, T.L.I.; Shih, C.W.; Tausi, S.; Sosene, V.; Maani, P.P.; Tupulaga, M.; Hsu, Y.-T.; Chang, C.-R.; et al. Higher imported food patterns are associated with obesity and severe obesity in Tuvalu: A latent class analysis. Curr. Dev. Nutr. 2024, 8, 102080. [Google Scholar] [CrossRef]
- Choi, I.; Kim, W.C. Detecting and Analyzing Politically-Themed Stocks Using Text Mining Techniques and Transfer Entropy—Focus on the Republic of Korea’s Case. Entropy 2021, 23, 734. [Google Scholar] [CrossRef]
- Sanchez-Villegas, A.; Martinez, J.A.; De Irala, J.; Martínez-González, M.A. Determinants of the adherence to an “a priori” defined Mediterranean dietary pattern. Eur. J. Nutr. 2002, 41, 249–257. [Google Scholar] [CrossRef]
- Feinstein, L.; Sabates, R.; Sorhaindo, A.; Rogers, I.; Herrick, D.; Northstone, K.; Emmett, P. Dietary patterns related to attainment in school: The importance of early eating patterns. J. Epidemiol. Community Health 2008, 62, 734–739. [Google Scholar] [CrossRef]
- Tucker, K.L. Dietary patterns, approaches, and multicultural perspective. Appl. Physiol. Nutr. Metab. 2010, 35, 211–218. [Google Scholar] [CrossRef]
- Kim, H.; Lee, K.; Rebholz, C.M.; Kim, J. Plant-based diets and incident metabolic syndrome: Results from a South Korean prospective cohort study. PLoS Med. 2020, 17, e1003371. [Google Scholar] [CrossRef]
- Côté, M.; Lamarche, B. Artificial intelligence in nutrition research: Perspectives on current and future applications. Appl. Physiol. Nutr. Metab. 2022, 47, 1–8. [Google Scholar] [CrossRef] [PubMed]
- Kim, H.; Lim, D.H.; Kim, Y. Classification and prediction on the effects of nutritional intake on overweight/obesity, dyslipidemia, hypertension and type 2 diabetes mellitus using deep learning model: 4–7th Korea national health and nutrition examination survey. Int. J. Environ. Res. Public Health 2021, 18, 5597. [Google Scholar] [CrossRef] [PubMed]
- Molenaar, A.; Jenkins, E.L.; Brennan, L.; Lukose, D.; McCaffrey, T.A. The use of sentiment and emotion analysis and data science to assess the language of nutrition-, food-and cooking-related content on social media: A systematic scoping review. Nutr. Res. Rev. 2023, 1–36. [Google Scholar] [CrossRef] [PubMed]
- Cohen, D.A.; Babey, S.H. Contextual influences on eating behaviours: Heuristic processing and dietary choices. Obes. Rev. 2012, 13, 766–779. [Google Scholar] [CrossRef] [PubMed]
Column | Description | Category |
---|---|---|
REGION | 17 cities |
|
TOWN_T | Townships |
|
APT_T | Apartment |
|
SEX | - |
|
AGE | 1–80 * | |
INCM | Income quantiles (individual) |
|
HO_INCM | Income quantiles (household) |
|
INCM5 | Income quintiles (individual) |
|
HO_INCM5 | Income quintiles (household) |
|
EDU | Education level |
|
OCCP | Occupational reclassification and unemployment/non-economic activities (except conscripted soldiers) |
|
Feature | Category | Feature | Proportion (%) |
---|---|---|---|
Sex | Demographic | Male | 7144 (42.5%) |
Female | 9665 (57.5%) | ||
Age | Demographic | 19–39 | 5296 (32.7%) |
40–59 | 5814 (35.9%) | ||
60+ | 5077 (31.4%) | ||
House type | Demographic | General | 7831 (46.6%) |
Apartment | 8978 (53.4%) | ||
Highest level of education | Demographic | Graduated elementary school | 3288 (19.6%) |
Graduated high school | 6342 (37.8%) | ||
Over associate degree/bachelor’s degree | 5639 (33.5%) | ||
Obesity | Disease | Obesity | 5535 (32.9%) |
Normal | 11,311 (67.1%) | ||
Dyslipidemia | Disease | Dyslipidemia | 1350 (45.7%) |
Normal | 1605 (54.3%) | ||
Total | 16,809 |
Rank | Pattern 1 | Pattern 2 | Pattern 3 |
---|---|---|---|
1 | Kimchi | Mix of Red Pepper Paste and Soybean Paste | Americano |
2 | Instant Coffee | Pork Belly | Fried Chicken |
3 | Milk | Lettuce | Mayonnaise |
4 | White Rice | Red Pepper | Fish Cake Soup |
5 | Multigrain Rice | Cold Noodle | Ramen |
6 | Soybean Paste Soup | Onion | Salty Snack (Cookie) |
7 | Kimchi Stew | Soju | Chicken Breast |
8 | Apple | Grilled Mushrooms | Soda |
9 | Roast Seaweed | Orange Juice | Sausage |
10 | Stir-Fried Anchovy | Duck Meat | Beer |
Proportion of participants | 67.1% | 17.4% | 15.5% |
Proportion of tokens | 62.3% | 15.0% | 22.7% |
Machine Learning Model | Performance Measure | Benchmark Model | NLP-Based Indices Included Model | T-Statistic | p-Value |
---|---|---|---|---|---|
XGBoost | Balanced accuracy | 0.5276 | 0.5879 | 15.2015 | 0.0000 *** |
F1 score | 0.4958 | 0.5813 | 13.6947 | 0.0000 *** | |
LightGBM | Balanced accuracy | 0.5194 | 0.5855 | 15.1472 | 0.0000 *** |
F1 score | 0.4752 | 0.5754 | 18.7120 | 0.0000 *** | |
CatBoost | Balanced accuracy | 0.5276 | 0.5879 | 15.2015 | 0.0000 *** |
F1 score | 0.4958 | 0.5813 | 13.6947 | 0.0000 *** |
Machine Learning Model | Performance Measure | Benchmark Model | NLP-Based Indices Included Model | T-Statistic | p-Value |
---|---|---|---|---|---|
XGBoost | Balanced accuracy | 0.5497 | 0.5956 | 3.8721 | 0.0019 *** |
F1 score | 0.5461 | 0.5937 | 4.0133 | 0.0015 *** | |
LightGBM | Balanced accuracy | 0.5730 | 0.5873 | 1.6280 | 0.0690 * |
F1 score | 0.5676 | 0.5858 | 2.2078 | 0.0273 ** | |
CatBoost | Balanced accuracy | 0.5801 | 0.6186 | 3.5572 | 0.0031 *** |
F1 score | 0.5741 | 0.6166 | 3.8686 | 0.0019 *** |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Choi, I.; Kim, J.; Kim, W.C. An Explainable Prediction for Dietary-Related Diseases via Language Models. Nutrients 2024, 16, 686. https://doi.org/10.3390/nu16050686
Choi I, Kim J, Kim WC. An Explainable Prediction for Dietary-Related Diseases via Language Models. Nutrients. 2024; 16(5):686. https://doi.org/10.3390/nu16050686
Chicago/Turabian StyleChoi, Insu, Jihye Kim, and Woo Chang Kim. 2024. "An Explainable Prediction for Dietary-Related Diseases via Language Models" Nutrients 16, no. 5: 686. https://doi.org/10.3390/nu16050686
APA StyleChoi, I., Kim, J., & Kim, W. C. (2024). An Explainable Prediction for Dietary-Related Diseases via Language Models. Nutrients, 16(5), 686. https://doi.org/10.3390/nu16050686