Evaluating the Performance of Topic Modeling Techniques with Human Validation to Support Qualitative Analysis
Abstract
:1. Introduction
1.1. Topic Modeling as a Qualitative Research Approach
1.2. Machine Learning Models for Topic Modeling
1.3. Topic Modeling in Computer Science and Engineering Education
2. Methods
2.1. Context and Participants
- Factor: Description of intended modifications or decisions toward the design.
- Argue prediction: Justification for the modification or decision on the design, with informed anticipated outcomes.
- Observation: Informed evidence (i.e., accounts of tangible elements) revealed and observed in the CAD/CAE software over the factor of analysis.
- Justification: In the justification section, students provide a reasoned justification for their proposed solutions.
2.2. Data Collection and Processing
2.3. Differences between Supervised and Unsupervised Approaches
2.3.1. Unsupervised Approach
2.3.2. Supervised Approaches
- Exact matches (EMs): The top three topics are the same and in the same order.
- Unordered matches (UMs): Top three topics are the same but in any order.
- Highest topic matches (HTMs): Whether the main topic appears in the top three predicted topics.
- Proportion of highest topic matches (PHTMs): Frequency of the main topic appearing in the predictions.
- Two of three matches (TTMs: At least two of the top three predicted topics match the actual labels.
- Main topic accuracy (MTA): The first topic prediction matches the actual topic.
2.4. Qualitative Analysis: Supervised and Unsupervised Approaches in Topic Modeling
3. Trustworthiness, Validity, and Reliability
4. Results
4.1. Differences between Supervised and Unsupervised Approaches
4.1.1. Supervised Models
4.1.2. Unsupervised Model: Latent Dirichlet Allocation (LDA)
- Energy efficiency and conservation: This theme focuses on energy efficiency in homes, including aspects such as solar panels, windows, heat, and costs. The frequent mention of terms like “decrease”, “increase”, “annual”, “electrical”, and “consumption” suggests a strong focus on improving energy efficiency and managing energy consumption in households.
- Solar energy positioning and costs: This theme centers around the positioning of elements in houses (e.g., “south”, “side”, “window”, “roof”) to maximize sunlight exposure, particularly in different seasons (“winter”, “summer”). It also touches on the costs associated with solar energy.
- Solar energy generation: This theme emphasizes solar energy generation, focusing on aspects like sunlight, panels, roofs, and trees. It also considers seasonal changes (“summer”, “winter”) and the importance of angle and radiation for effective solar energy generation.
- Economical considerations: This theme revolves around the financial aspects of installing solar panels, such as cost, budget, and expenses. It also considers structural elements like walls and foundations.
- Solar panel placement and solar heat gain: This theme deals with the placement of solar panels on different sides of a house (e.g., “east”, “west”, “south”) to maximize sunlight exposure throughout the day. It also mentions heating and direct sunlight, which are crucial for effective solar energy usage.
- Insulation and thermal consideration in seasons: This theme focuses on managing energy for heating and cooling throughout different seasons (“winter”, “summer”). It mentions elements like insulation, windows, and temperature control to reduce energy consumption and maintain comfort in homes.
4.2. Qualitative Results: Supervised and Unsupervised Approaches in Topic Modeling
- Argue category:
- –
- LDA achieves higher percentages in identifying three out of three topics (71.0%) compared to XGBoost (65.0%), indicating better performance in comprehensive topic coverage.
- –
- XGBoost demonstrates a notable difference in identifying two out of three topics, showing better performance with 32.0% vs. 27.0% for LDA, and also performs slightly better in identifying one out of three topics (3.0% vs. 2.0% for LDA).
- Observation category:
- –
- Both models show similar performance through all three metrics, in identifying three out of three topics is identical with (77.5%).
- –
- XGBoost excels in identifying two out of three topics (67.7% vs. 46.5% for LDA) but slightly underperforms in identifying one out of three topics (8.1% vs. 5.1% for LDA).
- –
- XGBoost correctly identifies none of the topics (zero out of three) in this category, whereas LDA identifies 2.0%.
- Justification category:
- –
- LDA demonstrates superior performance in identifying three out of three topics (59.3%) compared to XGBoost (23.2%).
- –
- XGBoost excels in identifying two out of three topics (55.2% vs. 37.5% for LDA) but slightly underperforms in identifying one out of three topics (8.1% vs. 5.1% for LDA).
4.2.1. LDA Argumentation Framework
4.2.2. XGBoost Argumentation Framework
5. Discussion
5.1. Comparison of Supervised and Unsupervised Approaches in Terms of Coverage and Computational Efficiency
5.2. Comparison of Supervised and Unsupervised Approaches in Terms of Interpretability
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Akintayo, O.T.; Eden, C.A.; Ayeni, O.O.; Onyebuchi, N.C. Evaluating the impact of educational technology on learning outcomes in the higher education sector: A systematic review. Open Access Res. J. Multidiscip. Stud. 2024, 7, 52–72. [Google Scholar] [CrossRef]
- Valdez, D.; Picket, A.C.; Young, B.; Golden, S.D. On mining words: The utility of topic models in health education research and practice. Health Promot. Pract. 2021, 22, 309–312. [Google Scholar] [CrossRef] [PubMed]
- Nanda, G.; Jaiswal, A.; Castellanos, H.; Zhou, Y.; Choi, A.; Magana, A.J. Evaluating the Coverage and Depth of Latent Dirichlet Allocation Topic Model in Comparison with Human Coding of Qualitative Data: The Case of Education Research. Mach. Learn. Knowl. Extr. 2023, 5, 473–490. [Google Scholar] [CrossRef]
- Wang, Y.; Sohn, S.; Liu, S.; Shen, F.; Wang, L.; Atkinson, E.J.; Amin, S.; Liu, H. A clinical text classification paradigm using weak supervision and deep representation. BMC Med. Inform. Decis. Mak. 2019, 19, 1. [Google Scholar] [CrossRef]
- Moore, B.A.; Wright, J. Constructing written scientific explanations: A conceptual analysis supporting diverse and exceptional middle-and high-school students in developing science disciplinary literacy. Front. Educ. 2023, 8, 1305464. [Google Scholar] [CrossRef]
- McNeill, K.L.; Martin, D.M. Claims, evidence, and reasoning. Sci. Child. 2011, 48, 52. [Google Scholar]
- Feijoo-Garcia, M.A.; Holstrom, M.S.; Magana, A.J.; Newell, B.A. Simulation-Based Learning and Argumentation to Promote Informed Design Decision-Making Processes within a First-Year Engineering Technology Course. Sustainability 2024, 16, 2633. [Google Scholar] [CrossRef]
- Feijoo-Garcia, M.A.; Newell, B.; Magana, A.J.; Holstrom, M. Argumentation Framework as an Educational Approach for Supporting Critical Design Thinking in Engineering Education. In Proceedings of the 2024 ASEE Annual Conference & Exposition, Portland, OR, USA, 23–26 June 2024. [Google Scholar]
- Vieira, C.; Ortega-Alvarez, J.D.; Magana, A.J.; Boutin, M. Beyond analytics: Using computer-aided methods in educational research to extend qualitative data analysis. Comput. Appl. Eng. Educ. 2024, 32, e22749. [Google Scholar] [CrossRef]
- Bloomfield, J.; Fisher, M.J. Quantitative research design. J. Australas. Rehabil. Nurses Assoc. 2019, 22, 27–30. [Google Scholar] [CrossRef]
- Roni, S.M.; Merga, M.K.; Morris, J.E. Conducting Quantitative Research in Education; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
- Tong, A.; Flemming, K.; McInnes, E.; Oliver, S.; Craig, J. Enhancing transparency in reporting the synthesis of qualitative research: ENTREQ. BMC Med. Res. Methodol. 2012, 12, 181. [Google Scholar] [CrossRef]
- France, E.F.; Cunningham, M.; Ring, N.; Uny, I.; Duncan, E.A.; Jepson, R.G.; Maxwell, M.; Roberts, R.J.; Turley, R.L.; Booth, A.; et al. Improving reporting of meta-ethnography: The eMERGe reporting guidance. BMC Med. Res. Methodol. 2019, 19, 25. [Google Scholar] [CrossRef] [PubMed]
- Gauthier, R.P.; Wallace, J.R. The computational thematic analysis toolkit. Proc. ACM Hum.-Comput. Interact. 2022, 6, 1–15. [Google Scholar] [CrossRef]
- Kherwa, P.; Bansal, P. Topic modeling: A comprehensive review. EAI Endorsed Trans. Scalable Inf. Syst. 2019, 7, e2. [Google Scholar] [CrossRef]
- Nanda, G.; Douglas, K.A.; Waller, D.R.; Merzdorf, H.E.; Goldwasser, D. Analyzing Large Collections of Open-Ended Feedback From MOOC Learners Using LDA Topic Modeling and Qualitative Analysis. IEEE Trans. Learn. Technol. 2021, 14, 146–160. [Google Scholar] [CrossRef]
- Zhao, W.; Zou, W.; Chen, J.J. Topic Modeling for Cluster Analysis of Large Biological and Medical Datasets. BMC Bioinform. 2014, 15, S11. [Google Scholar] [CrossRef]
- Mohammadiha, N.; Smaragdis, P.; Leijon, A. Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Trans. Audio Speech Lang. Process. 2013, 21, 2140–2151. [Google Scholar] [CrossRef]
- Wu, X.; Feng, C.; Li, Q.; Zhu, J. Keyword Pool Generation for Web Text Collecting: A Framework Integrating Sample and Semantic Information. Mathematics 2024, 12, 405. [Google Scholar] [CrossRef]
- Çatir, O. UNDERSTANDING EMPLOYEE VOICE USING MACHINE LEARNING METHOD: EXAMPLE OF HOTEL BUSINESSES. Geoj. Tour. Geosites 2022, 43, 955–963. [Google Scholar] [CrossRef]
- George, L.; Sumathy, P. An integrated clustering and BERT framework for improved topic modeling. Int. J. Inf. Technol. 2023, 15, 2187–2195. [Google Scholar] [CrossRef]
- Grün, B.; Hornik, K. topicmodels: An R package for fitting topic models. J. Stat. Softw. 2011, 40, 1–30. [Google Scholar] [CrossRef]
- Ning, X.; Yim, D.; Khuntia, J. Online sustainability reporting and firm performance: Lessons learned from text mining. Sustainability 2021, 13, 1069. [Google Scholar] [CrossRef]
- Muchene, L.; Safari, W. Two-stage topic modelling of scientific publications: A case study of University of Nairobi, Kenya. PLoS ONE 2021, 16, e0243208. [Google Scholar] [CrossRef] [PubMed]
- Rahmi, N.A.; Rudiman, R. Latent Dirichlet Allocation Utilization as a Text Mining Method to Elaborate Learning Effectiveness. JSE J. Sci. Eng. 2023, 1, 23–29. [Google Scholar]
- Wang, W.; Guo, B.; Shen, Y.; Yang, H.; Chen, Y.; Suo, X. Neural labeled LDA: A topic model for semi-supervised document classification. Soft Comput. 2021, 25, 14561–14571. [Google Scholar] [CrossRef]
- Zhou, S.; Zhao, Y.; Bian, J.; Haynos, A.F.; Zhang, R. Exploring eating disorder topics on Twitter: Machine learning approach. JMIR Med. Inform. 2020, 8, e18273. [Google Scholar] [CrossRef]
- Gou, Z.; Huo, Z.; Liu, Y.; Yang, Y. A method for constructing supervised topic model based on term frequency-inverse topic frequency. Symmetry 2019, 11, 1486. [Google Scholar] [CrossRef]
- Hou, Y.Y.; Li, J.; Chen, X.B.; Ye, C.Q. Variational quantum semi-supervised classifier based on label propagation. Chin. Phys. B 2023, 32, 070309. [Google Scholar] [CrossRef]
- Kimura, M.; Izawa, R. Density-Fixing: Simple yet Effective Regularization Method based on the Class Priors. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–8. [Google Scholar]
- Engelen, J.E.v.; Hoos, H.H. A Survey on Semi-Supervised Learning. Mach. Learn. 2019, 109, 373–440. [Google Scholar] [CrossRef]
- Hudon, A.; Phraxayavong, K.; Potvin, S.; Dumais, A. Ensemble methods to optimize automated text classification in avatar therapy. BioMedInformatics 2024, 4, 423–436. [Google Scholar] [CrossRef]
- Onan, A. Hybrid supervised clustering based ensemble scheme for text classification. Kybernetes 2017, 46, 330–348. [Google Scholar] [CrossRef]
- Li, H.; Ma, Z.; Zhu, H.; Ma, Y.; Chang, Z. An ensemble classification algorithm of micro-blog sentiment based on feature selection and differential evolution. IEEE Access 2022, 10, 70467–70475. [Google Scholar] [CrossRef]
- Das, M.; Banerjee, S.; Saha, P. Abusive and threatening language detection in urdu using boosting based and bert based models: A comparative approach. arXiv 2021, arXiv:2111.14830. [Google Scholar]
- Osman, M.; He, J.; Mokbal, F.M.M.; Zhu, N.; Qureshi, S. ML-LGBM: A machine learning model based on light gradient boosting machine for the detection of version number attacks in RPL-based networks. IEEE Access 2021, 9, 83654–83665. [Google Scholar] [CrossRef]
- Çano, E.; Morisio, M. Quality of word embeddings on sentiment analysis tasks. In Natural Language Processing and Information Systems; Springer: Berlin/Heidelberg, Germany, 2017; pp. 332–338. [Google Scholar] [CrossRef]
- Wang, K.J. Making hong kong film. In Hong Kong Popular Culture; Hong Kong Studies Reader Series; Springer: Berlin/Heidelberg, Germany, 2020; pp. 33–116. [Google Scholar] [CrossRef]
- Gatto, J.; Seegmiller, P.; Johnston, G.; Preum, S.M. Identifying the perceived severity of patient-generated telemedical queries regarding covid: Developing and evaluating a transfer learning–based solution. JMIR Med. Inform. 2022, 10, e37770. [Google Scholar] [CrossRef] [PubMed]
- Lin, H.; Bu, N. A cnn-based framework for predicting public emotion and multi-level behaviors based on network public opinion. Front. Psychol. 2022, 13, 909439. [Google Scholar] [CrossRef] [PubMed]
- Razali, M.N.; Mustapha, A.; Mostafa, S.A.; Gunasekaran, S.S. Football matches outcomes prediction based on gradient boosting algorithms and football rating system. Hum. Factors Softw. Syst. Eng. 2022, 61, 57. [Google Scholar]
- Al Hanai, T.; Ghassemi, M.M.; Glass, J.R. Detecting Depression with Audio/Text Sequence Modeling of Interviews. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 1716–1720. [Google Scholar]
- Gurcan, F.; Cagiltay, N.E. Big data software engineering: Analysis of knowledge domains and skill sets using LDA-based topic modeling. IEEE Access 2019, 7, 82541–82552. [Google Scholar] [CrossRef]
- Sydorenko, S.; Kuzminska, O.; Mazorchuk, M.; Barna, O. Bibliometric analysis in determining the research directions of early career researchers. Inf. Technol. Learn. Tools 2022, 5, 113–129. [Google Scholar]
- Sanfilippo, F.; Austreng, K. Enhancing teaching methods on embedded systems with project-based learning. In Proceedings of the 2018 IEEE International Conference on Teaching, Assessment, and Learning for Engineering (TALE), Wollongong, Australia, 4–7 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 169–176. [Google Scholar]
- Ariza, J.A.; Baez, H. Understanding the role of single-board computers in engineering and computer science education: A systematic literature review. Comput. Appl. Eng. Educ. 2022, 30, 304–329. [Google Scholar] [CrossRef]
- Dolgopolovas, V.; Dagienė, V. Computational thinking: Enhancing STEAM and engineering education, from theory to practice. Comput. Appl. Eng. Educ. 2021, 29, 5–11. [Google Scholar] [CrossRef]
- Shaik, T.; Tao, X.; Li, Y.; Dann, C.; McDonald, J.; Redmond, P.; Galligan, L. A review of the trends and challenges in adopting natural language processing methods for education feedback analysis. IEEE Access 2022, 10, 56720–56739. [Google Scholar] [CrossRef]
- Fahlevvi, M.R. Sentiment Analysis And Topic Modeling on User Reviews of Online Tutoring Applications Using Support Vector Machine and Latent Dirichlet Allocation. Knowbase Int. J. Knowl. Database 2022, 2, 142–155. [Google Scholar] [CrossRef]
- Gao, C.; Zeng, J.; Wen, Z.; Lo, D.; Xia, X.; King, I.; Lyu, M.R. Emerging app issue identification via online joint sentiment-topic tracing. IEEE Trans. Softw. Eng. 2021, 48, 3025–3043. [Google Scholar] [CrossRef]
- Wang, Z. Extracting latent topics from user reviews using online LDA. In Proceedings of the 2018 International Conference on Information Technology and Management Engineering (ICITME 2018), Beijing, China, 26–27 August 2018; Atlantis Press: Amsterdam, The Netherlands, 2018; pp. 204–208. [Google Scholar]
- Qiang, J.; Qian, Z.; Li, Y.; Yuan, Y.; Wu, X. Short text topic modeling techniques, applications, and performance: A survey. IEEE Trans. Knowl. Data Eng. 2020, 34, 1427–1445. [Google Scholar] [CrossRef]
- Xie, C.; Ding, X.; Jiang, R. Using Computer Graphics to Make Science Visible in Engineering Education. IEEE Comput. Graph. Appl. 2023, 43, 99–106. [Google Scholar] [CrossRef] [PubMed]
- Feijóo-García, M.A.; Ramírez-Arévalo, H.H.; García, P.G.F. Collaborative Strategy for Software Engineering Courses at a South American University. In Proceedings of the CSEDU (2), Online, 23–25 April 2021; pp. 266–273. [Google Scholar]
- Tabula. 2023. Available online: https://tabula.technology/ (accessed on 11 June 2024).
- HaCohen-Kerner, Y.; Miller, D.C.; Yigal, Y. The influence of preprocessing on text classification using a bag-of-words representation. PLoS ONE 2020, 15, e0232525. [Google Scholar] [CrossRef]
- Selection of the Optimal Number of Topics for LDA Topic Model-Taking Patent Policy Analysis as an Example. Entropy 2023, 23, 1301. [CrossRef]
- Hagg, L.J.; Merkouris, S.S.; O’Dea, G.A.; Francis, L.M.; Greenwood, C.J.; Fuller-Tyszkiewicz, M.; Westrupp, E.M.; Macdonald, J.A.; Youssef, G.J. Examining analytic practices in latent dirichlet allocation within psychological science: Scoping review. J. Med. Internet Res. 2022, 24, e33166. [Google Scholar] [CrossRef] [PubMed]
- Campagnolo, J.M.; Duarte, D.; Dal Bianco, G. Topic coherence metrics: How sensitive are they? J. Inf. Data Manag. 2022, 13. [Google Scholar] [CrossRef]
- Röder, M.; Both, A.; Hinneburg, A. Exploring the space of topic coherence measures. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, Shanghai, China, 2–6 February 2015; pp. 399–408. [Google Scholar]
- Zhou, K.; Wang, J.; Ashuri, B.; Chen, J. Discovering the Research Topics on Construction Safety and Health Using Semi-Supervised Topic Modeling. Buildings 2023, 13, 1169. [Google Scholar] [CrossRef]
- Jensen, F.B.; Kuperman, W.A.; Porter, M.B.; Schmidt, H. Computational Ocean Acoustics; Springer: Berlin/Heidelberg, Germany, 1995; Volume 121. [Google Scholar] [CrossRef]
- Mimno, D. Mallet: MAchine Learning for LanguagE Toolkit. Available online: http://mallet.cs.umass.edu (accessed on 11 June 2024).
- Murshed, B.A.H.; Mallappa, S.; Abawajy, J.; Saif, M.A.N.; Al-Ariki, H.D.E.; Abdulwahab, H.M. Short text topic modelling approaches in the context of big data: Taxonomy, survey, and analysis. Artif. Intell. Rev. 2023, 56, 5133–5260. [Google Scholar] [CrossRef] [PubMed]
- Martino, L.; Elvira, V.; Camps-Valls, G. The recycling Gibbs sampler for efficient learning. Digit. Signal Process. 2018, 74, 1–13. [Google Scholar] [CrossRef]
- Bisgin, H.; Liu, Z.; Fang, H.; Xu, X.; Xu, X.; Tong, W. Mining FDA drug labels using an unsupervised learning technique—Topic modeling. BMC Bioinform. 2011, 12, S11. [Google Scholar] [CrossRef]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Sagi, O.; Rokach, L. Ensemble learning: A survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018, 8, e1249. [Google Scholar] [CrossRef]
- Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef]
- Warrens, M.J. Five ways to look at Cohen’s kappa. J. Psychol. Psychother. 2015, 5, e197. [Google Scholar] [CrossRef]
- Buch, A. Ideas of holistic engineering meet engineering work practices. In Engineering Professionalism; Brill: Leiden, The Netherlands, 2016; pp. 145–169. [Google Scholar]
- Wan, X.; Wang, T. Automatic labeling of topic models using text summaries. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 2297–2305. [Google Scholar]
- Tan, Y.; Ou, Z. Topic-weak-correlated latent dirichlet allocation. In Proceedings of the 2010 7th International Symposium on Chinese Spoken Language Processing, Tainan, Taiwan, 29 November–3 December 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 224–228. [Google Scholar]
- Wang, Y.; Pan, Z.; Zheng, J.; Qian, L.; Li, M. A hybrid ensemble method for pulsar candidate classification. Astrophys. Space Sci. 2019, 364, 139. [Google Scholar] [CrossRef]
- Mathis, C.A.; Siverling, E.A.; Glancy, A.W.; Moore, T.J. Teachers’ incorporation of argumentation to support engineering learning in STEM integration curricula. J. Pre-Coll. Eng. Educ. Res. (J-PEER) 2017, 7, 6. [Google Scholar] [CrossRef]
- Liu, Y.; Wang, H.; Fei, Y.; Liu, Y.; Shen, L.; Zhuang, Z.; Zhang, X. Research on the prediction of green plum acidity based on improved XGBoost. Sensors 2021, 21, 930. [Google Scholar] [CrossRef]
- Meisert, A.; Böttcher, F. Towards a discourse-based understanding of sustainability education and decision making. Sustainability 2019, 11, 5902. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Algorithm | Hyperparameter | Bounds |
---|---|---|
KNN | Number of neighbors | (3, 11) |
Weights | [uniform, distance] | |
Algorithm | [auto, balltree, kdtree, brute] | |
Leaf Size | (10, 50) | |
Distance | [Manhattan distance, Euclidean distance] | |
SVM | Kernel | [linear, poly, rbf, sigmoid] |
C | (0.01, 10) | |
Gamma | [scale, auto] | |
Degree | (2, 5) | |
Random forest | Number of estimators | (100, 600) |
Maximum depth | (3, 30) | |
Minimum samples split | (3, 20) | |
Minimum samples leaf | (3, 20) | |
Maximum features | (0.5, 1.0) | |
Maximum samples | (0.7, 1.0) | |
LightGBM | Number of estimators | (50, 500) |
Maximum depth | (3, 50) | |
Number of leaves | (10, 100) | |
Learning rate | (0.001, 10) | |
Minimum child weight | (1.0, 20.0) | |
XGBoost | Number of estimators | (50, 500) |
Maximum depth | (3, 50) | |
Learning rate | (0.0001, 10.0) | |
Regularization lambda | (0.0, 10.5) | |
Gamma | (0.0, 100.0) | |
Minimum child weight | (0.0, 200.0) | |
Subsample | (0.5, 1.0) | |
Colsamplebytree | (0.5, 1.0) | |
LSTM | Embedding dimension | (100, 300) |
LSTM units | (64, 128, 30) | |
Learning rate | (0.0001, 0.1) | |
Batch size | (16, 32, 64) | |
Epochs | (10, 300) |
Top 1 | Top 2 | Top 3 | |
---|---|---|---|
Test Set | Design and space efficiency | Energy efficiency and conservation | Environmental considerations |
Model prediction | Design and space efficiency | Energy efficiency and conservation | Environmental considerations |
Model | Best Parameters | Metrics |
---|---|---|
Random forest | Number of estimators = 76, Maximum depth = 20, Maximum features = ‘sqrt’, Minimum samples leaf = 4 | Mean squared error: 0.0037 Exact matches: 0.4576 Unordered matches: 0.7288 Highest topic matches: 58 out of 59 predictions Proportion of highest topic matches: 0.9831 Two of three matches: 1.0000 Main topic accuracy: 0.7119 |
LGBM | Number of trees = 210, Maximum depth = 3, Number of leaves = 31, Learning rate = 0.01, Estimator subsample: 0.7, Colsamplebytree = 0.9 | Mean squared error: 0.0070 Exact matches: 0.2373 Unordered matches: 0.4407 Highest Topic matches: 55 out of 59 predictions Proportion of highest topic matches: 0.9322 Two of three matches: 0.9661 Main topic accuracy: 0.5763 |
XGBoost | Number of estimators = 80, Maximum depth = 3, Learning rate = 0.1, Estimator subsample = 0.7, Colsamplebytree = 0.8 | Mean squared error: 0.0004 Exact matches: 0.5085 Unordered matches: 0.8475 Highest topic matches: 59 out of 59 predictions Proportion of highest topic matches: 1.0000 Two of three matches: 1.0000 Main topic accuracy: 0.8475 |
RNN: LSTM | LSTM units = 256, Learning rate = 0.001, Epochs = 10, Embedding dim = 300, Batch size = 64 | Mean squared error: 0.0069 Exact matches: 0.1864 Unordered matches: 0.3559 Highest topic matches: 58 out of 59 predictions Proportion of highest topic matches: 0.9831 Two of three matches: 0.9831 Main topic accuracy: 0.5593 |
SVM | C = 1.0098, degree = 4, kernel = ‘sigmoid’ | Mean squared error: 0.0099 Exact matches: 0.1695 Unordered matches: 0.3051 Highest topic matches: 51 out of 59 predictions Proportion of highest topic matches: 0.8644 Two of three matches: 0.9661 Main topic accuracy: 0.5254 |
KNN | Number of neighbors = 17, Weights = uniform, Algorithm = ‘auto’ Leaf Size = 5 | Mean squared error: 0.0147 Exact matches: 0.0508 Unordered matches: 0.1864 Highest topic matches: 39 out of 59 predictions Proportion of highest topic matches: 0.6610 Two of three matches: 0.8136 Main topic accuracy: 0.2881 |
Label | Topic Weight | Top 20 Words | Representative Quotes |
---|---|---|---|
Energy efficiency and conservation | 0.12571 | energy, solar, panel, home, decrease, house, net, window, increase, heat, cost, tree, amount, roof, increasing, annual, size, electrical, efficiency, consumption | “The addition of more trees around the home is beneficial and increases energy efficiency because they provide shade to the home when you need it. Because of this, this lowers A/C costs over the year.” |
Solar energy positioning and costs | 0.26446 | energy, south, house, side, roof, sun, window, winter, sunlight, solar, cost, facing, time, panel, area, reduce, block, summer, large, heating | “Hip roof is better because it has more area facing toward the sun compared to regular roof. Reducing the total area of the house can reduce the total cost of the house.” |
Solar energy generation | 0.21765 | solar, house, energy, sun, panel, sunlight, window, roof, tree, south, leaf, summer, light, radiation, angle, generate, heat, directly, facing, winter | “More solar panels will include more solar cells so more sunlight can hit the surface to create energy.” |
Economical considerations | 0.07336 | cost, house, solar, panel, make, budget, high, wall, side, sunlight, foundation, time, order, money, idea, energy, adding, space, expensive, made | “Lowering the walls will bring down the price because there is less material and thus make living in the house cheaper as there is less cost to cover.” |
Solar panel placement and solar heat gain | 0.15969 | solar, energy, house, sun, east, west, south, panel, sunlight, window, side, day, heating, heat, tree, receive, northern, direct, hemisphere, radiation | “Having solar panels on those sides of the house will produce more energy because when solar panels face the sun, they gain more energy from solar radiation and is converted into energy.” |
Insulation and thermal consideration in seasons | 0.18272 | house, energy, heat, winter, air, window, amount, summer, reduce, temperature, cool, heating, sunlight, insulation, wall, adding, net, side, cold, tree | “This should keep all hot and cold air in and shield from the opposite outside. That should reduce the amount of AC and heat the house uses. Shade from the windows should help with cooling in the summer. The amount of energy to heat and cool the house should be reduced because of the reduction of escape points. The insulation should work the same way as for the windows and walls.” |
Model | Argue | Observation | Justification | ||||||
---|---|---|---|---|---|---|---|---|---|
3 of 3 | 2 of 3 | 1 of 3 | 3 of 3 | 2 of 3 | 1 of 3 | 3 of 3 | 2 of 3 | 1 of 3 | |
LDA | 71.0% | 27.0% | 2.0% | 77.5% | 19.3% | 3.1% | 59.3 % | 37.50% | 0.31% |
XGBoost | 65.6 % | 32.3% | 2.02% | 77.5% | 20.4% | 2.04% | 40.6% | 55.2% | 4.1% |
Topics Order | Quote |
---|---|
Solar energy positioning and costs | “Solar panels on the roof generate more energy because they are closer to the sun and thus have a higher chance to get more coverage.” |
Solar energy generation | “Solar panels facing the south will generate more electrical energy because they receive more energy from the sun.” |
Insulation and thermal considerations in seasons | “Trees outside the windows will make the summer times colder in the house due to lack of sunlight and make the house warmer in the winter due to the excess of sunlight coming into the house.” |
Topics Order | Quote |
---|---|
Energy efficiency and conservation, insulation and thermal regulation, economic considerations | “Adding a source of power to a house that lacks any will drastically lower the annual energy cost, because having more energy naturally provided will lead to less energy consumed. Raising the roof will help lower energy cost because heat rises, so raising the roof will help contain the heat and lower energy consumption. Increasing window size will decrease annual energy cost because more sunlight will get in, which will provide more heat in the winter months. Increasing the insulation R-value will decrease annual energy cost because the house will be better at maintaining its internal temperature, which will lead to less heat and A/C being used.” |
Insulation and thermal regulation, economic considerations, design and space efficiency | “Adding the solar panels drastically lowered the annual energy cost. Raising the roof did the opposite of what I expected and upped the energy consumption. Increasing the window size did decrease the annual energy cost, but only slightly. Increasing the insulation R value drastically lowered annual energy cost.” |
Economic considerations, insulation and thermal regulation, energy efficiency and conservation | “Solar panels generated a lot of energy, especially compared to no solar panels, because solar panels use energy from sunlight to make electricity that can be used to lower the energy cost. Raising the roof upped the energy consumption because it caused more space to heat, which will lead the heater to work harder to heat the same amount of living space. Increasing the window size did decrease the annual energy cost, but only slightly because the winter energy consumption dropped, but energy consumption increased in the summer which only led to marginal improvements. Increasing the insulation R value drastically lowered annual energy cost because the house better maintained its internal temperature, meaning that less heat and A/C were used. A/C and heat were the two sources taking up energy, and lowering them minimally over one day built up over the year.” |
Model | Argue | Observation | Justification | ||||||
---|---|---|---|---|---|---|---|---|---|
1-2 | 2-3 | 1-3 | 1-2 | 2-3 | 1-3 | 1-2 | 2-3 | 1-3 | |
LDA | 15.03% | 8.45% | 23.48% | 24.28% | 10.79% | 35.07% | 14.08% | 8.71% | 22.80% |
XGBoost | 5.33% | 6.20% | 11.53% | 5.75% | 5.28% | 11.04% | 5.14% | 4.75% | 9.89% |
Model | Topic 1 | Topic 2 | Topic 3 | Quote |
---|---|---|---|---|
Claim | ||||
LDA | Solar energy positioning and costs | Solar energy generation | Energy efficiency and conservation | “Solar panels on the south side of the roof generate more electricity because they are exposed to the sun more throughout the day and convert the sunlight to electrical energy. Making the windows on the south side of the house larger will allow more sunlight in during the winter and let the sunlight warm the house more instead of using power on the heater.” |
XGBoost | Energy efficiency and conservation | Design and space efficiency | Environmental considerations | |
Observation | ||||
LDA | Solar energy positioning and costs | Energy efficiency and conservation | Solar energy generation | “Solar panels on the south side caused the annual energy cost of the house to decrease. Larger windows on the south side of the house caused the energy cost of the house to decrease.” |
XGBoost | Energy efficiency and conservation | Economic considerations | Insulation and thermal regulation | |
Justification | ||||
LDA | Solar energy generation | Economic considerations | Energy efficiency and conservation | “The solar panels generated more energy because when they are exposed to the sun more, they received more sunlight and converted more energy to electrical energy. The larger windows saved energy because the house was able to be warmed more from the sun in the winter, which saved energy from being spent on the heater.” |
XGBoost | Economic considerations | Energy efficiency and conservation | Insulation and thermal regulation |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Romero, J.D.; Feijoo-Garcia, M.A.; Nanda, G.; Newell, B.; Magana, A.J. Evaluating the Performance of Topic Modeling Techniques with Human Validation to Support Qualitative Analysis. Big Data Cogn. Comput. 2024, 8, 132. https://doi.org/10.3390/bdcc8100132
Romero JD, Feijoo-Garcia MA, Nanda G, Newell B, Magana AJ. Evaluating the Performance of Topic Modeling Techniques with Human Validation to Support Qualitative Analysis. Big Data and Cognitive Computing. 2024; 8(10):132. https://doi.org/10.3390/bdcc8100132
Chicago/Turabian StyleRomero, Julian D., Miguel A. Feijoo-Garcia, Gaurav Nanda, Brittany Newell, and Alejandra J. Magana. 2024. "Evaluating the Performance of Topic Modeling Techniques with Human Validation to Support Qualitative Analysis" Big Data and Cognitive Computing 8, no. 10: 132. https://doi.org/10.3390/bdcc8100132
APA StyleRomero, J. D., Feijoo-Garcia, M. A., Nanda, G., Newell, B., & Magana, A. J. (2024). Evaluating the Performance of Topic Modeling Techniques with Human Validation to Support Qualitative Analysis. Big Data and Cognitive Computing, 8(10), 132. https://doi.org/10.3390/bdcc8100132