Application of Natural Language Processing and Genetic Algorithm to Fine-Tune Hyperparameters of Classifiers for Economic Activities Analysis
Abstract
:1. Introduction
2. Materials and Methods
2.1. Dataset Description
2.2. Natural Language Processing
2.3. Optimizing Classifier Parameters Using a Genetic Algorithm
3. Results
3.1. Subgroups
- Random error in code selection by the author.
- Intentional selection of the wrong code.
- Selection of the wrong code due to a lack of information about which code should be assigned.
3.2. Expert Evaluation
- NACE code 42.1: groups containing phrases indicating road maintenance (including the words “maintenance” and “road AND maintenance”) were identified, which are not suitable for this group.
- NACE code 41.1L groups related to construction supervision (including words “control” and “supervision”) were identified. These contract groups were marked as not corresponding to their NACE code.
- NACE code 41.2: groups of activities related to procurement rather than construction were identified. The groups based on the key rule “procurement” were designated as miscellaneous.
- NACE code 42.2: activities were divided into two main groups: repair and construction. Subgroups based on the rules “repair” or “repair” AND “capital” were identified, with the remaining activities containing construction objects. A few activities containing the words “technological” AND “connection” were also identified. This group should belong to NACE 43.2.
- NACE code 43.1: subgroups with the word “landscaping” were assigned to NACE 42.9. Additionally, subgroups based on the rules “wood”, “wood AND territory”, “emergency AND wood”, and “supply” were excluded and classified as miscellaneous.
- NACE code 43.2: a separate group of activities based on the rule “alarm” was identified. This group is not relevant to the analyzed code and was classified as miscellaneous.
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
Division | Title |
---|---|
A.1 | Crop and animal production, hunting, and related service activities |
A.2 | Forestry and logging |
A.3 | Fishing and aquaculture |
B.5 | Mining of coal and lignite |
B.6 | Extraction of crude petroleum and natural gas |
B.7 | Mining of metal ores |
B.8 | Other mining and quarrying |
B.9 | Mining support service activities |
C.10 | Manufacture of food products |
C.11 | Manufacture of beverages |
C.12 | Manufacture of tobacco products |
C.13 | Manufacture of textiles |
C.14 | Manufacture of wearing apparel |
C.15 | Manufacture of leather and related products |
C.16 | Manufacture of wood and of products of wood and cork, except furniture; manufacture of articles of straw and plaiting materials |
C.17 | Manufacture of paper and paper products |
C.18 | Printing and reproduction of recorded media |
C.19 | Manufacture of coke and refined petroleum products |
C.20 | Manufacture of chemicals and chemical products |
C.21 | Manufacture of basic pharmaceutical products and pharmaceutical preparations |
C.22 | Manufacture of rubber and plastic products |
C.23 | Manufacture of other non-metallic mineral products |
C.24 | Manufacture of basic metals |
C.25 | Manufacture of fabricated metal products, except machinery and equipment |
C.26 | Manufacture of computer, electronic, and optical products |
C.27 | Manufacture of electrical equipment |
C.28 | Manufacture of machinery and equipment n.e.c. |
C.29 | Manufacture of motor vehicles, trailers, and semi-trailers |
C.30 | Manufacture of other transport equipment |
C.31 | Manufacture of furniture |
C.32 | Other manufacturing |
C.33 | Repair and installation of machinery and equipment |
D.35 | Electricity, gas, steam, and air conditioning supply |
E.36 | Water collection, treatment, and supply |
E.37 | Sewerage |
E.38 | Waste collection, treatment, and disposal activities; materials recovery |
E.39 | Remediation activities and other waste management services |
F.41 | Construction of buildings |
F.42 | Civil engineering |
F.43 | Specialised construction activities |
G.45 | Wholesale and retail trade and repair of motor vehicles and motorcycles |
G.46 | Wholesale trade, except of motor vehicles and motorcycles |
G.47 | Retail trade, except of motor vehicles and motorcycles |
H.49 | Land transport and transport via pipelines |
H.50 | Water transport |
H.51 | Air transport |
H.52 | Warehousing and support activities for transportation |
H.53 | Postal and courier activities |
I.55 | Accommodation |
I.56 | Food and beverage service activities |
J.58 | Publishing activities |
J.59 | Motion picture, video and television programme production, sound recording, and music publishing activities |
J.60 | Programming and broadcasting activities |
J.61 | Telecommunications |
J.62 | Computer programming, consultancy, and related activities |
J.63 | Information service activities |
K.64 | Financial service activities, except insurance, and pension funding |
K.65 | Insurance, reinsurance, and pension funding, except compulsory social security |
K.66 | Activities auxiliary to financial services and insurance activities |
L.68 | Real estate activities |
M.69 | Legal and accounting activities |
M.70 | Activities of head offices; management consultancy activities |
M.71 | Architectural and engineering activities; technical testing and analysis |
M.72 | Scientific research and development |
M.73 | Advertising and market research |
M.74 | Other professional, scientific, and technical activities |
M.75 | Veterinary activities |
N.77 | Rental and leasing activities |
N.78 | Employment activities |
N.79 | Travel agency, tour operator reservation service, and related activities |
N.80 | Security and investigation activities |
N.81 | Services to buildings and landscape activities |
N.82 | Office administrative, office support, and other business support activities |
O.84 | Public administration and defence; compulsory social security |
P.85 | Education |
Q.86 | Human health activities |
Q.87 | Residential care activities |
Q.88 | Social work activities without accommodation |
R.90 | Creative, arts, and entertainment activities |
R.91 | Libraries, archives, museums, and other cultural activities |
R.92 | Gambling and betting activities |
R.93 | Sports activities and amusement and recreation activities |
S.94 | Activities of membership organisations |
S.95 | Repair of computers and personal and household goods |
S.96 | Other personal service activities |
T.97 | Activities of households as employers of domestic personnel |
T.98 | Undifferentiated goods-and services-producing activities of private households for own use |
U.99 | Activities of extraterritorial organisations and bodies |
References
- Schnabl, E.; Zenker, A. Statistical Classification of Knowledge-Intensive Business Services (KIBS) with NACE Rev. 2; Fraunhofer ISI: Karlsruhe, Germany, 2013; Volume 25. [Google Scholar]
- Nijhowne, S. Defining and classifying statistical units. In Business Survey Methods; Wiley Online Library: New York, NY, USA, 1995; pp. 49–64. [Google Scholar]
- Barrier, E.B. The concept of sustainable economic development. In The Economics of Sustainability; Routledge: London, UK, 2017; pp. 87–96. [Google Scholar]
- Graiet, M.; Mammar, A.; Boubaker, S.; Gaaloul, W. Towards correct cloud resource allocation in business processes. IEEE Trans. Serv. Comput. 2016, 10, 23–36. [Google Scholar] [CrossRef]
- Ievdokymov, V.; Ostapchuk, T.; Lehenchuk, S.; Grytsyshen, D.; Marchuk, G. Analysis of the Impact of Intangible Assets on the Companies’ Market Value; Natsional’nyi Hirnychyi Universytet. Naukovyi Visnyk: Dnipropetrovsk Oblast, Ukraine, 2020; pp. 164–170. [Google Scholar]
- Békés, G.; Muraközy, B.; Harasztosi, P. Firms and products in international trade: Evidence from Hungary. Econ. Syst. 2011, 35, 4–24. [Google Scholar] [CrossRef]
- Fröhlich, M. Nowcasting short-term indicators with machine learning methods. Stat. J. IAOS 2022, 38, 1411–1436. [Google Scholar] [CrossRef]
- Ambrois, M.; Butticè, V.; Caviggioli, F.; Cerulli, G.; Croce, A.; De Marco, A.; Giordano, A.; Resce, G.; Toschi, L.; Ughetto, E.; et al. Using Machine Learning to Map the European Cleantech Sector; Technical report, EIF Working Paper; European Investment Fund (EIF): Luxembourg, 2023. [Google Scholar]
- Boselli, R.; Cesarini, M.; Mercorio, F.; Mezzanzanica, M. Using machine learning for labour market intelligence. In Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2017, Skopje, Macedonia, 18–22 September 2017; Proceedings, Part III 10. Springer: Cham, Switzerland, 2017; pp. 330–342. [Google Scholar]
- d’Andrimont, R.; Yordanov, M.; Martinez-Sanchez, L.; Eiselt, B.; Palmieri, A.; Dominici, P.; Gallego, J.; Reuter, H.I.; Joebges, C.; Lemoine, G.; et al. Harmonised LUCAS in-situ land cover and use database for field surveys from 2006 to 2018 in the European Union. Sci. Data 2020, 7, 352. [Google Scholar] [CrossRef] [PubMed]
- Redlein, A.; Stopajnik, E. Current labour market situation and upcoming trends in the European Facility Service Industry. J. Facil. Manag. Educ. Res. 2017, 1, 1. [Google Scholar]
- Gite, S.; Patil, S.; Dharrao, D.; Yadav, M.; Basak, S.; Rajendran, A.; Kotecha, K. Textual feature extraction using ant colony optimization for hate speech classification. Big Data Cogn. Comput. 2023, 7, 45. [Google Scholar] [CrossRef]
- Jasmir, J.; Nurmaini, S.; Tutuko, B. Fine-grained algorithm for improving knn computational performance on clinical trials text classification. Big Data Cogn. Comput. 2021, 5, 60. [Google Scholar] [CrossRef]
- Hawalah, A. Semantic ontology-based approach to enhance Arabic text classification. Big Data Cogn. Comput. 2019, 3, 53. [Google Scholar] [CrossRef]
- Masich, I.; Rezova, N.; Shkaberina, G.; Mironov, S.; Bartosh, M.; Kazakovtsev, L. Subgroup Discovery in Machine Learning Problems with Formal Concepts Analysis and Test Theory Algorithms. Algorithms 2023, 16, 246. [Google Scholar] [CrossRef]
- Salloum, S.; Dautov, R.; Chen, X.; Peng, P.X.; Huang, J.Z. Big data analytics on Apache Spark. Int. J. Data Sci. Anal. 2016, 1, 145–164. [Google Scholar]
- Ahmed, N.; Barczak, A.; Rashid, M. An enhanced parallelisation model for performance prediction of Apache Spark on a multinode Hadoop cluster. Big Data Cogn. Comput. 2021, 5, 65. [Google Scholar] [CrossRef]
- Ayazbayev, D.; Bogdanchikov, A.; Orynbekova, K. Defining Semantically Close Words of Kazakh Language with Distributed System Apache Spark. Big Data Cogn. Comput. 2023, 7, 160. [Google Scholar] [CrossRef]
- Kroß, J.; Krcmar, H. Pertract: Model extraction and specification of big data systems for performance prediction by the example of Apache Spark and Hadoop. Big Data Cogn. Comput. 2019, 3, 47. [Google Scholar] [CrossRef]
- Chowdhary, K.; Chowdhary, K. Natural language processing. In Fundamentals of Artificial Intelligence; Springer: New Delhi, India, 2020; pp. 603–649. [Google Scholar]
- Musleh, D.A.; Alkhwaja, I.; Alkhwaja, A.; Alghamdi, M.; Abahussain, H.; Alfawaz, F.; Min-Allah, N.; Abdulqader, M.M. Arabic Sentiment Analysis of YouTube Comments: NLP-Based Machine Learning Approaches for Content Evaluation. Big Data Cogn. Comput. 2023, 7, 127. [Google Scholar] [CrossRef]
- Clark, A. Magic words: How language augments human computation. In Language and Meaning in Cognitive Science; Routledge: London, UK, 2012; pp. 21–39. [Google Scholar]
- Anandarajan, M.; Hill, C.; Nolan, T.; Anandarajan, M.; Hill, C.; Nolan, T. Text preprocessing. In Practical Text Analytics: Maximizing the Value of Text Data; Springer: Cham, Switzerland, 2019; pp. 45–59. [Google Scholar]
- Clark, E.; Araki, K. Text normalization in social media: Progress, problems and applications for a pre-processing system of casual English. Procedia-Soc. Behav. Sci. 2011, 27, 2–11. [Google Scholar] [CrossRef]
- Büschken, J.; Allenby, G.M. Improving text analysis using sentence conjunctions and punctuation. Mark. Sci. 2020, 39, 727–742. [Google Scholar] [CrossRef]
- Zhao, J.; Gui, X. Comparison research on text pre-processing methods on twitter sentiment analysis. IEEE Access 2017, 5, 2870–2879. [Google Scholar]
- Agnihotri, D.; Verma, K.; Tripathi, P. Pattern and cluster mining on text data. In Proceedings of the 2014 Fourth International Conference on Communication Systems and Network Technologies, Bhopal, India, 7–9 April 2014; pp. 428–432. [Google Scholar]
- Kaufmann, M.; Kalita, J. Syntactic normalization of twitter messages. In Proceedings of the International Conference on Natural Language Processing, Kharagpur, India, 8–11 December 2010; Volume 16. [Google Scholar]
- Vijayarani, S.; Janani, R. Text mining: Open source tokenization tools-an analysis. Adv. Comput. Intell. Int. J. (ACII) 2016, 3, 37–47. [Google Scholar]
- Singh, J.; Gupta, V. Text stemming: Approaches, applications, and challenges. ACM Comput. Surv. (CSUR) 2016, 49, 1–46. [Google Scholar] [CrossRef]
- Khyani, D.; Siddhartha, B.; Niveditha, N.; Divya, B. An interpretation of lemmatization and stemming in natural language processing. J. Univ. Shanghai Sci. Technol. 2021, 22, 350–357. [Google Scholar]
- Korenius, T.; Laurikkala, J.; Järvelin, K.; Juhola, M. Stemming and lemmatization in the clustering of finnish text documents. In Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, Washington, DC, USA, 8–13 November 2004; pp. 625–633. [Google Scholar]
- Balakrishnan, V.; Lloyd-Yemoh, E. Stemming and lemmatization: A comparison of retrieval performances. In Proceedings of the SCEI Seoul Conferences, Seoul, Republic of Korea, 10–11 April 2014. [Google Scholar]
- Saranya, S.; Usha, G. A Machine Learning-Based Technique with IntelligentWordNet Lemmatize for Twitter Sentiment Analysis. Intell. Autom. Soft Comput. 2023, 36, 339–352. [Google Scholar] [CrossRef]
- Yang, X.; Yang, K.; Cui, T.; Chen, M.; He, L. A Study of Text Vectorization Method Combining Topic Model and Transfer Learning. Processes 2022, 10, 350. [Google Scholar] [CrossRef]
- Qiu, D.; Jiang, H.; Chen, S. Fuzzy information retrieval based on continuous bag-of-words model. Symmetry 2020, 12, 225. [Google Scholar] [CrossRef]
- Jang, B.; Kim, M.; Harerimana, G.; Kang, S.U.; Kim, J.W. Bi-LSTM model to increase accuracy in text classification: Combining Word2vec CNN and attention mechanism. Appl. Sci. 2020, 10, 5841. [Google Scholar] [CrossRef]
- Roshan, R.; Bhacho, I.A.; Zai, S. Comparative Analysis of TF–IDF and Hashing Vectorizer for Fake News Detection in Sindhi: A Machine Learning and Deep Learning Approach. Eng. Proc. 2023, 46, 5. [Google Scholar] [CrossRef]
- Abubakar, H.D.; Umar, M.; Bakale, M.A. Sentiment classification: Review of text vectorization methods: Bag of words, Tf-Idf, Word2vec and Doc2vec. SLU J. Sci. Technol. 2022, 4, 27–33. [Google Scholar] [CrossRef]
- Diao, R.; Chao, F.; Peng, T.; Snooke, N.; Shen, Q. Feature selection inspired classifier ensemble reduction. IEEE Trans. Cybern. 2013, 44, 1259–1268. [Google Scholar] [CrossRef] [PubMed]
- Dos Santos, E.M.; Sabourin, R.; Maupin, P. Overfitting cautious selection of classifier ensembles with genetic algorithms. Inf. Fusion 2009, 10, 150–162. [Google Scholar] [CrossRef]
- Wang, N.; Wang, P.; Zhang, B. An improved TF-IDF weights function based on information theory. In Proceedings of the 2010 International Conference on Computer and Communication Technologies in Agriculture Engineering, Chengdu, China, 12–13 June 2010; Volume 3, pp. 439–441. [Google Scholar]
- Turki, T.; Roy, S.S. Novel hate speech detection using word cloud visualization and ensemble learning coupled with count vectorizer. Appl. Sci. 2022, 12, 6611. [Google Scholar] [CrossRef]
- Kumar, V.; Subba, B. A TfidfVectorizer and SVM based sentiment analysis framework for text data corpus. In Proceedings of the 2020 National Conference on Communications (NCC), Kharagpur, India, 21–23 February 2020; pp. 1–6. [Google Scholar]
- Egger, R. Text Representations and Word Embeddings: Vectorizing Textual Data. In Applied Data Science in Tourism: Interdisciplinary Approaches, Methodologies, and Applications; Springer: Cham, Switzerland, 2022; pp. 335–361. [Google Scholar]
- Leung, K.M. Naive bayesian classifier. Polytech. Univ. Dep. Comput. Sci. Risk Eng. 2007, 2007, 123–156. [Google Scholar]
- Song, Y.Y.; Ying, L. Decision tree methods: Applications for classification and prediction. Shanghai Arch. Psychiatry 2015, 27, 130. [Google Scholar] [PubMed]
- Tangirala, S. Evaluating the impact of GINI index and information gain on classification using decision tree classifier algorithm. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 612–619. [Google Scholar] [CrossRef]
- Anmala, J.; Turuganti, V. Comparison of the performance of decision tree (DT) algorithms and extreme learning machine (ELM) model in the prediction of water quality of the Upper Green River watershed. Water Environ. Res. 2021, 93, 2360–2373. [Google Scholar] [CrossRef] [PubMed]
- Silva, S.; Almeida, J. Dynamic maximum tree depth: A simple technique for avoiding bloat in tree-based gp. In Proceedings of the Genetic and Evolutionary Computation—GECCO 2003: Genetic and Evolutionary Computation Conference, Chicago, IL, USA, 12–16 July 2003; Proceedings, Part II. Springer: Cham, Switzerland, 2003; pp. 1776–1787. [Google Scholar]
- Buntine, W.; Niblett, T. A further comparison of splitting rules for decision-tree induction. Mach. Learn. 1992, 8, 75–85. [Google Scholar] [CrossRef]
- Chan, T.M.; Zheng, D.W. Hopcroft’s problem, log-star shaving, 2D fractional cascading, and decision trees. ACM Trans. Algorithms 2022. [Google Scholar] [CrossRef]
- Algehyne, E.A.; Jibril, M.L.; Algehainy, N.A.; Alamri, O.A.; Alzahrani, A.K. Fuzzy neural network expert system with an improved Gini index random forest-based feature importance measure algorithm for early diagnosis of breast cancer in Saudi Arabia. Big Data Cogn. Comput. 2022, 6, 13. [Google Scholar] [CrossRef]
- Taud, H.; Mas, J. Multilayer perceptron (MLP). In Geomatic Approaches for Modeling Land Change Scenarios; Springer: Cham, Switzerland, 2018; pp. 451–455. [Google Scholar]
- Banerjee, C.; Mukherjee, T.; Pasiliao, E., Jr. An empirical study on generalizations of the ReLU activation function. In Proceedings of the 2019 ACM Southeast Conference, Kennesaw, GA, USA, 18–20 April 2019; pp. 164–167. [Google Scholar]
- Zhang, Z. Improved adam optimizer for deep neural networks. In Proceedings of the 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS), Banff, AB, Canada, 4–6 June 2018; pp. 1–2. [Google Scholar]
- Brodley, C.E.; Friedl, M.A. Identifying and eliminating mislabeled training instances. In Proceedings of the National Conference on Artificial Intelligence, Portland, OR, USA, 4–8 August 1996; pp. 799–805. [Google Scholar]
- Lemmerich, F.; Becker, M. pysubgroup: Easy-to-use subgroup discovery in python. In Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2018, Dublin, Ireland, 10–14 September 2018; Proceedings, Part III 18. Springer: Cham, Switzerland, 2019; pp. 658–662. [Google Scholar]
- Atzmueller, M. Subgroup discovery. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2015, 5, 35–49. [Google Scholar] [CrossRef]
- Kim, M.P.; Ghorbani, A.; Zou, J. Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, Honolulu, HI, USA, 27–28 January 2019; pp. 247–254. [Google Scholar]
- De Lusignan, S.; Khunti, K.; Belsey, J.; Hattersley, A.; Van Vlymen, J.; Gallagher, H.; Millett, C.; Hague, N.; Tomson, C.; Harris, K.; et al. A method of identifying and correcting miscoding, misclassification and misdiagnosis in diabetes: A pilot and validation study of routinely collected data. Diabet. Med. 2010, 27, 203–209. [Google Scholar] [CrossRef] [PubMed]
- Oishi, S.M.; Morton, S.C.; Moore, A.A.; Beck, J.C.; Hays, R.D.; Spritzer, K.L.; Partridge, J.M.; Fink, A. Using data to enhance the expert panel process: Rating indications of alcohol-related problems in older adults. Int. J. Technol. Assess. Health Care 2001, 17, 125–136. [Google Scholar] [CrossRef] [PubMed]
- Li, H.; Chaudhari, P.; Yang, H.; Lam, M.; Ravichandran, A.; Bhotika, R.; Soatto, S. Rethinking the hyperparameters for fine-tuning. arXiv 2020, arXiv:2002.11770. [Google Scholar]
- Fielding, A.H.; Bell, J.F. A review of methods for the assessment of prediction errors in conservation presence/absence models. Environ. Conserv. 1997, 24, 38–49. [Google Scholar] [CrossRef]
- Brodley, C.E.; Friedl, M.A. Identifying mislabeled training data. J. Artif. Intell. Res. 1999, 11, 131–167. [Google Scholar] [CrossRef]
- Fields, J.; Chovanec, K.; Madiraju, P. A Survey of Text Classification with Transformers: How wide? How large? How long? How accurate? How expensive? How safe? IEEE Access 2024, 12, 6518–6531. [Google Scholar] [CrossRef]
- Xie, Y.; Li, Z.; Yin, Y.; Wei, Z.; Xu, G.; Luo, Y. Advancing Legal Citation Text Classification A Conv1D-Based Approach for Multi-Class Classification. J. Theory Pract. Eng. Sci. 2024, 4, 15–22. [Google Scholar] [CrossRef] [PubMed]
- Phiphitphatphaisit, S.; Surinta, O. Deep feature extraction technique based on Conv1D and LSTM network for food image recognition. Eng. Appl. Sci. Res. 2021, 48, 581–592. [Google Scholar]
- Zub, K.; Zhezhnych, P.; Strauss, C. Two-Stage PNN–SVM Ensemble for Higher Education Admission Prediction. Big Data Cogn. Comput. 2023, 7, 83. [Google Scholar] [CrossRef]
- Kasneci, E.; Seßler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E.; et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
- Abburi, H.; Suesserman, M.; Pudota, N.; Veeramani, B.; Bowen, E.; Bhattacharya, S. Generative ai text classification using ensemble llm approaches. arXiv 2023, arXiv:2309.07755. [Google Scholar]
- Zhao, F.; Yu, F. Enhancing Multi-Class News Classification through Bert-Augmented Prompt Engineering in Large Language Models: A Novel Approach. In Proceedings of the 10th International Scientific and Practical Conference “Problems and Prospects of Modern Science and Education”, Stockholm, Sweden, 12–15 March 2024; International Science Group: New York, NY, USA, 2024. 381p. [Google Scholar]
- Prottasha, N.J.; Sami, A.A.; Kowsher, M.; Murad, S.A.; Bairagi, A.K.; Masud, M.; Baz, M. Transfer learning for sentiment analysis using BERT based supervised fine-tuning. Sensors 2022, 22, 4157. [Google Scholar] [CrossRef] [PubMed]
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Floridi, L.; Chiriatti, M. GPT-3: Its nature, scope, limits, and consequences. Minds Mach. 2020, 30, 681–694. [Google Scholar] [CrossRef]
- NACE: Statistical Classification of Economic Activities in the European Community. Available online: https://ec.europa.eu/eurostat/web/nace/overview (accessed on 10 June 2024).
Parameter | Description | CountVectorizer | TfidfVectorizer |
---|---|---|---|
max_features | Max. number of most frequently occurring words in vectorization | 1500 | 1500 |
min_df | Min. document frequency for a word to be included in analysis | 5 | 5 |
max_df | Max. document frequency for a word to be excluded from analysis | 0.7 | 0.7 |
Classification Method | Accuracy | F1-Score |
---|---|---|
Naive Bayes | 66% | 0.71 |
Decision Tree | 62.7% | 0.70 |
Random Forest | 71% | 0.70 |
Multilayer Perceptron | 71% | 0.73 |
Economic Activity Description | Initial NACE | Description of Initial NACE Code | Predicted NACE | Description of Predicted NACE Code | The Type of Discrepancy |
---|---|---|---|---|---|
Execution of facade repair works | 41.2 | Building and construction works | 43.3 | Finishing works in buildings and structures | Intentional selection of the wrong code. |
Fire depot in the village | 42.1 | Road construction | 26.3 | Communication equipment | Lack of description |
Execution of works on the manufacture and installation of fire doors | 43.2 | Electrical and other types of installation works | 43.3 | Finishing and finishing works in buildings and structures | Error in NACE code selection |
The implementation of works on the device of an inclined lift for low-mobilitygroups of the population | 43.2 | Electrical and other types of installation works | 43.9 | Other specialized construction works | Error in NACE code selection |
Emergency maintenance in 2019 | 43.2 | Electrical and other types of installation works | 33.1 | Repair services for metal products, machinery, and equipment | Lack of description |
Rent of special equipment | 43.9 | Other specialized construction works | 68.2 | Rental services for own or leased real estate. | Intentional selection of the wrong code. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Malashin, I.; Masich, I.; Tynchenko, V.; Nelyub, V.; Borodulin, A.; Gantimurov, A. Application of Natural Language Processing and Genetic Algorithm to Fine-Tune Hyperparameters of Classifiers for Economic Activities Analysis. Big Data Cogn. Comput. 2024, 8, 68. https://doi.org/10.3390/bdcc8060068
Malashin I, Masich I, Tynchenko V, Nelyub V, Borodulin A, Gantimurov A. Application of Natural Language Processing and Genetic Algorithm to Fine-Tune Hyperparameters of Classifiers for Economic Activities Analysis. Big Data and Cognitive Computing. 2024; 8(6):68. https://doi.org/10.3390/bdcc8060068
Chicago/Turabian StyleMalashin, Ivan, Igor Masich, Vadim Tynchenko, Vladimir Nelyub, Aleksei Borodulin, and Andrei Gantimurov. 2024. "Application of Natural Language Processing and Genetic Algorithm to Fine-Tune Hyperparameters of Classifiers for Economic Activities Analysis" Big Data and Cognitive Computing 8, no. 6: 68. https://doi.org/10.3390/bdcc8060068
APA StyleMalashin, I., Masich, I., Tynchenko, V., Nelyub, V., Borodulin, A., & Gantimurov, A. (2024). Application of Natural Language Processing and Genetic Algorithm to Fine-Tune Hyperparameters of Classifiers for Economic Activities Analysis. Big Data and Cognitive Computing, 8(6), 68. https://doi.org/10.3390/bdcc8060068