Benchmarking with a Language Model Initial Selection for Text Classification Tasks
Abstract
:1. Introduction
- A new approach to model benchmarking called Language Model-Dataset Fit (LMDFit) benchmarking is devised. The approach allows for substantially reducing carbon emissions associated with model testing by implementing a candidate model initial selection for a target dataset prior to model performance assessment. The efficiency of the LMDFit approach was verified through extensive experiments in comparison with the conventional benchmarking process. The application of the proposed approach allowed for emission reductions in the range of 10 to 75% (37% on average), depending on the classification task at hand.
- The mean and skewness vector of the semantic similarity score distribution is shown to be a reliable group-predictor of language model classification performance for a given dataset. Unlike the existing approaches to forecasting AI model performance, which aim to rank candidate models for further experiments, the two-vector of text similarity statistics can be used to categorize all candidate models as either “more fit” or “less fit”. All “more-fit” models are then to be analyzed in benchmarking experiments that may result in not as dramatic emission cuts as in the case when only a few top-ranked models would be considered. On the other hand, this conservative approach is more robust and secure, as it minimizes the risk of inadvertently cutting off relevant models due to noise or bias in the data or statistics used for the initial categorization.
2. Related Work
2.1. AI Costs and Environmental Impacts
2.2. Language Model Benchmarking
2.3. Language Model Performance Prediction
3. Resources Used
3.1. Pre-Trained Language Models
3.2. Datasets
- (a)
- Environmental claims (https://huggingface.co/datasets/climatebert/environmental_claims, accessed on 5 March 2024). The set contains 2647 real-world environmental claims mostly in the financial domain by listed companies [41]. The data were annotated by 16 domain experts and have been used in studies, such as [42,43]. It is to support the task of environmental claim detection, which is a sentence-level binary classification task. The set includes both true claims (approximately 25%) and false claims (approximately 75%) with an average word token count of 27.61 per sentence.
- (b)
- AGNews (https://huggingface.co/datasets/ag_news, accessed on 5 March 2024) is a large collection of news articles. It serves to support the general task of text classification [44]. The news articles are categorized into four classes: “World”, “Sport”, “Business”, and “Science/Technology”. The set comprises 120,000 training samples and 7600 testing samples with equal representation for each class. The average word token count across all articles is 43.93. This dataset has been widely used for benchmarking purposes in NLP (e.g., see [45,46]).
- (c)
- Financial phrase-bank (https://huggingface.co/datasets/financial_phrasebank, accessed on 5 March 2024). The dataset was created to support sentiment analysis in the financial domain [47]. It contains phrases selected from financial news articles and company press releases. The phrases were labeled by 16 human annotators as “positive”, “negative”, or “neutral”. The data have been used by several research groups (e.g., [48,49]). For the purposes of the presented study, a sample consisting of phrases with an over 50% inter-annotator agreement was compiled from the data. The sample includes 4,840 financial statements classified as “negative” (59.41% of the total), “positive” (28.13%), or “neutral” (the rest). On average, one phrase in the sample has 23.15 tokens.
- (d)
- Rheology (https://huggingface.co/datasets/bluesky333/chemical_language_understanding_benchmark, accessed on 5 March 2024) is a subset of the Chemical Language Understanding Benchmark collection [50]. The dataset is meant to support a sentence-level classification task. It consists of 2017 single-labeled sentences from research papers in the chemistry domain with a sentence average length of 39.67 tokens. The sentences are organized into five classes exemplifying different polymer structures and properties.
- (e)
- Plant-chemical relationship corpus (http://gcancer.org/plant_chemical_corpus/, accessed on 5 March 2024). The data comprise 939 documents describing plant–chemical relationships [51]. The relationships were annotated by experts with labels of either a “positive” or “negative” containment of chemicals in the plants. The set has been used in NLP research to support the named entity recognition task [52]. For the purposes of this study, abstract sentences and plant and chemical element names of the set were concatenated to form the input for the language models using the following template: “Relation of {plant} and {chemical} on {sentence}”. The average length of the concatenated text is 41.92 tokens.
- (f)
- arXiv (https://www.kaggle.com/datasets/Cornell-University/arxiv, accessed on 5 March 2024). The arXiv collection is maintained by Cornell University. It includes over 1.7 million scholarly articles publicly available on arXiv.org. The dataset has been used in multiple text classification studies (e.g., see [53,54]). By design, it supports multi-label classification, as each archived article can belong to more than one field of study. The stored articles are supplemented with extensive metadata, such as versions, titles, authors, categories, and abstracts. For the purposes of this study, a random sample of 51,774 article abstracts (https://www.kaggle.com/code/chhatrabikramshah123/researchpaperrecommendation/input?select=arxiv_data_210930-054931.csv, accessed on 5 March 2024) was utilized. The nine largest category names plus “Other” were used as labels for the texts. The average token count is 194.50 per abstract in the sample.
- (g)
- The European Court of Human Rights (ECtHR) cases (https://huggingface.co/datasets/ecthr_cases, accessed on 5 March 2024). This is a commonly used benchmarking dataset for NLP in the legal domain (see [55] for a related study). The data include facts from 11,000 ECtHR cases. The facts are multiple-labeled by designations of the European Convention of Human Rights (ECHR) articles they violate. There are 33 labels, and the average case size is 1918.76 word tokens.
- (h)
- Ohsumed (http://disi.unitn.it/moschitti/corpora.htm, accessed on 5 March 2024) dataset consists of abstracts of MEDLINE records on cardiovascular diseases registered in 1991. There are 34,389 abstracts in total, and the average abstract size is 115.75 tokens. The abstracts are multiple-labeled by 23 specific cardiovascular conditions. The data are meant to support the multi-label multi-class text classification task, and the dataset has been widely used in NLP research (e.g., see [56,57]).
3.3. Efficiency Measures
4. LMDFit and Model Initial Selection
4.1. Overview
- Screening. This step entails recruiting pre-trained model candidates. The candidates M are manually selected based on model specifications (including the language and training corpora), utilization precedents, and other relevant information. The screening process is meant to gather potential candidates for benchmarking.
- Selecting. The second step is to assess the fitness of each candidate-model for the target task . All mobilized models are classified as either less-fit or more-fit based on how they perform on a proxy evaluative task. The latter task is to assess “basic skills”—the abilities to differentiate and to generalize texts—of the candidate-models.
- Hiring. A (subjective) decision-making process for deciding which models should be benchmarked. The fitness assessment results are used to reduce the number of models to be tested.
- Onboarding. In this study, it refers to the standard benchmarking process. Onboarding, thus, also includes all “must-have” procedures before the models could be deployed in comparison experiments, such as fine-tuning and inferencing.
4.2. Candidate Model Fitness Assessment
4.2.1. Assumptions
4.2.2. Sampling and Implementation Details
Algorithm 1: model initial selection |
5. Experiments
5.1. Environmental Claims Collection
5.2. AGNews
5.3. Financial Phrase-Bank
5.4. Rheology Dataset
5.5. Plant–Chemical Relationship Corpus
5.6. arXiv Documents
5.7. ECtHR Cases
5.8. Ohsumed Collection
6. Discussion
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Strubell, E.; Ganesh, A.; McCallum, A. Energy and policy considerations for deep learning in NLP. arXiv 2019, arXiv:1906.02243. [Google Scholar]
- Verdecchia, R.; Sallou, J.; Cruz, L. A systematic review of Green AI. WIREs Data Min. Knowl. Discov. 2023, 13, e1507. [Google Scholar] [CrossRef]
- Schwartz, R.; Dodge, J.; Smith, N.A.; Etzioni, O. Green AI. Commun. ACM 2020, 63, 54–63. [Google Scholar] [CrossRef]
- Li, B.; Qi, P.; Liu, B.; Di, S.; Liu, J.; Pei, J.; Yi, J.; Zhou, B. Trustworthy AI: From Principles to Practices. ACM Comput. Surv. 2023, 55, 1–46. [Google Scholar] [CrossRef]
- Blagec, K.; Dorffner, G.; Moradi, M.; Samwald, M. A critical analysis of metrics used for measuring progress in artificial intelligence. arXiv 2021, arXiv:2008.02577. [Google Scholar]
- Klotz, A.C.; da Motta Veiga, S.P.; Buckley, M.R.; Gavin, M.B. The role of trustworthiness in recruitment and selection: A review and guide for future research. J. Organ. Behav. 2013, 34, S104–S119. [Google Scholar] [CrossRef]
- Ahuja, K.; Dandapat, S.; Sitaram, S.; Choudhury, M. Beyond Static Models and Test Sets: Benchmarking the Potential of Pre-trained Models Across Tasks and Languages. arXiv 2022, arXiv:2205.06356. [Google Scholar]
- Ahuja, K.; Kumar, S.; Dandapat, S.; Choudhury, M. Multi task learning for zero shot performance prediction of multilingual models. arXiv 2022, arXiv:2205.06130. [Google Scholar]
- Xia, M.; Anastasopoulos, A.; Xu, R.; Yang, Y.; Neubig, G. Predicting performance for natural language processing tasks. arXiv 2020, arXiv:2005.00870. [Google Scholar]
- Kadiķis, E.; Vaibhav, S.; Klinger, R. Embarrassingly simple performance prediction for abductive natural language inference. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua, Online, 10–15 July 2022. [Google Scholar] [CrossRef]
- Li, Q.; Peng, H.; Li, J.; Xia, C.; Yang, R.; Sun, L.; Yu, P.S.; He, L. A Survey on Text Classification: From Traditional to Deep Learning. ACM Trans. Intell. Syst. Technol. 2022, 13, 1–41. [Google Scholar] [CrossRef]
- Altınel, B.; Ganiz, M.C. Semantic text classification: A survey of past and recent advances. Inf. Process. Manag. 2018, 54, 1129–1153. [Google Scholar] [CrossRef]
- Garrido-Merchan, E.C.; Gozalo-Brizuela, R.; Gonzalez-Carvajal, S. Comparing BERT against traditional machine learning models in text classification. J. Comput. Cogn. Eng. 2023, 2, 352–356. [Google Scholar] [CrossRef]
- Ferro, M.; Silva, G.D.; de Paula, F.B.; Vieira, V.; Schulze, B. Towards a sustainable artificial intelligence: A case study of energy efficiency in decision tree algorithms. Concurr. Comput. Pract. Exp. 2023, 35, e6815. [Google Scholar] [CrossRef]
- Gutiérrez, M.; Moraga, M.Á.; García, F. Analysing the energy impact of different optimisations for machine learning models. In Proceedings of the 2022 International Conference on ICT for Sustainability (ICT4S), Plovdiv, Bulgaria, 13–17 June 2022; IEEE: New York, NY, USA, 2022; pp. 46–52. [Google Scholar] [CrossRef]
- Gurumurthy, A.; Kodali, R. Benchmarking the Benchmarking Models. Benchmarking Int. J. 2008, 15, 257–291. [Google Scholar] [CrossRef]
- Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv 2018, arXiv:1804.07461. [Google Scholar]
- Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. Superglue: A stickier benchmark for general-purpose language understanding systems. Adv. Neural Inf. Process. Syst. 2019, 32, 3266–3280. [Google Scholar]
- Liang, Y.; Duan, N.; Gong, Y.; Wu, N.; Guo, F.; Qi, W.; Gong, M.; Shou, L.; Jiang, D.; Cao, G.; et al. XGLUE: A new benchmark dataset for cross-lingual pre-training, understanding and generation. arXiv 2020, arXiv:2004.01401. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Lundgard, A. Measuring justice in machine learning. arXiv 2020, arXiv:2009.10050. [Google Scholar]
- Caton, S.; Haas, C. Fairness in machine learning: A survey. ACM Comput. Surv. 2020, 56, 1–38. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
- Casola, S.; Lauriola, I.; Lavelli, A. Pre-trained transformers: An empirical comparison. Mach. Learn. Appl. 2022, 9, 100334. [Google Scholar] [CrossRef]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. Xlnet: Generalized autoregressive pretraining for language understanding. Adv. Neural Inf. Process. Syst. 2019, 32, 5753–5763. [Google Scholar]
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
- Ye, Z.; Liu, P.; Fu, J.; Neubig, G. Towards more fine-grained and reliable NLP performance prediction. arXiv 2021, arXiv:2102.05486. [Google Scholar]
- Boughorbel, S.; Jarray, F.; El-Anbari, M. Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS ONE 2017, 12, e0177678. [Google Scholar] [CrossRef] [PubMed]
- Jiang, Y.; Neyshabur, B.; Mobahi, H.; Krishnan, D.; Bengio, S. Fantastic Generalization Measures and Where to Find Them. arXiv 2019, arXiv:1912.02178. [Google Scholar]
- Dziugaite, G.K.; Drouin, A.; Neal, B.; Rajkumar, N.; Caballero, E.; Wang, L.; Mitliagkas, I.; Roy, D.M. In search of robust measures of generalization. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 6–12 December 2020. [Google Scholar]
- Martin, C.H.; Peng, T.; Mahoney, M.W. Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data. Nat. Commun. 2021, 12, 4122. [Google Scholar] [CrossRef]
- Yang, Y.; Theisen, R.; Hodgkinson, L.; Gonzalez, J.E.; Ramchandran, K.; Martin, C.H.; Mahoney, M.W. Test Accuracy vs. Generalization Gap: Model Selection in NLP without Accessing Training or Testing Data. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; ACM: New York, NY, USA, 2023; pp. 3011–3021. [Google Scholar] [CrossRef]
- Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A pretrained language model for scientific text. arXiv 2019, arXiv:1903.10676. [Google Scholar]
- Chalkidis, I.; Fergadiotis, M.; Malakasiotis, P.; Aletras, N.; Androutsopoulos, I. LEGAL-BERT: The muppets straight out of law school. arXiv 2020, arXiv:2010.02559. [Google Scholar]
- Han, X.; Zhang, Z.; Ding, N.; Gu, Y.; Liu, X.; Huo, Y.; Qiu, J.; Yao, Y.; Zhang, A.; Zhang, L.; et al. Pre-trained models: Past, present and future. AI Open 2021, 2, 225–250. [Google Scholar] [CrossRef]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed]
- Hazourli, A. FinancialBERT—A Pretrained Language Model for Financial Text Mining. 2022. Available online: https://huggingface.co/ahmedrachid/FinancialBERT (accessed on 23 October 2024). [CrossRef]
- ValizadehAslani, T.; Shi, Y.; Ren, P.; Wang, J.; Zhang, Y.; Hu, M.; Zhao, L.; Liang, H. PharmBERT: A domain-specific BERT model for drug labels. Briefings Bioinform. 2023, 24, bbad226. [Google Scholar] [CrossRef] [PubMed]
- Stammbach, D.; Webersinke, N.; Bingler, J.A.; Kraus, M.; Leippold, M. A Dataset for Detecting Real-World Environmental Claims. arXiv 2022, arXiv:2209.00507. [Google Scholar] [CrossRef]
- Webersinke, N.; Kraus, M.; Bingler, J.A.; Leippold, M. Climatebert: A pretrained language model for climate-related text. arXiv 2021, arXiv:2110.12010. [Google Scholar] [CrossRef]
- Schimanski, T.; Bingler, J.; Hyslop, C.; Kraus, M.; Leippold, M. Climatebert-netzero: Detecting and assessing net zero and reduction targets. arXiv 2023, arXiv:2310.08096. [Google Scholar]
- Zhang, X.; Zhao, J.; LeCun, Y. Character-level convolutional networks for text classification. Adv. Neural Inf. Process. Syst. 2015, 28, 649–657. [Google Scholar]
- Li, Z.; Xu, J.; Zeng, J.; Li, L.; Zheng, X.; Zhang, Q.; Chang, K.W.; Hsieh, C.J. Searching for an effective defender: Benchmarking defense against adversarial word substitution. arXiv 2021, arXiv:2108.12777. [Google Scholar]
- Xiong, Y.; Feng, Y.; Wu, H.; Kamigaito, H.; Okumura, M. Fusing label embedding into bert: An efficient improvement for text classification. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; pp. 1743–1750. [Google Scholar]
- Malo, P.; Sinha, A.; Korhonen, P.; Wallenius, J.; Takala, P. Good debt or bad debt: Detecting semantic orientations in economic texts. J. Assoc. Inf. Sci. Technol. 2014, 65, 782–796. [Google Scholar] [CrossRef]
- Soong, G.H.; Tan, C.C. Sentiment Analysis on 10-K Financial Reports using Machine Learning Approaches. In Proceedings of the 2021 IEEE 11th International Conference on System Engineering and Technology (ICSET), Shah Alam, Malaysia, 6 November 2021; IEEE: New York, NY, USA, 2021; pp. 124–129. [Google Scholar]
- Leippold, M. Sentiment spin: Attacking financial sentiment with GPT-3. Financ. Res. Lett. 2023, 55, 103957. [Google Scholar] [CrossRef]
- Kim, Y.; Ko, H.; Lee, J.; Heo, H.Y.; Yang, J.; Lee, S.; Lee, K.h. Chemical Language Understanding Benchmark. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track); Sitaram, S., Beigman Klebanov, B., Williams, J.D., Eds.; Association for Computational Linguistic: Toronto, ON, Canada, 2023; pp. 404–411. [Google Scholar] [CrossRef]
- Cho, H.; Kim, B.; Choi, W.; Lee, D.; Lee, H. Plant phenotype relationship corpus for biomedical relationships between plants and phenotypes. Sci. Data 2022, 9, 235. [Google Scholar] [CrossRef]
- Choi, W.; Kim, B.; Cho, H.; Lee, D.; Lee, H. A corpus for plant-chemical relationships in the biomedical domain. BMC Bioinform. 2016, 17, 1–15. [Google Scholar] [CrossRef]
- Scharpf, P.; Schubotz, M.; Youssef, A.; Hamborg, F.; Meuschke, N.; Gipp, B. Classification and clustering of arxiv documents, sections, and abstracts, comparing encodings of natural and mathematical language. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, Online, 1–5 August 2020; pp. 137–146. [Google Scholar]
- Patadia, D.; Kejriwal, S.; Mehta, P.; Joshi, A.R. Zero-shot approach for news and scholarly article classification. In Proceedings of the 2021 International Conference on Advances in Computing, Communication, and Control (ICAC3), Mumbai, India, 3–4 December 2021; IEEE: New York, NY, USA, 2021; pp. 1–5. [Google Scholar]
- Chalkidis, I.; Jana, A.; Hartung, D.; Bommarito, M.; Androutsopoulos, I.; Katz, D.M.; Aletras, N. LexGLUE: A benchmark dataset for legal language understanding in English. arXiv 2021, arXiv:2110.00976. [Google Scholar] [CrossRef]
- Peng, Y.; Wu, W.; Ren, J.; Yu, X. Novel GCN Model Using Dense Connection and Attention Mechanism for Text Classification. Neural Process. Lett. 2024, 56, 1–17. [Google Scholar] [CrossRef]
- Burkhardt, S.; Kramer, S. Online multi-label dependency topic models for text classification. Mach. Learn. 2018, 107, 859–886. [Google Scholar] [CrossRef]
- Schilter, O.; Schwaller, P.; Laino, T. Balancing computational chemistry’s potential with its environmental impact. Green Chem. 2024, 26, 8669–8679. [Google Scholar] [CrossRef]
- Martínez, F.S.; Parada, R.; Casas-Roma, J. CO2 impact on convolutional network model training for autonomous driving through behavioral cloning. Adv. Eng. Inform. 2023, 56, 101968. [Google Scholar] [CrossRef]
- Becker, G.S. Human capital and the economy. Proc. Am. Philos. Soc. 1992, 136, 85–92. [Google Scholar]
- Morley, M.J. Person-organization fit. J. Manag. Psychol. 2007, 22, 109–117. [Google Scholar] [CrossRef]
- Edwards, J.R. Person-Job Fit: A Conceptual Integration, Literature Review, and Methodological Critique; John Wiley & Sons: Hoboken, NJ, USA, 1991. [Google Scholar]
- Nafukho, F.M.; Hairston, N.; Brooks, K. Human capital theory: Implications for human resource development. Hum. Resour. Dev. Int. 2004, 7, 545–551. [Google Scholar] [CrossRef]
- Harris, Z. Distributional Structure; Taylor & Francis Group: Abingdon, UK, 1954. [Google Scholar]
- Firth, J.R. A synopsis of linguistic theory, 1930 ± 1955’ Studies in Linguistic Analysis. In Special Volume of the Philological Society; Blackwell: Oxford, UK, 1957. [Google Scholar]
- Turney, P.D.; Pantel, P. From frequency to meaning: Vector space models of semantics. J. Artif. Intell. Res. 2010, 37, 141–188. [Google Scholar] [CrossRef]
- Hill, F.; Reichart, R.; Korhonen, A. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. 2015, 41, 665–695. [Google Scholar] [CrossRef]
- Gao, T.; Yao, X.; Chen, D. Simcse: Simple contrastive learning of sentence embeddings. arXiv 2021, arXiv:2104.08821. [Google Scholar]
- Aggarwal, C.C.; Hinneburg, A.; Keim, D.A. On the surprising behavior of distance metrics in high dimensional space. In Proceedings of the Database Theory—ICDT 2001: 8th International Conference, London, UK, 4–6 January 2001; proceedings 8. Springer: Berlin/Heidelberg, Germany, 2001; pp. 420–434. [Google Scholar]
- Huang, A. Similarity measures for text document clustering. In Proceedings of the Sixth New Zealand Computer Science Research Student Conference (NZCSRSC2008), Christchurch, New Zealand, 14–18 April 2008; Volume 4, pp. 9–56. [Google Scholar]
- Spruill, M. Asymptotic distribution of coordinates on high dimensional spheres. Electron. Commun. Probab. 2007, 12, 234–247. [Google Scholar] [CrossRef]
- Paukkeri, M.S.; Kivimäki, I.; Tirunagari, S.; Oja, E.; Honkela, T. Effect of dimensionality reduction on different distance measures in document clustering. In Proceedings of the Neural Information Processing: 18th International Conference, ICONIP 2011, Shanghai, China, 13–17 November 2011; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2011; pp. 167–176. [Google Scholar]
- Wijewickrema, M.; Petras, V.; Dias, N. Selecting a text similarity measure for a content-based recommender system: A comparison in two corpora. Electron. Libr. 2019, 37, 506–527. [Google Scholar] [CrossRef]
- Parsons, V.L. Stratified Sampling. In Wiley StatsRef: Statistics Reference Online; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2017; pp. 1–11. [Google Scholar] [CrossRef]
- Doane, D.P.; Seward, L.E. Measuring skewness: A forgotten statistic? J. Stat. Educ. 2011, 19, 2. [Google Scholar] [CrossRef]
- Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
- Yu, H.; Gao, C.; Li, X.; Zhang, L. Ancient Chinese Poetry Collation Based on BERT. Procedia Comput. Sci. 2024, 242, 1171–1178. [Google Scholar] [CrossRef]
- Raiaan, M.A.K.; Mukta, M.S.H.; Fatema, K.; Fahad, N.M.; Sakib, S.; Mim, M.M.J.; Ahmad, J.; Ali, M.E.; Azam, S. A review on large Language Models: Architectures, applications, taxonomies, open issues and challenges. IEEE Access 2024, 12, 26839–26874. [Google Scholar] [CrossRef]
- Gasparetto, A.; Marcuzzo, M.; Zangari, A.; Albarelli, A. A survey on text classification algorithms: From text to predictions. Information 2022, 13, 83. [Google Scholar] [CrossRef]
No. | Name | Huggingface * Address | Pre-Training Approach | Training Data |
---|---|---|---|---|
1 | BERT [20] | bert-base-uncased | from scratch | Wikipedia and Books Corpus |
2 | SciBERT [35] | allenai/scibert_scivocab_uncased | from scratch | Scientific papers |
3 | LegalBERT [36] | nlpaueb/legal-bert-base-uncased | from scratch | Legal text (court cases, contracts, legislations, etc.) |
4 | FinancialBERT [39] | ahmedrachid/FinancialBERT | from scratch | TRC2-financial corpus, Bloomberg news articles, corporate reports, and earning call transcripts |
5 | PharmBERT [40] | Lianglab/PharmBERT-uncased | fine-tuned from BERT (1) | DailyMed drug labels |
6 | Agriculture BERT | recobo/agriculture-bert-uncased | fine-tuned from SciBERT (2) | National Agricultural Library (NAL) documents and agricultural literature |
7 | Chemical BERT | recobo/chemical-bert-uncased | fine-tuned from SciBERT (2) | Chemical industry domain documents and Wikipedia chemistry documents |
Model | LMDFit Metrics | Conventional Metrics * | ||||
---|---|---|---|---|---|---|
Mean | Skewness | F1 Micro | F1 Macro | |||
more-fit | BERT | 0.712 | −0.335 | 0.892 | 0.858 | |
SciBERT | 0.740 | −0.445 | 0.898 | 0.871 | ||
PharmBERT | 0.748 | −0.512 | 0.887 | 0.855 | ||
FinancialBERT | 0.774 | −0.323 | 0.890 | 0.857 | ||
Chemical BERT | 0.832 | −0.659 | 0.853 | 0.810 | ||
Agriculture BERT | 0.841 | −0.422 | 0.911 | 0.887 | ||
less-fit | LegalBERT | 0.880 | −3.082 | 0.855 | 0.819 |
Task | LMDFit | Conventional | |||
---|---|---|---|---|---|
Time (s) | Emissions (g) | Time(s) | Emissions (g) | ||
Fitness assessment | 51.2 | 1.1 | - | - | |
Clustering and Hiring | 2.1 | <0.1 | - | - | |
Benchmarking | 1419.7 | 47.8 | 1646.3 | 55.5 | |
Total | 1473.0 | 48.8 | 1646.3 | 55.5 |
No | Model | LMDFit Metrics | Conventional Metrics * | |||
---|---|---|---|---|---|---|
Mean | Skewness | F1 Micro | F1 Macro | |||
more-fit | BERT | 0.681 | −0.141 | 0.948 | 0.948 | |
PharmBERT | 0.779 | −0.397 | 0.945 | 0.945 | ||
less-fit | SciBERT | 0.777 | −0.589 | 0.943 | 0.943 | |
FinancialBERT | 0.800 | −0.635 | 0.928 | 0.928 | ||
Agriculture BERT | 0.853 | −0.622 | 0.943 | 0.943 | ||
Chemical BERT | 0.887 | −1.017 | 0.932 | 0.932 | ||
LegalBERT | 0.887 | −0.879 | 0.942 | 0.942 |
Task | LMDFit | Conventional | |||
---|---|---|---|---|---|
Time (s) | Emissions (g) | Time (s) | Emissions (g) | ||
Fitness assessment | 55.3 | 1.2 | - | - | |
Clustering and Hiring | 2.0 | <0.1 | - | - | |
Benchmarking | 27,599.1 | 1851.8 | 97,620.6 | 6569.1 | |
Total | 27,656.4 | 1853.0 | 97,620.6 | 6569.1 |
Model | LMDFit Metrics | Conventional Metrics * | ||||
---|---|---|---|---|---|---|
Mean | Skewness | F1 Micro | F1 Macro | |||
more-fit | BERT | 0.713 | −0.164 | 0.840 | 0.825 | |
FinancialBERT | 0.749 | −0.129 | 0.841 | 0.824 | ||
PharmBERT | 0.752 | −0.240 | 0.835 | 0.810 | ||
Agriculture BERT | 0.817 | −0.162 | 0.844 | 0.830 | ||
less-fit | SciBERT | 0.762 | −0.825 | 0.840 | 0.823 | |
Chemical BERT | 0.815 | −0.490 | 0.729 | 0.592 | ||
LegalBERT | 0.881 | −0.809 | 0.731 | 0.576 |
Task | LMDFit | Conventional | |||
---|---|---|---|---|---|
Time (s) | Emissions (g) | Time (s) | Emissions (g) | ||
Fitness assessment | 52.3 | 1.1 | - | - | |
Clustering and Hiring | 2.1 | <0.1 | - | - | |
Benchmarking | 1431.2 | 63.4 | 2487.5 | 109.9 | |
Total | 1485.6 | 64.5 | 2487.5 | 109.9 |
Model | LMDFit Metrics | Conventional Metrics * | ||||
---|---|---|---|---|---|---|
Mean | Skewness | F1 Micro | F1 Macro | |||
more-fit | PharmBERT | 0.798 | −0.608 | 0.577 | 0.387 | |
Chemical BERT | 0.831 | −0.515 | 0.563 | 0.397 | ||
Agriculture BERT | 0.838 | −0.520 | 0.639 | 0.573 | ||
less-fit | BERT | 0.779 | −0.668 | 0.539 | 0.358 | |
SciBERT | 0.783 | −0.768 | 0.589 | 0.491 | ||
FinancialBERT | 0.818 | −0.766 | 0.543 | 0.357 | ||
LegalBERT | 0.902 | −0.692 | 0.472 | 0.291 |
Task | LMDFit | Conventional | |||
---|---|---|---|---|---|
Time (s) | Emissions (g) | Time (s) | Emissions (g) | ||
Fitness assessment | 55.0 | 1.2 | - | - | |
Clustering and Hiring | 1.9 | <0.1 | - | - | |
Benchmarking | 753.6 | 28.8 | 1784.6 | 68.0 | |
Total | 810.5 | 30.0 | 1784.6 | 68.0 |
Model | LMDFit Metrics | Conventional Metrics * | ||||
---|---|---|---|---|---|---|
Mean | Skewness | F1 Micro | F1 Macro | |||
more-fit | SciBERT | 0.737 | −0.425 | 0.816 | 0.812 | |
BERT | 0.786 | −0.508 | 0.784 | 0.781 | ||
FinancialBERT | 0.824 | −0.342 | 0.756 | 0.754 | ||
Agriculture BERT | 0.827 | −0.555 | 0.840 | 0.837 | ||
Chemical BERT | 0.854 | −0.230 | 0.769 | 0.766 | ||
LegalBERT | 0.893 | −0.544 | 0.707 | 0.697 | ||
less-fit | PharmBERT | 0.784 | −1.711 | 0.803 | 0.801 |
Task | LMDFit | Conventional | |||
---|---|---|---|---|---|
Time (s) | Emissions (g) | Time (s) | Emissions (g) | ||
Fitness assessment | 54.7 | 1.2 | - | - | |
Clustering and Hiring | 2.0 | <0.1 | - | - | |
Benchmarking | 1376.7 | 46.2 | 1614.9 | 54.3 | |
Total | 1433.4 | 47.4 | 1614.9 | 54.3 |
Model | LMDFit Metrics | Conventional Metrics * | ||||
---|---|---|---|---|---|---|
Mean | Skewness | F1 Micro | F1 Macro | |||
more-fit | SciBERT | 0.804 | −0.494 | 0.807 | 0.605 | |
PharmBERT | 0.847 | −0.717 | 0.794 | 0.562 | ||
FinancialBERT | 0.866 | −0.586 | 0.790 | 0.553 | ||
Agriculture BERT | 0.891 | −0.465 | 0.808 | 0.600 | ||
Chemical BERT | 0.911 | −0.652 | 0.779 | 0.522 | ||
LegalBERT | 0.932 | −0.561 | 0.786 | 0.536 | ||
less-fit | BERT | 0.835 | −0.946 | 0.793 | 0.553 |
Task | LMDFit | Conventional | |||
---|---|---|---|---|---|
Time (s) | Emissions (g) | Time (s) | Emissions (g) | ||
Fitness assessment | 59.9 | 1.3 | - | - | |
Clustering and Hiring | 1.9 | <0.1 | - | - | |
Benchmarking | 55,153.1 | 3612.0 | 64,373.6 | 4214.6 | |
Total | 55,214.9 | 3613.3 | 64,373.6 | 4214.6 |
Model | LMDFit Metrics | Conventional Metrics * | ||||
---|---|---|---|---|---|---|
Mean | Skewness | F1 Micro | F1 Macro | |||
more-fit | BERT | 0.816 | −0.496 | 0.702 | 0.190 | |
SciBERT | 0.854 | −1.088 | 0.701 | 0.212 | ||
PharmBERT | 0.861 | −0.787 | 0.702 | 0.200 | ||
FinancialBERT | 0.872 | −1.126 | 0.691 | 0.202 | ||
LegalBERT | 0.885 | −0.728 | 0.717 | 0.205 | ||
Agriculture BERT | 0.908 | −1.274 | 0.700 | 0.219 | ||
less-fit | Chemical BERT | 0.929 | −4.083 | 0.662 | 0.163 |
Task | LMDFit | Conventional | |||
---|---|---|---|---|---|
Time (s) | Emissions (g) | Time (s) | Emissions (g) | ||
Fitness assessment | 147.1 | 3.5 | - | - | |
Clustering and Hiring | 2.2 | <0.1 | - | - | |
Benchmarking | 18,906.3 | 924.5 | 22,104.4 | 1080.0 | |
Total | 19,055.6 | 928.0 | 22,104.4 | 1080.0 |
Model | LMDFit Metrics | Conventional Metrics * | ||||
---|---|---|---|---|---|---|
Mean | Skewness | F1 Micro | F1 Macro | |||
more-fit | SciBERT | 0.699 | −0.109 | 0.783 | 0.746 | |
Agriculture BERT | 0.802 | −0.124 | 0.776 | 0.739 | ||
less-fit | PharmBERT | 0.764 | −0.609 | 0.764 | 0.725 | |
BERT | 0.792 | −0.822 | 0.753 | 0.700 | ||
FinancialBERT | 0.824 | −0.708 | 0.733 | 0.677 | ||
LegalBERT | 0.881 | −0.834 | 0.732 | 0.663 | ||
Chemical BERT | 0.882 | −0.654 | 0.715 | 0.652 |
Task | LMDFit | Conventional | |||
---|---|---|---|---|---|
Time (s) | Emissions (g) | Time (s) | Emissions (g) | ||
Fitness assessment | 61.1 | 1.4 | - | - | |
Clustering and Hiring | 2.1 | <0.1 | - | - | |
Benchmarking | 9136.3 | 593.3 | 37,039.9 | 2431.6 | |
Total | 9199.5 | 594.7 | 37,039.9 | 2431.6 |
Dataset | Number of Models Not Selected for Onboarding | Computational Time Decrease (%) | Emission Reduction (%) |
---|---|---|---|
Environmental claims | 1 | 10.5 | 11.8 |
AGNews | 5 | 71.7 | 71.8 |
Financial phrase-bank | 3 | 40.3 | 41.2 |
Rheology | 4 | 54.6 | 55.9 |
Plant–chemical relationship | 1 | 11.2 | 12.6 |
arXiv | 1 | 14.2 | 14.3 |
ECtHR | 1 | 13.8 | 14.1 |
Ohsumed | 5 | 75.2 | 75.5 |
Average | 2.6 | 36.4 | 37.1 |
Dataset | F1 Micro | F1 Macro | |||||||
---|---|---|---|---|---|---|---|---|---|
Spearman | Pearson | Spearman | Pearson | ||||||
Cosine Similarity | Cosine Similarity | Cosine Similarity | Cosine Similarity | ||||||
Mean | Skewness | Mean | Skewness | Mean | Skewness | Mean | Skewness | ||
Environmental claims | 0.32 | 0.64 | 0.50 | 0.64 | 0.32 | 0.64 | 0.46 | 0.58 | |
AGNews | 0.61 | 0.86 | 0.47 | 0.63 | 0.67 | 0.88 | 0.47 | 0.63 | |
Financial phrase-bank | 0.32 | 0.71 | 0.75 | 0.55 | 0.43 | 0.68 | 0.77 | 0.56 | |
Rheology | 0.07 | 0.21 | 0.48 | 0.43 | 0.11 | 0.36 | 0.30 | 0.35 | |
Plant–chemical relationship | 0.57 | 0.39 | 0.69 | 0.23 | 0.57 | 0.39 | 0.69 | 0.23 | |
arXiv | 0.57 | 0.43 | 0.58 | 0.39 | 0.71 | 0.39 | 0.60 | 0.45 | |
ECtHR | 0.89 | 0.86 | 0.53 | 0.92 | 0.04 | 0.07 | 0.24 | 0.75 | |
Ohsumed | 0.89 | 0.75 | 0.88 | 0.76 | 0.89 | 0.75 | 0.89 | 0.79 | |
Average | 0.53 | 0.61 | 0.61 | 0.57 | 0.47 | 0.52 | 0.55 | 0.54 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Riyadi, A.; Kovacs, M.; Serdült, U.; Kryssanov, V. Benchmarking with a Language Model Initial Selection for Text Classification Tasks. Mach. Learn. Knowl. Extr. 2025, 7, 3. https://doi.org/10.3390/make7010003
Riyadi A, Kovacs M, Serdült U, Kryssanov V. Benchmarking with a Language Model Initial Selection for Text Classification Tasks. Machine Learning and Knowledge Extraction. 2025; 7(1):3. https://doi.org/10.3390/make7010003
Chicago/Turabian StyleRiyadi, Agus, Mate Kovacs, Uwe Serdült, and Victor Kryssanov. 2025. "Benchmarking with a Language Model Initial Selection for Text Classification Tasks" Machine Learning and Knowledge Extraction 7, no. 1: 3. https://doi.org/10.3390/make7010003
APA StyleRiyadi, A., Kovacs, M., Serdült, U., & Kryssanov, V. (2025). Benchmarking with a Language Model Initial Selection for Text Classification Tasks. Machine Learning and Knowledge Extraction, 7(1), 3. https://doi.org/10.3390/make7010003