sBERT: Parameter-Efficient Transformer-Based Deep Learning Model for Scientific Literature Classification
Abstract
:1. Introduction
1.1. Research Questions Addressed in This Paper
- How do different text classification models compare in terms of performance for scientific literature classification (SLC)
- −
- This paper evaluates the performance of various text classification models, including CNN, RNN, and transformer-based models across multiple datasets, such as WOS, ArXiv, Nature, Springer, and Wiley.
- −
- Detailed performance metrics are provided, which highlight the strengths and weaknesses of each model.
- −
- The results indicate that sBERT (small BERT) outperforms other models in accuracy and robustness across diverse datasets.
- What are the limitations of existing CNN and RNN models in handling text classification tasks?
- −
- −
- This sets the stage for introducing transformer-based models that address these limitations through mechanisms such as self-attention.
- How can parameter efficiency be achieved while maintaining high performance?
- −
- The proposed sBERT model is designed to utilize a multi-headed attention mechanism and hybrid embeddings to capture global context efficiently.
- −
- The paper details the architecture of sBERT, emphasizing its parameter efficiency compared to other transformer-based models.
- −
- It requires only 15.7 MB of memory and achieves rapid inference times, demonstrating a significant reduction in computational resources without compromising performance.
1.2. Paper Outline
2. Related Work
2.1. Convolutional Neural Networks for Sentence Classification
2.2. Advances in CNN Architectures
2.3. Word Embeddings and Contextual Information
2.4. Hierarchical and Attention-Based Models
2.5. Hybrid Models and Domain-Specific Applications
2.6. Transformer-Based Models
3. Proposed Model
3.1. Overview
3.2. Detailed Description
3.2.1. Embedding Block
3.2.2. Query, Key, and Value Projections
3.2.3. Self-Attention
Overview
Self-Attention Mechanism
3.2.4. Residual Connections
3.2.5. Layer Normalization
3.2.6. FCNN (Fully Connected Neural Network)
3.2.7. Concatenating Outputs from Multiple Attention Heads
3.2.8. Residual Connections
3.2.9. Combining the Representations Obtained from All Attention Heads
3.2.10. Classification
4. Datasets and Experiments
4.1. Datasets
4.1.1. WOS
4.1.2. ArXiv
4.1.3. Nature
4.1.4. Springer
4.1.5. Wiley
4.1.6. COR233962
4.2. Experimental Setup
4.2.1. Data Acquisition
4.2.2. Data Cleaning and Preprocessing
4.2.3. Data Splitting for Training, Validation, and Testing
4.2.4. Hyperparameter Selection
4.2.5. Training Details
4.2.6. Performance Metric
5. Results and Discussion
5.1. Parameter Space Comparison
5.2. Carbon Emissions Comparison
Carbon Emissions Calculation
- Power Consumption (Watts)Power consumption is the total power used by the hardware resources (CPU, GPU, and RAM) during model training. It can be calculated using Equation (29):
- Energy Consumption (kWh)Energy consumption is the total amount of power consumed over a period of time. It is given by Equation (30),
- Carbon Emission (kg CO2 per kWh)The carbon emission is calculated by multiplying the energy consumption by the carbon intensity of the electricity grid. The carbon emission can be calculated using Equation (31),
5.3. Performance Comparison
5.3.1. WOS Datasets
5.3.2. Discussion
5.3.3. Other Datasets
5.3.4. Discussion
5.4. Hypothesis Testing
5.4.1. Accuracy Measurements of Compared Models
5.4.2. Paired t-Test
- t-statistic:
- p-value:
5.4.3. Interpretation of the Results
- t-statistic: The t-statistic of is a measure of the difference between the two groups relative to the variability observed within the groups. The large negative value shows that the accuracy of sBERT is significantly different from that of RCNN.
- p-value: The p-value of is much lower than the significance level threshold (0.05), suggesting that the difference in classification accuracy between sBERT and RCNN is statistically significant. This indicates that there is strong evidence to reject the null hypothesis (there is no difference in performance between the two models).
6. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Ware, M.; Mabe, M. The STM Report: An Overview of Scientific and Scholarly Journal Publishing; Technical Report 4; International Association of Scientific, Technical and Medical Publishers: Oxford, UK, 2015. [Google Scholar]
- Jinha, A.E. Article 50 Million: An Estimate of the Number of Scholarly Articles in Existence. Learn. Publ. 2010, 23, 258–263. [Google Scholar] [CrossRef]
- National Center for Education Statistics. Doctor’s Degrees Conferred by Postsecondary Institutions, by Field of Study: Selected Years, 1970-71 through 2018-19. Digest of Education Statistics, 2019. Available online: https://nces.ed.gov/programs/digest/d21/tables/dt21_324.10.asp (accessed on 15 June 2024).
- Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the EMNLP, Doha, Qatar, 25–29 October 2014. [Google Scholar]
- Georgakopoulos, S.V.; Tasoulis, S.K.; Vrahatis, A.G.; Plagianakos, V.P. Convolutional Neural Networks for Toxic Comment Classification. In Proceedings of the 10th Hellenic Conference on Artificial Intelligence, Patras, Greece, 9–12 July 2018; pp. 1–6. [Google Scholar]
- Hughes, M.; Li, I.; Kotoulas, S.; Suzumura, T. Medical Text Classification Using Convolutional Neural Networks. Stud. Health Technol. Inform. 2017, 235, 246–250. [Google Scholar]
- Kowsari, K.; Brown, D.E.; Heidarysafa, M.; Meimandi, K.J.; Gerber, M.S.; Barnes, L.E. HDLTex: Hierarchical Deep Learning for Text Classification. In Proceedings of the 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Cancun, Mexico, 18–21 December 2018; Volume 2018, pp. 364–371. [Google Scholar]
- Aripin; Agastya, W.; Huda, S. Multichannel Convolutional Neural Network Model to Improve Compound Emotional Text Classification Performance. IAENG Int. J. Comput. Sci. 2023, 50, 866. [Google Scholar]
- Zhou, P.; Qi, Z.; Zheng, S.; Xu, J.; Bao, H.; Xu, B. Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling. In Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 11–16 December 2016; pp. 3485–3495. [Google Scholar]
- McCann, B.; Bradbury, J.; Xiong, C.; Socher, R. Learned in Translation: Contextualized Word Vectors. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Sydney, Australia, 2019; Volume 32. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019. [Google Scholar] [CrossRef]
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019. [Google Scholar] [CrossRef]
- Zhang, Y.; Roller, S.; Wallace, B.C. MGNC-CNN: A Simple Approach to Exploiting Multiple Word Embeddings for Sentence Classification. arXiv 2016, arXiv:1603.00968. [Google Scholar]
- Wu, H.L.X.; Cai, Y.; Xu, J.; Li, Q. Combining Machine Learning and Lexical Features for Readability Assessment of User Generated Content. In Proceedings of the COLING, Dublin, Ireland, 23–29 August 2014. [Google Scholar]
- Zhang, X.; Zhao, J.; LeCun, Y. Character-level Convolutional Networks for Text Classification. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28, pp. 649–657. [Google Scholar]
- Conneau, A.; Schwenk, H.; Cun, Y.L.; Barrault, L. Very deep convolutional networks for text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017—Proceedings of Conference, Valencia, Spain, 3–7 April 2017; Volume 1, pp. 1107–1116. [Google Scholar] [CrossRef]
- Johnson, R.; Zhang, T. Convolutional Neural Networks for Text Categorization: Shallow Word-level vs. Deep Character-level. arXiv 2016, arXiv:1609.00718. [Google Scholar]
- Wang, H.; He, J.; Zhang, X.; Liu, S. A short text classification method based on N-gram and CNN. Chin. J. Electron. 2020, 29, 248–254. [Google Scholar] [CrossRef]
- Soni, S.; Chouhan, S.S.; Rathore, S.S. TextConvoNet: A convolutional neural network based architecture for text classification. Appl. Intell. 2023, 53, 14249–14268. [Google Scholar] [CrossRef] [PubMed]
- Mandelbaum, A.; Shalev, A. Word Embeddings and Their Use In Sentence Classification Tasks. arXiv 2016, arXiv:1610.08229. [Google Scholar]
- Senarath, Y.; Thayasivam, U. DataSEARCH at IEST 2018: Multiple Word Embedding based Models for Implicit Emotion Classification of Tweets with Deep Learning. In Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Brussels, Belgium, October 2018; Balahur, A., Mohammad, S.M., Hoste, V., Klinger, R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 211–216. Available online: https://aclanthology.org/W18-6230 (accessed on 11 July 2024).
- Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical Attention Networks for Document Classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 1480–1489. [Google Scholar] [CrossRef]
- Zhou, X.; Wan, X.; Xiao, J. Attention-based LSTM network for cross-lingual sentiment classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 247–256. [Google Scholar] [CrossRef]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Hassan, A.; Mahmood, A. Convolutional recurrent deep learning model for sentence classification. IEEE Access 2018, 6, 13949–13957. [Google Scholar] [CrossRef]
- Gonçalves, S.; Cortez, P.; Moro, S. A Deep Learning Approach for Sentence Classification of Scientific Abstracts. In Discovery Science; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2018; Volume 11198, pp. 21–35. [Google Scholar] [CrossRef]
- Jin, D.; Szolovits, P. Hierarchical Neural Networks for Sequential Sentence Classification in Medical Scientific Abstracts. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 2561–2572. [Google Scholar] [CrossRef]
- Yang, Z.; Emmert-Streib, F. Threshold-learned CNN for multi-label text classification of electronic health records. IEEE Access 2023, 11, 17574–17583. [Google Scholar] [CrossRef]
- Jiang, Z.H.; Yu, W.; Zhou, D.; Chen, Y.; Feng, J.; Yan, S. Convbert: Improving bert with span-based dynamic convolution. Adv. Neural Inf. Process. Syst. 2020, 33, 12837–12848. [Google Scholar]
- Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- He, P.; Liu, X.; Gao, J.; Chen, W. Deberta: Decoding-enhanced bert with disentangled attention. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- He, P.; Gao, J.; Chen, W. DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Lai, S.; Xu, L.; Liu, K.; Zhao, J. Recurrent convolutional neural networks for text classification. In Proceedings of the National Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015; Volume 3, pp. 2267–2273. [Google Scholar]
- Sun, Z.; Yu, H.; Song, X.; Liu, R.; Yang, Y.; Zhou, D. MobileBERT: A Compact Task-Agnostic BERT for Resource-Limited Devices. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; pp. 2158–2170. [Google Scholar] [CrossRef]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016. [Google Scholar] [CrossRef]
- arXiv.org e-Print Archive. Available online: https://arxiv.org/ (accessed on 1 June 2024).
- Nature. Available online: https://www.nature.com/nature (accessed on 1 June 2024).
- Springer—International Publisher. Publisher: Springer. Available online: https://www.springer.com/us (accessed on 1 June 2024).
- Wiley Online Library: Scientific Research Articles, Journals, Books, and Reference Works. Publisher: Wiley. Available online: https://onlinelibrary.wiley.com/ (accessed on 1 June 2024).
- Cornell arXiv Dataset. arXiv Dataset and Metadata of 1.7M+ Scholarly Papers across STEM. Available online: https://www.kaggle.com/datasets/Cornell-University/arxiv (accessed on 28 September 2023).
Model | Main Features | Limitations |
---|---|---|
Text-CNN [4] | - Applies convolutional layers to text to detect features - Applies pretrained word vectors - Used for sentiment analysis and question answering | - Increased model size and complexity - Not optimal for simpler tasks like text classification |
Char-CNN [19] | - Uses character level representation to make the model language independent | - Does not utilize meaning associated with words - Training is slow because the model needs to be deeper and perform well |
HDLTex-CNN [7] | - Uses multiple CNN layers with varied filter sizes and max-pooling layers on text represented as GloVe-embedded words | - It is difficult to determine the optimal filter sizes for the CNN layers - Order of n-grams is ignored |
Multi-group Norm Constraint CNN [17] | - Employs more than one embedding to improve classification performance | - Computationally expensive due to multiple embeddings |
VDCNN [20] | - Input represented as character sequences is subjected to a network of 30 layers to generate a feature vector | - High computational cost - Slow training |
Two-Dimensional Multiscale CNN [21] | - Detects features within sentences and combines features from different sentences to represent input text - Uses 2D multiscale convolutional operations over document matrices | - Increased complexity and computational requirements |
HDLTex-RNN [7] | - Uses LSTM to reduce the effect of vanishing gradients - Can capture long-range features within the text | - Slow training of an inference due to sequential nature of LSTMs |
RCNN [37] | - Uses an RNN to learn word representations, followed by a max-pooling layer to generate a feature vector to represent text documents | - Recurrent structures are inherently sequential, making them slow to train |
BLSTM-2DCNN [9] | - Combines bidirectional LSTM with two-dimensional max-pooling | - Complex architecture leading to higher computational resources and longer training times |
DistilBERT [16] | - Distilled version of BERT - 40% smaller while retaining 97% of BERT’s capabilities - Efficient for resource-constrained environments | - Slightly reduced performance compared to full BERT |
MobileBERT [38] | - Smaller in size, task-independent BERT model - Optimized for resource-limited environments - Incorporates bottleneck structures and parameter reduction | - Performance trade-off for reduced size and efficiency |
ALBERT [15] | - Lite BERT model - Reduces model complexity by parameter sharing - Employs factorized embedding parameterization | - May have lower performance on certain complex tasks |
RoBERTa [14] | - Optimized version of BERT - Trained with more data, longer sequences, and dynamic masking | - Requires significant computational resources for training |
ConvBERT [33] | - Integrates span-based dynamic convolution - Enhances local dependency capture in text | - Increased model complexity |
ELECTRA [34] | - Pre-trains text encoders as discriminators - Focuses on distinguishing real input tokens from corrupted ones | - Training as discriminators might introduce complexity |
XLNet [13] | - Generalized auto-regressive pretraining model - Captures bidirectional context - Addresses limitations of BERT’s masked language modeling | - Computationally intensive pretraining |
GPT-2 [39] | - Unsupervised multitask architecture - Can perform a variety of NLP tasks without task-specific fine-tuning | - Prone to generating unsafe or biased content |
T5 [40] | - Integrated text-to-text transformer model - Frames all tasks as text-to-text problems | - Requires extensive pretraining and task-specific fine-tuning |
DeBERTa [35] | - Enhances BERT with disentangled attention - Improved decoding mechanisms | - Complex architecture |
DeBERTaV3 [36] | - Integrates pretraining similar to ELECTRA - Uses gradient-disentangled sharing of embedding | - Computationally intensive |
Model | Parameter Size | Layers | Attention Heads | Hidden Size |
---|---|---|---|---|
DistilBERT [16] | 66 M | 6 | 12 | 768 |
MobileBERT [38] | 25 M | 24 | 4 | 512 |
ALBERT [15] | 12 M | 12 | 64 | 4096 |
BERT [11] | 110 M | 12 | 12 | 768 |
RoBERTa [14] | 125 M | 12 | 12 | 768 |
ConvBERT [33] | 110 M | 12 | 12 | 768 |
ELECTRA [34] | 110 M | 12 | 12 | 768 |
XLNet [13] | 110 M | 12 | 12 | 768 |
GPT-2 [39] | 117 M | 12 | 12 | 768 |
T5 [40] | 220 M | 12 | 12 | 768 |
DeBERTa [35] | 110 M | 12 | 12 | 768 |
DeBERTa-v3 [36] | 304 M | 24 | 16 | 1024 |
sBERT (proposed model) | 11.9 M | 1 | 12 | 100 |
Dataset | Domains | Samples | Train | Test | WPS | CPS | Voc |
---|---|---|---|---|---|---|---|
WOS-5376 | 3 | 5736 | 4588 | 1148 | 209.03 | 1386.13 | 42,306 |
WOS-11967 | 7 | 11,967 | 8017 | 3950 | 201.43 | 1340.19 | 57,875 |
WOS-46985 | 7 | 46,985 | 31,479 | 15,506 | 205.27 | 1375.76 | 125,968 |
ArXiv | 7 | 40,060 | 32,048 | 8012 | 148.21 | 978.67 | 112,452 |
Nature | 8 | 49,782 | 24,891 | 24,891 | 175.20 | 1206.56 | 84,228 |
Springer | 24 | 116,230 | 92,984 | 23,246 | 167.92 | 1128.41 | 22,254 |
Wiley | 74 | 179,953 | 143,962 | 35,991 | 170.26 | 1150.62 | 113,534 |
COR233962 | 6 | 233,962 | 187,169 | 46,793 | 148.26 | 971.3 | 172,954 |
WIL-30628 | 6 | 30,628 | 24,502 | 6126 | 183.52 | 1230.28 | 43,714 |
SPR-50317 | 10 | 50,317 | 40,253 | 10,064 | 165.89 | 1103.97 | 12,531 |
Property | Value |
---|---|
Operating System | Linux-6.1.85+-x86_64-with-glibc2.35 |
Python Version | 3.10.12 |
CPU Count | 2 |
CPU Model | Intel® Xeon® CPU @ 2.30 GHz |
GPU Count | 1 |
GPU Model | 1 x Tesla T4 |
Longitude | −79.9746 |
Latitude | 32.8608 |
RAM | 12.7 GB |
WOS-5736 | WOS-11967 | WOS-46985 | COR-233962 | WIL-30628 | SPR-50317 | |||||
---|---|---|---|---|---|---|---|---|---|---|
Time (s) | 157.2688 | 286.5306 | 713.4463 | 361.6975 | 721.5509 | 500.1951 | 4233.135 | 2299.827 | 460.3575 | 287.246 |
Model | Duration | Energy Consumption (kWh) | Carbon Emission (kg CO2) |
---|---|---|---|
ALBERT [15] | 844.3437219 | 0.018017887 | 0.008806693 |
BERT [11] | 828.3376598 | 0.019312801 | 0.009439614 |
ELECTRA [34] | 5381.362884 | 0.15103994 | 0.043122468 |
DistilBERT [16] | 8414.990516 | 0.247839145 | 0.034387638 |
RoBERTa [14] | 1647.26625 | 0.038005353 | 0.005273236 |
sBERT | 84.86049414 | 0.001797255 | 0.000878453 |
Model | WOS-5736 | WOS-11967 | WOS-46985 |
---|---|---|---|
TextCNN [4] | 49.46 | 16.55 | 31.13 |
MGN-CNN [17] | 98.41 | 92.94 | 89.33 |
CharCNN [19] | 88.48 | 25.45 | 76.59 |
RCNN [37] | 97.07 | 92.94 | 88.2 |
VDCNN [20] | 82.9 | 67.64 | 75.76 |
HDLTEX-CNN [7] | 98.47 | 93.52 | 88.67 |
HDLTEX-RNN [7] | 97.82 | 93.98 | 90.45 |
DistilBERT [16] | 96.08 | 92.4 | - |
ALBERT [15] | 95.38 | 91.65 | - |
RoBERTa [14] | 95.56 | 93.82 | 90.32 |
ELECTRA [34] | 93.55 | 73.1 | 81.05 |
ConvBERT [33] | 92.94 | 65.75 | - |
sBERT | 99.21 | 95.85 | 92.36 |
CS | ECE | Psych | MAE | Civil | Med | Biochem | |
---|---|---|---|---|---|---|---|
Precision | 0.903837 | 0.916994 | 0.87562 | 0.776646 | 0.86635 | 0.890251 | 0.890887 |
Recall | 0.995567 | 0.993842 | 0.996837 | 0.997282 | 0.995622 | 0.998372 | 0.995129 |
F1 score | 0.947487 | 0.953873 | 0.932305 | 0.873243 | 0.926498 | 0.941217 | 0.940127 |
CS | ECE | Psych | MAE | Civil | Med | Biochem | |
---|---|---|---|---|---|---|---|
Precision | 0.919893 | 0.932343 | 0.89906 | 0.9287 | 0.96131 | 0.961392 | 0.923961 |
Recall | 0.994915 | 0.992477 | 0.996373 | 0.988381 | 0.981849 | 0.994745 | 0.993397 |
F1 score | 0.955935 | 0.961471 | 0.945219 | 0.957612 | 0.971471 | 0.977784 | 0.957421 |
Model | Nature | Springer | ArXiv | Wiley | COR233962 | SPR-50317 | WIL-30628 |
---|---|---|---|---|---|---|---|
TextCNN [4] | 33.79 | 27.7 | 26.26 | 8.15 | 88.74% | 90.06% | 75.69 |
MGN-CNN [17] | 57.23 | 100 | 85.35 | 52.04 | 95.93 | 98.5 | 93.2 |
CharCNN [19] | 51.23 | 61.6 | 80.53 | 7.82 | 53.22 | 94.09 | 91.62 |
RCNN [37] | 52.21 | 85 | 83.32 | 62.1 | 93.81 | 99.0 | 94.63 |
VDCNN [20] | 53.44 | 63 | 66.11 | 49.5 | 54.91 | 30.3 | 20.66 |
HDLTEX-CNN [7] | 60.12 | 99.98 | 85.84 | 50.81 | 91.6 | 100 | 94.79 |
HDLTEX-RNN [7] | 61.83 | 100 | 84.26 | 43.39 | 92.15 | 80.85 | 85.11 |
sBERT | 63.2 | 100 | 85.94 | 70.54 | 93.17 | 100 | 94.6 |
Model | Split 1 | Split 2 | Split 3 | Split 4 | Split 5 | Split 6 | Split 7 | Split 8 | Split 9 | Split 10 |
---|---|---|---|---|---|---|---|---|---|---|
RCNN | 0.8434 | 0.6172 | 0.8978 | 0.7528 | 0.6688 | 0.7298 | 0.8016 | 0.7402 | 0.6492 | 0.7720 |
sBERT | 0.9477 | 0.9390 | 0.9571 | 0.9386 | 0.9460 | 0.9467 | 0.9383 | 0.9529 | 0.9456 | 0.9550 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ahanger, M.M.; Wani, M.A.; Palade, V. sBERT: Parameter-Efficient Transformer-Based Deep Learning Model for Scientific Literature Classification. Knowledge 2024, 4, 397-421. https://doi.org/10.3390/knowledge4030022
Ahanger MM, Wani MA, Palade V. sBERT: Parameter-Efficient Transformer-Based Deep Learning Model for Scientific Literature Classification. Knowledge. 2024; 4(3):397-421. https://doi.org/10.3390/knowledge4030022
Chicago/Turabian StyleAhanger, Mohammad Munzir, Mohd Arif Wani, and Vasile Palade. 2024. "sBERT: Parameter-Efficient Transformer-Based Deep Learning Model for Scientific Literature Classification" Knowledge 4, no. 3: 397-421. https://doi.org/10.3390/knowledge4030022
APA StyleAhanger, M. M., Wani, M. A., & Palade, V. (2024). sBERT: Parameter-Efficient Transformer-Based Deep Learning Model for Scientific Literature Classification. Knowledge, 4(3), 397-421. https://doi.org/10.3390/knowledge4030022