Identification and Visualization of Key Topics in Scientific Publications with Transformer-Based Language Models and Document Clustering Methods
Abstract
:1. Introduction
2. Background and Relevant Literature
2.1. Document Clustering Algorithms
2.2. Pretrained Language Models and Applications
2.3. Keyword Extraction
- Rule-based linguistic approaches that use linguistic knowledge and rules;
- Statistical approaches (such as TF-IDF) that use term frequency and co-occurrence statistics;
- Domain knowledge approaches that utilize an ontology in a domain to identify keywords;
- Machine learning approaches that use algorithms to automatically detect keywords.
3. Methodology
- The iterative clustering phase represents abstracts as GPT-3 similarity embeddings and uses the HDBSCAN algorithm along with silhouette scores [49] to divide abstracts into clusters.
- The keyword extraction phase identifies candidate words from each abstract and selects 5 keywords from candidate words with the MMR ranking algorithm to represent each abstract in the abstract clusters.
- The keyword grouping phase represents keywords as GPT-3 similarity embeddings and uses the HDBSCAN algorithm and silhouette scores again to form keyword groups to represent topics in an abstract cluster.
3.1. Iterative Clustering
3.2. Keywords Extraction
3.3. Keywords Grouping
4. Experiment Results and Visualization
4.1. Abstract Clustering Results and Visualization
4.2. Keyword Grouping Results and Visualization
5. Discussion
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Software Implementation
Software Libraries | Purposes | Phases |
---|---|---|
OpenAI API service 1 and pretrained GPT-3 model 2 (text-similarity-curie-001) | Abstract text and keyword vectorization | Phase 1 and Phase 2 |
HDBSCAN clustering library 3 (v0.8.27) and UMAP dimension reduction library 4 (v0.5.1) | Abstract clustering and keyword grouping | Phase 1 and Phase 3 |
Stanford CoreNLP 5 (v4.4.0) | POS tagging | Phase 2 |
Plotly.js 6 (v2.11.1) | Scatter dot chart | Visualization |
CSS Bootstrap 7 and Pagination.js 8 | List view | Visualization |
References
- Nugroho, R.; Paris, C.; Nepal, S.; Yang, J.; Zhao, W. A survey of recent methods on deriving topics from Twitter: Algorithm to evaluation. Knowl. Inf. Syst. 2020, 62, 2485–2519. [Google Scholar] [CrossRef]
- Blei, D.M. Probabilistic topic models. Commun. ACM 2012, 55, 77–84. [Google Scholar] [CrossRef] [Green Version]
- Cobo, M.J.; López-Herrera, A.G.; Herrera-Viedma, E.; Herrera, F. SciMAT: A new science mapping analysis software tool. J. Am. Soc. Inf. Sci. Technol. 2012, 63, 1609–1630. [Google Scholar] [CrossRef]
- Aria, M.; Cuccurullo, C. bibliometrix: An R-tool for comprehensive science mapping analysis. J. Informetr. 2017, 11, 959–975. [Google Scholar] [CrossRef]
- Witten, I.H.; Paynter, G.W.; Frank, E.; Gutwin, C.; Nevill-Manning, C.G. Kea: Practical automated keyphrase extraction. In Design and Usability of Digital Libraries: Case Studies in the Asia Pacific; IGI global: Hershey, PA, USA, 2005; pp. 129–152. [Google Scholar]
- Campos, R.; Mangaravite, V.; Pasquali, A.; Jorge, A.; Nunes, C.; Jatowt, A. YAKE! Keyword extraction from single documents using multiple local features. Inf. Sci. 2020, 509, 257–289. [Google Scholar] [CrossRef]
- Mihalcea, R.; Tarau, P. Textrank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, 25–26 July 2004; pp. 404–411. [Google Scholar]
- Rose, S.; Engel, D.; Cramer, N.; Cowley, W. Automatic keyword extraction from individual documents. Text Min. Appl. Theory 2010, 1, 1–20. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Le, Q.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on International Conference on Machine Learning, Beijing, China, 21–26 June 2014; Volume 32, pp. II-1188–II–1196. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar] [CrossRef]
- Neelakantan, A.; Xu, T.; Puri, R.; Radford, A.; Han, J.M.; Tworek, J.; Yuan, Q.; Tezak, N.; Kim, J.W.; Hallacy, C.; et al. Text and Code Embeddings by Contrastive Pre-Training. arXiv 2022, arXiv:2201.10005. [Google Scholar]
- Radu, R.-G.; Rădulescu, I.-M.; Truică, C.-O.; Apostol, E.-S.; Mocanu, M. Clustering Documents using the Document to Vector Model for Dimensionality Reduction. In Proceedings of the 2020 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR), Cluj-Napoca, Romania, 21–23 May 2020; pp. 1–6. [Google Scholar] [CrossRef]
- Vahidnia, S.; Abbasi, A.; Abbass, H.A. Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering. J. Data Inf. Sci. 2021, 6, 99–122. [Google Scholar] [CrossRef]
- Xie, J.; Girshick, R.; Farhadi, A. Unsupervised deep embedding for clustering analysis. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, New York, NY, USA, 24–26 June 2016; Volume 48, pp. 478–487. [Google Scholar]
- Glänzel, W.; Thijs, B. Using ‘core documents’ for detecting and labelling new emerging topics. Scientometrics 2012, 91, 399–416. [Google Scholar] [CrossRef]
- Pham, D.T.; Dimov, S.S.; Nguyen, C.D. Selection of K in K-means clustering. Proc. Inst. Mech. Eng. Part C J. Mech. Eng. Sci. 2005, 219, 103–119. [Google Scholar] [CrossRef]
- McInnes, L.; Healy, J.; Astels, S. hdbscan: Hierarchical density based clustering. J. Open Source Softw. 2017, 2, 205. [Google Scholar] [CrossRef]
- Campello, R.J.; Moulavi, D.; Zimek, A.; Sander, J. Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Trans. Knowl. Discov. Data 2015, 10, 1–51. [Google Scholar] [CrossRef]
- McInnes, L.; Healy, J. Accelerated hierarchical density based clustering. In Proceedings of the 2017 IEEE International Conference on Data Mining Workshops (ICDMW), New Orleans, LA, USA, 18–21 November 2017; pp. 33–42. [Google Scholar]
- Liu, Z.; Lin, Y.; Sun, M. Word Representation. In Representation Learning for Natural Language Processing; Liu, Z., Lin, Y., Sun, M., Eds.; Springer: Singapore, 2020; pp. 13–41. [Google Scholar] [CrossRef]
- Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, 2–4 August 1996; pp. 226–231. [Google Scholar]
- Zhu, Y.; Kiros, R.; Zemel, R.; Salakhutdinov, R.; Urtasun, R.; Torralba, A.; Fidler, S. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 19–27. [Google Scholar]
- Elsafoury, F.; Katsigiannis, S.; Pervez, Z.; Ramzan, N. When the Timeline Meets the Pipeline: A Survey on Automated Cyberbullying Detection. IEEE Access 2021, 9, 103541–103563. [Google Scholar] [CrossRef]
- Desai, A.; Nagwanshi, P. Grouping News Events Using Semantic Representations of Hierarchical Elements of Articles and Named Entities. In Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence, New York, NY, USA, 24–26 December 2020. [Google Scholar] [CrossRef]
- Ambalavanan, A.K.; Devarakonda, M.V. Using the contextual language model BERT for multi-criteria classification of scientific articles. J. Biomed. Inform. 2020, 112, 103578. [Google Scholar] [CrossRef]
- Fan, C.; Wu, F.; Mostafavi, A. A Hybrid Machine Learning Pipeline for Automated Mapping of Events and Locations From Social Media in Disasters. IEEE Access 2020, 8, 10478–10490. [Google Scholar] [CrossRef]
- Yang, L.; Zhang, M.; Li, C.; Bendersky, M.; Najork, M. Beyond 512 Tokens: Siamese Multi-Depth Transformer-Based Hierarchical Encoder for Long-Form Document Matching. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, New York, NY, USA, 19–23 October 2020; pp. 1725–1734. [Google Scholar] [CrossRef]
- Cheng, Q.; Zhu, Y.; Song, J.; Zeng, H.; Wang, S.; Sun, K.; Zhang, J. Bert-Based Latent Semantic Analysis (Bert-LSA): A Case Study on Geospatial Data Technology and Application Trend Analysis. Appl. Sci. 2021, 11, 11897. [Google Scholar] [CrossRef]
- Altuncu, M.T.; Yaliraki, S.N.; Barahona, M. Graph-Based Topic Extraction from Vector Embeddings of Text Documents: Application to a Corpus of News Articles. In Complex Networks & Their Applications IX.; Springer: Cham, Switzerland, 2021; pp. 154–166. [Google Scholar]
- Nasim, Z.; Haider, S. Evaluation of clustering techniques on Urdu News head-lines: A case of short length text. J. Exp. Theor. Artif. Intell. 2022, 1–22. [Google Scholar] [CrossRef]
- Frey, B.J.; Dueck, D. Clustering by Passing Messages Between Data Points. Science 2007, 315, 972–976. [Google Scholar] [CrossRef] [Green Version]
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. 2019. Available online: https://arxiv.org/abs/1908.10084 (accessed on 1 January 2022).
- Li, B.; Zhou, H.; He, J.; Wang, M.; Yang, Y.; Li, L. On the Sentence Embeddings from Pre-trained Language Models. arXiv 2020, arXiv:2011.05864. [Google Scholar]
- Ito, H.; Chakraborty, B. Social Media Mining with Dynamic Clustering: A Case Study by COVID-19 Tweets. In Proceedings of the 2020 11th International Conference on Awareness Science and Technology (iCAST), Qingdao, China, 7–9 December 2020; pp. 1–6. [Google Scholar] [CrossRef]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Bavarian, M.; Jun, H.; Tezak, N.; Schulman, J.; McLeavey, C.; Tworek, J.; Chen, M. Efficient Training of Language Models to Fill in the Middle. arXiv 2022, arXiv:2207.14255. [Google Scholar]
- Siddiqi, S.; Sharan, A. Keyword and keyphrase extraction techniques: A literature review. Int. J. Comput. Appl. 2015, 109, 18–23. [Google Scholar] [CrossRef]
- Nasar, Z.; Jaffry, S.W.; Malik, M.K. Textual keyword extraction and summarization: State-of-the-art. Inf. Process. Manag. 2019, 56, 102088. [Google Scholar] [CrossRef]
- Roberge, G.; Kashnitsky, Y.; James, C. Elsevier 2022 Sustainable Development Goals (SDG) Mapping; Elsevier: Amsterdam, The Netherlands, 2022; Volume 1. [Google Scholar] [CrossRef]
- Brin, S.; Page, L. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 1998, 30, 107–117. [Google Scholar] [CrossRef]
- Bennani-Smires, K.; Musat, C.; Hossmann, A.; Baeriswyl, M.; Jaggi, M. Simple Unsupervised Keyphrase Extraction using Sentence Embeddings. In Proceedings of the 22nd Conference on Computational Natural Language Learning, Brussels, Belgium, 31 October–1 November 2018; pp. 221–229. [Google Scholar] [CrossRef] [Green Version]
- Carbonell, J.; Goldstein, J. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 24–28 August 1998; pp. 335–336. [Google Scholar] [CrossRef] [Green Version]
- Kaminskas, M.; Bridge, D. Diversity, Serendipity, Novelty, and Coverage: A Survey and Empirical Analysis of Beyond-Accuracy Objectives in Recommender Systems. ACM Trans. Interact. Intell. Syst. 2016, 7, 1–42. [Google Scholar] [CrossRef]
- Dyer, M.; Dyer, R.; Weng, M.-H.; Wu, S.; Grey, T.; Gleeson, R.; Ferrari, T.G. Framework for soft and hard city infrastructures. Proc. Inst. Civ. Eng.-Urban Des. Plan. 2019, 172, 219–227. [Google Scholar] [CrossRef]
- Dyer, M.; Weng, M.-H.; Wu, S.; Ferrari, T.G.; Dyer, R. Urban narrative: Computational linguistic interpretation of large format public participation for urban infrastructure. Urban Plan. 2020, 5, 20–32. [Google Scholar] [CrossRef]
- Dyer, M.; Wu, S.; Weng, M.-H. Convergence of Public Participation, Participatory Design and NLP to Co-Develop Circular Economy. Circ. Econ. Sustain. 2021, 1, 917–934. [Google Scholar] [CrossRef]
- Weng, M.-H.; Wu, S.; Dyer, M. AI Augmented Approach to Identify Shared Ideas from Large Format Public Consultation. Sustainability 2021, 13, 9310. [Google Scholar] [CrossRef]
- Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef] [Green Version]
- McInnes, L.; Healy, J.; Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]
- Malzer, C.; Baum, M. A hybrid approach to hierarchical density-based cluster selection. In Proceedings of the 2020 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), Karlsruhe, Germany, 14–16 September 2020; pp. 223–228. [Google Scholar]
- Arbelaitz, O.; Gurrutxaga, I.; Muguerza, J.; Pérez, J.M.; Perona, I. An extensive comparative study of cluster validity indices. Pattern Recognit. 2013, 46, 243–256. [Google Scholar] [CrossRef]
- Rendón, E.; Abundez, I.; Arizmendi, A.; Quiroz, E.M. Internal versus External cluster validation indexes. Int. J. Comput. Commun. 2011, 5, 27–34. [Google Scholar]
- Manning, C.; Surdeanu, M.; Bauer, J.; Finkel, J.; Bethard, S.; McClosky, D. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, Maryland, 23–24 June 2014; pp. 55–60. [Google Scholar] [CrossRef] [Green Version]
- Duncan, J.M.A.; Boruff, B.; Saunders, A.; Sun, Q.; Hurley, J.; Amati, M. Turning down the heat: An enhanced understanding of the relationship between urban vegetation and surface temperature at the city scale. Sci. Total Environ. 2019, 656, 118–128. [Google Scholar] [CrossRef]
- Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar]
- Grootendorst, M. KeyBERT: Minimal keyword extraction with BERT. Zenodo 2020. [Google Scholar] [CrossRef]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [Green Version]
- Xu, R.; Wunsch, D., II. Survey of clustering algorithms. IEEE Trans. Neural Netw. 2005, 16, 645–678. [Google Scholar] [CrossRef]
- Menardi, G. Density-based Silhouette diagnostics for clustering methods. Stat Comput 2011, 21, 295–308. [Google Scholar] [CrossRef]
- Adomavicius, G.; Kwon, Y. Maximizing aggregate recommendation diversity: A graph-theoretic approach. In Proceedings of the the 1st International Workshop on Novelty and Diversity in Recommender Systems (DiveRS 2011), Chicago, IL, USA, 23 October 2011; pp. 3–10. [Google Scholar]
- Zhang, M.; Hurley, N. Avoiding monotony: Improving the diversity of recommendation lists. In Proceedings of the 2008 ACM Conference on Recommender Systems, New York, NY, USA, 23 October 2008; pp. 123–130. [Google Scholar] [CrossRef]
- Zhou, T.; Kuscsik, Z.; Liu, J.-G.; Medo, M.; Wakeling, J.R.; Zhang, Y.-C. Solving the apparent diversity-accuracy dilemma of recommender systems. Proc. Natl. Acad. Sci. USA 2010, 107, 4511–4515. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Kapoor, K.; Kumar, V.; Terveen, L.; Konstan, J.A.; Schrater, P. ‘I like to Explore Sometimes’: Adapting to Dynamic User Novelty Preferences. In Proceedings of the 9th ACM Conference on Recommender Systems, New York, NY, USA, 16–20 September 2015; pp. 19–26. [Google Scholar] [CrossRef]
- Wong, K.-F.; Wu, M.; Li, W. Extractive summarization using supervised and semi-supervised learning. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Manchester, UK, 18–22 August 2008; pp. 985–992. [Google Scholar]
- Firoozeh, N.; Nazarenko, A.; Alizon, F.; Daille, B. Keyword extraction: Issues and methods. Nat. Lang. Eng. 2020, 26, 259–291. [Google Scholar] [CrossRef]
- Newman, D.; Noh, Y.; Talley, E.; Karimi, S.; Baldwin, T. Evaluating topic models for digital libraries. In Proceedings of the 10th Annual Joint Conference on Digital Libraries, Gold Coast, QLD, Australia, 21–25 June 2010; pp. 215–224. [Google Scholar] [CrossRef]
- Mimno, D.; Wallach, H.; Talley, E.; Leenders, M.; McCallum, A. Optimizing semantic coherence in topic models. In Proceedings of the 2011 conference on empirical methods in natural language processing, Edinburgh, UK, 28 July 2011; pp. 262–272. [Google Scholar]
- Röder, M.; Both, A.; Hinneburg, A. Exploring the Space of Topic Coherence Measures. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, Shanghai, China, 2–6 February 2015; pp. 399–408. [Google Scholar] [CrossRef]
- Stevens, K.; Kegelmeyer, P.; Andrzejewski, D.; Buttler, D. Exploring topic coherence over many models and many topics. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea, 12–14 July 2012; pp. 952–961. [Google Scholar]
- Aletras, N.; Stevenson, M. Evaluating topic coherence using distributional semantics. In Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013)–Long Papers, Potsdam, Germany, 20–22 March 2013; pp. 13–22. [Google Scholar]
Pattern | Description | Examples |
---|---|---|
NNs | two or more nouns | land parcel, planning practice, machine learning approach, land surface temperature |
JJ + NNs | one adjective plus one or more nouns. | important variable, main contribution, urban heat island, urban heat island research |
JJ + [and] + JJ + NNs | one adjective plus an optional conjunction and plus one adjective plus one or more nouns. | relevant spatial scale, physical and socioeconomic characteristic, major urban environmental problem |
[JJ] + NN + and + NNs | an optional adjective plus a noun plus the conjunction and plus one or more nouns. | air and noise pollution, urban greenery and planning practice, urban planning, and land use management |
1st Iteration | 2nd Iteration | 3rd Iteration | 4th Iteration | |||||
---|---|---|---|---|---|---|---|---|
Number of Abstracts | Silhouette Score | Number of Abstracts | Silhouette Score | Number of Abstracts | Silhouette Score | Number of Abstracts | Silhouette Score | |
593 | (#1) 16 | 0.99 | ||||||
(#2) 21 | 0.88 | |||||||
(#3) 38 | 0.74 | |||||||
(#4) 31 | 0.69 | |||||||
(#5) 31 | 0.51 | |||||||
62 | 0.77 | (#6) 23 | 0.85 | |||||
(#7) 25 | 0.67 | |||||||
(#8) 14 | −0.14 | |||||||
107 | 0.71 | (#9) 33 | 0.87 | |||||
(#10) 42 | 0.84 | |||||||
(#11) 32 | −0.41 | |||||||
287 | −0.76 | (#12) 13 | 0.95 | |||||
(#13) 12 | 0.82 | |||||||
(#14) 23 | 0.66 | |||||||
(#15) 17 | 0.54 | |||||||
(#16) 35 | 0.16 | |||||||
187 | −0.50 | |||||||
(#17) 10 | 0.74 | |||||||
(#18) 15 | 0.71 | |||||||
(#19) 25 | 0.61 | |||||||
(#20) 12 | 0.60 | |||||||
(#21) 19 | 0.36 | |||||||
106 | −0.48 | (#22) 26 | 0.33 | |||||
(#23) 34 | 0.31 | |||||||
(#24) 46 | 0.02 |
Abstract Cluster Sequence Number | Number of Abstracts | Group 1 | Group 2 | Group 3 | Group 4 | Group 5 | Average Coverage | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Number of Keywords | Coverage | Number of Keywords | Coverage | Number of Keywords | Coverage | Number of Keywords | Coverage | Number of Keywords | Coverage | |||
1 | 16 | 11 | 69% | 9 | 56% | 9 | 56% | 60% | ||||
2 | 21 | 16 | 67% | 10 | 43% | 15 | 62% | 21 | 86% | 65% | ||
3 | 38 | 35 | 84% | 14 | 34% | 24 | 66% | 14 | 37% | 10 | 26% | 49% |
4 | 31 | 18 | 55% | 131 | 100% | 78% | ||||||
5 | 31 | 20 | 65% | 26 | 74% | 70% | ||||||
6 | 23 | 17 | 65% | 12 | 57% | 10 | 39% | 16 | 65% | 57% | ||
7 | 25 | 26 | 76% | 31 | 88% | 13 | 48% | 71% | ||||
8 | 14 | 17 | 79% | 23 | 93% | 86% | ||||||
9 | 33 | 32 | 73% | 17 | 52% | 10 | 24% | 10 | 27% | 35 | 82% | 52% |
10 | 42 | 26 | 57% | 15 | 48% | 45 | 83% | 15 | 31% | 55% | ||
11 | 32 | 25 | 72% | 28 | 78% | 32 | 78% | 15 | 47% | 69% | ||
12 | 13 | 12 | 92% | 13 | 85% | 11 | 77% | 85% | ||||
13 | 12 | 14 | 75% | 16 | 75% | 75% | ||||||
14 | 23 | 11 | 39% | 13 | 61% | 15 | 57% | 12 | 48% | 51% | ||
15 | 17 | 27 | 94% | 17 | 82% | 88% | ||||||
16 | 35 | 19 | 54% | 22 | 60% | 30 | 63% | 59% | ||||
17 | 10 | 10 | 90% | 11 | 80% | 85% | ||||||
18 | 15 | 14 | 73% | 10 | 60% | 22 | 93% | 75% | ||||
19 | 25 | 13 | 44% | 11 | 36% | 15 | 40% | 23 | 60% | 45% | ||
20 | 12 | 15 | 100% | 16 | 92% | 96% | ||||||
21 | 19 | 18 | 84% | 18 | 74% | 79% | ||||||
22 | 26 | 22 | 77% | 23 | 77% | 10 | 31% | 62% | ||||
23 | 34 | 33 | 71% | 49 | 82% | 10 | 29% | 61% | ||||
24 | 46 | 30 | 57% | 64 | 89% | 12 | 26% | 15 | 28% | 50% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Weng, M.-H.; Wu, S.; Dyer, M. Identification and Visualization of Key Topics in Scientific Publications with Transformer-Based Language Models and Document Clustering Methods. Appl. Sci. 2022, 12, 11220. https://doi.org/10.3390/app122111220
Weng M-H, Wu S, Dyer M. Identification and Visualization of Key Topics in Scientific Publications with Transformer-Based Language Models and Document Clustering Methods. Applied Sciences. 2022; 12(21):11220. https://doi.org/10.3390/app122111220
Chicago/Turabian StyleWeng, Min-Hsien, Shaoqun Wu, and Mark Dyer. 2022. "Identification and Visualization of Key Topics in Scientific Publications with Transformer-Based Language Models and Document Clustering Methods" Applied Sciences 12, no. 21: 11220. https://doi.org/10.3390/app122111220
APA StyleWeng, M. -H., Wu, S., & Dyer, M. (2022). Identification and Visualization of Key Topics in Scientific Publications with Transformer-Based Language Models and Document Clustering Methods. Applied Sciences, 12(21), 11220. https://doi.org/10.3390/app122111220