A Graph-Based Keyword Extraction Method for Academic Literature Knowledge Graph Construction
Abstract
:1. Introduction
2. Related Work
2.1. Automatic Keyword Extraction
2.2. Automatic Keyword Extraction Based on TextRank
3. TP-CoGlo-TextRank Keyword Extraction Algorithm
3.1. Word Frequency
3.2. Word Position Importance
3.3. Word Co-Occurrence Frequency
3.4. GloVe Word Embedding
3.5. TP-CoGlo-TextRank Algorithm
Algorithm 1 TP-CoGlo-TextRank Algorithm |
Input: document, GloVe.model, sliding window size sw, damping factor d, iteration threshold θ, max_iterations, the number of words to be extracted k |
Output: keywords extracted from document |
|
4. Experiment and Discussion
4.1. Experimental Dataset
4.2. Measurement for Evaluation
4.3. Results and Analysis
4.3.1. Validity Verification of Word Frequency, Position, Co-Occurrence Frequency, and Similarity
- (1)
- Validity verification of word frequency
- (2)
- Validity verification of word position
- (3)
- Validity verification of word co-occurrence frequency
- (4)
- Validity verification of the similarity between words
4.3.2. Determination of Model Parameters
- (1)
- Determination of the parameter β of word position importance
- (2)
- Determination of the parameter γ in the transition probability and the parameter λ in the overall weighting
4.3.3. Comparative Experiment
- (1)
- M1: the TF-IDF method. The TF score is calculated by Equation (2) and the smoothed IDF is calculated by Equation (22), where N represents the size of the document set and df(w) represents the document frequency of word w.
- (2)
- M2: the LDA topic model-based method. In addition to the 10,000 documents in the experimental dataset, another 38,365 documents are selected from the DBLP-Citation-network V14 dataset, and a total of 48,365 documents are used to train the LDA model. The trained LDA model is then used to compute the influence of words in each test document, and words with the highest word influence are extracted as keywords.
- (3)
- M3: the TextRank algorithm [6].
- (4)
- M4: the modified TextRank algorithm proposed in [32], which uses word frequency, position, and word co-occurrence relationship to compute the transition probability matrix. The optimal weights of the three parts of transition probabilities are 0, 0.9, and 0.1, respectively.
- (5)
- M5: the modified TextRank algorithm proposed in [37], which uses word similarities and co-occurrence relationship, both of which have a weight of 0.5, to compute the transition probability matrix, and takes the sum of similarities between the word and all its co-occurring words as the initial value of the word. The same word embedding model and word similarity calculation method as M6 are used here.
- (6)
- M6: the TP-CoGlo-TextRank algorithm proposed in this paper. The glove.42B.300d [31] word embedding model, trained on 42 billion tokens from Common Crawl (http://commoncrawl.org, accessed on 1 March 2024), is used to obtain the word vectors.
- (1)
- The M2 algorithm has the worst performance in keyword extraction. We believe this is mainly because the M2 algorithm is closely related to the training effect of the LDA model. Although the 48,365 documents used to train the LDA model are all computer-related, their topic distribution is relatively scattered. As shown in Figure 10, the optimal number of topics is 15, and the highest topic coherence score is 0.5250, indicating that the semantic coherence of the words in the topics is weak, resulting in the keywords extracted by the M2 algorithm not being able to express the document topics well.
- (2)
- As the number of extracted keywords k increases, the AP of the six algorithms gradually decreases, indicating that with the increase in k, the number of correctly extracted keywords increases slightly. AR gradually increases, mainly because the number of keywords provided by the document itself is fixed, but as k increases, the number of correctly extracted keywords increases, although by a smaller amount. With the increase in k, except for the M2 algorithm, the AF1 of the other five algorithms increases first and then decreases gradually after reaching a peak. As for the M2 algorithm, its AF1 reaches a peak at k = 16 and then shows a downward trend. The reason for the late peak of the M2 algorithm is that the topic coherence of the trained LDA model is not high, which leads to the fact that when a small number of words are extracted as keywords, the extracted words cannot express the document topics well. However, as k increases, more high-quality words are extracted, making the M2 algorithm gradually reach its peak.
- (3)
- Graph-based algorithms M3, M4, M5, and M6 are better than the statistical feature-based algorithm M1 in extracting keywords. The average AP values increased by 1.40%, 10.42%, 1.65%, and 12.14%, respectively. The average AR values increased by 1.73%, 6.60%, 1.74%, and 8.09%, respectively. The average AF1 values increased by 1.57%, 8.95%, 1.74%, and 10.61%, respectively. This shows that graph-based keyword extraction algorithms can better measure the importance of words by using the relationship between words.
- (4)
- Compared with the other five baseline algorithms, the M6 algorithm has a significant improvement in AP, AR, and AF1. The average AP values increased by 12.14%, 23.34%, 10.74%, 1.72%, and 10.49%, respectively. The average AR values increased by 8.09%, 24.31%, 6.36%, 1.48%, and 6.34%, respectively. The average AF1 values increased by 10.61%, 23.65%, 9.04%, 1.66%, and 8.87%, respectively. This shows that the M6 algorithm has good performance in keyword extraction when adjacent keywords in the original text are not combined into phrases.
5. Academic Literature Knowledge Graph Construction
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Singhal, A. Introducing the Knowledge Graph: Things, Not Strings. 2012. Available online: https://www.blog.google/products/search/introducing-knowledge-graph-things-not/ (accessed on 14 February 2024).
- Wang, Z.; Zhang, J.; Feng, J.; Chen, Z. Knowledge graph embedding by translating on hyperplanes. In Proceedings of the 28th AAAI Conference on Artificial Intelligence, Québec City, QC, Canada, 27–31 July 2014; pp. 1112–1119. [Google Scholar]
- Ji, S.; Pan, S.; Cambria, E.; Marttinen, P.; Yu, P.S. A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 494–514. [Google Scholar] [CrossRef] [PubMed]
- Liu, X.; Yu, Y.; Guo, C.; Sun, Y. Meta-path-based ranking with pseudo relevance feedback on heterogeneous graph for citation recommendation. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, Shanghai, China, 3–7 November 2014; pp. 121–130. [Google Scholar]
- Jiang, Z.; Yin, Y.; Gao, L.; Lu, Y.; Liu, X. Cross-language citation recommendation via hierarchical representation learning on heterogeneous graph. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018; pp. 635–644. [Google Scholar]
- Mihalcea, R.; Tarau, P. TextRank: Bringing order into texts. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, 25–26 July 2004; pp. 404–411. [Google Scholar]
- Nomoto, T. Keyword extraction: A modern perspective. SN Comput. Sci. 2022, 4, 92. [Google Scholar] [CrossRef] [PubMed]
- Hasan, K.S.; Ng, V. Automatic keyphrase extraction: A survey of the state of the art. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, MD, USA, 22–27 June 2014; pp. 1262–1273. [Google Scholar]
- Wang, Z.; Wang, D.; Li, Q. Keyword extraction from scientific research projects based on SRP-TF-IDF. Chin. J. Electron. 2021, 30, 652–657. [Google Scholar]
- Rathi, R.N.; Mustafi, A. Designing an efficient unigram keyword detector for documents using relative entropy. Multimed. Tools Appl. 2022, 81, 37747–37761. [Google Scholar] [CrossRef]
- Campos, R.; Mangaravite, V.; Pasquali, A.; Jorge, A.M.; Nunes, C.; Jatowt, A. A text feature based automatic keyword extraction method for single documents. In Proceedings of the 40th European Conference on Information Retrieval, Grenoble, France, 26–29 March 2018; pp. 684–691. [Google Scholar]
- Campos, R.; Mangaravite, V.; Pasquali, A.; Jorge, A.; Nunes, C.; Jatowt, A. Yake! keyword extraction from single documents using multiple local features. Inf. Sci. 2020, 509, 257–289. [Google Scholar] [CrossRef]
- Lu, X.; Zhou, X.; Wang, W.; Lio, P.; Hui, P. Domain-oriented topic discovery based on features extraction and topic clustering. IEEE Access 2022, 8, 93648–93662. [Google Scholar] [CrossRef]
- Goz, F.; Mutlu, A. MGRank: A keyword extraction system based on multigraph GoW model and novel edge weighting procedure. Knowl.-Based Syst. 2022, 251, 109292. [Google Scholar] [CrossRef]
- Jain, M.; Bhalla, G.; Jain, A.; Sharma, S. Automatic keyword extraction for localized tweets using fuzzy graph connectivity measures. Multimed. Tools Appl. 2022, 81, 42932–42956. [Google Scholar] [CrossRef]
- Yan, Y.; Tan, Q.; Xie, Q.; Zeng, P.; Li, P. A graph-based approach of automatic keyphrase extraction. Procedia Comput. Sci. 2017, 107, 248–255. [Google Scholar]
- Abimbola, R.O.; Awoyelu, I.O.; Hunsu, F.O.; Akinyemi, B.O.; Aderounmu, G.A. A noun-centric keyphrase extraction model: Graph-based approach. JAIT 2022, 13, 578–589. [Google Scholar] [CrossRef]
- Liu, Z.; Huang, W.; Zheng, Y.; Sun, M. Automatic keyphrase extraction via topic decomposition. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, USA, 9–11 October 2010; pp. 366–376. [Google Scholar]
- Teneva, N.; Cheng, W. Salience Rank: Efficient keyphrase extraction with topic modeling. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 530–535. [Google Scholar]
- Sarracén, G.L.D.I.P.; Rosso, P. Offensive keyword extraction based on the attention mechanism of BERT and the eigenvector centrality using a graph representation. Pers. Ubiquitous Comput. 2021, 27, 45–57. [Google Scholar] [CrossRef]
- Duan, X.; Ying, S.; Chen, H.; Yuan, W.; Yin, X. OILog: An online incremental log keyword extraction approach based on MDP-LSTM neural network. Inf. Syst. 2021, 95, 101618. [Google Scholar] [CrossRef]
- Zhang, Y.; Tuo, M.; Yin, Q.; Qi, L.; Wang, X.; Liu, T. Keywords extraction with deep neural network model. Neurocomputing 2020, 383, 113–121. [Google Scholar] [CrossRef]
- Page, L.; Brin, S.; Motwani, R.; Winograd, T. The PageRank Citation Ranking: Bringing Order to the Web; Technical Report; Stanford Infolab: Stanford, CA, USA, 1998. [Google Scholar]
- Liu, L.; Yu, H.; Fei, N.; Chen, C. Key-word extracting algorithm from single text based on TextRank. Appl. Res. Comput. 2018, 35, 705–710. [Google Scholar]
- Gu, Y.; Xia, T. Study on keyword extraction with LDA and TextRank combination. Data Anal. Knowl. Discov. 2014, Z1, 41–47. [Google Scholar]
- Xia, T. Extracting keywords with modified TextRank model. Data Anal. Knowl. Discov. 2017, 1, 28–34. [Google Scholar]
- Chen, Z.; Li, Y.; Xu, F.; Feng, G.; Shi, D.; Cui, X. Key information extraction of forestry text based on TextRank and clusters filtering. Trans. Chin. Soc. Agric. Mach. 2020, 51, 207–214+172. [Google Scholar]
- Xiong, A.; Liu, D.; Tian, H.; Liu, Z.; Yu, P.; Kadoch, M. News keyword extraction algorithm based on semantic clustering and word graph model. Tsinghua Sci. Technol. 2021, 26, 886–893. [Google Scholar] [CrossRef]
- Guo, W.; Wang, Z.; Han, F. Multifeature fusion keyword extraction algorithm based on TextRank. IEEE Access 2022, 10, 71805–71813. [Google Scholar] [CrossRef]
- Qiu, D.; Zheng, Q. Improving TextRank algorithm for automatic keyword extraction with tolerance rough set. Int. J. Fuzzy Syst. 2022, 24, 1332–1342. [Google Scholar] [CrossRef]
- Matsuo, Y.; Ishizuka, M. Keyword extraction from a single document using word co-occurrence statistical information. Int. J. Artif. Intell. Tools 2004, 13, 157–169. [Google Scholar] [CrossRef]
- Xia, T. Study on keyword extraction using word position weighted TextRank. Data Anal. Knowl. Discov. 2013, 29, 30–34. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Zhang, M.; Li, X.; Yue, S.; Yang, L. An empirical study of TextRank for keyword extraction. IEEE Access 2020, 8, 178849–178858. [Google Scholar] [CrossRef]
- Tang, J.; Zhang, J.; Yao, L.; Li, J.; Zhang, L.; Su, Z. ArnetMiner: Extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, USA, 24–27 August 2008; pp. 990–998. [Google Scholar]
- Manning, C.; Surdeanu, M.; Bauer, J.; Finkel, J.; Bethard, S.; McClosky, D. The Stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MD, USA, 22–27 June 2014; pp. 55–60. [Google Scholar]
- Ning, J.; Liu, J. Using Word2vec with TextRank to extract keywords. Data Anal. Knowl. Discov. 2016, 32, 20–27. [Google Scholar]
Number of Keywords | Number of Documents |
---|---|
3 | 2153 |
4 | 3650 |
5 | 4197 |
Number of Words | Number of Keywords |
---|---|
1 | 9303 |
2 | 22,489 |
3 | 7725 |
4 | 1882 |
5 | 467 |
6 | 132 |
7 | 33 |
8 | 9 |
9 | 4 |
Minimum Total Number of Different Words in All Keywords in a Document | Maximum Total Number of Different Words in All Keywords in a Document | Minimum Average Number of Different Words in Each Keyword in a Document | Maximum Average Number of Different Words in Each Keyword in a Document |
---|---|---|---|
3 | 22 | 1 | 5.5 |
Keywords Provided by the Document | Not Keywords Provided by the Document | |
---|---|---|
Keywords Extracted by the Algorithm | TP | FP |
Not Keywords Extracted by the Algorithm | FN | TN |
Algorithms | Top 7 (Uncombined) | Top 7 (Combined) | ||||
---|---|---|---|---|---|---|
AP% | AR% | AF1% | AP% | AR% | AF1% | |
M1 | 37.21 | 53.32 | 43.83 | 43.16 | 54.03 | 47.99 |
M2 | 27.32 | 36.69 | 31.32 | 29.52 | 37.90 | 33.19 |
M3 | 38.44 | 54.93 | 45.23 | 45.50 | 55.65 | 50.06 |
M4 | 47.23 | 60.01 | 52.86 | 57.29 | 61.45 | 59.30 |
M5 | 38.66 | 54.86 | 45.35 | 45.60 | 55.67 | 50.14 |
M6 | 48.95 | 61.42 | 54.48 | 60.54 | 62.79 | 61.64 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, L.; Li, Y.; Li, Q. A Graph-Based Keyword Extraction Method for Academic Literature Knowledge Graph Construction. Mathematics 2024, 12, 1349. https://doi.org/10.3390/math12091349
Zhang L, Li Y, Li Q. A Graph-Based Keyword Extraction Method for Academic Literature Knowledge Graph Construction. Mathematics. 2024; 12(9):1349. https://doi.org/10.3390/math12091349
Chicago/Turabian StyleZhang, Lin, Yanan Li, and Qinru Li. 2024. "A Graph-Based Keyword Extraction Method for Academic Literature Knowledge Graph Construction" Mathematics 12, no. 9: 1349. https://doi.org/10.3390/math12091349
APA StyleZhang, L., Li, Y., & Li, Q. (2024). A Graph-Based Keyword Extraction Method for Academic Literature Knowledge Graph Construction. Mathematics, 12(9), 1349. https://doi.org/10.3390/math12091349