Hierarchical Keyword Generation Method for Low-Resource Social Media Text
Abstract
:1. Introduction
2. Related Work
3. Method
3.1. Problem Formulation
3.2. Architecture
3.3. Data Preprocessing
3.4. Hierarchical Encoder
3.5. Selection Block
3.6. Decoder
3.7. Training
Algorithm 1 Self-supervised training based on the text segment recovery task |
Input: Unlabeled sample set in the social media text domain Hierarchical keyword generation model |
1 for do |
2 for do |
3 —> |
4 —>—> |
5 —>—> |
6 for in do |
7 random probability |
8 if is True do |
9 if then |
10 else do |
11 if , then |
12 end |
13 —>—> |
14 |
15 loss backward and optimizer step |
16 end |
17 end |
4. Experiments
4.1. Datasets
4.2. Evaluation
- TF-IDF [33]: A commonly used unsupervised keyword extraction method that evaluates the importance of words or phrases in a text prediction database based on their frequency.
- TextRank [34]: A graph-based sorting algorithm that constructs a graph network based on the semantic relationships between words in text.
- Transformer [14]: a Seq2Seq model based on a multi-head self-attention mechanism that performs well on several text generation-like tasks.
- BART [35]: A large-scale pre-trained language model based on the Transformer model, which can greatly improve the performance of downstream text generation tasks.
4.3. Implementation Details
4.4. Results and Analysis
4.4.1. Comparative Experiments
4.4.2. Ablation Experiments
4.4.3. Model Hyperparameter Experiments
4.4.4. Case Study on the Real Social Media Platform
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Hammouda, K.M.; Matute, D.N.; Kamel, M.S. Corephrase: Keyphrase extraction for document clustering. In Proceedings of the Machine Learning and Data Mining in Pattern Recognition: 4th International Conference (MLDM), Leipzig, Germany, 9–11 July 2005; pp. 265–274. [Google Scholar]
- Zhang, C.; Yang, Q.; Zhang, J.; Gou, L.; Fan, H. Topic Mining and Future Trend Exploration in Digital Economy Research. Information 2023, 14, 432. [Google Scholar] [CrossRef]
- Wu, X.; Bolivar, A. Keyword extraction for contextual advertisement. In Proceedings of the 17th International Conference on World Wide Web, Beijing, China, 21–25 April 2008; pp. 1195–1196. [Google Scholar]
- Dave, K.S.; Varma, V. Pattern based keyword extraction for contextual advertising. In Proceedings of the 19th ACM international conference on Information and knowledge management, Toronto, Canada, 26–30 October 2010; pp. 1885–1888. [Google Scholar]
- Jones, S.; Staveley, M.S. Phrasier: A system for interactive document retrieval using keyphrases. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, USA, 15–19 August 1999; pp. 160–167. [Google Scholar]
- Boudin, F.; Gallina, Y.; Aizawa, A. Keyphrase generation for scientific document retrieval. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), Online, 5–10 July 2020; pp. 1118–1126. [Google Scholar]
- Zhang, Y.; Zincir-Heywood, N.; Milios, E. World wide web site summarization. Web Intell. Agent Syst. Int. J. 2004, 2, 39–53. [Google Scholar]
- Banbhrani, S.K.; Xu, B.; Liu, H.; Lin, H. SC-Political ResNet: Hashtag Recommendation from Tweets Using Hybrid Optimization-Based Deep Residual Network. Information 2021, 12, 389. [Google Scholar] [CrossRef]
- Berend, G. Opinion Expression Mining by Exploiting Keyphrase Extraction. In Proceedings of the 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, 8–13 November 2011; pp. 1162–1170. [Google Scholar]
- Wang, H.; Wang, Y. EREC: Enhanced Language Representations with Event Chains. Information 2022, 13, 582. [Google Scholar] [CrossRef]
- Meng, R.; Zhao, S.; Han, S.; He, D.; Brusilovsky, P.; Chi, Y. Deep Keyphrase Generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 582–592. [Google Scholar]
- Gallina, Y.; Boudin, F.; Daille, B. KPTimes: A Large-Scale Dataset for Keyphrase Generation on News Documents. In Proceedings of the 12th International Conference on Natural Language Generation (INLG), Tokyo, Japan, 29 October–1 November 2019; pp. 130–135. [Google Scholar]
- Li, Y.; Zhang, Y.; Zhao, Z. CSL: A Large-scale Chinese Scientific Literature Dataset. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 3917–3923. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS′17), Red Hook, NY, USA, 4 December 2017; pp. 6000–6010. [Google Scholar]
- Cho, K.; van Merrienboer, B.; Gülçehre, Ç. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014; pp. 1724–1734. [Google Scholar]
- Chen, J.; Zhang, X.; Wu, Y. Keyphrase Generation with Correlation Constraints. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 13 November 2018; pp. 4057–4066. [Google Scholar]
- Zhang, Y.; Xiao, W. Keyphrase Generation Based on Deep Seq2Seq Model. IEEE Access 2018, 6, 46047–46057. [Google Scholar] [CrossRef]
- Chen, W.; Gao, Y.; Zhang, J. Title-Guided Encoding for Keyphrase Generation. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 6268–6275. [Google Scholar]
- Wang, Y.; Li, J.; Chan, H.P. Topic-Aware Neural Keyphrase Generation for Social Media Language. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 17 July 2019; pp. 2516–2526. [Google Scholar]
- Kim, J.; Jeong, M.; Choi, S. Structure-augmented keyphrase generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Virtual, 7–11 November 2021; pp. 2657–2667. [Google Scholar]
- Yang, P.; Ge, Y.; Yao, Y. GCN-based document representation for keyphrase generation enhanced by maximizing mutual information. Knowl. Based Syst. 2022, 243, 108488. [Google Scholar]
- Ye, H.; Wang, L. Semi-Supervised Learning for Neural Keyphrase Generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 13 November 2018; pp. 4142–4153. [Google Scholar]
- Wang, Y.; Liu, Q.; Qin, C. Exploiting Topic-Based Adversarial Neural Network for Cross-Domain Keyphrase Extraction. In Proceedings of the 2018 IEEE International Conference on Data Mining, Sentosa, Singapore, 17–20 November 2018; pp. 597–606. [Google Scholar]
- Guo, L.; Sun, H.; Qi, Q. Keyword Extraction Algorithm Based on Pre-training and Multi-task Training. In Proceedings of the Sixth International Congress on Information and Communication Technology, Singapore, 10–11 November 2022; pp. 723–734. [Google Scholar]
- Sun, S.; Liu, Z.; Xiong, C. Capturing Global Informativeness in Open Domain Keyphrase Extraction. In Proceedings of the Natural Language Processing and Chinese Computing: 10th CCF International Conference (NLPCC), Qingdao, China, 13–17 October 2021; pp. 275–287. [Google Scholar]
- Bhat, G.; Saluja, A.; Dye, M.; Florjanczyk, J. Hierarchical Encoders for Modeling and Interpreting Screenplays. In Proceedings of the Third Workshop on Narrative Understanding, Online, 22 March 2021; pp. 1–12. [Google Scholar]
- Wang, Z.; Wang, P.; Huang, L.; Sun, X.; Wang, H. Incorporating Hierarchy into Text Encoder: A Contrastive Learning Approach for Hierarchical Text Classification. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022. [Google Scholar]
- Sakhrani, H.; Parekh, S.; Ratadiya, P. Transformer-based Hierarchical Encoder for Document Classification. In Proceedings of the 2021 International Conference on Data Mining Workshops (ICDMW), IEEE, Auckland, New Zealand, 7–10 December 2021; pp. 852–858. [Google Scholar]
- Wu, D.; Ahmad, W.U.; Dev, S. Representation Learning for Resource-Constrained Keyphrase Generation. In Proceedings of the Findings of the Association for Computational Linguistics (EMNLP), Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 700–716. [Google Scholar]
- NLP Chinese Corpus: Large Scale Chinese Corpus for NLP. Available online: https://zenodo.org/records/3402023 (accessed on 7 September 2019).
- Fxsjy, Jieba. Available online: https://github.com/fxsjy/jieba (accessed on 20 January 2020).
- Goto456, Stopwords. Available online: https://github.com/goto456/stopwords (accessed on 12 May 2023).
- Salton, G.; Yang, C.S.; Yu, C.T. A Theory of Term Importance in Automatic Text Analysis. J. Am. Soc. Inf. Sci. 1975, 26, 33–44. [Google Scholar]
- Mihalcea, R.; Tarau, P. TextRank: Bringing Order into Text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, 25–26 July 2004; pp. 404–411. [Google Scholar]
- Lewis, M.; Liu, Y.; Goyal, N. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 19 June 2020; pp. 7871–7880. [Google Scholar]
- Yuewang-Cuhk, TAKG. Available online: https://github.com/yuewang-cuhk/TAKG (accessed on 5 August 2019).
- Fnlp, Bart-Base-Chinese. Available online: https://huggingface.co/fnlp/bart-base-chinese (accessed on 30 December 2022).
Dataset | News2016zh | ||
---|---|---|---|
Size | 147,456 | 38,505 | |
Text length | max | 65 | 101 |
min | 5 | 5 | |
avg | 21.39 | 20.70 | |
Keyword length | max | 29 | 36 |
min | 1 | 1 | |
avg | 8.45 | 2.76 | |
Number of keywords | max | 12 | 9 |
min | 1 | 1 | |
avg | 1.45 | 1.06 |
Model | Rouge-1 | Rouge-2 | Rouge-L | F1@1 |
---|---|---|---|---|
TF-IDF | 2.36 | 0.64 | 2.12 | 2.36 |
TextRank | 1.72 | 0.27 | 1.14 | 1.72 |
TAKG | - | - | - | 21.57 |
Transformer | 27.72 | 15.73 | 27.63 | 21.79 |
BART | 32.34 | 20.70 | 30.09 | 26.36 |
HKG | 32.78 | 21.82 | 31.12 | 27.20 |
Model | Rouge-1 | Rouge-2 | Rouge-L | F1@1 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 K | 10 K | 0 | 1 K | 10 K | 0 | 1 K | 10 K | 0 | 1 K | 10 K | |
BART | 30.9 | 30.9 | 32.3 | 16.2 | 16.4 | 20.7 | 27.4 | 27.7 | 30.1 | 19.0 | 19.2 | 26.4 |
HKG | 31.1 | 31.2 | 32.8 | 17.9 | 18.0 | 21.8 | 29.3 | 29.5 | 31.1 | 22.7 | 23.1 | 27.2 |
Training Stage | Rouge-1 | Rouge-2 | Rouge-L | F1@1 | ||
---|---|---|---|---|---|---|
Pre-Training | Self-Supervised Fine-Tuning | Supervised Fine-Tuning | ||||
√ | × | × | 30.77 | 13.29 | 19.87 | 15.38 |
× | × | √ | 14.60 | 8.34 | 14.24 | 12.87 |
√ | × | √ | 31.24 | 20.27 | 29.91 | 25.21 |
√ | √ | √ | 32.78 | 21.82 | 31.12 | 27.20 |
Weibo Post: | 马航客机失联接近两星期,中国未有停止搜索行动。中国国防部新闻发言人表示,军方将根据马方提供的最新信息及前一阶段搜寻情况继续保持足够的搜寻兵力,配合卫星和雷达扩大搜寻范围,加大搜救力度。 (The Malaysia passenger plane lost contact for nearly two weeks, and China has not stopped its search operations. A spokesperson for the Ministry of National Defense of China states that the military will continue to maintain sufficient search forces based on the latest information provided by the Malaysian side and the previous stage of search, cooperate with satellites and radar to expand the search range, and increase search and rescue efforts.) |
BART: | #马航客机失联# (#Malaysia Passenger Plane Lost Contact#) |
HKG: | #马航飞机失联# (#Malaysia Airplane Lost Contact#) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Guan, X.; Long, S. Hierarchical Keyword Generation Method for Low-Resource Social Media Text. Information 2023, 14, 615. https://doi.org/10.3390/info14110615
Guan X, Long S. Hierarchical Keyword Generation Method for Low-Resource Social Media Text. Information. 2023; 14(11):615. https://doi.org/10.3390/info14110615
Chicago/Turabian StyleGuan, Xinyi, and Shun Long. 2023. "Hierarchical Keyword Generation Method for Low-Resource Social Media Text" Information 14, no. 11: 615. https://doi.org/10.3390/info14110615
APA StyleGuan, X., & Long, S. (2023). Hierarchical Keyword Generation Method for Low-Resource Social Media Text. Information, 14(11), 615. https://doi.org/10.3390/info14110615