1. Introduction
In the digital age, the rapid proliferation of data presents challenges in efficiently identifying valuable information from large volumes of irrelevant content. This problem is particularly acute in specialised fields such as traffic accident cause analysis, where relevant keywords must be accurately extracted to summarise extensive textual datasets. Keyword extraction techniques constitute a fundamental tool in natural language processing (NLP) and play a critical role in addressing these challenges by assisting users in promptly identifying core content [
1]. Currently, keyword extraction methods can generally be classified into two categories: supervised and unsupervised keyword extraction [
2,
3].
Supervised keyword extraction methods typically approach the process as a binary classification task, extracting keywords by training models on manually labelled corpora [
4,
5]. Typically, complex machine learning algorithms are employed to learn from rich feature sets and optimise keyword prediction performance. However, these approaches are constrained by their requirement for substantial annotated datasets, which are often difficult to acquire [
6]. Furthermore, models trained on specific datasets may not generalise well to various document types or keyword extraction tasks, resulting in potential overfitting. In contrast, unsupervised methods do not require labelled data; instead, they rank keywords based on specific metrics, allowing unsupervised methods to be more adaptable to different contexts. In response to the limitations of traditional unsupervised approaches, Mihalcea and Tarau [
7] proposed the TextRank algorithm in 2004, which has since gained widespread attention. TextRank is a graph-based algorithm that constructs co-occurrence networks by treating words as nodes and their co-occurrence relations as edges. However, the algorithm assumes that all words are equally important initially, which might not fully capture the semantic and syntactic significance of individual words within a document. To overcome these constraints, this paper proposes an improved method, the IWF-TextRank algorithm, which integrates a variety of lexical features (e.g., semantic vectors, word frequency, lexical properties, and word length) into the traditional TextRank framework. The innovation of this method lies in the dynamic allocation of word weights through backpropagation (BP) neural networks and sequence relationship analysis. This method can capture more nuanced contextual information, thereby improving the relevance and accuracy of keyword extraction, especially on domain-specific datasets such as traffic accident reports.
The main contribution of this article is to propose a new keyword extraction algorithm, IWF-TextRank, which extends the traditional TextRank into a multi-feature weighted framework. It integrates features such as semantic vectors, word frequency, lexicality, and word length, and combines them with a BP neural network to achieve dynamic weight distribution, which significantly improves the keyword extraction effect of the algorithm in data in specific fields (such as traffic accident reports). The specific features are as follows:
Semantic vectors: word vectors are utilised to take into account the semantic relationships between words;
Word frequency and lexicality: adjusts the importance of keywords based on their frequency of occurrence and grammatical roles in the sentence;
Word length: word length is taken into account as longer words usually carry more information;
Backpropagation neural network: a BP neural network is used to dynamically adjust word weights to ensure that more important, contextually relevant words are emphasised.
In addition, by adding sequence relationship analysis to the BP neural network, IWF-TextRank can more accurately capture the contextual association of the text, ultimately achieving higher precision and recall rates. This improved approach aims to provide more accurate domain-specific keyword extraction, addressing the shortcomings of traditional algorithms such as TF-IDF and basic TextRank, which cannot take into account semantic relationships and contextual importance. Experimental results show that the IWF-TextRank algorithm outperforms traditional methods such as TF-IDF and basic TextRank in terms of keyword extraction accuracy, providing an effective solution for keyword extraction tasks in fields such as traffic analysis.
Keyword extraction methods are mainly classified into three categories: statistical-feature-based methods, topic-based modelling methods, and word graph modelling-based methods. Among the statistical-based methods, TF-IDF [
8] is a widely used technique where keywords are ranked based on word frequency and inverse document frequency. However, TF-IDF has been criticised for its over-reliance on word frequency, especially in professional domains, which often leads to poor extraction results [
9]. To address this issue, researchers have proposed various improvements, such as integrating positional and word span weights [
10], or combining TF-IDF with weight-balancing algorithms [
11]. Topic-based modelling approaches, such as Latent Dirichlet Allocation (LDA [
12]), have also been used for keyword extraction. LDA performs well in capturing semantic associations between words, making it a powerful tool for extracting topic-relevant keywords from large text corpora [
13,
14]. However, LDA is computationally expensive and may not perform well when dealing with short texts or single topic documents [
15,
16]. Finally, one of the most widely used word-graph-based keyword extraction methods is TextRank [
7]. This algorithm simulates the PageRank algorithm through a co-occurrence network, iteratively calculates the importance scores of the nodes, and extracts the words with higher scores as keywords. In recent years, researchers have proposed a number of extensions to TextRank, such as SW-TextRank and DK-TextRank, which incorporate semantic features and optimise word weights, but these methods often fail to fully balance multiple linguistic features [
17,
18].
In recent years, researchers have proposed a variety of improved algorithms to improve the accuracy and adaptability of keyword extraction. Among the existing methods, NE-Rank and SemanticRank are typical representatives based on semantics and word weight optimisation. NE-Rank mainly enhances the importance score of words by word frequency, thereby improving the accuracy of keyword extraction. However, due to the lack of consideration of the semantic relationship between words, NE-Rank performs poorly when processing complex texts or data in specific fields [
19]. SemanticRank introduces word vector technology to improve the accuracy of keyword extraction through semantic similarity [
20]. However, its performance in specific domain data is limited because the semantic modelling of this method tends to be of general context [
21]. In contrast, the IWF-TextRank proposed in this paper comprehensively considers multiple features such as word frequency, lexicality, word length, and semantics, and combines the BP neural network to optimise weights, which is more suitable for keyword extraction in specific fields.
Table 1 compares the main features and performance of NE-Rank, SemanticRank, and IWF-TextRank to more clearly show the innovations and advantages of IWF-TextRank.
The remainder of this paper is structured as follows:
Section 2 describes the IWF-TextRank modelling framework and the methodology used in this study.
Section 3 discusses the experimental results and analyses them in comparison with existing methods. Finally,
Section 4 summarises the contributions of this paper and suggests potential directions for future research.
3. Results
In order to verify the effectiveness of the IWF-TextRank algorithm, a comparative analysis was carried out of the extraction effects of different feature conditions, different parameter conditions, and different algorithms, respectively.
3.1. Evaluation Indicators
Keyword extraction generally adopts accuracy (P), recall (R), and F-value as the evaluation index of extraction effects [
31]. Accuracy and recall affect each other: the higher the accuracy, the lower the recall, so there is a contradiction between accuracy and recall, which can be weighted and reconciled by the F-value. The F-value is the result of the comprehensive consideration of the
p-value and the R-value, and the higher the F-value, the better the effectiveness of the experimental method. The formula for this is shown below.
where
denotes the set of keywords extracted in the
ith document;
denotes the set of manually labelled keywords in the ith document; and N denotes the number of documents in the test document set.
3.2. Parameter Setting
In this experiment, the accident causation data from the 2020 traffic accident data of a certain city was used as the dataset, and a total of 400 documents were obtained after processing. The data are provided by relevant departments and have high authenticity and accuracy. The dataset covers 4419 traffic accident records, contains rich accident information, and has a high value density. The data records are divided into two categories: 4364 data are recorded in table form, and another 55 accident data are recorded in text form. In general, the information in the dataset includes the specific time, location, weather conditions, cause of accident, type of accident, relevant information regarding the accident participants, vehicle information, and other fields. The experimental process is divided into the following two parts:
Parameter training: 50% of the data is selected as the training set, which is used to obtain the optimal combination of feature parameters.
Keyword extraction effect comparison: The remaining 50% is selected to test the keyword extraction effect of different algorithms and compared. In order to facilitate the comparison, the documents after word splitting are manually annotated, and each document is annotated with 10 words as the annotated keywords.
The initial keyword set is obtained by preprocessing the data, after which the statistical algorithms are used to count the word frequency, lexicality, and word length of the words, respectively. The statistical results are shown in
Table 4 and
Table 5.
Based on the results of lexical statistics, verbs and nouns are selected as secondary features. According to the number of words contained in the lexical properties, it can be determined that the importance of verbs is greater than that of nouns, which will be recorded as B1 and B2, respectively, and then B1 > B2. From the results of the word length statistics, it can be seen that words with a word length of 2 and 4 account for a larger proportion of the word length, giving word length rankings of 2, 4, 3, >4, and 1, which will be recorded as C1, C2, C3, C4, and C5, respectively.
According to the above statistics, the importance of the word and the word length ranking can be determined, and the order relationship analysis method can be used to obtain the weight of each indicator. The final weights of the indicators are shown in
Table 6.
After calculating the word feature weights, the BP neural network is used to calculate the weight assignment parameters of the word, the input of the neural network, and whether it is a keyword as the output (where 1 means it is a keyword, and 0 means it is not a keyword). The main parameters for the training of the BP neural network model are as follows: n-samples = 2000, noise = 0.4, random state = None, max epochs = 1000, learn rate = 0.035. The final obtained values for α, β, γ, and δ are 0.33, 0.35, 0.21, and 0.11.
3.3. Comparison of Results
3.3.1. Comparison of Results for Different Features
(1) Comparison of results for single features
The keyword extraction experiment under single-feature conditions can not only be used to compare and analyse the results of multi-factor and single-factor extraction, but can also preliminarily verify which feature enhancement can more effectively improve the accuracy of keyword extraction. The feature parameter settings are shown in
Table 7, in which groups 1–4 represent the parameter settings when the single feature is word frequency, word nature, word length, and semantics, respectively, and group 9 is the parameter settings under the selected feature conditions in this paper. The extraction results are shown in
Table 8 and
Figure 3, where (a), (b), and (c) represent the results for the
p-value, R-value, and F-value under different conditions, respectively.
As can be seen from the figure, when the number of extracted keywords is 3, 5, 7, or 10, respectively, the extraction effect under the condition of multi-feature combination proposed in this paper is significantly better than that under the condition of a single feature; the comparative analysis of the extraction effect between individual features shows that, among them, lexicality is the optimal result for the feature, followed by word frequency, and then word length. Its p-value, R-value, and F-value are significantly higher than those when semantics are used as a single feature, indicating that word frequency and lexical features have a greater impact on the keyword extraction effect. The word length feature has a relatively smaller impact on the keyword extraction effect, while the extraction effect is the worst when the algorithm is improved only with semantics as a feature; thus, semantic features have a smaller impact on the extraction result, which also confirms the results of parameter training from the side.
(2) Comparison of results for multiple features
According to the results of the single-feature experiments, it can be seen that lexicality has a greater impact on the keyword extraction results, followed by word frequency, and then word length, while semantic features have a smaller impact. Therefore, the lexicality, word frequency, and word length features that have a greater impact were selected to carry out the double-feature combination experiments and triple-feature combination experiments, respectively, and the results were compared and analysed with the results of the textual feature conditions. In the double-feature experiment, the feature weights were obtained using the sequential relationship analysis method; the degree of importance between the indicators was lexicality > frequency > length, and the ratio of the degree of importance was lexicality/frequency/length 1:1.3, lexicality/length 1:1.5, and frequency/length 1:1.2, which were then calculated to obtain the weights of the features and inputted into the BP neural network training to obtain the feature parameters. In the three-feature experiment, word/word frequency/word length was 1.3:1.2, and the weights were calculated in the same way as in the two-feature experiment. The specific parameter settings are shown in
Table 9, and the experimental results are shown in
Table 10 and
Figure 4, where the red curves represent the extraction results under the conditions of the feature parameters in this paper.
As can be seen from the figure, the results under the combination of features in this paper are significantly better than the results under the combination of double and triple features, and the extraction effect is better than the combination of two and two when combining word frequency, word length, and word nature; a comparative analysis of the results between the combination of double features reveals that the extraction effect is the best when combining word frequency and word nature.
The above experimental comparison found that the p-value, R-value, and F-value, under the conditions of the features selected in this paper, were greater than the other feature combinations, indicating that the features selected in this paper are more suitable for the text dataset.
3.3.2. Comparison of Results for Different Parameters
In order to illustrate the effect of keyword extraction under the parameter conditions of this paper, five groups of experiments were set up. Among them, group 1 considers the four features as equally important; group 5 is the combination of parameters obtained from the training of this paper; and the remaining combinations are fine-tuned on the basis of the parameters obtained from the training. The specific parameter settings are shown in
Table 11. The results for the
p-value, R-value, and F-value under different parameter conditions are shown in
Table 12 and
Figure 5, and the red curves in the figure represent the extraction results under the conditions of feature parameters in this paper.
As can be seen from the figure, after fine-tuning each parameter on the basis of the parameters obtained from training, the extraction effect is optimal under the parameter conditions where this paper is located, which indicates that α, β, γ, δ is the optimal parameter combination when α, β, γ, and δ are 0.33, 0.35, 0.21, and 0.11, and at this time, the improved algorithm has the best extraction effect.
3.3.3. Comparison of Results of Different Extraction Methods
In order to further verify the extraction effect of the IWF-TextRank algorithm, it was compared with that of the traditional TextRank algorithm as well as the TFIDF algorithm, and the extraction results are shown in
Table 13 and
Figure 6.
As can be seen from the figure, when the number of extracted keywords is 3, 5, 7, 10, the extraction effect of the TFIDF algorithm based on word frequency consideration is relatively poor compared to the other two methods; relative to the TFIDF algorithm, the keyword extraction effect of the traditional TextRank algorithm has been improved to a certain extent, and the extraction effect of the IWF-TextRank algorithm is significantly better than the other two methods. Specifically, there is a 10.06% improvement in accuracy over TextRank and an 7.16% improvement in recall when the number of keywords is 10. These enhancements were validated by a statistical significance test (p < 0.05), indicating that the observed improvements are statistically robust and not due to random variation.
The integration of semantic features with the BP neural network plays a pivotal role in enhancing the algorithm’s performance. By capturing deeper contextual relationships between terms, the BP neural network optimises the weight distribution of features, leading to the more accurate identification of domain-specific keywords. Furthermore, the improvement in the F-value demonstrates that the IWF-TextRank algorithm effectively balances the inherent trade-off between precision and recall, resulting in a more comprehensive and reliable keyword extraction process. This balance is particularly advantageous for traffic accident data analysis, where the precise extraction of relevant keywords is critical for accurate summarisation and further analytical tasks.
4. Discussion
The proposed IWF-TextRank algorithm demonstrates significant improvements in keyword extraction for traffic accident texts, primarily due to three key enhancements: multi-feature weighting, BP neural network optimisation, and semantic enhancement through the CBOW model. Each of these factors is summarised below with supporting quantitative evidence.
Unlike the traditional TextRank, which relies solely on word frequency, the IWF-TextRank incorporates additional features such as lexicality and word length to provide a more comprehensive weighting for each candidate keyword. The experimental results indicate that the inclusion of these features improves both the precision and recall of keyword extraction by approximately 7%, particularly by accurately capturing the relevance of nouns and verbs in accident descriptions. Additionally, the word-length feature enables the model to better extract keywords with longer lengths, further improving the extraction of key information.
- 2.
Optimisation through BP Neural Network
The BP neural network automatically balances the weights for word frequency, part of speech, and word length, eliminating subjective manual adjustments. The cross-validation results show that the BP optimisation enhances extraction accuracy with statistical significance (p < 0.05), indicating the network’s effective contribution to model stability and precision in identifying critical accident-related keywords.
- 3.
Semantic Enhancement via the CBOW Model
Integrating the CBOW model from Word2Vec allows the algorithm to capture deeper semantic relationships between words.
Despite the advantages of IWF-TextRank in keyword extraction, certain limitations remain. First, the dataset is limited to traffic accident records from a single city in 2020, potentially limiting the model’s generalisability across regions and timeframes. Second, while BP neural network optimisation performs well on small datasets, it may require additional tuning for larger datasets. Finally, although the CBOW model improves semantic understanding, it may face challenges with longer or multi-topic texts.
5. Conclusions
This study presents a new keyword extraction algorithm, IWF-TextRank, which combines multi-feature weighting, BP neural network optimisation, and the CBOW model from Word2Vec to achieve enhanced performance in domain-specific datasets like traffic accident reports. The main contributions of this work are as follows:
A multi-feature weighting mechanism incorporating word frequency, lexicality, and word length, enabling more accurate identification of core keywords;
BP neural network optimisation for automated weight distribution, reducing subjectivity and improving stability and precision in keyword extraction;
Integration of the CBOW model, enhancing the model’s semantic understanding and enabling deeper contextual associations for more accurate keyword identification.
These contributions result in significant improvements in keyword extraction accuracy and robustness within traffic accident analysis, providing a useful framework for similar applications.
Future work could involve expanding the dataset to other regions and years to validate the robustness of IWF-TextRank, combining it with more advanced deep learning models such as Transformers for complex text analysis, and exploring its potential in other specialised fields (e.g., medical or legal documents). Further research could also investigate automated parameter adjustments to improve adaptability across diverse datasets.