1. Introduction
The provision of software and system security concerns both organizations and software users as successful security breaches lead to severe incidents. Generally, incidents of this nature firstly endanger both user and system data, and secondly grant improper access to unauthorized areas. As a result, evaluating software security constitutes a practical method in assessing the existing threats, along with the potential countermeasures, and is widely studied by different perspectives [
1]. One of the main perspectives is related to security vulnerabilities, described as weaknesses that are exploitable by potential cybersecurity attacks and threaten one or multiple components of a system [
2]. Even though the National Vulnerability Database (
) (
https://nvd.nist.gov/ (accessed on 11 July 2023)), which cooperates with Common Vulnerabilities and Exposures (
) (
https://cve.mitre.org/ (accessed on 11 July 2023)), summarizes information related to disclosed vulnerabilities, the prediction and assessment of actual exploits are of great importance since many disclosed vulnerabilities are never exploited [
3,
4].
Consequently, prior studies focus on exploring the exploitability of security vulnerabilities by analyzing information from official databases, such as
and ExploitDB (
https://www.exploit-db.com/ (accessed on 11 July 2023)) [
3,
4,
5,
6,
7,
8,
9], and attack signatures retrieved from intrusion detection systems [
4,
5,
8,
9]. In addition, technical descriptions that present an Exploit
, i.e., demonstration of a concept that can lead to vulnerability exploitation, combined with clues provided by social media [
9], online forums, platforms [
4,
5,
8], and advisories [
9], have contributed to prior research as well. Overall, these studies mostly leverage textual information and vulnerability characteristics to train models that classify exploitable and non-exploitable observations.
The process of studying and predicting the exploitability of security vulnerabilities is directly associated with the general field of vulnerability-patch prioritization [
5]. The main goal of vulnerability prioritization is to assist organizations in maintaining system security and avoiding severe incidents by prioritizing the remediation of the riskiest security threats, i.e., providing patches to fix the vulnerabilities that are more likely to lead to actual exploits [
3]. While the Common Vulnerability Scoring System (
) (
https://nvd.nist.gov/vuln-metrics/cvss (accessed on 11 July 2023)), which measures access and impact metrics, is a standard formula to assess the severity of security vulnerabilities [
2,
5], some other approaches and measures are commonly utilized as well. Therefore, the various exploit indicators are suggested as a more appropriate choice in establishing models for vulnerability prioritization as they constitute a more direct approach to identifying the vulnerabilities that are responsible for attacks against critical infrastructures [
1].
The numerous text mining techniques applied in vulnerability prediction are related to both document representations that rely on word tokens and to more advanced approaches relevant to artificial neural networks [
10,
11]. While these techniques generally produce accurate results when combined with machine learning algorithms, e.g., Random Forest (
) and Support Vector Machines, they do not take into consideration the potential association between the concepts included in vulnerability descriptions and the likelihood of exploitation.
To address this research gap, in the current study, we focus on extracting topics based on the textual content found in vulnerability descriptions, via word clustering, to investigate which weaknesses are strongly associated with recent exploits and most likely to be exploited in the future. The contributions of this study to vulnerability prioritization and the existing research are the topic analysis of the weaknesses and products that are related to frequent recent exploits. Further, the proposed framework, which can be easily reproduced and trained for different periods, constitutes a key point in the effectiveness of the approach as well. A characteristic that discriminates the proposed framework from the existing methodologies is the exclusive use of topic distributions, assigned to each new observation, which provide unambiguous justifications regarding the exploitability of new vulnerabilities. Moreover, while the majority of the existing approaches focus on multiple characteristics that are evaluated after a significant amount of time, our framework enables the early proactive identification of severe threats by requiring only a representative description per vulnerability.
To provide a complete framework that identifies the main concepts and assesses the likelihood of exploitation of a new vulnerability, we also make use of the topic memberships of the most recent records to train classification models. Our goal is to predict whether a newly disclosed vulnerability will possibly be exploited or not. In this regard, we collect entries from the data feeds to take advantage of the vulnerability descriptions and available external references. These properties help us extract topics and determine the existence of Exploit for each record, respectively. At this point, we note that this indicator constitutes a binary class (in our study) that characterizes each vulnerability as exploitable or not.
In the proposed framework, we initially make use of the unsupervised algorithm Global Vectors (
) [
12] for efficient word representations, which is a widely employed approach in similar studies with promising performance in document classification tasks [
13,
14]. In the next step, we apply the Uniform Manifold Approximation and Projection (
) [
15] dimensionality reduction technique to construct dense areas and project the extracted word representations in a two-dimensional space [
16]. In the related literature,
is proposed as an algorithm that enables clustering algorithms to identify coherent clusters of word and document vectors [
16,
17,
18]. In this projected vector space, we apply the standard Fuzzy K-Means algorithm (
), which is a soft clustering approach [
19], to assign cluster memberships for the identified keywords and later for each vulnerability description. The findings from other related studies indicate that
is an effective approach for document classification tasks, providing higher topic coherence and accuracy than topic modeling algorithms in some cases as well [
20,
21]. Finally, we employ a machine learning approach [
22] to train classification models by leveraging the posterior document properties produced by the proposed topic extraction approach. To evaluate the performance of the proposed framework, we also train several models based on two topic modeling algorithms, entitled Latent Dirichlet Allocation (
) [
23] and Correlated Topic Models (
) [
24]. The results derived from our experiments show that the proposed approach extracts interpretable topics, while at the same time, it obtains higher predictive capabilities compared to the topic modeling techniques.
The following sections of this study are organized as follows: In
Section 2, we present an overview of the most related studies by discussing practical implications, outcomes, and the novel methodologies that were employed.
Section 3 presents the methodology of this study, which is oriented to providing practical solutions and insights concerning our two Research Questions. The results of this study are demonstrated in
Section 4 with respect to our two Research Questions and the practical implications of the proposed framework.
Section 5 includes the discussion of our findings while
Section 6 presents the potential limitations of this study. Finally,
Section 7 concludes the paper while
Section 8 presents several recommendations for future work. All major symbols used in this study are described in
Appendix A.
3. Methodology
In this section, we present in detail the proposed approach along with its components serving two main tasks that are related to
Topic Extraction (
Section 3.2) and
Classification Models (
Section 3.3). In the first phase, we propose a methodology aiming at the extraction of the main themes derived from vulnerability descriptions contributing, in turn, to the identification of the more or less exploitable topics. The second phase includes all the necessary procedures that we follow in order to train classification models based on the topics extracted in the previous phase. The main parts of these two phases are summarized in
Figure 1.
Briefly, we first collected the publicly available
data feeds from
and applied the necessary procedures to clean and form the datasets of this study (
Data Collection and Preprocessing). Next, we deployed Natural Language Processing (
) techniques to transform the retrieved descriptions into suitable data structures for the later steps of our framework (
Text Preprocessing). Hence, we utilized the processed descriptions and propose an approach for topic extraction based on
,
, and
(
Keyword Clustering), while also investigating its effectiveness by training models with two baseline topic modeling approaches to compare with (
Topic Modeling). Finally, we assigned posterior cluster memberships to the documents to clarify the more or less exploitable topics through the coefficients extracted from a Generalized Linear Model (
) [
58].
Moreover, to fit classification models based on the cluster and topic memberships extracted in the previous task, we initially applied a data oversampling method to establish a balanced training dataset (Data Oversampling). Moreover, we selected two machine learning algorithms to fit classification models (Model Selection and Tuning). Finally, we calculated multiple performance metrics to evaluate the fitness of the trained models (Performance Evaluation) and the effectiveness of the proposed approach in predicting exploits.
The general motivation behind our study is to address the challenge of distinguishing exploitable types of vulnerabilities and products, and to provide information on emerging threats. In addition, we further aim to propose a complete framework that can be used for other similar tasks as well, i.e., datasets that contain textual information, which is associated with scores or classes, e.g., user reviews. Overall, the proposed methodology and the corresponding findings of this study aim to answer the following Research Questions (RQs):
RQ1: Which topics of security vulnerabilities are frequently associated with recent exploits? This RQ is dedicated to providing answers on the potential relationships between the textual descriptions, expressed by topics, and the exploitability of vulnerabilities as indicated by the recent Exploit (2022 records). Hence, we developed a framework that assigns a specific mixture of topics and an exploitability indicator (ranging from 0 to 1) to security vulnerabilities, as estimated by a trained . The respective findings discover specific characteristics that are frequently associated with exploits while the framework can be used as a basis in vulnerability prioritization by explaining the exploitability of future vulnerabilities.
RQ2: Can textual topics predict/explain the exploitability of security vulnerabilities? Although RQ1 helps us assign exploitability indicators to security vulnerabilities, each threat is usually characterized as exploitable or non-exploitable with a binary class rather than a probability. In this RQ, we aim at training classification models to predict whether a vulnerability is linked with an Exploit or not. The motivation behind this goal is to determine if the textual descriptions and the proposed approach can establish effective classification models for future use. Also, through the answers provided for this RQ, we aim at strengthening our findings concerning the mutual characteristics of the vulnerabilities that are exploited.
3.4. Summary
To provide a comprehensive summary of our methodology, in this section, we describe the whole framework with inclusive and concise steps. Our goal is to guide future research in reproducing the experiments of this study or follow some distinct parts of the proposed framework that might be useful for similar tasks.
The first step is to collect all the necessary information concerning the descriptions and the exploitability of vulnerabilities through appropriate data sources, some of which are discussed in
Section 2.1. In our case, we explored the
to find the appropriate information concerning the descriptions and the exploitability of security vulnerabilities. Also, some preprocessing procedures are necessary to transform the initial dataset into the appropriate data structures as indicated by the employed algorithms. These steps are summarized as follows:
Retrieve information related to vulnerability descriptions and exploitability indicators;
Apply cleaning and preprocessing procedures to the initial datasets;
Define datasets for topic extraction (2015–2021 data feeds in our case) and classification models (2022 data feeds in our case);
Employ techniques to establish the and ;
The following phase includes all the necessary procedures that should be followed to project word vectors in a low-dimensional vector space. This structure is used to assess cluster memberships to keywords and documents through the algorithm. To do so, the following procedures shall be followed:
Use to train word embeddings using the algorithm;
Employ to project these word embeddings in a low-dimensional space;
Pipeline the outcomes of into the to extract cluster memberships of keywords ();
In turn, the is used to train the topic modeling algorithms and assess cluster memberships to the documents using both and , as denoted in Equation (3). Also, some appropriate measures are used to evaluate each technique. Hence, the distinct steps are the following:
Use to train topic models;
Calculate document memberships using the and ;
Evaluate the topic coherence of topic and cluster models using the ;
Train models for different numbers of topics as indicated by the highest of each algorithm (in our case, 24 for the proposed framework, 21 for , and 10 for );
By selecting the optimal model as trained using the proposed framework, the next step is to assign a comprehensive title that describes the concepts of each cluster. Also, these properties are used to explore the potential effects of the cluster membership of the documents to the exploitability of vulnerabilities. These processes are summarized as follows:
Provide a topic title for each cluster using the top keywords and some representative descriptions;
Evaluate coefficients that assess the potential effects of each cluster on the exploitability indicators by employing a ;
Identify exploitable weaknesses and products to assist vulnerability prioritization based on the highest coefficients of the ;
It should be noted that the proposed framework can be adapted for other similar tasks that are related to textual information and relevant scores, e.g., user reviews. We should also clarify that, in our study, the topic extraction models were trained using the first dataset (2015–2021 data feeds) while the was evaluated using the second dataset (2022 data feeds). Moreover, the second dataset was used to train machine learning models. All the necessary steps that should be applied to train and evaluate machine learning models are as follows:
Split the dataset into training (70%) and testing (30%) datasets;
Balance the training dataset by employing an oversampling algorithm—in our case, the Adaptive Synthetic oversampling algorithm was employed;
Select machine learning algorithms (in our case, two were selected);
Apply a strategy that combines 10-fold cross-validation and grid search, using the training dataset, to tune the parameters of each algorithm for every set of inputs;
Select the best parameter combinations based on the average accuracy of the respective models in the 10-fold cross-validation process;
The previous steps are applied to select the “best” machine learning models for each topic extraction (3 algorithms in our case) and machine learning (2 algorithms in our case) algorithm as well as for the three different topic parameters (3 numbers of topics in our case). These models are finally evaluated under five performance metrics that help us provide insights concerning the inclusiveness of vulnerability descriptions and the effectiveness of the proposed framework.
5. Discussion
To recap, we notice that the proposed framework () extracts clusters from a large corpus that can effectively capture keyword relations and unveil coherent topics, making this approach a productive option for future research attempts. Also, in contrast to most approaches of this nature, this approach offers visualization capabilities and exports outcomes significantly faster. During the experiments of this study, we noticed that the process of training all the examined models using the was finalized significantly faster than the two topic modeling algorithms. However, we cannot determine which algorithm has the lowest time complexity/costs as the running time of each algorithm strongly depends on the initial model parameters and stopping criteria, which raise bias in the process by affecting the number of iterations.
Regarding the space and time costs of the proposed method, firstly, the cost of the
algorithm is
[
12] as it depends on the keyword co-occurrence statistics (
). Moreover, the respective cost of
is empirically approximately
[
15], where
denotes the number of dimensions of the outcoming vectors; in our case, it was set to two. Finally, the cost of the
algorithm is
[
71].
Fitting a topic modeling algorithm is an iterative process where the linkage strength between every single keyword—included in the documents—and every predefined topic is re-evaluated continuously. Thus, the complexity of the two topic modeling algorithms can be considered as
[
23]. By inspecting the costs of these two different approaches, we can conclude that the
requires fewer memory resources and probably fewer computations per iteration than the two topic modeling algorithms.
Overall, and most importantly, we should mention that most of the topics, extracted from the , are matched with basic identifiers and descriptions as well as with some software products and vendors. Also, by considering the findings discussed in the previous section, we finally reach a point where the topic properties discovered from the can both effectively characterize new entries and predict their exploitability, helping us to satisfy both RQ1 and RQ2. Moreover, by proposing a new approach that produces high performance for both topic and classification purposes, we succeed in boosting the accuracy and significance of the findings that address the posed tasks of this study.
The outcomes of this study contribute to providing both practical and informative knowledge as we both reveal topic details of security vulnerabilities and present a new structured framework. Briefly, this framework combines several methodologies in a serialized way that are utilized in both general tasks and text mining approaches. Undoubtedly, both and are algorithms that provide sufficient results in various tasks while constitutes a validated approach in projecting keywords in vector spaces. These three algorithms offer analogous qualities to the proposed framework in the stage of topic extraction. In particular, produces a multidimensional vector space that reflects on the relations between the collected keywords while filters the latter vector space with an ultimate goal to gather and spread the keywords appropriately. At last, identifies the suitable clusters included in the outputted 2d vector space, helping us to discover and interpret intelligible topics as well as assign topic distributions over the documents of the corpus, i.e., descriptions of security vulnerabilities.
By analyzing the performance of the three examined approaches, in terms of topic coherence and classification power, we cannot determine with high certainty whether one of these approaches overshadows the rest. However, we can clarify that provided lower predictive power than and and that provided the highest performance in relatively many cases. With more detail, we evaluated nine models overall, three for each algorithm, where the three models that were trained based on the were evaluated as the most coherent ones. Compared with the highest evaluations extracted using the two topic modeling algorithms, the improves the best solution by 20% for 10 topics, 45% for 21 topics, and 55% for 24 topics. Also, the provides the best evaluations eight times out of fifteen compared with the other classification models that were trained under the same number of topics, while three out of the five metrics are maximized for the model that was trained for 24 topics using the .
Regarding the exploitability of security vulnerabilities, we reveal that the vulnerabilities that are related to specific types of weaknesses or products are more likely to be exploited. To be more precise, the information extracted from the recent vulnerability records indicates that the most exploitable types of weaknesses are SQL injection along with buffer and stack overflow. At the same time, we discovered that the security vulnerabilities associated with WordPress plugins and the TensorFlow learning platform tend to be more exploitable than the ones associated with other products. Hence, we propose that the aforementioned types of vulnerabilities and products should be considered as the main priorities of users and experts in terms of avoiding potential security breaches as well as maintaining the security of applications and systems. The final shows that only three out of the twenty-four topics are not assigned with statistically significant coefficient estimations, which means that the extracted topics offer sufficient information on the exploitability of security vulnerabilities. The coefficients of the show that SQL injection constitutes the most severe one as the odds of exploitation have an estimated relative increment of more than 400 for a unit increment. At the same time, the buffer/stack overflows, the WordPress plugins, and the TensorFlow learning platform are linked with close to 172, 32, and 57 estimated relative increments to the odds, respectively.
Overall, these estimations reveal that the majority of the extracted topics are associated with a small proportion of exploitable vulnerabilities as only five of them produce positive and significant coefficients. In addition, our experiments show that only twelve topics are associated with significantly low estimations. Therefore, we suggest that the vulnerabilities associated with the remaining topics should be accounted for as potential severe threats since the available technologies and capabilities evolve and change over time.
6. Threats to Validity
Although we believe that the results of this study are promising, we suggest that additional experiments are necessary as they could boost the performance and validate the effectiveness of our framework. The issues we encountered are linked to both topic extraction and classification tasks; they generally concern model tuning, exploration of additional algorithms, performance evaluation, results interpretation, and data collection. As a result, the main drawbacks and limitations of this study are related to both the internal and external validity of this study.
Regarding the internal validity, we first introduced some methodologies for data collection, cleaning, and preprocessing that contain some manually treated steps, especially the ones that are related to the main annotation. Although the related literature proposes several approaches and data sources to characterize each vulnerability as exploitable or not, many of the respective studies propose the as one of the primary sources for retrieving information about security vulnerabilities. In addition, the evaluation of the clustering and topic models was completed using an individual measurement, meaning that other characteristics and aspects of these models were not examined. To address this issue, we made sure that the extracted topics were interpretable and validated from the descriptions and properties of official vulnerability records; further, we proceeded to visualization techniques that provided additional information. We also made sure that the employed evaluation metric, i.e., , is widely applied and supported by the related literature. Finally, the tuning procedures of the classification models are a potential threat to many tasks, which are similar to the ones included in this study, as there is not yet an overall optimal approach to deal with this challenge. For this reason, we employed a validation strategy that explores the performance of classification models under different combinations of parameters to select the most productive ones from a large pool of models.
The external validity of this study concerns both the generalization of our findings and the effectiveness of the proposed framework on different data structures. First of all, to generalize our findings on the most exploitable threats and products, it is necessary to retrieve and analyze collective knowledge from multiple sources, which are possibly not accessible for security reasons, as a single one may not cover every single relevant event, clue, or proof. Nonetheless, by evaluating the available information from , we succeed in including valuable information from multiple data sources since the external references of are linked to many valuable security-oriented websites, including security advisories and the ExploitDB. Moreover, apart from security vulnerabilities, one different aspect of cybersecurity includes the malware products that are utilized to perform a series of actions in an affected system. However, security vulnerabilities are usually associated with this aspect as the malware products aim to exploit specific weaknesses in order to gain access to a system. Therefore, we believe that a valuable perspective of computer security is addressed in this study as we reveal the current primary threats that are related to security vulnerabilities. Finally, we believe that the effectiveness of the proposed framework should be investigated in different datasets to accept this option as a productive alternative to topic models. To mitigate threats of this nature, in this study, we decided to employ machine learning techniques that were previously proposed as practical options in various text mining tasks.