A Heterogeneous Machine Learning Ensemble Framework for Malicious Webpage Detection

Shin, Sam-Shin; Ji, Seung-Goo; Hong, Sung-Sam

doi:10.3390/app122312070

Open AccessArticle

A Heterogeneous Machine Learning Ensemble Framework for Malicious Webpage Detection

by

Sam-Shin Shin

¹,

Seung-Goo Ji

¹ and

Sung-Sam Hong

^2,*

¹

Internet Incident Response Technology Team, Korea Internet & Security Agency, Naju 58324, Republic of Korea

²

Department of Multimedia Contents, Jangan University, Hwaseong 18331, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(23), 12070; https://doi.org/10.3390/app122312070

Submission received: 9 October 2022 / Revised: 20 November 2022 / Accepted: 22 November 2022 / Published: 25 November 2022

(This article belongs to the Special Issue AI for Cybersecurity)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The growing dependence on digital systems has heightened the risks posed by cybersecurity threats. This paper proposes a new method for detecting malicious webpages among several adversary activities. As shown in previous studies, malicious URL detection performance is significantly affected by the learning dataset features. The overall performance of different machine learning models varies depending on the data features, and using a particular model alone is not always desirable in any given environment. To address these limitations, we propose an ensemble approach using different machine learning models. Our proposed method outperforms the existing single model by 6%, allowing for the detection of an additional 141 malicious URLs. In this study, repetitive tasks are automated, improving the performance of different machine learning models. In addition, the proposed framework builds an advanced feature set based on URL and web content and includes the most optimized detection model structure. The proposed technology can contribute to define an advanced feature set based on URL and web content and includes the most optimized detection model structure and research on automated technology for the detection of malicious websites, such as phishing websites and malicious code distribution.

Keywords:

security; malicious URL detection; machine learning; ensemble learning; artificial intelligence

1. Introduction

Today’s advanced IT technologies call for additional attention to the expansion of interactions between user applications and critical infrastructure in fields such as communication, finance, defense, and education. In many critical infrastructure sectors, virtualization of information assets has broadened the possible entry points for cyber criminals and attackers. Cyber crimes involve a variety of tactics, including online fraud, system cracking, phishing attacks, DNS poisoning, malicious software attacks, data theft, spam, scam, and blackmail. The perpetrator often uses a phishing website to disguise itself as a legitimate URL and gain access to the targeted system [1].

In recent years, the detection of malicious web links has become a vital security issue. Google, a globally used web search engine, discovers 10,000 new malicious websites daily [2]. Malicious web-based attacks include phishing, drive-by-download, spam, click-jacking, plug-in and script-enabled, and malvertising attacks [3]. Phishing attacks generally occur in the form of fraudulent messages that appear to have come from a legitimate source [4], whereas drive-by-download attacks exploit security flaws in plug-ins that extend the browser’s functionality [5]. They classified URLs into malicious and benign categories based on URL information [6] and extracted 14 features by defining lexical-, host-, and content-based features. Bhoj et al. [7] proposed a feature set that can create a balancing dataset to improve malicious web detection performance. Chaiban et al. [8] proposed a method to secure the reliability of a dataset by collecting various features for malicious web detection and establishing a pipeline for data access, collection, and processing. In this study, model performance was improved by performing dimensionality reduction by applying principal component analysis (PCA) and Chi-square testing to the generated features. One general approach to deter such activities is a URL blacklist, which is a collection of websites that have engaged in malicious or suspicious behaviors. Because of human feedback, this safeguarding technique is highly accurate. However, it is still unable to cover all categories in a constantly changing online environment [9]. To address the shortcomings of the URL blacklist approach, cybersecurity experts have suggested machine learning for malicious URL detection, which is known as a classification model. This model is based on discriminative rules or features. This approach can distinguish malicious from benign URLs by extracting features, thereby allowing the machine to learn them. In this process, discriminative rules or feature selection play a crucial role in machine learning, helping to identify effective features that can characterize malicious websites. Most of the existing studies simply develop a feature based on the URL or web content and detect it with a machine learning model. These studies are mainly related to phishing site detection [10]. Invernizzi et al. [11] proposed an effective system to detect infections, called Nazca, which detects web requests used for malicious code binary downloads and is designed to work with large-scale networks such as internet service providers (ISP).

In this paper, we present an extensive survey of malicious website features and an ensemble machine learning model for malicious web detection. The motivation of this research study includes the following:

To propose an advanced machine learning model that can detect and predict the distribution of malicious codes based on URLs and web content without network data analysis.
To predict the risk of distribution by detecting malicious code distribution websites to predict cyber threats in advance, rather than simple malicious web detection.

Because their patterns change over time, the information from malicious websites is complex and can be utilized by combining different features. However, the performance of machine learning models differs according to the feature selection techniques. We propose an improved feature set to improve the web detection performance using the collected dataset and feature analysis used in previous studies. Additionally, a module that automatically generates this is included in the framework. In numerous studies, machine learning models, such as support vector machines (SVM), decision trees, and random forests, have already proved their merit. However, machine learning models may provide different results based on generalization and other feature combinations. Thus, we propose an ensemble machine learning method that offers the best performance. In particular, the contributions of this research study include the following:

Defining an advanced feature set based on URL and web content and including the most optimized detection model structure.
Providing an improved malicious web detection framework with high accuracy through ensemble techniques based on six different machine learning sub-models.
Providing the performance comparison results of various machine learning models for malicious web detection.
Providing an automated technology for the detection of malicious webpage such as phishing, malicious code distribution, and via the web in the cyber security field.
Reducing cyber security damage by predicting the location of malicious code distribution in advance.

The remainder of this paper is organized as follows: In Section 1, we briefly discuss issues related to malicious webpages. Section 2 presents a survey of related works. In Section 3, we describe the methodology for designing ensemble machine learning techniques, including functions and processes. In Section 4, we present the findings of ensemble machine learning testing, including the datasets, as well as a comparison of single and ensemble models. In Section 5, we provide our conclusions and discuss future scope of the research.

2. Related Work

2.1. Heuristic-Based Malicious Website Detection

Heuristic-based techniques use an algorithm to generate matches from a database after scanning suspicious webpages. In this approach, blacklisting is the most widely employed practice. The database contains profiles of known malicious websites such as URL, IP addresses, and domains. If a newly added URL matches the known malicious URL listed on the blacklist, it is deemed malicious [12]. The heuristic method is implemented using a webpage execution dynamics analysis, identifying any signature of malicious activities, such as abnormal process generation and repetitive redirection. However, this tedious mechanism requires everyday access to and analysis of each website. Because a large number of new URLs are generated each day, it is impracticable to maintain a valid blacklist. Nevertheless, its simplicity and efficiency are sufficient to overcome its inherent limitations, and the heuristic method is widely used in antivirus systems.

Phishing websites often include branding that appears legitimate and may even use the same logo as the actual company. It is very difficult to determine which page belongs to each website, but this can be partially supplemented by heuristic approaches. For this purpose, a tentative blacklist is generated in XML format by analyzing all webpages under the same hostname. When a particular business name is typed on Google, it shows a valid URL at the top of the search result page; if this URL is blocked, it is deemed as phishing, and the address is automatically updated and included in a blacklist [13].

Heuristic-based approaches have the advantages of simplicity, high efficiency, and general performance; however, there are nondeterministic polynomial (NP)-hard problems [14], computational complexity owing to several iterations, and local optimization problems [15].

2.2. Machine Learning-Based Malicious Website Detection

A machine learning-based approach to detect malicious websites requires feature extraction to accumulate learning data. These features include lexical-, host-, and content-based features, which can be configured in different patterns based on their attributes. The lexical-based features refer to the information obtained from the URL name itself, and they can be extracted from features such as URL string, URL length, element (hostname and top-level domain, etc.) length, and the number of special characters and symbols, or obtained from the extraction of IP addresses, keywords, and tokens [16]. Host-based features can be obtained from the URL hostname features, providing information such as malicious host features, geographical location, identification, and management style [17]. Finally, content-based features have a larger amount of information compared to other types of features, including HTML, JavaScript, visual imagery, and Active X, which can be extracted by crawling the entire webpage [18].

Researchers have developed various methods to classify a particular URL as malicious or benign using feature extraction attributes as a dataset for machine learning. Machine learning algorithm-based classifiers include k-nearest neighbors (KNN), SVM, random forest, naive Bayes, and artificial neural networks. Zhuang et al. [10] proposed a phishing website detection method that uses ensemble classifiers. They built an ensemble classification algorithm based on tag features from the HTML attributes to combine the predicted results from different phishing detection classifiers. They also employed a hierarchical clustering technique for automatic phishing categorization. Chatterjee et al. [19] presented an approach based on reinforcement learning for phishing URL detection. They classified URLs as malicious or benign based on URL information, the URL’s IP address, and additional URL access requests, and their novel model is capable of compensation depending on the learning agents’ behavior and status. Singhal et al. [6] extracted 14 features by defining lexical-based, host-based, and content-based features, and compared the performances of random forest, gradient boosting, and neural networks. They compared the performance of each feature based on the similarity between known and new malicious web data. Vara et al. [20] proposed a malicious web detection model using a SVM classifier. The features they considered include the IP address, ‘@’ symbol, ‘.’ (dot) symbol, domain separation using ‘–’ (underscore or hyphen) symbol, URL redirection, HTTPS token, email subject line, short URL service, hostname length, sensitive words, the number of slashes, Unicode, SSL certificate validity, anchor, iframe, and website ranking. They emphasized that selecting an effective feature is the key to improving the performance of machine learning models.

Machine learning researchers agree that the success of a model or algorithm depends on the manner in which the features are selected and extracted from the web. Classifying whether a particular web is malicious or benign depends on the performance of the machine learning model. Therefore, it is imperative to study the features and machine learning models that are most effective in accurately predicting malicious URLs. Therefore, in this study, a method for extracting a feature set optimized for a model is proposed.

2.3. Malicious Code Distribution Pattern Detection

To cause personal computers to be infected with a malicious code, attackers create a network connecting a landing site and malicious code distribution site. This network is known as a malware distribution network (MDN). JavaScript and iframe tags are used to enable automatic connections without user action, and their links are obfuscated, causing analysis to be difficult [21]. Vulnerable personal computers can easily be infected with a malicious code. To lure potential victims, popular websites with a high number of user connections may be injected with a malicious code that can automatically create a connection between the user and the distribution site. Wang et al. [22] presented a novel approach to identifying landing pages in MDNs that lead to drive-by downloads. To analyze contents that are similar across malicious websites, they extracted commonly used information from known MDNs and then queried the information from a search engine to identify suspicious websites and finally determine whether it is a malicious distribution network. They performed string clustering for similar malicious strings based on the string similarity. Invernizzi et al. [10] proposed an effective system called Nazca to detect infections. Nazca detects web requests used for malicious code binary downloads and is designed to work with large-scale networks such as ISP. It performs network traffic analysis with live packet records and extracts a record for each HTTP connection, focusing on malicious code, distributed malicious code hosting, ad hoc malicious domain, and the use of malicious code droppers. Choi et al. [21] proposed an automated link generation and tracing method called the AutoLink-Tracer. This approach consists of two elements: a link trace and link analysis. The former involves an actual browser and forward proxy, whereas the latter involves the root node of the website that is first visited by a user, the hopping node that is automatically linked to it, and the last node, the final node, which has no further link. All links can be expressed in a graphical form. They determined whether a particular network was responsible for malicious code distribution through the collection of links and automated link analysis.

Existing studies have analyzed the distribution or propagation of malicious code based on node links. It is important to identify the main malicious nodes. Therefore, in this study, by combining the prediction results of the reputation data and the malicious web detection model, it is possible to evaluate whether the web is related to malicious code and to determine the main malicious node by determining the distribution site. A more precise judgment will be possible if combined with node link analysis in the future.

3. Malicious Code Distribution Automation Prediction System (MCDWDS)

In this paper, we propose an ensemble machine learning-based malicious URL detection method for predicting malicious code distribution. There are several ways to achieve better results from machine learning models, such as searching for more data, employing multiple algorithms, and hyperparameter tuning. In the machine learning process, more generalized expressions promise a better performance. Therefore, it is desirable to combine multiple models. The machine learning-based malicious code distribution automation prediction system uses an ‘ensemble’ approach that combines different types of learning models, including SVM, decision tree, XGBoost, random forest, convolutional neural networks, and logistic regression.

Figure 1 shows the overall system architecture for executing a malicious code distribution automation prediction algorithm. The system consists of six modules, and each suspicious URL undergoes step ② through ⑨. The entry of a URL into the machine learning model provides a prediction that determines whether the URL is malicious or benign. In the following section, we describe the algorithm and evaluate the performance of each automated prediction model.

3.1. Applying Machine Learning Algorithm for MCDWDS

A decision tree is an algorithm used to classify the labeled data. It analyzes associations, patterns, and rules between and from a large dataset and designs a model for classification and prediction [23]. Because it has a flowchart-like tree structure, a decision tree is the easiest classification and prediction algorithm to interpret, particularly the input data and target variables. Random forest is a popular ensemble machine learning method that is mainly used in classification and regression analysis, and it produces a classification or average prediction from multiple decision trees configured during the learning process [24]. Its wide application includes detection, classification, and regression. In this algorithm, each tree has slightly different features, owing to its inherent randomness. This makes tree predictions decorrelated and consequently improves the level of generalization. Logistic regression is a very popular model used to find the conditional probability, and it is also one of the most frequently used learning methods for malicious URL detection. Similar to ordinary regression analysis, the primary goal of logistic regression is to express relations between dependent variables and independent variables in specific functions so that they can be used as prediction models [25]. XGBoost is a machine learning algorithm based on a decision tree using a gradient boosting framework. This ensemble tree method uses the gradient descent architecture of gradient boosting machines to train weak learners (generally classification and regression tree (CART)) [26]. An SVM is used to create a boundary to separate different points that belong to a single class. An SVM finds the closest point of the best line or the best decision boundary that can segregate multiple dimensional spaces into classes. The closest data point is called the hyperplane. Selecting a hyperplane that maximizes the margin between the hyperplane and learning dataset improves classification accuracy [27]. A convolutional neural network (CNN) is a multidirectional artificial neural network. There are multiple convolutional layers in traditional neural networks. We employed the CNN-LSTM architecture, combining convolutional neural network layers for input data feature extractions and a long short-term memory (LSTM) for sequence predictions.

3.2. Extraction and Pre-Processing of Malicious Code Distribution Web Features

In this section, we define malicious web features and perform a process to extract them, as shown in Figure 2. We employed an improved feature set consisting of 26 features to enhance web detection performance based on the collected dataset. In this framework, when the data are input, the feature set is extracted from the feature extraction module. The process comprises three modules: data loading, URL-based extraction, and content-based extraction. The collected URLs are registered in the database and the feature data for machine learning (URL length, domain data, and HTML contents) are extracted in raw data form, which is then forwarded to the training data pre-processing module. The feature data consist of ‘URL-based feature data’ extracted from the URL itself and ‘content-based feature data’ extracted from the HTML source code upon URL requests. The 26features that we defined and extracted are listed in Table 1.

Because the extracted feature data are not appropriate for machine learning, raw data require a vectorization process that converts feature data into numbers. Based on the machine learning algorithm used, this process involves two modules: URL and content because the former involves a word tokenization of lexical features of URLs, converting domain character data into vectors appropriate for the CNN deep learning algorithm, while the latter involves a conversion of 26 sets of feature data extracted through HTML parsing into vectors appropriate for machine learning.

3.3. Machine Learning Model Selection in MCDWDS

Figure 3 describes how to select better-performing models and how to replace the inferior models with their superior counterparts. The model selection submodule evaluates the classification performance of each model by machine learning parameters and, if appropriate, replaces the existing model with new models.

The classifier’s performance standards include the accuracy, recall, precision, and F1-score, each of which is directly associated with the model’s performance. The model replacement cycle may vary depending on the data loading status; however, in general, the module combines the existing data and the newly loaded data, and then performs a classification on a weekly basis as per a pre-defined schedule. The machine learning model selection and replacement process were performed as follows:

Input a machine learning model generated from the updated learning data;
Conduct a performance validation of each model and create a ‘Model Information,’ which covers both validation results and performance data;
Load a ‘Model Information’ of the previously used models;
Replace the existing models with better models by comparing the previous and updated models.

3.4. Sub-Model Prediction Process

In this process, a feature set is adaptively applied to each sub-model, and the internal models individually derive the prediction values. Figure 4 illustrates this process. As a predictive input value, it becomes the pre-processed data and predicts whether the result is normal (green in Figure 4) or malicious (red in Figure 4) by querying the trained model. Because the model stores the ‘model architecture’, that is, the algorithm name, algorithm object, and training data, it loads the model and predicts whether the URL to be detected is normal or malicious. The logic was researched and developed so that the feature set could be adaptively input to the contents or URLs according to the algorithm type.

3.5. Ensemble Machine Learning Prediction Results Analysis

Figure 5 shows the ensemble machine learning prediction analysis process:

①: Input the prediction results for each model.
②: Dynamically calculate weights (model reliability) using the performance values of each model.
③: Calculate the ensemble result of soft voting (using the weighted average) with the weights and performance values of each model.
④: The result of determining whether the ensemble is benign (green in Figure 5) or malicious (red in Figure 5) is delivered by comparing the results of ensemble calculations.

In the left column, the process presents the prediction results for each model. The default weight for each sub-model was based on the single prediction performance (predictive accuracy) of the corresponding model. To improve the ensemble performance (accuracy and generalization), a weighted soft voting algorithm is proposed so that submodel weights are dynamically applied and reflected in the voting results. Based on the performance value against the average model performance, the reliability of each model was calculated dynamically. A formula to calculate the dynamic reliability of each model is expressed as follows:

A v e r a g e P e r f o r m a n c e (A P) = \frac{1}{n} \sum A c c u r a c y_{i}

(1)

M e a n D e v i a t i o n {(M D)}_{n} = \sum_{k = 1}^{n} M o d e l P e r f o r m a n c e_{k} - A P

(2)

W e i g h t_{n} = M D - L o w e s t M e a n D e v i c a t i o n

(3)

Soft voting calculates the ensemble results using the weighted value and performance value of each model. The ensemble result calculation is expressed as follows:

\bar{X} = \frac{\sum_{i = 1}^{n} W e i g h t_{i} A c c u r a c y_{i}}{\sum_{i = 1}^{n} W e i g h t_{i}}

(4)

The ensemble results guide us to determine which website is potentially malicious, thereby allowing us to conduct risk assessment on a specific website.

4. Experiment and Result

4.1. Dataset

For this study, we collected a set of malicious and benign URLs using the Cyber Threats Analysis System (C-TAS) of the Korea Internet & Security Agency (KISA) and open-source intelligence (OSINT). Malicious and benign URLs accounted for 80% and 20% of the learning data, respectively. To estimate the accuracy of the models, 20% of the learning data were excluded from the input data for machine learning, and reserved to estimate the accuracy of the model whose learning has been completed. In total, 114,996 URLs were processed in the machine learning models and malicious and benign URLs were segregated in a ratio of 8:2. The remaining 23,000 URLs were used for validation. Table 2 shows the composition of the dataset.

4.2. Classification Results by Machine Learning Models

We performed a classification of the dataset with six machine learning models: SVM, decision tree, random forest, XGBoost, logistic regression, and convolutional neural networks. The learning data accounted for 80% of the total dataset, and the remaining dataset was used for the validation. For learning and validation purposes, malicious and benign webpages were prepared in the same ratio. The data features consist of 26 vectors: 11 lexical-based features, 8 malicious features, 5 HTML/JS-base features, and 2 domain-based features. These datasets resulted in classification performance for each machine learning model, as summarized in Table 3.

Table 3 above shows the precision (expressed in a percentage) of each machine learning model in descending order: light gradient boosting machine (0.9546), random forest (0.9511) is in the first place, followed by XGBoost (0.9499), logistic regression (0.9337), decision tree (0.9162), 1D-CNN (0.804), and SVM (0.6096).

The highest difference between precision and accuracy was for logistic regression, which recorded a value of 0.03. This was mainly attributable to the parameter properties due to alternating model selections. However, an SVM recorded a very low level of performance (approximately 60%). This poor performance was due to sparse features, which made it difficult to find a hyperplane pattern against the Kernel-SVM. The malicious web detection dataset structure is not compatible with the SVM kernel. In addition, when the scale or format of feature values is the same and the number of data samples is small (10,000 or less), an SVM can show good performance regardless of the data dimension [28]. The dataset used in this study consists of heterogeneous features and the number of samples is approximately 100, 000. Therefore, it is considered that the SVM result is low.

In addition, we performed experiments with the high-performance extreme learning machines (HP-ELM) model [29] and light gradient boosting machine (LGBM) [30] to compare the latest models. There are studies applied to this model in the field of malware detection [31]. As a result of the experiment, the F1-score was measured as 0.6946. The type and scale of the features of the dataset are diverse, so it is not suitable for HP-ELM, which is thought to indicate low performance. The LGBM is a tree-based ensemble model.

Looking at the performance of this model with an F1-score as high as 0.9543, it can be considered that our dataset has a structure suitable for a tree-based model. The RF is thought to exhibit relatively high performance because the influence of outliers or heterogeneous feature values on classification performance is offset through subtree construction and pruning.

4.3. Prediction Accuracy of the Ensemble Technique

We estimated the prediction accuracy by applying ensemble techniques to the classification results from the machine learning selections. This includes three methods: hard voting, soft voting, and weighted soft voting, and their performances are summarized in Table 4. In hard voting, a model with a majority vote wins. In soft voting, the probabilities given by each model for each class are summed, and the greatest sum of probabilities wins. In weighted soft voting, predictions are weighted by the model’s importance and summed.

When we applied the ensemble techniques, the highest precision was observed in weighted soft voting. Because both precision and recall had good records; the F1-score was also good. In terms of accuracy, we obtained the best results for hard voting. However, the difference in F1-score and accuracy between weighted soft voting and hard voting was only 0.0115 and 0.0139, respectively.

The poorest result of accuracy was found in soft voting, with a difference of 0.1403 compared with weighted soft voting. In terms of precision, the difference was 0.14. Poor performance in soft voting was mainly due to a substantial gap between the prediction classification assessments in different machine learning models. As shown in Table 3, the prediction classification of an SVM was 60%, which led to a lower probability of soft voting.

Figure 6 shows the ROC curve of the weighted soft voting model. The AUC value was 0.924, indicating that the TP and FP ratios were not biased and high detection accuracy was confirmed. The red-dot line in Figure 6 represents the worst case.

An experiment was performed with AdaBoost, a different ensemble technique, to compare the performance of the proposed ensemble model. Table 5 presents the results of the AdaBoost experiment. In conclusion, the accuracy performance of the hard voting method was the highest with 0.9548, and the weighted soft voting method showed high precision with 0.9587, recall 0.9587, and F1-score 0.9587 for the remaining performance indicators. The proposed method dynamically adjusts weights by reflecting the characteristics of each model and thus exhibits a high accuracy performance compared to other methods. However, as aforementioned, there is a disadvantage that depends on the performance of the sub-model. Therefore, the proposed soft voting ensemble technique based on dynamic weighting is suitable for malicious web detection.

4.4. Validation of Malicious Website Classifications

The classification results from ensemble machine learning were validated using actual malicious and benign webpage classifications. For validation purposes, we compared the actual false positive rate with malicious website data, referring to the reputation scores provided by malicious website search engines such as Google Safe Browsing and VirusTotal. Table 6 shows the URL prediction results per reputation score ranging from 3 to 15.

The classification results from ensemble machine learning were validated using actual malicious and benign webpage classifications. For validation purposes, we compared the actual false positive rate with malicious website data, referring to the reputation scores provided by malicious website search engines such as Google Safe Browsing and Virus Total. Table 6 shows the URL prediction results per reputation score ranging from 3 to 15.

The ensemble predictions for the 11,665 URLs that were actually benign showed that the total number of false positives was 633, and the total number of true positives was 11,032. In other words, the false positive rate was as low as 5.43%, meaning that 94.57% of the predictions were true. The ensemble predictions for URLs that were actually malicious showed that the total number of false positives was 11,337, accounting for 4.35%. This means that the ratio of true negatives was 95.65%, which was slightly higher than that of the benign URL predictions by 1.08 percent points. When the reputation score was 8 or more, all predictions were true-positive. The ensemble predictions for benign URLs with a reputation score of 8 or more found an additional 141 false positives. This additional detection of malicious websites can increase prediction performance by 22.27%.

5. Conclusions

In this study, we proposed a new ensemble machine learning method for malicious webpage detection. Various machine learning techniques have been studied to detect malicious URLs. A higher performance inevitably requires greater data processing. Recent studies have shown that SVMs, decision trees, random forests, and other popular machine learning models can deliver impressive performance. However, malicious webpages are constantly evolving, and their detection requires repetitive model evaluations, improved generalization performance, and different combinations of features. To this end, we conducted an extensive analysis of ensemble machine learning techniques to automate repetitive tasks.

We discovered that while a single-model detection performance achieved an average of 86%, the ensemble framework yielded better detection performance in weighted soft voting by 9 percent points. Validating the classification results shows that the ensemble approach detected an additional 141 malicious webpages, thus improving the detection performance by 22.27%. We intend to add an analysis system function that focuses on feature dataset management of malicious webpages and their importance for intuitive technologies and cybersecurity standards for addressing ever-changing malicious web URLs.

In our future work, we intend to study malicious attack web detection technology that can detect malicious websites and malicious code distribution sites as well as waypoints and domains containing C&C. In addition, as a sub-model, we want to study a model that can detect whether a website is malicious through deep learning using website binary imaging.

Author Contributions

Conceptualization, S.-S.S.; methodology, S.-G.J.; software, S.-S.S.; validation, S.-S.S. and S.-S.H.; formal analysis, S.-G.J.; resources, S.-S.H.; data curation, S.-S.H.; writing—original draft preparation, S.-S.S. and S.-S.H.; writing—review and editing, S.-G.J. and S.-S.H.; supervision, S.-G.J.; project administration, S.-G.J.; funding acquisition, S.-G.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was financially supported by the Institute of Civil-Military Technology Cooperation (ICMTC) grant funded by the Korea government (MOTIE and DAPA) under grant No. UM21306RD3.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Kang, H.K.; Shin, S.S.; Kim, D.Y.; Park, S.T. Design and Implementation of Malicious URL Prediction System based on Multiple Machine Learning Algorithms. J. Korea Multimed. Soc. 2020, 23, 1396–1405. [Google Scholar] [CrossRef]
Le, H.; Pham, Q.; Sahoo, D.; Hoi, S.C.H. URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection. arXiv preprint 2018, arXiv:1802.03162. [Google Scholar] [CrossRef]
Patil, D.R.; Patil, J.B. Survey on Malicious Web Pages Detection Techniques. Int. J. u-e-Serv. Sci. Technol. 2015, 8, 195–206. [Google Scholar] [CrossRef]
Baykara, M.; Gürel, Z.Z. Detection of Phishing Attacks. In Proceedings of the 2018 6th International Symposium on Digital Forensic and Security (ISDFS), Antalya, Turkey, 22–25 March 2018; pp. 1–5. [Google Scholar] [CrossRef]
Cova, M.; Kruegel, C.; Vigna, G. Detection and Analysis of Drive-by-Download Attacks and Malicious JavaScript Code. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, NC, USA, 26–30 April 2010; pp. 281–290. [Google Scholar] [CrossRef]
Singhal, S.; Chawla, U.; Shorey, R. Machine Learning & Concept Drift Based Approach for Malicious Website Detection. In Proceedings of the 2020 International Conference on Communication Systems & Networks (COMSNETS), Bengaluru, India, 7–11 January 2020; pp. 582–585. [Google Scholar] [CrossRef]
Bhoj, N.; Tripathi, A.; Bisht, G.S.; Dwivedi, A.R.; Pandey, B.; Chhimwal, N. Comparative Analysis of Feature Selection Techniques for Malicious Website Detection in SMOTE Balanced Data. RS Open J. Innov. Commun. Technol. 2021, 2, 1–10. [Google Scholar] [CrossRef]
Chaiban, A.; Sovilj, D.; Soliman, H.; Salmon, G.; Lin, X. Investigating the Influence of Feature Sources for Malicious Website Detection. Appl. Sci. 2022, 12, 2806. [Google Scholar] [CrossRef]
Altay, B.; Dokeroglu, T.; Cosar, A. Context-Sensitive and Keyword Density-Based Supervised Machine Learning Techniques for Malicious Webpage Detection. Soft Comput. 2019, 23, 4177–4191. [Google Scholar] [CrossRef]
Zhuang, W.; Jiang, Q.; Xiong, T. An intelligent anti-phishing strategy model for phishing website detection. In Proceedings of the 2012 32nd International Conference on Distributed Computing Systems Workshops, Macau, China, 18–21 June 2012. [Google Scholar] [CrossRef]
Invernizzi, L.; Miskovic, S.; Torres, R.; Saha, S.; Lee, S.-J.; Mellia, M.; Kruegel, C.; Vigna, G. Nazca: Detecting Malware Distribution in Large-Scale Networks. NDSS 2014, 14, 23–26. [Google Scholar] [CrossRef] [Green Version]
Eshete, B.; Kessler, F.B. Effective Analysis, Characterization, and Detection of Malicious Web Pages. In Proceedings of the 22nd International Conference on World Wide Web, Rio de Janeiro, Brazil, 13–17 May 2013; pp. 355–359. [Google Scholar] [CrossRef]
Tretyakov, K. Machine Learning Techniques in Spam Filtering. In Data Mining Problem-Oriented Seminar; MTAT: Beauvallon, France, 2004; pp. 60–79. Available online: https://courses.cs.ut.ee/2004/dm-seminarspring/uploads/Main/P06.pdf (accessed on 16 January 2022).
Knuth, D.E. Postscript about NP-hard problems. ACM SIGACT News. 1974, 6, 15–16. [Google Scholar] [CrossRef]
Beheshti, Z.; Shamsuddin, S.M. A review of population-based meta-heuristic algorithms. Int. J. Adv. Soft Comput. Appl 2013, 5, 1–35. [Google Scholar]
Aljabri, M.; Alhaidari, F.; Mohammad, R.M.A.; Samiha, M.; Alhamed, D.H.; Altamimi, H.S.; Chrouf, S.M.B. An Assessment of Lexical, Network, and Content-Based Features for Detecting Malicious URLs Using Machine Learning and Deep Learning Models. Comput. Intell. Neurosci. 2022, 2022, 3241216. [Google Scholar] [CrossRef] [PubMed]
Wang, H.-H.; Yu, L.; Tian, S.-W.; Peng, Y.-F.; Pei, X.-J. Bidirectional LSTM Malicious Webpages Detection Algorithm Based on Convolutional Neural Network and Independent Recurrent Neural Network. Appl. Intell. 2019, 49, 3016–3026. [Google Scholar] [CrossRef]
Ozker, U.; Sahingoz, O.K. Content Based Phishing Detection with Machine Learning. In Proceedings of the 2020 International Conference on Electrical Engineering (ICEE), Istanbul, Turkey, 25–27 September 2020; pp. 27–32. [Google Scholar] [CrossRef]
Chatterjee, M.; Namin, A.S. Detecting Phishing Websites through Deep Reinforcement Learning. In Proceedings of the 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), Milwaukee, WI, USA, 15–19 July 2019; Volume 2, pp. 227–232. [Google Scholar] [CrossRef]
Vara, K.D.; Dimble, V.S.; Yadav, M.M.; Thorat, A.A. Based on URL Feature Extraction Identify Malicious Website Using Machine Learning Techniques. Int. Res. J. Innov. Eng. Technol. 2022, 6, 144–148. [Google Scholar] [CrossRef]
Choi, S.Y.; Lim, C.G.; Kim, Y.M. Automated Link Tracing for Classification of Malicious Websites in Malware Distribution Networks. J. Inf. Process. Syst. 2019, 15, 100–115. [Google Scholar] [CrossRef]
Wang, G.; Stokes, J.W.; Herley, C.; Felstead, D. Detecting Malicious Landing Pages in Malware Distribution Networks. In Proceedings of the 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Budapest, Hungary, 24–27 June 2013. [Google Scholar] [CrossRef]
Salami, H.O.; Ibrahim, R.S.; Yahaya, M.O. Detecting Anomalies in Students' Results Using Decision Trees. Int. J. Mod. Educ. Comput. Sci. 2016, 8, 31–40. [Google Scholar] [CrossRef]
Desai, A.; Jatakia, J.; Naik, R.; Raul, N. Malicious Web Content Detection Using Machine Leaning. In Proceedings of the 2017 2nd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT), Bangalore, India, 19–20 May 2017; pp. 1432–1436. [Google Scholar] [CrossRef]
Chiramdasu, R.; Srivastava, G.; Bhattacharya, S.; Reddy, P.K.; Reddy Gadekallu, T. Malicious Url Detection Using Logistic Regression. In Proceedings of the 2021 IEEE International Conference on Omni-Layer Intelligent Systems (COINS), Barcelona, Spain, 23–25 August 2021; Volume 2021, pp. 11–16. [Google Scholar] [CrossRef]
Mokbal, F.M.M.; Dan, W.; Xiaoxi, W.; Wenbin, Z.; Lihua, F. XGBXSS: An Extreme Gradient Boosting Detection Framework for Cross-Site Scripting Attacks Based on Hybrid Feature Selection Approach and Parameters Optimization. J. Inf. Secur. Appl. 2021, 58, 102813. [Google Scholar] [CrossRef]
Brintha, N.C.; Preethi, C.; Winowlin Jappes, J.T. Exploring Malicious Webpages Using Machine Learning Concept. In Proceedings of the 2021 2nd International Conference for Emerging Technology (INCET), Belagavi, India, 21–23 May 2021; pp. 1–5. [Google Scholar] [CrossRef]
Sheykhmousa, M.; Mahdianpari, M.; Ghanbari, H.; Mohammadimanesh, F.; Ghamisi, P.; Homayouni, S. Support vector machine versus random forest for remote sensing image classification: A meta-analysis and systematic review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 6308–6325. [Google Scholar] [CrossRef]
Akusok, A.; Bjork, K.-M.; Miche, Y.; Lendasse, A. High-Performance Extreme Learning Machines: A Complete toolbox for Big Data Applications. IEEE Access 2015, 3, 1011–1025. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3149–3157. Available online: https://dl.acm.org/doi/10.5555/3294996.3295074 (accessed on 16 January 2022).
Shamshirband, S.; Chronopoulos, A.T. A new malware detection system using a high performance-ELM method. In Proceedings of the 23rd International Database Applications & Engineering Symposium, Athens, Greece, 10–12 June 2019; pp. 1–10. [Google Scholar] [CrossRef]

Figure 1. Malicious Code Distribution Automation Prediction System Architecture.

Figure 2. Malicious Web Feature Extraction Process.

Figure 3. Machine Learning Model Selection and Replace Process.

Figure 4. Sub−model Prediction.

Figure 5. Ensemble Machine Learning Prediction Analysis Process.

Figure 6. ROC Curve (Weighted Soft Voting).

Table 1. Machine Learning Features for Malicious Web Detection.

Type	No.	Feature	Description
Lexical-based feature	1	IP	• IP address is included in the hostname.
	2	URL Length	• URL length exceeds a given number of characters.
	3	Short URL	• A long link is reduced to a Short URL.
	4	HTML Length	• HTML text length.
	5	@	• A ‘@’ symbol is included in the URL.
	6	//	• A URL redirection occurs due to ‘//.’
	7	(_), (-)	• A domain name includes symbols such as (_) or (-) that are not recommended by the naming rules.
	8	HTTPS	• Whether a HTTPS security protocol is used or not.
	9	Unnecessary Ports	• An ordinary web server uses Port 80 (HTTP) and Port 443 (HTTPS) only.
	10	HTTPS script in URL	• HTTPS script is included in the sub-domain/domain name.
	11	HTTPS Validity	• HTTPS certificate validity period is over within a year.
Malicious feature	12	Request URL	• Videos, images, and CSS files are loaded from the external URLs.
	13	Window Pop Tag	• Whether a window pop-up command is included.
	14	Anchor Tag Ratio	• Check a ratio of anchor tags <a href>. Whether a website is linked to another domain.
	15	HTML Tag Configuration	• Check a ratio of <Meta>, <Script>, and <Link> tags in HTML source code.
	16	Server Form Handler	• A webpage sending data to a server is an external URL or an ‘about:blank’ page.
	17	Email Tag	• Whether a “mailto:” tag is functioning.
	18	WHOIS Lookup	• WHOIS domain is not included in the URL.
	19	<script> Tag	• Check a ratio of the ‘src’ attribute allowed to link to scripts from an external domain.
HTML/JS-based feature	20	Number of Forwarding	• Number of redirects to different URLs.
	21	onMouseOver Script	• Whether a onMouseOver script is included in Javascript.
	22	Disabled Mouse Right Click	• Whether a “event.button == 2” script is used.
	23	Pop-up Window	• Whether a pop-up window has a ‘text field’ for data input.
	24	iFrame	• Whether iframe/frameBorder tag and attribute are used.
Domain-based feature	25	Domain Registration Period	• Check the expiration date of a domain using WHOIS DB. • Longer than 6 months is normal but less than 6 months is suspicious.
Domain-based feature	26	SSL certificate registration period	• Check the expiration date of an SSL certificate.

Table 2. Dataset Composition.

Type	Dataset	Learning (80%)	Validation (20%)
Malicious URL	57,504	91,996	23,000
Benign URL	57,492	91,996	23,000
Total	114,996	114,996

Table 3. Classification Performance by Machine Learning Models.

Model	Precision	Recall	F1-Score	G-Mean	Accuracy
Support Vector Machine	0.6096	0.6095	0.6093	0.60	0.61
Decision Tree	0.9162	0.9162	0.9161	0.91	0.91
Random Forest	0.9511	0.951	0.951	0.95	0.95
XGBoost	0.9499	0.9498	0.9498	0.94	0.94
Logistic Regression	0.9337	0.9329	0.933	0.93	0.90
1D-Convolutional Neural Networks	0.804	0.802	0.8018	0.80	0.80
LGBM	0.9541	0.9546	0.9543	0.95	0.95
HP-ELM	0.6535	0.6774	0.6653	0.65	0.65
Average	0.8465	0.8491	0.8475	0.84	0.83

Table 4. Performance of the Voting Ensemble Techniques based on Dynamic Model Weight.

Ensemble Method	Precision	Recall	F1-Score	G-Mean	Accuracy
Hard Voting	0.955	0.9549	0.9573	0.9549	0.9548
Soft Voting	0.8286	0.8284	0.8284	0.8285	0.8284
Weighted Soft Voting	0.9686	0.9689	0.9688	0.9687	0.9687

Table 5. Performance of the AdaBoost Ensemble Techniques.

Ensemble Method	Precision	Recall	F1-Score	G-Mean	Accuracy
AdaBoost	0.9526	0.9504	0.9516	0.9516	0.9516

Table 6. Reputation Score Mapping based on Ensemble Prediction Results.

Reputation Score	Benign URL Results		Malicious URL Results
	False Positive	True Positive	True Negative	False Positive
	(Malicious)	(Benign)	(Malicious)	(Benign)
3	213	9736	981	321
4	52	0	124	0
5	41	0	255	0
6	13	0	200	0
7	173	1296	4059	172
8	25	0	282	0
9	46	0	1359	0
10	22	0	1052	0
11	26	0	1526	0
12	1	0	16	0
13	7	0	209	0
14	14	0	630	0
15	0	0	151	0
Total	633	11,032	10,844	493

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shin, S.-S.; Ji, S.-G.; Hong, S.-S. A Heterogeneous Machine Learning Ensemble Framework for Malicious Webpage Detection. Appl. Sci. 2022, 12, 12070. https://doi.org/10.3390/app122312070

AMA Style

Shin S-S, Ji S-G, Hong S-S. A Heterogeneous Machine Learning Ensemble Framework for Malicious Webpage Detection. Applied Sciences. 2022; 12(23):12070. https://doi.org/10.3390/app122312070

Chicago/Turabian Style

Shin, Sam-Shin, Seung-Goo Ji, and Sung-Sam Hong. 2022. "A Heterogeneous Machine Learning Ensemble Framework for Malicious Webpage Detection" Applied Sciences 12, no. 23: 12070. https://doi.org/10.3390/app122312070

APA Style

Shin, S. -S., Ji, S. -G., & Hong, S. -S. (2022). A Heterogeneous Machine Learning Ensemble Framework for Malicious Webpage Detection. Applied Sciences, 12(23), 12070. https://doi.org/10.3390/app122312070

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Heterogeneous Machine Learning Ensemble Framework for Malicious Webpage Detection

Abstract

1. Introduction

2. Related Work

2.1. Heuristic-Based Malicious Website Detection

2.2. Machine Learning-Based Malicious Website Detection

2.3. Malicious Code Distribution Pattern Detection

3. Malicious Code Distribution Automation Prediction System (MCDWDS)

3.1. Applying Machine Learning Algorithm for MCDWDS

3.2. Extraction and Pre-Processing of Malicious Code Distribution Web Features

3.3. Machine Learning Model Selection in MCDWDS

3.4. Sub-Model Prediction Process

3.5. Ensemble Machine Learning Prediction Results Analysis

4. Experiment and Result

4.1. Dataset

4.2. Classification Results by Machine Learning Models

4.3. Prediction Accuracy of the Ensemble Technique

4.4. Validation of Malicious Website Classifications

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI