1. Introduction
Defacement attacks are a type of attacks that amend the website’s content and as a result change the website’s appearance [
1,
2].
Figure 1 is a web page of Tuy-Hoa (Vietnam) airport’s website, that was defaced in March 2017 and
Figure 2 is the home page of an Australian government’s website that was defaced by Indonesian hackers with a message “Stop Spying on Indonesia”. According to some reports, the number of defacement attacks reported in the world escalated from 2010 to 2011 and from 2012 to 2013 [
2,
3]. However, the number of defacement attacks has reduced in recent years [
2,
3]. Nevertheless, there are still thousands of websites and web applications that are defaced everyday all over the world [
2,
3,
4]. Here are some of the most popular defacement attacks reported in the world recently [
5]:
In 2011, the home page of the website for Harvard University was replaced by the photo of the Syrian President, Bashar Al-Assad.
In 2012, there were about 500 Chinese websites defaced by an anonymous hacker group.
In 2013, the whole website of MIT University, USA was defaced after the death of the well-known hacker, Aaron Swartz.
In 2014, about 100 Singaporean websites were defaced. Most of these websites were operated by the opposition Reform Party.
In 2015, an ISIS propaganda website on the dark web was defaced and its contents were replaced with online medicine advertisements for selling Prozac and Viagra.
Although many causes of defacement attacks have been pointed out, the main cause is websites, web applications, or hosting servers with severe security vulnerabilities that are exploited and permit attackers to initiate defacement attacks [
1,
2,
4]. Common security vulnerabilities that exist in websites and web applications include SQLi (SQL injection), XSS (cross-site scripting), CSRF (cross-site request forgery), inclusion of local or remote files, inappropriate account management, and software that is not updated [
1,
2,
4].
Defacement attacks to websites, web portals, or web applications can result in critical consequences to their owners. The attacks can cause an interruption of the website’s normal operations, damage to the owner reputation, and possible losses of the valuable data. In turn, these may lead to a huge financial loss. A defacement attack to a website immediately interrupts its normal operations because the organization’s staff and customers are not able to access features or services provided by the website. Furthermore, if appropriate countermeasures are not applied timely there may be more new attacks to the website in the near future because the details of the website’s security vulnerabilities are leaked. The reputation damage to the website owner, and in the long term, potential data losses are also serious. Due to the scope of this paper, interested readers may find the detailed discussion about this in [
5].
Because defacement attacks to websites and web applications are widespread and there are serious impacts, many countermeasures have been proposed and deployed in practice. Current countermeasures against defacement attacks include: (1) scanning and fixing security vulnerabilities that exist in websites, web portals, or web applications and (2) installing defacement monitoring tools, such as VNCS web monitoring [
6], Nagios web application monitoring software [
7], website defacement monitoring [
8], and WebOrion defacement monitor [
9].
This paper proposes a hybrid website defacement detection model that is based on the combination of the machine learning-based detection and the signature-based detection. We extend the machine learning-based detection method proposed in our previous work [
10] and use it in the proposed hybrid defacement detection model. The advantages of the machine learning-based detection are (1) the detection profile can be inferred from the training data automatically and (2) the high overall detection accuracy and the low false positive detection rate. The signature-based detection is used to boost the processing speed of the proposed model for common forms of defacement attacks.
The remainder of the paper is structured as follows:
Section 2 discusses some related works,
Section 3 presents the machine learning-based detection model,
Section 4 presents the hybrid website defacement detection model, and
Section 5 provides the paper’s conclusion.
2. Related Works
There are many website defacement monitoring and detection methods and tools that have been proposed and implemented in practice. These solutions can be divided into two categories which are the signature-based detection approach and the anomaly-based detection approach [
11,
12]. The signature-based detection approach first creates a set of known attack signatures from defaced web pages. Attack signatures are usually encoded in the form of rules, or string patterns. Then, the approach looks for attack signatures in the monitored web pages. If a match is found, a defaced attack is detected. The signature-based approach is fast and efficient for detecting known attacks. However, it is not able to detect new-form or unknown attacks.
On the other hand, the anomaly-based detection approach first constructs a “profile” from the information of monitored pages of a website that is in normal working conditions. Then, the pages are observed to extract the information, and then the page information is compared with the profile to look for a difference. A defacement attack is detected if any notable difference is found and an attack alarm is raised. The major advantage of this approach is that it has the potential to detect new or unknown attacks. However, it is very hard to decide the detection threshold between the monitored web page and the profile because the content of dynamic web pages changes regularly.
Anomaly-based techniques for the defacement monitoring and detection of websites and web applications consist of those based on traditional comparison methods as well as advanced methods. While traditional comparison methods include checksum comparison, diff comparison, and DOM tree analysis, advanced methods are based on complicated or learning techniques, such as statistics, data mining, machine leaning, genetic programming, and analysis of page screenshots [
11,
12]. The following parts of this section will provide a description of these methods. In addition, some website defacement monitoring tools widely used in practice are also discussed.
2.1. Defacement Detection of Websites Based on Traditional Comparisons
Defacement detection of websites and web applications using the comparison of checksums is one of the simplest methods to find changes in web pages. Firstly, the checksum of the web page’s content is computed using hashing algorithms, such as MD5 or SHA1, and stored in the detection profile. Secondly, the web page is monitored and a new checksum of the web page’s content is calculated and then compared with its corresponding checksum saved in the detection profile. If the two checksum values are not the same, an alarm is raised. This technique seems to work fine for static web pages. However, the technique is not applicable for dynamic pages, for instance web pages of e-commerce websites because their content changes frequently [
11,
12].
Diff comparison method uses the DIFF tool which is commonly available on Linux and UNIX environments. DIFF is used to compare the current content of the web page and its content stored in the profile to find the changes. The most difficult thing to do is to decide on an anomaly threshold as the input for the monitoring process of each web page. In short, the Diff comparison technique is relatively effective for most dynamic pages if the anomaly threshold is chosen correctly [
11,
12].
DOM is an API that determines the logical structure of web pages, or HTML documents. DOM can be used to scan and analyze the web page structure. The DOM tree analysis can be used to detect changes in the web page structure, rather than changes in the web page content. First, the page structure is extracted from the page content in the normal working condition and stored in the profile. Then, the page structure of the monitored page is extracted and then compared with the stored page structure in the profile to look for the difference. If a notable difference is found between the page structures an alarm is raised. Generally, this technique works fine for stable structured web pages [
11,
12]. However, it is not able to detect unauthorized modifications in the content of the web pages.
2.2. Defacement Detection of Websites Based on Advanced Methods
This section presents a survey of some defacement detection proposals based on advanced methods, including Kim et al. [
13], Medvet et al. [
14], Bartoli et al. [
15] and Borgolte et al. [
16].
Kim et al. [
13] proposed the use of a statistical method for the web page defacement monitoring and detection. The proposed method consists of the training stage and the detection stage. In the training stage, the HTML content of each normal web page is first divided into features using the 2-gram method, and then the occurring frequency of each 2-gram or feature is counted. On the basis of the results of a statistical survey, they conclude that 300 2-grams at the highest occurring frequencies are sufficient to represent a web page for the defacement detection. The detection profile contains all normal web pages of the training dataset, each of which is transferred to a vector of 300 2-grams and their occurring frequencies. In the detection stage, as shown in
Figure 3, the monitored web page is first retrieved, and then its HTML content is processed and converted to a vector using the same technique done for training web pages. Next, the monitored page’s vector is compared with the corresponding page vector stored in the detection profile using the cosine distance to compute the similarity. If the computed similarity is less than the abnormal threshold, an attack alarm is raised. The abnormal threshold is generated initially and then dynamically updated for each web page periodically. The strong point of the proposed method is that it can create and adjust dynamic detection thresholds, and thereby can reduce the false alarm rate. However, the major shortcomings of this approach are (1) the periodic adjusted thresholds are not appropriate for monitored web pages where the content is frequently changed, and therefore the proposed method still generates more false alarms and (2) it demands high computing resources for the dynamic threshold adjustment for each monitored page.
Medvet et al. [
14] and Bartoli et al. [
15] proposed building the website defacement detection profile by using genetic programming techniques. First, in order to collect data of web pages, they use 43 sensors for monitoring and extracting the information of monitored pages. The next step is the vectorization process, where the gathered information of each page is transformed into a vector of 1466 elements. The proposed method includes two stages of training and detection. In the training stage, web pages of normal working websites are retrieved and vectorized to construct the detection profile based on genetic programming techniques. In the detection stage, the information of the monitored web page is retrieved, vectorized, and then compared with the detection profile to look for the difference. If any significant difference is found an attack alarm is raised. The main drawbacks of this approach are that (1) it demands highly extensive computing resources for the building of the detection profile due to the large-size vectors of web pages and (2) it uses expensive genetic programming techniques.
Borgolte et al. built Meerkat [
16] which is a system for the defacement detection of websites and web applications that is based on the image analysis and recognition of screenshots of web pages using computer vision techniques.
Figure 4 shows Meerkat’s architecture that is based on the deep neural network. The inputs to Meerkat are a list of web addresses (URL) of monitored pages. For each URL, the system first loads the web page and then takes a screenshot of the page. The page screenshots (images) are used as the inputs to the system instead of the original web pages for the defacement analysis and detection. Similar to other learning-based systems, Meerkat also has the training stage and the detection stage. In the training stage, it gathers screenshots of monitored web pages that are in normal working conditions. The training screenshots are processed to extract high level features using advanced machine learning methods, such as the stacked autoencoder and the deep neural network. The set of features of monitored pages are then stored in the detection profile. In the detection stage, the same processing procedure used in the training stage is applied to each monitored web page to create the current set of page features. The page’s current feature set is compared with its feature set stored in the detection profile to find the difference. If any notable difference is found an attack alarm is fired. Meerkat was tested on a dataset of 2.5 million normal web pages and 10 million defaced web pages. The tested results show that the system achieves high detection accuracies from 97.42% to 98.81% and low false positive rates from 0.54% to 1.52%. The advantages of this proposed system are that the detection profile can be constructed from the training data and the system was experimented on a large dataset. Nevertheless, this method’s main disadvantage is that it demands extensive computing resources for highly complicated image processing and recognition techniques. Moreover, Meerkat’s processing may also be slow because a web page must be fully loaded and displayed in order to take its high-quality screenshot.
2.3. Defacement Monitoring and Detection Tools
This section introduces some popular tools for website defacement monitoring and detection, which include VNCS web monitoring [
6], Nagios web application monitoring software [
7], Site24x7 website defacement monitoring [
8] and WebOrion defacement monitor [
9].
2.3.1. VNCS Web Monitoring
VNCS web monitoring [
6] is a security solution of the Vietnam Cybersecurity (VNCS) that can be used to monitor websites, web portals, and web applications based on the real-time collection of web logs and the Splunk platform [
17]. The solution’s monitoring agents are installed on target systems to collect and transfer web logs to the central server for processing. Splunk is used for storing, indexing, searching, analyzing, and management of web logs. The main features of VNCS web monitoring consist of unified web log management and automatic analysis of web logs to detect website issues and attacks, including web page defacements, SQLi attacks, XSS attacks, and real-time site status alerts.
The solution’s disadvantages are (1) its monitoring agents need to be installed on monitored systems to collect and transfer web logs, (2) its set-up and operation costs are high because it is a commercial solution, and (3) it only uses checksum and direct comparison of web page contents, which in turn may generate a high level of false alarms for websites with dynamic contents, such as e-shops and forums.
2.3.2. Nagios Web Application Monitoring Software
Nagios web application monitoring software [
7] is a commercial solution for monitoring websites, web portals, and other web applications. There is a range of monitoring tools provided for different requirements of customers, such as Navgios XI which is a recently published version of the solution. Typical features of the solution include URL monitoring, HTTP status monitoring, website availability monitoring, website content monitoring, and website transaction monitoring.
The shortcomings of the Nagios solution are (1) its set-up and operation costs are high because it is a commercial tool and (2) it only uses checksum and direct comparison of web page contents, which may generate a high level of false alarms for websites with dynamic contents, such as e-stores and forums.
2.3.3. Site24x7 Website Defacement Monitoring
Site24x7 website defacement monitoring [
8] is a service for monitoring and detecting website defacement attacks. This service provides the following features:
Early identification of website security issues, including unauthorized insertion, or modification of web page’s HTML elements, such as text, script, image, link, iframe, and anchor;
Scans the entire website to find attacking links and other issues related to web page quality;
Identifies changes in the href or src attributes of each HTML tag, that points to not-being used domains;
Early identification of violations to security policies;
Minimizes every effort to take control of the monitored website.
The advantage of this service is simple installation and low initial setup cost. However, the service is only suitable for static web pages and not suitable for dynamic web pages, such as e-commerce websites or forums.
2.3.4. WebOrion Defacement Monitor
WebOrion defacement monitor [
9] is a solution for website defacement detection that can be provided as a service or it can be installed as software on the clients’ site. The major features of WebOrion defacement monitor [
9] include:
The content analysis engine is responsible for analyzing and comparing the elements of the web pages with thresholds to detect unauthorized modifications.
The advanced integrity validation engine is responsible for validating the integrity of the page elements using hash calculation. The final decision of the page status is determined based on an advanced decision algorithm.
The image analysis engine converts the web page into an image and it is analyzed and compared with thresholds to detect changes.
The monitoring is agentless which means that there is no requirement to installat monitoring agents on the target systems.
The intelligent baseline determination means that baseline or threshold values for comparison are determined in a smart way using page analysis.
Active warning and reporting are provided because it supports sending email and SMS alerts automatically to predefined email addresses and phone numbers when unauthorized changes are detected.
The advantage of this system is that it can monitor and detect defacements comprehensively without installing monitoring agents on the target systems. However, the option of installation as software on the clients’ site requires high initial setup costs.
2.4. Comments on Current Techniques and Tools
From the survey of website defacement monitoring and detection methods and tools, some remarks can be given as follows:
Defacement detection techniques using the comparison of checksums, or diff tools and the analysis of DOM trees can only be used effectively for static websites. Furthermore, the calculation of a suitable detection threshold for each monitored page is difficult and computationally expensive.
Defacement detection methods based on machine learning and data mining have potential because the detection profile or the threshold can be “learned” from the training data.
Kim et al. [
13] proposes an algorithm to dynamically generate and adjust the detection threshold for each monitor page to reduce false positive alarms. However, this method only works well for web pages that have fairly stable content. For highly dynamic websites, such as e-shops or forums it is not effective.
The common shortcoming of Medvet et al. [
14], Bartoli et al. [
15] and Borgolte et al. [
16] is the extensive computing requirements because they use either large-size feature sets [
14,
15] or highly complicated algorithms [
14,
15,
16]. This may restrict their implementation and deployment in practice.
Commercial website monitoring tools like VNCS web monitoring [
6], Nagios web application monitoring software [
8], Site24x7 website defacement monitoring [
8] and WebOrion defacement monitor [
9] have two common drawbacks: (1) they are expensive because they are commercial solutions and (2) they only use checksum and direct comparison of web page contents, which may generate a high volume of false alarms on dynamic websites.
In this paper, we extend our previous work [
10] by proposing a hybrid defacement detection model based on machine learning and attack signatures. In this work, we first carry out extensive experiments on a larger dataset of English and Vietnamese web pages using the machine learning-based defacement detection to verify the detection performance. Then, we combine the machine learning-based detection and the signature-based detection to build the hybrid defacement detection model. The combination of the two detection techniques aims at improving the detection rate and boosting the processing speed for common forms of defacement attacks. Our proposed defacement detection model does not require extensive computational resources because we only use low-cost supervised learning algorithms, such as Naïve Bayes, or Random Forest (RF) for the classification of web page HTM code. Furthermore, the building of the detection classifier and the attack signatures from the training data is done offline. This in turn makes the proposed detection model more efficient because it is not necessary to generate and update the dynamic detection threshold for each monitored page.
5. Conclusions
This paper proposed a hybrid website defacement detection model that is based on machine learning techniques and attack signatures. The machine leaning-based component is able to detect defaced web pages with a high level of accuracy and the detection profile can be learned using a dataset of both normal pages and defaced pages. The signature-based component helps boost the processing speed for common forms of defaced attacks. Experimental results showed that our defacement detection model can work well on both static and dynamic web pages and that it has an overall detection accuracy of more than 99.26% and a false positive rate of lower than 0.62%. The model is also able to monitor web pages in other languages than the web page language of training data.
Although the proposed model works well on both static and dynamic web pages in different languages, the model does have some limitations. One limitation is related to the fact that we use the MD5 hashing algorithm to detect changes in embedded external files of monitored pages. Because this method is sensitive to any changes of these files, the model may generate many change alerts. The other issue with our model is the processing of change alerts is done manually and this in turn may cause some delays in the processing flow.
For future works, we will carry out more experiments to validate the independent language capability of the proposed detection model. In addition, more attack signatures will be added into the initial set. Moreover, our next task is to implement a real-time monitoring and detection system for website defacements based on the proposed hybrid defacement detection model.