Defacement detection techniques may be categorized into three categories: anomaly-based detection, signature-based detection, and machine-learning techniques. Anomaly-based traditional techniques involve checksum comparison, DIFF comparison tools, and signature-based detection; however, the simplest and fastest technique is checksum comparison. The checksum comparison uses a hash algorithm, such as MD5 or SHA1, and starts with a profile, by calculating the hash value for the web page in normal conditions, and then stores it in a database to compare later with another hash value for the same web page. Comparisons such as this compare the current website’s hash value with the original value already stored in a database. If it has the same value, the web page is clear and no modification has been done, and its integrity has been checked. If the content hash value changes and does not match the original hash value, there is an attack and modification on the web page. It works well for static web pages.
DIFF comparison is a tool used to find the differences between the content of two web pages. A threshold of the web page must be determined to be an effective result for a dynamic web page, and it works well with a dynamic web page if it has a determined threshold. Without a proper threshold, it will work only for static web pages. The document object model (DOM) tree analysis involves finding the changes in a web page structure. If any changes have been found in the web page structure, an alarm rings. The detection is based on the HTML code, not the content. It works well for static web pages. The signature-based technique is a type in which the known attack patterns (rules) are stored to monitor the web page. If there is a match with stored rules, there will be an alarm for an attack. However, it is fast and efficient for well-known types of attacks only and cannot detect new kinds of attacks.
Advanced anomaly-based techniques, such as machine learning, have played a vital role in classifying web pages into either defaced or normal web pages. Different machine-learning methods have been proposed to classify the web pages: random forest (RF), decision tree, deep learning techniques (conventional neural network) (CNN), gradient boosting decision (GBD), and support vector machine. However, the important factor is that detection accuracy has to be at a high level. The false alarms must also be reduced by less than 1%, taking into account high computing resources with a large data set unless another method is found to reduce high computing resources. Other defensive precautions can be taken to detect and monitor defacement attacks using available tools, such as the Nagios web application monitoring software tool [
13], Site24x7 website defacement monitoring tool [
14], and the WebOrion defacement monitor tool [
15].
3.3. Machine-Learning-Based Techniques
Table 1 shows the proposed review based on multiple works, based on classification and machine-learning techniques. The findings from these studies are explained in detailed below.
Table 1.
Defacement detection based on machine-learning algorithms.
Table 1.
Defacement detection based on machine-learning algorithms.
Work | Algorithm | Data Set Size | Accuracy | FP |
---|
[20] | Detection mechanism based on the 2-gram method to the web page text content; the threshold adjustment method to detect defacement. | 185 websites. | - | - |
[21] | classification model based on the three machine-learning algorithms, support vector machine (SVM), random forest (RF) and gradient-boosted decision trees (GBDT). | 1512 defaced and 3108 normal websites. | over 95% | FP less than 1% |
[22] | Detection and classification model using n-gram and building a classifier based on the profile. | 100 normal and 300 defaced web pages. | over 93% | FP less than 1% |
[23] | Defacement detection model based on combining machine-learning techniques and attack signatures. | 1200 normal English web pages (N1), 217 normal Vietnamese web pages (N2), and 1200 defaced web page (D1). | 99.26% | FP about 0.27% |
[24] | Multi-layer defacement detection model based on four layers of detection by machine-learning algorithms. | 800 English normal HTML files named (N1), 200 normal Vietnamese HTML files named (N2), 850 normal CSS files named (E1), 850 JavaScript normal files named (E2), and 2100 defaced web pages collected from the defacer. ID named (D1). | Over 89.80% | FP less than 1.04% |
[25] | Conventional neural network CNN-based detection model for website defacement. | 40,000 normal web pages and 60,000 defaced web pages. | 98.86% | 0.97% |
[26] | Combination model for website defacement detection using text and image features and deep learning techniques. | 114,268 files of normal working web pages and 78,200 files of defaced web pages. | 97.49% | 1.49% |
The study by Wu et al. [
21] introduced a novel classification model based on three machine-learning algorithms—support vector machine (SVM), random forest (RF), and gradient-boosted decision trees (GBDT), to classify and distinguish whether a website has been defaced. However, the method used involved building a classifier by extracting web pages and Trojan features to use in the classification step. The classification model is used with the three machine-learning algorithms (SMV, RF, and GBDT) to see if the website has been defaced, using the cross-validation method to divide a data set that contains 4620 websites from different sources. They evaluate the performances and the accuracies based on four measurements: true positive (TP, correctly classified defaced website), true negative (TN, correctly classified legitimate website), false positive (FP, legitimate website that is wrongly classified as a defaced website), and false negative (FN, defaced website that is wrongly classified as a legitimate website). Wu et al. found notable differences between the three machine-learning algorithms based on the model accuracy and performance; however, the support vector machine (SVM) shows the best accuracy with a false positive (FP) of less than 1%, and random forest (RF) and gradient-boosted decision trees (GBDT) have less accuracy. In their useful study on website defacement detection based on the support vector machine, they included the website Trojan features with the ordinal features, which gives it more strength. Therefore, it was conducive to choose SVM to handle complex functions; however, the SVM performance deteriorates with a large data set due to the increase in training time. Perhaps this is why they do not have enough defaced data sets. Another shortcoming is that they may consider the website structure features to improve the accuracy and detect screenshots.
Dau Hoang [
22] proposed a website defacement detection method based on machine-learning techniques in his study of website defacement detection. He adopted the idea from Woonyon Kim et al. [
20], who argued for the n-gram method and occurrence frequency to detect the defacement in a dynamic web page and reduce the false alarms, as well as a method of threshold adjustment to reduce the false positive alarms (as the detection was for dynamic web pages). They proposed two threshold mechanisms to compare their variants in the results. The results showed a good number of reduced false alarms, but they had a major drawback—it was ineffective for highly dynamic web pages. Hoang [
22] found the 2-gram and 3-gram data files of the experiment and the best performance of the detection accuracy by using machine-learning techniques on normal and attacked web pages, as well as dynamic and static web pages as text documents (HTML files). The method involves two phases: the training phase and the detection phase on data sets collected from Zone-H.org. The training phase involves training the data sets. These data sets consist of normal web pages and attacked web pages. Pre-processing involves extracting the features that will be used for training. The machine-learning algorithms used are naive Bayes and the J48 decision tree via the Weka machine-learning tool. The trained data set will be passed to create the classifier. In the detection phase, it takes the web page URL and then downloads its HTML source code. Then, the HTML page is pre-processed to extract its features from the previous training and classifying steps that were generated in phase one; it will decide whether the web page is attacked, which will be the result. The experimental results show that the proposed method achieved a high detection accuracy of over 93% and a low false positive rate of less than 1% when experimenting on both the 2-gram and 3-gram data files. In this approach, it is impressive using both 2-gram and 3-gram experiments, which provide excellent results. Automatic learning of the detection profile and working for the dynamic web pages are also huge advantages of this research.
Another study by Hoang et al. [
23] proposed a hybrid defacement detection based on machine-learning techniques and attack signatures. This was a work proposed from Hoang’s [
22] previous study. Hoang et al. [
23] proposed a website defacement detection method based on combining machine-learning techniques and attack signatures. The machine-learning methods used were from Hoang’s previous work [
22], supervised machine-learning algorithms, which were naıve Bayes and random forest (RF); however, they were used for the classification of the web page HTML code on a large group of data sets contained in English and Vietnamese web pages. Previous studies by Hoang [
22], Kim et al. [
20], and Wu et al. [
21] have not dealt with Vietnamese web pages, which is impressive. Hoang et al. [
23] contributed (in that the model could detect the defacement with improved performance and it has low false alarms on the English and Vietnamese web pages). It can also work better on static and dynamic web pages. Therefore, the detection threshold does not need dynamic updates. The signature-based method is fast and efficient for the list of stored known attack patterns and can be updated if a new defacement attack is detected. They signed 50 attack patterns for their experiment. The method was the same as in the previous work by Hoang [
22], which consisted of two phases: training and detection; however, a larger data set and two languages have been used (English and Vietnamese web pages). The large data set will slow it down, creating a high level of computation. During the training stage, after the normal web page HTML file extraction, a hash file (MD5) will be calculated and stored in a hash database. The attack signatures will be done after classifying the web page into an attacked web page. It will be processed manually to extract the common attacks stored inside the attack pattern database. The detection stage and the monitored web page will be matched to the stored hash files, and then it will be matched again to detect whether there is a change in the hash files. If there is a change, an attack will be found. If no changes, no attack is found. In their useful study of defacement detection based on machine-learning techniques and attack signatures, Hoang et al. [
23] concluded that the proposed method can be conducted on static and dynamic web pages in two languages and also show a high level of accuracy of more than 99.26% and a lower false positive rate of lower than 0.62%. The disadvantage of this research is that the MD5 hash algorithm causes more alerts than usual based on its sensitivity to changes; however, a high range of alerts on the monitored web page file will be raised, which is also the issue in manual processing, which will delay the process. The study would have been more interesting if it included more than 50 attack signatures as attacker scenarios change frequently.
The work by Hoang et al. [
24] proposed a multi-layer model for website defacement detection. They found an efficient and simple method that detects the defacement in a web page. Using an integrity check as the last step in their methodology, we also found great accuracy in their experiment. However, the method continued with their previous works [
22,
23]. They proposed a four-layer model for the monitored web pages in which they started with valuable data sets that contained 800 normal web pages as HTML in English (marked as N1), 200 normal web pages as HTML in Vietnamese (manually collected and marked as N2), 850 normal web page CSS files (marked as E1), 850 normal web page JavaScript files (marked as E2), and 2100 defaced web pages taken from the Defacer.ID (marked as D1). The proposed model can be classified into training and detection phases. In the training stage, they used the data sets of the HTML, CSS, and JavaScript normal files and the HTML defaced file to extract the image hash database using the MD5 hash algorithm to be used later in the detection stage. By training the data sets using the random forest (RF) machine-learning algorithm, they built the integrated model, which is the classifier. The detection stage involves monitoring the modifications in the web page; however, detecting the monitored web page consists of four layers: pre-processing the HTML code, pre-processing the CSS file, pre-processing the JavaScript file, and image integrity check. Each image file will be checked for its integrity. The original image hashes have already been calculated in the training stage and saved to the image hash database; therefore, they are compared against the image hash database as follows. First, one asks, has the image file been found? If not, it will be marked as defaced. If it is found, it will calculate the file hash value. If the value matches the one in the database, it will be marked as no changes being found. If it does not match the value, it will be marked as defaced and changes will be made. Therefore, in the end, it will detect the defacement on the web page. See
Figure 7 for the detection phase, A possible speculation of what would happen if the hash database was hacked is that any upcoming file matching would have integrity issues, which means that the hacker could upload into the database a file of his own and it would be found at the file matching step. He could also add a hash calculation of his own file to be found at the hash-matching step. No changes will be detected for the embedded images of the monitored web page. Moreover, any changes in the hash database will affect the results by matching a file or mismatching one. The file may pass as ’No changes detected’ or ’Changes detected’. the possibility of the embedded image files being hacked or modified must be considered in the detection system. The experimental results on the data sets show that the detection rate (accuracy) has a great number (over 98.80%) and reduced the false positive alarms to less than 1.04%. The approach by Hoang et al. [
24] is very helpful as they used the integrity check, which is an efficient method to detect changes.
Another study by Dau Hoang et al. [
25] adopted previous works [
22,
23,
24] and proposed a CNN-based model (conventional neural network) for detecting website defacement. This defacement detection is an alternative to the traditional machine-learning algorithms, such as random forest and decision tree, which have been used earlier in works [
22,
23,
24]. An advanced machine-learning algorithm uses the CNN algorithm, which is a deep machine-learning method, instead of the traditional supervised machine algorithms. They contributed by detecting the defacement from the collected data set that contains both normal and defaced web pages, which also contributes by reducing the false alarms, resulting in better detection accuracy by 98.86%. The findings of the proposed method will decide whether the web page is defaced. The methodology they proposed was after [
22,
23,
24]. To extract the training features for each web page’s content, including HTML code by using the n-gram method (2-gram and 3-gram). The detection model is in two stages: a training stage and a detection stage for web page data sets collected from Zone-H.org, which contains both normal and attacked web pages. The model will be trained using the CNN algorithm to produce an output. Measuring the performance shows that the CNN algorithm is better than traditional algorithms. False alarms are also fewer and reduced in this approach by 0.97%. Only one disadvantage can be seen in this approach, which is the high level of computing resources because of functions used with the CNN model.
Nguyen et al. [
26] proposed detecting website defacement attacks using web page text and image features. They proposed a combination model for website defacement detection using text and image features used with deep-learning techniques, and they contributed by detecting the defacement, measuring the accuracy to a high level (97.49%), and reducing false alarms to 1.49%. The method is a combination model of website defacement detection consisting of two stages, the training, and the detection stage. The training stage is all about training the data sets, which are either normal or defaced web pages. A total of 57,134 HTML files and 57,134 screenshot image files were collected from normal working web pages. A total of 39,100 HTML files and 39,100 screenshot image files were collected from defaced web pages. Both text and image features are extracted to be trained to the model by extracting the pure text and capturing the screen. The next step will be training, by pre-processing the text with the BiLSTM algorithm and pre-processing the image with the EfficientNet algorithm. The detection stage shown in
Figure 8 starts by monitoring the web page’s HTML code and then extracting features by pre-processing and classifying them into two classifiers. The No. 1 text classifier and No. 2 image classifier will be combined using the late fusion method to obtain the results, which is a either normal or a defaced web page. The model based on deep-learning techniques and the BiLSTM and EfficientNet algorithms has achieved high detection accuracy and reduced the false alarms better than the previous works [
22,
23] discussed earlier. In Nguyen et al.’s study of detecting website defacement attacks using web page text and image features, combining the text and image features leads to great detection results [
22,
23]. It has one shortcoming, in that it has more computational resources as well as highlighted above. What can be seen in the results accuracy is that the defacement detection model based on combining machine-learning techniques and attack signatures [
23] is the best among the other works thanks to the accuracy rate of 99.26% and false positive(FP) rate of about 0.27%. Hoang et al. have succeeded in their results that depend on strong techniques.
Figure 7 and
Figure 8 illustrate the detection stages of the algorithms writers have used to describe and implement the method of their proposals. In general, they combined it with other algorithms and used different ways to train the data into the models to classify it. Considering the web page HTML code file, CSS code file, Java Script code file, and even the embedded image file, all these factors must be involved in the machine-learning defacement detection-based method.
3.4. Based on Other Tools
Throughout the years, there has been a wide variety of techniques for detecting and preventing defaced websites; machine-learning detection techniques are great, but sometimes they are resource-consuming and slow because they need to be fed with a set of data and apply more than one technique in one to give accurate results. Therefore, some tools have been found to detect and prevent defaced websites, and those tools will run fast compared to machine-learning techniques.
Figure 9 shows how the tools work.
Table 2 and
Table 3 present a review of the studies that have used tools to detect or prevent website defacement attacks.
Table 2.
Defacement detection is based on another tool.
Table 2.
Defacement detection is based on another tool.
Work | Algorithm | Advantages | Shortcomings |
---|
[27] | Defacement and intrusion monitoring tool (WDIMT) was divided into a three-layered presentation for the Linux terminal, business layer, and data access layer. | It allows users to examine each web page status, which is indicated by a color. | This tool was executable in a Linux environment, which is considered a limitation for it. |
[28] | The method was combining of monitoring of local area network (LAN) and remote monitoring. This method (based on the hash function and MD5, and using Boyer–Moore) implements the C4.5 algorithm. | Easy to use. | In the future, they hope for an improvement. Therefore, the application can be suitable for more types of websites. |
[29] | Using a random monitoring technique, they created two random monitoring algorithms: (1) A common monitoring algorithm that chooses one slot for each MR (URMA) according to the uniform distribution and the selected slot is monitored whether they are malformed and for non-selected slots are not monitored. (2) Depends on the attack damage randomized monitoring algorithm (ADRMA). Using the Python 3 programming language. They implemented three clever program-discrete web distortion attack models (AM1 is the most aggressive; medium aggressive is AM2, and the least aggressive is AM3) | Very fast. | Their future studies can expand the scope of their research by investigating a wider range of discoveries and monitoring systems used in different networks. |
[30] | The method was the dork search engine
technique, used to crawl the infected web pages by ascertaining the underground ads in the search engines. They presented a design for the system. It has four stages. | The experimental results show that the model, thanks to its high quality, can provide the data set with a high detection rate; from our point of view, it has been distinguished. It is considered a positive that the false alarm rate is low. | The ads were limited to unknown or popular sites; this is an incorrect opinion because the breaches were many, and on most websites, even well-known. |
Table 3.
Defacement detection is based on another tool.
Table 3.
Defacement detection is based on another tool.
Work | Algorithm | Advantages | Shortcomings |
---|
[31] | The method was for hiding malicious JavaScript in web source code and introduced a scanning system that collectively examined URLs. One future research plan will be to discover many web source codes to measure the accuracy and speed of the proposed scanning algorithm in this study. | Effective response to tamper attacks that insert URLs. | One future research plan will be to discover many web source codes to measure the accuracy and speed of the proposed scanning algorithm in this study. |
[32] | The proposed software system utilizes the operating system’s kernel feature for file system monitoring to detect changes (add, delete, modify). The entire web hash is stored within this NV-RAM memory location. | The preliminary study’s findings demonstrate that the system is proficient at identifying alterations and can classify the research procedures one is carrying out. | In the future, one must implement the system. |
[33] | The model includes an offline integration check engine, a variable control system, a forced backup directory, and a notification system. A. The offline integrity check engine’s job is to examine the website’s data integrity. B. Integration of a control variable. This variable was fabricated to act as a control variable when a website was updated. C. Basic defense mechanism | Prevention and self-protection mechanism | The implementation of the model and its validation through various tests and experiments will be the focus of future efforts. |
[34] | They collected ”similar” deface pages to identify campaigns. The first step was to extract the raw content from defaced web pages. Second, they detected campaigns as groups of “similar” pages by clustering. Finally, they gave each cluster a label and displayed the campaigns on various dimensions. | Their approach helped simplify the work of analysts by automatically identifying deface campaigns. | Their approach was not effective on all types of web pages. |
[35] | They used three tools involving penetration testing and vulnerability assessments of web applications by scanning web application pages and extracting vulnerabilities (the tools were Acunetix, Burp Suit, and OWASP Zed Attack Proxy (ZAP)). | The tools they used were effective at finding vulnerabilities. | It was only limited to discovering specific vulnerabilities. |
Mfundo et al. [
27] discussed one of the most common attacks on websites, which is defacement. These threats try to change the web page’s content or make the website unavailable. Mfundo et al. proposed a web defacement and intrusion monitoring tool (WDIMT) that detects defacement and rapid defacement alerts and rapid uploading of a web page’s original content. The WDIMT’s system architecture is divided into three layers: the presentation layer for the Linux terminal, which shows the user’s information and executes commands; most database and presentation layer interactions take place at the business layer, The database that will hold user data and a hash of each web page is available in the data access layer. The WDIMT website allows users to examine each web page’s status, which is indicated by a color. The WDIMT uses a Linux terminal to execute commands. The WDIMT website’s visual representation of any defacement allows users to identify defaced pages and request that the original content of those pages be uploaded again. This tool is executable on a Linux environment and is considered a limitation for it.
Figure 10 shows a flowchart of the WDIMT process.
In [
28], Tran Dac Tot et al. talked about the importance of promptly detecting changes in the interface and content of the site to reduce the damage caused by the attack and due to the spread of this type through the use of viruses and malware or by exploiting loopholes and weaknesses in the web; therefore, they suggested a method combining monitoring in the local area network (LAN) and remote monitoring. This method combines many factors, such as server and database monitoring, to develop based on hash functions; MD5 is used in hashing with the highest encryption speed and string-matching techniques. They used Boyer–Moore to find the content change based on the differences between two HTML language documents of the same web page at two different times. At the same time, this method implements the C4.5 algorithm, an improved algorithm that allows processing the data set with numeric attributes, to increase the accuracy of security alerts. The proposed method of implementation initially collects data, obtains the information stored on the system and the information received when querying the internet environment, and then compares the information. If it is the same, there is no change and the site is safe. If they are different, the DNS parameters have been modified, and then the system is alerted by sending the warning results to the three site administrators through two images (via e-mail and SMS). They stated that this method is easy to use; in addition to warning the administrator, if there is any change in the content, the authors hope (in future studies) for progress. Thus, the application can be suitable for more website types.
The study [
29] by Youngho mentioned that cyber-attacks have become dangerous and varied, and no matter how many methods are used to prevent and detect them, the attackers are smart by dangerously exploiting the weaknesses and harming the reputation of the site, and they found a vulnerability that smart attackers can exploit by deep exploration of the current client-side defense. In this study, they introduced a new intelligent defacement of web attacks that exploit existing vulnerabilities with client-side detection methods. They then proposed using a random monitoring technique as a potentially effective defense against such attacks, and they created two random monitoring algorithms: (1) a common monitoring algorithm that chooses one slot for each MR (URMA) according to the uniform distribution; the selected slots are monitored, whether they are malformed for non-selected slots or are not monitored, and (2) depends on the attack damage randomized monitoring algorithm (ADRMA). Additionally, they presented the results of large-scale experiments to validate their idea and demonstrate the detection performance of their randomized monitoring algorithms. Their results are shown in random monitoring detection algorithms that can quickly detect and shut down many intelligent web defacement attacks (and, thus, not allow for significant attack damage in terms of the number of obfuscated slots when compared to an existing static FPMA tool First, an attacker (a) uses WS vulnerabilities in electronics to maliciously alter one or more web pages that the server has saved (WS). For example, A inserts a malicious link to a malware file stored in the malicious server (M) that cooperates with A. System administrators and regular users are unable to readily detect that this malicious link is embedded inside a misconfigured web page because of the way it was injected into the file. Through the malicious link that has been injected when the web user (U) accesses the WS, U is automatically connected to the external malicious server M, where the malware is subsequently downloaded. To implement their idea, they use the Python 3 programming language and they implement three clever programs of discrete web distortion attack models (AM1—most aggressive, medium aggressive—AM2, and least aggressive—AM3), the results of their experiments were presented, and they stated that their proposed randomized monitoring algorithms can quickly identify and stop a variety of intelligent web defacement attacks, though their performances in terms of detection vary slightly depending on their design features. Finally, by investigating a larger range of discoveries and keeping an eye on the systems employed in the various networks, future studies can broaden the areas of their research. They also want to investigate whether their methods for random detection might help them more effectively defend against adversaries who deliberately try to circumvent their defenses.
The study by Yang et al. [
30] talks about infection promotion as an attack that occurs by exploiting a website vulnerability to inject illegal content into web pages to promote malicious, invalid, and fake products. They proposed a new and effective approach to automatically collect infected web pages. One of the goals of the adversaries of the infected sites is to promote the hidden works, mainly through search engines. Moreover, from previous studies of a similar idea in detecting the infection, it was found that the opponent cannot advertise illegal and unlicensed ads for a legitimate and licensed website, but on the other hand, it is easy for opponents to penetrate sites that are less secure via SQL, XSS or other known methods. The proposed method is the dork search engine technique used to crawl infected web pages by ascertaining the underground ads in the search engines. Running with 58 initial black keywords, they collected 22,939 infected pages with a range of 2563 domains and were automatically aggregated along the route of 8374 new black keywords. Therefore, they presented a design for the system. It has four stages. In the first stage, an extensive list of black keywords was created, and used in fake ads. Second, the list was sorted based on the probability of the opponent using these words. Thirdly, these keywords were given priority (or right of progress), and then pages were crawled using these suspicious keywords by taking advantage of the powerful APIs of the search engine. Finally, they identified a set of features to build a machine-learning model file in which infected web pages would be distinguished from normal pages. This data set was extensive, and they were able to build a machine-learning model to distinguish infected web pages from normal ones. The experimental results show that the model, thanks to its high quality, can provide the data set with a high detection rate, and from our point of view, it has been distinguished; it is considered one of the positives that the false alarm rate is low in order not to inconvenience the administrators without benefit. On the other hand, the ads were limited to unknown or popular sites. This is an incorrect opinion because the breaches were many, and on most websites, even well-known.
Park et al. [
31] were interested in a web defacement attack and its capability to carry out a malicious attack, which is a series of unauthorized changes to a web home page through unauthorized means, an attack that achieves political purposes, or the ability to hack and make unauthorized changes to the home screen image. For the home page to show off, there were many types of attacks. The attacker’s method involved entering a malicious URL and using a URL that automatically and secretly distributed the malicious code to multiple private computers. It is another way is for them instead of a malicious injection of the URL. Therefore it is visible in the web source code. They can attack for a long time without being easily detected by officials. It is entered secretly. Therefore, in this case, the most used method is the JavaScript obfuscation technique. JavaScript obfuscation is when the JavaScript code that generates the URL is on the server, the source that was developed to protect the code, but it allows attackers to cover up their malicious behavior. It is an obfuscation method of coding that converts to hexadecimal and inserts it into the code. It is also hidden from firewall analysis techniques and IPS JavaScript. When a user accesses a web page that contains a code, the code, after being processed by the browser, will be decrypted to a malicious URL and activated. After that, the malicious code is secretly downloaded and installed without the administrators knowing that even if it is detected, JavaScript is considered malicious. The analysis of the URL is not for the user in general, and its disadvantages are that the accuracy of its detection is not clear and is inaccurate and that there are restrictions on the accessed web page. In this paper, they report their proposed method for hiding malicious JavaScript in web source code and introduce a scanning system that collectively examines URLs. It runs periodically and in JavaScript code finds hidden URLs and runs a similarity comparison algorithm with the malicious URL repository. It is used to determine whether it is malignant, suspicious, or normal. Infected computers are botnets controlled by attackers. Important information is stored on the computer or mobilized for a DDoS attack. By forming a botnet, the information is continuously leaked to the attacker. That is why an effective response to tamper attacks that insert URLs is much more important. In the future, researchers will aim to discover many web source codes to measure the accuracy and speed of the proposed scanning algorithm in this study.
Viswanathan et al. [
32] stated that the system proposed is combined with a safe hardware element known as the trusted platform module (TPM) for authenticating modifications to dynamic and active website content. This paper proposes a software system that utilizes an operating system’s kernel feature for file system monitoring to detect changes (add, delete or modify). The TPM implementation stores encrypted graphic keys and other sensitive data in its memory, and the platform has assigned these primary keys for security purposes. The TPM reaches clients in a disabled and inactive state to prevent email from being used. In this approach, the researchers mentioned that the entire web hash is stored within this NV-RAM memory location. Hence, only the original web admin, who is also the owner of the platform, and its entry, can save it to that memory location. The preliminary study’s findings demonstrate that their system is proficient at identifying alterations and can classify the research procedures one is carrying out.
Mao et al. [
33] mentioned that website defacement occurs in five steps, which include recognition, intrusion, use, capture, and intrusion. The researchers proposed an approach focused on the discovery, as well as on prevention and self-protection mechanisms to reduce the impact of website distortion. They stated that their proposed model is proactive and responsive. The model includes an offline integration check engine, a variable control system, a forced backup directory, and a notification system. A prediction and self-protection system based on a highly accessible architecture is proposed by the technique. A. Offline integrity check engine. Its job is to examine the website’s data integrity. A client/server application is what it is. A sensor (python script) installed on the web server under observation serves as the client component. B. Integration of a control variable. This variable was fabricated to act as a control variable whenever one of the websites being tracked changed (an update). This variable, which was inspired by the serial setting of the DNS systems, is increased by the website administrators at each valid alteration. The client script on the web server automatically updates the hashes of the website files when the variable is changed. The previous reference hashes file is replaced on the offline server with the newly created hashes file. C. Basic defense mechanism once the attack is identified. The system can regenerate itself thanks to the fundamental protection mechanism. Finally, the implementation of the model and its validation through various tests and experiments will be the focus of future efforts.
In 2018, Maggi et al. [
34] published a paper in which they described visible traces of a defacer in a website defacement. They collected “similar” defaced pages to identify campaigns and used data clustering as the core of their analysis system. The first step in their approach for any type of website dynamically or statically is to extract the raw content from the defaced web pages. Second, they detected campaigns as groups of “similar” pages by clustering. The clustering features are visual, structural, geographical, domains, social, and title. Finally, they gave each cluster a label and displayed the campaigns on various dimensions. The clustering required 35 h to perform in total, 2 of which were spent on labeling and visualization. They concluded from their automatic approach that many similar defaced websites are produced by campaigns organized by attackers for political and commercial purposes, as well as other reasons.
In 2017, Nagpure et al. [
35] pointed out some methods of penetration testing and vulnerability assessment on web applications by using many tools. First, they defined web vulnerabilities and their types (SQL injection, cross-site scripting, session Hijacking, privilege escalation, browser replay attack, insufficient session expiration, session fixation, directory traversal, authentication bypass, cross-site request forgery, click-jacking, and browser cash weakness). Second, they mentioned the testing methods, either manual or automatic. When they used the automation test, they had to use software to scan every web application page. They mentioned three programs used to find vulnerabilities or test web penetration: Acunetix is effective in penetration testing and vulnerability assessment, but it costs money and does not provide a passive scan. Another tool is Burp Suit. It is effective in vulnerability assessment, but it is not effective in penetration testing. One could pay for it or obtain it for free. The last tool is OWASP Zed Attack Proxy (ZAP), which is effective for vulnerability assessment. Perhaps it was a passive or an active scan. Furthermore, it was free or available for purchase. Then they compared the tools and their results. Each tool detected certain vulnerabilities. Finally, we benefited from this paper with the mentioned tools for testing vulnerabilities and penetration testing.
Table 4 will explain the meaning of the defacement-based machine-learning methods and the basic algorithm to be used.