1. Introduction
The impact of smartphones in our daily life is growing rapidly. It is obvious that these devices’ popularity and usage rates have significantly risen because of the wide range of functions they can provide us. As of 2022, Statista [
1] estimates that there will be 6.56 million mobile phone users worldwide. In fact, UN statisticians have [
2] predicted that by October 2022, among a population of 8 billion, approximately 82% of the total population in the world own a smartphone. Android is the most widely used smartphone operating system, with a 71.85% market share as of July 2022, according to Statista statistics [
3]. This results from Android’s open-source nature and other characteristics.
Given its open-source nature, it is inevitable that an operating system with such widespread use would attract malware [
4]. Due to these circumstances, mobile malware has emerged. The term “malware” describes software applications that have been installed on a device with a variety of purposes, such as to annoy users, steal personal user data, consume system resources without permission, or harm the device [
5]. After being installed on the target device, these programs operate to fulfill their intended functions, either visibly or invisibly to the user, in the background [
6]. During the first two quarters of 2021, more than 2.2 million new instances of mobile malware were reported, according to Statista [
7]. Given all of these facts, it is obvious that Android malware represents a significant threat to mobile systems.
Today, it is widely known that smartphones contain personal data about users, including their banking details, confidential information, and health data. Therefore, it is evident that both the security and integrity of these sensitive data will be compromised if these smartphones are exposed to a malware. When malware is planted on a smartphone, it can acquire the user’s bank account if the user is using online banking apps, as well as their health records and private, sensitive data such as personal information. Using the Android smartphone, the malware can send emails and SMS messages without the user’s consent to spread the malware to even more users. It is inevitable that malware developers will target smartphones more frequently considering all of these capabilities. The fact that the frequency of mobile malware rose by approximately 1800% in 2016 is among the most obvious and concrete indicator of this problem [
8].
These issues have led to a wide range of models being proposed for malware detection on Android mobile devices. In the past, methods based on signatures have emerged. These methods search the application code to detect known patterns of malicious code. Although signature-based approaches are currently effective in terms of their performance and in detecting malware that has already been identified, they cannot be used to detect new types of malware, and as a result, more modern techniques and models have replaced them. These models can be examined in three categories:
Static malware detection methods.
Dynamic malware detection methods.
Hybrid malware detection methods.
All three of the aforementioned methods are frequently applied using machine-learning-based techniques. The ability of machine-learning-based approaches to detect previously unknown malware [
9] makes them vital. Because static detection techniques evaluate software codes without running suspicious applications, they are unable to detect malicious programs that obfuscate the program code or employ encryption [
10] to protect the malware. On the other hand, approaches based on dynamic detection methods run the application in a secure environment to search for any malicious behavior [
11]. However, since the performance cost of the application rises in dynamic detection-based methods, it becomes challenging to perform a real-time analysis due to the extension of the execution time [
12]. Both strategies have advantages and disadvantages in comparison with each other. Although static approaches offer a thorough analysis of the code, it can be challenging to detect changes in the application code that might take place while the application is running. On the other hand, the time and performance costs for dynamic approaches are quite high. Hybrid methods have been developed to overcome the disadvantages of both of these methods.
The FL-BDE model was developed in this study as a strong and effective malware detection tool to reduce the negative effects of malware on Android operating systems. The proposed model provides us with a structure that interprets and combines the scores of ML-based NN, SVM, DF, LR, BPM and BDT methods in two separate groups of three according to whether the incoming data are voted as benign or malicious applications. The contributions of this study to the literature are summarized as follows:
To develop a strong and dynamic model, the results of ML-based approaches with high classification performance are integrated over the FIS, which has an effective inference capacity.
Depending on whether the data obtained through the voting and routing process were voted as benign or malicious, our model combines the results of only three ML-based algorithms. While this allows us to increase the performance of our model by taking advantage of the scoring power of the six methods, it also simplifies the design of the FIS model by reducing the number of inputs.
The model proposed in this study has outperformed comparable studies in the literature, and as a result, it has revealed that it has a strong and dynamic structure.
The rest of the study is organized as follows. A brief overview of recent research on the subject is provided in
Section 2. The dataset, the proposed model and its features, the ML-based methods and the performance metrics to be employed in the assessment of experiments are described in
Section 3. The outcomes of the experiments are discussed along with the literature in the “Results and Discussion” section. The last section concludes, along with suggestions for further research on the topic.
2. Literature Review
Malware is defined as software that was created with malicious intent and serves this function. The widespread use of mobile devices and the increasing number of users have caused malware developers to focus their attention on this area [
11]. Malware can have a variety of purposes. Some of these malicious purposes can be categorized as disrupting the normal operation of the Android operating system it is working on, obtaining the user’s personal and sensitive information illicitly and without the user’s consent, seizing the user’s device, obtaining information for ransom, or displaying unwanted advertising content to the user. To stop malware developers from achieving these goals, a lot of research has been carried out in the field of malware detection. The three main categories of studies conducted in this context are static analysis, dynamic analysis, and hybrid analysis.
Static analysis techniques examine an application’s source code without running it [
13] to detect malicious behavior in the suspicious software. Predominantly, these methods use either permission-based [
14] or signature-based [
15] techniques. Malware detection in signature-based approaches is carried out by comparing the application codes to the codes in a database that contain previously known malicious code fragments. However, in permission-based approaches, the permissions requested by the application are compared with the permissions frequently requested by the malicious applications. Although static analysis methods are faster than dynamic analysis methods are [
16], they are inefficient at providing information about the behavior of the application while they are running because they are unable to detect the application code that the application will load dynamically during execution [
17].
On the other hand, dynamic analysis techniques involve running the program in a secure, isolated environment and analyzing its behavior to determine if it is malicious or not [
18]. Dynamic analysis methods detect malware by examining features such as system calls [
19] and network traffic [
20] that occur while the application is running.
Martín et al. [
21] created a model utilizing a combination of static analysis and dynamic analysis, employing both a Random Forest classifier and a Bagging classifier. With this fusion model, researchers have achieved an accuracy of 89.7% and a precision of 89.7%. Mat et al. [
22] developed a Bayesian classifier and used system-based permission features to detect malicious mobile applications. The proposed model achieved a precision of 91.1%, an F-measure of 91% and an accuracy of 94%. In the Machine Learning and Natural Language Processing-based model created to detect malware, Nguyen et al. [
23] examined the behavioral characteristics of Android smartphone users and the anomalies in these behaviors. They obtained 99.8% accuracy, 90% precision, 97.6% recall and a 93.6% F-measure by incorporating SVM into their proposed model.
In the study of Lu and Wang [
24], the network traffic generated by the application and the network traffic matrix produced by the CNN deep learning model were both evaluated using the model, which the researchers called F2DC. The proposed model achieved an F-measure of 96.08%.
Using the hybrid model that Amer and El-Sappagh created [
25], they analyzed API and system calls utilizing dynamic analysis techniques and LSTM. In addition, they used ensemble machine learning to examine the Android permissions and find malware by combining Random Forest (RF), MLP, AdaBoost, SVM and Decision Tree (DT) classifiers. This proposed model reached a 99.3% accuracy and 99% F-measure. Yang et al. [
26] analyzed the characteristics of the Android software, such as permission and activity, and detected malware in accordance using the Contrastive Learning model. This model was created using Bi-LSTM, which is commonly referred to as “Double-Sided LSTM”, and the feature extraction was carried out using a Text-CNN-based model. The researchers reported that the proposed model achieved an accuracy of 97.53%, a precision of 96.66%, a recall of 98.41% and an F1-Score of 97.53%, reflecting the result of the evaluations using the AMGP and Drebin datasets.
Jerbi et al. [
27] transformed the process of generating rules for malware detection into a two-fold optimization problem. As a result, their proposed Two-Level Malicious Application Detection (BMD) model had an F-measure of 97.79% and an accuracy of 98.18%.
With the method that they called DEEPSEL, Azad et al. [
28] evaluated whether the application was malicious or not by evaluating several features of the dataset. The name of the model refers to the deep feature selection process they utilized in the process. In this study, machine learning-based algorithms and deep learning (DL)-based models were combined to perform the classification, and Particle Swarm Optimization was utilized for the feature selection. As a result of the evaluation performed on the CICAndMal2017 [
29] dataset, the proposed model obtained 83.6% accuracy, 82.4% precision, 82.5% recall and an 82.5% F-measure.
D’Angelo et al. [
17] detected malware by training two artificial neural networks, which they built using two-dimensional API images. These images serve as a signature of a program’s activities over time. Subsequently, the features with the biggest impact on the findings were selected from the given features using an autoencoder and conventional artificial neural networks. The accuracy, precision and F-measure metrics for this model, which consists of two encoders and a SoftMax artificial neural network, were 94%, 98% and 97%, respectively.
Taha et al. [
30] developed a novel fuzzy integral-based multi-classifiers ensemble model for Android malware detection. They combined the results of the XGBoost, RF, DT, AdaBoost and Light-GBM classifiers over the Choquet fuzzy integral. The experimental results obtained through the dataset, consisting of 9476 benevolent and 5560 malicious applications, showed that their proposed approach based on the Choquet fuzzy integral technique achieved a higher performance, with an accuracy value of 95.08% compared to those of the classifiers that were used individually. The risk-based fuzzy analytical hierarchy process approach was applied by Mohamad Arif et al. [
18] in their Multi-Criteria Based Decision System and Mobile Malware Detection system. This methodology, which involved a static analysis, evaluated the permission-based features with the purpose of increasing the user’s awareness of high-risk permissions through a risk analysis. The accuracy value for evaluations on the Drebin and AndroZoo datasets was 90.54%.
Mazaed Alotaibi and Fawad [
31] built a Multifaceted Deep Generative Adversarial Networks (MDGAN) model for effective malware detection. With the three-stage proposed model, they converted the APK files into a binary image and an API sequence in the first stage. Then, by sending the image files to GoogleNet and sending the API sequence to the LSTM network, they obtained and determined the distinctive and stable features, respectively. Then, they applied the data with the combined feature to the Generative Adversarial Networks (GAN) and determined whether the data were a malicious application. They used the AndroZoo and Drebin databases as the dataset. They proved that the proposed model, with an accuracy value of 96.2% and an F-score value of 94.7%, which were obtained from the experimental studies, outperforms the studies that have been conducted in this context recently.
In a different study, Atacak et al. [
32] used permission-based analysis and created a hybrid model by fusing CNN with FL. In the study, they used two different datasets and 500 benign and malicious applications. They analyzed the APK file of the applications and acquired a manifest.xml file. In the next stage, they obtained the permission information of the applications from this file. They performed a feature extraction and a feature reduction with two convolution layers and two pooling layers using the permission information. In the last layer, they estimated that the features equated to five neurons with ANFIS architecture. With their model, they reached an accuracy of 92% in the first dataset and an accuracy of 94.66% in the second dataset.
In this study, unlike prior ensemble learning (EL) methods in the literature, the FL-BDE model combines the results of ML-based methods using a fuzzy logic (FL)-based inference system (FIS), which is based on the intuitive view and inference philosophy of a human solving a problem.
4. Results and Discussion
In this section, the experimental results of the proposed FL-BDE model for detecting Android malware and nine different ML- and EL-based methods used to verify the model’s performance are presented. The success of the proposed model in malware detection is proven by comparing the experimental results with the studies in the literature that use similar models and techniques. The FL-BDE model’s experimental setup was built in the Microsoft Azure ML environment using the components that are classified into several categories in the Materials and Methods section (Chapter 3). In the first phase of the setup, the data were read from the csv file containing 2000 applications, with a total of 1134 features, which had been uploaded to the Azure ML platform through the “My Datasets” component. The relevant data were therefore applied to the “Filter Based Feature Selection” component and reduced to a 2000 × 50 dataset. The “Feature Selection Method” and “Number of Desired Features” parameters in this component were set to the “Fisher score” and “50”, respectively. To obtain the training and testing data for the ML-based models, the dataset with the decreased feature set was then subject to the data splitting procedure. The Azure ML environment’s “Split Data” component was used to carry out this process. Depending on the two-stage splitting strategy that had to be employed in the test phase, the “fraction of rows in the first output dataset” parameter, which displays the split ratio within the component, was set to 0.60 or 0.70, respectively. Three components, “Two Class Algorithm Name,” “Train Model” and “Score Model”, were used to produce the malicious rating results in the 0–1 range of the models developed using the SVM, LR, BPM, BDT, DF and NN approaches utilizing the training and test data for the multi-classification process. The parameters adjusted for the mentioned classification algorithms and their value changes are shown in
Table 2.
The stages, including the voting and routing processes and the combining scores process of the proposed model, with the execution of the single classifiers that are not part of Azure machine learning, were implemented with a program written on the “Execute R script” component in this environment.
Along with the proposed FL-BDE model in the experimental studies, the ML-based single classifiers such as SVM, LR, BPM, NN and RF and the EL-based single classifiers such as BDT, DF, AdaBoost and Bagging were also used to evaluate its performance in Android malware detection. The performances of the model and classifiers were verified on the training and testing data obtained by applying the random split and k-fold cross-validation conditions to the dataset. Splits of 0.60 and 0.70 were used as the random split condition. Data vectors of 1200 × 50 for the training process and 800 × 50 for the testing process were obtained in the random split condition of 0.60. For the split ratio of 0.70, the sizes of these vectors for the training and testing processes were 1400 × 50 and 600 × 50, respectively. In the cross-validation approach, five-fold cross-validation was used. Firstly, the dataset was divided into five equal parts, and five data groups were obtained, each consisting of 400 × 50 vectors. Then, after performing five iterative test processes, where there was one group of testing data and four groups of training data, the performance results of the classification models were obtained for each iteration step. After that, by averaging the performance results obtained in these iteration steps, the overall performance of the model and methods used for the Android malware detection was determined.
In
Table 3, the classification errors and confusion matrix parameter values for the FL-BDE, ML- and EL-based models are given for the split ratio of 0.60. The proposed FL-BDE model, with a value of 1.88%, produced the best performance in terms of the misclassification rate (MCR), which represents the classification error in all of the positive and negative classes. When it is compared to the results produced by the ML- and EL-based models, this result corresponds to an extremely low value. In fact, it had an MCR of 1.87% less than the one (an MCR of 3.75%) of the model created using the RF method, which produced the closest value to that of our model. Given that the MCR is inversely correlated with the classification accuracy, it can be concluded that the proposed approach shows an excellent performance in this regard. In terms of the false positive rate (FPR), which indicates the proportion of negative instances that were predicted to be positive, and the false negative rate (FNR), which indicates the proportion of positive instances that were predicted to be negative, the FL-BDE model also performed quite well. The results in
Table 3 also show us that the proposed model’s 2.75% FPR and 1% FNR values were considerably lower than the FPR and FNR values of the other models. Despite it having the same FNR value as the proposed model, the NN-based model performed the worst out of all of the models, since it had the greatest MCR and FPR values. The NN-based model achieved an MCR value of 10.63% and an FPR value of 20.25%.
Figure 4 shows the performance results of the FL-BDE, ML- and EL-based models at the split ratio of 0.60. According to the performance values shown in
Figure 4, the FL-BDE model offered the best performance among all of the models in terms of the accuracy, recall, specificity, precision and F-measure metrics. The proposed model achieved results with an accuracy of 0.9813, a recall of 0.9900, a specificity of 0.9725, a precision of 0.9730 and an F-measure of 0.9814. While the RF-based model was the most similar one the proposed model in terms of the accuracy and F-measure metrics, the LR-based model achieved a similar result, with values of 0.9700 and 0.9690 for the specificity and precision metrics. Of all of the models, the NN-based model exhibited the lowest performance in terms of the other metrics, except for recall, with values of 0.8938 accuracy, 0.7975 specificity, 0.8302 precision, and 0.9031 for the F-measure. However, this model provided the best performance in terms of the recall metric, along with the FL-BDE model, with a value of 0.9900.
Figure 5 illustrates ROC curves that demonstrate the relationship between the false positive rates and true positive rates for the FL-BDE, ML- and EL-based models at the split ratio of 0.60. The value of the area under the ROC curve (AUC) gives a critical measure of the model’s success in separating the classes. The closer the AUC values of the models are to 1, then the more distinct their discriminatory power in distinguishing the classes is. In this regard, when the AUC values derived from the ROC curves, which are presented in
Figure 5, were compared with each another, it was observed that all of the models, except for the SVM-based model, display a very good performance. With an AUC value of 0.997, the proposed model demonstrated an outstanding performance in terms of class discrimination. The DF and BDT-based models, with AUC values of 0.990 and 0.992, respectively, were the most similar to our model in terms of performance.
Table 4 compares the confusion matrix parameters and error rates of the FL-BDE, ML- and EL-based models at the split ratio of 0.70. As in the 0.60 split conditions, it can be clearly seen from the data in
Table 4 that the best performance is reflected in the lowest MCR, FPR and FNR error rates under these split conditions for the proposed model. The proposed model, with an MCR value of 0.67% and an FPR value of 1.33%, achieved error rates that were 3% and 1% lower for both of the performance values, respectively, than those of the BPM-based model, which produced the closest result to it. Furthermore, it correctly classified all of the positive instances, resulting in an FNR of 0%. In terms of the MCR and FPR error rates, the NN-based model had the worst performance, with values of 9.83% and 18.67%, respectively. However, this model was the model that provided the best performance at this error rate, after the proposed model, with an FNR value of 1%. With an error value of 6.67%, the Bagging-based model had the worst performance in terms of the FNR.
The performance results of the FL-BDE, ML- and EL-based models are shown in
Figure 6 for the split ratio of 0.70. When these results were compared with the results obtained from the split conditions of 0.60, it was seen that all of the models, aside from the LR, DF, Bagging and RF based-models, improved their performances in terms of the most of the metrics, with the growth occurring in the number of instances trained at the split ratio of 0.70. The proposed model had the best performance in this period, improving its performance by 1.2% for accuracy, 1% for recall, 1.42% for specificity, 1.38% for precision and 1.2% for the F-measure, according to the performance results at the split conditions of 0.60. From the performance results at the split ratio of 0.70, it can be clearly seen that the FL-BDE model had a much better performance than the ML- and EL-based models did, with an accuracy of 0.9933, a recall of 1.00, a specificity of 0.9867, a precision of 0.9868 and an F-measure of 0.9934. It also produced a better performance in terms of the value differences, which are 3% or more for accuracy, 3.33% or more for recall, 1% or more for specificity, 1.08% or more for precision and 2.99% or more for the F-measure, than those of the RF, BPM and BDT-based models, which have the most similar performance results to it in terms of most of the confusion matrix metrics. The NN-based model had the worst performance in terms of all of the metrics, except for the recall metric. With a value of 0.99, it came the closest to that of the proposed model with regard to this metric.
Figure 7 shows the ROC curves and their AUC values for the models built using the proposed FL-BDE, ML- and EL-based methods at the split ratio of 0.70. When the AUC values of the models built here were compared with the AUC values at the split ratio of 0.60, it was seen that while there was no change in the performance result of the BDT model, the ones of the DF and NN-based models showed a decreasing tendency of 0.1. This decrease was by 0.3% in the Bagging model. On the other hand, the performance of the remaining models in terms of this metric improved by 0.2% or more. From the obtained results, it can be said that all of the models show good performances since they all obtained an AUC performance of 0.98 or more in the split ratio of 0.70. When the AUC results of the models were evaluated among themselves, the proposed model showed an excellent performance, with a very high value of 0.999. The closest AUC performance to this model was provided by the AdaBoost and RF-based models, with values of 0.993.
Figure 8 shows the error rates related to the classification performance of the FL-BDE, ML- and EL-based models obtained using five-fold cross-validation. From the error rate values in the figure, it can be clearly understood that the FL-BDE model showed the best performance among all of the models, reaching the lowest error rate, with 1% and 0.1% values, respectively, in terms of both the MCR and the FPR. The proposed model achieved lower error rates of 2.1% in the MCR and 3% in the FPR than those of even the BDT model, which gave the most similar result to it. Although the NN model achieved the lowest error rate, with a value of 1.5 in terms of the FNR, it also has a high error rate value of 10.40% in terms of the MCR as a result of its very high FPR of 19%. This made it the best-performing model among all of the models. The LR was the model with the worst performance in terms of the FNR, with a value of 6.20%. When these results were compared with the results obtained in the conditions with the split ratios of 0.6 and 0.7 in terms of the MCR, it was seen that the error rate values of the Ensemble based-BDT, AdaBoost and Bagging models obtained by five-fold cross-validation were lower than the ones of the same models at both of the split ratios. Such as the model we proposed, most of the other models achieved an error rate value that was below the 0.6 split ratio, while providing an error rate value that was above the 0.7 split ratio for this metric in five-fold cross-validation. The FNR of the FL-BDE model in the relevant validation conditions produced an error rate that was above the values obtained in both of the split conditions. However, in all of the conditions, it obtained a result that was much lower than the error rate values of the other models, except for the NN model.
Figure 9 illustrates the performance results of FL-BDE, ML- and EL-based models created using five-fold cross-validation. From the performance values depicted in the figure, the FL-BDE model performed considerably better than the ML- and EL-based models did in terms of the accuracy, recall, specificity, precision and F-measure metrics, as well as the performance results in the split ratios of 0.60 and 0.70. The proposed model achevied the results of an accuracy of 0.990, a recall of 0.981, a specificity of 0.999, a precision of 0.999 and an F-measure of 0.990. In fact, the model showed a better performance of 2.1% in terms of the accuracy and 3% in terms of the specificity, precision and F-measure metrics than the BDTmodel did, which produced the most similar result to it, except for the recall metric. The RF was the model that has the most similar result to that of our model, with accuracy values of 0.962, a recall of 0.969, a specificity of 0.954, a precision of 0.955 and an F-measure value of 0.962 after those of the BDT model. As in the split ratios of 0.60 and 0.70, the NN model gave the worst performance for all of the metrics, except the recall metric. When the performance results of
Figure 4 and
Figure 9 were compared, it can be understood that the FL-BDE based model tested in five-fold cross-validation conditions had a better performance values in terms of all of the metrics, except the recall metric at the split ratio of 0.60. When
Figure 6 and
Figure 9 were examined, a similar situation can be observed for the performance results at the split ratio of 0.70 in terms of the specificity, precision and F-measure metrics.
Figure 10 shows the ROC curves and AUC values obtained for the FL-BDE, ML- and EL-based models in the condition of five-fold cross validation. From the performance outputs that are depicted, it can be seen that the proposed model had a better performance than the other models did, with an AUC output of 0.997. As in the performance results of other metrics, the ensemble learning-based BDT model had the best performance after our model, with a value of 0.993 in terms of this metric. For the conditions of five-fold validation, the SVM model has the lowest AUC score among all of the models. In fact, the SVM showed the worst performance in both the random split conditions and the five-fold cross-validation conditions.
The experimental results showed that the proposed FL-BDE model produced the best performance in terms of most performance metrics in the random split ratio conditions and the five-fold cross-validation conditions. Here, it was decided which classifiers used in the voting and routing process would be included in the positive and negative classifier groups using the experimental results of the classifier errors. When the performances of ML-based models in both the split conditions and the cross-validation were evaluated according to the average FPR and FNR error rates, it was found that the LR, BPM, and BDT-based models provided the best performances in terms of the classification of negative instances, while the NN, DF and SVM-based models had the best performances in classifying the positive instances. This information is critical in structuring the proposed model.
In the literature, many studies have been conducted to detect malware through different classification methods by using static and dynamic analysis methods. Among them, artificial intelligence-based methods such as ML, DL and FL, as well as hybrid and EL-based approaches, have performed better in malware detection than the others have. A summary of the proposed model is presented in
Table 5, along with information about the relevant studies in the literature and their performances in terms of the accuracy, recall, precision, F-measure and AUC metrics.
When the studies in
Table 5 were compared with each other in terms of the performance metric results, it was found that the studies based on EL, comparative learning and DL were highly successful at malware detection. In comparison to the relevant models, the FL-BDE model, which was proposed as a FIS-based dynamic ensemble approach in our work, demonstrated a more competitive performance. The bi-level malware detection (BMD) model proposed by Jerbi et al. [
27] and the Bi-LSTM model using comparative learning suggested by Yang et al. [
26] are the other approaches that have performance results that are similar to that of the FL-BDE model. The data in the table make it clearly evident that the proposed model outperformed both of the models in terms of all of the performance metrics. When the performance of the proposed model is compared with the performance of the other models shown in
Table 5, it can be seen that it performed much better than these models did.
5. Conclusions
Due to the explosive growth in the current usage of mobile devices, the market share of the Android operating system, which powers these devices, has increased dramatically. Consequently, malware has made them its target. So far, although the research and the developed applications for malware detection by Android application developers have made a partial contribution to the solution of the problem, they have been insufficient to completely eliminate this problem due to some of the characteristics and behavioral features of malware.
In this study, a dynamic model that combines the outputs of ML-based methods through FIS was proposed for Android malware detection. Two thousand application instances in the form of APK files were employed in the study, one thousand of which were malicious applications downloaded from the Drebin database, and one thousand of which were benign applications downloaded from the Google Play Store. The APK files were first analyzed by reverse engineering, and the Manifest.xml file was obtained. Then, the permissions, intentions and activities contained in this file were determined. After that, by querying each APK file, the data vector of 2000 × 1134 consisting of 1’s and 0’s was saved to the csv file to be used as a dataset in malware detection.
The experimental results of the proposed model were obtained by applying this dataset to the model whose feature extraction, feature selection, data splitting, multi-classification, voting and routing and combining scores processes were built using the necessary components in the Azure ML environment. The accuracy of this model was also evaluated against the ML-based models, including the SVM, LR, BPM, BDT, DF and NN methods built in this environment. The classification error rates including the MCR, FNR and FPR, the confusion matrix metrics including the accuracy, recall, specificity, precision and F-score and the AUC metric obtained from ROC curves were used for the assessment of the models. The experiments performed in the random split ratio and five-fold cross validation conditions showed that the proposed FL-BDE model outperformed the ML-based models in terms of both the classification error rates and the confusion matrix metrics. With performance differences of 0.5% or more for the 0.60 split conditions and 0.7% or more for the 0.70 split conditions, the proposed model outperformed the ML-based models in terms of the ROC curves. When the proposed model was compared with similar ensemble-based current literature studies, it was observed that it performed better, with a smaller difference. In comparison with other studies in the literature that produced performance results that are similar to those of our model, it was seen that it had a much better performance in terms of all of the metrics.
In the future, real-time malicious detection applications can be realized by creating a hybrid model that obtains the feature vectors from APK application files with DL-based approaches, and then implementing the malicious application detection process by using the FL-BDE approach we propose here.