GA-StackingMD: Android Malware Detection Method Based on Genetic Algorithm Optimized Stacking

Xie, Nannan; Qin, Zhaowei; Di, Xiaoqiang

doi:10.3390/app13042629

Open AccessArticle

GA-StackingMD: Android Malware Detection Method Based on Genetic Algorithm Optimized Stacking

by

Nannan Xie

^1,2,

Zhaowei Qin

^1,2,*

and

Xiaoqiang Di

^1,2,3

¹

School of Computer Science and Technology, Changchun University of Science and Technology, Changchun 130022, China

²

Jilin Province Key Laboratory of Network and Information Security, Changchun 130022, China

³

Information Center, Changchun University of Science and Technology, Changchun 130022, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(4), 2629; https://doi.org/10.3390/app13042629

Submission received: 10 January 2023 / Revised: 7 February 2023 / Accepted: 15 February 2023 / Published: 17 February 2023

(This article belongs to the Special Issue Information Security and Privacy)

Download

Browse Figures

Versions Notes

Abstract

:

With the rapid development of network and mobile communication, intelligent terminals such as smartphones and tablet computers have changed people’s daily life and work. However, malware such as viruses, Trojans, and extortion applications have introduced threats to personal privacy and social security. Malware of the Android operating system has a great variety and updates rapidly. Android malware detection is faced with the problems of high feature dimension and unsatisfied detection accuracy of single classification algorithms. In this work, an Android malware detection framework GA-StackingMD is presented, which employs Stacking to compose five different base classifiers, and Genetic Algorithm is applied to optimize the hyperparameters of the framework. Experiments show that Stacking could effectively improve malware detection accuracy compared with single classifiers. The presented GA-StackingMD achieves 98.43% and 98.66% accuracies on CIC-AndMal2017 and CICMalDroid2020 data sets, which shows the effectiveness and feasibility of the proposed method.

Keywords:

Android malware detection; integrated machine learning; stacking; genetic algorithm

1. Introduction

The rapid development of the Internet has promoted intelligent mobile terminals such as smartphones and tablet computers to play a more and more important role in people’s life and work. Up to October 2022, the Android operating system accounts for 70.96% of the global smartphone market share [1]. At the same time, malware has led to a huge impact on personal privacy and social security. Kaspersky [2] found a total of 3,464,756 malicious installation packages, 97,661 new mobile banking Trojans, and 17,372 new mobile ransomware Trojans in 2021. The security of mobile intelligent terminals is facing severe challenges.

Malware mainly includes software that infringes upon legitimate rights and runs on users’ computers or mobile terminals without permission, such as computer viruses, Trojans, worms, blackmail software, spyware, and so on. Rapidly growing malware is not only leading to a threat to users’ privacy but also playing an important role in most computer intrusions. In present software developments and applications, open source code is regarded as a common practice. However, reusing other code allows bad actors to access a wide range of developer communities to obtain different kinds of malware [3]. The development of code obfuscation and anti-tracking technology makes traditional malware such as viruses and worms evolve into more threatening variants by means of polymorphism and deformation, and they even escape security defense scanning, which reflects the limitations of existing detection methods [4].

The research on Android malware detection mainly focuses on two aspects. On the one hand, malware analysis extracts different types of features, and they are used to detect malware. On the other hand, how to select or build an appropriate classification detection model to achieve the purpose of malware detection, which is to have a good and stable detection performance for different forms of software [5,6], so that the stability, robustness, and effectiveness of the classification model have attracted the attention of researchers. The existing malware detection methods usually extract the data set features and train the learning models, then classify the malicious samples from normal ones. The detection accuracy of a single feature or single detection algorithm is not satisfied in practice, while the combinations of multi-dimensional features or various classifiers are faced with high computational consumption.

Considering the above problems, we try to propose solutions. The contributions of this paper are as follows.

An Android malware detection framework based on Stacking is proposed. The framework mainly includes three parts: data set construction, feature dimension reduction, and optimization method GA-StackingMD. By Stacking technology, five base classifiers are integrated to combine their advantages in order to improve malware detection performance;
A two-step feature dimension reduction method is realized. The extracted multiple-category features have high dimensions, which may lead to redundancy, excessive computational consumption, and even over-fitting. The proposed method uses InfoGain for first feature detection and then applies Chi-square Test to reduce redundancy; finally, the key features subset with better distinguishing ability is constructed;
A Stacking hyperparameters’ optimization method, GA-Stacking MD, is presented to improve the classifiers’ combination performance. Experiments show it achieves better accuracy than the original combination of base classifiers. The proposed method is not only applicable to the optimization of stacking hyperparameters but also can be extended to apply to different integrated classifiers’ optimization, so as to play a role in different base classifier algorithm environments.

2. Related Work

2.1. Android Malware Detection

The openness of the Android system makes it easier for attackers to carry out malicious activities by its vulnerability. The general malware detection methods usually employ classification algorithms to deal with the extracted features. Machine learning algorithms such as Support Vector Machine, Decision Tree, K-Nearest Neighbor, and Naive Bayes are of mature development and have been widely used in malware detection.

Malware feature extraction includes a static method and a dynamic method. The static method decompiles the executable APK file and extracts features from the file or the program instructions. Static extraction is easy to implement, but it is difficult to deal with the camouflage caused by code obfuscation and other technologies.

The dynamic method runs applications in a monitoring environment and detects the changes in the environment. Thus it is necessary to define “normal” states of the system, in order to judge the malicious application according to differences or changes. The dynamic method monitors application behavior in the simulated environment and could obtain nearly real operation information. However, executing and monitoring applications in a dynamic environment will consume more system resources.

Machine learning algorithms are most widely used in malware detection. A single classifier is difficult to take into account both detection precision and recall, but multiple models have to consider the result combination of different classifiers. Lindorfer et al. [7] established a malware detection model MARVIN by linear classifier and SVM. Zhu et al. [8] proposed SEDMDroid to detect Android malware which combined SVM and MLP. The integrated learning framework MFDroid [9] employed seven feature selection algorithms to process permissions, API calls, and opcodes. Then the results of each algorithm are integrated to construct a new feature set to train the classifiers.

Along with the emergence and development of Deep Learning, it is also applied in Android malware detection because of its powerful data processing ability. Cen et al. [10] used Deep Belief Network to deal with multiple category features. Saxe and Berlin [11] developed a Deep Neural Network malware classifier and achieved 95% detection accuracy and under 0.1% false alarm rate. These researches are beneficial attempts to improve the effectiveness of Android malware detection and have promoted the development of relevant fields.

2.2. Features of Android Malware Detection

The features of Android malware are extracted by static methods or dynamic methods, and the extracted different types of features can reflect the application software from different aspects. Thus these features are used to distinguish malware and normal applications. The feature categories in this work include Permissions, API calls, Dalvik opcodes, Intent, and Hardware. For example, Singh et al. [12] detected Android malware by applying four machine learning classifier algorithms, namely, Naive Bayes, Support Vector Machine, Decision Tree, and Random Forest. After analysis and comparison, it was found that Random Forest had the highest accuracy. Mohapatra et al. [13] combined C-means, Random Forest, and Local Outlier Factor proposed a malware detection model using machine learning technology for malware detection, summarized some work on using different machine learning methods to solve malware detection problems, and applied different machine learning algorithms to measure performance in the proposed framework.

Android provides a permission mechanism for developers to protect user privacy and system security. If an application needs to access or execute operations on specific resources, the developers have to request the corresponding permissions. In addition to 206 official permissions, developers also can customize permissions according to their demands and share resources with other applications. The AndroidManifest.xml file preserves permission-related information, such as security permissions that control access to specific components. Wang et al. [14] applied multilevel permissions to detect Android malware and achieved satisfactory performance.

API calls are pre-defined functions in programs that meet functional requirements and operational conventions. They are abstractions of application behaviors and object actions, and provide data sharing for various components. Peiravian and Zhu [15] pointed out that the frequency and sequence of dangerous API calls of malware are obviously different from those of normal applications. The repetition frequency of some sensitive API is higher in malware than in benign ones. API calls are also widely used in malware detection [16]. Considering the high dimension of API calls, we choose “RestrictedAPI” and “SuspiciousAPI” as classification features. “Restricted API” is protected by permissions, and “Suspicious API” refers to the suspicious ones marked by the expert, such as encryption functions, and sending HTTP request functions.

Android translates Java source code into opcodes when executed on DVM (Dalvik Virtual Machine) at runtime, which is called Dalvik opcodes. After segmentation by N-gram, they are usually used in malware detection as the lowest-level features. Jose et al. [17] applied Dalvik opcodes as classification features obtained by reverse engineering, and Zhang et al. [18] used Dalvik opcodes graph to detect Android malware. Sewak et al. [19] used opcode as the classification feature, and used four different feature subsets to analyze and compare the RF and the deep neural network. It was found that the precision of the classical RF classifier was better than that of the DNN regardless of the feature input.

The Intent is a messaging object which requests operations from other components. The basic usages include starting activities and services and delivering broadcasts. Feizollah et al. [20] pointed out that Intent features contain rich information which reflects the difference between malware and normal applications. They also demonstrated that Intent should be combined with other features to achieve better detection results.

When an Android application runs on a terminal, it depends on the device’s hardware, such as Bluetooth, network, and position coordination. If certain hardware is missing, the application which needs this hardware will not work properly. Therefore, the hardware also indicates some characteristics of the application to a certain extent and can be used as a detection feature.

In the study of malware detection features, how to extract low-dimensional features and how to develop new feature processing methods to get better accuracy have always been the focus. A single feature usually describes one aspect of the applications, while the combination of different features could better cover the characteristics of malware. Since both static and dynamic features have their own shortcomings, many researchers combined them and used integrated features to detect malware [21].

2.3. Feature Selection and Genetic Algorithm

Machine learning uses computers to simulate human learning activities, study how to learn existing knowledge, acquire new knowledge, and improve learning performance constantly. Logic-based classifiers such as Decision Tree, rule learning machine RIPPER, and Perceptron construct classification rules by induction and deduction. The statistics-based classifiers decide the sample categories by constructing probability models, which include Bayesian, KNN, and SVM. Integrated classifiers refer to the combination techniques which integrated different methods, such as Boosting, Bagging, and Stacking. Machine learning is widely used in image recognition, natural language processing, text processing, and security practices.

When using Machine Learning to process data, the high feature dimension usually leads to more computing consumption and even over-fitting. So feature dimension reduction is usually used in preprocessing. Feature selection is one of the most widely employed methods and it includes filter, wrapper, and embedded model according to the correlation between feature selection and classifiers. The filter is completely independent of classifiers, so it is used widely and flexibly. Typical feature selection algorithms include Fisher, Correlation Coefficient, InfoGain, Chi-square Test, and Mutual Information [22].

GA (Genetic Algorithm) [23] originated from Darwinian evolution theory; it simulates the natural laws of evolution and searches for high-quality solutions by iterations. GA has good global search ability, and its search spaces and directions can be adjusted adaptively by crossover and mutation. The algorithm begins with a population and selects the best individuals of the population to produce the next generation. The best individuals are determined according to the fitness function defined in advance. Through the crossover and mutation, new individuals in accordance with the fitness function are produced. The population selected by the survival of the fittest is more adaptable to the environment than the original population, and then the high-quality solutions to the problem are determined.

GA is widely used in image processing [24], control systems [25], and network information security [26]. Qi et al. [27] proposed an improved chaotic GA to solve the problems of slow convergence and local optimization of BP neural networks. Elhefnawy et al. [28] presented a hybrid nested genetic fuzzy algorithm (HNGFA) to distinguish most intrusion categories from normal traffic. Yildiz and Dogru [29] constructed a permission feature subset by GA, and it achieved good detection results with SVM in Android malware detection.

2.4. Stacking

In the past few decades, Machine Learning and its related fields have developed rapidly. In order to solve the problems of single models, researchers try to integrate multiple classifiers. Three integrated learning algorithms, Bagging, Boosting, and Stacking are developed. The integration model combines multiple individual classifiers and has better generalization performance than single models.

Compared with Bagging and Boosting, Stacking has advantages in the flexible integration of multiple base classifiers. Stacking uses a meta-classifier mechanism, which combines various base classifiers through autonomous learning to find the best combination from two or more base classifiers [30]. For a specific data set, the different classifiers learn in different data spaces and focus on different aspects. Therefore, Stacking aims to combine the advantages of different models to achieve better results.

The heterogeneous base classifiers of Stacking can achieve relatively better results, and the classification accuracy is improved compared with the homogeneous integration. Two-layer Stacking trains the first layer base classifiers with original data, and the prediction results of them are combined with the original label to construct a new data set to train the second layer meta classifier. The structure of Stacking is shown in Figure 1.

Stacking has been applied to Chinese emotion classification, sales prediction, and other fields. In malware detection, Zheng et al. [31] designed Stacking-based CMalHunt to deal with the new malware CryptocMal. Jiang et al. [32] presented an adaptive Stacking integration model SSEM, which binary encoded the base classifiers and hyperparameters, which achieved good performance in computer network information security.

3. Android Malware Detection Method

3.1. The GA-StackingMD Framework

In consideration of the high feature dimension and low detection efficiency of single classifiers, an Android malware detection method GA-StackingMD based on Stacking is presented, and GA is used to optimize the hyperparameters of the base models. The GA-StackingMD Framework is shown in Figure 2.

The proposed malware detection framework consists of three parts.

Data set construction. The Android application is decompiled and static features are extracted, including Permission, API, Dalvikopcode, Intent, and Hardware. Then the extracted features are digitized to construct the original feature set with high dimensions;
Feature dimension reduction. A two-step feature selection method is proposed to reduce the original feature dimension. InfoGain is used for the primary election, then Chi-square Test is applied for further reduction to remove redundant and irrelevant features. The final key feature subset contains the selected features with low dimensional and good differentiation;
GA-StackingMD. An optimization algorithm based on GA is presented to select the hyperparameters of base classifiers of Stacking. After selecting the better combination of base classifiers, GA is used to adaptively adjust the hyperparameters in a given scope, and finally, improve the malware detection performance.

3.2. Feature Processing

3.2.1. Feature Extraction

The two data sets used in this work are open by the Canadian Institute for Cybersecurity. CIC-AndMal2017 [33] includes 426 malicious applications and 1700 benign ones, and CIC-MalDroid 2020 [34] includes 13,204 malicious samples and 4039 benign ones. The malware includes advertising, malicious SMS, blackmail, and so on. There is a large difference in the number of samples between the two data sets, which also tests the robustness of our proposed method.

Androguard is used to extract Permissions, Dalvik opcode, API, Intent, and Hardware features. The extracted features are firstly converted to numerical values. The features from all samples are sorted by category and arranged in dictionary order in the same category. If the feature appears in a sample, then it is marked “1”, otherwise marked “0”. A sample is represented by a vector composed of “1” and “0” so that all the samples compose a feature matrix.

3.2.2. Two-Step Feature Selection

Single features are difficult to comprehensively describe the differences between malware and normal applications, but the combination of multiple features has a high dimension which leads to a significant increase in computation and feature redundancy. In order to improve the performance of the classifiers, reducing the feature dimension effectively and constructing a good feature subset are important prerequisites.

InfoGain is used to implement the first step selection. The InfoGain value is defined by information entropy and conditional entropy, which indicates the degree of reduction of information complexity or uncertainty under specific conditions. Feature selection indicates how much information a feature brings to the whole classification system. The higher the InfoGain value, the more information it brings, and the more important the feature is. With all the features sorted by this standard, we select the features ranked in front to construct the candidate feature set.

In the second step, Chi-square Test is applied to optimize the feature set to further remove the low correlation features. Chi-square Test describes the correlation between two variables by calculating the X² distribution. Take features as variables, the Chi-square values between features and the class labels are calculated respectively and sorted. The features with high correlation are retained, and finally, the key feature subsets for detection are selected.

3.3. GA-StackingMD

The process of GA-StackingMD makes the hyperparameters of the combination better, including three steps.

3.3.1. Training the First Layer Classifiers

In this paper, the first layer classifier is selected by the following criteria: one is that the algorithm can belong to different categories, thus covering different aspects from the point of view of classification. For example, some algorithms are based on probability, some algorithms are based on distance, so they may be able to train and learn from the different aspects of the data set. The other is that the implementation process of the algorithm is relatively simple and less time-consuming so the rapid consumption of system resources can be reduced as much as possible when a variety of algorithms are combined. It is preliminarily distinguished as follows: Decision Tree-based RF, probability-based linear classifier SVM, distance-based KNN, ensemble learning-based LGBM, and CatBoost. This paper chooses these five classifiers as the first layer classifiers and trains them by 5-fold cross-validation. The output of each classifier and the sample labels are combined to form a new data set for the second layer. Figure 3 shows the first layer process.

The training and testing samples are divided according to the proportion of 7:3. By 5-fold cross-validation, each base classifier uses about 20% training set in prediction and combines the results of each iteration to form new feature vectors. At the same time, each base classifier predicts a result of the testing set. Therefore, in the new data set constructed, the columns are composed of base classifiers, the number of rows is equal to the number of training data and test data, and it is labeled by the average predicted results.

3.3.2. Training the Second Layer Classifier

In order to prevent over-fitting, the Logical Regression model is used to process the data set in the previous step. Different combinations of classifiers are compared according to classification accuracy. For example, if we have m base classifiers, the number of possible combinations is (2^m − 1). The combination with the highest accuracy is selected and then the optimized combination of the integrated model of this data set is obtained.

3.3.3. Optimizing Hyperparameters by GA

After determining the best combination of base classifiers, the combination of multiple algorithms can give full play to the advantages by selecting the appropriate hyperparameter. GA is used to optimize the hyperparameter combinations of the classifiers. The optimization process is shown in Figure 4.

First of all, the initial population is randomly generated. Several chromosomes form the first generation of feasible solutions to start the iteration. Taking accuracy as the fitness function, the next-generation population is produced by crossover, mutation, and selection. Then the best individual in the population is selected generation by generation. If the iteration meets the termination condition, the chromosome with the highest fitness generated in this iteration is output as the optimized solution. Otherwise, the next iteration is continued execution until it meets the stop condition or reaches the maximum number of iterations.

Assuming the selected optimized combination contains n base classifiers, the parameter j of classifier i is p_ij with the value range [a_ij, c_ij]. The number of parameters of classifier i is s_i, then the parameter vector of classifier i can be expressed as

p_{i} = [p_{i 1}, p_{i 2} \dots p_{i s_{i}}]

, the parameter matrix is

P = [p_{11}, p_{12}, \dots p_{1 s_{i}}, p_{21}, p_{22}, \dots p_{2 s_{2}} \dots p_{i 1}, p_{i 2}, \dots p_{i s_{i}}]

. The optimization method takes each p_i in the matrix and carries out iterations within the set range of parameters. Finally, the parameter setting achieves optimal when the accuracy is maximum.

3.4. GA-StackingMD Description

Input: Data set represented by feature matrix.

Output: Optimized parameter combination, and fitness function value.

Step 1: Integrating all possible base model combinations by Stacking, setting accuracy as the fitness function f(x), and selecting the combinations with the highest accuracy.

Step 2: Setting the hyperparameters and their value ranges of each base model, the thresholds of population number S, iteration number M, and fitness function T.

Step 3: Encoding hyperparameters in binary, and initialing the population.

Step 4: Carrying out natural selection, cross-reproduction, and gene mutation, and keeping the population number no more than S.

Step 5: Generating new populations iteratively, and the parameters of optimized combination and fitness function are selected.

Step 6: If the two adjacent iterations of the algorithm meet the stop condition:

| f_{i} (x) - f_{i - 1} (x) | \leq t

, the algorithm stops. Otherwise, continue to Step 7.

Step 7: If the algorithm reaches the maximum iteration M, the algorithm stops, otherwise, it goes back to Step 4.

The stop condition in Step 6

| f_{i} (x) - f_{i - 1} (x) | \leq t

is the stop condition of the genetic algorithm, which has nothing to do with the classifiers.

f_{i} (x)

represents the value of accuracy when the number of iterations is i. When the difference between the number of iterations is i and i − 1 reaches a certain value, or when it is stable, it reaches the stop condition of the genetic algorithm. If it does not tend to be stable, it will be iterated until the maximum number of iterations. To sum up, there are two stop conditions for GA-StackingMD, accuracy tends to be stable, and the maximum number of iterations M.

4. Experiments

4.1. Experimental Environment and Evaluation Index

The experiments are performed on the 64-bit Windows operating system, with 16G memory and Python 3.8.3. The data sets are CIC-AndMal2017 and CICMalDroid2020.

Define four parameters: TP (True Positive), FP (False positive), FN (False Negative), and TN (True Negative), to calculate the following evaluation indexes.

(1) Precision. It indicates the proportion of correctly identified positive samples in all positive samples.

Precision = TP/(TP + FP),

(1)

(2) Accuracy. It indicates the proportion of correctly identified positive and negative samples in all samples.

Accuracy = (TP + TN)/(TP + TN + FP + FN),

(2)

(3) Recall. It indicates the proportion of correctly identified positive samples in all identified positive samples.

Recall = TP/(TP + FN),

(3)

(4) F1-score. It is the harmonic mean of precision and recall that comprehensively evaluates the classification performance.

F1-score = 2 × Precision × Recall/(Precision + Recall),

(4)

4.2. Experiment Steps

4.2.1. Key Feature Subset Construction

We extract Permission, API, Dalvik, Intent, and Hardware from CIC-AndMal2017 and CICMalDroid2020, and respectively get 7666 and 14,600 raw features. After the two-step feature selection, we select seven feature subsets to compare their detection performance. The feature dimension ranges from 800 to 1400 with an interval of 100. The results are shown in Figure 5.

The evaluation indexes on the two data sets show a similar trend. Among the seven groups, accuracy increases at first and then decreases with the increase of feature numbers. The comprehensive indexes reach the best when it is 1000. Finally, these selected features are reserved to construct key feature subsets and are used in subsequent experiments.

4.2.2. Compare Stacking with Single Classifiers

Five independent algorithms SVM, KNN, LGBM, CatBoost, and RF are applied to malware detection on the two data sets, and then they are composed as the base classifiers of Stacking. Table 1 and Table 2 show the results of the two data sets, respectively.

From the perspective of single algorithms, LGBM has advantages over others. But after combining the five methods, Stacking achieves the best detection performance with 96.55% and 98.56% accuracy. The results show that compared with single algorithms, the integrated Stacking effectively improves malware detection performance.

4.2.3. The Optimized Combination of Base Classifier Selection

The first step of GA-StackingMD is to select the best combination of the base classifiers. The five algorithms have 31 combinations. Set fixed parameters to compare the performances of different combinations. Taking CIC-AndMal2017 as an example, the results are shown in Table 3.

The 31 groups are sorted according to accuracy, and the combination of “KNN+LGBM” achieves the best accuracy of 94.67% and has the best performance in F1-score, precision, and recall. It should be noted that some of the combinations have the same accuracy. In this step, eight groups of combinations get the same highest accuracy. Considering all of the combinations including the two algorithms, KNN and LGBM, we choose the KNN and LGBM combination as the best one. In the same way, the selected combination of CICMalDroid2020 is “RF+KNN+LGBM”, with 94.30% accuracy and 96.18% F1-score.

4.2.4. Hyperparameters’ Optimization by GA

The hyperparameters of the base classifiers are different, and some hyperparameters affect the performance of the algorithm, while others do not. Thus GA is used to optimize the specific hyperparameters which are influential. Taking CICMalDroid2020 as an example, we set the value range of these parameters as in Table 4, and the parameters selected after GA are shown in Table 5.

After parameter selection, the detection results on the two data sets are shown in Figure 6.

The first step is setting the parameter selection ranges, then in the iterative process, the classifiers run with the given parameters, and the accuracies of various parameter combinations are compared in order to select the best combination. After this process, the best combination of these parameters is selected within the given range. The results in Figure 6 show that the optimized hyperparameters could effectively improve malware detection. Among them, in the data set CIC-AndMal2017, four evaluation indicators have significantly improved. In the data set CICMalDroid2020, although recall decreased by 0.03 percentage points, accuracy, F1-scores, and precision have improved to varying degrees, of which accuracy has increased by 1.88 percentage points and 0.1 percentage points, respectively.

4.2.5. Comparison of GA-StackingMD and Other Classifiers

The proposed GA-StackingMD is compared with other algorithms which are widely used in state-of-art classification methods, including XGB [35] (Extreme Gradient Boosting), NB [36] (Naive Bayes), CART [37] (Classification And Regression Tree), MLP [38] (Multi-layer Perceptron), and ERT [39] (Extremely Randomized Trees). The results of the two data sets are shown in Figure 7.

As shown in Figure 7, GA-StackingMD performs best on both data sets. In addition, XGB achieves a similar detection result.

We analyze the presented method from five aspects, including selections of key feature subsets and best combinations of base classifiers, comparison of initial Stacking with single algorithms and the proposed algorithms with other widely used algorithms, and optimization of the hyperparameters by GA. The experiment results show that the proposed GA-StackingMD could achieve better detection results.

4.3. Comparison with Literature

In order to compare with the existing literature, this paper summarizes some studies that have used the same data set and summarizes the methods they use, including the features used, classification methods, and experimental results.

As shown in Table 6, this paper summarizes the proposed model and relevant research and performance results in the literature. It is noteworthy that Reference [40] also uses the Stacking integration model. Three machine learning models, ET (ExtraTree), XGB, and RF are used for analysis and comparison to select the most effective integration model. The highest value was obtained with the ensemble model in the voting structure, and the accuracy was to be 90.4%. The main difference between this study and this paper is reflected in two aspects. On the one hand, the selection of the base classifier and meta-classifier is relatively random, and it does not propose how to choose a better model combination method and the setting of hyperparameters. On the other hand, the number of data set samples is relatively small.

In the study [41], firstly, the extracted features are processed by CNN, and then an adaptive network-based fuzzy inference system (ANFIS) model was used to classify the features obtained. With the proposed model, 94.67% accuracy was achieved on the CICMalDroid2020 data set. Reference [42] proposed a dynamic method for Android malware classification based on network traffic, F2DC. The byte sequence of application data transmitted over the TCP/IP network is called raw payload. This method characterizes an Android malware from its raw payload and uses CNN to learn the potential representation of the raw payload for effective classification. Reference [43] extracted system calls as the characteristics of Android malware detection through dynamic analysis, analyzed and compared five different machine learning algorithms, and finally, discovered that KNN performed best, reaching 85% accuracy. Reference [44] first analyzed the text features and visual features of the samples and then used the CNN network to mine the deep features. After such a complex multi-stage feature engineering, the balanced features were input into the Voting-Based Extensible Learning model, and the accuracy rate was 97.76% on the data set CICAlDroid 2020. The presented method was carefully compared with the other five machine learning methods. Ksibi et al. [45] explored the way to detect Android malware based on images other than features. First, convert the sample file into color images and develop deep CNN using produced images that extract higher-level semantics associated with malware. Compare and analyze the customized CNN and a deep convolution neural network model VGG-16 to detect Android malware, with an accuracy of 97.81%. Compared with relevant models, the GA-StackingMD, which was proposed as a GA-based ensemble approach in our work, demonstrated more competitive performance.

5. Conclusions

The open architecture and wide usage of Android make it vulnerable to malware attacks. In recent years, with the improvement of code confusion, shelling, and other technologies, malware is easier to produce and the variants are increasing. The research on Android malware detection should be in consideration of practical applications. How to reduce the calculation consumption and improve the detection accuracy is the problem that needs to be considered.

In order to reduce the high dimension caused by multiple features fusion, a two-step feature selection method based on InfoGain and Chi-square Test is proposed, which reduces the original feature dimensions of the two data sets to 13% and 10%. A Stacking model with five base classifiers is implemented which significantly improves the detection accuracy compared with single classifiers. Furthermore, GA is used to optimize the hyperparameters of the Stacking, and finally achieves the accuracies of 98.43% and 98.66% on the two data sets.

In future work, we will further optimize the selection of base classifiers, so that they can give full play to their respective advantages, and further improve the selection range of hyperparameters, in order to improve the effectiveness and scientific nature of the algorithm proposed in this paper and promote the application in the virtual environment. We are trying to implement the stacking algorithm on a distributed platform, thus significantly reducing the running consumption of the algorithm, which can play a better role in real-time applications.

Author Contributions

Conceptualization, N.X. and Z.Q.; methodology, N.X. and Z.Q.; software, X.D.; validation, N.X. and Z.Q.; formal analysis, N.X., Z.Q. and X.D.; investigation, Z.Q.; resources, N.X.; data curation, N.X. and Z.Q.; writing—original draft preparation, N.X. and Z.Q.; writing—review and editing, Z.Q.; visualization, X.D.; supervision, X.D.; project administration, N.X.; funding acquisition, X.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Science and Technology Research Project of the Education Department of Jilin Province (Grant No. JJKH20231539KJ) and the Opening Project of Guangdong Province Key Laboratory of Information Security Technology (Grant No. 2020B1212060078).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data used in this paper can be obtained by contacting the authors of this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

Mobile Operating System Market Share Worldwide. Available online: https://gs.statcounter.com/os-market-share/mobile/worldwide (accessed on 30 October 2022).
Mobile Malware Evolution. 2021. Available online: https://securelist.com/mobile-malware-evolution-2021/105876/ (accessed on 15 November 2022).
Tsfaty, C.; Fire, M. Malicious Source Code Detection Using Transformer. arXiv 2022, arXiv:2209.07957. [Google Scholar]
Gao, Y.; Lu, Z.; Luo, Y. Survey on malware anti-analysis. In Proceedings of the Fifth International Conference on Intelligent Control and Information Processing, Dalian, China, 18–20 August 2014; pp. 270–275. [Google Scholar]
Singh, J.; Singh, J. A survey on machine learning-based malware detection in executable files. J. Syst. Archit. 2021, 112, 101861. [Google Scholar] [CrossRef]
Qiang, W.; Yang, L.; Jin, H. Efficient and Robust Malware Detection Based on Control Flow Traces Using Deep Neural Networks. Comput. Secur. 2022, 122, 102871. [Google Scholar] [CrossRef]
Lindorfer, M.; Neugschwandtner, M.; Platzer, C. Marvin: Efficient and comprehensive mobile app classification through static and dynamic analysis. In Proceedings of the 2015 IEEE 39th Annual Computer Software and Applications Conference, Taichung, Taiwan, 1–5 July 2015; pp. 422–433. [Google Scholar]
Zhu, H.; Li, Y.; Li, R.; Li, J.; Song, H. SEDMDroid: An enhanced stacking ensemble of deep learning framework for Android malware detection. IEEE Trans. Netw. Sci. Eng. 2020, 8, 984–994. [Google Scholar] [CrossRef]
Wang, X.; Zhang, L.; Zhao, K.; Ding, X.; Yu, M. MFDroid: A Stacking Ensemble Learning Framework for Android Malware Detection. Sensors 2022, 22, 2597. [Google Scholar] [CrossRef]
Cen, L.; Gates, C.S.; Si, L.; Li, N. A probabilistic discriminative model for android malware detection with decompiled source code. IEEE Trans. Dependable Secur. Comput. 2014, 12, 400–412. [Google Scholar] [CrossRef]
Saxe, J.; Berlin, K. Deep neural network based malware detection using two dimensional binary program features. In Proceedings of the 2015 10th International Conference on Malicious and Unwanted Software (MALWARE), Fajardo, PR, USA, 20–22 October 2015; pp. 11–20. [Google Scholar]
Singh, D.; Karpa, S.; Chawla, I. “Emerging Trends in Computational Intelligence to Solve Real-World Problems” Android Malware Detection Using Machine Learning. In Proceedings of the International Conference on Innovative Computing and Communications: Proceedings of ICICC 2021, Delhi, India, 20–21 February 2021; pp. 329–341. [Google Scholar]
Vashishtha, L.K.; Chatterjee, K.; Sahu, S.K.; Mohapatra, D.P. A Random Forest-Based Ensemble Technique for Malware Detection. In Proceedings of the Information Systems and Management Science: Conference Proceedings of 4th International Conference on Information Systems and Management Science (ISMS) 2021, Msida, Malta, 14–15 December 2021; pp. 454–463. [Google Scholar]
Wang, Z.; Li, K.; Hu, Y.; Fukuda, A.; Kong, W. Multilevel permission extraction in android applications for malware detection. In Proceedings of the 2019 International Conference on Computer, Information and Telecommunication Systems (CITS), Beijing, China, 28–31 August 2019; pp. 1–5. [Google Scholar]
Peiravian, N.; Zhu, X. Machine learning for android malware detection using permission and api calls. In Proceedings of the 2013 IEEE 25th International Conference on Tools with Artificial Intelligence, Herndon, VA, USA, 4–6 November 2013; pp. 300–305. [Google Scholar]
Han, W.; Xue, J.; Wang, Y.; Huang, L.; Kong, Z.; Mao, L. MalDAE: Detecting and explaining malware based on correlation and fusion of static and dynamic characteristics. Comput. Secur. 2019, 83, 208–233. [Google Scholar] [CrossRef]
de la Puerta, J.G.; Sanz, B. Using dalvik opcodes for malware detection on android. Log. J. IGPL 2017, 25, 938–948. [Google Scholar] [CrossRef] [Green Version]
Zhang, J.; Qin, Z.; Zhang, K.; Yin, H.; Zou, J. Dalvik opcode graph based android malware variants detection using global topology features. IEEE Access 2018, 6, 51964–51974. [Google Scholar] [CrossRef]
Sewak, M.; Sahay, S.K.; Rathore, H. Comparison of deep learning and the classical machine learning algorithm for the malware detection. In Proceedings of the 2018 19th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), Busan, Republic of Korea, 27–29 June 2018; pp. 293–296. [Google Scholar]
Feizollah, A.; Anuar, N.B.; Salleh, R.; Suarez-Tangil, G.; Furnell, S. Androdialysis: Analysis of android intent effectiveness in malware detection. Comput. Secur. 2017, 65, 121–134. [Google Scholar] [CrossRef] [Green Version]
Santos, I.; Devesa, J.; Brezo, F.; Nieves, J.; Bringas, P.G. Opem: A static-dynamic approach for machine-learning-based malware detection. In Proceedings of the International Joint Conference CISIS’12-ICEUTE’ 12-SOCO’ 12 Special Sessions, Ostrava, Czech Republic, 5–7 September 2012; pp. 271–280. [Google Scholar]
Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
Maulik, U.; Bandyopadhyay, S. Genetic algorithm-based clustering technique. Pattern Recognit. 2000, 33, 1455–1465. [Google Scholar] [CrossRef]
Mala, C.; Sridevi, M. Multilevel threshold selection for image segmentation using soft computing techniques. Soft Comput. 2016, 20, 1793–1810. [Google Scholar] [CrossRef]
Cpalka, K.; Łapa, K.; Przybył, A. A new approach to design of control systems using genetic programming. Inf. Technol. Control 2015, 44, 433–442. [Google Scholar] [CrossRef]
Qiang, X.J. Computer application under the management of network information security technology using genetic algorithm. Soft Comput. 2022, 26, 7871–7876. [Google Scholar] [CrossRef]
Changxing, Q.; Yiming, B.; Yong, L. Improved BP neural network algorithm model based on chaos genetic algorithm. In Proceedings of the 2017 3rd IEEE International Conference on Control Science and Systems Engineering (ICCSSE), Beijing, China, 17–19 August 2017; pp. 679–682. [Google Scholar]
Elhefnawy, R.; Abounaser, H.; Badr, A. A hybrid nested genetic-fuzzy algorithm framework for intrusion detection and attacks. IEEE Access 2020, 8, 98218–98233. [Google Scholar] [CrossRef]
Yildiz, O.; Doğru, I.A. Permission-based android malware detection system using feature selection with genetic algorithm. Int. J. Softw. Eng. Knowl. Eng. 2019, 29, 245–262. [Google Scholar] [CrossRef]
Sesmero, M.P.; Ledezma, A.I.; Sanchis, A. Generating ensembles of heterogeneous classifiers using stacked generalization. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2015, 5, 21–34. [Google Scholar] [CrossRef]
Zheng, R.; Wang, Q.; Lin, Z.; Jiang, Z.; Fu, J.; Peng, G. Cryptocurrency malware detection in real-world environment: Based on multi-results stacking learning. Appl. Soft Comput. 2022, 124, 109044. [Google Scholar] [CrossRef]
Jiang, W.; Chen, Z.; Xiang, Y.; Shao, D.; Ma, L.; Zhang, J. SSEM: A novel self-adaptive stacking ensemble model for classification. IEEE Access 2019, 7, 120337–120349. [Google Scholar] [CrossRef]
Lashkari, A.H.; Kadir, A.; Taheri, L.; Ghorbani, A.A. Toward Developing a Systematic Approach to Generate Benchmark Android Malware Datasets and Classification. In Proceedings of the 2018 International Carnahan Conference on Security Technology (ICCST), Montreal, QC, Canada, 22–25 October 2018; pp. 1–7. [Google Scholar]
Mahdavifar, S.; Kadir, A.F.A.; Fatemi, R.; Alhadidi, D.; Ghorbani, A.A. Dynamic Android Malware Category Classification using Semi-Supervised Deep Learning. In Proceedings of the 2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), Calgary, AB, Canada, 17–22 August 2020; pp. 515–522. [Google Scholar]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Webb, G.I.; Keogh, E.; Miikkulainen, R. Naïve Bayes. Encycl. Mach. Learn. 2010, 15, 713–714. [Google Scholar]
Rutkowski, L.; Jaworski, M.; Pietruczuk, L.; Duda, P. The CART decision tree for mining data streams. Inf. Sci. 2014, 266, 1–15. [Google Scholar] [CrossRef]
Taud, H.; Mas, J. Multilayer perceptron (MLP). In Geomatic Approaches for Modeling Land Change Scenarios; Springer: Cham, Switzerland, 2018; pp. 451–455. [Google Scholar]
Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef] [Green Version]
Arslan, R.S. Identify Type of Android Malware with Machine Learning Based Ensemble Model. In Proceedings of the 2021 5th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), Ankara, Turkey, 21–23 October 2021; pp. 628–632. [Google Scholar]
Atacak, İ.; Kılıç, K.; Doğru, İ.A. Android malware detection using hybrid ANFIS architecture with low computational cost convolutional layers. PeerJ Comput. Sci. 2022, 8, e1092. [Google Scholar] [CrossRef]
Lu, T.; Wang, J. F2DC: Android malware classification based on raw traffic and neural networks. Comput. Netw. 2022, 217, 109320. [Google Scholar]
Shakya, S.; Dave, M. Analysis, Detection, and Classification of Android Malware using System Calls. arXiv 2022, arXiv:2208.06130. [Google Scholar]
Ullah, F.; Alsirhani, A.; Alshahrani, M.M.; Alomari, A.; Naeem, H.; Shah, S.A. Explainable malware detection system using transformers-based transfer learning and multi-model visual representation. Sensors 2022, 22, 6766. [Google Scholar] [CrossRef]
Ksibi, A.; Zakariah, M.; Almuqren, L.A.; Alluhaidan, A.S. Deep Convolution Neural Networks and Image Processing for Malware Detection. Preprint (Version 1). 27 January 2023. Available online: https://www.researchsquare.com/article/rs-2508967/v1 (accessed on 4 February 2023).
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]

Figure 1. Two-layer Stacking Structure.

Figure 2. The GA-StackingMD Framework.

Figure 3. Training of the first layer base classifiers.

Figure 4. Flow chart of GA optimization.

Figure 5. (a) CIC-AndMal2017: Comparison of seven feature sets; (b) CICMalDroid2020: Comparison of seven feature sets.

Figure 6. (a) CIC-AndMal2017: Comparison of GA_StackingMD and Initial Stacking Model; (b) CIC-MalDroid2020: Comparison of GA_StackingMD and Initial Stacking Model.

Figure 7. (a) CIC-AndMal2017: Comparison of GA-StackingMD and other classifiers; (b) CICMalDroid2020: Comparison of GA-StackingMD and other classifiers.

Table 1. CIC-AndMal2017 detection results.

Classifiers	Accuracy	F1-Score	Precision	Recall
SVM	94.83%	87.55%	91.34%	84.06%
KNN	91.69%	78.71%	88.29%	71.01%
LGBM	96.08%	90.57%	94.49%	86.96%
CatBoost	95.30%	88.28%	95.76%	81.88%
RandomForest	94.04%	84.92%	93.86%	77.54%
Stacking	96.55%	91.60%	96.77%	86.96%

Table 2. CICMalDroid2020 detection results.

Classifiers	Accuracy	F1-Score	Precision	Recall
SVM	98.24%	98.82%	98.78%	98.86%
KNN	97.85%	98.57%	98.44%	98.69%
LGBM	98.34%	98.89%	98.75%	99.03%
CatBoost	98.11%	98.74%	98.50%	98.98%
RandomForest	97.75%	98.49%	98.41%	98.58%
Stacking	98.56%	99.03%	98.98%	99.09%

Table 3. Comparison of different combinations on CIC-AndMal2017.

Models	Accuracy	F1-Scores	Precision	Recall
[‘RF’, ‘CATTREE’]	89.34%	68.22%	79.35%	59.84%
[‘RF’]	89.5%	68.84%	79.57%	60.66%
[‘CATTREE’]	89.81%	73.03%	73.95%	72.13%
...	...	...	...	...
[‘RF’, ‘CATTREE’, ‘LGBM’]	93.57%	80.38%	96.55%	68.85%
[‘RF’, ‘LGBM’]	93.57%	80.38%	96.55%	68.85%
[‘CATTREE’, ‘LGBM’]	93.57%	80.38%	96.55%	68.85%
[‘SVM’, ‘LGBM’]	93.57%	80.38%	96.55%	68.85%
[‘KNN’, ‘LGBM’]	94.67%	85.59%	88.6%	82.79%

Table 4. The ranges of several hyperparameters.

Classifiers	Hyperparameter Ranges
KNN	knn_n_neighbors = [2, 4, 6, 8, 10, 12, 14, 16]
LGBM	lgbm_max_depths = [4, 5, 6, 7, 8, 9, 10, −1]
LGBM	lgbm_n_estimators = [60, 80, 100, 120, 140, 160, 200, 240]
RF	rf_max_depths = [4, 5, 6, 7, 8, 9, 10, 11]
RF	rf_n_estimators = [50, 60, 70, 80, 90, 100, 110, 120]

Table 5. The selected hyperparameters by GA.

Data Set	Classifiers	Best Hyperparameters
CIC-AndMal2017	KNN	knn_n_neighbors = 2
	LGBM	lgbm_max_depths = −1
	LGBM	lgbm_n_estimators = 80
CICMalDroid2020	KNN	knn_n_neighbors = 4
	LGBM	lgbm_max_depths = 6
	LGBM	lgbm_n_estimators = 160
	RF	rf_max_depths = 11
	RF	rf_n_estimators = 80

Table 6. GA-StackingMD comparison with the relevant studies in the literature.

Reference	Year	Data Sets	Experiment Description
[40]	2021	CIC-AndMal2017	Feature Type: Permission Classifier: Ensemble model (ET, XGB, and RF) Experimental Result: Accuracy—90.40%; Precision—90.40%; Recall—90.40%; F1-score—90.40%
[41]	2022	Drebin CICMalDroid2020	Feature Type: Permission Classifier: CNN and ANFIS (CNN, adaptive network-based fuzzy inference system) Experimental Result: Accuracy—92.00%, 94.67%; Precision—92.15%, 94.78%; Recall—92.00%, 94.67%; F1-score—92.01%, 94.66%
[42]	2022	Drebin CIC-MalDroid2020	Feature Type: Network Traffic Classifier: F2DC (A novel traffic encoding scheme called F2D and CNN) Experimental Result: Precision—96.30%, 82.06%; Recall—96.03%, 81.60%; F1-score—96.08%, 81.34%
[43]	2022	CIC-AndMal2017	Feature Type: System Calls Classifier: DT, KNN, LR, SVM, MLP Experimental Result: Precision—85.00%; Recall—85.00%; F1-score—85.00%
[44]	2022	CICMalDroid2020 CIC-InvesAndMal2019	Feature Type: Textual Features Analysis and Visual Features Analysis (Network-based byte streams) Classifier: Voting-Based Ensemble Learning (Gaussian Naive Bayes, SVM, DT, LR, RF) Experimental Result: Accuracy—97.76%, 98.44%
[45]	2023	CIC-AndMal2017 CICMalDroid2020	Feature Type: The RGB graphics Classifier: CNN, DCNN(VGG-16) [46] Experimental Result: Accuracy—97.81%; Precision—97.98%; Recall—97.63%; F1-score—97.78%
This paper	2023	CIC-AndMal2017 CICMalDroid2020	Feature Type: Permission, API, Dalvik opcode, Intent, and Hardware Classifier: GA-StackingMD (SVM, KNN, LGBM, CatBoost, and RF) Experimental Result: Accuracy—98.43%, 98.66%; Precision—98.28%, 99.15% Recall—93.44%, 99.06%; F1-score—95.80%, 99.10%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xie, N.; Qin, Z.; Di, X. GA-StackingMD: Android Malware Detection Method Based on Genetic Algorithm Optimized Stacking. Appl. Sci. 2023, 13, 2629. https://doi.org/10.3390/app13042629

AMA Style

Xie N, Qin Z, Di X. GA-StackingMD: Android Malware Detection Method Based on Genetic Algorithm Optimized Stacking. Applied Sciences. 2023; 13(4):2629. https://doi.org/10.3390/app13042629

Chicago/Turabian Style

Xie, Nannan, Zhaowei Qin, and Xiaoqiang Di. 2023. "GA-StackingMD: Android Malware Detection Method Based on Genetic Algorithm Optimized Stacking" Applied Sciences 13, no. 4: 2629. https://doi.org/10.3390/app13042629

APA Style

Xie, N., Qin, Z., & Di, X. (2023). GA-StackingMD: Android Malware Detection Method Based on Genetic Algorithm Optimized Stacking. Applied Sciences, 13(4), 2629. https://doi.org/10.3390/app13042629

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GA-StackingMD: Android Malware Detection Method Based on Genetic Algorithm Optimized Stacking

Abstract

1. Introduction

2. Related Work

2.1. Android Malware Detection

2.2. Features of Android Malware Detection

2.3. Feature Selection and Genetic Algorithm

2.4. Stacking

3. Android Malware Detection Method

3.1. The GA-StackingMD Framework

3.2. Feature Processing

3.2.1. Feature Extraction

3.2.2. Two-Step Feature Selection

3.3. GA-StackingMD

3.3.1. Training the First Layer Classifiers

3.3.2. Training the Second Layer Classifier

3.3.3. Optimizing Hyperparameters by GA

3.4. GA-StackingMD Description

4. Experiments

4.1. Experimental Environment and Evaluation Index

4.2. Experiment Steps

4.2.1. Key Feature Subset Construction

4.2.2. Compare Stacking with Single Classifiers

4.2.3. The Optimized Combination of Base Classifier Selection

4.2.4. Hyperparameters’ Optimization by GA

4.2.5. Comparison of GA-StackingMD and Other Classifiers

4.3. Comparison with Literature

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI