3.1. Overview
The proposed method includes the input, preprocessing, and training and classification phases, as illustrated in
Figure 1. The input phase extracts ASM and bytes files from a database with a disassembler.
The preprocessing phase includes global image generation and local feature visualization. A global image is generated using the binary information extracted from the bytes file through a binary extractor as pixels. The local features are extracted from ASM files through a local feature extractor, and are visualized. The extracted local features are input into an obfuscation checker to determine whether they are obfuscated. If malware is obfuscated, the local features are entered in a GAN executor. If malware has not been obfuscated, the local features are visualized through the local feature visualizer and then a local image is generated. The phases are as follows:
Global and local images are input into a GAN trainer. A global image of the obfuscated malware is input into a GAN executor that outputs a local image of the obfuscated malware.
The generated global and local images are merged through an image merger.
The merged image is input into a CNN trainer to train the CNN. The CNN executor classifies malware into different families by receiving the merged image and the trained CNN.
The training and classification phase that utilizes a generative adversarial network (GAN) and CNN includes (1) GAN training and execution, (2) global and local image merge and (3) CNN training and classification stages.
3.2. Input Phase
refers to a database, refers to the malware family number, and refers to the malware sample number. Database consists of malware samples [, , … , …, ]. The malware sample is input into a binary extractor that outputs the ASM file and the bytes file
3.3. Preprocessing Phase
refers to a local image set, and
consists of local images [
,
, …,
, …,
]. A local image
is an image generated by the processes detailed in
Figure 2 using a local feature
extracted from the ASM file
. The text section refers to the section with the program code in the PE file. A feature extractor receives an ASM file
and extracts a local feature
from the text section of the ASM file
based on a predefined list. The local feature
is composed of opcodes and API function names. A feature selector receives the local feature
and outputs the selected local feature
based on the term frequency inverse document frequency (TFIDF) algorithm [
15]. The top
local features are derived in the ascending order of TFIDF of the local feature
for each family. The selected local feature
is derived after removing the same local features and the local features belonging to all families. The fastText model represents words with a similar meaning, among the words inputted through distributed representation, as similar vector values [
22]. A fastText trainer learns by receiving the local feature
and outputs the trained fastText model
. The fastText executor outputs the embedded local feature
by receiving the local feature
and the trained fastText model
.
Algorithm 1 is a local feature visualization algorithm. The local feature visualization function outputs a local image by receiving the non-obfuscated malware and the embedded local feature . The range from to is the pixel range of a grayscale image. is a two-dimensional matrix for generating a local image. To generate a local image, the element of the embedded local feature is used. The element is a real number ranging from the minimum value to the maximum value . The element of the embedded local feature is extracted to generate a local image. The range from the minimum value of the element to the maximum value of the element normalized from to .
Because the normalized local feature consists of pixels ranging from to of a grayscale image, a local image is generated by using the pixels. The size of the normalized local feature is the same as the size of the local feature embedded through the fastText model . The row size of the 2-D matrix is , which is the size of the local feature extracted from the non-obfuscated malware . The column size of the 2-D matrix is , which is the size of the embedded local feature
Algorithm 1. Local Feature Visualization Algorithm |
FUNCTION LocalFeatureVisualization (, ) OUTPUT //Local Image BEGIN ←Local image minimum pixel value ←Local image maximum pixel value ←2-Dimension matrix for local image FOR Zero to ←Extract element of ← END FOR FOR Zero to ← FOR Zero to ← ← END FOR END FOR END
|