Bytecode Similarity Detection of Smart Contract across Optimization Options and Compiler Versions Based on Triplet Network

Zhu, Di; Yue, Feng; Pang, Jianmin; Zhou, Xin; Han, Wenjie; Liu, Fudong

doi:10.3390/electronics11040597

Open AccessArticle

Bytecode Similarity Detection of Smart Contract across Optimization Options and Compiler Versions Based on Triplet Network

by

Di Zhu

,

Feng Yue

^*,

Jianmin Pang

,

Xin Zhou

,

Wenjie Han

and

Fudong Liu

State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(4), 597; https://doi.org/10.3390/electronics11040597

Submission received: 19 January 2022 / Revised: 7 February 2022 / Accepted: 13 February 2022 / Published: 15 February 2022

(This article belongs to the Special Issue Recent Advanced Technologies and Applications of Smart Computing and Cyber Security)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, the number of smart contracts running in the blockchain has increased rapidly, accompanied by many security problems, such as vulnerability propagation caused by code reuse or vicious transaction caused by malicious contract deployment, for example. Most smart contracts do not publish the source code, but only the bytecode. Based on the research of bytecode similarity of smart contract, smart contract upgrade, vulnerability search and malicious contract analysis can be carried out. The difficulty of bytecode similarity research is that different compilation versions and optimization options lead to the diversification of bytecode of the same source code. This paper presents a solution, including a series of methods to measure the similarity of smart contract bytecode. Starting from the opcode of smart contract, a method of pre-training the basic block sequence of smart contract is proposed, which can embed the basic block vector. Positive samples were obtained by basic block marking, and the negative sampling method is improved. After these works, we put the obtained positive samples, negative samples and basic blocks themselves into the triplet network composed of transformers. Our solution can obtain evaluation results with an accuracy of 97.8%, so that the basic block sequence of optimized and unoptimized options can be transformed into each other. At the same time, the instructions are normalized, and the order of compiled version instructions is normalized. Experiments show that our solution can effectively reduce the bytecode difference caused by optimization options and compiler version, and improve the accuracy by 1.4% compared with the existing work. We provide a data set covering 64 currently used Solidity compilers, including one million basic block pairs extracted from them.

Keywords:

smart contract; bytecode similarity; basic block; triplet network

1. Introduction

Intelligent computing has become a part of our daily life, and advanced computing methods and technologies have become complex [1,2,3]. The explosive growth of intelligent computing data has brought some security problems. As a distributed technology that connects data blocks in an orderly manner, blockchain can help us overcome this challenge. It has the characteristics of decentralization, so as to reduce complexity, and can realize the openness and transparency of all data in the system, so as to improve the security of intelligent computing. Blockchain has broad application prospects in medical [4], Internet of things [5], finance [6] and other fields. The combination of intelligent computing and blockchain will provide strong support for the development of intelligent computing.

A smart contract is a decentralized program running on the blockchain. According to [7], the number of smart contracts on the blockchain has increased explosively in the past five years. At the same time, the program complexity of smart contracts and the amount of code per unit contract are also increasing year by year. The code reuse of smart contracts brings more and more serious security problems [8].

In 2016, hackers illegally obtained more than 3.6 million eths (the US$60 million) raised by the “DAO” start-up team through “recursive call vulnerability” [9]. In 2018, hackers deployed a large number of malicious contracts using the above settings, occupied the gaslimit of the whole block, and finally took away the “fomo3d” US$3 million (10469 ETH) bonus [10]. In 2021, hackers took advantage of the loopholes of the blockchain website polynetwork [11] and Ethereum to transfer tokens, and stole $272 million worth of Ethereum.

At present, the programming languages for developing smart contracts are Solidity [12], Lisp Like Language (LLL), Serpent [13] and so on. According to statistics [7], Solidity is the newest and most popular among them. Solidity is a high-level programming language designed to implement smart contracts. After its birth, Solidity was quickly adopted by other blockchains besides Ethereum (for example, expand [14], wanchain [15], tomochain [16], smartmesh [17], cpchain [18] and thundercore [19]). Solidity runs on Ethereum Virtual Machine (EVM) and uses EVM instructions. EVM is a stack-based rather than register-based architecture. Once a call to the contract is received (via message call or transaction), EVM will first search and load the contract bytecode from the local database. Then, EVM will find and parse it into the corresponding opcode [20].

According to statistics [21], in the top 1.5 million smart contracts deployed on the blockchain, only 32,499 (about 2%) disclosed the source code on Etherscan, and the rest only published bytecode files. Moreover, most smart contracts will not be subject to comprehensive security audits before deployment, to ensure that there will be no security problems, which means that the vast majority of smart contracts running in the blockchain have many potential security threats [22]. Therefore, the research on smart contract bytecode is of great significance. The similarity of smart contract bytecode refers to comparing two or more smart contract bytecodes to determine their similarity and difference.

At present, the main difficulties in bytecode similarity analysis of smart contracts are as follows [23]:

Bytecode diversity affected by optimization options. Ethereum uses “gas” (i.e., a form of fee) to charge the deployment and execution of smart contracts. In order to achieve the comprehensive optimization of a contract running time and resources, the compiler sets optimization options to optimize instructions and basic blocks according to the deployment and execution charges of gas. Therefore, the same source code will produce different bytecodes under the influence of optimization options.
Bytecode diversity affected by the compiler version. The compiler version updated too quickly when Solidity was born. Since its birth, there have been nearly 100 versions and changes in several major versions. The same source code will produce different bytecode with different compiler versions.
It is worth mentioning that each bytecode has corresponding metadata, which contains the information of whether the corresponding bytecode has been compiled and optimized, and the compiler version during compilation.

2. Related Work

Although the birth time of smart contract is relatively short, there has been a lot of work on the security audit of smart contracts.

At present, there are three main methods of security audits, traditional static analysis, dynamic analysis, and machine learning.

Traditional static analysis can perform security audits without executing the program. For example, after providing bytecode and blockchain global status at the same time, oyente [24] recursively executes in a symbolic way to obtain tracking information with path constraints, and uses this information to determine whether there is an error. Other similar works choose to use other analysis techniques [25,26,27,28] to supplement symbol execution, or focus on only one error to improve accuracy. However, due to the incompleteness of the system and the limitations of symbol Execution Technology (such as path explosion), no tool can ensure complete accuracy [29]. There is work to formalize verification and use mathematical language to define the expected behavior of the system to improve correctness. For example, secure [30] extracts semantic information according to the dependency of the contract, and then determines whether to hold the attribute according to the predetermined pattern involving compliance and violation. Bai et al. [31] designed a framework to determine whether the contract template meets the correctness and necessary attributes. The above static methods cannot obtain the context information of contract operation, and the false positive rate is high.

Dynamic analysis needs to execute code, but compared with traditional static analysis, it has a lower false positive rate. Contractfuzzer [32] and ReGuard [33] are typical fuzzy tools for detecting vulnerabilities. They execute smart contracts by providing a large number of randomly generated inputs, and then detect vulnerabilities according to the execution log. Due to the randomness of the input, even if the detection is bypassed, some extremely wrong locations may take too much time. Sereum [34] is equipped with vulnerability detection code for EVM, which can dynamically check the correctness. Once the execution violates the predefined rules, the transaction is suspended in real time. EVM* [35] is another similar detection tool, but due to distributed execution, this method may introduce too much overhead.

With the explosive growth of the number of smart contracts, it is difficult to conduct security audits on smart contracts in traditional static and dynamic ways. In addition to the increase in the number of smart contracts, the amounts of code and complexity of unit smart contracts are also increasing. Static analysis is based on the logical reasoning of opcode. It depends on an artificially constructed vulnerability model, and highly depends on a priori knowledge. It has the problems of low accuracy and a high false positive rate. However, dynamic simulation and pre-test methods consume a lot of time and resources. In addition, they are gradually unable to adapt to the growth of the number of smart contracts and unit codes.

Combined with machine learning, it can help the corresponding security audit tools to extract experience and knowledge from the massive data related to smart contracts, and then classify and predict the new samples according to the training generated model, in order to improve the accuracy and effectiveness of smart contract audit. Liu et al. [36] transformed the smart contract source code into an abstract syntax tree (AST), and applied it to the N-gram model [37]. To achieve the purpose of the safety audit, Yang et al. [38] generated three Syntax Standard sequences for the smart contract source code AST and put them into kenlm [39], thus improving the accuracy and recall. Zhuang et al. [40] used the function call of smart contract to generate CFG and input it into GCN to determine whether there are vulnerabilities in the contract. Nami Ashizawa [41] et al. took the smart contract as the granularity embedding vector, combined it with opcode and bytecode, conducted similarity research at the source level and extended it to the application of vulnerability detection.

KEVM [42] performs a fully executable formal verification of the smart contract bytecode. Liu et al. [43]. used a symbolic transaction sketch to detect the similarity of smart contract bytecode. Wesley Joon Wie et al. [44] marked whether the corresponding smart contract bytecode was vulnerable, and input the marked bytecode into LSTM. Liu et al. [23] used the method of birthmarks on the decompiled bytecode for similarity detection. Huang [45] decompiles the bytecode, slices the instruction and embeds the similarity of the matching bytecode in the graph network.

Compared with source code, security audit at bytecode level is more urgent and difficult. This paper combines machine learning and static analysis technology to study the bytecode similarity of smart contract. Under the same experimental evaluation, it has a certain improvement on the existing work.

In short, the main contributions of this paper are as follows:

Combined with the metadata information mentioned in the introduction, the neural machine translator (NMT) is applied to the bytecode similarity measurement of smart contract. The model trained by triplet network uniformly converts the optimized compiled smart contract bytecode into non-optimized compiled bytecode, or vice versa. This overcomes the problem of byte code diversification of the same source code caused by optimization options;
After normalizing the difference between instruction normalization and compiled version, combined with the feature extraction of traversal control flow graph (CFG), it overcomes the diversity of bytecode of the same source code caused by compiled version;
We improved the negative sample selection method to solve the current difficulty of negative sampling;
A smart contract data set is provided by us, including different versions of smart contract source code and more than 1 million basic block pairs.

The code and dataset of this article are open source at https://github.com/Zdddzz/smartcontract (accessed on 10 January 2022).

3. Methodology or Design and Implementation

Our main workflow is shown in Figure 1. Firstly, we compile the crawling smart contract to generate an opcode and extract the logical opcode, and then nomalize the instructions to form a basic block sequence. Then, the corresponding positive and negative samples are obtained by basic block marking and pre-training model. After embedding, it is put into a triplet network composed of transformers to measure the similarity of basic blocks in smart contracts. After the analysis of the model, high-precision results are obtained. Then, the decompiled bytecode is transformed into basic block sequences through the model, and the compiler version difference instruction sequence difference is normalized, and the experimental analysis of cross optimization options and cross compiler version is carried out.

3.1. Dataset Formation

3.1.1. Opcode Formation and Logical Opcode Extraction

In order to ensure the cross-version effect, we have crawled more than four thousand smart contracts on Ethereum, including compiler versions such as 0.4–0.7. We compile the source code of these smart contracts to generate nearly 20,000 files containing opcodes. Because the code in the smart contract is not all for logical operation, we need to extract the logical part of the opcode file.

Smart contracts are mainly composed of three parts [46]:

Contract deployment code. This part of the role is that when EVM creates a contract, it will first create a contract account and then run the deployment code. After running, it will store the logical code with auxdata on the blockchain, and then associate the storage address with the contract account;
Logical code. This part is the real logical operation part of the smart contract when running in EVM;
Auxdata. This is the password verification part before smart contract deployment.

For the sake of security, the smart contract will make the code of the deployment part as simple as possible, and the deployment code of all smart contracts is almost the same, except for the deployment address. Each smart contract in the auxdata part is not the same, so the part that needs smart contract security audit is the logical code part. In order to improve the training effect and avoid meaningless training, we only extract the opcode fragments of the logical code.

3.1.2. Instruction Normalization and Basic Block Sequence Formation

The rapid development of natural language processing in recent years has also brought enlightenment to program fragments. There have been many studies on the use of natural language processing to analyze programs. Natural language processing (NLP) has a relatively good effect on natural language. However, the program fragments are so different from the natural language that they are not suitable for putting program fragments directly into the neural network [47]. We need to process the code fragments of the opcode file to suit the neural network to extract semantic features. Therefore, we normalize the opcode instructions from the opcode files, and form basic block sequences from the opcode instruction string.

As shown in the opcode file section of Figure 2, DUPx and SWAPx instructions operate on the stack. Even in the same semantic environment, the x value will change. Retaining the x value will have a weak effect on the extraction of semantic features, and it will also cause the out of vocabulary (OOV) problem, which increases the difficulty of training. Therefore, we normalized it as DUP and SWAP.

For example, the hexadecimal number in each line of the opcode fragment in Figure 2 represents the push instruction, which means the hexadecimal number is pushed to the top of the stack in the EVM. There are also hexadecimals in logical operations and mathematical operations such as AND SUB, DIV, etc. These hexadecimal numbers represent function signatures, transaction information and memory addresses, and so on. Hexadecimal numbers were normalized as var.

The LT, GT, EQ, ISZERO judgment instructions will be followed by the basic block, which means that the basic block will be jumped or branched to the corresponding marked basic block after the conditional judgment. There will be the basic block to jump in the JUMP and JUMPI instructions marking, and we normalized the basic block label to BB (basic block).

CREATE and CREATE2 instructions have different roles in opcode, which means that different creation instructions will be used in different semantic environments. For the same reason, LOG1, LOG2, LOG3 and LOG4 were not normalized.

The next step needs to form a sequence of basic blocks. A basic block is a sequence of statements executed in the maximum order in the program. There is only one entry and exit in one basic block. The entry is the first statement and the exit is the last. As a program, a smart contract also has a sequence of basic blocks. In the opcode file of the smart contract, each basic block will be marked. We normalize the opcode instructions in the range of each mark, and splice them to obtain the corresponding basic block sequence.

3.1.3. Positive Sample Acquisition

Our first goal is to realize the conversion of optimized and unoptimized basic block sequences, so as to eliminate the bytecode diversification problem caused by optimization options. Therefore, in the triple network, the optimized basic block sequence under the same source code should be used as the positive sample, and the optimized basic block just conforms to the principle that the positive sample is similar to but not completely consistent [48] with the anchor point in the triple network.

We chose the basic block sequence extracted from the opcode file of the same smart contract with optimization options as the positive sample. Because the same smart contract can ensure that the semantics of the corresponding basic blocks are similar, the optimized basic blocks have a higher similarity to the unoptimized basic blocks. This means that two basic block sequences extracted from the opcode file with optimization option and the opcode file without optimization option of the same smart contract can form a basic block pair. To ensure that the generated basic block pairs are the basic blocks with corresponding semantics under the same source code optimization, sither [49] IR is used to mark the basic blocks. However, the optimization option optimizes redundant basic blocks besides the interior of basic blocks. We intersect the basic block marks of two opcode files to obtain the basic block pair. Through the above method, we get nearly 1 million pairs of basic block sequences.

3.2. Neural Network Pre-Training

3.2.1. Neural Machine Translation

For a long time, machine translation has been a core challenge in the field of natural language processing. In recent years, machine translation based on end-to-end neural networks has made great progress compared with previous methods. Recursive models have been successfully used in sequence modeling, and have been used in NMT tasks for a long time. In 2014, Cho et al. [50] first proposed to use two Gated Recurrent Units (GRUs) [51] as encoder and decoder respectively, and the author applied this model to machine translation tasks. The comparison with traditional statistical machine translation systems shows the superiority of this method in terms of translation quality. In 2015, Bahdanau et al. [52] improved the translation quality based on GRUs’ recurrent neural networks, especially long sentence translation. However, there are two problems [53]:

The basic limitation of sequential calculation limits the computational efficiency of the recursive model. The calculation of time slice t depends on the calculation result at t − 1, which greatly limits the parallel ability of the model;
Loss of information in the process of sequential calculation. This sequential calculation mechanism means that the semantic relationship between contexts will continue to weaken as the state passes. This is because the calculation is linear, which usually takes a certain amount of time.

In particular, the length of a part of the basic block sequence is usually too long, which will lead to a large loss of semantic information.

3.2.2. Transformer with Self-Attention Mechanism

To solve this problem, in 2017 Vaswani et al. [54] proposed a new NMT model based on the attention mechanism named Transformer. Self-attention is a special case of the attention mechanism, which can model the dependency relationship between markers at different positions in a single sequence. Due to the highly parallel computing of self-attention, Transformer not only improves the translation quality, but also significantly improves the computing efficiency. For our huge data set, a transformer with efficient parallel computing and a self-attention mechanism is very suitable.

Transformer is similar to the traditional seq2seq [55] model. It has an encoder-decoder structure. The encoder is responsible for representing the sequence as an embedded expression. Each token is represented by a vector, and the above context is vectorized. The decoder predicts the probability distribution of the translated sentence based on the context information obtained by the encoder, and selects the token with the highest probability as the output. Transformer first uses word2vec and other word embedding methods [56] to convert the input prediction into a feature vector, then it marks the position information by position encoding, and uses the output of the multi-layer encoder part as the input of the decoder. The encoder of each layer is composed of a self-attention layer and a feedforward layer. The decoder of each layer is composed of a self-attention layer, a feedforward layer and an encoder-decoder attention layer. Finally, the calculation is output through softmax. In our work, we embedded the basic block sequence with tokens, and calculate the similarity probability of the basic block sequence through the calculation of the encoder and decoder.

The self-attention mechanism allows the transformer to pay attention to the different positions of the input basic block sequence during training, to calculate the representation ability of the sequence. Transformer also uses a multi-layer self-attention mechanism to replace single self-attention once to improve the model effect. This multi-layer self-attention mechanism has the following advantages:

It does not make any assumptions about the time and space relationship between data, which is a processing Ideal for a set of basic block pairs;
Parallel calculation can be carried out. Moreover, the efficiency of processing a large number of data sets is high;
It can achieve better results for longer basic block sequences.

Transformer structure and basic block attention showed in Figure 3.

3.2.3. Basic Block Embedding and Position Coding

In natural language processing, the sequence of sentences is converted into a vector of fixed dimensions for the next step of the calculation. Using machine learning to study the similarity of the basic blocks of smart contracts, the more important step is to map the basic block sequence into a fixed-dimensional vector by word embedding [57]—that is, the basic block needs to be converted into a vector and expressed in a vector space. Figure 4 shows the basic block embedded representation.

A text sequence is a kind of time-series data. The position relationship of each token represents its sequence relationship in the sequence. The sequence relationship of each token often affects the semantics of the entire text sequence. The basic block sequence is also the same. The sequence of opcode instructions will also affect the semantics of the entire basic block sequence. The transformer we use introduces positional coding in the encoder. The positional coding adds the position information of the instruction in the basic block sequence to each opcode instruction vector, in order to distinguish opcode instructions at different positions.

3.2.4. Negative Sample Acquisition and Hard Sample Insertion

In the case of only positive samples (basic block pairs with equivalent semantics), the training goal is to make the vector space distance between the anchor (the basic blocks without optimization options) and the positive sample (the basic blocks with optimization options) as close as possible. Because anchor and positive are trained together in the neural network, the embeddings of both are dynamic. This makes neural networks tend to map any input to the same vector. In other words, even basic blocks with different semantics will be close to each other. Therefore, we introduce negative samples (basic blocks with unequal semantics) as a necessary supplement for model training. As shown in Figure 5, after introducing negative samples, we can train the embedding network so that the semantically equivalent basic block pairs are close in the vector space under the supervision of positive and negative samples, and the semantically unequal basic block pairs are separated in the vector space.

However, it is obviously impossible to manually find negative samples that are very different from each basic block sequence in such a large basic block data. There were two main methods in other areas in the past.

Random sampling of the data set, but the result of random sampling does not guarantee the difference between the sampled basic block and basic block as anchor;
An approach [58] marks the ground truth in advance, and then a series of proposals will be generated in the algorithm. If the intersection over union (IoU) of proposals and ground truth is below a certain threshold, it is a negative sample. This can well ensure the difference between the negative sample and the smart contract itself, but it may need to label the ground truth, which requires a higher data set and consumes considerable resources. There will also be a problem that is the number of positive samples may be much smaller than the negative sample, so the effect of training is always limited.

We are inspired by Zhang et al. [59]. Given two basic block pairs, we first obtain their embedding E₁, E₂ by using a pre-trained unoptimized basic block sequence encoder and aggregating the output

D_{b} (b b_{1}, b b_{2}) = \sqrt{\sum_{i = 1}^{d} {(e_{1 i} - e_{2 i})}^{2}}

(1)

where d is the embedding dimension. E₁, E₂ are two basic block vectors, and we can judge their similarity by measuring their embedded Euclidean distance.

Among them, e_1i∈E₁ and e_2i∈E₂. Obviously, the smaller the Euclidean distance, the more similar between the two basic blocks. The larger the Euclidean distance, the less similar between the two basic blocks.

When generating negative samples, we randomly select 100 basic blocks and use the similarity calculated by Equation (1) to find the farthest basic block based on the pre-trained model as a differentiation negative sample.

However, only the negative samples of the above practices are used as simple negative samples, so the model will be easy to make correct judgments that get unsatisfactory performance in actual tests. We thus lack hard negative samples [60] for improving the accuracy of model training. Hard negative samples refer to the semantic similarity between positive and negative samples. In other words, hard negative samples are difficult for the model to make correct judgments. In many computer vision tasks, inserting hard-negative samples into random negative samples is effective [61,62,63]. Adding an appropriate amount of hard negative samples to the negative samples can improve the accuracy of model training. The sample vector space representation and training trend representation are shown in Figure 6.

We randomly select 100 basic blocks in dataset when generating negative samples, then use the closest similarity calculated by Equation (1) as the hard negative sample.

Although hard negative samples have been proven to be beneficial to the training of the network [64,65], all hard negative samples will make the training very unstable, so we set the ratio to 3:1, based on the above negative sampling and hard negative sampling.

We choose triplet loss [66] as the loss function during training. It is used to train samples with small differences. The goal is:

Make the samples with higher similarity as close as possible during training in the embedding vector space;
Keep samples with large differences in similarity as far away as possible during training in the vector space.

The training data includes anchors, positive samples and negative samples. In training, the distance in vector space between the anchor and positive sample is smaller than the distance between the anchor and negative sample. In this article, the anchor is the basic block extracted from the unoptimized opcode file, and the positive sample is the basic block extracted from the optimized opcode file with the same source code of the anchor. For negative examples, we use the above sampling method.

These are not enough, however, as if only the above two points are followed, the distance between samples of the same category in the embedded vector space will be smaller, and the distance of samples of different categories will also be smaller. Therefore, a hyperparameter named margin needs to be added to keep a sufficient vector distance between a positive sample and a negative sample. After experiments, we have determined that the experimental effect is best when the margin is 120.

3.3. Cross Compiler Version Normalization and Similarity Calculation

Through the steps in Section 3.1 and Section 3.2, a model across optimization options is generated. Compared with the GCC compiler, which has four optimization options, the Solidity compiler has only one operation optimization option, and can obtain the information of whether the current bytecode is optimized according to the metadata of the corresponding bytecode. Therefore, we can decompile bytecode and optimize the sequence of options through model transformation. Then, we normalize the compiler version according to the instruction sequence. Finally, we calculate the similarity of basic block sequences.

3.3.1. Cross Compiler Version Normalization

When upgrading the version of the Solidity compiler, some version differences will produce different bytecodes for the same source code. After the cross-optimization option conversion of the model, the basic block sequence is normalized across compiler versions. We have normalized the instructions in Section 3.1.2. In order to strengthen the effect of bytecode similarity detection across compiler versions, we normalize the change rules of smart contract compiler versions. For example, all versions after Solidity v0 4.22 have a new function: if there is a SWAP instruction before the comparison operator (LT, GT, etc.), replace the comparison operator with the opposite operator and remove the SWAP instruction. Therefore, we convert the combination of SWAP and LT into GT. Similarly, SWAP and SLT are converted into SGT instructions. Figure 7 shows the difference of bytecode generated by different compiler versions on the same source code.

3.3.2. Similarity Calculation

After the conversion and cross compiler version normalization of the basic block with optimized and unoptimized options, we use Equation (1) to calculate the Euclidean distance of the basic block.

In this article, however, a vector is used to represent each basic block sequence. The difference between basic blocks can be measured by calculating the distance between different vectors. After the conversion of optimized and unoptimized basic blocks and cross compiler version normalization, we use the Euclidean distance to calculate the distance of the basic block vector. The resulting range of the Euclidean distance calculation is in the range of 0 to positive infinity, but the likelihood of similarity between the two basic blocks should be between 0 and 1, so we map the calculation result to between 0 and 1 by using Equation (2)

S i m (B_{1}, B_{2}) = e x p (- \frac{D (E_{1}, E_{2})}{d})

(2)

where B₁ and B₂ denote the two basic blocks, d denotes the embedding dimension and E₁, E₂ are embeddings for B₁ and B₂, respectively.

3.4. Similarity Measure Extend to Bytecode

3.4.1. Key Instruction Combination Matching

Before decompilation and conversion, we form a CFG by evm-cfg-builder [67]. At the same time, a queue is maintained. When traversing the CFG—that is, when simulating the execution of smart contract bytecode—the instruction sequence of operations on different data is recorded as a form of data dependence. According to this queue, the key instruction combination summarized by Liu [23] is matched to form a five-dimensional vector, which is added in the later calculation. The purpose of this is to increase the proportion of key instruction combinations in similarity calculation.

3.4.2. Basic Block Inter Features

The actual basic blocks do not work alone—that is, they will call or be called by other basic blocks. Of course, when calling a method, the basic block also jumps to a basic block of the method. The interaction with other basic blocks (including themselves) in the same bytecode is an important semantic feature—that is, the feature between basic blocks.

Ideally, the entire call graph should be considered. For example, SMIT [68] uses call graph matching to detect similarities between malware samples. However, the calculation cost is still too high and time-consuming.

Combined with [69], we only extract the in and out degrees of nodes (i.e., basic blocks) in the call graph as their basic block features.

More specifically, for the basic block bb₁, we embed the features between its basic blocks into a two-dimensional vector, and the formula is as Equation (3)

s (b b_{1}) = (i n (b b_{1}), o u t (b b_{1}))

(3)

where in (bb₁) and out (bb₁) are the in-degree and out-degree of the corresponding basic block in the call graph, respectively. The (Euclidean) distance formula of the features between the two basic blocks is as Equation (4):

D (b b_{1}, b b_{2}) = | | s (b b_{1}) - s (b b_{2}) | |

(4)

3.4.3. Similarity Calculation of Bytecode

This paper calculates the distance of the embedded vector space to obtain the measure of similarity. At present, there are many ways to calculate the vector distance. The key instruction combination and the inter feature between basic blocks in Section 3.4.2 will be used in the basic block similarity calculation and intelligent contract calculation in this section, respectively.

Similar basic blocks in smart contract bytecode may have different calling procedures, and the program semantics affected by the combination of key instructions between basic blocks will also be different. Therefore, simply calculating the Euclidean distance of the basic block will ignore many program semantic features that need to be considered, resulting in a large false positive rate. Compared with Eclone [23], it is not comprehensive to only calculate the birthmark without considering the basic block vector distance, so we consider all features. Based on the basic block vector distance, the key instruction mapping table X and the two-dimensional vector representation of accessing the basic block vector distance are added.

In combination with the above, we calculate the basic block distance as Equation (5)

| | b b_{1} b b_{2} | | = \frac{S i m_{1} (b b_{1}, b b_{2}) - w J (X_{1}, X_{2})}{S i m_{1} (b b_{1}, b b_{2}) + w J (X_{1}, X_{2}) + (1 - ξ^{D (b b_{1}, b b_{2})})}

(5)

where ξ is a predefined super parameter within the range (0,1) to suppress the influence of D, which we set to 0.75. Sim₁ (bb₁, bb₂) is the Euclidean distance between basic blocks, X₁ is the key instruction matching mapping table of bb₁ and X₂ is the key instruction matching mapping table of bb₂. J is the generalized Jaccard distance of key instruction mapping tables X₁ and X₂, and w is used to adjust the weight of key instruction matching on semantics. The shorter the distance, the higher the similarity, and vice versa. The generalized Jaccard distance is expressed as Equation (6):

J (A, B) = 1 - \frac{A \times B}{{| | A | |}^{2} + {| | B | |}^{2} - A \times B}

(6)

The similarity calculation methods used in this paper are summarized in Table 1.

We then use CFG to extend the method of Eclone to measure the similarity of bytecode between the whole contract 1 and contract 2; if G1 and G2 are CFG decompiled by two bytecodes in a pair of smart contracts, the best match (i.e., the maximum probability measure between the two blocks) is found in G2 for each basic block in G1. The task of searching for the best matching basic block is realized to identify a pair of basic blocks with the minimum vector distance.

In this paper, the similarity probability of each basic block is calculated by using the function [70] with the midpoint of 0.5 sigmoid as Equation (7). The purpose is to better distinguish the probability of similar basic blocks. The probability calculated by dissimilar basic block pairs and similar basic block pairs is more different:

P (b b_{1}; b b_{2}) = 1 / (1 + e^{- k \cdot (| | b b_{1} b b_{2} | | - 0.5)})

(7)

In this paper, all basic blocks are probability matched, and the sum of the obtained probability and the sum of the theoretical probability are calculated to obtain the absolute similarity. The sum of the theoretical probability is the probability of randomly finding a basic block and the basic block to be tested in contract 2. The final absolute similarity calculation formula is Equation (8) [71]

S i m (S C_{1}, S C_{2}) = \sum_{b b_{i} \in S C_{1}} l o g \frac{m a x_{b b_{2} \in S C_{2}} P (b b_{1}; b b_{2})}{P (b b_{1}; H_{0})}

(8)

where SC₁ is contract 1, SC₂ is contract 2 and H₀ is the probability of randomly finding a basic block matching bb₁ in SC₂.

The corresponding relative similarity is the final similarity calculated by us, and the calculation formula is Equation (9):

S i m_{f i n a l} (S C_{1}, S C_{2}) = \frac{S i m (S C_{1}, S C_{2})}{S i m (S C_{2}, S C_{2})}

(9)

4. Experiment

4.1. Training Details

We use adam [72] as an optimizer to adapt to gradient descent;
The hyperparameters are learning rate: 3 × 10⁻⁵, batch size: 48, epoch: 10, dropout: 0.1;
We set 512 as the maximum length, and padding for basic block vectors is shorter than this length;
Our training was conducted on a workstation with Intel Xeon processors, four NVIDIA 2080Ti graphics cards, 192 GB RAM, and 2 TB SSD;
The experiments were carried out five times, and the average value was taken as the final result.

4.2. Evaluation Criteria Settings

To evaluate the effectiveness of the proposed algorithm, we conducted two types of experiment:

Experimental evaluation of triplet network model in Section 4.3.1, Section 4.3.2 and Section 4.3.3;

We formed more than 1 million basic blocks pairs from different compiler versions in the form of 8:2 as training set and evaluation set, and divide the data set into the following four types:

ASM_small: includes nearly 50,000 basic block pairs;
ASM_base: includes nearly 300,000 basic block pairs;
ASM_large: includes nearly 1 million basic block pairs;
ASM_base_unnormalized: Including nearly 500,000 basic block sequence pairs of different Solidity versions, but not normalized by instructions.

Following previous works for natural language understanding [73,74], we mix the corresponding similar basic blocks (correct basic block), randomly select 99 basic blocks and then calculate the comparison of similarity according to Equation (2), and sort the obtained similarity from high to low. We use the ratio of accuracy similar basic blocks to be measured in M (A@M). The A@M score evaluates the proportion of the data whose true response ranks in the top M responses in the whole evaluation data – that is, the higher the A@M score, the better the effect. For example, A@3 represents the true response in the top three positions of similarity ranking, and A@10 represents the top ten positions of the true response in the similarity ranking.

2.: Experiments on the effectiveness of bytecode similarity measurement in this paper across optimization options and compiler versions in Section 4.3.4, Section 4.3.5 and Section 4.3.6.

In order to explore the role of overcoming the bytecode diversity caused by optimization options through model conversion of basic block sequence in the bytecode similarity detection of smart contract in Ethereum environment, we collected nearly 1300 smart contract source codes from mainnet Etherscan, which can be compiled by multiple compiler versions, and compiled them, including optimization option (--optimize) and no optimization option. Because bytecodes compiled from the same source code must be the most similar, we marked a pair of bytecodes (with and without optimization options) to form test cases. If the two bytecodes to be detected are compiled from the same source code, the unused bytecode and the bytecode using the optimization option are marked as 1; otherwise, they are marked −1. When converting the basic block sequence through the model, we uniformly convert the optimized block sequence into the non-optimized block sequence. In the actual Ethereum environment, we can judge the conversion according to the optimizer option in the metadata corresponding to each bytecode.

In the evaluation, we considered four statistical types, namely true positive (TP), true negative (TN), false positive (FP) and false negative (FN). For example, if the label l of the test case is 1 (that is, the two bytecodes in the test case are compiled from the same source code) and when our method detects the two bytecodes in the test case as being 1, we count this test case as TP. However, if the similarity result is −1, it is recorded as FN. We calculated the accuracy of our method under different detection thresholds. Specifically, the accuracy is calculated as Equation (10):

(T P + T N) / n

(10)

In this part of the experiment, we need to set the threshold θ to judge the similarity. When Sim_final ≥ θ, we judge it as similar, and when Sim_final < θ it is judged as dissimilar.

4.3. Empirical Results

4.3.1. Dataset Size and Normalization

The effects of the model in different datasets are as Table 2.

From Table 2, it can be seen that when the data set size is large and the coverage version is sufficiently comprehensive, the accuracy of the model is higher. When we use the Asm_large dataset to train, the accuracy rate reaches 97.8%.

For the normalized experiment, the difference between the unnormalized dataset and the normalized data set is 5.1%. The main reasons for this situation are as follows:

Constants have little effect on program semantics. In the opcode of a smart contract, unnormalized constants may include memory addresses, function signatures and transaction information, etc, which may cause OOV problems;
In the EVM, instructions can be replaced by other instructions of the same category without changing the semantics in some cases. This feature may confuse neural net-works, especially NMT models.

On the premise of not losing semantic information, normalization reduced the number of training vocabulary without semantic information, and thus brought better accuracy.

4.3.2. Different Negative Sample Sampling Methods

We compare different negative sample sampling methods under the ASM_large. The results of the experiment are shown in Figure 8 and Table 3.

From the above experimental results, it can be concluded that differential negative sample sampling has achieved better results than random negative sampling, as differential negative sampling ensures that the distance between the anchor and negative sample in the vector space is large, which means that semantic differences are guaranteed between anchors and negative samples.

The accuracy of random and differential sampling methods is quite different. This is because we used pre-training before obtaining differential negative sampling besides the difference of semantic guarantees, which also shows the effect of pre-training. Adding hard negative samples to negative samples improves the judgment ability of the model, so the mixing of hard negative samples and differentiated negative samples achieves the best experimental results.

4.3.3. Hyperparameter Margin

We experiment with the hyperparameter margin under the same data set and negative sampling method, and the results are seen in Table 4.

The proper hyperparameter margin can guide the neural network to train in the right direction. However, too small a margin will simplify the training task of the model, resulting in poor performance in actual evaluation (A@M), while a too-large margin will make the training task very difficult, making the model prone to extreme situations, leading to overfitting. From the experimental results of Table 4, it can be seen that 120 is the best margin.

4.3.4. Bytecode Similarity Measurement Effect across Optimization Options

In order to reflect the effect of cross optimization options, all steps are the same, except the transformation of optimization options by the model.

As can be seen from Figure 9, the model we designed can effectively convert basic blocks across optimization options. Under the maximum accuracy threshold, the bytecode accuracy is 4.7 percentage points higher than that without cross optimization options. When the threshold is too high, the accuracy decreases. This is because under different optimization options and compiler versions, the opcode of smart contract bytecode decompilation will produce synonymous conversion of basic blocks. Therefore, if the threshold is too high, it will be considered that the synonymous opcode combination is not synonymous, resulting in the rise of FN—that is, many byte codes of the same source code are judged to be different. It can be noted that the maximum accuracy threshold after basic block sequence conversion has increased, which also reflects the effectiveness of basic block conversion across optimization options.

4.3.5. Bytecode Similarity Measurement across Compiler Versions

In order to verify the effect of instruction normalization and compiler version differential normalization, we compiled the above smart contract source code into different bytecodes using 0.4.11 compiler and 0.4.25 compiler, marked as 1 and −1 respectively, and experimented with Equation (10). The experimental results of cross optimization options are shown in Figure 10.

Experiments show that instruction normalization and compiler version difference normalization are effective for bytecode similarity detection of smart contracts.

4.3.6. Comparative Experiment

At the same time, according to the above experimental methods, we compare under the threshold of the highest accuracy.

Under the same experimental form, we conducted a comparative experiment with Eclone. It can be seen from Table 5 that our method has achieved high accuracy, and the threshold increase under the highest accuracy also reflects that our method has a certain effect on the conversion of cross optimization options and the normalization of cross compiler versions.

5. Conclusions

This paper presents a bytecode similarity measurement method of smart contracts across optimization options and compiler versions. This is a systematic solution, including data processing for generating basic block sequences, neural network model design and training, model conversion of basic block sequences across optimization options and normalization of cross compiler versions. The innovation of this scheme is to apply the neural translation network to the similarity detection of smart contract bytecode, use the triple network to train the basic blocks of smart contracts, normalized instructions and cross compiler version differences, and then traverse the control flow graph feature extraction to overcome the diversity of smart contract bytecode of the same source code caused by optimization options and compiler version. At the same time, the negative sampling method is improved. The evaluation results show that the model brings better results, improves the measurement effect of bytecode similarity, and improves the accuracy by 1.4% compared with the existing work. In addition, we also disclosed the data set at the basic block level of smart contract, which provides a comparable benchmark and supplement for relevant research.

Author Contributions

Conceptualization, D.Z.; Data curation, F.L.; Formal analysis, F.Y. and J.P.; Investigation, W.H.; Software, D.Z.; Supervision, J.P. and X.Z.; Validation, F.Y., X.Z. and W.H.; Writing—original draft, D.Z.; Writing—review & editing, J.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Natural Science Foundation of China under Grant No.61802433 and 61802435.

Data Availability Statement

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Tong, Z.; Ye, F.; Yan, M.; Liu, H.; Basodi, S. A Survey on Algorithms for Intelligent Computing and Smart City Applications. Big Data Min. Anal. 2021, 4, 155–172. [Google Scholar] [CrossRef]
Elhoseny, M.; Salama, A.; Abdelaziz, A.; Riad, A. el-din Intelligent Systems Based on Loud Computing for Healthcare Services: A Survey. Int. J. Comput. Intell. Stud. 2017, 6, 157. [Google Scholar] [CrossRef]
Huang, H.; Yan, C.; Liu, B.; Chen, L. A Survey of Memory Deduplication Approaches for Intelligent Urban Computing. In Machine Vision and Applications; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
Xu, J.; Xue, K.; Li, S.; Tian, H.; Hong, J.; Hong, P.; Yu, N. Healthchain: A Blockchain-Based Privacy Preserving Scheme for Large-Scale Health Data. IEEE Internet Things J. 2019, 6, 8770–8781. [Google Scholar] [CrossRef]
Alshaikhli, M.; Elfouly, T.; Elharrouss, O.; Mohamed, A.; Ottakath, N. Evolution of Internet of Things from Blockchain to IOTA: A Survey. IEEE Access 2021, 10, 844–866. Available online: https://ieeexplore.ieee.org/document/9662390 (accessed on 2 February 2022). [CrossRef]
Du, M.; Chen, Q.; Xiao, J.; Yang, H.; Ma, X. Supply Chain Finance Innovation Using Blockchain. IEEE Trans. Eng. Manag. 2020, 67, 1045–1058. [Google Scholar] [CrossRef]
Etherscan. Available online: https://etherscan.io/ (accessed on 2 February 2022).
Kushwaha, S.S.; Joshi, S.; Singh, D.; Kaur, M.; Lee, H.-N. Systematic Review of Security Vulnerabilities in Ethereum Blockchain Smart Contract. IEEE Access 2022, 10, 6605–6621. [Google Scholar] [CrossRef]
Ghaleb, B.; Al-Dubai, A.; Ekonomou, E.; Qasem, M.; Romdhani, I.; Mackenzie, L. Addressing the DAO Insider Attack in RPL’s Internet of Things Networks. IEEE Commun. Lett. 2019, 23, 68–71. [Google Scholar] [CrossRef] [Green Version]
Min, T.; Wang, H.; Guo, Y.; Cai, W. Blockchain Games: A Survey. In Proceedings of the 2019 IEEE Conference on Games (CoG), London, UK, 20–23 August 2019; pp. 1–8. [Google Scholar]
PolyNetwork. Available online: https://www.poly.network/#/ (accessed on 2 February 2022).
Solidity—Solidity 0.8.12 Documentation. Available online: https://docs.soliditylang.org/en/develop/ (accessed on 4 February 2022).
Serpent. Available online: https://eth.wiki/archive/serpent (accessed on 2 February 2022).
Expanse Tech. Available online: https://expanse.tech/ (accessed on 2 February 2022).
Wanchain—Build the Future of Finance. Available online: https://wanchain.org/ (accessed on 2 February 2022).
TomoChain—The Most Efficient Blockchain for the Token Economy. Available online: https://tomochain.com/ (accessed on 2 February 2022).
SmartMesh—SmartMesh Opens up a World Parallel to the Internet. Available online: https://smartmesh.io (accessed on 2 February 2022).
CPCHAIN—Cyber Physical Chain. Available online: https://www.cpchain.io/ (accessed on 2 February 2022).
ThunderCore—Decentralized Future. Today. Available online: https://www.thundercore.com/ (accessed on 2 February 2022).
Opcodes for the EVM. Available online: https://ethereum.org (accessed on 2 February 2022).
Fröwis, M.; Böhme, R. In Code We Trust? In Proceedings of the Data Privacy Management, Cryptocurrencies and Blockchain Technology, Oslo, Norway, 14–15 September 2017; Garcia-Alfaro, J., Navarro-Arribas, G., Hartenstein, H., Herrera-Joancomartí, J., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 357–372. [Google Scholar]
Wang, Z.; Jin, H.; Dai, W.; Choo, K.-K.R.; Zou, D. Ethereum Smart Contract Security Research: Survey and Future Research Opportunities. Front. Comput. Sci. 2020, 15, 152802. [Google Scholar] [CrossRef]
Liu, H.; Yang, Z.; Jiang, Y.; Zhao, W.; Sun, J. Enabling Clone Detection for Ethereum Via Smart Contract Birthmarks. In Proceedings of the 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), Montreal, QC, Canada, 25–26 May 2019; pp. 105–115. [Google Scholar]
Badruddoja, S.; Dantu, R.; He, Y.; Upadhyay, K.; Thompson, M. Making Smart Contracts Smarter. In Proceedings of the 2021 IEEE International Conference on Blockchain and Cryptocurrency (ICBC), Sydney, Australia, 3–6 May 2021; pp. 1–3. [Google Scholar]
Nikolić, I.; Kolluri, A.; Sergey, I.; Saxena, P.; Hobor, A. Finding the Greedy, Prodigal, and Suicidal Contracts at Scale. In Proceedings of the 34th Annual Computer Security Applications Conference, San Juan, PR, USA, 3–7 December 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 653–663. [Google Scholar]
Krupp, J.; Rossow, C. Teether: Gnawing at Ethereum to Automatically Exploit Smart Contracts. In Proceedings of the 27th USENIX Conference on Security Symposium, Baltimore, MD, USA, 15–17 August 2018; USENIX Association: Berkeley, CA, USA, 2018; pp. 1317–1333. [Google Scholar]
Mossberg, M.; Manzano, F.; Hennenfent, E.; Groce, A.; Grieco, G.; Feist, J.; Brunson, T.; Dinaburg, A. Manticore: A User-Friendly Symbolic Execution Framework for Binaries and Smart Contracts. In Proceedings of the 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), San Diego, CA, USA, 11–15 November 2019; pp. 1186–1189. [Google Scholar]
Kalra, S.; Goel, S.; Dhawan, M.; Sharma, S. ZEUS: Analyzing Safety of Smart Contracts. In Proceedings of the Network and Distributed Systems Security (NDSS) Symposium 2018, San Diego, CA, USA, 18–21 February 2018. [Google Scholar]
Torres, C.F.; Schütte, J.; State, R. Osiris: Hunting for Integer Bugs in Ethereum Smart Contracts. In Proceedings of the 34th Annual Computer Security Applications Conference, San Juan, PR, USA, 3–7 December 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 664–676. [Google Scholar]
Tsankov, P. Security Analysis of Smart Contracts in Datalog. In Leveraging Applications of Formal Methods, Verification and Validation. Industrial Practice; Margaria, T., Steffen, B., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 316–322. [Google Scholar]
Bai, X.; Cheng, Z.; Duan, Z.; Hu, K. Formal Modeling and Verification of Smart Contracts. In Proceedings of the 2018 7th International Conference on Software and Computer Applications, Kuantan, Malaysia, 8–10 February 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 322–326. [Google Scholar]
Jiang, B.; Liu, Y.; Chan, W.K. ContractFuzzer: Fuzzing Smart Contracts for Vulnerability Detection. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, Montpellier, France, 3–7 September 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 259–269, ISBN 978-1-4503-5937-5. [Google Scholar]
Liu, C.; Liu, H.; Cao, Z.; Chen, Z.; Chen, B.; Roscoe, B. ReGuard: Finding Reentrancy Bugs in Smart Contracts. In Proceedings of the 2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSE-Companion), Gothenburg, Sweden, 27 May–3 June 2018; pp. 65–68. [Google Scholar]
Rodler, M.; Li, W.; Karame, G.; Davi, L. Sereum: Protecting Existing Smart Contracts Against Re-Entrancy Attacks. arXiv 2018, arXiv:1812.05934. [Google Scholar]
Ma, F.; Fu, Y.; Ren, M.; Wang, M.; Jiang, Y.; Zhang, K.; Li, H.; Shi, X. EVM*: From Offline Detection to Online Reinforcement for Ethereum Virtual Machine. In Proceedings of the 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), Hanzhou, China, 24–27 February 2019; pp. 554–558. [Google Scholar]
Liu, H.; Liu, C.; Zhao, W.; Jiang, Y.; Sun, J. S-Gram: Towards Semantic-Aware Security Auditing for Ethereum Smart Contracts. In Proceedings of the 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE), Montpellier, France, 3–7 September 2018; pp. 814–819. [Google Scholar]
Wang, S.; Chollak, D.; Movshovitz-Attias, D.; Tan, L. Bugram: Bug Detection with n-Gram Language Models. In Proceedings of the 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), Singapore, 3–7 September 2016; pp. 708–719. [Google Scholar]
Yang, Z.; Keung, J.; Zhang, M.; Xiao, Y.; Huang, Y.; Hui, T. Smart Contracts Vulnerability Auditing with Multi-Semantics. In Proceedings of the 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC), Madrid, Spain, 13–17 July 2020; IEEE: Madrid, Spain, 2020; pp. 892–901. [Google Scholar]
Heafield, K. KenLM: Faster and Smaller Language Model Queries; Association for Computational Linguistics: Stroudsburg, PA, USA, 2011. [Google Scholar]
Zhuang, Y.; Liu, Z.; Qian, P.; Liu, Q.; Wang, X.; He, Q. Smart Contract Vulnerability Detection Using Graph Neural Network. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, Yokohama, Japan, 11–17 July 2020; International Joint Conferences on Artificial Intelligence Organization: Yokohama, Japan, 2020; pp. 3283–3290. [Google Scholar]
Ashizawa, N.; Yanai, N.; Cruz, J.P.; Okamura, S. Eth2Vec: Learning Contract-Wide Code Representations for Vulnerability Detection on Ethereum Smart Contracts. arXiv 2021, arXiv:2101.02377. [Google Scholar]
Hildenbrandt, E.; Saxena, M.; Zhu, X.; Rodrigues, N.; Daian, P.; Guth, D.; Roşu, G. KEVM: A Complete Semantics of the Ethereum Virtual Machine. In Proceedings of the 2018 IEEE 31st Computer Security Foundations Symposium (CSF), Oxford, UK, 9–12 July 2018. [Google Scholar]
Liu, H.; Yang, Z.; Liu, C.; Jiang, Y.; Zhao, W.; Sun, J. EClone: Detect Semantic Clones in Ethereum via Symbolic Transaction Sketch. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, 23–28 August 2021; ACM: Lake Buena Vista, FL, USA, 2018; pp. 900–903. [Google Scholar]
Tann, W.J.-W.; Han, X.J.; Gupta, S.S.; Ong, Y.-S. Towards Safer Smart Contracts: A Sequence Learning Approach to Detecting Security Threats. arXiv 2019, arXiv:1811.06632. [Google Scholar]
Huang, J.; Han, S.; You, W.; Shi, W.; Liang, B.; Wu, J.; Wu, Y. Hunting Vulnerable Smart Contracts via Graph Embedding Based Bytecode Matching. IEEE Trans. Inf. Forensics Secur. 2021, 16, 2144–2456. Available online: https://ieeexplore.ieee.org/document/9316905 (accessed on 17 January 2022). [CrossRef]
Mining Bytecode Features of Smart Contracts to Detect Ponzi Scheme on Blockchain. Available online: https://www.techscience.com/CMES/v127n3/42601 (accessed on 2 February 2022).
Zuo, F.; Li, X.; Young, P.; Luo, L.; Zeng, Q.; Zhang, Z. Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs. In Proceedings of the 2019 Network and Distributed System Security Symposium, San Diego, CA, USA, 24–27 February 2019. [Google Scholar] [CrossRef]
Yang, P.; Liu, W.; Yang, J. Positive Unlabeled Learning via Wrapper-Based Adaptive Sampling. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17), Melbourne, Australia, 19–25 August 2017; pp. 3273–3279. [Google Scholar]
Slither, the Solidity Source Analyzer. Available online: https://github.com/crytic/slither (accessed on 2 February 2022).
Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
Sukhbaatar, S.; Grave, E.; Bojanowski, P.; Joulin, A. Adaptive Attention Span in Transformers. arXiv 2019, arXiv:1905.07799. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2016, arXiv:1409.0473. [Google Scholar]
Davis, A.S.; Arel, I. Faster Gated Recurrent Units via Conditional Computation. In Proceedings of the 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), Anaheim, CA, USA, 18–20 December 2016; pp. 920–924. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems—Volume 2, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
Rong, X. Word2vec Parameter Learning Explained. arXiv 2014, arXiv:1411.2738. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
Zhang, X.; Sun, W.; Pang, J.; Liu, F.; Ma, Z. Similarity Metric Method for Binary Basic Blocks of Cross-Instruction Set Architecture. In Proceedings of the Workshop on Binary Analysis Research (BAR) 2020, San Diego, CA, USA, 23 February 2020. [Google Scholar]
Xuan, H.; Stylianou, A.; Liu, X.; Pless, R. Hard Negative Examples Are Hard, but Useful. arXiv 2021, arXiv:2007.12749. [Google Scholar]
Hoffer, E.; Ailon, N. Deep Metric Learning Using Triplet Network. In Proceedings of the Similarity-Based Pattern Recognition, Copenhagen, Denmark, 12–14 October 2015; Feragen, A., Pelillo, M., Loog, M., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 84–92. [Google Scholar]
Hermans, A.; Beyer, L.; Leibe, B. In Defense of the Triplet Loss for Person Re-Identification. arXiv 2017, arXiv:1703.07737. [Google Scholar]
Shrivastava, A.; Gupta, A.; Girshick, R. Training Region-Based Object Detectors with Online Hard Example Mining. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 761–769. [Google Scholar]
Robinson, J.; Chuang, C.-Y.; Sra, S.; Jegelka, S. Contrastive Learning with Hard Negative Samples. arXiv 2020, arXiv:2010.04592. [Google Scholar]
Kim, T.; Hong, K.; Byun, H. The Feature Generator of Hard Negative Samples for Fine-Grained Image Recognition. Neurocomputing 2021, 439, 374–382. [Google Scholar] [CrossRef]
Chen, W.; Chen, X.; Zhang, J.; Huang, K. Beyond Triplet Loss: A Deep Quadruplet Network for Person Re-Identification. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Evm-Cfg-Builder. Available online: https://github.com/crytic/evm_cfg_builder (accessed on 4 February 2022).
Hu, X.; Chiueh, T.; Shin, K.G. Large-Scale Malware Indexing Using Function-Call Graphs. In Proceedings of the 16th ACM Conference on Computer and Communications Security, Chicago, IL, USA, 9–13 November 2009; Association for Computing Machinery: New York, NY, USA, 2009; pp. 611–620. [Google Scholar]
Liu, B.; Huo, W.; Zhang, C.; Li, W.; Li, F.; Piao, A.; Zou, W. ADiff: Cross-Version Binary Code Similarity Detection with DNN. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, Montpellier, France, 3–7 September 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 667–678, ISBN 978-1-4503-5937-5. [Google Scholar]
Mount, J. The Equivalence of Logistic Regression and Maximum Entropymodels. 2011. Available online: http://www.mfkp.org/INRMM/article/12013393 (accessed on 10 January 2022).
David, Y.; Partush, N.; Yahav, E. Statistical Similarity of Binaries. SIGPLAN Not. 2016, 51, 266–280. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Henderson, M.; Al-Rfou, R.; Strope, B.; Sung, Y.; Lukacs, L.; Guo, R.; Kumar, S.; Miklos, B.; Kurzweil, R. Efficient Natural Language Response Suggestion for Smart Reply. arXiv 2017, arXiv:1705.00652. [Google Scholar]
Yang, Y.; Yuan, S.; Cer, D.; Kong, S.; Constant, N.; Pilar, P.; Ge, H.; Sung, Y.-H.; Strope, B.; Kurzweil, R. Learning Semantic Textual Similarity from Conversations. In Proceedings of the Third Workshop on Representation Learning for NLP, Melbourne, Australia, 20 July 2018; Association for Computational Linguistics: Melbourne, Australia, 2018; pp. 164–174. [Google Scholar]

Figure 1. Overview of main work. The top is the formation process of triple network model, and the bottom is the cross-optimization option of bytecode and the cross compiler version similarity analysis process.

Figure 2. The basic block pair compiled from 0x0a2eaa1101bfec3844d9f79dd4e5b2f2d5b1fd4d after similar marking, and the optimized basic block is on the left and the non-optimized basic block is on the right. The white area represents no difference, and the green, blue and red area indicates different degrees of difference of the same basic block under the optimization option.

Figure 3. Transformer structure and basic block attention representation.

Figure 4. Normalized basic block sequence embedding. Where d is the dimension of position embedding, t is t-th token, c is the constant decided by i, and its value is 1/10,000^2i/d.

Figure 5. Triplet network representation.

Figure 6. The (left) is the schematic diagram of each sample in vector space, and the (right) means after training negative samples will approach the anchor, and negative samples will move away from the anchor.

Figure 7. An example of bytecode diversity brought by compiler versions.

Figure 8. Loss under different sampling results. The smaller the loss, the better the effect of negative sampling.

Figure 9. Experimental results of cross optimization options. (a) is the experimental result of no cross-optimization option conversion, and (b) is the result of cross-optimization option conversion.

Figure 10. Experimental results of cross compiler versions. (a) is the experimental result without instruction normalization and compiler version difference normalization, and (b) is the opposite.

Table 1. The Similarity calculation method and object calculated in this paper.

Method	Calculation Object	Equation
Euclidean	Basic block vector distance, Basic block inter-features	Equation (1)
Generalized Jaccard	Key instruction combination mapping	Equation (6)

Table 2. Experimental results of data set size and instruction normalization.

Dataset	A@3(%)	A@10(%)
ASM_small	81.4	91.2
ASM_base	83.7	94.5
ASM_large	88.2	97.8
ASM_base_unnormalized	79.9	89.4

Table 3. Experimental results of different negative sampling methods.

Negative Sampling	A@3(%)	A@10(%)
Random	83.5	91.4
Differentiation	87.1	95.2
Mix	88.2	97.8

Table 4. Different margin experiment effects.

Margin	A@3(%)	A@10(%)
80	82.9	94.0
100	84.3	95.9
120	88.2	97.8
140	88.4	97.2
160	83.5	94.4

Table 5. Comparative experimental results with Eclone.

	Threshold	Accuracy (%)
Eclone [23]	0.84	93.1
Our method	0.89	94.5

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, D.; Yue, F.; Pang, J.; Zhou, X.; Han, W.; Liu, F. Bytecode Similarity Detection of Smart Contract across Optimization Options and Compiler Versions Based on Triplet Network. Electronics 2022, 11, 597. https://doi.org/10.3390/electronics11040597

AMA Style

Zhu D, Yue F, Pang J, Zhou X, Han W, Liu F. Bytecode Similarity Detection of Smart Contract across Optimization Options and Compiler Versions Based on Triplet Network. Electronics. 2022; 11(4):597. https://doi.org/10.3390/electronics11040597

Chicago/Turabian Style

Zhu, Di, Feng Yue, Jianmin Pang, Xin Zhou, Wenjie Han, and Fudong Liu. 2022. "Bytecode Similarity Detection of Smart Contract across Optimization Options and Compiler Versions Based on Triplet Network" Electronics 11, no. 4: 597. https://doi.org/10.3390/electronics11040597

APA Style

Zhu, D., Yue, F., Pang, J., Zhou, X., Han, W., & Liu, F. (2022). Bytecode Similarity Detection of Smart Contract across Optimization Options and Compiler Versions Based on Triplet Network. Electronics, 11(4), 597. https://doi.org/10.3390/electronics11040597

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bytecode Similarity Detection of Smart Contract across Optimization Options and Compiler Versions Based on Triplet Network

Abstract

1. Introduction

2. Related Work

3. Methodology or Design and Implementation

3.1. Dataset Formation

3.1.1. Opcode Formation and Logical Opcode Extraction

3.1.2. Instruction Normalization and Basic Block Sequence Formation

3.1.3. Positive Sample Acquisition

3.2. Neural Network Pre-Training

3.2.1. Neural Machine Translation

3.2.2. Transformer with Self-Attention Mechanism

3.2.3. Basic Block Embedding and Position Coding

3.2.4. Negative Sample Acquisition and Hard Sample Insertion

3.3. Cross Compiler Version Normalization and Similarity Calculation

3.3.1. Cross Compiler Version Normalization

3.3.2. Similarity Calculation

3.4. Similarity Measure Extend to Bytecode

3.4.1. Key Instruction Combination Matching

3.4.2. Basic Block Inter Features

3.4.3. Similarity Calculation of Bytecode

4. Experiment

4.1. Training Details

4.2. Evaluation Criteria Settings

4.3. Empirical Results

4.3.1. Dataset Size and Normalization

4.3.2. Different Negative Sample Sampling Methods

4.3.3. Hyperparameter Margin

4.3.4. Bytecode Similarity Measurement Effect across Optimization Options

4.3.5. Bytecode Similarity Measurement across Compiler Versions

4.3.6. Comparative Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI