MultiTagging: A Vulnerable Smart Contract Labeling and Evaluation Framework

Alsunaidi, Shikah J.; Aljamaan, Hamoud; Hammoudeh, Mohammad

doi:10.3390/electronics13234616

Open AccessArticle

MultiTagging: A Vulnerable Smart Contract Labeling and Evaluation Framework

by

Shikah J. Alsunaidi

¹

,

Hamoud Aljamaan

^1,2,*

and

Mohammad Hammoudeh

¹

Information and Computer Science Department, King Fahd University of Petroleum and Minerals, Dhahran 31261, Saudi Arabia

²

Interdisciplinary Research Center for Finance and Digital Economy, King Fahd University of Petroleum and Minerals, Dhahran 31261, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(23), 4616; https://doi.org/10.3390/electronics13234616

Submission received: 28 September 2024 / Revised: 16 November 2024 / Accepted: 19 November 2024 / Published: 22 November 2024

Download

Browse Figures

Versions Notes

Abstract

:

Identifying vulnerabilities in Smart Contracts (SCs) is crucial, as they can lead to significant financial losses if exploited. Although various SC vulnerability identification methods exist, selecting the most effective approach remains challenging. This article examines these challenges and introduces solutions to enhance SC vulnerability identification. It introduces MultiTagging, a modular SC multi-labeling framework designed to overcome limitations in existing SC vulnerability identification approaches. MultiTagging automates SC vulnerability tagging by parsing analysis reports and mapping tool-specific tags to standardized labels, including SC Weakness Classification (SWC) codes and Decentralized Application Security Project (DASP) ranks. Its mapping strategy and the proposed vulnerability taxonomy resolve tool-level labeling inconsistencies, where different tools use distinct labels for identical vulnerabilities. The framework integrates an evaluation module to assess SC vulnerability identification methods. MultiTagging enables both tool-based and vote-based SC vulnerability labeling. To improve labeling accuracy, the article proposes Power-based voting, a method that systematically defines voter roles and voting thresholds for each vulnerability. MultiTagging is used to evaluate labeling across six tools: MAIAN, Mythril, Semgrep, Slither, Solhint, and VeriSmart. The results reveal high coverage for Mythril, Slither, and Solhint, which identified eight, seven, and six DASP classes, respectively. Tool performance varied, underscoring the impracticality of relying on a single tool to identify all vulnerability classes. A comparative evaluation of Power-based voting and two threshold-based methods—AtLeastOne and Majority voting—shows that while voting methods can increase vulnerability identification coverage, they may also reduce detection performance. Power-based voting proved more effective than pure threshold-based methods across all vulnerability classes.

Keywords:

blockchain; smart contract; software security; vulnerability identification; vulnerability taxonomy; data analysis; data annotation

1. Introduction

A Smart Contract (SC) is a digital agreement in the form of a computer program that is signed and stored on a blockchain network. SCs operate on the blockchain and auto-execute when predefined conditions are satisfied [1]. SCs were initially introduced in Blockchain 2.0 [2] as a core component of the Ethereum platform [3]. Later, many other platforms—such as Hyperledger Fabric, Rootstock, Stellar, and Corda—enabled SC development [4]. Each blockchain platform supports specific programming languages for SC development. Ethereum SCs, e.g., are developed with either Solidity—a JavaScript-like, object-oriented programming language—or Vyper—a Python-like programming language. Hyperledger Fabric contracts are built with Java or Golang, while Rootstock supports Solidity [4].

SCs have a wide range of applications. For instance, in decentralized finance (DeFi) two parties can lend, borrow, and trade digital assets without the need for third-party control. SCs also enable the creation of Decentralized Autonomous Organizations (DAOs). Other application areas include crowdfunding, supply chain management, intellectual property rights, real estate transactions, identity verification, insurance, voting, and the gaming industry [4,5].

SCs are an attractive target for attackers due to their value [6]. SCs typically hold and manage digital assets, such as cryptocurrencies or tokens. Maintaining SC security is challenging for several reasons: (1) SCs are deployed on blockchain platforms, which have their own security considerations [7]; (2) SCs are immutable once deployed—meaning they cannot be easily modified or patched if vulnerabilities are discovered [8]; and (3) inherent flaws in the programming languages used to develop SCs expose a range of vulnerabilities that attackers can exploit.

The SC domain is still in its infancy; several security issues have been recorded in the literature, resulting in significant financial losses. According to SlowMist [9], in February 2024 a vulnerability in a lending contract led to approximately USD 1.35 million in losses. Another similar attack occurred in March 2024, resulting in losses of around USD 11.6 million. The literature shows significant efforts to enhance SC security. Several analysis tools—e.g., Slither, Mythril, and MAIAN—have been developed to aid in inspecting SC codes and identifying existing vulnerabilities [10]. Analysis tools vary in several aspects, including the number of detectable vulnerabilities, analysis time, and detection accuracy. New vulnerability patterns often emerge, requiring the ongoing maintenance of analysis tools to improve performance. Machine Learning (ML) is leveraged in SC vulnerability detection solutions to overcome several limitations in traditional analysis tools [11]. However, acquiring a dataset annotated with SC vulnerabilities to train supervised ML models or evaluate the detection performance of ML models remains challenging.

Blockchain data, such as Ethereum’s, is often publicly accessible, allowing it to be gathered and tagged manually or with analysis tools. However, creating a reasonably sized dataset manually is time-consuming and prone to human error. Data tagging using analysis tools has various issues, including the following: (1) Each tool generates a report in a specific format, often requiring manual parsing to extract tags—a process that is time-consuming and limits dataset size; (2) There is lack of standardization in naming vulnerabilities detected by different analysis tools, which is compounded by the absence of a public registry to assist in mapping tool tags to widely recognized SC vulnerability labels, such as SWC codes [12] and DASP ranks [13]; (3) A single analysis tool is typically insufficient to discover all vulnerabilities within a contract, necessitating the use of multiple tools to increase vulnerability coverage. Conflicts are likely to arise between the tools’ decisions on vulnerability presence, which are usually resolved using threshold-based voting techniques such as AtLeastOne and Majority voting. However, these traditional techniques lack mechanisms to differentiate false positives, potentially increasing misclassifications when analysis tools vote on false flags. This emphasizes the need for novel approaches that improve labeling accuracy; (4) Emerging vulnerability patterns require regular evaluation of tool efficiency, especially for older tools.

In addition to the aforementioned issues, several published evaluation studies [14,15,16,17,18,19,20] neglect to disclose the examined tool versions or any adjusted parameters, making generalization of their findings challenging. This article introduces the MultiTagging framework, developed to address the shortcomings in current tagging methods. It also presents an evaluation study that compares the performance of six SC analysis tools and various vote-based labeling methods. The key contributions of this article can be summarized as follows.

The article proposes a parser mechanism for efficiently extracting vulnerability tags from SC analysis tool reports. By automating the parsing process, it addresses the time-consuming and error-prone issue of manual parsing, ensuring consistent, standardized, and accurate tag extraction across diverse tools.
To address the issue of inconsistent vulnerability labeling, this article introduces a mapper approach that automates the assignment of analysis tool tags to standard labels, namely SWC codes and DASP ranks. It also develops a new SC vulnerability taxonomy that maps SWC codes to the DASP Top 10 ranks. Furthermore, the article establishes a public registry that maps tool-specific vulnerability labels to standard labels, addressing labeling heterogeneity among analysis tools. This registry facilitates consistent comparisons across tools and fosters a cohesive research environment where findings are comparable, reproducible, and verifiable.
To improve vote-based labeling accuracy, this article proposes Power-based voting. This novel method systematically establishes the role of each analysis tool and sets optimal voting thresholds for each vulnerability type. Additionally, it develops a decision strategy for selecting the optimal voting technique, based on two key factors: the degree of overlap and voters’ performance. These factors are designed to enhance labeling accuracy and reduce the likelihood of false positives. This approach increases the reliability of vulnerability detection across multiple analysis tools.
To advance the field of SC vulnerability detection, this article presents an evaluation study and draws notable conclusions on the performance of six analysis tools—MAIAN, Mythril, Semgrep, Slither, Solhint, and VeriSmart. Additionally, it investigates the effectiveness of the Power-based voting method compared to two traditional voting methods—AtLeastOne and Majority voting. By evaluating recent versions of these SC analysis tools, fully disclosing the experimental setup, and providing a replication package, this study establishes a benchmark for tool performance, offering a valuable resource for researchers and practitioners. This methodology enables meaningful comparisons and iterative improvements to SC vulnerability detection methods.

The rest of this article is organized as follows. Section 2 reviews SC vulnerabilities and discusses current labeling approaches. Section 3 presents a new SC vulnerability taxonomy. Section 4 introduces the MultiTagging framework, highlighting its features and the proposed Power-based voting mechanism. Section 5 provides a detailed description of the evaluation study conducted, and Section 6 discusses the study findings. Section 7 addresses threats to the validity of our study. Finally, Section 8 concludes the article and outlines directions for future work.

2. Literature Review

Identifying vulnerabilities in SCs is essential, due to the significant financial risks posed by potential exploitation. Manual inspection of SCs is error-prone, prompting the development of numerous tools that analyze contract source codes to detect vulnerability patterns. However, reaching consensus is challenging on two fronts: (1) What vulnerabilities can each tool effectively identify? and (2) Which tool demonstrates superior performance in detecting specific vulnerabilities? A major obstacle in addressing these questions is the inconsistent terminology used across tools; they often apply distinct labels to similar vulnerabilities or refer to patterns rather than the vulnerability names, complicating direct comparisons of their detection capabilities and overall effectiveness. The rest of this section provides an overview of common SC vulnerabilities and widely recognized taxonomies (Section 2.1), followed by a discussion of findings from evaluation studies on SC analysis tools and their labeling approaches (Section 2.2). Finally, it identifies key research gaps in the SC vulnerability identification field (Section 2.3).

2.1. Definition of SC Vulnerabilities

Several types of SC vulnerability have been reported and explored by researchers and cybersecurity experts. These vulnerabilities vary in behavior and severity. In the field of SC vulnerability identification and labeling, established taxonomies, such as the Decentralized Application Security Project (DASP) Top 10 [13] and the SC Weakness Classification (SWC) registry [12,21], play a critical role. Although released in 2018, these taxonomies continue to provide a foundational and standardized framework for identifying and categorizing SC vulnerabilities [15,18,20,22,23]. This framework enhances consistency and enables comparability across study findings.

The DASP Top 10 [13], established by the NCC Group [24], classifies the ten most common Ethereum SC vulnerabilities, as outlined in Table 1. In contrast, the SWC registry [12,21] offers a more comprehensive taxonomy, describing 37 SC vulnerabilities. This registry, designed with reference to the Common Weakness Enumeration (CWE) list [25], assigns each vulnerability a unique ID. As illustrated in Figure 1, these IDs are linked to the corresponding CWE classes. This mapping demonstrates how the SWC registry aligns SC-specific vulnerabilities with broader software weakness categories, thereby enhancing cross-platform relevance and compatibility. All vulnerabilities cataloged by the SWC registry have been incorporated into the Enterprise Ethereum Alliance (EEA) EthTrust Security Levels Specification. This specification is regularly updated to reflect evolving security standards [26].

The DASP Top 10 classification [13] offers a high-level overview of major Ethereum SC vulnerabilities, while the SWC registry [12] provides a detailed classification, leveraging the CWE framework [25] to enhance cross-platform compatibility. However, the vulnerabilities identified in the DASP Top 10 are not exclusive to Ethereum; they are also applicable to SCs on other blockchain platforms. Many common vulnerabilities, such as Reentrancy, Access control, Integer overflows/underflows, and Front-running, originate from fundamental principles of blockchain technology and the way SCs interact within decentralized environments [27]. This demonstrates that the DASP Top 10 provides a broadly applicable framework for identifying SC vulnerabilities across various platforms.

It is critical to standardize the terminology for SC vulnerabilities. In the literature, DASP classes and SWC codes are used interchangeably. SWC codes can represent subcategories of DASP classes. Researchers have proposed several taxonomies to map SWC codes to DASP classes [22,28,29]. For instance, Dia et al. [28] mapped only seven SWC codes to seven DASP classes. Rameder et al. [22] mapped fourteen SWC codes to eight DASP classes. Di Angelo and Salzer [29] mapped most SWC codes to nine DASP classes. Table 2 reveals that some SWC codes are mapped differently across studies. The disparity appears in three DASP classes: 2, 5, and 10. Some vulnerabilities remain unmapped by current studies; there are two such vulnerabilities in [29], 30 in [28], and 22 in [22].

2.2. Identification of SC Vulnerabilities

The literature identifies numerous methods for detecting SC vulnerabilities, but two are commonly used to label SC samples for dataset construction: manual inspection and the use of analysis tools. In manual inspections, cybersecurity experts examine the SC source codes and label the vulnerabilities they find. However, datasets based on manual inspections, such as CodeSmells [30], are often limited in size and diversity. Conversely, several analysis tools have been developed primarily to assist in auditing and debugging contracts [19,31]. These tools vary in their methodologies, detection accuracies, and the vulnerabilities they can identify. There are eight distinct techniques employed by these tools [10], namely, static analysis, formal analysis, symbolic execution, fuzzing, code synthesis, execution tracing, transaction interception, and machine learning. Analysis tools are divided into two main categories: static and dynamic tools [19]. Dynamic tools differ from static tools in that they execute contracts to identify vulnerabilities.

Verifying the effectiveness of analysis tools in detecting SC vulnerabilities is critical, as it influences the validity of dataset labeling. Researchers have derived significant insights on the performance of several tools. Parizi et al. [14] found that SmartCheck outperformed Mythril, Oyente, and Securify in detecting ten vulnerable SC samples. These samples encompassed a diverse set of vulnerabilities, including Integer Overflow, Missing Constructor, Reentrancy, Unchecked External Call, and other real-world security issues. The Mythril tool was noted for its accuracy, characterized by a lower false positive rate. In contrast, Durieux et al. [15] discovered that Mythril outperformed SmartCheck, achieving the highest detection rate of 27% on a dataset of 69 SCs. Furthermore, the combination of Mythril and Slither led to a 10% increase in detection rate, resulting in 37% vulnerability detection. Only 42% of vulnerabilities in their annotated dataset were identified by integrating all evaluated tools. None of the tools were able to detect the Bad Randomness and Short Addresses vulnerabilities. All tested tools struggled to identify vulnerabilities in three categories: Access Control, DoS, and Front Running. After examining the execution time of tools on the remaining portion of the dataset, i.e., 44,589 SCs, researchers discovered that Slither is the fastest tool, while Manticore is the slowest. Leid et al. [32] observed that Mythril and Manticore provide comparable coverage, although Manticore had a longer execution time.

All tools tested by Ghaleb and Pattabiraman [33] failed to detect numerous vulnerability instances and yielded multiple false positives. However, Slither successfully detected all Reentrancy and tx.origin instances. For all vulnerability types, Slither had the lowest false negative rate, followed by Securify. Similarly, Dia et al. [28] showed that Slither, followed by Securify2, achieved the highest true positive rate, while Mythril had the best true negative rate. However, none of the tested tools provided effective security assurance at any level. Ji et al. [18] demonstrated that tool performance varies depending on the type of vulnerability. VeriSmart outperformed all other tools in detecting the Arithmetic class, with 100% recall. Slither and Oyente were the top performers in detecting the Reentrancy class, with recall scores of 100% and 93.5%, respectively, although Slither exhibited a higher false positive rate compared to Oyente. Slither and Solhint were able to detect all samples of the Timestamp Dependence class. SmartCheck and Slither were the best performers in detecting the Unchecked Low Level Call class, with recall scores of 100% and 86.5%, respectively. None of the tested tools were effective at detecting three classes: Access Control, DoS, and Front Running.

Kushwaha et al. [19] discovered that Reentrancy is the most commonly checked vulnerability by the majority of the analyzed tools. Slither, Solhint, and SmartCheck are the fastest SC analysis tools, while Manticore is the slowest. Slither, Mythril, and Oyente outperformed the other tested tools in terms of vulnerability detection. Di Angelo et al. [20] revealed valuable insights into the robustness of 12 bytecode-based analysis tools over time. By testing the tools on a dataset covering six years of Ethereum activity, they observed that the number of reported vulnerabilities decreased over time, as did the tools’ performance. Lack of tool maintenance is identified as one of the causes of this performance degradation. Mythril, Oyente, and Vandal performed consistently, with no errors, while HoneyBadger, MAIAN, and Osiris exhibited an increase in error rate after 7.5 million blocks.

According to the literature, most of the tools examined perform poorly in detecting various vulnerabilities. Combining these tools, however, can improve both vulnerability coverage and detection rate [28]. Several combination approaches can be used; e.g., Yashavant et al. [34] employed a majority voting mechanism to establish dataset labels. Ren et al. [17] recommended combining heterogeneous tools, e.g., static and dynamic tools, to reduce false positives. Zhang et al. [16] found that static analysis tools typically provide extensive coverage. Furthermore, combining Mythril, Slither, and Remix improves the coverage rate.

Technologies have been harnessed to enable controlled testing environments through the development of automated frameworks such as SolidiFI [33], ScrawlD [34], SmartBugs [35,36], and USCV [18]. SolidiFI works by injecting bugs into SC samples and then utilizes them to examine the tool’s detection performance. ScrawlD aids in parsing the analysis reports of five tools to obtain vulnerability tags. It enables mapping the extracted tags to one of eight SWC codes. SmartBugs and USCV are two executable frameworks that provide a uniform environment for running various SC analysis tools. SmartBugs 2.0.8 [37] was designed to run and extract tags from 20 tools. USCV [38], on the other hand, provides a method to evaluate the performance of eight tools using several metrics. Both SmartBugs and USCV employ a pipelining strategy, which limits the ability to use one of their services directly.

Table 3 summarizes the key features and settings of each evaluation study discussed in this section. It provides details on the number of evaluated tools, the benchmark data size, and the evaluation measurements used by each study. It indicates whether or not the study automated parsing and mapping processes. Additionally, it reveals studies that lacked key information necessary for replication, including execution environment, tool versions, and parameter values.

2.3. Research Gaps

Many researchers have sought to standardize a taxonomy of SC vulnerabilities and assess the effectiveness of analysis tools in detecting various vulnerability types. However, this study aims to address several critical limitations, which are summarized as follows:

Lack of a standardized and comprehensive taxonomy of SC vulnerabilities. Table 3 shows that studies label SC vulnerabilities using three approaches: (1) employing established taxonomies, such as DASP ranks or SWC codes; (2) using their own taxonomy; and (3) not applying any taxonomy, though labels are often similar to SWC codes. The absence of uniform definitions leads to inconsistencies, with multiple labels referring to the same vulnerabilities [39]. Table 2 demonstrates efforts to establish a uniform and simple taxonomy by mapping SWC codes to DASP categories; however, certain conflicts remain to be addressed. Furthermore, the majority of analysis tool developers use different tag names for their vulnerability detectors. Tool evaluators probably assign the same detectors to distinct vulnerability classes. Because of these variances, comparing evaluation study results and drawing general conclusions is challenging. This article addresses this gap by resolving conflicts in existing mappings and proposing a comprehensive taxonomy that aligns SWC codes with corresponding DASP ranks. Additionally, it develops a public registry that maps detectors from various SC analysis tools to the relevant tags using the proposed taxonomy. Designed to be generic and updatable, the registry enables the inclusion of additional analysis tools, thereby broadening its applicability across diverse tools and contexts.
Lack of automated vulnerability parsing and mapping approaches. Analysis tools report results in various formats that require parsing and interpretation to yield understandable and comparable conclusions. Despite notable efforts to address this gap, further contributions are needed. For instance, the ScrawlD parser [34] is limited to five analysis tools, and the mapper supports only eight SWC codes. Additionally, this project has not been updated since July 2022. The SmartBugs parser [15,35,36] cannot directly extract tags from reports generated outside its framework. Furthermore, it does not provide an automated method for mapping tool tags to a common SC vulnerability taxonomy. The SolidiFI [33] and USCV [18] frameworks also lack comprehensive mappers that consider all detectors of the analysis tools. Additionally, their repositories appear to be deprecated, with the most recent updates in May 2022 and July 2021, respectively. This study contributes to bridging this gap by developing MultiTagging, an open-source framework that introduces a parser mechanism and mapping strategy for automating the generation of common vulnerability tags from contract analysis reports. Its modular architecture enables adaptation to various SC analysis tools, enhancing the efficiency and consistency of vulnerability identification.
Lack of replication’s key information. As demonstrated in Table 3, many studies failed to provide essential details about the experiment—such as the execution environment, tool versions, and parameter values—making replication challenging. Additionally, variations in experimental settings can significantly influence tool performance, potentially leading to misleading conclusions [17]. This study addresses these issues by investigating the effectiveness of vote-based methods versus individual SC analysis tools in detecting SC vulnerabilities. By evaluating recent versions of multiple SC analysis tools, fully disclosing the experimental setup, and providing a replication package, this study establishes a current benchmark for tool performance and serves as a valuable resource for both researchers and practitioners. Both the comprehensive disclosure and the replicable setup enhance reproducibility, facilitate future comparisons and improvements, foster consistent assessment processes, and promote the ongoing development of effective vulnerability detection techniques.

3. SC Vulnerability Taxonomy

This section introduces our proposed mapping of SWC codes to DASP classes, as illustrated in Figure 2. This mapping is based on a comprehensive review of the DASP Top 10 vulnerability classes [13], the SWC registry [12], and the EEA EthTrust Security Levels Specification [26], as well as an examination and comparison of SWC code mappings in both [22,29]. Our proposed taxonomy is comparable to that of [29], with minor modifications, discussed as follows.

SWC100 and SWC108 are mapped to DASP classes 2 and 10 in [22,29], respectively. Specifically, DASP class 2 [40]—termed Access Control class—includes all vulnerabilities that grant an attacker access to a contract’s private values. The insecure visibility setting is an example of such vulnerabilities. The titles of SWC100 [41] and SWC108 [42] are “Function Default Visibility” and “State Variable Default Visibility”, respectively. These vulnerabilities arise from failing to explicitly declare the visibility type (access modifier) of functions or variables. In Solidity, the default access modifier is “public”. Consequently, SWC100 and SWC108 should be classified under DASP class 2, not class 10.
SWC106 is mapped to DASP classes 2 and 5 in [22,29], respectively. The title of SWC106 [43] is “Unprotected SELFDESTRUCT Instruction”. This vulnerability arises due to insufficient access control rules that allow attackers to trigger the self-destruct function of the contract. Since access control breaches can lead to a DoS attack, correctly assigning this SWC code to a DASP class is challenging. Considering the core cause of the vulnerability—improper access control rules—DASP class 2 is more appropriate for SWC106 than class 5.
SWC121 and SWC122 codes are unmapped in studies [22,29]. The titles of SWC121 [44] and SWC122 [45] are “Missing Protection against Signature Replay Attacks” and “Lack of Proper Signature Verification”, respectively. SWC121 arises from the absence of a reliable mechanism for verifying cryptographic signatures, allowing a malicious user to gain unauthorized access by launching a signature replay attack with a hash of another processed message. SWC122 results from a lack of an effective method for verifying data authenticity. In blockchain systems, messages are authenticated using digital signatures; however, since SCs cannot sign messages, alternative signature verification procedures are necessary. Implementing an improper verification method can lead to the acceptance of invalid authentication data, compromising the system’s integrity. A malicious user could exploit this vulnerability to gain unauthorized access. Consequently, it is more appropriate to classify SWC121 and SWC122 under DASP class 2.
SWC132 is mapped to DASP class 10 in [29]. The title of SWC132 [46] is “Unexpected Ether balance”. This vulnerability arises from strict equality checks on a contract’s Ether balance. A malicious user can manipulate the balance of the target contract by forcibly sending Ether using the “selfdestruct” function. This action can cause the check to fail and potentially lock the contract, resulting in a DoS attack. Given these characteristics, DASP class 5, which deals with vulnerabilities leading to DoS attacks, is more appropriate for SWC132.

To facilitate the comparison of analysis tools, we assigned the relevant DASP class and SWC code to each detector based on descriptions provided by their developers (the complete tools mapping registry is available online [47]).

4. MultiTagging Framework

This section introduces our open-source MultiTagging framework [48]. Figure 3 provides an overview of the four main modules of the framework, which will be described further as follows. The entire framework was developed using Python 3.12.2.

4.1. Analysis Tool Reports Tagger

Since the analysis reports produced by each tool vary in format and vulnerability labels, this module assists in interpreting the reports to create uniform labels. This module can be used to create an SC vulnerability dataset with standardized labels, such as SWC codes and DASP ranks. Having standardized labels aids in generalizing some conclusions, e.g., the number/type of vulnerabilities a tool can detect when compared to others. The Tagger module consists of the following two components:

Parser: This component takes two inputs—the analysis reports generated by each tool and the analysis time taken to produce each report. Each tool uses a specific keyword to indicate the vulnerability location in its report (e.g., Slither uses “check”). The parser scans each report for this indicator keyword and extracts the relevant vulnerability tags. It then passes the extracted tags and corresponding analysis time to the Mapper for further processing.
Mapper: To address the lack of uniformity in naming SC vulnerabilities, we developed a public Vulnerability Map Registry, which is available online [47], for six selected tools. This registry, however, can be expanded to support additional tools. The Mapper uses this registry to map the tool-specific tags to standardized SWC/DASP labels. It outputs a dataframe consisting of the following columns: vulnerability tags extracted by the tool, SWC codes and titles, DASP ranks and titles, and the analysis time.

4.2. Analysis Tool Evaluator

Analysis tools should be regularly evaluated to assess their effectiveness in identifying emerging vulnerability patterns and their compatibility with updated compilers and execution environments, such as the Solidity compiler and EVM opcodes. The Evaluator module assists in achieving this goal using the following components:

Preparer: This component reads and filters both actual (base data) and predicted labeled data from analysis tools based on user requirements. It applies three filters: (1) Tools, which includes only the tools specified by the user; (2) Base, which considers scores derived from base data identified by the user, allowing for cases where multiple datasets are tested; and (3) Fairness, which addresses instances where some tools fail to process certain samples, allowing the user to include only those samples analyzed by all tools. The Preparer then identifies the vulnerabilities that each tool can detect and, if Fairness is not applied, removes any samples that the tool failed to analyze. This process ensures an accurate evaluation of each tool.
Counter: This component compares each tool’s predicted labels to the actual labels, using the Preparer’s output to generate a confusion matrix for each tool. The confusion matrix provides essential metrics—true positives, true negatives, false positives, and false negatives—which are passed to the next component to obtain a detailed assessment of each tool’s performance.
Performance Measure: This component uses the tool’s confusion matrix to compute a range of performance metrics, such as precision and recall. These metrics provide valuable insights into the tool’s performance for each label (SC vulnerability).

4.3. Labels Elector

To date, none of the existing analysis tools can identify all SC vulnerabilities, and the detection accuracy of these tools for certain vulnerabilities remains poor. SCs can be labeled using the combined votes of multiple tools; however, the accuracy of these voting-based labels requires validation, as they may yield unsatisfactory outcomes. This module assists in producing vote-based labels, with the following two components:

Preparer: This component reads and filters the labeled data produced by each tool according to user requirements, applying the same filters as the Evaluator module—Tools, Base, and Fairness. It then passes a dataframe containing aggregated votes (i.e., labels produced by each tool) for each vulnerability to the next component.
Voter: This component applies voting methods to the aggregated vote data produced by the Preparer. It outputs a dataset labeled based on tool votes. The Labels Elector module offers two voting mechanisms, threshold-based and power-based, which are described in the following subsections.

4.3.1. Threshold-Based Voting

In this voting mechanism, a contract’s vulnerability is acknowledged once the required minimum number of votes is attained. The MultiTagging framework offers two popular threshold-based voting methods:

AtLeastOne. The contract’s vulnerability exists if at least one tool predicts it.
Majority. The contract’s vulnerability exists if at least half of the tools can identify it.

4.3.2. Power-Based Voting

The article introduces this newly developed voting mechanism, which implements a systematic procedure to define the roles of analysis tools and establish an appropriate voting threshold for each vulnerability. It considers three influential factors: tool capability, sensitivity, and similarity. The voting threshold—the minimum number of votes required to acknowledge the existence of a contract’s vulnerability—varies depending on the analysis tools used. Algorithm 1 summarizes the Power-based voting mechanism. The algorithm takes as input the performance scores, P, of the analysis tools, consisting of n pairs

(R, P)

, where R represents the Recall score and P is the Precision score. It also takes the tool similarity scores, S, which is an

n \times n

matrix that provides the overlap degrees between each tool and its peers. The algorithm considers five significant thresholds: Low recall score,

L R_{t h r e s h o l d}

; High recall score,

H R_{t h r e s h o l d}

; Low precision score,

L P_{t h r e s h o l d}

; Minimum performance difference,

D_{t h r e s h o l d}

; and Minimum similarity score,

S_{t h r e s h o l d}

. These thresholds help assess the three factors influencing the performance of the voting method.

Recall-based thresholds set the minimum percentage of true positives required for a tool to be classified as high- or low-performing. This approach helps to exclude tools that lack the capability to detect specific vulnerabilities, eliminating fake votes. The Precision-based threshold helps identify low-accuracy tools—those with a high false positive rate. The similarity threshold sets the minimum degree of overlap required for two tools to be considered similar. Detecting similarity between tools plays a crucial role in calibrating the voting method, reducing the likelihood of agreement on false positives. The minimum performance difference threshold aids in determining whether or not to remove a low-performing tool from the voting process. A tool is removed only if there is another tool that outperforms it by at least the threshold value. Excluding such low-performing tools improves the accuracy and reliability of the voting results, as it minimizes the influence of tools that may contribute to false positives.

Algorithm 1: Power-based Voting

As shown in Figure 4, the algorithm returns the voting method

V M

for m vulnerability classes after completing two primary steps:

Tool Roles Identification. This step examines and determines the role of each tool in the vulnerability detection process, which helps improve voting results. Each analysis tool may play one of three roles:
- None: The tool is excluded since it cannot identify the vulnerability. The tool is excluded in two cases: (i) its recall and precision scores are both zero; (ii) the tool’s results are a subset of a better-performing tool findings. The $D_{t h r e s h o l d}$ was established at 50% to guarantee that low-performing tools are not eliminated unless better-performing tools are available.
- Inverter: This contributes to adjusting the findings of another tool. The extremely low recall score, which is lower than the $L P_{t h r e s h o l d}$ , implies that the majority of the tool’s flags are false positives. If the similarity rate between such a tool and a higher-performing one is large, it is probably due to false positives. In this case, the poorly performing tool can assist in correcting (inverting) the false positives of the other tool, thus increasing its precision. To ensure that the tool’s true positives are close to zero, the $L R_{t h r e s h o l d}$ and $L P_{t h r e s h o l d}$ are set at 10% and 20%, respectively. The $S_{t h r e s h o l d}$ is set at 60%.
- Voter: This tool is engaged in the voting process.
Voting Methods Identification. This step determines the appropriate voting method for each vulnerability class. For each class, voters are categorized as high- and low-performance tools based on their recall scores, with an $H R_{t h r e s h o l d}$ of 95% to ensure that the majority of positive samples are identified. The appropriate voting method is then determined as follows: (a) Majority: Used when all voters are high-performance tools, intended to reduce false positives while maintaining a high true positive rate; (b) AtLeastOne: Applied when all voters are low-performance tools, aimed at enhancing the true positive rate; and (c) Weight-based: Used when there is a combination of high- and low-performance tools. Majority voting is applied to high-performance tools, while AtLeastOne voting is applied to the others. The outputs of these two methods are then combined using OR logic to determine the final voting result.

After identifying the roles of each tool and selecting the appropriate voting mechanism for each vulnerability class, the findings from the inverter-equipped analysis tool are adjusted so that whenever both tools—the primary tool and its inverter—identify a result as a true positive, it is converted to a true negative. The voting procedure is then applied to each vulnerability class.

4.4. Evaluation Scores Plotter

This module facilitates the interpretation and comparison of evaluation scores for analysis tools by generating graphical charts. The Plotter module includes the following two components:

Preparer: This component reads and filters the evaluation data based on user requirements, applying the same filters as the Evaluator module—Tools, Base, and Fairness. The prepared data is then passed to the Plotter.
Plotter: This component takes two main inputs: (1) plotting style and (2) performance scores. The plotting style specifies whether to display the scores of a single tool or a group of tools, and whether or not to present the tool’s performance across various datasets. By processing these inputs, the Plotter generates graphical charts representing the evaluation metrics of the tools.

5. Research Method and Design

This section introduces the evaluation study that was conducted to assess the MultiTagging framework’s functionality. The study aims to investigate the performance of six SC analysis tools and three voting techniques in identifying SC vulnerabilities.

5.1. Goal

The goal of the evaluation study is built from the Goal-Question-Metric (GQM) template as follows: evaluate six SC analysis tools, namely MAIAN, Mythril, Semgrep, Slither, Solhint, and VeriSmart, as well as three voting methods, namely AtLeastOne, Majority, and Power-based voting, for the purpose of SC vulnerabilities multi-tagging with respect to their detection performance measures from the perspective of both researchers and security analysts within the context of Ethereum SCs data.

5.2. Research Questions

The study aims to investigate the following three research questions:

RQ 1: What is the best SC analysis tool for identifying each SC vulnerability in terms of precision and recall scores?
RQ 2: To what extent are the investigated analysis tools comparable in terms of SC vulnerability detection?
RQ 3: To what extent will voting methods offer an increase in multi-tagging coverage when used to identify various SC vulnerabilities?

5.3. Benchmark

We drew on the findings of Di Angelo et al.’s study [29] in selecting datasets for our investigation. Di Angelo et al. collected and consolidated 13 publicly available Ethereum SC vulnerability datasets, standardizing label names across datasets and addressing overlaps and conflicts. The datasets are classified into two groups: (1) Wild sets, containing deployed SCs on either the main or the test chains; (2) Crafted sets, containing SCs designed to exemplify common vulnerabilities or those with source code intentionally injected with bugs. We used only seven of these datasets, as presented in Table 4, excluding the others due to conflicts and errors identified in Di Angelo et al.’s investigation [29]. We further refined the selected datasets as follows: (1) removed all samples lacking a declared Solidity compiler version or using a version prior to 4.0, as analysis tools cannot compile these samples; (2) examined and adjusted the mapping of SWC codes to DASP classes (the modifications made are available online [47], see the “CGT_Update” sheet) to align with the classification given in Figure 2; (3) converted the dataset into a multi-label format; (4) removed duplicates from the final dataset. Figure 5 shows the final size and pattern frequency for each dataset (the final set of the selected benchmarks is available in the framework repository [48]).

5.4. Analysis Tools

The literature provides a range of analysis tools for identifying SC vulnerabilities [10,19,39]. However, some tools have constraints that make them unsuitable in certain scenarios. For instance, the Securify tool [52] does not support Solidity versions prior to 0.5.8. ConFuzzius [53], IR-Fuzz [54], and MuFuzz only support Solidity version 0.4.26. The SmartCheck tool [55] does not support Solidity versions after 0.6.0. The Manticore tool [56] does not support all EVM opcodes and may fail for Solidity versions other than 0.4.x. The Osiris tool [57] can detect only integer-related bugs.

We defined five criteria to select suitable analysis tools for our investigation: (1) Accessibility: The tool should be open-source and support a command-line interface (CLI); (2) Compatibility: The tool should support a wide range of Solidity versions; (3) Simplicity: The tool should perform analysis using only the SC source code as input; (4) Coverage: The tool should be capable of detecting multiple vulnerabilities; (5) Documentation: Adequate documentation should be available to facilitate tool operation. Based on these criteria, we selected six tools from two categories: static analysis and dynamic analysis. The static analysis tools are Slither, Solhint, and Semgrep, while the dynamic analysis tools are MAIAN, Mythril, and VeriSmart.

MAIAN [6] is an open-source, Python-based dynamic analysis tool, developed collaboratively by researchers from the National University of Singapore and University College London and launched in 2018. MAIAN takes SC bytecode as input, generated by a custom-built EVM. It runs multiple symbolic execution traces until it discovers one that meets a predefined set of properties. MAIAN uses the Z3 solver [58] to produce concrete values for symbolic inputs. If an SC is flagged as positive—meaning a trace is found—MAIAN performs a validation step to reduce the false positive rate. It deploys the SC on a private Ethereum blockchain network to validate the detected properties. MAIAN considers three kinds of vulnerable SCs that violate either safety or liveness properties: (1) Suicidal contracts; (2) Prodigal contracts; (3) Greedy contracts.
Mythril [23] is an open-source, Python-based dynamic analysis tool developed by the ConsenSys team and launched in 2017. Mythril uses the Z3 solver [58], a symbolic virtual machine (SVM) called LASER [59], and a control-flow graph to detect a variety of SC vulnerabilities. It accepts bytecode as input and employs concolic execution for in-depth analysis.
Semgrep [60] is an open-source, lightweight static analysis tool written in Python 3 and launched in 2020. It supports a variety of programming languages, including Solidity, which was added in December 2021. It was developed by Semgrep, a cybersecurity company founded in 2017. Semgrep scans SC codes to detect vulnerability patterns and style violations using predefined or user-defined custom rules. These rules are written in YAML. Each rule contains metadata, conditions, and actions that instruct the analyzer to perform specific actions when certain conditions are met.
Slither [61] is an open-source static analysis framework written in Python 3, developed by the Trail of Bits team and launched in 2018. Slither accepts the Solidity Abstract Syntax Tree (AST) as input, which is generated by the Solidity compiler from SC code. It first extracts information from AST, including the SC’s inheritance graph, control-flow graph, and list of expressions. Next, it converts the SC code into an internal representation language called SlithIR, which uses the Static Single Assessment (SSA) [62] form to facilitate code analysis computations. Slither can be utilized to identify SC vulnerabilities or to optimize and understand SC code. The latest version of Slither [63] includes more than 90 detectors.
Solhint [18,19] is an open-source static analysis tool developed in Java and launched in 2017. It uses predefined patterns and rules to detect code security vulnerabilities. It employs an ANTLR4-based Solidity parser. Solhint also provides recommendations on style and best coding practices. It is customizable, allowing users to modify existing rules or add new ones.
VeriSmart [64] is an OCaml-based, open-source, dynamic analysis tool introduced in 2020 by the Software Analysis Lab at Korea University. Like MAIAN and Mythril, VeriSmart uses the Z3 solver [58] but performs domain-specific preprocessing and optimization before employing it. It automatically generates contract assertion statements, using a Counterexample-Guided Inductive Synthesis (CEGIS) verification method that iteratively searches for the hidden invariants necessary to validate safety properties. VeriSmart consists of two main components: a generator and a validator. The generator produces candidate invariants, which the validator then uses to prove or disprove assertion safety. The validator flags unproven assertions, prompting the generator to produce new invariants. This process repeats until the contract is verified as safe or the time budget is exhausted.

Table 5 provides the source and version of each analysis tool implemented in our study. Each tool has a set of parameters, whose adjustment can likely influence the results of the tool. With the exception of Mythril, all tools were used with their default settings. Mythril has a default execution timeout of 60 min per contract—significantly longer than comparable tools—and was therefore reduced to 5 min to align with VeriSmart.

5.5. Evaluation Measures

After the SC analysis is completed, the tool generates a report detailing the state of the examined SC and any detected vulnerabilities. The Tagger module of MutliTagging is then used to parse the reports and map the extracted tags to standardized vulnerability tags, i.e., SWC codes and DASP ranks. Next, the Evaluator model calculates four prediction parameters by comparing the tool’s predicted tags to the actual tags in the benchmark datasets: (1) True Positive

(T P)

, representing the number of samples correctly classified as containing the X vulnerability; (2) False Positive

(F P)

, representing the number of samples incorrectly classified as containing the X vulnerability; (3) True Negative

(T N)

, representing the number of samples correctly classified as safe; (4) False Negative

(F N)

, representing the number of samples incorrectly classified as safe. These parameters were then used to calculate additional performance metrics.

In this study, we apply the commonly used metrics, as indicated in Table 3, to assess the performance of SC analysis tools, specifically precision and recall. We also use the coverage, average analysis time, and failure rate measures to estimate the tool’s efficiency. Finally, we investigate the relation between analysis tools using the overlap degree measure [20]. The evaluation metrics used are as follows:

Average Analysis Time ( $A A T$ ): This measures the average analysis time of the analysis tool. It can be computed using Formula (1).

$A A T = \frac{\sum_{\begin{matrix} i \end{matrix} = 1}^{n} {A T}_{i}}{n},$

(1)

where $A T$ represents the time the tool takes to analyze an SC, and n denotes the total number of samples.
Failure Rate ( $F R$ ): This measures the failure rate of the analysis tool. The lower the failure rate, the more robust the tool. It can be computed using Formula (2).

$F R = \frac{F S}{n} \cdot 100,$

(2)

where $F S$ represents the number of samples that the analysis tool failed to process, while n denotes the total number of samples.
Coverage: This shows the proportion of unique vulnerabilities correctly reported by the analysis tool when applied to the benchmark. It is computed using Formula (3).

$C o v e r a g e = \frac{D V}{T V} \cdot 100,$

(3)

where $D V$ represents the number of vulnerability types detectable by the tool, and $T V$ denotes the total number of vulnerability types in the benchmark.
Precision: This is the proportion of positive samples that are accurately classified as positive. It is computed using Formula (4).

$P r e c i s i o n = \frac{T P}{F P + T P}$

(4)
Recall: This evaluates the analysis tool’s capacity to identify positive samples and is computed using Formula (5).

$R e c a l l = \frac{T P}{F N + T P}$

(5)
Overlap degree: This computes the agreement degree among analysis tools in terms of judgments. It was proposed by Di Angelo et al. [20]. For tool t, let $D (t)$ be the set of DASP classes that t can identify, and let $F (t, v)$ be the set of positive samples flagged by tool t as containing the vulnerability v. The overlap between $t_{1}$ and $t_{2}$ can be computed by Formula (6). The numerator is the total number of samples flagged by both tools for all vulnerabilities common between the two tools, whereas the denominator represents the number of samples flagged by $t_{1}$ . This metric is asymmetric, meaning that Overlap $(t_{1}, t_{2})$ is not necessarily equal to Overlap $(t_{2}, t_{1})$ .

$O v e r l a p (t_{1}, t_{2}) = \frac{\sum_{v ϵ D (t_{1}) \cap v ϵ D (t_{2})} |F (t_{1}, v) \cap F (t_{2}, v)|}{\sum_{v ϵ D (t_{1}) \cap v ϵ D (t_{2})} |F (t_{1}, v)|}$

(6)

5.6. Execution Environment

The majority of our experimental investigation was conducted using our open-source MultiTagging framework, introduced in Section 4. Two different setups were used for the study experiments: SC analysis tools were run on Ubuntu 22.04 with an Intel i9 processor and 32 GB of RAM, while the MultiTagging framework was executed on macOS with an Apple M2 Max and 32 GB of RAM.

6. Results and Discussion

This section summarizes our observations on the effectiveness of SC labeling approaches. It discusses the performance of the examined analysis tools in identifying eight SC vulnerability classes from the DASP Top 10. Next, it presents our findings on using voting techniques to identify SC vulnerabilities. Finally, it highlights the value of the MultiTagging framework in addressing challenges related to SC vulnerability labeling.

6.1. Individual-Based Labeling

This section discusses our investigation’s findings on six SC analysis tools, MAIAN, Mythril, Semgrep, Slither, Solhint, and VeriSmart, from three perspectives: (1) efficiency; (2) detection performance; (3) similarity.

6.1.1. Analysis Tools Efficiency

The detection mechanism of each tool influences its analysis time. Figure 6 reveals that static analysis tools—Solhint, Slither, and Semgrep—have the lowest average analysis times compared to dynamic analysis tools—MAIAN, Mythril, and VeriSmart. This outcome was expected, as dynamic tools execute the SC with various inputs to identify vulnerabilities. VeriSmart’s high average analysis time results from its approach, which continues seeking new invariants until the contract is proven safe or the time budget is exhausted. For MAIAN, added time comes from a verification phase, where flagged SCs are deployed on a private blockchain to confirm vulnerabilities. The in-depth analysis of Mythril—with a default maximum recursion depth of 22—contributes to its longer analysis time. Analysis time is critical, especially when a large number of samples need to be analyzed. However, it is influenced by several factors, including SC complexity, lines of code, and the capacity of the execution environment.

Figure 6 demonstrates that Solhint and Semgrep have the lowest failure rates, while VeriSmart, followed by Mythril, had the highest failure rates. Compilation issues and execution timeouts were the most common causes of these failures. VeriSmart and Mythril do not support Solidity versions prior to 0.4.13 and 0.4.11, respectively. The execution timeout for Mythril was set to match that of VeriSmart, which has a default execution timeout of five minutes (300 s) [65]. Failures occur when the time budget is exhausted before the contract analysis is completed. Table 6 presents the number of samples successfully analyzed by each tool. To ensure a fair evaluation, we included only the common samples that all tested tools were able to analyze, resulting in 645 samples used in all subsequent evaluations.

Examining the vulnerability classes detected by each tool, we found that Mythril, Slither, and Solhint are the most capable tools. Mythril identified positive samples from eight classes, followed by Slither and Solhint, which detected positive samples from seven and six classes, respectively. In contrast, MAIAN, Semgrep, and VeriSmart have the lowest coverage, each addressing only two classes. Figure 6 highlights our findings on the coverage of analysis tools. It shows that all tools can identify the Access Control class. Four tools can detect the Reentrancy, Arithmetic, DoS, and Bad Randomness classes, while three can identify Unchecked Return Values and Time Manipulation classes. Only Mythril can detect the Front Running class.

6.1.2. Analysis Tools Performance

This section examines the performance of analysis tools in detecting SC vulnerabilities (i.e., positive samples). Figure 7 shows two performance metrics for each tool: recall and precision. Table 7 provides a comprehensive summary of the performance of all analysis tools investigated in this study. The results are analyzed and interpreted in the following.

Only four tools—Mythril, Semgrep, Slither, and Solhint—have detectors capable of identifying Reentrancy samples. Slither was more adept at recognizing Reentrancy patterns, detecting all positive samples in the benchmark with a recall of 1.00. Mythril came in second, whereas Semgrep performed poorly. In terms of precision scores, all tools were inaccurate, as they categorized many negative samples as positive.

All evaluated analysis tools can detect Access Control positive samples to varying degrees of accuracy. Solhint, Mythril, Slither, and VeriSmart were the most capable, with recall rates of 0.83, 0.82, 0.79, and 0.75, respectively. However, the precision scores reveal that all four tools produced a high proportion of false positives. Semgrep achieved a precision score of 1.00, indicating accuracy in minimizing false positives; however, it struggled to detect positive samples. MAIAN detected 12 of 77 positive samples, outperforming Semgrep in recall but falling short in accuracy.

Although four tools—Mythril, Semgrep, Slither, and VeriSmart—can identify Arithmetic issues, only Mythril and VeriSmart detected a substantial number of positive samples from the benchmark. VeriSmart outperformed Mythril in identifying positive samples, with a recall rate of 0.98; however, it also exhibited a high false positive rate. Slither performed as expected—it has only one detector, “Division before multiplication”, within the Arithmetic class—and therefore cannot identify additional patterns such as “Integer Overflow and Underflow”.

Slither and Solhint performed comparably in detecting the Unchecked Return Values class, achieving recall scores of 0.77 and 0.75, respectively. However, both produced a large number of false positives, with precision scores of 0.45 and 0.41, respectively. Mythril identified about 58% of positive samples from this class. Similar findings were observed in detecting the Time Manipulation class. Slither and Solhint outperformed Mythril, detecting almost all positive samples with recall scores of 0.97, while Mythril identified 80%. However, all tools exhibited a high false positive rate.

The DoS and Bad Randomness classes are the most challenging for analysis tools to identify. Slither was the most effective in detecting positive DoS samples, with a recall of 0.63. Solhint detected a quarter of the positive samples, with a recall of 0.24, while Mythril detected a third and MAIAN identified only one positive sample. All tools performed poorly in terms of precision scores. Three tools—Mythril, Slither, and Solhint—were able to detect positive samples from the Bad Randomness class, each achieving a recall score of 0.25. The Bad Randomness class comprises only four samples in the benchmark, a significant imbalance likely affecting performance. Only Mythril was able to identify positive Front Running samples, with a recall of 0.74.

In summary, the performance of analysis tools varies with the vulnerability type, making it impossible to rely on a single tool to detect all vulnerabilities with high accuracy. Reducing false alarms is challenging for several reasons, including the possibility of overlap between some vulnerability types that require manual investigation to resolve. Furthermore, the tools cannot fully interpret contract logic, leading to some false positives. In the SC domain, however, it is crucial to eliminate false negatives to the greatest extent feasible, with a focus on achieving high recall scores. False negatives in an SC can have serious consequences, as exploited vulnerabilities often lead to the loss of Ether (i.e., money) [33].

6.1.3. Similarity

We calculated the degree of overlap between the analysis tools, as shown in Figure 8, to examine differences in tool judgments. Each row represents the baseline tool,

t_{1}

, compared to the tool in the column,

t_{2}

. Dark cells indicate a high overlap degree, and vice versa. The greater the overlap between two tools, the more similar they are in sample classification; nevertheless, similarity does not guarantee classification accuracy. Very high precision versus very low recall indicates a low number of samples flagged as positive. Since the overlap metric considers only flagged samples (i.e., positive instances), the overlap degree is projected to decrease if the number of positive samples for both tools is limited. This explains the lack of overlap between MAIAN and Semgrep. Figure 8 also shows zero overlap between Semgrep and VeriSmart in both directions. The only common class they can detect is Access Control. Examining their performance scores in Table 7 shows clear evidence of no overlap between them, indicating that they may complement one another.

Another observation from Figure 8 is the high degree of overlap between Slither and Solhint. The overlap degree between Solhint in the row and Slither in the column reveals that 78.05% of the samples flagged by Solhint were also flagged by Slither, with the remainder labeled differently. In contrast, the overlap degree from Slither in the row to Solhint in the column reveals that only 69.86% of the samples flagged by Slither were also flagged by Solhint. The difference clearly indicates Slither’s high false positive rate. Slither generated approximately 90.91% of flags identical to those of Semgrep, suggesting that the samples flagged by Semgrep approximately represent a subset of those flagged by Slither. The samples flagged by MAIAN represent a subset of those flagged by VeriSmart.

To further investigate tool similarities, we computed the degree of overlap between tools for each vulnerability, as shown in Figure 9. Four tools—Mythril, Semgrep, Slither, and Solhint—flagged some Reentrancy samples. Slither exhibited high overlap degrees with Semgrep and Solhint, at 98.80% and 77.78%, respectively. Conversely, the overlap degrees from Semgrep and Solhint to Slither were much lower, at 1.59% and 55.91%, respectively. Semgrep has a lower ratio of true positives than Slither, with recall scores of 0.02 and 1.00, respectively, implying that most of their common flags are false positives. Therefore, excluding common Reentrancy flags between Semgrep and Slither may help reduce Slither’s false positives.

It is evident from Table 7 that Mythril, Solhint, and Slither are the most effective tools for detecting positive samples of the Access Control class. However, the degrees of overlap among these tools vary, reflecting a proportion of variations in their positive flags. Combining tools can assist in detecting more positive samples, but it may also increase false positives. The zero overlap degree between the pairs (MAIAN, Semgrep) and (Semgrep, VeriSmart) indicates that their positive flags do not match, making them complementary to each other. Flags identified by Semgrep are a subset of those flagged by Mythril and Slither, while those identified by MAIAN are a subset of VeriSmart’s. Mythril and VeriSmart are most sensitive than Slither for Arithmetic positive samples, with recalls of 0.82, 0.98, and 0.09, respectively. VeriSmart has a substantial overlap with Mythril and Slither, sharing 80.26% and 93.31% of their flags, respectively. In contrast, Slither identified 29.77% of VeriSmart’s flags, while Mythril identified 39.32%. In this case, using all tools could reduce detection accuracy due to an increase in false positives.

Slither and Solhint exhibit significant overlap in identifying positive samples for the Unchecked Return Values and Time Manipulation classes, with overlap degrees exceeding 79%. Samples flagged by Mythril are approximately a subset of those flagged by Slither and Solhint. Slither and Solhint appear to perform similarly in identifying both classes; however, the high overlap degrees may suggest a high false positive rate for these tools, which can be verified by comparing their overlap degrees with their false positive rates. The tools showed low overlap degrees for the DoS class. Bad Randomness flags identified by Semgrep represent a subset of those flagged by Solhint, while Slither flags approximately 80% of the samples labeled as Bad Randomness by both Semgrep and Solhint. Since Mythril is the only tool capable of identifying the Front Running class, there is no overlap between tools for this class.

6.2. Vote-Based Labeling

Implementing multiple analysis tools can improve the coverage rate of SC vulnerability detection. However, analysis tools vary in sensitivity. Voting mechanisms are commonly used to resolve differences in tool judgments regarding the presence of a vulnerability [16,17,28,34]. This section examines the effectiveness of three voting methods: AtLeastOne, Majority, and Power-based voting.

Figure 7 demonstrates that the AtLeastOne voting method outperforms the Majority voting method in coverage rate, achieving a high recall across all vulnerability classes. The AtLeastOne method requires agreement from at least one tool to confirm the presence of a vulnerability. By adopting a zero-tolerance approach, this method increases the vulnerability detection rate and thereby enhances the SC’s security level—though it also results in a high false positive rate. In contrast, the detection rate of the majority voting method is strongly influenced by the individual performance of the voters (i.e., analysis tools). To indicate a vulnerability, the Majority voting method requires agreement from at least half of the tools, helping to limit false positives. However, achieving accurate results remains challenging without high-precision tools.

Examining the tools’ overlap degrees assists in determining the optimal voting method. The AtLeastOne voting method is ideal for tools with varying or low recall and minimal overlap—such as in the Access Control, DoS, and Bad Randomness classes—yielding high recall scores, as it encompasses the total positive samples flagged by all voting tools. In contrast, the Majority method results in lower recall scores. For instance, Mythril, Slither, and Solhint detected positive samples in the Bad Randomness class; however, Figure 9 indicates minimal overlap among them. As illustrated in Figure 7, applying the Majority voting method significantly reduced the coverage rate. The Majority voting method is optimal when tools exhibit high overlap and comparable high performance, yielding better precision scores than the AtLeastOne voting method. For example, in the Time Manipulation class (Figure 7), it helped reduce false positives, whereas the AtLeastOne voting strategy led to an increase in false positives.

In the SC vulnerability detection, recognizing positive samples is often prioritized over labeling accuracy—meaning recall is more critical than precision. The AtLeastOne voting method is particularly useful for the Reentrancy, Access Control, Arithmetic, Unchecked Return Values, DoS, and Bad Randomness classes, as it yields higher recall than the Majority voting method. For the Time Manipulation class, the Majority voting proves more beneficial than AtLeastOne, providing comparable recall with improved precision. Since only one tool participates in voting for the Front Running class, all voting methods perform identically. The Power-based voting method, however, is applicable across all vulnerability classes.

The two phases of the Power-based voting method enhanced voting outcomes. The first phase was designed to prevent voters with low recall from participating in the voting process, which could otherwise compromise the precision. The second phase helped determine the most appropriate voting strategy for each vulnerability class. Table 8 summarizes the outcome of implementing these two phases. It indicates that Semgrep is ineffective for identifying the Reentrancy and Bad Randomness classes, while MAIAN is inadequate for detecting the DoS class. However, these excluded voters can improve the performance of other tools by acting as inverters, correcting the false positive flags of certain tools. Table 8 also shows that the AtLeastOne voting method is appropriate for identifying all vulnerability classes, except for the Time Manipulation class.

Table 7 demonstrates that, compared to the pure AtLeastOne voting method, incorporating the inverter reduced false positives of the Power-based voting method for three vulnerability classes: Reentrancy, DoS, and Bad Randomness. The improvement percentage is estimated based on the number of false positives shared between the two tools. However, the overall accuracy of the voting method remains influenced by the accuracy of the remaining voters.

6.3. MultiTagging Effectiveness

The MultiTagging framework addressed critical challenges in SC vulnerability labeling, achieving notable improvements in accuracy, consistency, and comparability across multiple analysis tools. First, the framework’s parser mechanism automated the extraction of vulnerability tags from SC analysis tool reports, effectively resolving the time-consuming and error-prone task of manual parsing. This automation ensured consistent, standardized, and accurate tag extraction, accommodating the diverse output formats of various tools and streamlining the analysis workflow.

To overcome inconsistent labeling across analysis tools, MultiTagging employed a mapper approach that aligned tool-specific vulnerability tags with standard labels, such as SWC codes and DASP ranks. Supported by a public Vulnerability Map Registry, this approach unified tool-specific labels, enabling reliable cross-tool comparisons. Additionally, the Power-based voting method within MultiTagging provided a systematic approach to vote-based labeling, dynamically assigning roles to each tool based on overlap degree and performance. This strategy enhanced labeling accuracy by minimizing false positives and ensured that high-performing tools contributed effectively to the labeling process.

7. Threats to Validity

The potential threat to external validity relates to the dataset’s size and the diversity of patterns within each vulnerability class. To mitigate this threat, we combined seven different datasets to construct a large and diverse dataset. Another threat is the potential for inconsistency in labeling similar vulnerabilities across datasets, arising from the lack of a standardized taxonomy that maps SC vulnerabilities to common labels (e.g., SWC codes or DASP classes). To counter this threat, we utilized the revised datasets provided in [29], in which labels were standardized and conflicts were resolved. Additionally, we developed a new, comprehensive taxonomy addressing gaps identified in prior work. We then verified the revised datasets to address any inconsistencies. However, external validity remains threatened by the continual evolution of vulnerability patterns—an area that necessitates ongoing assessment.

Internal validity may be compromised if a tool fails to generate analysis reports due to an execution delay or unsupported Solidity version—this could lead to contracts being arbitrarily labeled as safe or vulnerable, indirectly influencing performance scores. To mitigate this threat, we ran all tools across the entire dataset. For accurate assessment and comparison, we examined each tool’s output and selected only common samples that all tools were able to analyze.

Each analysis tool has a set of parameters, adjustments to which could influence tool results, thus posing a threat to construct validity. In this study, we retained default settings for all tools except Mythril, where we reduced the execution timeout to match the other tools. Examining the impact of parameter settings would require extensive time and computational resources; thus, it is left as potential future work. Another construct validity concern relates to tool versions, as tools are often updated to improve performance. To address this threat and prevent bias, we implemented the latest versions of each tool and documented them in this study. To enhance reproducibility, we have made the MultiTagging framework publicly available, along with a replication package for the evaluation study presented in this article.

8. Conclusions and Future Work

In this article, we addressed key challenges in the SC vulnerability identification domain by introducing MultiTagging—a modular SC multi-labeling framework. We highlighted its core features, including the Power-based voting method, which optimizes the voting process by factoring in voter performance and inter-tool relationships. To tackle the issue of inconsistent vulnerability labeling, we proposed a new SC vulnerability taxonomy that maps SWC codes to the DASP Top 10. Our empirical evaluation of six analysis tools demonstrated the effectiveness of the framework. The proposed tagging mechanism enabled accurate, automated parsing and mapping of tool tags to SWC and DASP labels. Evaluation metrics revealed that Slither excels in detecting Reentrancy and DoS vulnerabilities, achieving recalls of 1.00 and 0.63, respectively, while VeriSmart is most effective for the Arithmetic class, with a recall of 0.98. Slither and Solhint performed best on Unchecked Return Values and Time Manipulation, each with recalls exceeding 0.75. For Bad Randomness, Mythril, Slither, and Solhint showed comparable performance, with a recall of 0.25. Mythril was the only tool identifying Front Running, with a recall of 0.74. High-recall tools frequently detected vulnerabilities flagged by low-recall tools, resulting in notable overlap—e.g., Slither overlapped with Semgrep and Solhint by 77.78% and 98.80% in Reentrancy, while VeriSmart and Slither overlapped by 93.31% in Arithmetic. Notably, Power-based voting proved more effective than pure threshold-based voting across all vulnerability classes.

The MultiTagging framework advances empirical research focused on assessing the effectiveness of analysis tools in identifying SC vulnerabilities and supports precise, efficient sample labeling for dataset creation. Looking forward, the framework could be extended to accommodate additional analysis tools. Its flexible design enables modules to be adapted for any SC analysis tool with minimal adjustments—such as updating the public Vulnerability Map Registry to incorporate new tool mapping and defining indicator keywords for the Parser to extract tool-specific tags. Additionally, optimizing configuration parameters for dynamic analysis tools, such as execution timeout, may further enhance labeling accuracy. Future research could investigate the impact of such parameters on SC vulnerability labeling accuracy. Although SC vulnerabilities can be labeled using the combined votes of multiple tools, the accuracy of these voting-based labels requires further validation, as they may produce suboptimal results. Thus, future studies could explore the effects of various tool combinations on labeling accuracy.

Author Contributions

Conceptualization, S.J.A. and H.A.; methodology, S.J.A. and H.A.; software, S.J.A.; validation, S.J.A. and H.A.; formal analysis, S.J.A.; investigation, S.J.A. and H.A.; resources, S.J.A.; data curation, S.J.A.; writing—original draft preparation, S.J.A.; writing—review and editing, S.J.A. and H.A.; visualization, S.J.A.; supervision, H.A. and M.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

We used publicly available datasets, which are discussed in the Benchmark Section 5.3. We offer a complete replication copy of our work at https://github.com/orgs/MultiTagging/repositories (accessed on 18 November 2024).

Acknowledgments

The authors would like to acknowledge the support of the King Fahd University of Petroleum and Minerals (KFUPM), Saudi Arabia, in the development of this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bocek, T.; Stiller, B. Smart contracts–blockchains in the wings. In Digital Marketplaces Unleashed; Springer: Berlin/Heidelberg, Germany, 2017; pp. 169–184. [Google Scholar]
Choi, T.M.; Siqin, T. Blockchain in logistics and production from Blockchain 1.0 to Blockchain 5.0: An intra-inter-organizational framework. Transp. Res. Part E Logist. Transp. Rev. 2022, 160, 102653. [Google Scholar] [CrossRef]
Buterin, V. A next-generation smart contract and decentralized application platform. White Pap. 2014, 3, 1–36. [Google Scholar]
Zheng, Z.; Xie, S.; Dai, H.N.; Chen, W.; Chen, X.; Weng, J.; Imran, M. An overview on smart contracts: Challenges, advances and platforms. Future Gener. Comput. Syst. 2020, 105, 475–491. [Google Scholar] [CrossRef]
Wang, S.; Yuan, Y.; Wang, X.; Li, J.; Qin, R.; Wang, F.Y. An overview of smart contract: Architecture, applications, and future trends. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; pp. 108–113. [Google Scholar]
Nikolić, I.; Kolluri, A.; Sergey, I.; Saxena, P.; Hobor, A. Finding the greedy, prodigal, and suicidal contracts at scale. In Proceedings of the 34th Annual Computer Security Applications Conference, San Juan, PR, USA, 3–7 December 2018; pp. 653–663. [Google Scholar]
Ibba, G.; Pierro, G.A.; Di Francesco, M. Evaluating machine-learning techniques for detecting smart ponzi schemes. In Proceedings of the 2021 IEEE/ACM 4th International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB), Madrid, Spain, 31–31 May 2021; pp. 34–40. [Google Scholar]
Bartoletti, M.; Carta, S.; Cimoli, T.; Saia, R. Dissecting Ponzi schemes on Ethereum: Identification, analysis, and impact. Future Gener. Comput. Syst. 2020, 102, 259–277. [Google Scholar] [CrossRef]
Slowmist. 2024. Available online: https://hacked.slowmist.io/?c=ETH (accessed on 18 November 2024).
Ivanov, N.; Li, C.; Yan, Q.; Sun, Z.; Cao, Z.; Luo, X. Security threat mitigation for smart contracts: A comprehensive survey. ACM Comput. Surv. 2023, 55, 1–37. [Google Scholar] [CrossRef]
Jiang, F.; Chao, K.; Xiao, J.; Liu, Q.; Gu, K.; Wu, J.; Cao, Y. Enhancing smart-contract security through machine learning: A survey of approaches and techniques. Electronics 2023, 12, 2046. [Google Scholar] [CrossRef]
Smart Contract Weakness Classification (SWC). 2020. Available online: https://swcregistry.io/ (accessed on 18 November 2024).
Decentralized Application Security Project (DASP) Top 10. 2018. Available online: https://dasp.co/ (accessed on 18 November 2024).
Parizi, R.M.; Dehghantanha, A.; Choo, K.K.R.; Singh, A. Empirical vulnerability analysis of automated smart contracts security testing on blockchains. In Proceedings of the 28th Annual International Conference on Computer Science and Software Engineering, Markham, ON, Canada, 29–31 October 2018; pp. 103–113. [Google Scholar]
Durieux, T.; Ferreira, J.F.; Abreu, R.; Cruz, P. Empirical review of automated analysis tools on 47,587 ethereum smart contracts. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, Seoul, Republic of Korea, 27 June–19 July 2020; pp. 530–541. [Google Scholar]
Zhang, P.; Xiao, F.; Luo, X. A framework and dataset for bugs in ethereum smart contracts. In Proceedings of the 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME), Adelaide, Australia, 28 September–2 October 2020; pp. 139–150. [Google Scholar]
Ren, M.; Yin, Z.; Ma, F.; Xu, Z.; Jiang, Y.; Sun, C.; Li, H.; Cai, Y. Empirical evaluation of smart contract testing: What is the best choice? In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual, Denmark, 11–17 July 2021; pp. 566–579. [Google Scholar]
Ji, S.; Kim, D.; Im, H. Evaluating countermeasures for verifying the integrity of Ethereum smart contract applications. IEEE Access 2021, 9, 90029–90042. [Google Scholar] [CrossRef]
Kushwaha, S.S.; Joshi, S.; Singh, D.; Kaur, M.; Lee, H.N. Ethereum smart contract analysis tools: A systematic review. IEEE Access 2022, 10, 57037–57062. [Google Scholar] [CrossRef]
Di Angelo, M.; Durieux, T.; Ferreira, J.F.; Salzer, G. Evolution of automated weakness detection in Ethereum bytecode: A comprehensive study. Empir. Softw. Eng. 2024, 29, 41. [Google Scholar] [CrossRef]
SWC-Registry. Available online: https://github.com/SmartContractSecurity/SWC-registry (accessed on 18 November 2024).
Rameder, H.; Di Angelo, M.; Salzer, G. Review of automated vulnerability analysis of smart contracts on Ethereum. Front. Blockchain 2022, 5, 814977. [Google Scholar] [CrossRef]
Mueller, B. Smashing ethereum smart contracts for fun and real profit. HITB SECCONF Amst. 2018, 9, 4–17. [Google Scholar]
NCC Group. Available online: https://www.nccgroup.com/us/ (accessed on 18 November 2024).
Common Weakness Enumeration (CWE). 2024. Available online: https://cwe.mitre.org/index.html (accessed on 18 November 2024).
EEA EthTrust Security Levels Specification Version 2. 2023. Available online: https://entethalliance.org/specs/ethtrust-sl/v2/ (accessed on 18 November 2024).
Wang, S.; Ouyang, L.; Yuan, Y.; Ni, X.; Han, X.; Wang, F.Y. Blockchain-enabled smart contracts: Architecture, applications, and future trends. IEEE Trans. Syst. Man, Cybern. Syst. 2019, 49, 2266–2277. [Google Scholar] [CrossRef]
Dia, B.; Ivaki, N.; Laranjeiro, N. An empirical evaluation of the effectiveness of smart contract verification tools. In Proceedings of the 2021 IEEE 26th Pacific Rim International Symposium on Dependable Computing (PRDC), Perth, Australia, 1–4 December 2021; pp. 17–26. [Google Scholar]
di Angelo, M.; Salzer, G. Consolidation of Ground Truth Sets for Weakness Detection in Smart Contracts. In Proceedings of the Financial Cryptography and Data Security. FC 2023 International Workshops, Brač, Croatia, 5 May 2023; Essex, A., Matsuo, S., Kulyk, O., Gudgeon, L., Klages-Mundt, A., Perez, D., Werner, S., Bracciali, A., Goodell, G., Eds.; Springer: Cham, Switzerland, 2024; pp. 439–455. [Google Scholar]
Chen, J.; Xia, X.; Lo, D.; Grundy, J.; Luo, X.; Chen, T. Defining smart contract defects on ethereum. IEEE Trans. Softw. Eng. 2020, 48, 327–345. [Google Scholar] [CrossRef]
Di Angelo, M.; Salzer, G. A survey of tools for analyzing ethereum smart contracts. In Proceedings of the 2019 IEEE International Conference on Decentralized Applications and Infrastructures (DAPPCON), Newark, CA, USA, 4–9 April 2019; pp. 69–78. [Google Scholar]
Leid, A.; van der Merwe, B.; Visser, W. Testing ethereum smart contracts: A comparison of symbolic analysis and fuzz testing tools. In Proceedings of the Conference of the South African Institute of Computer Scientists and Information Technologists 2020, Cape Town, South Africa, 14–16 September 2020; pp. 35–43. [Google Scholar]
Ghaleb, A.; Pattabiraman, K. How effective are smart contract analysis tools? evaluating smart contract static analysis tools using bug injection. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, USA, 18–22 July 2020; pp. 415–427. [Google Scholar]
Yashavant, C.S.; Kumar, S.; Karkare, A. Scrawld: A dataset of real world ethereum smart contracts labelled with vulnerabilities. arXiv 2022, arXiv:2202.11409. [Google Scholar]
Ferreira, J.F.; Cruz, P.; Durieux, T.; Abreu, R. Smartbugs: A framework to analyze solidity smart contracts. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, Virtual Event, Australia, 21–25 December 2020; pp. 1349–1352. [Google Scholar]
Di Angelo, M.; Durieux, T.; Ferreira, J.F.; Salzer, G. Smartbugs 2.0: An execution framework for weakness detection in ethereum smart contracts. In Proceedings of the 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), Luxembourg, 11–15 September 2023; pp. 2102–2105. [Google Scholar]
Smartbugs. Available online: https://github.com/smartbugs/smartbugs (accessed on 18 November 2024).
USCV: A Unified Smart Contract Validator. Available online: https://github.com/93suhwan/uscv (accessed on 18 November 2024).
Zhou, H.; Milani Fard, A.; Makanju, A. The state of ethereum smart contracts security: Vulnerabilities, countermeasures, and tool support. J. Cybersecur. Priv. 2022, 2, 358–378. [Google Scholar] [CrossRef]
DASP 2. Available online: https://dasp.co//#item-2 (accessed on 18 November 2024).
SWC100. Available online: https://swcregistry.io/docs/SWC-100/ (accessed on 18 November 2024).
SWC108. Available online: https://swcregistry.io/docs/SWC-108/ (accessed on 18 November 2024).
SWC106. Available online: https://swcregistry.io/docs/SWC-106/ (accessed on 18 November 2024).
SWC121. Available online: https://swcregistry.io/docs/SWC-121/ (accessed on 18 November 2024).
SWC122. Available online: https://swcregistry.io/docs/SWC-122/ (accessed on 18 November 2024).
SWC132. Available online: https://swcregistry.io/docs/SWC-132/ (accessed on 18 November 2024).
Mapping Registry. 2024. Available online: https://github.com/MultiTagging/MultiTagging/blob/main/Mapping/VulnerablityMap.xlsx (accessed on 18 November 2024).
MultiTagging Framework. 2024. Available online: https://github.com/MultiTagging/MultiTagging (accessed on 18 November 2024).
Doublade. 2019. Available online: https://doublade.readthedocs.io/en/latest/index.html (accessed on 18 November 2024).
Schneidewind, C.; Grishchenko, I.; Scherer, M.; Maffei, M. eThor: Practical and provably sound static analysis of ethereum smart contracts. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, USA, 9–13 November 2020; pp. 621–640. [Google Scholar]
NotSoSmartC. 2023. Available online: https://github.com/crytic/not-so-smart-contracts/ (accessed on 18 November 2024).
Tsankov, P.; Dan, A.; Drachsler-Cohen, D.; Gervais, A.; Buenzli, F.; Vechev, M. Securify: Practical security analysis of smart contracts. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, Toronto, ON, Canada, 15–19 October 2018; pp. 67–82. [Google Scholar]
Torres, C.F.; Iannillo, A.K.; Gervais, A.; State, R. Confuzzius: A data dependency-aware hybrid fuzzer for smart contracts. In Proceedings of the 2021 IEEE European Symposium on Security and Privacy (EuroS&P), Vienna, Austria, 6–10 September 2021; pp. 103–119. [Google Scholar]
Liu, Z.; Qian, P.; Yang, J.; Liu, L.; Xu, X.; He, Q.; Zhang, X. Rethinking smart contract fuzzing: Fuzzing with invocation ordering and important branch revisiting. IEEE Trans. Inf. Forensics Secur. 2023, 18, 1237–1251. [Google Scholar] [CrossRef]
Tikhomirov, S.; Voskresenskaya, E.; Ivanitskiy, I.; Takhaviev, R.; Marchenko, E.; Alexandrov, Y. Smartcheck: Static analysis of ethereum smart contracts. In Proceedings of the 1st International Workshop on Emerging Trends in Software Engineering for Blockchain, Gothenburg, Sweden, 27 May 2018; pp. 9–16. [Google Scholar]
Mossberg, M.; Manzano, F.; Hennenfent, E.; Groce, A.; Grieco, G.; Feist, J.; Brunson, T.; Dinaburg, A. Manticore: A user-friendly symbolic execution framework for binaries and smart contracts. In Proceedings of the 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), San Diego, CA, USA, 11–15 November 2019; pp. 1186–1189. [Google Scholar]
Torres, C.F.; Schütte, J.; State, R. Osiris: Hunting for integer bugs in ethereum smart contracts. In Proceedings of the 34th Annual Computer Security Applications Conference, San Juan, PR, USA, 3–7 December 2018; pp. 664–676. [Google Scholar]
De Moura, L.; Bjørner, N. Z3: An efficient SMT solver. In Proceedings of the International Conference on Tools and Algorithms for the Construction and Analysis of Systems, Budapest, Hungary, 29 March–6 April 2008; Springer: Berlin/Heidelberg, Germany, 2008; pp. 337–340. [Google Scholar]
LASER-Ethereum. Available online: https://github.com/muellerberndt/laser-ethereum (accessed on 18 November 2024).
Semgrep. Available online: https://semgrep.dev/ (accessed on 18 November 2024).
Feist, J.; Grieco, G.; Groce, A. Slither: A static analysis framework for smart contracts. In Proceedings of the 2019 IEEE/ACM 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB), Montreal, QC, Canada, 27 May 2019; pp. 8–15. [Google Scholar]
Rosen, B.K.; Wegman, M.N.; Zadeck, F.K. Global value numbers and redundant computations. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, San Diego, CA, USA, 10–13 January 1988; pp. 12–27. [Google Scholar]
Slither, the Smart Contract Static Analyzer. Available online: https://github.com/crytic/slither (accessed on 18 November 2024).
So, S.; Lee, M.; Park, J.; Lee, H.; Oh, H. Verismart: A highly precise safety verifier for ethereum smart contracts. In Proceedings of the 2020 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 18–21 May 2020; pp. 1678–1694. [Google Scholar]
VeriSmart. Available online: https://github.com/kupl/VeriSmart-public (accessed on 18 November 2024).

Figure 1. Mapping SWC codes to CWE.

Figure 2. Mapping SWC codes to DASP Top 10.

Figure 3. Overview of MultiTagging framework.

Figure 4. Flowchart of the Power-based voting algorithm.

Figure 5. Overview of used benchmarks.

Figure 6. Analysis tools efficiency metrics.

Figure 7. Performance overview of analysis tools and voting methods using a portion of the benchmark.

Figure 8. Overlap of analysis tool findings.

Figure 9. Overlap of analysis tool findings per vulnerability.

Table 1. DASP Top 10 Vulnerabilities.

DASP Rank	Class	Description
1	Reentrancy	This occurs when external contract calls initiate new calls to the calling contract before the first execution is completed.
2	Access Control	This occurs when an attacker gains illegal access rights.
3	Arithmetic Issues	This occurs when an attacker exploits the absence of mechanisms to verify that the arithmetic operation result is within the data type scope, allowing state variable tampering.
4	Unchecked Low-Level Calls	This occurs because there is no mechanism to propagate exceptions in low-level external calls, causing the code to continue running despite the failure.
5	Denial of Service (DoS)	DoS can occur in various ways, e.g., by intentionally raising the gas required to execute a function or by abusing access control rules.
6	Bad Randomness	This occurs due to the use of predictable randomness.
7	Front Running	Blockchain transactions are executed in a certain sequence, often based on the transaction fees. Because transactions are publicly available, malicious users can exploit this gap by raising the gas fees for their transactions to be processed first.
8	Time Manipulation	This occurs when a random number is generated using an initial seed that miners can control. A malicious miner can exploit such variables to their benefit.
9	Short Address Attack	Ethereum Virtual Machine (EVM) inserts zeros at the end of transactions that are less than 32 bytes long. However, the issue may arise if the address, rather than the data, is shorter, resulting in an incorrect address being accepted.
10	Unknown Unknowns	Unknown vulnerabilities.

Table 2. Mapping SWC codes to DASP categories.

DASP Rank	SWC Code Mapping
DASP Rank	Dia et al. [28]	Rameder et al. [22]	Di Angelo and Salzer [29]
1	107	107	107
2	-	100, 108, 112, 115	105, 106, 112, 115, 117, 118, 124
3	101	101	101
4	104	104	104
5	106	106, 113, 126, 128	113, 126, 128, 134
6	120	120	120
7	114	114	114
8	116	116	116
9	-	-	-
10	-	-	100, 102, 103, 108–111, 119, 123, 125, 127, 129, 130–133, 135, 136
Unmapped Codes	100, 102, 103, 105, 108–113, 115, 117–119, 121–136	102, 103, 105, 109–111, 117–119, 121–125, 127, 129–136	121, 122

Table 3. Characteristics of evaluation studies on SC analysis tools.

Ref.	Year	No. of Tools	Benchmark Size	Labels Taxonomy and No.	Evaluation Metrics	Automation		Execution Environment Declared	Tools’ Info Reported
Ref.	Year	No. of Tools	Benchmark Size	Labels Taxonomy and No.	Evaluation Metrics	Parsing	Mapping	Execution Environment Declared	Version	Settings
Parizi et al. [14]	2018	4	10 SCs	None: 11	ROC, accuracy	✗	✗	✓	✗	✗
Durieux et al. [15]	2020	9	47,587 SCs	DASP: 10	Accuracy, ET	✓	✗	✓	✗	✓ *
Leid et al. [32]	2020	3	20 tokens	Not clear	Coverage, ET	✗	✗	✓	✓	✓
Ghaleb and Pattabiraman [33]	2020	6	50 SCs	None: 8	FN, FP	✓	✓	✗	✓	✓ *
Zhang et al. [16]	2020	9	176 SCs	None: 49	Coverage, precision, recall	✗	✗	✗	✗	✗
Dias et al. [28]	2021	3	222 SCs	Own: 141	TP, FP, TN, FN, recall, F1-score, markedness, informedness	✓	✗	✗	✓	✓
Ren et al. [17]	2021	9	46,186 SCs	None: 8 **	Coverage, precision, recall	✗	✗	✓	✗	✓
Ji et al. [18]	2021	8	273 SCs	DASP: 6	TP, FP, TN, FN, precision, recall, accuracy, F1-score, AUC	✓	✓	✗	✗	✗
Kushwaha et al. [19]	2022	13	30 SCs	None: 13	ET	✗	✗	✗	✗	✗
Di Angelo et al. [20]	2024	12	248,328 SCs	SWC: 15	Error rate, overlap	✓	✓	✓	✓	✗

Note: * Only the execution timeout is mentioned, ** The study used only one label, Reentrancy, as it is detectable by all tools, ET: Execution Time, TN: True Negative, TP: True Positive, FN: False Negative, FP: False Positive, ✓: Provided, ✗: Not Provided.

Table 4. Benchmark datasets description.

Benchmark	Year	Type	No. of Entries	No. of Positive Entries	No. of SWC Classes	No. of DASP Classes	Final No. of Samples
Doublade [49]	2019	Wild	319	152	5	4	225
eThor [50]	2020	Wild	720	196	1	1	223
JiuZhou [16]	2020	Crafted	168	68	33	10	164
SBcurated [35]	2020	Crafted	143	145	16	10	101
SolidFI [33]	2020	Crafted	350	350	7	6	343
SWCregistry [12]	2020	Crafted	117	76	33	6	91
NotSoSmartC. [51]	2023	Crafted	31	24	12	6	26
Study Dataset (Total)			1848	1011	33	10	1173

Table 5. Implemented analysis tools.

Analysis Tool	Implemented Version	Available On
MAIAN	#4bab09a	https://github.com/smartbugs/MAIAN, Accessed on: 17 December 2023
Mythril	v0.24.8	https://github.com/ConsenSys/mythril-classic, Accessed on: 24 July 2024
Semgrep	#c3a9f40	https://github.com/Decurity/semgrep-smart-contracts, Accessed on: 28 January 2024
Slither	v0.10.0	https://github.com/crytic/slither, Accessed on: 17 December 2023
Solhint	v4.1.1	https://github.com/protofire/solhint, Accessed on: 12 January 2024
VeriSmart	#36d191e	https://github.com/kupl/VeriSmart-public, Accessed on: 25 January 2024

Table 6. Number of samples analyzed per tool.

Actual No. of Samples	MAIAN	Mythril	Semgrep	Slither	Solhint	VeriSmart	No. of Common Samples
1173	1084	852	1153	1063	1159	663	645

Table 7. Performance of examined analysis tools and voting methods.

Label	No. of Positive Samples	Analysis Tool	TP	TN	FP	FN	Recall	Precision
Reentrancy	54	Mythril	46	452	139	8	0.85	0.25
		Semgrep	1	583	8	53	0.02	0.11
		Slither	54	373	218	0	1.00	0.20
		Solhint	35	457	134	19	0.65	0.21
		AtLeastOne voting	54	313	278	0	1.00	0.16
		Majority voting	53	416	175	1	0.98	0.23
		Power-based voting	54	319	272	0	1.00	0.17
Access Control	76	MAIAN	12	553	16	64	0.16	0.43
		Mythril	62	360	209	14	0.82	0.23
		Semgrep	3	569	0	73	0.04	1.00
		Slither	60	257	312	16	0.79	0.16
		Solhint	63	217	352	13	0.83	0.15
		VeriSmart	57	262	307	19	0.75	0.16
		AtLeastOne voting	74	100	469	2	0.97	0.14
		Majority voting	63	324	245	13	0.83	0.20
		Power-based voting	74	100	469	2	0.97	0.14
Arithmetic	45	Mythril	37	460	140	8	0.82	0.21
		Semgrep	0	600	0	45	0.00	NaN
		Slither	4	547	53	41	0.09	0.07
		VeriSmart	44	256	344	1	0.98	0.11
		AtLeastOne voting	44	216	384	1	0.98	0.10
		Majority voting	38	466	134	7	0.84	0.22
		Power-based voting	44	216	384	1	0.98	0.10
Unchecked Return Values	116	Mythril	67	519	10	49	0.58	0.87
		Slither	89	422	107	27	0.77	0.45
		Solhint	87	406	123	29	0.75	0.41
		AtLeastOne voting	89	388	141	27	0.77	0.39
		Majority voting	87	438	91	29	0.75	0.49
		Power-based voting	89	388	141	27	0.77	0.39
DoS	38	MAIAN	1	586	21	37	0.03	0.05
		Mythril	12	544	63	26	0.32	0.16
		Slither	24	507	100	14	0.63	0.19
		Solhint	9	549	58	29	0.24	0.13
		AtLeastOne voting	26	423	184	12	0.68	0.12
		Majority voting	15	554	53	23	0.39	0.22
		Power-based voting	26	444	163	12	0.68	0.14
Bad Randomness	4	Mythril	1	618	23	3	0.25	0.04
		Semgrep	0	637	4	4	0.00	0.00
		Slither	1	586	55	3	0.25	0.02
		Solhint	1	631	10	3	0.25	0.09
		AtLeastOne voting	3	569	72	1	0.75	0.04
		Majority voting	0	626	15	4	0.00	0.00
		Power-based voting	3	573	68	1	0.75	0.04
Front Running	34	Mythril	25	453	158	9	0.74	0.14
		AtLeastOne voting	25	453	158	9	0.74	0.14
		Majority voting	25	453	158	9	0.74	0.14
		Power-based voting	25	453	158	9	0.74	0.14
Time Manipulation	35	Mythril	28	539	71	7	0.80	0.28
		Slither	34	493	117	1	0.97	0.23
		Solhint	34	459	151	1	0.97	0.18
		AtLeastOne voting	34	459	151	1	0.97	0.18
		Majority voting	34	492	118	1	0.97	0.22
		Power-based voting	34	492	118	1	0.97	0.22

Table 8. Power-based vote method: Voter roles and vote strategy for each vulnerability class.

Vulnerability Class	Voters	Inverter: [Tools]	Vote Strategy
Reentrancy	[Mythril, Slither, Solhint]	Semgrep: [Slither]	AtLeastOne
Access Control	[MAIAN, Mythril, Semgrep, Slither, Solhint, VeriSmart]		AtLeastOne
Arithmetic	[Mythril, Slither, VeriSmart]		AtLeastOne
Unchecked Return Values	[Mythril, Slither, Solhint]		AtLeastOne
DoS	[Mythril, Slither, Solhint]	MAIAN: [Slither]	AtLeastOne
Bad Randomness	[Mythril, Slither, Solhint]	Semgrep: [Slither, Solhint]	AtLeastOne
Front Running	[Mythril]		AtLeastOne
Time Manipulation	[Mythril, Slither, Solhint]		Majority

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alsunaidi, S.J.; Aljamaan, H.; Hammoudeh, M. MultiTagging: A Vulnerable Smart Contract Labeling and Evaluation Framework. Electronics 2024, 13, 4616. https://doi.org/10.3390/electronics13234616

AMA Style

Alsunaidi SJ, Aljamaan H, Hammoudeh M. MultiTagging: A Vulnerable Smart Contract Labeling and Evaluation Framework. Electronics. 2024; 13(23):4616. https://doi.org/10.3390/electronics13234616

Chicago/Turabian Style

Alsunaidi, Shikah J., Hamoud Aljamaan, and Mohammad Hammoudeh. 2024. "MultiTagging: A Vulnerable Smart Contract Labeling and Evaluation Framework" Electronics 13, no. 23: 4616. https://doi.org/10.3390/electronics13234616

APA Style

Alsunaidi, S. J., Aljamaan, H., & Hammoudeh, M. (2024). MultiTagging: A Vulnerable Smart Contract Labeling and Evaluation Framework. Electronics, 13(23), 4616. https://doi.org/10.3390/electronics13234616

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MultiTagging: A Vulnerable Smart Contract Labeling and Evaluation Framework

Abstract

1. Introduction

2. Literature Review

2.1. Definition of SC Vulnerabilities

2.2. Identification of SC Vulnerabilities

2.3. Research Gaps

3. SC Vulnerability Taxonomy

4. MultiTagging Framework

4.1. Analysis Tool Reports Tagger

4.2. Analysis Tool Evaluator

4.3. Labels Elector

4.3.1. Threshold-Based Voting

4.3.2. Power-Based Voting

4.4. Evaluation Scores Plotter

5. Research Method and Design

5.1. Goal

5.2. Research Questions

5.3. Benchmark

5.4. Analysis Tools

5.5. Evaluation Measures

5.6. Execution Environment

6. Results and Discussion

6.1. Individual-Based Labeling

6.1.1. Analysis Tools Efficiency

6.1.2. Analysis Tools Performance

6.1.3. Similarity

6.2. Vote-Based Labeling

6.3. MultiTagging Effectiveness

7. Threats to Validity

8. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI