1. Introduction
The Internet of Things (IoT) has experienced rapid development in recent years, transforming industries to Internet-based paradigms by enabling the connection of millions of devices [
1]. Owing to advancements in 5G and the upcoming 6G wireless communication technologies, data transmission has become significantly faster and more reliable, laying a solid foundation for IoT and edge intelligence applications [
2]. By 2030, it is projected that approximately 500 billion IoT devices will be in use globally [
3]. IoT data generation is expected to grow significantly, with estimates suggesting a rise to 73.1 ZB by 2025, a 422% increase from the 17.3 ZB produced in 2019 [
4]. IoT devices are increasingly used in smart homes [
5], healthcare [
6,
7], manufacturing [
8,
9], and transportation [
10,
11], where they collect vast amounts of data. These data are essential for deriving insights and powering edge intelligent applications through advanced analytics and machine learning [
12], driving the convergence of IoT, edge devices, and cloud into a seamless computing continuum [
13], which serves as a cornerstone of modern digital ecosystems.
With the expansion of IoT and edge intelligence computing, ensuring the security and privacy of the massive amount of data generated and processed by these devices has become a significant challenge [
14]. Trusted Execution Environment (TEE) provides a secure, isolated space within a processor where sensitive data and code can be processed, safeguarding them from unauthorized access or tampering [
15]. TEE is crucial for protecting the growing volume of sensitive data generated by billions of interconnected devices in IoT. These devices often operate in untrusted environments and are vulnerable to security breaches, making TEE essential for ensuring data integrity and privacy [
16]. Various TEE technologies enhance data security, including AMD Secure Encrypted Virtualization (SEV) [
17] for virtual machine encryption, Arm TrustZone [
18] for embedded devices, and Penglai Enclave [
19] on RISC-V platforms. Intel SGX [
20], in contrast, focuses on application-level security through enclaves, making it ideal for protecting sensitive data in cloud and edge intelligence applications [
21] and is widely used in IoT [
22,
23].
However, as edge intelligence becomes an essential component of IoT architectures, the usage of TEE in these scenarios faces several challenges. The resource constraints of edge devices, such as limited processing power and memory, coupled with the high-performance demands of real-time applications, make it difficult for TEE technologies to operate efficiently [
24]. Additionally, ensuring data integrity [
25,
26], managing I/O operations securely [
27], and handling the performance overhead from frequent enclave transitions [
28] present significant hurdles, limiting the widespread deployment of TEE-based solutions in edge intelligence.
In this paper, we analyze the performance and security challenges of using SGX-based TEE in IoT and edge intelligence scenarios. Through a series of experiments, we evaluate the impact of TEE on system performance, focusing on key aspects such as I/O operations and data handling in resource-constrained environments. We concentrate on the performance of different SGX implementations in real-world scenarios, conducting comprehensive and detailed evaluations on SGX-based TEE through various carefully designed test cases. Our results reveal significant performance degradation and security limitations when using TEE technologies in edge scenarios. This research provides critical technical evaluations and security analyses that offer valuable insights for advancing the development and application of TEE in IoT and edge intelligence scenarios. The contributions of this paper could be summarized as follows:
A comprehensive survey of current TEE technologies in IoT scenarios is conducted, identifying key challenges such as I/O security and performance overhead.
A detailed testing framework is designed to evaluate the performance of SGX-based TEE in edge environments, covering implementations based on SGX Software Development Kit (SDK) and Library Operating System (LibOS), as well as their usage in a virtual environment.
Based on the survey and experimental results, performance bottlenecks in SGX are identified, offering insights and optimization directions for the application of TEE technologies in edge intelligence scenarios.
The rest of the paper is organized as follows. In
Section 2, we provide a brief introduction to the fundamental concepts of SGX-based TEE, focusing on two key features: isolated execution and attestation.
Section 3 discusses the application of TEE in IoT and the challenges it faces in edge intelligence scenarios, such as I/O security and performance degradation.
Section 4 presents a detailed performance evaluation of SGX in edge scenarios and proposes performance improvement directions based on the test results.
Section 5 covers a review of related research and
Section 6 concludes the paper.
3. Analysis of TEE Usage in Edge Intelligence and IoT
IoT refers to a scenario where physical and digital devices are interconnected through specific protocols and communication methods, forming an extensive network [
1]. Information exchange is a crucial component of IoT, involving the protection of privacy data and identity authentication between devices. These concerns align directly with the core capabilities of TEE. In cloud computing, data are transmitted to the cloud for centralized processing, yet data owners do not fully trust the cloud infrastructure. As a result, TEE technologies such as ARM TrustZone [
18], Confidential Compute Architecture (CCA) [
37], Penglai Enclave [
19] and SGX are crucial for ensuring data protection on remote servers. Next, we will analyze the applications of these different TEE technologies in the IoT and conduct a detailed investigation and analysis of SGX.
3.1. Arm TrustZone and CCA
In 2004, Arm TrustZone was proposed by incorporating hardware security extensions into Arm Cortex-A application processors. Arm TrustZone divides a system’s hardware and software resources into two distinct worlds: the Secure World and the Normal World. The Secure World handles sensitive operations, such as cryptographic key management, authentication, and secure storage, while the Normal World runs the standard operating system and applications. The transition between the Secure World and the Normal World is managed through Monitor Mode. TrustZone extends memory security with optional components like the TrustZone Address Space Controller (TZASC) and TrustZone Memory Adapter (TZMA). TZASC manages secure and non-secure memory regions for DRAM, while TZMA handles off-chip ROM or SRAM, enabling secure world access to non-secure areas but not vice versa.
CCA, designed to enhance virtual machine protection, introduces the Realm Management Extension (RME) and adds two new worlds: Realm World and Root World. CCA introduces the Realm World to create secure, isolated environments for confidential VMs, completely separating them from other domains, including the host OS, hypervisor, TrustZone, and other realms. CCA uses a Granule Protection Table, an extension to the page table that tracks memory ownership with worlds to maintain separation. The Granule Protection Table is managed by the Monitor in the Root World, which ensures that the hypervisor or OS cannot alter it directly. This Monitor also controls the dynamic allocation of memory across worlds by updating the Granule Protection Table. Additionally, CCA features attestation mechanisms that validate the platform and ensure the integrity of the realms. The low power consumption of the ARM architecture makes it well suited for IoT scenarios. Owing to its wide deployment in mobile and low-end devices, ARM TrustZone and CCA are widely deployed in IoT and edge intelligence scenarios [
38,
39,
40].
3.2. Penglai Enclave
Penglai Enclave is a software–hardware co-designed TEE technology aimed at enhancing the security and scalability of security-critical applications in cloud environments, specifically on RISC-V platforms. Penglai introduces two key hardware primitives: the Guarded Page Table and the Mountable Merkle Tree (MMT). The Guarded Page Table enables fine-grained, page-level memory isolation, ensuring that unauthorized software cannot access secure memory. MMT provides scalable memory encryption and integrity protection, supporting large secure memory regions. The secure monitor operates at the highest privilege level (e.g., machine mode in RISC-V), managing enclave creation, enforcing memory isolation and maintaining security guarantees.
Additionally, Penglai employs the Shadow Enclave and Shadow Fork mechanisms, which allow for the fast creation of multiple secure instances, significantly reducing the latency of memory initialization. The system can dynamically scale to thousands of concurrent secure instances, supporting up to 512 GB of encrypted memory with only approximately 5% overhead for memory-intensive applications. This framework effectively addresses the limitations of traditional TEE systems in terms of memory protection and scalability, making it particularly suitable for large-scale, dynamic and serverless cloud computing scenarios.
3.3. Analysis of SGX in Edge Intelligence and IoT
SGX is widely used in IoT scenarios based on its security guarantees and remote attestation features.
Table 1 summarizes some applications of SGX in IoT. The application scenarios can be broadly categorized into four types, with the first focusing on the use of SGX for IoT architecture. The work in [
41] describes the entire process from data acquisition at the edge devices, to encrypted transmission, and finally, decryption and processing at the gateway, with SGX responsible for key management tasks. Based on device virtual cloning, SGX-Cloud [
42] was constructed to allow users to choose which data to place into the cloud. The methods outlined in [
43] utilized SGX’s privacy-preserving mechanisms to achieve secure multi-party data sharing. Second, the remote attestation mechanism of SGX can be better utilized to achieve a security-enhanced identity authentication [
44]. Third, blockchain is a key component in IoT scenarios, and its integration with SGX can achieve better privacy protection. The work in [
45] sets up oracle nodes running in an SGX environment, enabling blockchain to securely obtain external data. Zhang et al. [
46] leveraged SGX in industrial IoT to track data flow and prevent data abuse. According to the study in [
4], SGX edge servers were set up to implement attribute-based fine-grained access control. Furthermore, SGX can also be applied in fields such as federated learning [
47], smart grids [
48] and health data protection [
49].
However, SGX usage in edge intelligence still has limitations. In edge intelligence scenarios, where device resources are limited and data mobility are complex, SGX, though capable of providing hardware-level data security, still faces significant challenges. Security and performance analyses are crucial in understanding the effectiveness and practicality of SGX in these scenarios. On the one hand, data security is a significant concern in edge devices, as data exchanged with peripherals or the cloud are vulnerable to leakage. Additionally, SGX’s inherent weakness in defending against side-channel attacks, combined with the closer proximity of attackers to edge devices, increases the likelihood of such attacks succeeding in edge scenarios. On the other hand, deploying SGX on edge devices often leads to performance degradation due to the added computational overhead and frequent context-switching, particularly in resource-constrained environments. Next, we will conduct an in-depth analysis of the three key challenges SGX faces in edge intelligence scenarios: I/O security issues, vulnerability to side-channel attacks, and performance degradation issues.
3.3.1. I/O Security Issues
Firstly, SGX does not include specific protection measures for I/O operations. While memory regions are protected by SGX, I/O operations still interact with external systems through unprotected channels. These channels, such as networks and storage devices, may become targets for attackers, potentially leading to the risk of I/O data being intercepted or alerted on edge devices. To establish a trusted path between the enclave and I/O devices, various solutions have been proposed, as shown in
Table 2.
There are generally three approaches to construct a trusted I/O path: hypervisor-based, software-based and external hardware (HW)-based solutions. For hypervisor-based methods such as SGXIO [
50], although it leverages a virtual machine to supervise the OS, it significantly increases the TCB. Additionally, because SGX does not trust the hypervisor, this approach involves a complex process of trusted domain binding and attestation. Software-based solutions like Aurora [
51] protects I/O using the System Management Mode (SMM) but occupies valuable SMRAM resources and lacks distinction between secure and non-secure devices. Rocket-io [
52] leverages the Storage Performance Development Kit (SPDK) and Data Plane Development Kit (DPDK) to accelerate communication between the enclave and peripherals. However, it is limited to the disk and network interface card, offering poor generality. The approaches based on external HW encounter challenges such as significant overhead when establishing trusted I/O paths. Ref. [
53] designs an external HW called USB Dongle, and implements a Device Provisioning Enclave (DPE) for key exchange and information transmission, but during the initialization of the trusted path, a trusted OS must first be booted to bind the DPE and the device, resulting in significant performance overhead. Furthermore, A USB proxy device (UPD) is designed in [
54] for transmitting USB packets. However, the I/O path setup relies on remote attestation, which incurs significant overhead. Some I/O protection solutions for specific scenarios, such as Bluetooth [
55] and browsers [
56] are proposed but these also encounter issues such as limited generality and inadequate security protection. This demonstrates that SGX still lacks a universal solution for I/O security and I/O performance degradation issues. This is particularly concerning in edge intelligence scenarios, where attackers can more easily access devices and data sources, leading to increased security risks.
3.3.2. Vulnerability to Side-Channel Attacks
SGX provides hardware-level protection, preventing privileged software, such as the OS and hypervisor, from accessing sensitive data. However, the shared use of system resources between the enclave and untrusted applications significantly expands the attack surface for side-channel vulnerabilities. Currently, side-channel attacks exploiting page tables, branch prediction and cache present significant threats to the security of data within enclaves.
In the case of page tables, the enclave depends on the untrusted OS for management, while using additional data structures for validation. The OS can mark all pages as inaccessible, causing any access to trigger a page fault. By monitoring which pages the enclave accesses and analyzing the sequence of these events over time, the OS can potentially infer certain enclave states and compromise protected data. PigeonHole [
57] demonstrates that page fault side-channel attacks can exploit unprotected SGX, leaking an average of 27% and up to 100% of secret bits. Additionally, Ref. [
58] focuses on attacks targeting page table flags, such as modifying the “present” bit to track enclave page access, using updates to the “accessed” bit to detect pages accessed by the victim enclave and monitoring the “dirty” bit to infer the victim enclave’s memory write operations. Additionally, attacks like the controlled-channel attacks [
59] also pose significant challenges to the security of data within the enclave.
Branch prediction is a critical mechanism in modern processors that predicts the outcome of branch instructions before execution. By utilizing branch prediction, processors can preemptively fetch and decode instructions based on the predicted branch direction, enhancing performance. However, during context switching, the Branch Prediction Unit (BPU) is not flushed, potentially retaining sensitive information from isolated environments. This residual data can be exploited by attackers through software-based side-channel attacks, posing significant security risks. By analyzing the branch history information of the victim enclave, attacks such as Branch Shadowing [
60] and BranchScope [
61] can effectively extract sensitive data or cryptographic keys from within an enclave.
Cache-based side-channel attacks exploit the differing access times between the cache and main memory, stealing sensitive data by measuring the victim’s execution time or data access patterns. For example, the Prime + Count cache side-channel attack [
62] can establish covert channels across worlds within ARM TrustZone. For SGX, it is particularly susceptible to cache-timing attacks [
25]. Experiments have shown that the Prime+Probe cache side-channel attack can extract the AES key from an SGX enclave in less than 10 s. To address the issue of side-channel attacks, several solutions have been proposed. For page table-based side-channel attacks, hardware isolation can provide enclaves with independent page tables [
63]. Additionally, cache partitioning [
64] prevents attackers and victims from sharing cache lines, while techniques like timing variation elimination [
65] can help mitigate cache-based side-channel attacks. Although these methods can effectively defend against specific types of side-channel attacks, there is currently no universal solution to protect SGX and other TEE technologies from side-channel vulnerabilities. Furthermore, many SGX projects, such as Gramine and Occlum, overlook side-channel attacks in their considerations, making SGX more susceptible to such attacks in practical deployments.
3.3.3. Performance Degradation Issues
In resource-constrained edge scenarios, the issue of SGX’s performance degradation has drawn significant attention. SGX’s performance degradation primarily arises from two sources, the overhead of data encryption and decryption, and the performance cost associated with entering and exiting the enclave during system calls or EPC page fault [
66]. Encryption and decryption operations are performed by hardware, leaving limited room for optimization. In contrast, various approaches have been proposed to optimize the performance overhead caused by context switching. Hotcalls [
28] based on a synchronization spin-lock mechanism significantly improve performance compared to Ecalls and Ocalls. The work in [
67] achieves exitless system calls by delegating system calls to threads running outside the enclave, and performance overhead due to page faults is mitigated by implementing paging within the enclave. Another more general solution involves leveraging a LibOS, such as Gramine [
35] and Occlum [
36]. The LibOS approach can streamline interactions between the enclave and the application, potentially reducing the overhead associated with traditional system calls and improving overall performance. Next, we will provide a detailed discussion of the performance overheads of different SGX implementations in edge intelligence scenarios through a qualitative analysis and a quantitative assessment.
4. Performance Evaluation in Edge Scenarios
In this section, the performance evaluation of four different SGX implementations in edge scenarios are conducted, including the SGX SDK-based approach, two LibOS-based approaches (Gramine and Occlum), and the SGX implementation in a virtual environment. CPU-intensive instructions, I/O-intensive instructions, network programming, common system calls and performance under realistic workloads are the five main aspects of testing, covering most edge intelligence scenarios. Through qualitative analysis and quantitative evaluation, we will analyze the performance degradation of SGX in resource-constrained edge scenarios.
4.1. Experimental Setup
The hardware specifications, system parameters, and software versions used for the tests are shown in
Table 3. The virtual environment utilizes the docker environment provided by Occlum, version 0.26.4. For both Gramine and Occlum, default values are used for the relevant parameters, and the tests are conducted on the SGX version 1. We disable CPU frequency boost features and fix the CPU frequency at 4.7 GHz to reduce performance data fluctuation. In the network tests, we disconnect external network connections and use local addresses to conduct the tests, minimizing external interference. To avoid the impact of other processes on the experimental results, we disable unnecessary background processes and services.
4.2. CPU-Intensive Workload
The Native-mode Benchmark (NBench) is a common test suite used to assess CPU performance, primarily focusing on the computational speed, floating-point operations and memory system efficiency of a computer. The NBench results are measured by the number of test iterations completed within 1 s. The more iterations are completed, the higher the efficiency. Taking the native Linux test results as the baseline, the performance of SGX implementations are illustrated in
Figure 2.
Overall, the implementation based on the SGX SDK has the worst performance, while Gramine performs the best. The performance of Occlum is similar to that of Gramine, and there is almost no difference between the virtual and native environments. The implementation based on the SGX SDK incurs significant context-switching overhead due to frequent enclave transitions, which leads to significant performance overhead. In the FOURIER test, it achieves only about of the native performance. For the two LibOS-based approaches, Gramine performs well across all tests, with a minimum performance of about of native performance. Occlum excels in numerical sorting, bitwise operations and LU decomposition, but in other tests, its performance is approximately to of the native performance. This is because Occlum’s design is more oriented towards multi-process environments, allowing multiple processes to coexist within a single enclave, whereas Gramine excels in single-threaded processing capabilities.
Separately, in the numerical sorting task, all four SGX implementations perform well, with performance being close to that of native Linux. This is due to the use of heap sort in the sorting task, and the small amount of numbers to be sorted in each group, which avoids the issue of insufficient enclave capacity. As a purely computational task, the performance of all SGX implementations is similar. In the bitwise operation test, LibOS-based approaches achieve performance close to native, while the SGX SDK shows a significant performance decline. In the floating-point computation tests, Occlum shows a performance decline relative to Gramine, indicating that Occlum is weaker in floating-point calculations. In purely CPU-intensive tasks, both Gramine and Occlum exhibit minimal performance degradation, while the SGX SDK experiences a notable performance decline. In edge intelligence scenarios, if using SGX SDK, it is essential to make efficient use of the limited memory space within the enclave and minimize the number of enclave transitions to reduce performance overhead.
4.3. I/O-Intensive Workload
In typical workflows, I/O is a common and crucial task. This is particularly relevant in edge intelligence applications, where devices are closer to the data source, making it important to focus on how to efficiently acquire and distribute data through I/O operations. Disk I/O is used for testing, with IOzone selected as the I/O-intensive workload. IOzone generates a series of file operations and measures their performance. IOzone is widely used and has been ported to various systems and platforms to assess file system performance. Since SGX does not support system calls natively, the SGX SDK-based solution requires using Ocalls to exit the enclave for I/O operations. This approach does not provide security guarantees, so we did not conduct I/O tests on the SGX SDK. Instead, we focused on testing LibOS-based solutions, as these solutions have their own file systems, allowing disk I/O to be performed within the enclave. We tested both Gramine and Occlum, which, despite both being based on LibOS, have significant differences in file systems. In Gramine, users need to customize a manifest file to specify trusted files, creating a virtual file system. Since Gramine only allows one process to exist within an enclave, processes cannot share encryption keys, which prevents them from sharing an encrypted file system. Gramine can only encrypt files marked as trusted, and it does not encrypt file metadata. In contrast, Occlum uses a fully encrypted file system divided into two layers: a read/write layer and a read-only layer. The read-only layer is encrypted from the “image” folder in the host’s Occlum instance, while the read/write layer handles files generated during process execution.
To ensure the generality of the experimental results, each experiment was conducted a total of five times. Additionally, we ensured that only the current process performs file read/write and disk interaction operations. In the line charts, the short horizontal lines indicate the maximum and minimum values from the five tests, while the average of the five tests is connected to form the line chart. This paper mainly records the performance in four scenarios: sequential read/write and random read/write. Additionally, we focused on the system’s read and write performance under different buffer sizes.
4.3.1. Sequential Read/Write Performance
Sequential read and write operations are both performed on files stored on disk, meaning relevant file information is not preloaded in memory. In the case of sequential write operations, new files are written, requiring not only data storage but also the overhead of tracking the data’s location on the storage medium, known as metadata. The results are shown in
Figure 3. Occlum’s sequential read and write performance is significantly weaker than Gramine’s, which is a result of the performance overhead introduced by the fully encrypted file system. When the buffer size was small, Gramine’s sequential read and write speed could match or even surpass that of native Linux. However, once the size exceeded 1 MB, there was a noticeable decline in performance as the block size increased. The reason for this is that Gramine reached its preset maximum read/write buffer size, and further increased its result in multiple read/write operations, which reduced efficiency. When the buffer size for sequential read reached 16 MB, Gramine’s efficiency dropped to only about
of that of the native Linux, as shown in
Figure 3a.
4.3.2. Random Read/Write Performance
Random read and write operations were performed on cached files, which resulted in higher performance compared to sequential read and write operations. The results are shown in
Figure 4. Occlum’s read and write performance remained at a low level, while Gramine experienced a noticeable performance drop when the read/write buffer size exceeded 1 MB. Unlike sequential read and write operations, Gramine’s performance with smaller buffer sizes did not surpass that of the native Linux. It merely approached native performance. This indicates that with caching, Gramine cannot achieve the same read and write performance as native Linux. The results in the virtual environment show almost no difference compared to those in the native environment.
Overall, SGX does not offer an I/O solution that balances performance and security in edge intelligence scenarios. The SGX SDK requires frequent enclave transitions for I/O operations, which introduces additional performance overhead without providing sufficient security. Occlum offers a secure file system, but its performance is too poor to be practically useful in real tasks. While Gramine achieves native performance with smaller buffer sizes in tests, it lacks comprehensive security guarantees and experiences significant performance degradation with larger buffer sizes.
4.4. Performance in Network Programming
In the context of IoT, the performance of network programming directly influences the speed of data exchange, thereby impacting application performance. This study simulated a scenario where a confidential server based on SGX was deployed remotely, and by adjusting the client load, server throughput and latency were observed to assess the impact of SGX on network programming performance. Lighttpd [
68], a lightweight HTTP server, was enabled on the server side, while ApacheBench [
69] was used on the client side to adjust the load. The simulation test was conducted by continuously downloading files from the server. The Lighttpd server version used was 1.4.40, and the ApacheBench version was 2.3. To simplify the experiment, both the server and the client ran on the same test machine, with the download address set to the local network 127.0.0.1 and port 8004. The test focused on two main aspects: first, testing multi-threaded performance by examining changes in system throughput as concurrency increases; and second, testing single-threaded performance by analyzing how latency changes as the size of the downloaded file increases. Since SGX SDK does not support system calls and cannot directly access ports, the experiment was conducted using SGX implementations based on LibOS, namely Gramine and Occlum.
4.4.1. Impact of Concurrency on Throughput
Since Occlum requires the use of the spawn method for process creation and does not support the fork system call, and fork is used in Lighttpd, the Lighttpd code needs to be modified to replace fork calls with spawn system calls. The code modification is planned for future implementation, and in this paper, testing is only performed on native Linux and Gramine. In the Gramine tests, the maximum number of threads for Lighttpd was set to 25, and the file downloaded by the client was a randomly generated file of 10 KB.
Figure 5 shows that as concurrency increases, both the native Linux and Gramine exhibit a trend where throughput initially increases and then stabilizes. When concurrency is low, Gramine’s performance is significantly lower. For instance, when the concurrency is 1, Gramine achieves a throughput of only 7.6, while the native Linux exceeds 4000. The main reason for this is that with such a low load, the proportion of time spent handling download tasks decreases significantly, making the overhead of creating enclaves in Gramine particularly costly. However, as the load increases, both native Linux and Gramine show a rapid increase in throughput as concurrency rises from 1 to 50. When concurrency increases from 50 to 300, Gramine’s throughput continues to rise, while the native Linux throughput stabilizes, and both eventually reach their maximum throughput. The test results indicate that Gramine’s maximum throughput is about 75∼80% of that of native Linux.
4.4.2. CPU Resource Usage
The CPU usage of the Gramine and Occlum frameworks is evaluated during network programming to address resource constraints in edge scenarios, where computational resources are limited. The evaluation focuses on the Lighttpd web server running within enclaves, with its CPU consumption monitored from an external terminal. To ensure the accuracy of the experimental results, the system cache is cleared prior to each test. The terminal used for testing is set to measure CPU usage every second. Results are shown in
Figure 6. For Occlum, since all processes are initiated by a parent process, only the CPU usage of its child processes is measured. In contrast, for Gramine, the Lighttpd service is managed by the Platform Abstraction Layer (PAL) loader, making it the primary target for monitoring. The results show that both frameworks exhibit similar trends, with higher CPU usage during the initial runtime phase. However, as the server stabilizes, Occlum achieves slightly lower steady-state CPU overhead compared to Gramine. Overall, Occlum has a lower CPU overhead, but the initialization of the Lighttpd environment causes a significant number of context switches, resulting in the TEE’s CPU overhead being much higher than that of a native Linux environment.
4.4.3. Impact of File Size on Latency
When measuring the impact of file size on download latency, a single-threaded mode was used, and native Linux, Gramine and Occlum were all tested. The client concurrency was set to 1, and the file size gradually increased from 1 KB to 100 KB to observe the changes in download latency. The results are shown in
Figure 7. As the file size increases, there is no significant change in the latency of native Linux. For Occlum, latency increases both in virtual and native environments. Gramine is the most affected by file size, with latency rising significantly as the file size increases. Comparing server performance in single-threaded mode, Occlum demonstrates better performance in handling file downloads, particularly with larger files. Compared to native Linux, Occlum’s latency is approximately 2 to 3 times higher, while Gramine, due to the impact of file size, can experience latency increases of over 10 times.
According to the results of the network programming performance tests, although Occlum performs better in file downloads in single-threaded mode, it still exhibits 2 to 3 times higher latency compared to native Linux, and is similarly affected by latency fluctuations due to file size. Additionally, Occlum’s compatibility issues with the spawn-based process creation mode can lead to usability problems. Gramine, on the other hand, has high download latency and is severely impacted by the size of the load.
4.5. System Call Overhead
A core challenge faced by SGX is how to manage its relationship with OS. In
Section 2, two approaches to interacting OS are introduced. The SGX SDK fully distrusts OS, which means that system calls cannot be made within the enclave. While this approach enhances security, it presents challenges in terms of performance and code portability. On the other hand, when allowing for a moderate increase in TCB, LibOS-based solutions can perform most system calls within the enclave. Through security checks, these solutions can also avoid attacks from OS, such as Iago attacks [
70]. We primarily tested the system call overhead of process creation and inter-process communication in Gramine and Occlum to observe SGX performance in multi-process tasks.
4.5.1. Process Creation Latency
Gramine and Occlum differ significantly in their process management models. Gramine only allows one process per enclave, while Occlum permits multiple processes within a single enclave. This leads to differences in their process creation methods: Occlum discards the fork system call and instead uses spawn to create new processes. Gramine, on the other hand, uses the traditional fork to create child processes, where the child and parent processes share the enclave properties defined by the manifest file. During testing, child processes are generated with varying memory sizes allocated via the malloc function. For parent processes, both native Linux and Gramine use fork system call, while Occlum uses spawn. The time required to create a new process with different memory sizes reflects the process creation latency.
In
Figure 8, the process creation latency in Occlum is significantly lower than in Gramine. When the memory size of the new process is small (less than 100 MB), the performance can match that of native Linux. In Gramine, after a fork call, a new enclave is created, and complex interactions occur between the parent and child processes, such as identity authentication and key exchange, resulting in higher latency. Occlum, however, allows multiple processes to exist within a single enclave, where these processes share resources and are transparently managed by the operating system, greatly reducing process creation time. However, when the memory size of the new process exceeds the capacity of the EPC, it is constrained by the secure memory limit, requiring some of the memory to be moved to regular memory. This triggers repeated encryption and decryption operations, leading to additional performance overhead and a sharp increase in latency.
4.5.2. Inter-Process Communication Performance
Considering the different process execution models, in addition to process creation, inter-process communication is also a common and important scenario. First, we used the process creation methods mentioned earlier, whereas Gramine and native Linux used fork, and Occlum used spawn to create a child process. Communication between the parent and child processes was then established through the pipe system call. The parent process is responsible for writing data into the pipe, while the child process reads data from it. The throughput under different buffer sizes was used to reflect the efficiency of inter-process communication. In
Figure 9, it can be observed that Occlum demonstrates high efficiency in inter-process communication, with almost no difference compared to native Linux. As the buffer size increases, the throughput of the pipe gradually increases. However, Gramine shows very low efficiency in inter-process communication. Through testing process-related system calls, although Gramine supports multi-process mode, its performance significantly declines. In contrast, Occlum achieves native level performance in both process creation and inter-process communication system calls.
4.6. Performance Under Realistic Workloads
In this section, we aim to evaluate the performance and applicability of SGX in the context of IoT workloads. IoT systems often involve extensive data caching and fast access requirements, which impose stringent demands on both performance and data security. SGX offers hardware-based TEE that ensures the confidentiality and integrity of sensitive data while maintaining system flexibility and scalability. To simulate a representative IoT workload, Memcached is selected as the target application for evaluation. Memcached, a lightweight and widely adopted in-memory caching system, is commonly utilized for efficient key-value storage and retrieval in distributed systems with high-speed data access requirements. Its workload characteristics closely align with the caching and access patterns frequently observed in IoT scenarios. By employing Memcached as the benchmark, this study replicates typical IoT workloads to systematically analyze the impact of SGX on both performance and security under realistic operating conditions. The results aim to provide meaningful insights into the feasibility and practicality of deploying SGX in real-world IoT environments.
The Memcached environment is configured for both Gramine and bare-metal Linux, with a focus on evaluating the relationship between throughput and latency. Memcached runs inside the SGX enclave, and load testing is performed using the memtier_benchmark [
71] from a separate terminal. The default configuration uses four threads, and the load is increased by incrementally raising the number of concurrent clients per thread. The resulting performance metrics are presented in
Figure 10. It is observed that the maximum throughput of Gramine is only about
of that achieved on bare-metal Linux, and its relative performance is even lower under light workloads. This indicates that in the Memcached scenario, the use of SGX still introduces significant performance overhead. Frequent encryption and decryption operations, along with transitions in and out of the enclave, severely limit the performance of edge devices.
4.7. Performance Improvement Strategies
Based on the performance tests, it is evident that SGX experiences significant performance degradation in edge intelligence scenarios. Although different SGX implementations excel in specific areas, such as Gramine in CPU-intensive tasks and Occlum in process management, the overall performance of SGX in real-world tasks like I/O and network programming remains poor. To improve SGX’s performance in edge scenarios, three main approaches can be considered: optimizing enclave entry and exit, bypassing the kernel for I/O operations and adopting confidential virtual machine (CVM)-based TEE.
4.7.1. Optimizing Enclave Entry and Exit
Enclave switching poses a significant bottleneck in SGX-based systems, with the overhead of a single Ecall or Ocall exceeding 14,000 cycles [
28]. In comparison, a typical system call incurs only around 150 cycles. Since nearly all system calls within an enclave require switching to the untrusted application to complete, frequent transitions result in substantial performance overhead. To address this issue, approaches like HotCalls [
28] and Eleos [
67] have been proposed. HotCalls reduces the latency of enclave transitions by leveraging a spin–lock mechanism and shared unencrypted memory, cutting the overhead of Ecalls and Ocalls to as low as 620 cycles. Eleos, on the other hand, employs an “exitless” architecture that batches system calls, caches results and utilizes shared memory buffers to minimize the need for enclave exits, achieving a 2–3x reduction in overhead for system call-intensive workloads. Both approaches effectively mitigate the severe performance degradation caused by frequent enclave transitions, making SGX more practical for real-world, high-performance scenarios. However, our experimental evaluation reveals that the LibOS-based approach offers a more balanced solution for performance, security and usability in SGX implementations. By integrating most system calls directly within the enclave, LibOS significantly reduces the frequency of enclave switches. With a moderate increase in the TCB size, it also eliminates the need for partitioning existing code, allowing applications to be deployed into enclaves efficiently and with minimal performance overhead. Consequently, we conclude that the LibOS-based approach is particularly well suited for IoT and edge intelligence scenarios, where both performance optimization and ease of deployment are critical.
4.7.2. Bypassing the Kernel for I/O Operations
SGX performs well in CPU-intensive tasks, but it shows poor performance in I/O-intensive tasks and real-world applications, necessitating specialized optimization for I/O operations. SGX inherently lacks support for trusted I/O and does not allow efficient methods such as DMA to access enclave memory. Traditional solutions involve multiple copying steps: from disk or network cards to the kernel space, then from the kernel space to untrusted application memory, and finally, into the enclave, resulting in significant performance degradation. To address these challenges and provide secure and reliable peripheral channels, Aurora [
51] proposes a trusted I/O path based on the System Management Mode (SMM). Aurora uses System Management RAM (SMRAM) to isolate and protect I/O operations, creating a secure communication channel between the enclave and peripheral devices. Key features of Aurora include support for HID keyboards, USB storage, hardware clocks and serial printers, as well as a batch processing mechanism to optimize performance by reducing context switching. On the other hand, an efficient architecture called SMK [
72] focuses on extending SGX capabilities to address specific issues such as secure networking and trusted timing in distributed environments. SMK provides a trusted network by running protocol stacks on trusted hardware, ensuring data integrity and authenticity. Additionally, SMK introduces a trusted clock system to ensure that applications relying on precise timestamps (e.g., blockchain and secure communications) remain secure and reliable.
Although these methods can achieve secure I/O, they do not provide significant performance improvements. We believe that since TEE does not inherently trust privileged software, bypassing the kernel is a more suitable solution for TEE systems. This approach aligns with the TEE’s core security model, minimizing reliance on untrusted components and reducing the attack surface while ensuring both the integrity and confidentiality of data during I/O operations. Rocket-IO is a direct I/O stack for TEEs that bypasses the untrusted kernel using user–space libraries like DPDK and SPDK, enabling efficient and secure hardware interaction. It eliminates multiple data copies and integrates encryption, significantly enhancing both I/O performance and security compared to existing frameworks. At the same time, we believe that new hardware and protocols can be leveraged to achieve the authentication of trusted peripherals, enabling direct data interaction. For instance, utilizing RDMA network cards, device authentication can be conducted through the SPDM protocol, extending the trust domain from the CPU to external devices. Subsequently, direct data exchange can be realized through methods such as DMA. Additionally, integrating advanced hardware like Persistent Memory can enhance both efficiency and security by enabling high-speed, non-volatile storage and direct data access. Combined with TEEs like Intel SGX, this ensures low-latency and secure data interaction across devices, while maintaining robust protection for sensitive operations [
73].
4.7.3. Adopting CVM-Based TEE
Although the methods mentioned above can effectively improve the performance of SGX, its fundamental limitation lies in its design as a user–space TEE. While distrusting the OS, SGX still relies on OS-provided services, such as page table management, leading to inherent performance and security challenges. In contrast, VM-based TEEs, such as HyperEnclave [
74], leverage a trusted hypervisor to build a flexible, general-purpose TEE architecture, offering multiple enclave types to suit diverse application needs. CVMs, in particular, are evolving rapidly, enabling fully functional virtual machines where unmodified applications can run seamlessly. Despite introducing a slightly larger TCB, CVMs significantly enhance performance for I/O-intensive tasks [
75]. For example, Intel TDX [
76] addresses SGX’s I/O limitations by supporting trusted I/O, while solutions like Bifrost [
77] optimize CVM-I/O performance through techniques such as zero-copy encryption and packet reassembly, achieving up to 21.50% performance gains over traditional VMs. Similarly, FOLIO [
78], a DPDK-based software solution, improves network I/O performance for CVMs without relying on trusted I/O hardware, achieving performance within 6% of ideal TIO configurations while maintaining security and compatibility with DPDK applications. Given these advancements, CVM-based approaches are likely to represent the most practical and effective model for implementing TEEs in the future, provided there are no extreme constraints on TCB size.
5. Related Works
In this section, we primarily summarize the works related to the analysis and evaluation of TEE technologies.
For SGX-based TEE technologies, Hasan et al. [
32] implemented LMbench using both the SGX SDK and Gramine, focusing on four scenarios: Forkless, SGX, NoSGX and Gramine. They compared the performance of the porting method and the shim-based method by analyzing system read/write performance and the overhead of certain system calls. They also tested Ecall and Ocall overhead, showing that the shim-based method is more optimization-friendly. After several iterations, LibOS-based technologies like Gramine and Occlum outperformed the port-based method in some scenarios. For SGXGauge, the authors of [
66] examined SGX performance under different memory usage levels, using EPC size as a baseline and testing in low, medium and high memory usage conditions. They compared native Linux, SGX SDK-based and LibOS-based implementations, revealing a significant performance drop in the SGX SDK when memory exceeded the EPC, while LibOS-based implementations maintained stable performance. Weisse et al. [
28] detailed Ecall and Ocall performance overheads in warm and cold cache environments. They also tested data exchange speeds between non-secure applications and enclaves across four buffer transfer modes: zero copy, copying in, copying out and copying in&out. The results showed poor performance in the application–enclave interface, prompting the proposal of HotCalls to redesign the interface and improve performance.
For other TEE technologies, analysis and evaluation are essential components. For Arm TrustZone, a study [
79] comparing it with other TEE technologies concluded that TrustZone is more hardware-efficient and avoids the risks of a highly privileged “black box” controlling the system. For ARM CCA, the authors of [
37] conducted detailed performance tests under various workloads, including hypercalls, I/O instructions and application benchmarks like Apache and MySQL, identifying key performance bottlenecks. On the RISC-V platform, the authors of [
19] evaluated the performance of Penglai Enclave using the SPECCPU benchmark and addressed security challenges such as controlled-channel and cache-based side-channel attacks, proposing mitigation strategies to enhance security. Additionally, some studies focus on TEE technologies that adopt VM-based protection, such as AMD SEV and Intel TDX. For instance, recent work [
75] conducted a comparative evaluation of TDX, SEV, Gramine-SGX and Occlum-SGX, analyzing computational overhead and resource usage under various operational scenarios with legacy applications. This study uniquely evaluates TDX, providing valuable insights into the performance of CVM-based TEEs under realistic conditions. In contrast, the abovementioned works do not consider the integration of TEE technologies into IoT environments. In resource-constrained edge intelligence scenarios, the performance and applicability of TEEs face significant challenges, where factors such as limited computational power and energy efficiency can heavily impact their effectiveness.
6. Conclusions
This paper presented an in-depth analysis and evaluation of SGX-based TEE technologies in IoT and edge intelligence scenarios. Through comprehensive performance testing of various SGX implementations, including those based on SGX SDK and LibOS, we identified key challenges, such as performance degradation and I/O security issues, that arise when applying TEE in resource-constrained and latency-sensitive edge environments. Our experimental results highlight significant performance bottlenecks, particularly in areas like enclave transitions and secure I/O operations. These findings offer critical insights into the limitations of current SGX solutions and provide valuable benchmarks for improving the integration of TEE in IoT scenarios. Additionally, it proposes corresponding performance optimization strategies, offering practical approaches for deploying TEE in IoT and edge intelligence scenarios.
However, this study has some limitations. Primarily, it focuses on SGX-based TEE implementations and does not include evaluation with real workloads for other kinds of TEE technologies. Second, some security concerns related to TEE usage in edge environments, such as Man-in-the-Middle and Denial of Service attacks, are not covered in this work. Future work will focus on implementing the proposed performance optimizations and I/O security enhancements, while also broadening our experimental scope to cover other TEE technologies. This will provide a more comprehensive understanding of TEE utilization and find potential improvements in IoT–Edge Cloud Continuum.