1. Introduction
Kubernetes has transformed the management of containerized enterprise applications, providing scalable orchestration for workloads in cloud-native and high-performance computing settings. Initially developed by Google, Kubernetes has emerged as the industry standard owing to its flexibility, scalability, and robust architecture. This has resulted in its extensive application in domains such as high-performance computing and artificial intelligence, where efficiently managing large-scale clusters and resources is essential.
The choice of container network interface plugins regarding networking efficiency significantly influences Kubernetes’ proficiency in workload orchestration. Container network interfaces manage network communication among pods and nodes, directly influencing bandwidth, latency, and overall network efficacy. Prominent CNIs such as Antrea, Flannel, Cilium, and Calico possess distinct advantages and disadvantages, necessitating carefully selecting an appropriate CNI for workloads, particularly in performance-critical settings like HPC.
In high-performance computing, distributed workloads require extensive data processing, and simulations require higher bandwidth and minimal latency. Any network inefficiency may result in prolonged processing durations and higher expenses. Diverse workloads necessitate distinct network configurations—some emphasize low latency, while others aim to optimize bandwidth. Kubernetes’ capability to utilize custom CNIs enables personalized settings; however, choosing the optimal CNI necessitates a comprehensive assessment.
Besides choosing the suitable CNI, system optimization is essential for enhancing network performance. Modifying parameters like packet size, MTU (Maximum Transmission Unit), and kernel configurations can substantially influence the performance of CNIs. For example, specific CNIs perform better when optimized for bigger packet sizes, but others thrive with smaller sizes.
This article assesses the TCP (Transmission Control Protocol) and UDP (User Datagram Protocol) performance and latency efficiency of four of the most common Kubernetes CNI plugins (Antrea, Calico, Cilium, and Flannel), analyzing their behavior under various system tuning profiles. While doing these evaluations, we developed two tuned profiles that significantly impact bandwidth efficiency. Utilizing automation tools such as Ansible guarantees reproducible testing, yielding critical insights for enhancing Kubernetes networking in practical HPC contexts, which can also affect the data center design.
There are four scientific contributions to this paper:
Analysis of the impact of system tuning on CNI plugin performance—we use six built-in and two custom performance profiles to analyze the effects on CNI plugin performance;
From the design perspective, influence on Kubernetes data center network infrastructure design based on workload type-knowing that CNI plugins have 40–90 ms more latency and approximately 40% less bandwidth than the available physical bandwidth that network adapters provide leads to changes in the data center design and procurement processes;
The influence of system tuning, packet, and MTU sizes on CNI performance-related to the design process, as we usually tune networks for different workload types via these settings.
Examination of how the system tuning and CNI choice affects both latency-optimized and throughput-optimized workloads-suppose we have a real-time data-sharing HPC workload to execute; a latency-optimized network would be the correct configuration choice. On the other hand, suppose we have an HPC workload with a massive amount of non-shared data, like big datasets or distributed databases—the bandwidth-optimized network will be a huge benefit there.
This paper is organized as follows: the next sections provide relevant information about related work and all background technologies covered in this paper—Kubernetes, CNIs, and associated technologies. Then, we will describe the problem, our experimental setup, and study methodology, followed by results and the related discussion. In the last sections of the paper, we will go over some future research directions and conclusions we reached while preparing this paper.
2. Related Work
Kubernetes’ CNI plugins are integral to the functioning of containerized environments, significantly affecting network performance, scalability, and resource management. Multiple studies have focused on evaluating the performance of different CNI plugins, addressing their strengths and weaknesses under varying workloads.
One study provided a comprehensive overview of Kubernetes CNI plugins, analyzing their performance and identifying critical trade-offs among them [
1]. Further performance studies focused on container networking interface plugins, assessing scalability and efficiency in Kubernetes networks across various environments [
2]. Another performance study specifically investigated Kubernetes network solutions, examining the impact of different network configurations on overall system performance, with findings that emphasized the importance of selecting the right network solutions based on workload requirements [
3]. A study from 2020 assessed the functionality, performance, and scalability of CNI plugins, demonstrating how different plugins performed under varied workloads and highlighting critical considerations for network optimization. The findings showed that specific plugins provided better throughput and scalability for larger networks, while others were more suitable for environments requiring lower latency [
4]. A comprehensive study examined the design considerations of CNI plugins, focusing on both functionality and performance. This research confirmed that various plugins exhibit significant differences in throughput, latency, and overall efficiency depending on the use case, with some plugins optimized for higher scalability. In contrast, others are better suited for low-latency environments [
5]. Additionally, a detailed performance evaluation highlighted differences in CNI plugins for edge-based and containerized publish/subscribe applications, pointing to how specific plugins performed better based on the network configurations [
6]. A comparison of Kubernetes network solutions over a 10 Gbit/s network further emphasized the impact of plugin selection on overall performance, with key findings showing that different plugins optimized performance under specific use cases [
7]. Another study assessed containerized time-sensitive environments, focusing on deterministic overlay networks and how they interact with Kubernetes networking [
8]. Performance analysis of CNI-based container networks from 2018 examined the underlying factors influencing service levels, offering insights into the potential optimizations of network interfaces to enhance Kubernetes networking [
9].
In HPC networking and hardware, researchers have increasingly focused on exploring the performance implications of various networking configurations and hardware designs for HPC data centers. One study examined the evolving landscape of HPC data centers, discussing the implementation of Kubernetes and machine learning-based workload placement for optimizing performance and dynamic environment evaluation [
10]. Another study conducted a simulation analysis to assess hardware parameters for future GPU-based HPC platforms, demonstrating that optimized hardware configurations can significantly enhance computational efficiency [
11]. Additionally, research has been conducted on using WebAssembly in HPC environments, with findings showing potential improvements in the speed and portability of applications [
12]. The design of heterogeneous networks for HPC systems has also been explored, focusing on developing a unified reconfigurable router core to improve the performance of complex HPC networks [
13]. Furthermore, another investigation into improving network locality in data center and HPC networks using higher-radix switches with co-packaged optics from 2022 demonstrated potential gains in network performance and efficiency [
14]. One study specifically targeted the characterization of resource management in Kubernetes-based environments, providing valuable insights into how Kubernetes can enhance or limit application performance due to overhead [
15].
Hardware and software acceleration within Kubernetes has also emerged as a crucial area of research, with several studies examining how different strategies can be employed to enhance the performance of Kubernetes networks. For instance, a study explored the seamless integration of hardware-accelerated networking within Kubernetes, revealing substantial gains in network performance when specialized hardware was leveraged [
16]. Similarly, research investigating the offloading of networking tasks to Data Processing Units (DPUs) demonstrated that this approach can significantly enhance container networking efficiency, especially in high-demand environments [
17]. Additional research delved into improving performance in cloud-native environments by integrating Multus CNI with DPDK (Data Plane Development Kit,
https://www.dpdk.org/ (accessed on 6 October 2024)) applications, showing promising results in reducing network latency and improving throughput [
18]. The exploration of SR-IOV (Single Root I/O Virtualization) technology in Kubernetes-based 5G core networks further demonstrated the importance of hardware acceleration in meeting the demands of modern, high-performance networks [
19]. Finally, a study from 2020 focused on the design and integration of dedicated networks with container interfaces, presenting a novel approach to improving cloud computing platforms using Kubernetes [
20].
3. Technology Overview
This section will describe all the CNIs and technologies related to our paper: Antrea, Flannel, Cilium, Calico, Kubernetes, and tuned, a system daemon that can be used for performance tuning via simple configuration. These four CNIs were selected as the most used CNIs in regular Kubernetes environments. There are others like Weave.net, which is deprecated; Kube-Router, which is not as widely used; and Multus. The most common use case for Multus is network high availability for pods, which is heavily used for specialized networking use cases, not regular workloads.
3.1. Antrea
Antrea is a CNI plugin developed explicitly for Kubernetes to improve cluster network performance and security. It utilizes Open vSwitch (OVS) to enforce network policies and achieve high performance by efficiently processing packets. Antrea supports sophisticated functionalities such as network policy, traffic engineering, and observability. These characteristics are crucial for effectively managing intricate network needs in cloud-native systems. To assess their performance and resource utilization, researchers evaluated Antrea and other CNIs like Flannel, Calico, and Cilium [
2]. The study emphasized that Kube-Router had the best throughput. However, Antrea’s architecture, which seamlessly integrates with Kubernetes networking models, makes it a reliable option for environments that need scalable and secure network solutions. In addition, Qi et al. [
4] observed that Antrea’s utilization of OVS offers adaptability and effectiveness in overseeing network policies and traffic streams, which is essential for contemporary microservice architectures. The plugin’s emphasis on utilizing hardware offloading capabilities boosts its performance, making it well-suited for applications that require fast data throughput and minimal delay. Other researchers also highlighted the significance of implementing security measures with minimal impact on performance in Kubernetes systems. Antrea’s features are well-suited to meet these criteria [
21]. The advantages of adaptable data plane implementations, which play a crucial role in Antrea’s architecture, have also been researched [
22].
3.2. Flannel
Flannel is a straightforward and configurable Kubernetes CNI plugin developed to establish a flat network structure within Kubernetes clusters. The primary method employed is VXLAN (Virtual eXtensible Local Area Network) encapsulation to oversee overlay networks, guaranteeing smooth communication between pods on various nodes. Novianti and Basuki [
2] conducted a performance evaluation study on Flannel, which showed significant throughput, although it was not the highest compared to other CNIs such as Kube-Router. Flannel’s design is characterized by its simplicity, which has advantages and disadvantages. On the positive side, it allows for effortless deployment and requires minimal configuration. However, it may need help to manage additional functionalities, such as network policies and traffic segmentation, which other CNIs offer. However, Flannel remains a favored option for small to medium-sized clusters that stress simplicity and ease of setup and operation. Qi, Kulkarni, and Ramakrishnan [
4] emphasized that Flannel’s performance can be enhanced with practical tuning, especially in network-intensive workloads. Zeng et al. [
23] discovered that while not superior, Flannel’s performance offers a harmonious blend of simplicity and functionality, rendering it appropriate for numerous typical Kubernetes deployments.
3.3. Cilium
Cilium is a robust CNI plugin for Kubernetes that uses eBPF (extended Berkeley Packet Filter) to deliver efficient and secure networking. This system is precisely engineered to manage intricate network policies and offer comprehensive insight into network traffic, essential for security and performance monitoring. A performance analysis that incorporated Cilium was conducted in 2021. Results say that while Cilium did not achieve the highest throughput, its range of features and strong security capabilities make it the best option for situations with strict security needs. In their study, Qi, Kulkarni, and Ramakrishnan [
4] highlighted the smooth integration of Cilium with Kubernetes, along with its advanced features like network policy enforcement and observability. Cilium utilizes eBPF to effectively manage network traffic without additional costs, rendering it well-suited for high-performance applications. Moreover, Cilium offers advanced security capabilities like seamless encryption and precise network controls, which benefit organizations seeking to protect their microservices systems. Budigiri et al. [
21] corroborated equivalent results, suggesting that eBPF-based solutions such as Cilium can offer robust security measures without sacrificing performance.
3.4. Calico
Calico is a CNI plugin recognized for its vigorous network policy enforcement and ability to scale effectively. It offers Kubernetes’s Layer 3 networking and network policies, providing encapsulated and unencapsulated networking choices. In their performance benchmark, scientists [
2] examined Calico, emphasized its well-balanced performance in terms of throughput and resource utilization, and examined the versatility of Calico in accommodating different networking contexts and its efficacy in enforcing security regulations [
4]. Calico’s architectural design enables it to expand horizontally, effectively managing a substantial volume of network policies without a notable performance decline. The versatility of this choice is due to its integration with Kubernetes network policies and its ability to function in both cloud and on-premises environments. In addition, Calico’s incorporation of BGP (Border Gateway Protocol) allows for the utilization of advanced routing features, which can be essential in specific deployment situations. Zeng et al. [
23] discovered that Calico performs better than other assessed CNIs, making it an excellent choice for high-performance network situations. Other researchers also highlighted the potential of CNIs such as Calico to improve performance in challenging situations by utilizing hardware offloading techniques [
17].
These four notable Kubernetes networking solutions are specifically designed to cater to distinct networking requirements in containerized systems. Although they share a common goal of managing pod networking, their architectures, operations, and system sensibilities differ. Let us briefly discuss their architectural similarities, differences, and other factors that might impact their performance.
3.5. Used CNI Plugins Summary, Similarities, and Differences
Each of these networking systems has its advantages and disadvantages. Flannel is well-suited for basic, resource-limited settings, but its encapsulation may cause delays. Calico provides enhanced functionality and superior performance in extensive implementations, while it is more susceptible to network infrastructure limitations. Cilium utilizes eBPF to deliver efficient and fast networking. However, it may necessitate more up-to-date Linux systems and increased CPU (Central Processing Unit) power for intricate functionalities. Antrea, based on Open vSwitch, offers versatile and sophisticated networking features but may consume significant resources in situations with excessive network traffic. Their similarities and differences are summarized in
Table 1:
The next topic we need to cover is tuned—a system tuning daemon with configurable and built-in profiles that change the system’s performance and, as a result, the network subsystem’s performance.
3.6. Tuned, Its Default, and Custom Profiles
Tuned is a dynamic, adaptive system tuning tool that can enhance the performance of Linux systems by applying different tuning profiles according to the current system load and activity. It is especially beneficial in settings where the emphasis is on maximizing performance and optimizing resources.
Tuned functions by implementing a series of predetermined tuning profiles that modify system configurations to enhance performance for workloads or hardware setups. These profiles can be used for many purposes, such as optimizing overall performance, conserving power, reducing latency, or catering to specific applications like databases or virtual machines. Tuned constantly monitors the system’s activity and can switch profiles to retain the best possible performance.
All tuned profiles are stored as configuration files in the /usr/lib/tuned directory. We created two custom-tuned profiles, kernel_optimizations and nic_optimizations, to assess the influence of kernel and NIC optimizations on CNI performance across different MTU (Maximum Transmission Unit) and packet sizes and Kubernetes CNIs. The first tuned profile sets various parameters to tune the kernel for better network performance based on our experimentation [
24]. Based on our experimentation, the second tuned profile uses the same principle but sets various network adapter parameters to improve network performance [
25]. The “Experimental Setup and Study Methodology” section will explain the parameters used by our two custom profiles.
The tuned daemon is a crucial element of tuning as it operates in the background and implements suitable performance tuning parameters. The parameters can encompass modifications to kernel settings, CPU governor settings, disk I/O scheduler setups, and network stack optimizations. As an illustration, the latency-performance profile configures the CPU governor to performance mode to minimize latency. Still, the power save profile utilizes the power save governor to reduce power usage. We used this in our built-in and custom profile performance measurements. There are multiple ways of doing performance tuning via tuned, such as merging settings from pre-existing profiles or creating new ones. This modification is especially beneficial in contexts with distinct performance requirements only partially managed by the preset profiles. This makes it a perfect method to deliver additional performance across an arbitrary number of systems that can be a part of, for example, an HPC cluster.
3.7. Kubernetes
Kubernetes, a resilient and adaptable platform, has revolutionized and transformed container application administration and coordination. Kubernetes is designed to simplify deploying, scaling, and managing applications. It achieves this by abstracting infrastructure complexities and enabling effective resource management. Its exceptional stability, scalability, and fault tolerance have earned it widespread recognition, making it the primary choice for managing containerized workloads in cloud-native settings. Brewer (2015) states that Kubernetes enables the development of applications as collections of interconnected yet autonomous services, making it easier to evolve and scale cloud-native apps [
26]. In addition, the data center deployments of Kubernetes have gained significant popularity because of their robust capabilities, including self-healing and scaling. Haja et al. (2019) developed a specialized Kubernetes scheduler to handle delay limitations and ensure reliable performance at the edge [
27].
Kubernetes has demonstrated its indispensability as a tool for optimizing infrastructure. Haragi et al. (2021) performed a performance investigation that compared the deployment of AWS EC2 with local Kubernetes deployments using MiniKube. Their analysis highlighted the efficacy of Kubernetes in optimizing cloud resources, minimizing over-provisioning, and improving performance [
28]. In a study conducted by Telenyk et al. (2021), a comparison was made between Kubernetes and its lightweight alternatives, Micro Kubernetes and K3S. The researchers found that while the original Kubernetes performs better in most aspects, the lightweight variants demonstrate superior resource efficiency in limited contexts [
29].
Another piece of research from 2018 highlights the scalability and flexibility of Kubernetes, emphasizing the transition from virtual machines to container-based services that Kubernetes and Istio manage. This facilitates the development of applications with increased simplicity, allowing for the flexible adjustment and control of services [
30]. In addition, Kim et al. (2021) investigated the resource management capabilities of Kubernetes, emphasizing the significance of performance isolation in guaranteeing service quality. Their research indicates that the decrease in container performance is frequently caused by competition for CPU resources rather than limitations in network bandwidth. This highlights the importance of implementing better resource management strategies [
31].
It is worth noting that Kubernetes is effectively used to automate software production environments. In a recent study conducted in 2021, the utilization of Kubernetes clusters through Kops and Excel on AWS was evaluated. The study unequivocally demonstrated that the platform efficiently meets the demands of a production environment [
32]. Furthermore, researchers delved into database scaling in Kubernetes, proposing an approach for automatically scaling PostgreSQL databases to overcome synchronization and scalability issues, thereby ensuring a high level of availability [
33].
4. Problem Statement
In the past five years, it has become trendy to ship all workloads as containers using Docker, Podman, and Kubernetes. This paper does not cover this problem, but it is related to the topic of this paper, albeit a portion of it—how to make those applications perform well from the networking standpoint. This is why selecting the correct CNI for any given workload is crucial. There are multiple types of applications from the network performance standpoint:
Without a thorough and proper evaluation, it is impossible to determine the most suitable CNI for a given type of workload.
The fundamental issue with CNIs is that they are not particularly efficient and do not scale well, especially on non-offloaded interfaces (FPGA, DPDK, etc.). Even if we went with the idea of working with offload techniques, making offloaded interfaces work is sometimes challenging. Furthermore, the performance metrics of CNIs are vastly different—usually, one CNI has 2x–3x the performance of a different one. These facts alone make it worthwhile to develop an automated methodology to be able to do computerized evaluations of CNIs, especially for larger HPC environments. This is why we created a new methodology for automating and orchestrating performance evaluation of CNIs [
34].
Our previous research pointed us in this direction, which then turned into research on the overhead of implementing container-based infrastructure managed by Kubernetes for HPC use cases. This paper meets these challenges head-on and tries to help understand the absolute performance characteristics of CNIs and performance characteristics compared to physical networking when measured via a repeatable methodology. This methodology relies on multiple automated procedures:
automated CNI deployments via Helm charts;
automated node configuration so that all nodes have the same runtime configuration;
automated performance evaluation and storing of performance data;
automated cleanup process so that another CNI can be deployed on the same set of HPC nodes for evaluation.
These automated procedures will help ensure repeatability. Therefore, performance evaluations are performed before creating a production-ready Kubernetes cluster for HPC. As a result, we get a lot of performance data that can then be used to measure the maximum bandwidth and minimum latency per HPC node and how these values change with packet and MTU size. This will also be important for future research directions, as discussed in the “Future Works” section.
Due to CNI overheads, the design of HPC data centers will need to evolve if Kubernetes is to be used for managing HPC workloads. Specifically, these overheads must be factored into the design to achieve and maintain a specific performance envelope with QoS. While this is a broader topic than what we cover in this paper, it is a fundamental issue that cannot be ignored. This paper can be a precursor to a more in-depth discussion. It is essential to acknowledge this challenge and the need for further research.
5. Experimental Setup and Study Methodology
Our setup consisted of multiple HPE (Houston, TX, USA) ProLiant DL380 Gen10 x86 servers, with network interfaces ranging from 1 Gbit/s to 10 Gbit/s in different configurations. Specifically, for 10 Gigabit tests, we used a set of Intel (Santa Clara, CA, USA) X550-T2 RJ45-based adapters and Mellanox (Sunnyvale, CA, USA) ConnectX-3 adapters for fiber-based evaluations. We used Ubuntu 24.04, Ansible 2.17.1, Kubernetes 1.30.2, iperf 3.16, and netperf 2.70 on the software side. On the CNI side, we used Antrea 2.0.0, Cilium 1.15.6, Flannel 0.25.4, and Calico 3.20.0. We also used a combination of copper and optical connections, various tuned optimization profiles, and simulated packet sizes to check for any influence of a given tuned profile on real-life performance. We measured bandwidth and latency from a performance standpoint, which is essential for HPC workloads. HPC workloads are known to be sensitive to either one or both simultaneously. We also measured the performance of a physical system so that we could see an actual amount of overhead on these performance metrics. This is where we realized we would have many challenges if we used manual testing and researched other avenues to make testing more efficient and reproducible. Manual measuring would be less efficient and require a substantial time commitment. Establishing the testing infrastructure, adjusting the CNIs, executing the tests, and documenting the outcomes manually can be highly time-consuming and labor-intensive, particularly in extensive or intricate systems. Not only does this impede the review process, but it also heightens the probability of errors and inconsistencies.
Our automated methodology [
34] guarantees that every test is conducted under identical conditions on each occasion, resulting in a high degree of consistency and reproducibility that is challenging to attain manually. This results in increased reliability and comparability of outcomes, which is crucial for making well-informed judgments regarding the most suitable CNI for given workloads. Moreover, automation significantly decreases the time and exertion needed to conduct these tests. Tools like Ansible can optimize the deployment, configuration, and testing procedures, facilitating quicker iterations and more frequent testing cycles. This enhanced efficiency enables a more flexible evaluation procedure, making including new CNIs and updates easier.
The general methodology workflow is as follows:
Set up Kubernetes—On our GitHub page [
35], a directory called “plays” contains a sequence of yaml files that can be used to deploy Kubernetes via Ansible in a completely orchestrated way; they need to be executed in index number order (from 1 to 9);
Work out which performance evaluations are to be performed (1 Gigabit, 10 Gigabit, 10 Gigabit-LAG, 10 Gigabit fiber-to-fiber, local network test) and open the corresponding directory;
Configure the necessary network interface name in a file called test.yaml (performance evaluation source network interface) and IP address information in the file test.yaml (performance evaluation source IP address);
In the same directory, use the file called vars.yaml to select which MTU values, packet sizes, and protocols (TCP/UDP) will be evaluated;
Configure the necessary hostname in the same file (destination hostname);
Go to the file run_all_tests.sh, modify the required variables (a list of tuned profiles to be used, a list of network configurations to be used, and select whether we are using Kubernetes CNI-based evaluations or physical, operating system-based evaluations).
In the same shell script, select the destination server IP address;
Go to the file ansible.cfg and select the Ansible inventory file location, as well as the remote username; if needed, exchange the SSH keys with remote hosts;
Start the shell script.
Furthermore, manual testing can be demanding on resources, frequently necessitating specialized expertise and substantial human resources to conduct efficiently. On the other hand, an automated and coordinated testing framework allows for a more inclusive evaluation process, making it easier for a broader range of people to participate, including smaller enterprises and individual engineers. Offering a concise and user-friendly method for assessing CNI enhances transparency and fosters optimal network configuration and optimization techniques.
In our pursuit of improving computing platforms, we often neglect to step back from improvements and see if possible solutions would allow us to optimize our current environments. This paper evaluates optimization options and how they affect typical computing platforms’ network performance.
Before we dig into the performance evaluations and their results, let us establish basic terminology. For this paper, we will introduce three basic terms: test configuration, test optimization, and test run. A test optimization refers to a set of optimization changes made to both the server and client side of testing, and it is carried out using a tuned profile. A test configuration refers to evaluating the networking setup. A test run, therefore, combines test optimization, configuration, and settings, such as MTU, packet size, test duration, etc. Firstly, let us define our test optimizations.
The first test optimization is not an optimization; we must have a baseline starting point. The first optimization level (tuned profile) is called “default_settings”, and as the name would suggest, we do not make any optimization changes during this test. This will allow us to establish a baseline performance to be used as a reference for other optimizations.
The second optimization is called “kernel_optimizations”. At this optimization level, we introduce various kernel-level optimizations to the system. The Linux kernel can be a significant bottleneck for network performance if not configured properly, starting with send and receive buffers. The default value for the maximum receive and send buffer sizes for all protocols, “net.core.rmem_max and net.core.wmem_max”, is 212992 (2 MB), which can significantly impact performance, especially in high latency or high bandwidth environments. We increased these values to 16,777,216 (16 MB). The second kernel-level optimization we can take advantage of is TCP window scaling. By default, TCP window scaling is turned off on Linux, and by enabling this setting, we can increase the size of the scaling factor of the TCP receive window. Further, we will disable TCP Selective Acknowledgement (TCP SACK). Disabling TCP SACK can improve performance by slightly reducing CPU overhead. Still, it should be used carefully, as it can lead to a significant loss in network performance in lossy networks. The last kernel-level optimization we will introduce is increasing the SYN Queue Size to 4096 packets from the default 128. This setting might not affect our testing environment but can significantly improve performance in environments where a single server manages many connections.
The third optimization is called “nic optimizations”. At this optimization level, we introduce several network interface-level optimizations. The first important setting is enabling Generic Receive Offload. GRO helps coalesce multiple incoming packets into a larger one before passing them to the networking stack, improving performance by reducing the number of packets the CPU needs to process. The second important network interface-level optimization we can introduce is enabling TCP segmentation offload. By default, the CPU manages the segmentation of network packets. By enabling this option, we can offload that work to the network interface card and reduce the work needed by the CPU. This setting could significantly improve our performance as part of our tests, which will also evaluate network performance while forcing segmentation. The last network interface level optimization change we will make is to increase the receive and transmit ring buffers from the default value of 256 to 4096 packets. We do not expect this setting to significantly impact the test results, as it will probably only affect performance during bursts of network traffic.
The kernel and nic customization profiles are custom optimization levels developed for this paper. The rest are built-in tuned profiles available on Linux distributions. Specifically, accelerator performance, HPC compute, latency performance, network latency, and network throughput are built-in tuned profiles used in their corresponding evaluations, as denoted in the Figures in this paper’s “Results” section. We deliberately chose these profiles as they represent the most used tuned optimizations when running bandwidth- or latency-sensitive applications—precisely the two metrics measured by our evaluations.
Let us now look at different test configurations used during our testing. We evaluated five network setups: 1-gigabit, 10-gigabit, 10-gigabit-lag, fiber-to-fiber, and local. 1-gigabit, as the name would suggest, is a test conducted over a 1 Gbit link between the server and the client. The same applies to 10-gigabit. 10-gigabit-lag is a test conducted over the 10 Gbit links aggregated into a link aggregation group. Fiber-to-fiber is a test conducted over a fiber-optic connection between the server and client. Finally, local is a local test where the server and the client are on the same physical machine.
We included a few test settings alongside these settings and configuration options in each test run. These combine two different MTU values, 1500 and 9000, and five different packet sizes, 64, 512, 1472, 9000, and 15,000. Finally, all tests were conducted using TCP and UDP protocols separately. We approached these performance evaluations from the standpoint of multiple MTU and packet sizes because various workloads prefer various MTU/packet sizes. For example, large, serialized network data streams like video usually prefer larger MTU/packet sizes as they get more efficient. This happens because the data serialization cost is not as significant on modern switches as it used to be, which means lower overhead as packets require smaller headers compared to the overall packet size.
We observed another metric-per-node CPU overhead per CNI. Our findings confirm what was known from previous work [
1,
9,
36]—that Calico and Flannel use the least CPU resources (in our observations, less than 3%), followed closely by Antrea—approximately 5% in the latest version that was used for this paper. Cilium is the most compute overhead-hungry CNI regarding compute overhead, with overhead reaching 7%. Antrea CNI is being developed quickly and supports many platforms (x86, ARM, Linux, and Windows), which might be helpful in specific scenarios.
As this paper’s primary intended use case relates to HPC-based environments, we selected bandwidth and latency as performance metrics to be analyzed across multiple Kubernetes CNIs, tuned performance profiles, MTU sizes, and different Ethernet standards/interfaces. Let us now examine the results to reach some conclusions.
6. Results
The results analysis is split into four parts: TCP bandwidth, TCP latency, UDP bandwidth, and UDP latency efficiency comparison. This will give us a good set of baselines to understand the results from multiple angles and the level of performance drop compared to an environment that does not use Kubernetes or CNIs. The results presented in the paper will be related to 10 Gigabit network connections. Figures given in
Section 6.1,
Section 6.2,
Section 6.3 and
Section 6.4 are all presented as visual information uniformly—every figure is compared to the non-CNI physical network configuration. The performance of the non-CNI physical network scenario is represented by the value 0 on the Y-axis of every figure. Every bandwidth value in the negative Y-axis direction represents lower bandwidth by the Y number of gigabits per second. Similarly, every latency value in the positive Y-axis direction represents latency added to the measured physical latency.
The color scheme is going to be the same for all the performance charts in the following sections, both for TCP and UDP performance efficiency charts (
Figure 1):
Let us now check the bandwidth and latency efficiency testing results for these four CNIs. We will compare TCP bandwidth and latency efficiency, followed by UDP.
6.1. TCP Bandwidth Efficiency Comparison with the Performance of Physical Network without CNIs
The next set of results concerns the bandwidth comparison between CNIs and non-CNI scenarios. We designed the graph layout to make the performance drop between these two scenarios easier to read. We will ignore discussing smaller MTUs as performance in those scenarios is, at worst, 10% worse than the physical network for all CNIs, which is perfectly acceptable.
The first outlier is the Calico CNI in the default performance profile. Its performance is approximately 2 Gbit/s better than all other CNIs for larger packet sizes, as we can see in
Figure 2:
The second outlier is our custom kernel optimization profile. Calico’s performance starts getting much better the bigger the MTU size is. At 9000 and 15,000 MTU, Calico is, on average, more than 2 Gbit/s faster than other CNIs and gets within 2–2.5 Gbit/s less bandwidth than the physical network performance in this tuned profile, as we can see in
Figure 3:
The last outlier is Antrea in the NIC optimizations profile, while all other CNIs see a substantial performance drop compared to previous scenarios with MTU set to 9000 or 15,000. Antrea CNI performance improves significantly in this profile, where it becomes approximately 2 Gbit/s faster than all other CNIs for those MTU sizes, as can be seen in
Figure 4:
We will use the accelerator-performance profile as a baseline for the remaining evaluations. The reason is that results of hpc-compute, latency performance, network latency, and network throughput profiles are within 1–2% of this result.
Appendix A provides performance evaluations of these four CNIs (
Table A1). In the accelerator performance profile, all CNIs have 4–5 Gbit/s lower performance than the physical network for packet sizes of 9000 or more, as can be seen in
Figure 5:
The remaining four profile evaluations have the following characteristics:
The hpc-compute tunes the underlying node for network throughput [
37], which marginally increases TCP bandwidth when compared to the accelerator-performance profile, as we can see in
Table A1.
The latency performance profile tunes the system for low latency. It turns off power savings with the CPU again locked in the low C states [
37], but evaluation results remain the same as in the hpc-compute level, as can be seen in
Table A1;
The network latency profile, based on the latency-performance profile, disables huge transparent pages and NUMA (Non-Uniform Memory Access) balancing and tunes other network-related parameters [
37]. This marginally increases TCP bandwidth for 1472 and 9000 packet sizes but remains the same for all other packet sizes, as can be seen in
Table A1.
The network throughput profile increases kernel buffers for better performance [
37] but offers no improvement compared to previous tuned profiles, as seen in
Table A1.
Overall, TCP bandwidth efficiency results are excellent for sub-9000 packet sizes and shockingly bad for larger ones. This further underlines the point of our methodology and our paper—selecting the correct set of network settings to suit our workloads (especially HPC) and pairing that set of settings to a correct CNI is undoable without proper verification. Overall, Antrea (for most use cases) and Calico (for some use cases) seem to be the winners regarding TCP bandwidth efficiency. Calico appears to benefit from our kernel optimizations, which significantly increase performance and efficiency in that specific tuned profile.
Let us move to the next part of our efficiency evaluation, which concerns comparing TCP latency between CNIs and regular physical network scenarios.
6.2. TCP Latency Efficiency Comparison with the Performance of Physical Network without CNIs
Unlike bandwidth evaluations, latency evaluations show more fluctuation from profile to profile. The only constant is Flannel—it has lower latency than any other CNI in any performance profile. The most positive outlier is the latency performance profile; efficiency gets better than all other tuned profiles. Latency improves by 40–50% across the board. Flannel still has lower latency than all other CNIs. But still, it is quite a bit worse than using the physical network, as can be seen in
Figure 6:
In the default tuned profile, we can see that CNIs have approximately 65 to 92 ms added latency across the most used packet sizes compared to physical network performance evaluation. This is the worst-case scenario, as can be seen in
Figure 7:
The network throughput profile yields the best middle-ground latency performance of the tested CNIs. Calico’s latency increases in the 1500 MTU scenario. Flannel still seems to be the best solution, as can be seen in
Figure 8:
The remaining five profile evaluations have the following characteristics:
The accelerator-performance tuned profile increases the CNI latency by approximately 20%, except for Flannel. Flannel shines in these evaluations with latency that is much lower than all other CNIs, as can be seen in
Table A2;
The hpc-compute profile is a step in the wrong direction as latencies increase by approximately 20% across the board, as can be seen in
Table A2.
The kernel optimization profile latency increases, especially for Calico with more than 10% increased latency, as can be seen in
Table A2.
The network latency profile performance gets close to middle-ground levels of the network throughput profile. Flannel still has by far the lowest latency, as can be seen in
Table A2.
The NIC optimization profile, where latency increases almost to the worst-case scenario levels. The only comparison we can make here is with the default tuned profile—it is that bad, as seen in
Table A2.
The latency efficiency of Kubernetes CNIs, compared to physical network latency, is even more varied than bandwidth. This is partially to be expected due to the involved nature of the TCP protocol, which loses much efficiency because of its 3-way handshakes and other built-in mechanisms. Regarding TCP latency, Flannel is the clear winner in all the tested TCP scenarios, with approximately 20% lower latency than all other CNIs.
With that in mind, we also extensively evaluated UDP bandwidth and latency. The only marked difference in our methodology for UDP is that we did a smaller number of runs (from five to two), as we did extensive prior testing and concluded that it makes a negligible difference to the overall evaluation scores.
6.3. UDP Bandwidth Efficiency Comparison with the Performance of Physical Network without CNIs
The following results concern the UDP bandwidth efficiency comparison between CNIs and non-CNI scenarios. Let us start with the best UDP performance for 9000+ packet size, achieved in our custom kernel optimization profile, as can be seen in
Figure 9:
The most negative outlier is our custom nic optimization tuned profile. In this profile, UDP performance drops significantly to levels close to negative 6 Gbit/s per CNI when compared to the physical network evaluation with the same network profile, as shown in
Figure 10:
The remaining six profile evaluations have the following characteristics:
The hpc-compute profile and accelerator performance profile, which are the closest to the best performance level achieved by the kernel optimization profile, as shown in
Table A3;
The default profile and latency performance profiles, the middle-ground UDP performance scenarios, are quite a bit worse and all within a couple of percentage points in terms of performance, as shown in
Table A3;
The network latency and network throughput profiles are also quite a bit worse and offer a couple of percentage points in performance, as shown in
Table A3.
Considering all these results, there is only one conclusion—when we use standard-size packets (1472 and 9000), the bandwidth performance drops from one to the other significantly. For the most common packet size (1472), the performance drop is around 20% compared to the physical network evaluations, while at 9000, the performance drop is almost universally around 40–50%. There are exceptions—Calico in kernel optimizations and default performance profile yields the best results for UDP traffic with 9000 and 15,000 package sizes. We can only imagine what this would mean when using HPC for many streaming-type data (for example, video analysis). Regarding UDP bandwidth, with a couple of exceptions, Calico and Flannel are the best options, while Cilium is the worst option, closely matching Antrea’s performance.
6.4. UDP Latency Efficiency Comparison with the Performance of Physical Network without CNIs
The following results concern the UDP latency efficiency comparison between CNIs and non-CNI scenarios. The best performance result is achieved by using the built-in latency performance profile, offering significantly lower latency than all other scenarios except for the hpc-compute performance profile, as shown in
Figure 11:
The hpc-compute profile gets close to the latency performance profile in terms of performance but has two outliers—Flannel and Calico at 9000 MTU. Calico’s latency drops to the lowest levels overall in this performance profile, while Flannel almost reaches 400 ms, as shown in
Figure 12:
In the network latency profile, latency increases, becoming the middle-ground performance level in these evaluations, with latencies approximately leveled for Antrea and Cilium, with Calico’s latency increasing significantly for 9000 MTU. Flannel is still much more latency-efficient, as can be seen in
Figure 13:
The remaining five profile evaluations have the following characteristics:
The accelerator performance and network throughput profile are approximately 50% worse than our best scenario (latency performance profile), with similar levels of performance, as shown in
Table A4;
The default, nic optimization, and kernel optimization profiles all have very similar performance levels that are much worse than the best-case scenario. The only outlier is Antrea in our custom kernel optimization profile, where latency drops by 10% or more, as shown in
Table A4.
Our UDP latency efficiency evaluation suggests that CNIs suffer from a significant latency increase across the board as we increase the MTU size. It almost universally does not matter which tuned profile we select, which packet size, and which CNI plugin. However, Flannel seems to be the clear performance leader when discussing UDP latency, often by a large margin of over 20%. However, the problem remains: CNIs add significant overhead for UDP traffic.
7. Discussion
CNI performance across multiple tuned profiles, packets, and MTU sizes is expected to vary significantly. When looking at their architecture and performance sensitivities (as explained in
Table 1 in this paper), we expect that:
Flannel will have some performance penalty due to encapsulation;
Calico will offer high bandwidth and low latency due to direct routing;
Cilium will have high bandwidth for standard traffic, but the L7 DPI penalty is going to be significant if used;
Antrea will perform on par with Calico and Cilium, but latency may be higher in larger or policy-rich environments.
There are significant differences between these expectations and our performance evaluations. For example, Flannel has the lowest latency, especially when tuned profiles are applied. This was entirely unexpected, as encapsulation and decapsulation introduce latency. Calico, for the most part, acted as expected in terms of bandwidth but not in terms of latency. With no DPI on Cilium (which was completely unnecessary in our intended use case), Cilium behaved as expected. Antrea, the newest CNI plugin developed by VMware, was on par with other CNIs in most use cases, sometimes a bit faster and less latent, sometimes more.
Overall, bandwidth evaluations look exceptionally good in packet sizes up to 1472 bytes and shockingly bad in larger packet sizes. For latency, all results are sub-optimal. We can also call these results counter-intuitive, as it is usually the reverse—networks typically perform better when dealing with larger packet and MTU sizes. From the efficiency perspective, compared to a physical network scenario that does not use CNIs, we are still off in performance and latency. And it is not only that—we would expect UDP results to be markedly better than TCP across the board, which was not necessarily the case. We can only attribute this to overall CNI inefficiency, which means there is much room for improvement—both in how CNIs handle Kubernetes networking and how that gets scheduled on the OS side and network interfaces. We would be curious to see what these results would look like if we used FPGAs or other offloaded network adapters as the underlying physical interfaces.
If we were to investigate the correlation between our custom kernel_optimizations profile and CNI performance, most CNIs show the same results in TCP bandwidth, apart from Calico. These kernel parameters significantly improve Calico’s TCP and UDP performance from the bandwidth perspective at the cost of a bit higher latency. This improvement is approximately 50% in efficiency, translating to approximately 2.5 Gbit/s more performance in TCP and UDP scenarios at 5–6 ms more latency. This also correlates with real-life observations—often, when we have a large amount of traffic, tuning for bandwidth usually means higher latencies and vice versa. In TCP and UDP latency evaluations, the latency performance tuned profile brings a significant 40–50% drop in latency, which is a massive improvement. Flannel seems to be the best CNI regarding latency in all scenarios.
Looking at the actual results, 40 ms of added latency in the best case and 90 ms in the worst case is a huge performance problem for any distributed application scenario, especially HPC/AI applications. The performance drop would be significant if we ran a set of HPC workloads on thousands of containers across hundreds of HPC/AI nodes with these latencies. Bandwidth-wise, CNIs have the same problem. Our performance evaluations mostly could not reach 50% link saturation in TCP or UDP scenarios with larger packet sizes, which is very surprising, especially for UDP. We did not try to cut corners with network cards, either—we are running Intel and Mellanox enterprise-grade network cards across these evaluations. Furthermore, we had no issues reaching close to 100% saturation when we performed baselining on the physical network level or with smaller packet sizes, so these results can only be attributed to CNI performance. This also shows that there is much research to be performed in this area, as improvements need to be made everywhere—CNI efficiency, data center design, etc. We will discuss potential research avenues in the next section.
Currently, scale-out scenarios will use Calico or Antrea, as they seem the most obvious choice. Because of its all-around performance and optimization capabilities in multiple scenarios, Calico is being used by the most prominent brands on the market—Alibaba Cloud, AWS, Azure, Digital Ocean, Git, Google, Hetzner, and many others [
38]. Antrea is the most recent CNI plugin architecturally, with a good architecture underpinning for future development. Its architecture is based on the SDN principle, which is currently prevalent in large-scale enterprises and cloud environments for scalability reasons. It enables easy scalability across multiple sites as well.
One more crucial topic of discussion is the influence of MTU and packet sizes on overall performance. When analyzing our results, we concluded that it does not matter which MTU size (1500 or 9000) or packet size larger than 1472 bytes (9000, 15,000) we use; performance starts dropping rapidly. This indicates that all these plugins require further development to reach acceptable performance levels for larger MTUs and packet sizes.
8. Future Works
We see much potential for future research in terms of scalability across even more recent Ethernet standards (25 Gbit/s, 50 Gbit/s, 100 Gbit/s, 200 Gbit/s, etc.), as it is difficult to predict how bandwidth increases will influence pure CNI performance. But even with these newer networking standards, we need to see how an increase in pure bandwidth on the physical level will help. A brute-force way of solving problems like network efficiency issues is not the way to go; there needs to be much research performed on improving the architecture and efficiency of these plugins.
Future research must be performed on offload technologies for various network functions, especially considering the increase in bandwidth and the fact that CNIs have much overhead. FPGAs and network adapters using Intel’s DPDK are good candidates for this research. Furthermore, network adapters using offloading techniques for TCP (like TOE, TCP Offload Engine) and UDP (like UOE, UDP Offload Engine) might help alleviate a considerable portion of the measured overheads presented by this paper. Misconfiguration of any of these technologies will significantly impact CNIs’ performance in real-life scenarios.
As mentioned in
Section 4, one of the most fundamental challenges created by the Kubernetes-based HPC environments is that CNIs have a lot of overhead and need to be incorporated into the initial design. How this challenge is solved is a substantial area of future research. From what we learned in this paper, if we focus on one area of the design, which points to the fact that large-scale HPC environments for Kubernetes will have to use, for example, DPDK-enabled adapters with offload engines, or FPGAs—fundamentally changes how we plan and design our HPC environments. This design change also comes with a significant increase in cost.
CNI scaling across hundreds or thousands of nodes could also be a good direction for future research. The problem is that this research would have to be heavily funded, as the investment to do such an evaluation would be measured in millions of euros or dollars, which is most commonly out of reach. But what could be performed is to use research like the one presented in our paper, scale it out to a decent number of nodes—ten, twenty- and then try to build models that represent the CNI scale-out behavior. This could be performed by feeding such scale-out data to an ML engine that could be trained to predict future performance levels for an arbitrary number of nodes.
That research direction can be further extended by correlating the performance data of Kubernetes CNIs with physical performance in terms of overhead. This can be used as a data source to train an ML engine to learn about any environment’s performance characteristics. ML/reinforcement learning could then be used to predict the performance characteristics of an environment with multiple workload types (virtual machines and containers), as several Kubernetes CNIs can be used as virtual machine network backends for KVM virtualization. This would also produce a relevant set of data that describes the CNI overhead in real-time, especially when backed by a database and, for example, Grafana, to have real-time insight into these metrics and their correlation. These processes can be used to predict and gain insight into the current infrastructure behavior and future environment design as we plan scale-out processes, as described in this paper’s “Introduction” section.
9. Conclusions
This article evaluates the performance and latency efficiency of Kubernetes container networking interfaces under different tuning profiles and tailored network configurations. The most prevalent container network interfaces, such as Antrea, Flannel, Cilium, and Calico, were assessed, emphasizing their advantages and disadvantages in managing bandwidth and latency-sensitive workloads. We employed an automated approach using Ansible to organize and conduct tests to evaluate various CNIs within Kubernetes clusters. The results demonstrate that choosing the suitable CNI according to specific workload demands is crucial for enhancing Kubernetes workload performance.
The evaluation results indicated significant TCP and UDP bandwidth variations and latency efficiency among the assessed CNIs. Antrea regularly demonstrated superior performance in TCP bandwidth for bigger packet sizes, but Flannel excelled in TCP latency, albeit still trailing behind physical network configurations. UDP performance patterns exhibited similarity, with Antrea surpassing other CNIs in bandwidth efficiency. Latency results showed considerable variation among profiles, and the kernel-tuned parameters employed with Calico markedly enhanced its performance in both TCP and UDP bandwidth assessments.
These findings highlight the necessity of choosing and calibrating CNIs according to unique workload demands, which is critical in high-performance computing environments. The automated methodology in these evaluations enables practical CNI assessments, providing a consistent, efficient, and less error-prone way of measuring network performance. The intricacy and time commitment associated with manual performance reviews underscore the significance of automation in this scenario. Automated procedures enable researchers and practitioners to systematically implement, evaluate systematically, and contrast various CNIs, resulting in more dependable and actionable insights into Kubernetes network performance.
Furthermore, this evaluation establishes a solid platform for future endeavors in Kubernetes-based HPC settings. The considerable overhead introduced by CNIs in Kubernetes-managed HPC workloads highlights the necessity for optimized data center design that accommodates these overheads while maintaining performance integrity. We provide significant data for predicting the performance of CNIs across various workload types, particularly in the context of scale-out operations for future infrastructure expansions.