1. Introduction
Since the late 1960s, following the ARPANET’s birth and the development of network technologies over five decades, the Internet has achieved great success. According to the report [
1], Internet users increased by 3.5 percent from October 2022, reaching 5.07 billion as we enter the year’s final quarter. One hundred and seventy-one million new users over the past 12 months have taken global Internet penetration to 63.5%. Global mobile users have reached 5.48 billion, with smartphones accounting for almost four in five of the mobile handsets in use today, with 68.6% of all the people on Earth now using some form of mobile phone. The scale of global Internet users is unprecedented, the existing network traffic is characterized by encryption, and the network link has entered the era of high speed. The line rate in modern high-speed networks has reached hundreds of Gbps or multiple Tbps. Encrypted traffic accounts for 95% of the total traffic in the backbone network environment with Tbps bandwidth [
2]. The different types of networks have mushroomed in our life based on preeminent network infrastructure, such as IoT, Internet of Vehicles, 5G, cloud computing, blockchain, satellite networks, etc. However, the network forms are diversified and heterogeneous. The nodes at different levels in the end-edge-cloud architecture on the Internet are heterogeneous regarding function, performance, and data coverage. Thus, the security of cyberspace depends more on the traffic measurement of high-speed networks and encrypted traffic analysis and classification.
At this stage, the analysis, classification, and measurement of encrypted traffic rely more on the collection and analysis of massive network traffic data, and establishing more extensive and comprehensive encrypted traffic sample datasets is one of the long-term and primary goals of such work. The collection and analysis of massive network traffic data cannot happen without the support of high-speed network traffic measurement technology. Generally, the network traffic on the high-speed link is viewed as a set of flows. For traffic measurement, complete packet processing must occur in a few nanoseconds. For example, in a 100 Gbp Ethernet link, to process packets with an average size of 64 bytes and continuous arrival, the average processing time is about 4.768 ns. Thus, the measurement of network traffic has more technical challenges, such as using extremely high computing resources and storage resources, because of the continuous improvement of network link rate and the sharp increase of network data flows.
The flow is composed of the packets carrying the same FlowIDs such as source IP, destination IP, or others. We can define the size of flow as the number of packets in each flow. The network traffic conforms to the heavy-tailed distribution model in the real world: 80% of the network traffic consists of 20% of the flow, and the remaining 80% of the flow only constitutes 20% of the network traffic. In other words, compared with mouse flows (e.g., the number of packets less than 200), the elephant flows (e.g., the number of packets more than 10,000) can signify the real characteristics of the network. To provide high-quality operation services (such as QoS management and QoE improvement [
3]), detect anomalies in the network, control network congestion, or detect DDoS attacks, the different statistics and measurements of flows should be collected and analyzed based on high-speed network traffic measurement technology by the Internet service providers (ISPs) or the department of cyberspace security supervisions.
Thus, finding top-
k elephant flows and aggregating information about flow distributions can help us to achieve the above requirements. The significance of finding top-
k flows is shown in
Figure 1. In addition, the data flow has the characteristics of real time, continuity, and boundedness, which requires that the algorithm for processing the data flow can only perform calculations on the network flow once and can only use limited computing resources and memory resources. It also needs to ensure accuracy and timely awareness of large-scale data flows. Thus, finding top-
k flows is challenging, especially with limited resources.
High-speed network traffic measurement technology can mainly solve this problem, a challenging research field. For traffic measurement in the high-speed network environment, there are generally three solutions [
4]: (1) based on high-performance dedicated hardware (such as TCAM, ASIC), but high-performance hardware equipment is costly; (2) based on sampling technology; (3) based on data flow technology and method. The advantages and disadvantages of different solutions are shown in
Figure 2.
The top-
k elephant flow identification method based on sampling extracts some representative packets and then uses the probability theory to calculate the characteristics of the overall network traffic. The most typical method is the Sampled NetFlow [
5] proposed and used by Cisco. In high-speed network links, the processing speed of hardware cannot match the enormous data flow, so the
sampling method is used to relieve the pressure of elephant flow identification caused by high-speed network links. In the router, the flow memory will create a new flow record based on the five-tuples of each arriving packet. Whenever a new flow is drawn into the router, a new flow record will be created in the flow memory, and a quintuple will be extracted as the unique identification of the flow. Different from NetFlow, the sample and hold [
6] method first queries whether the currently arrived packet exists in the flow memory and, if so, updates the flow record; otherwise, the data packet is sampled with probability p, and a piece of new flow information is created in the flow memory after being sampled. However, the accuracy of these two elephant flow recognition methods is coarse granularity, depending on the flow memory size. They are unsuitable for elephant flow recognition in high-speed networks.
Based on data flow technology, the top-
k elephant flow identification method has two basic strategies:
count-all and
admit-all-count-some [
7]. The similarity between the two strategies is that they handle all traffic in the network rather than sampling. Therefore, the error of these two methods is smaller than the sampling technique; however, the two strategies are different in the process of handling traffic. The
admit-all-count-some strategy handles all traffic but does not store all traffic. The main idea is to recognize all new flows and, at the same time, remove the smallest flow from the cache. Examples are frequent [
8], efficient counting [
9], lossy counting [
10], space-saving [
11], and CSS [
12]. They can effectively eliminate the flow by introducing some constraint rules and controlling the error of top-
k flow measurement within a certain range. The
count-all strategy is based on sketches such as count–min sketch [
13] and count sketch [
14] to measure the size of all flows. It stores and counts all traffic. It is necessary to use sketches to summarize and count data flows to reduce the demand for memory space and resources on the premise of storing all traffic.
Furthermore, the admit-all-count-some strategy assumes that every new incoming flow is an elephant flow and expels the smallest one, in summary, to make room for the new one. Nevertheless, most flows are mouse flows. Such an assumption causes a significant error, especially under tight memory. The count-all strategy needs a large sketch to count all flows. These solutions could not be more memory-efficient. Additionally, it needs more memory to scan the entire counter and sort elements to answer the top-k query. On the other hand, because the flow sizes are unknown a priori, almost existing methods set a large length (such as direct use of the integers in C++ or a 20-bit counter).
However, if the hardware development resources and costs are limited, for example, they use only 1 Mbit to complete the top-k measurement task of millions of flows. Using a large counter such as existing methods will not only lead to the waste of counter space caused by counting most flows whose sizes are under 200 (the mouse flows) but also cause fewer counters, affecting measurement accuracy. Therefore, existing top-k finding solutions all share the same limitation: most counters will only record a mouse flow and waste significant on-chip memory, especially with resource constraints.
On the whole, we aim to conquer six challenges of finding top-k in a resource-constrained environment: (1) the scheme should be completed within limited on-chip resources as far as possible; (2) the algorithm that needs to process data flows can only perform calculations on network flows once; (3) the calculation logic is simple and the storage space is small; (4) to achieve measurement of millions or more flows; (5) to restore valid top-k elephant flows information; (6) measuring top-k elephant flow on the fixed and short-length counter, with satisfactory accuracy.
In this paper, we adopt a novel strategy called count-with-uth-level-sampling. Moreover, we propose a novel core component named Multi-Sampled Lightweight Counting Finder (MSLCFinder) to achieve a measurement scheme for finding top-k flows based on hardware resource-constrained environments. The MSLCFinder has three primary modules: a lightweight counting module, a multi-sampling module, and a flow label recording module. The lightweight counting module can work under extreme storage requirements and requests a few bits for counting. The multi-sampling module can sample flows by using multi-sampling probabilities generated by itself and the lightweight counting module. They complement and constrain each other to maximize the optimization of the MSLCFinder in the measurement task. The flow label recording module can promptly update and record top-k elephant flows information. Finally, with the MSLCFinder we propose, we can effectively find top-k elephant flows with less memory overhead and counters under resource constraints.
Our main contributions can be summarized as follows:
We propose a novel method for finding top-k elephant flows with little memory, reducing the waste of memory space of mouse flows. It is different from the two strategies in the previous study. It compresses the count by sampling and can accurately obtain the top-k elephant flows measurement results.
We design a uth-level sampling algorithm. It can generate multiple different sampling probabilities by initial sampling threshold , size of MSLCFinder counter, and the sampling level u. It is used to ensure that the counters of MSLCFinder can measure elephant flows more optimally before the end of the measurement task and reduce the space occupied by mouse flows.
We verify the effectiveness of the method in this paper on CAIDA public datasets and the effectiveness of the method in real network traffic.
We propose a flow recording algorithm for MSLCFinder to make up for the defect that existing top-k finding solutions generally cannot restore stream ID and other information.
The rest of the paper is structured as follows.
Section 2 describes the background of this research and reviews the related work.
Section 3 focuses on the measurement method we proposed, including the design of MSLCFinder and the algorithm of its core components. In
Section 4, the MSLCFinder is evaluated through experiments in many aspects. Finally, we summarize our research and point out future work.
3. Top-k Measurement Scheme Based on MSLCFinder
In this section, we make a systematic expound measurement algorithm based on MSLCFinder, a novel and efficient scheme for finding top-k flows in a limited resource environment. The measurement scheme overview is, understandably, described first. Then, we focus on three core module elements: lightweight counting module, uth-level multi-sampling module, and flow label recording module. They are the three core components of MSLCFinder, and they interact and restrict each other to complete the measurement task together.
3.1. Problem Definition
Finding the top-k elephant flows task is finding the k flows with the largest size with some measures of the measurement task. Typically, the network traffic is a set of N flows , where each flow is composed of a set of packets with the sameFlowID such as network quintuple. With this measure, the number of packets, i.e., of each flow, can be defined as the flow size. We let the be the real size of the flow and the is the estimated size of . Thus, finding top-k elephant flows can be described as outputtinga set of . The set Top-k with k elements can track the top-k flows with their estimated sizes.
3.2. Measurement Scheme Overview
The overall design of the measurement scheme is shown in
Figure 4. The scheme parses each message arriving by the traffic to be measured packet by packet, calculates the hash value corresponding to the data flow ID through the hash function, and is used to query, judge, and update the information of the MSLCFinder and flow sampling. The sampling probability is calculated by the given threshold and updated by the size of the counter in MSLCFinder.
At the beginning of the measurement, the hash value is calculated for the parsed FlowID, and the sampling process will use the initial sampling probability. If the FlowID is sampled, the counter at the corresponding location of the FlowID in MSLCFinder will be updated. Otherwise, the counter is not updated to reduce the probability of counting small flows in the MSLCFinder structure and reduce the pressure on on-chip resource storage.
Here, the size of the counter in the MSLCFinder is a fixed n bit. It uses a uth-level multi-sampling module to sample the flow. When the value of the MSLCFinder counter exceeds u thresholds, the sampling probability is recalculated, respectively, and a new sampling probability resamples the arriving flow. If the sampling is successful, the counter is incremented by 1; otherwise, it will be forwarded directly. When the count exceeds the last-level threshold, the flow corresponding to the location flow ID in the MSLCFinder is the top-k flow. The flow information is stored in the information storage table through the flow label recording algorithm. At the end of the measurement cycle, the entire information storage table is reported to the control end to obtain the specific information of the top-k flow.
3.3. Multi-Sampling Lightweight Counting Finder
3.3.1. Rationale
Our goal is to find top-k elephant flow with resource-constrained environments. It requires us to use less memory to complete the measurement task of more than one million flows. We aim to use a small hash table to find it, and it does not require highly accurate estimates. We are committed to using smaller length counters to be competent for the challenging task; for example, the counter length is less than 8 bits, or even smaller. Thus, before the counter increases the count, it is necessary to perform flow sampling by some regulars for the arriving traffic to determine whether the current packet can increase the counter’s value at the corresponding location in the hash table. In order to solve minimizing the error of flow sizes, we use multiple hash tables with different hash functions. The MSLCFinder has three primary modules: a lightweight counting module, a multi-sampling module, and a flow label recording module. We designed the MSLCFinder lightweight counting module based on the basic principle of sketches and smaller counter lengths. The design of the multi-sampling module will use the counts in counters and the initial sampling threshold. Based on the min-heap, we designed the flow label recording module.
3.3.2. The Lightweight Counting Module
As shown in
Figure 5, the lightweight counting module has
d hash functions, and each function space comprises
w counters. The counter of MSLCFinder is only
n bits long. The hash functions can calculate the hash value of the incoming packet as the FlowID for MSLCFinder counting. It can also help the multi-sampling module to achieve flow sampling.
Multi-Sampling: Multi-sampling is a core component module in MSLCFinder, which is responsible for sampling the arrived packets, thus telling MSLCFinder whether to add the current packet to the counter at the corresponding location. Unlike the count–min sketch, the count–min sketch uses a more extended counter to record the information of all streams and count each stream. Because our method needs to complete the top-k measurement task under resource constraints, sampling and judging the traffic before updating the counter can significantly reduce the counting pressure of the counter.
Insertion: At the beginning of the measurement task, all counters are zero. For each coming packet belonging to flow , MSLCFinder calculates the d hash functions. After mapping to the counter node, it should be sampled by sampling probability and the hash value. The hash value of will be the FlowID of flow . If is sampled, the counter will addone. Otherwise, the counter will not be processed.
Query: To query the counts of a flow , MSLCFinder reports the minimum value of d counter nodes that mapped.
Analysis: The minimum error tolerated for the measurement task is . The flow true size is , and its estimated size is . We want to set an error range (), in which N is the total number of packets counted by the module. In the optimal case, , (where e is the base of the natural logarithm, i.e., , a constant chosen to optimize the space for fixed accuracy requirements). When is smaller (i.e., w is larger), the error of element estimation is smaller; when is smaller (i.e., d is smaller), the probability of error in element estimation is smaller.
For the insertion procedure, because computing each hash function takes (constant) time, the total time to perform an update is , independent of w. Since d is typically small in practice (often less than 10), updates can be processed at high speed. The space complexity is . For the query procedure, the time complexity is , and the space complexity is .
3.3.3. The Multi-Sampling Module
Multi-sampling is a core component module in MSLCFinder, which is responsible for sampling the arrived packets, thus telling MSLCFinder whether to add the current packet to the counter at the corresponding location. Unlike the count–min sketch, the count–min sketch uses a more extended counter to record the information of all streams and count each stream. Because our method needs to complete the top-k measurement task under resource constraints, sampling and judging the traffic before updating the counter can significantly reduce the counting pressure of the counter.
The multi-sampling module has two functions: (1) to conduct flow-based sampling for consecutive arriving packets to guide MSLCFinder on whether to count this packet; (2) to calculate the flow sampling probability corresponding to the FlowID based on the current number of packets in the MSLCFinder.
The multi-sampling module can calculate the sampling probability p according to the current value in the counter and guide the MSLCFinder on whether to process or directly forward the currently arrived packets. The sampling probability p needs to obtain the count in counter and the initial threshold . Let a be the counts of the single counter. We choose uth-level sampling. The calculation method of sampling probability p is as follows:
Initial sampling probability:
When
a exceedsthe
, calculate 1st-level sampling probability:
When
a exceedsthe
, calculate
-level sampling probability:
The uth-level () sampling method can generate u probabilities for sampling. When the measurement task is over, the computational function with count a of counter can obtain estimates.
When
:
When
:
When
(
):
We can use the
a to calculate the
i by using the following method:
For example, when we use the 3rd-level sampling method to count the array named B, which has 2 million elements, let the threshold be 10,000, and the counter has 8-bit length. According to the above formula, . When the subscript of the element in array B is sampled, the value of counter is plus 1. When the array traversal is over and the , the actual estimated value is 1,995,582 according to the 3rd-level sampling method, which proves that our sampling plan is highly deterministic.
When the measurement task starts, the probability is directly used to sample each flow. If it is selected, it will be mapped to the corresponding position of the ID in the lightweight counting module, and its count will be increased by 1; if the packet is not sampled, it will be forwarded directly, and MSLCFinder will not process it. Here, we take the flow as an example. When the packet’s counts in exceed the threshold value of the second level, we start to calculate the sampling probability of the second level and continue to process the arrived packets according to , and so on until the threshold value of the u-level is exceeded. When, and only when, the count of reaches , the packets corresponding to the FlowID will not be processed anymore, and the FlowID will be regarded as a top-k elephant flow.
3.3.4. The Flow Label Recording Module
Figure 6 shows that the flow label recording module is designed to record and find top-
k elephant flows. The MSLCFinder achieves this goal based on a min-heap data structure.
Algorithm 1 shows the flow label recording algorithm. For each packet
belonging to flow
, which is counted by MSLCFinder, the module first inserts it into itself. Suppose that MSLCFinder reports the size of
as
. If
is already in the module, it will update the estimated flow size with max
, where
is the record of the size of
in the module. Alternatively, determine whether the module’s number of elements is fewer than
k of the top-
k. If the number of elements is fewer than
k of the top-
k, insert it into the module. Otherwise, judge whether the new count of the flow is greater than the current minimum in the module. If so, eliminate the minimum in the module, and insert the ID information of the new flow and its estimated count. To query top-
k flows, we simply report the
IDs of the
k flows in the module. If the measurement task wants to obtain the estimated flow size of these
k flows, we let the module record and output this information; otherwise, it is not necessary to know. The FlowID of top-
k flows is the top target of MSLCFinder.
Algorithm 1: Flow label recording algorithm. |
|
Analysis: (1) The time complexity: when a new element comes, it needs time to query or update the min-heap in the flow label recording module of MSLCFinder. Therefore, the total time complexity is . (2) The space complexity: the flow label recording module requires space.
3.4. Description of Top-k Measurement Algorithm Based on MSLCFinder
Figure 7 shows the overall algorithm process, divided into two stages: measurement and FlowID recording. Low-proportion sampling reduces the probability of mouse flows being counted by MSLCFinder. It reduces operations, saves the MSLCFinder space, and reduces the probability of hash collisions. The overall algorithm can only access the memory in four cases: no access, one read zero write, one read one write, and two read one write, which meets the requirement of this problem that one read one write is the average.
Initialization: Let MSLCFinder be , where m is the total number of counters in the MSLCFinder structure, and n is the number of bits in each counter. In addition, set the count in each counter in MSLCFinder as C, and the initial value of C is 0. Let the initial threshold of the top-k flow measurement task be . Choose a uth-level sampling method for flow sampling.
The minimum error that can be tolerated for the measurement task is
. Calculate the number
d of hash functions in MSLCFinder according to Formula (
11).
Then, calculate the initial sampling probability
with Formulas (
1) and (
2).
To more simply describe the process of this algorithm, we choose the 3rd-level sampling methods. That is, u = 3. For each incoming packet belonging to flow , these are the following two steps for each insertion:
Step 1: The will be inserted in the MSLCFinder, shown in lines 3–24 in Algorithm 2. Hash the identity of the stream corresponding to each arriving packet , and the FlowID of is the hash results . If the count C is already equal to , do not insert anything into the counter and forward the packet. Otherwise, determine the size of C and the three thresholds, and use the sampling probability corresponding to the threshold interval to perform the flow sampling. If the has been sampled, the counter will be plus one, and the flag is true, which means that the count of is updated.
Step 2: We use the MSLCFinder flow label recording module to record the FlowID of
, which is shown in lines 25–27 in Algorithm 2. We use the minimal
to record into the module. The record in the module is timely updated by the function
.
Algorithm 2: Top-k measurement algorithm based on MSLCFinder. |
|
Query top-k flows: After the measurement task, the MSLCFinder will report the information of k flows recorded in the flow label recording module, such as network quintuple.
Analysis: Our method’s time complexity and space complexity are as follows: (1) The time complexity is . When a new element comes, it needs time to query in the lightweight counting module of MSLCFinder. Then, it needs time to query or update the min-heap in the flow label recording module of MSLCFinder. Therefore, the total time complexity is . (2) The space complexity: the lightweight counting module needs a two-dimensional array, and the flow label recording module requires space, but k is usually tiny, and the space can be negligible. Thus, the total space complexity is .
4. Evaluation
In this section, we display the experimental settings, including platform, datasets, and development environment. In addition, the experimental procedure and results are demonstrated after them. The depth analysis of experimental results is also displayed.
4.1. Experimental Settings and Datasets
Platform: We use a laptop with 4-core CPUs (8 threads, Intel Core i7-4940MX @3.10 GHz) and 32 GB total system memory to complete the development and implementation of the algorithmin this paper. We also use it to complete our experiments.
Dataset: We use three datasets to perform the experiment. All information of datasets is shown in
Table 2.
(1) CAIDA-2018 [
44]: The first dataset is a CAIDA Anonymized Internet Trace from 2018, made of anonymized IP packets. It has 27,121,713 packets belonging to 1,000,551 flows.
(2) CERNET-30 and CERNET-60 [
45]: The second and third datasets comprise IP packets captured from the CERNET Backbone Node of East China and North China at different dates. For ethical reasons, we anonymized the IP information to a certain extent. The collection time of CERNET-30 is 30 minutes in the peak period, and the data size is about 35 GB, including 17,596,416 packets, about 11,941,109 flows in total. The collection time of CERNET-60 is one hour in the off-peak period, and the data size is about 46 GB, including 32,674,187 packets, with a total of about 16,082,037 flows.
As shown in
Table 3, we roughly compared the flow size distribution of the three datasets, selected the samples with the first 100 flow lengths, and analyzed the maximum, minimum, variance, and standard deviation of the flow size distribution. Our work aims to illustrate the influence of flow size distribution on our proposed method.
Implementation: The implementation of MSLCFinder is performed in C++. For the hash function, we use the improved IPSX [
15] algorithm to support different measurement task requirements. To achieve at least 95% precision, we can calculate
by Formula (
11). The total space cost is fixed as
bit
Mbit
KB, and the counter size is fixed as
bit. There are two selected measurement tasks. One is to output the source IP and destination IP of top-
k flows, and the other is to output the quintuple information of top-
k flows.
The total number of counters is determined by the total memory space (1 Mbit), the k-value of the top-k task, and the size of the measure required for the final maintenance of the top-k flow information table. To improve the versatility in different measurement tasks, we can also use half of the total memory to support the measurement function of MSLCFinder, and the other half of the memory to record and maintain the top-k flow information required in measurement tasks. However, if the available memory is abundant, it is unnecessary to provide half of the memory resources for recording and maintaining the top-k flow information.
4.2. Evaluation Metrics
Precision: Precision is defined as . means that the flows are real top-k flows.
Average relative error (ARE): ARE is defined as , where is estimated set of top-k flows, is the estimated size of flow , and is the real size of flow . ARE evaluates the error rate of the estimated flow size reported by the algorithm.
4.3. Experiments on Precision
The application scenario of our method is resource-constrained, so we use fixed-size memory and fixed-length counter in each experiment. In order to ensure the fairness of the comparative experiment, we choose to change only one specific factor rather than other factors in the experiment. We conduct the experiments for varying k and fixed or varying and fixed k on the CAIDA-2018, CERNET-30, and CERNET-60 datasets. We carry out two groups of experiments, using different measures as the measurement standard. One uses the five-tuple information as the flow ID, and the other uses the source/destination IP address (SrcIP/DstIP) pair as the flow ID.
In each group of experiments, we aim to carry out the measurement tasks of top-50, top-100, top-200, top-500, and top-1000 for three datasets, respectively, to verify the effect of this experimental scheme. The threshold is 74, 84, 94, 104, 114, 124, 174, and 200, respectively.
4.3.1. Result of Precision vs. Quintuple of Flows
In this subsection, we use the quintuple information to identify the flow
. We summarize the experimental results and show them in
Figure 8.
CAIDA-2018 with quintuple of flows: The result of the CAIDA-2018 dataset is shown in
Figure 8a,b. In the experiment, with the gradual increase of the initial sampling threshold
, the precision rate will increase and then decrease because the number of packets contained in the stream in the dataset is uneven. Secondly, with the increase of
, the sampling probability will gradually decrease, and the randomness will increase, which will bring some errors to the measurement. In the experimental results, when
∼104, the precision rate of
∼97% can be obtained, which shows that our method can achieve top-
k measurement tasks with certain requirements under resource constraints and can ensure high precision.
CERNET-30 with quintuple of flows: The result of the CERNET-30 dataset is shown in
Figure 8c,d. In the experiment, with the gradual increase of the initial sampling threshold
, the precision rate is kept at about 95% on average. Secondly, with the increase of
, the sampling probability will gradually decrease, and the randomness will increase, which will bring some errors to the measurement. In the experimental results, the measurement precision of top-50/100/200 is as expected, and when
∼114, the precision rate can be
∼98.5%, which shows that our method can achieve the top-
k measurement task with certain requirements under the condition of limited resources, and can ensure high precision. However, due to the limitation of the experimental dataset, the performance of the top-500 task was average, and the overall precision rate was 91.8%, which was related to the actual situation that the dataset was in line with the measurement threshold
.
CERNET-60 with quintuple of flows: The result of the CERNET-60 dataset is shown in
Figure 8e,f. In the experiment, with the gradual increase of the initial sampling threshold
, the precision rate is kept at about 95% on average. Secondly, with the increase of
, the sampling probability will gradually decrease, and the randomness will increase, which will bring some errors to the measurement. In the experimental results, the measurement precision of top-50/100/200/500 is excellent, and the precision rate of
∼98% can be obtained when
∼104, which shows that our method can achieve the top-
k measurement task with certain requirements under the condition of limited resources, and can ensure high precision.
4.3.2. The Precision vs. SrcIP/DstIP
In this subsection, we use the SrcIP/DstIP information to identify the flow
. We summarize the experimental results and show them in
Figure 9.
CAIDA-2018 with SrcIP/DstIP: The result of the CAIDA-2018 dataset is shown in
Figure 9a,b. The experimental effect is equivalent to that of the previous five-tuple FlowID, and the precision of top-100 and top-200 is still about 95%. In the experiment of top-500, its precision is better than the former. When
∼104, the precision rate of
∼97% can be obtained, which shows that our method can achieve top-
k measurement tasks with certain requirements under resource constraints and can ensure high precision.
CERNET-30 with SrcIP/DstIP: The result of the CERNET-30 dataset is shown in
Figure 9c,d. Due to different measures, the effect of this group of experiments is better than that of the previous group of experiments, especially in the top-500 experiment; its precision is 93%. When
∼124, the precision rate can be
∼98%, especially in the top-50/100/200 task. However, due to the limitation of the experimental dataset, the performance of the top-500 task was average, and the overall precision rate was 91.8%, which was related to the actual situation that the dataset was in line with the measurement threshold
.
CERNET-60 with SrcIP/DstIP: The result of the CERNET-60 dataset is shown in
Figure 9e,f. The measurement precision of top-50/100/200/500 is excellent, and the precision rate of
∼99% can be obtained when
∼174, which shows that our method can achieve the top-
k measurement task with certain requirements under the condition of limited resources, and can ensure high precision.
4.3.3. The ARE Results
In order to make the results in the figure more intuitive, we performed an exponential operation (
) on the ordinate axis. All the results are shown in
Figure 10. The average relative error is an important index to evaluate the simulation effect of the model. Because the method in this paper is based on the sampling strategy, which is different from the
count-all strategy, there must be errors in sampling. Through this indicator, we can evaluate the reliability of our method.
Result of ARE vs. Quintuple of Flows: As shown in
Figure 10a–c, for the CAIDA-2018 dataset in the measurement of quintuple of flow, we find that the ARE of MSLCFinder is less than 1. It shows that the difference between the predicted results obtained by our method and the precision results is slight. Although the task of top-500 made it challenging for this method to approach the limit under the condition of limited hardware resources, its ARE can still be less than 1. By the way, the performance of the CERNET-30 and CERNET-60 datasets is similar. There is a difference in the case of top-500. Due to the limited computing resources, the ARE is higher than 1. Nevertheless, the indicator of ARE is not very high. It still shows that the measurement results obtained by our method in this group of experiments are reliable.
Result of ARE vs. SrcIP/DstIP: As shown in
Figure 10d–f, for all datasets, the ARE is acceptable. Since this measure is composed of IP address pairs, from the actual proportion, the proper top-
k size is much larger than the flow size in the case of the quintuple, and the distribution is no longer extremely skewed. Therefore, although ARE is less than 1, it is not as good as the quintuple experiment. The reason for this problem is that the counter size is fixed. If the counter is complete, but the measurement task is not finished, although the flow is recorded as top-
k, there will be a large gap between the estimated value and the actual value. In future work, based on this research, we can balance the compression of counting, i.e., to improve the precision of the estimation of the number of elephant flows.
4.4. Contribution of Key Technique
In order to verify the contribution of the key technique, which is the
uth-level multi-sampling module, we additionally develop a comparison version of MSLCFinder. In this version, we remove the multi-sampling module and use only 8-bit counters to perform top-
k measurement tasks on three datasets; specifically in
Figure 11, we can intuitively see the gap and the importance of the
uth-level multi-sampling module. Even in the case of the top-100 and top-200 with excellent results in the previous experiment, the finder that loses the sampling function will misjudge the mouse flows by a large margin. As the counter size of MSLCFinder is only
n bits, it will be unable to cope with the measurement task of millions of flows without the multi-sampling module. If a bit counter with a long length is used, it will violate the original intention of the method proposed in this paper: to implement a top-
k measurement task with a resource-limited environment. It will be similar to existing methods.
4.5. Comparison with Existing Research Methods
In this subsection, we deploy the five usual methods in the existing research for the three datasets used in our previous evaluation to complete the comparative evaluations. We verify the feasibility and effectiveness of the method proposed in this paper in the previous evaluation. Notwithstanding, we prefer to verify the advantages of our proposed method based on MSLCFinder in the top-k measurement task.
4.5.1. Implementation
We choose five related methods, including HeavyKeeper [
7], lossy counting [
10], SpaceSaving [
11], CSS [
12], and count–min sketch [
13]. We use C++ to complete the implementation of these methods. We determine the number of buckets
w SpaceSaving, lossy counting, and CSS by the hardware resource (such as memory size) and the
k of top-
k. For HeavyKeeper and count–min sketch, the number of hash functions is 3, the heap size is
k, and the width of each hash space is also determined by the hardware resource (such as memory size). We also choose to use the improved IPSX [
15] algorithm for the hash function. The measure we choose is the quintuple information to identify the flow
.
In addition, we directly refer to the design and related parameters mentioned in [
7] for HeavyKeeper implementation. The fingerprint field and the counter field of HeavyKeeper are both 16 bits long. For count–min sketch in our implementation, the counter is 32 bits long, equivalent to directly using the integer type in C++.
We set the memory size to be Mbit, Mbit, and 1 Mbit, and the varying k we set to 50, 100, 200, 500, and 1000. As with the previous evaluation methods, based on the idea of controlled experiments, we select an equal size of memory and use five related methods and MSLCFinder for experiments to complete the top-k measurement tasks with different k values. Additionally, in all the experimental results, the MSLCFinder experimental results are based on the multiple selections of threshold and the optimal results obtained after multiple experiments under the premise of keeping the memory size and k unchanged.
4.5.2. Precision vs. Three Datasets Using Mbit
Figure 12 shows the precision results of the top-
k measurement experiments for three datasets using six methods when the memory size is 1/4 Mbit.
Due to the memory limitation of 1/4 Mbit, HeavyKeeper is not competent for this experiment’s top-1000 measurement task of three datasets. This is mainly due to the lack of counters that can participate in the measurement.
For the CERNET-30 dataset, our method performs well in five measurement tasks, especially in evaluating top-50 to top-200, which can achieve a precision of about 90%. Although our method is slightly less accurate than HeavyKeeper in the top-200 and top-500 experiments, the overall effect is comparable to HeavyKeeper. On the other hand, compared with the other four methods, our method has a very intuitive advantage in 1/4 Mbit memory space. For the other two datasets, our method has better advantages.
For the top-1000 measurement task, the information maintenance of 1000 stream IDs requires at least 104,000 bits, equivalent to 39.7% of the total memory in this group of experiments. Therefore, resources supporting various methods to complete the top-1000 measurement task are incredibly scarce. Nevertheless, our method can still achieve a precision of about 20%; especially in the CAIDA-2018 dataset, our method reached 30.1%. Moreover, they are better than the other four methods.
4.5.3. Precision vs. Three Datasets Using Mbit
Figure 13 shows the precision results of the top-
k measurement experiments for three datasets using six methods when the memory size is 1/2 Mbit.
Compared with the group of experiments in
Section 4.5.2, the memory of this group is doubled. Under the memory size of
Mbit, the six methods have corresponding experimental results.
In the top-50/top-100/top-200 experiments, our method well matched experimental results of HeavyKeeper for the CERNET-30 dataset. Although our method is slightly inferior to HeavyKeeper in the top-200 experiment, the gap is tiny. It is superior to the other four methods and has better results than HeavyKeeper in CERNET-60 and CAIDA-2018 datasets. Moreover, our method achieved more than 95% precision in the three datasets. For the top-500 experiment, our method still has promising results on three datasets. It shows that our method can be further competent for the top-k measurement task in the case of a lack of resources.
Although the memory is doubled, the measurement task of the top-1000 is still challenging. HeavyKeeper can output the measurement results in this group of experiments, but our method still has good advantages and is better than the other four methods. In addition, compared with the results in
Section 4.5.2,
Mbit of memory is added to our method, which makes our method nearly four times more accurate in the measurement task of top-1000. Other methods involved in the comparison did not have such a good increase.
4.5.4. Precision vs. Three Datasets Using 1 Mbit
Figure 14 shows the precision results of the top-
k measurement experiments for three datasets using six methods when the memory size is 1 Mbit.
Compared with the first two groups of experiments, the 1Mbit memory is relatively abundant for the six methods involved in the experiment, especially in the top-50/top-100/top-200 measurement tasks. Although the six methods achieved similar precision, our method is still similar to HeavyKeeper in the three datasets and better than the other four methods.
For top-500 and top-1000, our method also has better results than the other four methods. Compared with HeavyKeeper, our method still outputs excellent measurement results.
4.6. Analysis and Discussion
4.6.1. Why Is Finding Top-1000 Less Effective?
(1) Due to the strict limitation of 1 Mbit space, when the top-1000 measurement task is executed in this scheme, the size of the information storage table used to maintain the flow information will also increase with the increase of k. As mentioned earlier, when we use 1/4 Mbit memory, the information of the top-1000 will consume at least 40% of the memory. It will reduce the number of finder counters used for measurement tasks and affect the measurement results. Similarly, the five methods involved in the comparative experiment will also make the number of counters involved in the measurement scarce. In addition, for MSLCFinder, it is not ruled out that the number of elephant flows that meet the measurement threshold in the dataset involved in the measurement is not 1000.
(2) The flow size distribution in the high-speed network is always highly skewed [
46]. We summarized the flow size distribution based on a quintuple of three datasets in
Table 3. Although they are not standard heavy-tailed distributions, they are still highly skewed, especially CERNET-30. Compared with the CERNET-30 dataset, the flow size distribution of the CAIDA-2018 public dataset is relatively uniform, with small differences, and there are relatively many streams with large flow sizes. In the experimental results, the top-1000 precision obtained from the CAIDA-2018 dataset is better than the other two. Although the precision is somewhat low, this does not affect the effectiveness of our method. The effect will be significantly improved if we increase the total memory size.
4.6.2. The Necessity of Using Different Measures to Complete the Experiment
We use different FlowID to show MSLCFinder’s universality in network traffic header fields, i.e., MSLCFinder can be deployed on different header fields or other measures. Furthermore, the experimental results show that the performances of MSLCFinder on datasets with different header fields are similar.
In the same dataset, the distribution calculated by using a quintuple as the flow ID will be different to that when using an IP address pair. It is easy to understand that the communication between the same IP can be multiple ports and protocols.
When using an IP address pair as the flow ID, it is equivalent to using only the characteristics of the IP layer and generalizing the specific information contained in the transport layer. On the other hand, if the mouse flows between different ports of many different protocols are generated by the same pair of IP addresses, the pair of IP addresses is likely to become a member of the top-k flow when measured by the IP address pair.
Therefore, it is necessary to use different measures to verify the effectiveness of our method in the same dataset. We designed the experiment to show MSLCFinder’s universality in terms of network traffic header fields.
4.6.3. The Energy Cost and Complexity of MSLCFinder
The method based on MSLCFinder works in a resource-constrained environment, including the CPU, memory, power, and storage. Our method benefits from the advantages of the data flow algorithm; MSLCFinder requires small memory. Due to the real-time, continuous, and unbounded characteristics of network flows in high-speed links, the algorithms for processing data flows can only perform one calculation on the network flows and can only use limited computing and memory resources.
Therefore, when designing the network data flow method, it is necessary to meet the following requirements: (1) the space used by the algorithm must be small enough; (2) processing and updating must be simple and rapid; (3) the query must be accurate. In other words, our method is based on the development and design of the network data flow method, which fundamentally limits energy consumption and time and space complexity. In addition, the total time complexity of our method is , and the total space complexity is .
4.6.4. The Threshold
In the experiment, the selection of threshold has a guiding influence on the experimental effect. Because the size of the flow is not known prior to measurement, the selected threshold will have different effects on different flows; however, the selection of threshold is also one of the critical parameters of the method proposed in this paper.
Unlike the static selection of security parameters of NetFlow [
5], our goal is not to set sampling probability in the worst case to ensure that the network equipment can operate continuously under an adverse traffic environment. Our proposed scheme focuses on whether or not to use limited memory resources and limited counter units to complete the measurement of top-
k elephant flow. We aim to not pursue a minor error between the estimated value and the actual value. In addition, our method has a fixed initial threshold
to determine the sampling probability, but our sampling probability is recalculated with several counting thresholds in the counter.
In the open world, the flow distribution of unit measurement time period is different; only the mouse flow smaller than the decision condition likely exists in the entire measurement cycle, resulting in a decline in the effect of the top-k elephant flows. Therefore, in future work, we will improve the sampling threshold’s adaptive generation and increase our scheme’s applicability in the complex and changeable real network environment.
5. Conclusions
Finding the top-k elephant flows is critical for network traffic measurement, especially with resource constraints. As the network traffic overgrows, finding top-k flows with limited resources is challenging. In order to overcome the mutual restriction of hardware resources, massive data, and measurement precision, we adopt a novel strategy called count-with-uth-level-sampling. Moreover, we propose a novel core component named Multi-Sampled Lightweight Counting Finder (MSLCFinder) to achieve a measurement scheme for finding top-k flows based on hardware resource-constrained environments. The MSLCFinder has three primary modules: a lightweight counting module, a multi-sampling module, and a flow label recording module. Three modules complement and constrain each other to maximize the optimization of the MSLCFinder in the measurement task. The essential technique of MSLCFinder is that it uses a uth-level multi-sampling module to relieve the storage pressure of the MSLCFinder, and also reduces the possibility of the mouse flows being recorded by the flow label recording module. Our evaluation confirms that MSLCFinder can achieve more than 97% precision. With MSLCFinder that we proposed, we can effectively find top-k elephant flows with less memory overhead and counters under resource constraints.
Although the MSLCFinder can achieve good measurement results with small hardware resources, experiments show that this overhead is still high, especially in the flow label recording module. Therefore, in future research, we have the following issues that need to be further studied and optimized: (1) We first need to improve the efficiency of MSLCFinder, including updating the technology and methods at this stage, reconstructing core record data structure instead of min-heap to achieve a faster, more efficient, and more concise recording algorithm. (2) Our count-with-uth-level-sampling strategy can sample each flow for many continuous arriving flows in the network before the finder counts. Compared with the count-all strategy, it achieves a certain degree of filtering effect on mouse flows, but the accuracy of the strategy in filtering mouse flows needs further research and discussion. Therefore, we need to introduce effective filtering strategies in future work to improve the utilization of memory resources and the accuracy of identifying top-k elephant flows. (3) The initial sampling threshold has a decisive influence on the global measurement results. Determining how to generate sampling threshold based on flow adaptation or self-learning is the focus of our future research. (4) This paper only focuses on identifying elephant flow in network measurement, ignoring the identification of mouse flow. In the actual network environment, the traffic information of many attacks (such as DDoS attacks) in the network is composed of mouse flows. Based on the distribution characteristics of network traffic, it can be seen that many mouse flows in high-speed network links cannot be accurately monitored in real time. Therefore, in future work, we will continue to study how to identify mouse flow accurately. (5) In future research, we plan to design a machine learning or integrated learning method to adaptively calculate the initial sampling threshold for per-flow according to the traffic distribution in the current network environment to be tested to replace the current single global threshold scheme. (6) We will continue developing parallel hardware versions of MSLCFinder to achieve more efficient measurement results.