4.1. Evaluation of BF Parameters
As already stated in this section, referring to [
9], we further evaluate a BF structure for the MCS domain with a different input dataset. In contrast to our previous work, where we have used just a tiny dataset containing 200,605 sensor readings (in total), hereafter, we use a dataset that is 23 times larger to thoroughly evaluate the applicability of BF structure in the MCS scenario containing realistic data traces. In
Section 3.2, we have already shown that an optimal BF size and the number of hash functions can be calculated from the number of inserted elements
n and the probability of false positives
p. Therefore, in this part of the evaluation, we observe the behavior of a BF structure when encoding MCS data concerning those two parameters. Our goal is to identify distinct data readings which need to be inserted in a BF during its validity period (i.e., a period prior to filter reset), where a
distinct reading is defined as a unique combination of parameter name and a
small MGRS area identifier in the observed time slot. In other words, two data readings are redundant if they are collected in the same time slot (e.g., 5 min period), inside the same MGRS area of 100 × 100 m
2 (i.e.,
small MGRS area) and associated with the same parameter (e.g.,
), and only one of them should be inserted in a BF structure during that period. A BF is reset upon the end of each time slot (e.g., 5 min period), meaning that a new empty BF structure, which will be used in the next time slot, is initialized. As we can assume the expected number of data readings that need to be inserted in the filter (i.e.,
n), and we can set the desired probability of false positives
p for an MCS service, in all experiments, a filter structure is initialized with an optimal number of bits
m and hash functions
k calculated from Equations (
4) and (
5). Note that in our previous work described in [
9], we have used just one filter for the evaluation purpose, meaning that all data readings have been inserted in the same BF structure, while in the remainder of this section, we use multiple BFs, one for each active
km MGRS area following the proposed algorithm as described in
Section 3.3. In addition, to give a reader true insight into the error rate caused by using the BF structure for encoding MCS data, it is not realistic to insert all of the 4.7 million data readings that have been placed in a single day (as previously described) in just one filter structure.
First, we investigate the impact of parameter
n on the number of distinct data readings that were lost because the BF membership test failed and concluded that a reading (i.e., the combination of observed parameter and
small MGRS area identifier) is already inserted in the structure (although it was not) when a predefined probability of false positives
p is in the range from 0.1% to 10%. In the experiment, we use just a subset of data collected in the afternoon during a 3 h period between 4 p.m. and 7 p.m., as shown in
Figure 4, and reset each BF structure every 10 min (i.e., a BF validity period is 10 min). Each yellow bar in the graph shows the total number of data readings in the observed time slot (e.g., there are 30,227 readings in the time slot from 4:30 p.m. to 4:40 p.m.), while the orange segment in the bar represents the total number of distinct data readings that must be inserted in BFs during this period. In general, overall users have gathered between 30 and 42 thousand data readings in each time slot, but a relatively small proportion of them are indeed valuable and should not be lost during the BF encoding process (i.e., between 4.4 and 7.2 thousand data readings are unique in each time slot, while other data readings are redundant and can be discarded).
Figure 5 demonstrates how the parameter
n affects the number of distinct (useful) data readings lost in each observed time slot. In particular, the number of lost elements is represented by blue, green, dark blue, and violet segments in the bar when the probability of false positives
p was 0.1%, 1%, 5%, and 10%, respectively. As expected, when we compare the number of lost readings for different values of
p, we can see that the lowest error is attained in every observed time slot when the probability of false positives
p is 0.001 (i.e., 0.1%). Furthermore, when we compare
Figure 5a,b, we can see that a larger value of the expected number of different elements
n will considerably minimize data loss, even when
p is 0.1, as a Bloom filter will be initialized with a greater number of bits and hash functions. For example, when
p is 0.1, and
n is raised from 50 to 100, the number of lost data readings drops from 79 (out of 6976) to 22 in a period between 6:20 p.m. and 6:30 a.m., while for
it drops from 2 to 0 for the same time slot. In addition, as shown in
Figure 5b, there are no lost elements in the entire observed period when
and
; however, for
, the total number of lost elements is 18, as shown in
Figure 5a. In general, we can conclude that if we initialize BFs with a higher
n and lower
p, we will obtain greater precision (i.e., higher accuracy) because the vector will employ more bits and hash functions for mapping elements (i.e., data readings).
Then, we investigate how the duration of the filter reset period impacts the number of data readings lost because the BF membership test failed. More precisely, we analyze the total number of distinct elements (i.e., readings) that were lost throughout 24 h, depicted as an error rate (in per mille, ‰) in
Figure 6, with regards to the BF validity period (i.e., each BF is periodically initialized at the beginning of every time slot throughout a day). It is critical to minimize this error as each of these lost elements is valuable and does not have a replacement in the observed time slot. As expected, we can see that the error rate is generally lower with a shorter window as the BF structure is more frequently reset. When we analyze the worst-case scenario, i.e., we expect to insert only 50 elements in the BF and the probability of false positives is 0.1 (i.e., 10%), around 1.1‰ of distinct data readings are lost throughout the day when the window duration is 1 min (see
Figure 6a), while the error rate is round 7.6‰ when we reset BF every 15 min (see
Figure 6b), and finally it goes up to 17‰ when BF reset period is every 60 min (see
Figure 6d). In other words, although we allow having 10% of false positives, the actual error rate is significantly below this border because it goes from just 0.1% for a 1 min BF validity period up to 1.7% in the worst-case scenario when the BF validity period is 60 min. On the other hand, when we also expect to insert only 50 elements in the BF, but the allowed probability of false positives is a hundred times lower (i.e., 1‰), the actual error rate is less than 0.5‰ even when the window duration period is 60 min. Note that we expect to encode only 50 elements in each BF, but the actual number of inserted data readings is significantly higher in each time slot, and we are still below this very low allowed error rate, even when filters are reset every 60 min. As the BF validity period depends on the type of MCS service, our goal is to find parameters
n and
p that will keep the number of lost data readings to a minimum, regardless of the filter validity period duration. For example, noise levels need to be measured more densely than air pollutants, meaning that noise pollution monitoring will require a more frequent BF reset. It is self-evident that when
n increases and the probability of false positives
p decreases, the error rate will also decrease because the BF vector size is larger, and more hash functions are used for the mapping of elements (e.g., for
and
vector size is 239 bits and number of hash functions is only 3, while for
and
vector size is 2875 bits and number of hash functions is 10). As mobile phones also need to monitor Bloom filter structure in a distributed MCS architecture, we want to keep the vector size as small as possible while maintaining the number of hash functions to avoid losing too much data. Our analysis of a real dataset shows that this requirement can be met when the expected number of inserted elements
n is 100 and the probability of false positives
p is 0.001 (resulting in vector size of 1437 bits and ten hash functions), as the percentage of lost data readings will be around zero regardless of the BF validity period, as shown in
Figure 6c,d, where the error rate is 0.006‰ and 0.008‰, respectively. Therefore, we use those values as BF input parameters in the remainder of our evaluation described in
Section 4.2. Note that we did not use complex evaluation metrics such as area under the ROC curve (AUC) to evaluate BF because it does not apply to our approach as the BF data structure by definition does not contain false negatives, which is part of the receiver operating characteristic (ROC) curve.
Finally, for each BF associated with an active
km MGRS area, we analyze how many bits in the BF structure are set to 1, as setting too many bits to 1 can lead to an increased number of false positives, which can result in the lost data readings in our MCS service, i.e., measurements can be discarded because they were marked as redundant.
Figure 7 presents the number of bits that have value of 1 when
n equals 50 and parameter
(
Figure 7a) and
(
Figure 7b). We measured the BF occupancy (number of bits set to 1) for different time windows during 24 h to obtain the distribution of the number of bits that are set to 1. It is important to note that the total size of BF is 239 bits when the parameter
, and
m is 718 bits for the scenario when the parameter
. We can see that the structure occupancy for each observed BF validity period is not close to the maximal number of bits. As expected, the results show that when the BF validity period is just 1 min, the structure occupancy rate is the lowest compared to longer BF validity periods because the least number of elements is encoded in the filter, and there is no significant difference in the BF structure occupancy for longer BF validity periods. In addition, it is interesting to observe that the cumulative distribution of occupancy per BF validity period is not primarily affected by the probability of false positives
p, i.e.,
Figure 7b does not change its shape significantly compared to
Figure 7a. That means that the occupancy ratio remains stable despite the increase in the BF size (only absolute numbers correlated to the number of hash functions are affected). Even though the presented graphs show that the maximal number of bits set to 1 is not going beyond 40 (
Figure 7a) and 132 (
Figure 7b), one should note that there are some points located outside the whiskers of the box plot (i.e., above the ninth decile which are not depicted on the graph). These points indicate that in some periods during a day, some individual filters have more bits set to 1 due to the increased number of data readings encoded in the BF, causing that actual BF error to be greater than the allowed probability of false positives. For example, when the filter validity period is 15 min,
and
, there is a
km MGRS area with 100 distinct data readings in the observed period of which 85 are successfully encoded in the BF, and the filter has 174 out of 239 bits set to 1 resulting in an error higher than allowed. However, one should also note that in this specific case, 1.7 times more data readings were encoded in the BF than expected (i.e., BF expected to receive maximum
elements). We have confirmed our initial findings from [
9] that the BF overall size and memory footprint allow it to be employed on tiny devices, indicating that the structure is suitable for mobile devices to determine whether or not reading should be transmitted to an ME host.
4.2. Evaluation of BF Algorithm
One of the main characteristics of our decentralized MCS ecosystem is that it suppresses redundant and possibly irrelevant data readings by using the BF structure to deactivate some of the workers. However, this is achieved at the expense of filter control messages. Therefore, in this section, we first overview the messages exchanged between workers (i.e., their mobile devices) and ME hosts in our solution and then evaluate the proposed algorithm on a real-world dataset.
As already stated in
Section 3.2, an ME host is deployed in an MGRS area of 10 square kilometers and maintains BFs for all
km MGRS areas under its responsibility. It is aware of all user requests (tasks) and workers in the area. Whenever a worker moves to a neighboring
km MGRS area, it will send an
announce message to the corresponding ME host and receive an
assign message as a reply if at least one task is assigned to a worker, and it will start a data acquisition process and
submit results to the ME host until its BF membership test reports a positive match. In case the corresponding ME host determines that the submitted result is redundant (i.e., it has already received all required data readings in the meantime), it will send a BF
update message to the worker who submitted this result. Additionally, a BF maintained by the ME host is periodically
reset and delivered to the potential workers in the area. Therefore, the total number of exchanged messages equals the sum of all messages explained above (see Equation (
7)):
where
is the total number of messages that are exchanged within our MCS ecosystem,
is the number of
announce messages,
is the number of
assign messages,
is the number of
submit messages,
is the number of BF
update messages, and
is the number of BF
reset messages in the system.
Next, we observe the possible energy savings achievable with our approach compared to the baseline approach when all results produced by workers are transmitted to the ME hosts. Given that the energy consumption is in a linear relationship with the number of generated messages, the savings can be calculated as the percent decrease in the number of transmitted messages generated by our solution as compared to all produced results (see Equation (
8)):
where
is the energy savings,
is the number of all produced results that are transmitted in a traditional baseline MCS system, and
is the total number of messages that are exchanged within our MCS ecosystem.
We analyze the behavior of our solution concerning parameters
,
, and
, where
indicates a required number of results (of the same type) per
small MGRS area of 100 m
2 under the authority of an ME host,
indicates the number of results that a worker needs to submit per each
small MGRS area of 100 m
2 before updating its BF, and
indicates duration of the observation time slot in which a BF structure on the ME host is valid prior to reset (i.e., a day is divided into sequential independent time windows with a fixed duration, and upon the end of each time slot, a BF structure is reset). The initial parameter configuration is shown in
Table 1. Note that the number of all produced results, as well as the number of
announce messages, are obtained from the dataset, while the default values chosen for
,
, and
are hypothetical. The actual values for these three parameters in a real-world case would depend on the geographical configuration of a city, its population distribution, user mobility, and implemented application.
First, we analyze the influence of parameters
and
on the amount of sent data when the parameter
has a fixed value of 1 (i.e., a worker updates its BF structure after submitting the first result in the
small MGRS area). In other words,
Figure 8 depicts the cumulative number of results that were sent by all workers in our distributed system when an MCS service deployed on the ME host requires up to 15 results per each
small MGRS area, and each worker is instructed to submit only one result per
small area. As expected, the results show that with a longer
duration (i.e., the BF validity period), the number of sent results will be lower because the time between two consecutive filter resets is longer, and workers discard more results. In particular, when the filter reset period is 1 min, the percentage of
useful data sent by all workers is between 18% and 19%, while with the reset period of 120 min, this percentage drops to values between 11% and 15.5%, depending on the required number of workers per
small MGRS cell. We can see that when
has the value of 1 (i.e., ME host requires only one result per every 100 m
2), the amount of generated data is the lowest, because hosts update their BF immediately after they receive the first result for the
small MGRS area. In contrast, for bigger
, hosts expect to receive more results and update their filters less often, which leads to a higher amount of data sent by workers. When the filter validity period is 1 min, the influence of the parameter
on the percentage of sent data is not significant because there are not many workers located in the same area in such a short time interval, and the amount of redundant data is similar regardless of the required amount of data per
small MGRS area. On the other hand, when filter reset occurs every 120 min, it is very likely to find more workers in the same area who can submit data during the observed period, which results in greater differences in the percentage of sent data for different values of the parameter
. For example, when
is 1, the required amount of data on the ME host is collected very soon. Suppose other workers come to the same area in the remaining period. In that case, they will be instructed not to produce measurements because they receive an updated version of a BF (i.e., the host has already inserted received results in the filter), while when
is 15, more workers who send an announcement later in the time slot will also receive an initial (empty) version of the BF, and the ME host will receive more data readings. Overall, we can conclude that more than 80% of data readings produced by all workers in the system are never sent to edge hosts even in the worst-case scenario when the filter validity period is only 1 min, and MCS service requires up to 15 results per each
small MGRS area. That is a significant reduction in the number of transmitted data compared to a baseline approach where all produced data readings are sent to the cloud server.
Next, we observe trends in energy savings (
8) with respect to parameters
,
, and
. We analyze the number of all transmitted messages in our distributed MCS architecture compared to the baseline approach in which all produced data readings would be sent to the edge hosts. We modify two parameters for each analysis, while the third parameter is fixed to one of the default values given in
Table 1. First, in
Figure 9, we show how parameters
and
influence the savings when workers are instructed to transmit only one data reading per 100 m
2. In general, the results indicate that the advantage of our approach drops when increasing the value of parameter
because the required number of workers per
small MGRS area increases, consequently leading to a greater number of submitted results per cell, as already shown in
Figure 8. If more workers are required per
small area by ME host, worker self-deactivation by updating their BF has less influence because there is a need to satisfy application requirements (i.e., ME hosts need to collect
results before updating their filters). In addition, the results show that savings are generally lower when the BF validity period is shorter due to the increased number of transmitted data readings between workers and ME hosts. Although one might expect that energy savings will follow trends in the number of results submitted by workers (i.e., when the number of useful data increases, the total savings drops), we can see a slight deviation when
is one due to the overhead introduced by filter update and reset messages. Whenever a worker submits a redundant result to the corresponding ME host, it receives a
global filter for the
km MGRS area where it is currently residing (i.e., an updated BF which contains all results received by the ME host in the area), and when ME hosts require just one result per
small area, this can occur quite often. That is especially visible for short filter validity periods (e.g., 1 min) when the number of reset messages also increases. Overall, we can see that the total savings are above 53% even in the worst-case scenario when the filter validity period is just 1 min, while it can go up to 62% if filter reset occurs every 120 min.
Figure 10 shows how BF validity period and MCS application requirements influence the total energy savings when the required number of results, which a worker has to produce per
small MGRS area before updating its BF (i.e.,
), is fixed to one of the default values from
Table 1. The results confirm that savings drop when ME hosts require more data readings per
small MGRS area and the time between consecutive filter resets is shorter. In particular, the results indicate that a filter validity period of just 1 min significantly reduces the total saving compared to longer filter validity periods (e.g., 5 min) due to the overhead caused by many reset messages. Furthermore, when we compare the total savings in
Figure 10a,d, we can see that savings significantly drop with higher values of
. If workers are instructed to transmit more results to the edge host, then filtering mechanisms have less influence to reduce the total savings. However, even in the worst-case scenario shown in
Figure 10d, when we need high redundancy per
small MGRS area (i.e., up to 5 results for every 100 m
2), the energy savings is around 35% when the time between consecutive filter resets is just 1 min, and can grow up to 55% when filter validity period is 120 min.
Next, we analyze how the number of results a worker needs to submit per
small MGRS area and the length of BF validity period influence the total savings while keeping the fixed value of
, as shown in
Figure 11. The results indicate that when increasing the time between consecutive filter resets (i.e.,
), the influence of the parameter
drops. In other words, when the filter validity period is just 1 min, the difference in savings for
and
is around 18% (i.e., the total savings is 18% lower when workers are instructed to submit five results per
small area), while this difference drops to only 7.5% when filter reset is every 120 min when the parameter
has a fixed value of 5. Similarly, when
is 10, the difference in savings for
and
is around 18.5% when
is 1 min, and drops to 9.3% when filter validity period is 120 min. In addition, we can see that the total savings slowly increases with a decrease in the value of parameter
because an MCS application deployed on ME hosts requires fewer results. In particular, when workers submit only one result per
small MGRS area and the filter validity period is 120 min, the total savings goes up to 61.2% when
is 5 (see
Figure 11a), while it is less than 1% lower when ME host needs up to 10 results per
small area (see
Figure 11b). That is a significant result because savings remain pretty high even for many data readings required by an ME host.
Our findings are summarized in
Table 2 and
Table 3, which show the absolute difference in total savings for different input parameters. The first table shows the influence of parameter
on total savings, where column
presents the absolute difference in savings (expressed as percentage) when we compare the savings result for
and
for different values of parameters
and
(e.g., the result in the first row of column
shows that the total savings is 6.73% higher for
when filter reset is every 1 min and ME host requires 5 results per
small area), while the second table shows the influence of parameter
on total savings, where column
presents the absolute difference in savings when
and
for different values of parameters
and
(e.g., the result in the fifth row of column
shows that the total savings is 0.95% higher for
when filter reset is every 120 min and worker submits only 1 result per
small area). The results indicate that the parameter
has significant influence on the total savings when compared to the parameter
. We can conclude that
is used to control the number of workers who can produce data in each
small MGRS area during an observed time slot (i.e., BF validity period), while
is used to control the amount of data that each worker is going to produce per area. In other words, the significance of the parameter
is manifested only in cases when many redundant workers are located in the same 100 m
2 area during the same period.
Finally, we analyze the influence of parameter
on the cumulative amount of redundant data received by all ME hosts for different filter reset periods and application requirements in
Figure 12. As already stated, a data reading is redundant if an ME host already contains the same combination of the observed parameter and
small MGRS area identifier in its BF structure. Note that the percentage of redundant results is expressed concerning the total data received by ME hosts. As expected, the highest percentage of redundant data is received when hosts require only five data readings per
small cell before updating the global filter for the area. The lowest amount of redundant data is received when hosts require 15 results for all observed scenarios, regardless of the time window in which BF is reset and parameter
, because for smaller values of
, hosts will update their filters very soon, and subsequently received results will be redundant. When we compare savings for different BF validity periods, we see that the amount of redundant results received by ME hosts grows when the time between BF resets increases. If the filter is valid only for a short period, the probability that more workers will be within the same
small MGRS area during the same time slot is low. In other words, likely, hosts will not collect the expected number of results per each
small area because the filter will quickly reset. At the same time, it is quite possible that within, for example, 120 min, more workers will pass through the same
small area, and because of the relatively high values of the parameter
, a filter they receive may still not be updated with recently received results, which may later cause a more significant amount of redundant data in that area.
Similarly, when we compare the percentage of redundant results received by ME hosts for different values of the parameter
, we can see that the amount of redundant results grows with
because workers are instructed to produce more readings per
small area and need to wait longer before updating their filters. Therefore, a worker may send a redundant result because its version of the filter shows that this result is valuable and should be submitted, while in the meantime, the corresponding ME host has already received enough results for that
small area from other workers. Overall, the results show that even in the worst-case scenario, when workers are instructed to submit five results per each
small area, as shown in
Figure 12d, the percentage of redundant results received by ME hosts never exceeds 6%, which proves that the proposed algorithm works exceptionally well for different input scenarios.
To conclude, we have shown that our hierarchical edge computing architecture, enhanced with the algorithm for autonomous decision-making of a mobile worker, shows great potential for energy savings in distributed MCS environments while retaining the desired sensing quality due to decreased messaging between workers and ME hosts. The main advantage of the proposed hierarchical architecture is that there is no central point of failure because the data processing is distributed across multiple ME hosts, and all workers can autonomously decide when to transmit data. At the same time, the actual savings depend on an actual MCS application and its setup. Our analysis showed that the edge architecture is suitable for massive-scale MCS services. The amount of transmitted data between mobile workers and edge hosts can be significantly reduced using the proposed BF algorithm because workers independently decide when to send data instead of periodically transmitting all data readings. Consequently, it is possible to achieve energy savings of up to 62%.