The significant increase in air traffic levels observed over the past two decades has generated a greater focus on the optimization of Air Traffic Management (ATM) systems. The primary objective is to effectively manage the continuous increase in air traffic while ensuring safety, economic viability, operational efficiency, and environmental sustainability. The Automatic Dependent Surveillance-Broadcast (ADS-B) system is an integral component of modern air traffic control (ATC) systems, as it offers a cost-effective alternative to traditional secondary radar systems [
15]. Using ground stations, the ADS-B system not only reduces expenses but also improves the precision and timeliness of real-time positioning data. However, this particular system produces a substantial amount of data which, when integrated with additional flight-related data, such as flight plans or weather reports, encounters challenges in terms of scalability [
16]. The system must receive and store substantial volumes of ADS-B messages to facilitate various forms of analytical processing. Typically, these analyses require additional data to augment trajectory information, such as flight plans, weather data, and baggage ticketing information. Consequently, there is a consolidation of vast collections of diverse data, which ATM systems must effectively manage to generate appropriate decisions and predictions. Therefore, the implementation of ATM systems can be regarded as a specific instance of leveraging big data to conduct flight-related analytics [
16].
In the following sections, we present an LA as an end-to-end big-data system for real-world air/ground surveillance applications, as well as a testing methodology to assess its accuracy and performance in various real-world data-ingestion scenarios.
2.1. Problem Definition
Contemporary surveillance networks possess the ability to provide trajectories for various types of boats and aircraft in a global or, at the very least, expansive geographical range [
18]. The two most commonly utilized systems for air and maritime surveillance are ADS-B and Automatic Identification System (AIS). Both systems exhibit a cooperative nature. In addition to these aforementioned systems, sensor networks that are deployed on ground installations or mounted on airborne and space-based platforms provide object trajectories autonomously without requiring any form of cooperation. An illustrative example encompasses sensor network installations for coastal or ATC purposes [
19]. Surveillance systems offer trajectories that span medium and long periods of time. The aspect that presents difficulty lies in understanding the surrounding circumstances and determining the intentions of the objects being monitored.
Emerging technologies encompass activity-based intelligence and the identification of patterns of life. Advanced analysis of trajectories extracted by surveillance systems offers a potential approach to studying these technologies [
18]. Cluster algorithms are utilized to partition trajectories into distinct segments of interest. Unsupervised machine learning is used to decipher their behavior patterns, helping to understand their inner dynamics [
20]. The trajectories are consolidated and organized into distinct routes, each of which is assigned a specific representative. Probabilities have been calculated to determine the frequency of usage for these routes. This enables the application of predictive analytics and the detection of aberrant behavior [
15]. Ultimately, the incorporation of these novel data analytic methodologies is imperative for integration into pre-existing near real-time surveillance systems. The successful implementation of this endeavor necessitates the utilization of precise system architectures alongside the establishment of an entirely novel software and hardware framework [
21].
In recent years, much research has proposed several monitoring and architectural improvement approaches for LA. They mainly focused on text and sensor-based data, such as that found in social media, the Internet of Things, and smart system applications [
22,
23]. Performance comparison of LA with other big-data architectures such as Kappa [
14,
24,
25,
26,
27] has been proposed in several studies. When dealing with LA, however, several practical issues arise: (1) Identifying the system’s behavior under real-world data-intake situations is critical, (2) For safety and time-sensitive applications such as air/ground surveillance in aviation, it is important to evaluate the system’s accuracy and performance, (3) For various data-ingestion rates, especially data collected from diverse geographical locations happening in time-delayed ingestion, it is critical to identify underlying dynamics and visualize the system’s behavior analytically. To address all these issues, this paper offers a method for evaluating the performance of LA-based big-data applications using controlled and uncontrolled real-world data-ingestion experiments with ADS-B data.
Even if LA’s efficiency has increased in recent years, most of the gains have come from presenting an enhanced version of analytics algorithms implemented in LA’s layers, as well as the use of performance models. Yet, controlled experiments may be used to increase the measuring and understanding capabilities of the system’s inner dynamics. To that end, this research aims to look at the capacity of eventually consistent LA-based big-data applications in real-world data-ingestion operations for air/ground surveillance and monitoring. Although we refer to our previous work [
28], the present study has a novel focus. The goal of this article is to provide a comprehensive visualization of the inner dynamics and performance measures of an LA-based application in the aviation field using ADS-B data based on the technique given in [
28].
In this work, we specifically consider very important challenging real-world data-ingestion cases and abnormalities and then investigate the performance of LA in these circumstances. By experimenting with LA under these conditions, we evaluated and demonstrated its performance and capabilities. Thus, this study contributes to the big-data research area to provide not only an investigation of an LA under normal/nominal conditions but also under abnormal conditions. Existing studies have focused primarily on evaluations of LA hardware performance and scalability, leaving a significant gap in the literature regarding the empirical evaluation of data-processing accuracy within the various layers of LA. This study aims to fill this gap by introducing a novel methodology specifically designed to assess the accuracy of data processing in LA. To better show this gap, a comparison of recent work related to LA is given in
Table A1.
The primary research question guiding this study is: ‘How accurately does the Lambda Architecture process data under various real-world data-ingestion scenarios, particularly in extremely challenging data-intake cases, and how to measure its accuracy while it is in operation?’. To address these questions, the study has the following objectives:
- 1.
Develop a novel methodology to evaluate the data processing accuracy of LA,
- 2.
Empirically validate the ‘eventual consistency’ of LA under diverse data-ingestion scenarios, including abnormal and extremely problematic data-ingestion cases
- 3.
Apply the developed methodology in a practical case study involving air/ground surveillance, where data accuracy is paramount.
It is also important to emphasize that the performance comparison of the other big-data architectures [
14], the proposal of novel performance models, and the application of LA to different real-world applications such as IoT, smart systems, and social networks is beyond the scope of this paper.
2.2. Lambda Architecture-Based Big-Data System
To analyze incoming data and answer queries on stored historical and newly obtained data, an LA compromises the speed, batch, and serving layer. When the serving layer receives a query request, the response is created by querying both real-time and batch views simultaneously and combining the data from these levels. At the serving layer, both real-time and batch databases are searched, and the results are mixed into a single resultant data set to provide a near-real-time data set in response to the query. A scalable distributed data transmission system (data bus) allows data to be transferred continuously to batch and speed layers at the same time. On the speed layer, data processing and analytics activities are carried out in real time, whereas on the batch layer, they are carried out offline.
Figure 1 is a conceptual representation of the LA. Incoming data from the data-ingestion bus is transmitted to both the speed and batch layers, which subsequently produce multiple views utilizing the new and old data, and the results are stored on the LA’s serving layer. To construct an LA, several existing big-data technologies may be employed at all three levels. According to LA’s polyglot persistence paradigm, each accessible big-data technology framework may be employed for its specific data-processing capacity to deal with that sort of data and assist analytical activities.
The blocks of immutable main data sets are managed, operated, and stored by the batch layer. The incoming highly recent data are just added to the batch layer’s already-saved historical data. Update and remove actions are not permitted in the batch layer. Continuous data processing and analytics activities are run as needed to generate batch views from these data. A fresh batch view calculation operation is re-executed sequentially and combined to produce new batch views when coordinated with the speed layer or on a specified quantity of new data arrival. This process is ongoing and never-ending. The batch views are made up of the batch layer’s immutable data sets. Depending on the quantity of both incoming and stored historical data, full batch data processing and analytical computations take far too long. As a result, performing batch-layer actions and computations to generate recent batch views is a rare occurrence. The status of batch-layer data processing and importing must be tracked to ensure that batch view creation is finished before the speed layer becomes overburdened. The serving layer uses real-time and batch views created by both the speed and batch layers to respond to incoming queries. Consequently, the serving layer requires capabilities to store large amounts of data, such as NoSQL databases with various features. Due to the many types of data-ingestion patterns, this layer must handle both bulk and real-time data ingestion or ingestion. In cases where streaming data are delayed or absent, the serving layer is susceptible. Under these conditions, inconsistencies may emerge in data analyses and query responses, which are eventually addressed exclusively by the batch layer.
To fulfill the low-latency analytics and responsive query requirements, LA’s speed layer compensates for the batch views’ staleness by serving the most recently collected data, which the batch layer has not yet processed. Depending on its restricted capacity, the speed layer works in real time on streaming data and saves its output in the serving layer as real-time views.
The speed layer demands high read/write operations on the serving layer because of the nature of real-time operating needs. Only recent data are stored in real-time views until the batch layer finishes its operation cycle one or two times. The data saved as real-time views during batch processing is destroyed and removed from the serving layer after the batch layer completes the data processing and analytics calculation activities. Depending on the data processing at the batch layer, some real-time views must be flushed or cleaned from the real-time layer when batch view generation is completed. This process is critical in minimizing the stress on the real-time database at the serving layer. Monitoring and acting on the resources of the speed layer are dependent on the consumption of resources and capacity requirements, the exact coordination of the layer, and particular performance indicators for all levels. The batch view will be stale for at least the processing time between the start and finish times for the batch processes, if not longer, and if inappropriate circumstances or defective coordination with the speed layer exist. This necessitates careful and precise coordination between the speed and batch layers. Bulk data import is needed on the serving layer as soon as the coordinated data-processing activity between the speed and batch levels is completed. Data ingestion on the serving layer is finished when the last batch views are ready.
2.3. Common Technologies for Lambda Architecture Layers
For real-time, batch, and serving layers, respectively, Apache Spark/Spark Streaming, Hadoop YARN [
29], and Cassandra [
30] were utilized for LA designs. To construct real-time and batch views, as well as query the aggregated data, task-based serving, batch, and speed layers were chosen. The Apache Hadoop and Storm frameworks are sophisticated enough to build and deploy for a wide range of LA applications [
31].
The LA data-ingestion component (data bus) is designed to receive a large amount of real-time data. Apache Kafka [
32], which is highly mature, appropriate, scalable, fault-tolerant, and eligible for this purpose, is one of the most commonly used and popular frameworks for data bus technologies. It is a scalable, fault-tolerant system for data bus operations that allows high-throughput data transfer. Apache Samza [
33], Apache Storm [
34], and Apache Spark (Streaming) [
35] are good alternatives for the speed layer. For batch-layer activities, Apache Hadoop [
29], Apache Spark, is very popular and suitable.
NoSQL data stores offer a viable alternative to traditional relational database management systems (RDBMS) [
36]. However, organizations may likely find it difficult to quickly determine the most suitable option to adopt. To select the most suitable NoSQL database, it is essential to identify the specific requirements of the application that cannot be fulfilled by a relational database management system. If an RDBMS is capable of efficiently handling the data, the utilization of a NoSQL storage system may be unnecessary. In [
37], the researchers provided a comprehensive analysis of various NoSQL solutions, including Voldemort, Redis, Riak, MongoDB, CouchDB, Cassandra, and HBase, among others, highlighting their key characteristics and features. The study conducted in [
38] focused on examining the elasticity of non-relational solutions. Specifically, the authors compared the performance of HBase, Cassandra, and Riak in terms of read and update operations. The researchers concluded that HBase exhibits a high degree of elasticity and efficient read operations, whereas Cassandra demonstrates notable proficiency in facilitating rapid insertions (writes). However, Riak exhibited suboptimal scalability and limited performance enhancements across various types of operations.
As a speed layer database, Apache Cassandra [
30], Redis [
39], Apache HBase [
40], MongoDB [
41] could be utilized. These databases can handle real-time data ingestion, as well as random read and write operations. As a batch-layer database, MongoDB [
41], CouchbaseDB [
42], and SploutSQL [
31] can be utilized. These databases may be used to import large amounts of data to generate and serve batch views.
Apache Flink [
43], a framework and distributed processing engine for stateful computations for unbounded and bounded data streams, is also a viable choice for data-processing tasks. By regularly and asynchronously checkpointing, stateful Flink applications enhance local state access and ensure exactly-once-state consistency in the event of issues. Apache Flink, a framework and distributed processing engine for stateful computations for unbounded and bounded data streams [
9], is also a good choice for data-processing activities. Apache Spark, on the other hand, is an open-source data-processing platform designed for speed, simplicity, and sophisticated analytics. It provides a comprehensive, all-in-one structure for handling big-data processing needs with a variety of data types and data sources. Discretized streams (D-Streams) are a programming paradigm supported by Apache Spark [
44]. Stateful exactly once semantics are supported by the Spark Streaming programming model. For event-driven, asynchronous, scalable, and fault-tolerant applications, such as real-time data aggregation and response, Spark Streaming makes it simple to handle real-time data from various event streams.
In general, NoSQL databases are preferred over relational databases for LA [
36]. The major advantages of using the serving layer are scalability and advanced data-ingestion capabilities. Druid [
45] is a column-oriented, distributed, real-time analytical data storage. Druid’s distribution and query paradigm resembles concepts seen in current-generation search infrastructures. The capabilities of ingesting, querying, and building real-time views from incoming data streams are possible with Druid real-time nodes [
46]. Data streams that have been indexed using real-time nodes are immediately queriable. Druid also includes built-in support for producing batch views with Hadoop and MapReduce tasks for batch data ingestion.
2.4. Same Coding for Different Layers Approach
The LA is suitable for data-processing enterprise models that necessitate ad hoc user queries, immutable data storage, prompt response times, and the ability to handle updates in the form of new data streams. Additionally, it ensures that stored records are not erased and permits the addition of updates and new data to the database [
17]. The advantages associated with data systems constructed using the LA extend beyond mere scalability. As the capacity of the system to process substantial volumes of data increases, the potential to extract greater value from it also expands. Expansion of both the quantity and diversity of stored data will result in increased potential for data mining, analytics generation, and the development of novel applications [
47]. An additional advantage associated with the utilization of LA lies in the enhanced robustness of the applications. This is exemplified by the ability to perform computations on the entire dataset, facilitating tasks such as data migrations or rectifying errors.
It is possible to prevent the simultaneous activation of multiple versions of a schema [
17]. In the event of a schema modification, it becomes feasible to migrate all data to conform to the updated schema. Similarly, if an erroneous algorithm is unintentionally implemented in a production environment and causes data corruption, it is possible to rectify the situation by recalculating the affected values. This enhances the resilience of big-data applications. Ultimately, the predictability of performance will be enhanced. Although the LA demonstrates a generic and adaptable nature, its constituent components exhibit specialization. There is a significantly lower occurrence of “magic” taking place in the background in comparison to an SQL query planner. This phenomenon results in a higher degree of performance predictability [
17]. However, it is evident that this architectural framework is not universally applicable to all big-data applications [
48]. One of the challenges associated with the LA is the management of two intricate distributed systems, namely the batch and the speed layer, which are both responsible for generating identical outcomes. In the final analysis, despite the potential to bypass the need to code the application on two separate occasions, the operational demands associated with the management and troubleshooting of two systems will inevitably be considerable [
49]. Furthermore, the batch layer produces intermediate results that are stored in the file system, leading to increased latency as the length of the job pipelines expands [
50]. Despite numerous efforts aimed at mitigating access latency, the coarse-grained data access inherent in a MapReduce and Distributed File System framework is primarily suited for batch-oriented processing, thus constraining its applicability in low-latency back-end systems [
51].
2.4.1. Description of the Approach
The primary trade-off in Lambda Architecture revolves around the balance between latency and complexity. Batch processing is known for its ability to deliver precise and consistent outcomes. However, it does come with the trade-off of longer execution times and increased storage and maintenance requirements. Stream processing offers rapid and real-time outcomes, although it may be less precise and consistent due to the presence of incomplete or out-of-order data. The Lambda Architecture utilizes both layers and combines them into a serving layer to achieve a balance between these trade-offs. However, incorporating these additional components also introduces increased complexity and duplication within the data pipeline. This is because developers will now have to handle various frameworks, formats, and schemas for each layer. A method to simplify the Lambda Architecture pattern is to utilize a unified framework that can effectively manage both batch and stream processing using the same code base, data format, and schema. By utilizing this approach, developers can simplify and eliminate redundancy in the data pipeline. This means that developers only must focus on developing, testing, and managing a single set of logic and data structures. By utilizing a unified framework, it is possible to effortlessly transition between batch and stream processing without the requirement for distinct implementations. This not only saves time during development but also guarantees consistent data processing across various layers of the architecture. Moreover, having a unified framework facilitates the process of debugging and troubleshooting since developers can easily track the flow of data and transformations within a single codebase.
In another view, it may be difficult to develop applications using big-data frameworks, and debugging an algorithm on a big-data framework or platform may be much more difficult [
52]. This particular model exhibited numerous challenges in terms of implementation and maintenance. Proficiency in two distinct systems was a prerequisite for developers, necessitating a substantial learning curve. Furthermore, the effort to develop a cohesive resolution proved to be feasible; however, it was accompanied by numerous challenges, such as merge conflicts, debugging complications, and operational intricacies. It is necessary to input the incoming data into both the batch and the speed layers of the LA. Preserving the sequential arrangement of events within the input data is of paramount significance to attain comprehensive outcomes. The act of replicating data streams and distributing them to two different recipients can present challenges and result in increased operational burdens. The LA system does not consistently meet expectations, prompting numerous industries to employ either a full batch processing system or a stream processing system to fulfill their specific requirements. It is widely recognized that improper implementation of LA can result in challenges such as code complexity, debugging difficulties, and maintenance issues. In the past, application development and testing for batch and speed layers in LA had to be done at least twice. In addition to the configuration and coordination needs of each layer, the most significant downside of LA is that it is sometimes impractical for developers to implement the same algorithm repeatedly using various frameworks [
53]. Separate software with various frameworks must be developed, debugged, tested, and deployed on large hardware clusters for big-data applications, which necessitates more effort and time. Using the same big-data processing technology for various levels is a viable method to address this limitation [
9].
A unified framework simplifies the deployment process by eliminating the need to manage multiple codebases and configurations. With a single codebase, a developer can easily package and deploy any application, reducing the chances of configuration errors and deployment issues. This streamlines the overall development and deployment process, making it more efficient and reliable. Furthermore, a unified framework promotes code reusability, as it can leverage existing logic and data structures across different processing scenarios. This not only reduces development effort but also improves the maintainability and scalability of the application. For example, a unified framework like Apache Flink can be used in an LA to process both real-time streaming data and historical batch data. This allows for seamless integration of data processing tasks, such as filtering, aggregating, and joining, across both types of data sources. With a unified framework, developers can write code once and apply it to both batch and stream processing tasks, simplifying development efforts and ensuring consistent results. Using a unified framework for both batch and stream processing in an LA eliminates the need to maintain and update separate implementations for each processing method. This reduces the complexity of the architecture and improves overall efficiency. Furthermore, a unified framework enables seamless integration and interoperability between batch and stream data, enabling real-time insights and faster decision-making.
In addition, a unified framework simplifies the deployment process by providing a single platform for managing both batch and stream processing workflows. This eliminates the need for separate deployment strategies and reduces the risk of errors or inconsistencies between the two methods. By leveraging a unified framework, organizations can streamline their development and deployment processes, saving time and resources. Additionally, the unified framework allows for easier scaling and resource allocation as it provides a centralized control and management system for both batch and stream processing. This ensures that resources are allocated efficiently and effectively, maximizing performance and minimizing costs. Managing different frameworks, formats, and schemas for each layer of an LA can be challenging and time-consuming. It requires expertise in multiple technologies and increases the complexity of the system. Additionally, ensuring compatibility and data consistency across different layers can be a major trade-off, as changes in one layer may require corresponding modifications in other layers, leading to potential delays and inconsistencies in processing. Managing different frameworks, formats, and schemas for each layer of a Lambda Architecture can be challenging. It requires specialized knowledge and skills to maintain and update multiple implementations simultaneously. Additionally, ensuring compatibility and consistency between different layers may introduce complexity and increase the risk of errors. Moreover, managing and synchronizing data across different frameworks and schemas can be time-consuming and resource-intensive, impacting the overall efficiency of the system. Therefore, organizations must carefully consider the trade-offs and potential drawbacks of maintaining such a complex architecture.
Furthermore, a unified framework enables easier scalability and adaptability, as it eliminates the need to reconfigure and restructure multiple layers whenever changes are made to the system. This flexibility is crucial in today’s rapidly evolving technological landscape, where businesses need to respond quickly to market demands and incorporate new data sources or technologies. By having a single, cohesive framework, organizations can save time and resources that would otherwise be spent on complex integration processes and the maintenance of multiple layers. This not only increases productivity but also allows for more agile decision-making and a faster time to market. For example, in LA, for real-time analytics, having one set of logic and data structures ensures consistency in data processing across both the batch and speed layers. By writing, testing, and maintaining this unified codebase, developers can easily make changes or updates without having to duplicate efforts in multiple frameworks, reducing the risk of inconsistencies or errors in data analysis. This ultimately improves the reliability and accuracy of the analytics system. Implementing LA requires writing, testing, and maintaining one set of logic and data structures, which is crucial for ensuring consistency and accuracy in data processing. By having a unified framework, any changes or updates made to the logic or data structures can be easily reflected across all layers of the architecture, reducing the risk of inconsistencies or errors. This not only simplifies the overall development process but also streamlines the maintenance and troubleshooting efforts, making it easier to identify and fix any potential issues.
2.4.2. Available Technologies
There are several big-data technologies available for implementing an LA. One option is to utilize Apache Spark [
23], a versatile and high-speed framework that offers support for both batch and stream processing using a unified set of APIs and libraries. Another option is to utilize Apache Beam [
54], a unified model that simplifies the distinctions between batch and stream processing. It is compatible with multiple engines, including Spark [
23], Flink [
43], and Dataflow. These frameworks offer a high level of abstraction, enabling a developer to concentrate on the logic of the data-processing tasks instead of becoming caught up in the complexities of the underlying processing engines. This simplifies the development process and facilitates seamless switching between different processing engines, eliminating the need to rewrite the code. Moreover, these frameworks provide integrated fault tolerance and scalability features, guaranteeing the capability of the data pipeline to manage substantial amounts of data and efficiently recover from failures.
Spark’s extensive application programming interfaces (APIs) for both batch and streaming data processing make it a highly suitable and customized solution for the LA context [
23]. The fundamental component of the Spark Streaming API consists of a set of RDDs. Therefore, there exists a significant opportunity for the reuse of code. Additionally, the maintenance and debugging of business logic is simplified. Moreover, in the context of the batch layer, Spark emerges as a superior choice due to its remarkable performance resulting from its ability to process data in memory [
23]. Using Spark and Spark Streaming together allows big-data applications to use the same code for batch and online processing. In terms of LA, Spark Streaming and Spark may be utilized to create applications for the speed layer and batch layer. This framework can successfully support both layers. However, one issue remains: the serving layer must be linked with both levels for data processing and to provide data ingestion for both layers.
Understanding how Spark Streaming and Spark can be used to create applications for the speed layer and batch layer in terms of Lambda Architecture cannot be overstated. Spark is an open-source distributed computing system that provides a high-level API for distributed data processing. It offers in-memory computing, which significantly speeds up data processing compared to traditional disk-based systems. Spark Streaming, on the other hand, is an extension of Spark that enables real-time processing of streaming data. By leveraging the power of Spark and Spark Streaming, big-data applications can efficiently process large volumes of data in real time, enabling businesses to make faster and more informed decisions. Additionally, Spark’s ability to seamlessly integrate with other big-data tools and frameworks, such as Hadoop and Additionally, Spark Streaming, provides fault tolerance and scalability, allowing it to handle large data streams without any data loss. With its easy integration with other data processing and storage systems, Spark Streaming has become a powerful tool for building real-time analytics and machine learning applications. This combination of Spark and Spark Streaming makes it an ideal choice for industries such as finance, healthcare, and e-commerce, where real-time data analysis is crucial for staying competitive in the market. Overall, the high-level API provided by Spark and its extension, Spark Streaming, revolutionizes the way businesses can process and analyze big data in real time. Furthermore, Spark Streaming’s fault-tolerant and scalable nature allows businesses to handle large volumes of data with ease. With its ability to process data in mini-batches, Spark Streaming ensures low latency and high throughput, making it suitable for use cases that require near real-time analysis. Additionally, Spark Streaming integrates seamlessly with other Spark components, such as Spark SQL and MLlib, enabling businesses to perform complex data transformations and advanced analytics on streaming data. This versatility and integration further enhance the capabilities of Spark Streaming, making it a valuable asset for businesses across various industries.
2.4.3. Evaluation of the Approach
It is important to acknowledge the current state of the literature. To the best of our knowledge, there is no existing metric or software engineering methodology that can empirically measure the effectiveness of our proposed SC-FDL method. This presents a significant gap in the literature and underscores the novelty of our approach.
To measure the effectiveness of using the same coding technologies in different layers of a software application, we can propose a metric called the “code reusability factor” [
55]. This metric would quantify the percentage of code that can be reused across different layers or components of the application. A higher code reusability factor would indicate that the coding technologies used are effective in enabling easy switching between different processing engines and minimizing code rewriting efforts. Furthermore, it would also suggest that the frameworks used offer built-in fault tolerance and scalability features, contributing to an efficient data pipeline capable of handling large volumes of data and recovering from failures effectively. For example, in a big-data analytics application, a higher code reusability factor would allow developers to easily switch between different processing engines, such as Apache Spark or Apache Flink, based on specific use cases or performance requirements. This flexibility would enable the application to efficiently process and analyze large volumes of data without requiring significant code modifications. Additionally, the built-in fault tolerance and scalability features provided by these frameworks would ensure that the application can handle failures gracefully and scale resources as needed to maintain high performance even in demanding situations. Therefore, one way to measure the effectiveness of using the same coding technologies in different layers of LA is by evaluating code reusability. If the same coding technologies can be easily reused across different layers, it indicates that the technology stack is effective in providing a consistent and flexible development environment. Another metric could be assessing the ease of switching between processing engines [
56]. If the code can seamlessly switch between different processing engines without significant modifications, it suggests that the coding technologies are efficient in enabling interoperability. Overall, measuring code reusability and ease of switching can help determine the effectiveness of using the same coding technologies in different layers of LA.
In addition to code reusability and ease of switching, another important metric for evaluating the effectiveness of using the same coding technologies in different layers of LA is the scalability of the development environment [
9]. Scalability refers to the ability of coding technologies to handle increasing amounts of data and user interactions without sacrificing performance. By assessing how well the coding technologies can handle larger datasets and higher user loads, it becomes possible to determine whether they can support the growing demands of LA applications. This scalability is crucial in ensuring that the coding technologies can accommodate the ever-expanding data and user requirements in the field of LA. Without scalable coding technology, LA applications may face performance issues and struggle to process and analyze the vast amounts of data generated by users. As the field of LA continues to evolve and expand, the need for scalable coding technologies becomes even more critical. Without them, the potential insights and benefits that LA can provide may be limited or compromised. Therefore, assessing and prioritizing scalability in coding technologies is essential for the future of LA and its ability to meet the growing demands of data and user interactions. Using hypothesis testing methodology in conjunction with these metrics can help the effectiveness of the SC-FDL approach [
57].
In this study, we have only demonstrated the practical application of SC-FDL in the context of LA. However, we proposed several metrics to assess the effectiveness of this approach. However, we view this not as a limitation but as a promising direction for future research.
2.5. Explanation of Coordinated Working of Different Layers in Lambda Architecture
This section examines the integration and synchronization of the batch, serving, and streaming layers.
The batch layer is characterized by its high-latency nature, meaning that processing a large amount of data would lead to a noticeable delay, as demonstrated by the monitoring statistics. It is imperative that the batch operation selectively eliminates certain statistics, specifically data that are contingent on the frequency of batch execution and the size of the data set. The streaming layer can incorporate the aforementioned ‘incomplete/lost statistics’. This operation can be modeled in Equation (
1) [
23]:
The formula expression in Equation (
1) represents the manner in which the computation should exclude the monitoring events. In the given equation, the function
represents the process of discarding events. The variable
denotes the total number of raw events before applying the event selection (filter). The variable
represents the duration of the batch execution. Furthermore,
refers to the time interval within which events are discarded from the batch. Lastly, the expression
signifies the subtraction of the time interval
from the batch execution time
. The calculation determines the duration required for the selection of events and the emission of all events that satisfy the given condition. Given a batch job with a specified time interval of
hours (
n is an application-dependent configuration parameter), it is recommended that the batch discards all events that occur after a time
. The presence of partly computed results can be avoided through the utilization of the streaming layer.
Equation (
2) delineates how the chosen events, denoted as
, undergo a mapping procedure, denoted as
, resulting in the creation of key and value pairs [
23]. The key represents a distinct identifier for the statistical data, while the value corresponds to the values of the matrices associated with the respective key. Subsequently, the data will undergo the reduction process, denoted
, to consolidate the values based on the key of all the distributed nodes. These unified values are then stored in a designated storage folder called
. The outcome of the batch process, which is a new file, will be written in a designated folder on the Hadoop Distributed File System (HDFS). It is worth noting that this storage layer can be substituted with alternatives.
In the unified batch and streaming layers, formerly calculated statistics (if the precomputed statistics indeed are available) ought to be retrieved from the serving layer. This operation can be described by Equation (
3):
In Equation (
3), the variable
represents the precomputed target statistics that have been loaded from the serving layer [
23]. The variable
denotes the current timestamp, while
represents the timestamp from which the statistics will be loaded. The function
is the loading procedure used to retrieve data from the serving layer. If the input data, denoted as
, comprising all statistics transmitted from the serving layer to the database load procedure, exceeds the difference between the current timestamp
and the timestamp
, then the statistics
should be chosen and returned.
Equation (
4) represents the value of
, which refers to the statistics that have been mapped and stored from the serving layer
into the memory [
23]. This mapping operation is carried out by the function
, which yields key/value pairs. These pairs are then stored in memory and/or disk, denoted as
, for future use, such as combining with further layers.
In the streaming layer, the operation is described in Equation (
5):
In Equation (
5), the variable
represents the statistical data that has been mapped, aggregated, and computed from a series of monitoring events in a streaming context. The variable
denotes the total count of monitoring events in the stream [
23]. The function
filters and transforms the events before they undergo the mapping process, performed by the function
, which generates key/value pairs. Ultimately, the data undergo the reduction process, denoted as
, to consolidate the values based on the key.
In the batch layer, the batch data reading operation is described in Equation (
6):
In Equation (
6), the statistics obtained from storage and mapped are denoted as
, while
represents the precomputed statistics derived from Equations (
1) and (
2). The function
is responsible for loading the batch. The function is utilized to load exclusively the precomputed batch statistics that are considered “new”. As soon as the loading process is completed successfully, the file is marked as “old”. Subsequently, the loaded data undergoes the mapping process denoted as
[
23]. The mapping process does not necessitate any statistical reduction, as this has already been accomplished through the batch process.
The next step is synchronization and update. The implementation of merging, combining, and synchronizing computed statistics from all three layers is defined in Equation (
7):
In Equation (
7), the statistics obtained from the serving layer are denoted as
, the data retrieved from batch computations are denoted as
, and the data calculated from streaming data are denoted as
. These data sets are combined (joined) and returned as a new data set
[
23].
The development of the statistics state in the memory is given in Equation (
8):
In Equation (
8), the variable
represents the storage of both the new and old statistics in memory, which is used for incremental calculation. On the other hand,
refers to the combination of the statistics
,
, and
[
23]. The function
serves as the state update function responsible for the updating of statistics and their retention in memory. If the statistics originate from the serving layer and are not already stored in the state, denoted as
, they should be inserted into the state memory. If the data originates from the batch layer, denoted as
, it is necessary to replace the state memory with the batch statistics. If the statistics do not originate from the serving layer, denoted as
, or the batch layer, they are likely derived from the streaming layer, which represents a relatively recent source of statistics. In this case, it is recommended to aggregate these statistics with the existing statistics stored in the state memory. If the statistics already exist, they should be updated accordingly. Alternatively, if the statistics are entirely new and not present in the state memory, they should be inserted into the memory as fresh statistics.
As a final step, to update the serving layer, solely the new and modified statistics are inserted/updated into the serving layer. This operation can be described in Equation (
9):
The equation presented as Equation (
9) describes the process of combining the
and
variables, followed by a left-join operation denoted as ⋈, with the
variable. This operation aims to insert or update the newly obtained and updated statistical data from the batch only if the
variable exists in the designated spooling location [
23]. The resulting outcome is the integration of both the streamed and batch statistics into the serving layer. The use of statistics related to
is unnecessary as they are already present in the serving layer. To ensure the integration of each partition of statistics ∀, it is essential to establish a connection with the serving layer. This connection enables the implementation of bulk update or insert operations. Specifically, existing records in the serving layer are updated if they already exist, while new records are inserted if they do not. Ultimately, it is imperative to establish a designated interval-based checkpoint system to facilitate the process of recovery in the event of a potential failure [
23].