Managing and Optimizing Big Data Workloads for On-Demand User Centric Reports
Abstract
:1. Introduction
2. Background
- Data storage: Storing and maintaining large amounts of data can be expensive since it involves hardware, software, maintenance, and data replications for mitigating possible faults. Compression algorithms such as zstd have great compression rates that can help to save storage space [7].
- Data processing: Big data requires specialized tools and techniques, such as distributed processing techniques for handling the volumes in a timely manner.
- Data integration: Datasets are usually created by merging different events and records. As they come from various sources, it is quite difficult to analyze them. Furthermore, they generally contain many inconsistencies that need to be normalized along the pipeline. This requires data cleaning, transformation and normalization techniques that can guarantee the consistency of the dataset format.
- Query processing: Datasets are queried using a variety of approaches with different types of filters and parameters, one of which is more computationally heavy than others.
- The “3V’s” (volume, velocity, variety) referred to, named by Laney in his article “3-D Data Management: Controlling data volume, velocity, and variety” [8], are crucial in big data workflows. The performance of each big data application is determined by the volume of data the job needs to handle, the velocity (the speed) of traveling from point A to point B, and from a server/node to another, as well as the variety, precision, type, and the structure of the data [9]. Over time, two more Vs (value and veracity) were added, which helped to strengthen the big data sector by providing more effective ways of characterizing big data. The fourth V, veracity, refers to the accuracy, quality, and trust level of the data: missing or incomplete pieces may not be able to provide valuable insights. The final V refers to value, that is, the impact that big data can have on organizations. The newer “5V’s” model should be taken into account by any academic or industrial organization, (for more details about the model, see [10]). Even more “V’s” can be added to the model for better big data solutions. For instance, in [11], the authors provide some valuable insights into the the “10Vs” of big data.
- Data security and privacy are other important aspects of big data [9], especially in recent years as both consumers and companies are increasingly concerned about the privacy and security of their data. In addition to the obvious need of securing and obfuscating data, new regulations require companies to store sensitive user information as close to their location as possible. For example, European companies are required by law to store information on European servers and not US servers. This presents a challenge for companies as they must change their storage methods, which can be difficult and expensive due to the high networking costs associated with transferring large amounts of data between physical locations, as opposed to within the same location with different clusters.
- Infrastructure faults are another sensitive topic [9]. Hardware systems can be reliable only for a certain amount of time and they will have to be upgraded if they no longer perform at the desired level or they fail and need to be replaced. In addition to the incapacity to deliver their tasks in that maintenance period, companies cannot afford to permanently lose that data in case of failure. Because of this, companies usually rely on two or three more replication clusters that can take over traffic in case the main cluster fails. Although this strategy can help with load distribution, it increases both maintenance and running costs, since all the data must be copied into two or three more clusters.
3. Use Cases and State of the Art
- Static reporting—totally automated, usually developed by an analyst. These types of reports have a well-defined structure and content, usually containing general or high-level information since they are distributed to a large audience. Because of their nature, users interested in having in-depth analysis of specific aspects might have to compose their own manually made reports, composed from different sources over a large period of time in order to provide useful insight for their use cases. This can cause fragmentation and requires hours of time sifting through data. One of the most common types of static reports is the daily report.
- Ad hoc reporting (dynamic)—produced once and run by the user, has the goal of providing insights about a specific case, is more visual and oriented to that specific user, being dedicated to a smaller audience. On-demand reporting is a mandatory tool for many industries, satisfying the need for self-service business intelligence [14]. This type of reporting is great since offers high customizability and ease of use, reduces IT workloads because of the self-service nature, and also saves time and money since technical persons are no longer required to create custom queries for generating reports.
4. Problem Formulation and Research Questions
- RQ1:
- When should MapReduce be considered instead of Druid?
- A1:
- It really depends on the costs, volume and velocity of the data, but overall Druid is better for small amounts of data, usually under tens of terabytes scale, were it can compute the query results in seconds, without a significant cost. MapReduce starts to shine where Druid starts to struggle. As will be shown later, executing MapReduce operations on small datasets is not recommended since the overhead of the distributed system is too high, coordinating the nodes, starting the workers and so on, taking more time than the actual job.
- RQ2:
- Storage vs. processing power, which one is better for achieving performance in MapReduce?
- A2:
- If the focus is on faster executions, no matter the costs, the only way of reducing the time is by reducing the processing. Processing reduction can be achieved by storing the data in a format that will favor easier loading and filtering, even if that means data duplication and increased storage costs. Taking both time and cost into account, the frequency of the report should be also considered. Depending on the dataset, starting from 10–20 daily reports can be cheaper to have data duplication rather than spending more on brute forcing the dataset for producing tens of reports.
- RQ3:
- What are the benefits of pre-processing/partitioning the dataset? Is it always worth it?
- A3:
- Re-distributing the dataset based on favorable criteria can improve the overall performances since the reporting job will no longer have to load the entire dataset. On a raw dataset, creating a report for a specific user will imply loading all the data of all the users and throw away the ones that are not matching the user ID. If the dataset was pre-processed and stored in user-based folders, there’s no need for loading the entire dataset, saving a lot of processing power and time. Re-organizing the dataset is worth it only after a certain frequency of the report requests, when the processing power wasted on brute force filtering is higher than the processing power that will be used a single time for re-partitioning the dataset in a favorable way.
- RQ4:
- How good is the proposed solution as compared with other baselines?
- A4:
- The performances are strongly tight to the characteristics of the dataset. In the following examples, compared to a brute force approach, a performance increase of four digits can be seen, both in terms of costs and time spent. This is possible because compared to the baseline solution, creating reports in a dataset structured for partition pruning takes significantly less processing power and time.
- RQ5:
- From the business point of view, what is the impact for going with this solution?
- A5:
- Going for a pre-processed dataset obviously means more moving parts on the end product, but it is totally worth it in the above mentioned conditions since in the end this will result in faster jobs and lower costs. On top of the way faster reporting execution time, this also saves a lot of processing power that can be used in other places, where it is much more needed, creating more revenue in the end. Even if the design will require more time to be built, the return of the investment will be really fast, especially when talking about big datasets with tens of terabytes of data.
- RQ6:
- What are the benefits and limitations of the solution proposed in the current study for handling on-demand user-centric reports?
- A6:
- Applying the proposed solution will significantly reduce the costs and processing power needed to generate dynamic reports. It can mostly be applied for user-centric reports (or diverse similar cases) and the proposed solution will be worth it only after it passes a certain daily frequency threshold of requested reports. That is why the processing power used to partition the data should be less than the total processing power used to produce the reports in a brute force approach.
5. Materials and Methods
5.1. Prerequisites
5.2. Model Description
6. Quantitative Results
7. Discussion and Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Sumits, A. The History and Future of Internet Traffic. Available online: https://blogs.cisco.com/sp/the-history-and-future-of-internet-traffic (accessed on 17 December 2022).
- O’Dea, S. Monthly Internet Traffic in the U.S. 2018–2023. Available online: https://www.statista.com/statistics/216335/data-usage-per-month-in-the-us-by-age/ (accessed on 17 December 2022).
- Heffernan, V. Is Moore’s Law Really Dead? Available online: https://www.wired.com/story/moores-law-really-dead/ (accessed on 27 December 2022).
- Agrawal, D.; Bernstein, P.; Bertino, E.; Davidson, S.; Dayal, U.; Franklin, M.; Gehrke, J.; Haas, L.; Halevy, A.; Han, J.; et al. Challenges and Opportunities with Big Data 2011-1; Purdue University Libraries: West Lafayette, IN, USA, 2011. [Google Scholar]
- Takahashi, D. Intel: Moore’s Law Isn’t Slowing Down. Available online: https://venturebeat.com/business/intel-moores-law-isnt-slowing-down/ (accessed on 27 December 2022).
- Eeckhout, L. Is Moore’s Law Slowing Down? What’s Next? IEEE Micro 2017, 37, 4–5. [Google Scholar] [CrossRef]
- Collet, Y.; Kucherawy, M. Zstandard Compression and the Application/zstd Media Type. Technical Report—Internet Engineering Task Force (IETF). 2018. Available online: https://www.rfc-editor.org/rfc/rfc8478 (accessed on 5 March 2023).
- Laney, D. 3D data management: Controlling data volume, velocity and variety. META Group Res. Note 2001, 6, 1. [Google Scholar]
- Tole, A.A. Big data challenges. Database Syst. J. 2013, 4, 31–40. [Google Scholar]
- Ishwarappa; Anuradha, J. A Brief Introduction on Big Data 5Vs Characteristics and Hadoop Technology. Procedia Comput. Sci. 2015, 48, 319–324. [Google Scholar] [CrossRef]
- Awan, M.J.; Bilal, M.H.; Yasin, A.; Nobanee, H.; Khan, N.S.; Zain, A.M. Detection of COVID-19 in Chest X-ray Images: A Big Data Enabled Deep Learning Approach. Int. J. Environ. Res. Public Health 2021, 18, 10147. [Google Scholar] [CrossRef]
- Nasser, T.; Tariq, R. Big data challenges. J. Comput. Eng. Inf. Technol. 2015, 4, 9307. [Google Scholar]
- Haafza, L.A.; Awan, M.J.; Abid, A.; Yasin, A.; Nobanee, H.; Farooq, M.S. Big data COVID-19 systematic literature review: Pandemic crisis. Electronics 2021, 10, 3125. [Google Scholar] [CrossRef]
- Self-Service BI. Available online: https://learn.microsoft.com/ (accessed on 1 April 2023).
- Druid Use-Cases. Available online: https://druid.apache.org/use-cases (accessed on 2 April 2023).
- AWS Druid Costs. Available online: https://aws.amazon.com/marketplace/pp/prodview-4n6wdupx4okgw (accessed on 29 December 2022).
- Roginski, M. When Should I Use Apache Druid? Try This Checklist. Available online: https://www.rilldata.com/blog/when-should-i-use-apache-druid (accessed on 29 December 2022).
- Polato, I.; Ré, R.; Goldman, A.; Kon, F. A comprehensive view of Hadoop research—A systematic literature review. J. Netw. Comput. Appl. 2014, 46, 1–25. [Google Scholar] [CrossRef]
- Wang, K.; Khan, M.M.H. Performance prediction for apache spark platform. In Proceedings of the 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems, New York, NY, USA, 24–26 August 2015; pp. 166–173. [Google Scholar]
- Thusoo, A.; Sarma, J.S.; Jain, N.; Shao, Z.; Chakka, P.; Zhang, N.; Antony, S.; Liu, H.; Murthy, R. Hive-a petabyte scale data warehouse using hadoop. In Proceedings of the 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), Long Beach, CA, USA, 1–6 March 2010; pp. 996–1005. [Google Scholar]
- Foundation, T.A.S. Apache Oozie Workflow Scheduler for Hadoop. Available online: https://oozie.apache.org/ (accessed on 20 January 2023).
- Cauteruccio, F.; Giudice, P.L.; Musarella, L.; Terracina, G.; Ursino, D.; Virgili, L. A lightweight approach to extract interschema properties from structured, semi-structured and unstructured sources in a big data scenario. Int. J. Inf. Technol. Decis. Mak. 2020, 19, 849–889. [Google Scholar] [CrossRef]
- Arnholt, A.T. Passion Driven Statistics. Available online: https://alanarnholt.github.io/PDS-Bookdown2/skewed-right-distributions.html (accessed on 2 January 2023).
- Statz, D. Handling Data Skew in Apache Spark. Available online: https://itnext.io/handling-data-skew-in-apache-spark-9f56343e58e8 (accessed on 2 January 2023).
- Reursora, K. Generating Random Numbers with Uniform Distribution in Python. Available online: https://linuxhint.com/generating-random-numbers-with-uniform-distribution-in-python/ (accessed on 2 January 2023).
- Cauteruccio, F.; Terracina, G.; Ursino, D. Generalizing identity-based string comparison metrics: Framework and techniques. Knowl.-Based Syst. 2020, 187, 104820. [Google Scholar] [CrossRef]
- Open Air Quality. Available online: https://openaq.org/ (accessed on 16 February 2023).
- OpenAQ Amazon S3 Bucket. Available online: https://registry.opendata.aws/openaq/ (accessed on 16 February 2023).
Challenges and Problems | Context and Details | Solution |
---|---|---|
Data storage | storing and maintaining large amounts of data can be expensive | sanitize the final datasets, focus on the veracity and value, use compression algorithms |
Data processing | big data requires powerful, specialized tools and techniques | use distributed systems for handling the datasets, store them in ways that can improve the performances |
Data integration | datasets are usually composed from events coming from different sources | data cleaning, transformation and normalization techniques |
Query processing | datasets can be queried in multiple ways, some of them being more computational intensive than others | design the systems in a way that can handle various requests, techniques such as caching or partition pruning might be used to ease queries |
Data security and privacy | users and companies are more and more interested in protecting their data, governments are also imposing rigorous standards | obfuscate data, drop sensitive columns, anonymize the datasets |
Infrastructure faults | hardware has a limited lifetime and the datasets should be able to survive to different hazards | replication servers in different physical locations |
Scalability | the solution should be able to scale, to be extensible and ready for new challenges | use generic solutions, do not rely on hardware tuning |
Interval | Run Time | vCores | Memory GB Hours | Storage GB Hours | Input Size GB | Input Records | Remaining Records after Filter | Total Time across All Tasks |
---|---|---|---|---|---|---|---|---|
1 day | 2 m 3 s | 0.24 | 0.98 | 1.23 | 1.46 | 3,361,183 | 741 | 3 m |
1 month | 6 m 24 s | 1.64 | 6.58 | 8.23 | 47 | 110,385,904 | 30,963 | 1 h 6 m |
3 months | 12 m 45 s | 6.13 | 24.52 | 30.65 | 179 | 416,300,428 | 115,636 | 6 h |
6 months | 22 m 27 s | 11.33 | 45.32 | 53.65 | 366 | 852,056,650 | 237,740 | 9 h 30 m |
Interval | Run Time | vCores | Memory GB Hours | Storage GB Hours |
---|---|---|---|---|
1 day | 3 m 38 s | 0.42 | 1.71 | 2.14 |
6 months | 1 h 49 m | 57.52 | 230.10 | 287.63 |
Interval | Run Time | vCores | Memory GB Hours | Storage GB Hour | Input Size | Input Records | Remaining Records after Filtering | Total Time across All Tasks |
---|---|---|---|---|---|---|---|---|
1 day | 1 m 2 s | 0.16 | 0.67 | 0.84 | 87 KB | 6104 | 741 | 3 s |
1 month | 1 m 4 s | 0.17 | 0.71 | 0.88 | 3.3 MB | 242,752 | 30,963 | 32 s |
3 months | 1 m 8 s | 0.18 | 0.75 | 0.93 | 13.5 MB | 969,770 | 115,636 | 36 s |
6 months | 1 m 10 s | 0.21 | 0.84 | 1.05 | 27 MB | 1,939,540 | 237,740 | 55 s |
Interval | Run Time | vCores | Memory GB Hours | Storage GB Hours | Input Size | Input Records | Remaining Records | Total Time across All Tasks |
---|---|---|---|---|---|---|---|---|
6 months | 22 m 27 s | 11.33 | 45.32 | 53.65 | 366 GB | 852,056,650 | 237,740 | 9 h 30 m |
6 months | 1 m 10 s | 0.21 | 0.84 | 1.05 | 27 MB | 1,939,540 | 237,740 | 55 s |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Băicoianu, A.; Scheianu, I.V. Managing and Optimizing Big Data Workloads for On-Demand User Centric Reports. Big Data Cogn. Comput. 2023, 7, 78. https://doi.org/10.3390/bdcc7020078
Băicoianu A, Scheianu IV. Managing and Optimizing Big Data Workloads for On-Demand User Centric Reports. Big Data and Cognitive Computing. 2023; 7(2):78. https://doi.org/10.3390/bdcc7020078
Chicago/Turabian StyleBăicoianu, Alexandra, and Ion Valentin Scheianu. 2023. "Managing and Optimizing Big Data Workloads for On-Demand User Centric Reports" Big Data and Cognitive Computing 7, no. 2: 78. https://doi.org/10.3390/bdcc7020078
APA StyleBăicoianu, A., & Scheianu, I. V. (2023). Managing and Optimizing Big Data Workloads for On-Demand User Centric Reports. Big Data and Cognitive Computing, 7(2), 78. https://doi.org/10.3390/bdcc7020078