Choosing a Data Storage Format in the Apache Hadoop System Based on Experimental Evaluation Using Apache Spark
Abstract
:1. Introduction
2. Background
3. Method and Experiment
- Experimental evaluation of the studied data storage formats.
- Analysis of Spark data processing functions using different storage formats.
3.1. Experimental Evaluation
3.2. Analysis of the Spark Algorithm
- -
- Stages count;
- -
- Task count on each stage;
- -
- Shuffle spill (memory/drive) on each stage;
- -
- Median value statistics.
- -
- Searching for unique objects;
- -
- Data filtering;
- -
- Sorting.
4. Results
- Paired comparisons of each alternative;
- The comparison of the criteria themselves;
- Optimization task solution.
- -
- describes the degree of preference for alternative i over alternative j;
- -
- .
- Platform independence is not the most important characteristic, because the study is aimed to find the optimal file format for Apache Hadoop system.
- The ability to record complex structures has an important role, since it provides great opportunities for data processing and analysis.
- The ability to modify data is not critical, since most big data storage platforms comply with the “write once—read many” principle.
- The possibility of compression has an indirect role since it affects the volume of data.
- The presence of metadata is an indicator that does not require analysis, because it affects the speed of reading and grouping data.
- The data volume plays an important role in the processing and storage of big data, but is not critical, since the storage hardware has become much cheaper in recent years.
- Reading all lines is an important indicator, since it most fully reflects the speed of data processing using a particular data storage format.
- The filter and search for unique values are equally important characteristics; however, these functions rely on the subtraction of all strings, the importance of which is defined in the previous paragraph.
- Applying a function, grouping, and finding the minimum value are the next most important indicators, since they are interesting from the point of view of analytics than engineering.
- Sorting is the least important of the criteria presented, as it is most often used to visualize data.
- Equals = 1;
- More (less) important = 2 (1/2);
- Much more important = 4 (1/4);
- Critical = 6 (1/6);
- If necessary, it is possible to use intermediate values.
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Experimental Resources
Element | Characteristics |
---|---|
CPU | Intel Core i7-8565U 1.8 GHz 4 cores |
RAM | 16 GB |
Operating system | Windows 10 64x |
Platform | Java Virtual Machine |
Programming language used | Java v. 1.8 |
The framework used | Apache Spark v. 2.4 |
Field Name | Data Type |
---|---|
name | string |
surname | string |
age | 32 bit integer |
country | string |
balance | 64 bit integer |
card number | string |
currency | string |
account open date | calendar |
Appendix B. Statistics of the Operations Performed
Criteria | Avro | CSV | JSON | ORC | Parquet |
---|---|---|---|---|---|
Task count | 9 | 10 | 20 | 9 | 9 |
Shuffle spill (memory/disk) | 0.0 B/ 703.0 MB | 0.0 B/ 611.5 MB | - | 0.0 B/ 702.9 MB | 0.0 B/ 703.2 MB |
Median Values Statistics by Task | |||||
Scheduler Delay | 20 ms | 11 ms | 6 ms | 15 ms | 9 ms |
Task Deserialization Time | 29 ms | 21 ms | 3 ms | 30 ms | 17 ms |
Garbage Collection Time | 2 s | 2 s | 0.3 s | 2 s | 3 s |
Peak Execution Memory | 232.0 MB | 248.4 MB | 144.0 MB | 232.0 MB | 232.0 MB |
Shuffle spill (memory/disk) | 0.0 B/ 85.5 MB | 0.0 B/ 76.2 MB | - | 0.0 B/ 85.5 MB | 0.0 B/ 85.5 MB |
Criteria | Avro | CSV | JSON | ORC | Parquet |
---|---|---|---|---|---|
Task count | 200 | 200 | 200 | 200 | 200 |
Shuffle write | 11.5 KB/ 200 | 11.5 KB/ 200 | 11.5 KB/ 200 | 11.5 KB/ 200 | 11.5 KB/200 |
Median Values Statistics by Task | |||||
Scheduler Delay | 1 ms | 2 ms | 1 ms | 1 ms | 2 ms |
Task Deserialization Time | 1 ms | 2 ms | 2 ms | 1 ms | 1 ms |
Garbage Collection Time | 0 ms | 0 ms | 0 ms | 0 ms | 0 ms |
Peak Execution Memory | 18.0 MB | 18.0 MB | 18.0 MB | 18.0 MB | 18.0 MB |
Shuffle Read Size/Records | 4.0 MB/ 50,018 | 4.5 MB/50,013 | 4.1 MB/ 50,018 | 4.0 MB/ 50,018 | 4.0 MB/ 50,018 |
Shuffle Write Size/Records | 59.0 B/1 | 59.0 B/1 | 59.0 B/1 | 59.0 B/1 | 59.0 B/1 |
Criteria | Avro | CSV | JSON | ORC | Parquet |
---|---|---|---|---|---|
Task count | 1 | 1 | 1 | 1 | 1 |
Median Values Statistics by Task | |||||
Scheduler Delay | 2 ms | 0 ms | 0 ms | 0 ms | 1 ms |
Task Deserialization Time | 1 ms | 2 ms | 1 ms | 2 ms | 0 ms |
Garbage Collection Time | 0 ms | 0 ms | 0 ms | 0 ms | 0 ms |
Peak Execution Memory | 0.0 B | 0.0 B | 0.0 B | 0.0 B | 0.0 B |
Criteria | Avro | CSV | JSON | ORC | Parquet |
---|---|---|---|---|---|
Task count | 9 | 10 | 20 | 9 | 9 |
Median Values Statistics by Task | |||||
Scheduler Delay | 52 ms | 48 ms | 17 ms | 0.1 s | 28 ms |
Task Deserialization Time | 26 ms | 20 ms | 4 ms | 32 ms | 10 ms |
Garbage Collection Time | 97 ms | 93 ms | 17 ms | 45 ms | 43 ms |
Peak Execution Memory | 0.0 B | 0.0 B | 0.0 B | 0.0 B | 0.0 B |
Criteria | Avro | CSV | JSON | ORC | Parquet |
---|---|---|---|---|---|
Task count | 9 | 10 | 20 | 9 | 9 |
Median Values Statistics by Task | |||||
Scheduler Delay | 4 ms | 3 ms | 2 ms | 3 ms | 2 ms |
Task Deserialization Time | 17 ms | 1 ms | 7 ms | 11 ms | 18 ms |
Garbage Collection Time | 0.2 s | 0.2 s | 86 ms | 0.2 s | 0.1 s |
Peak Execution Memory | 0.0 B | 0.0 B | 0.0 B | 0.0 B | 0.0 B |
Shuffle Write Size/Records | 5.5 MB/1,166,764 | 4.8 MB/1,000,000 | 2.8 MB/582,729 | 5.5 MB/1,166,764 | 5.6 MB/1,166,764 |
Criteria | Avro | CSV | JSON | ORC | Parquet |
---|---|---|---|---|---|
Task count | 200 | 200 | 200 | 200 | 200 |
Shuffle write | 11.5 KB/ 200 | 11.5 KB/ 200 | 11.5 KB/ 200 | 11.5 KB/ 200 | 11.5 KB/200 |
Median Values Statistics by Task | |||||
Scheduler Delay | 2 ms | 1 ms | 2 ms | 1 ms | 1 ms |
Task Deserialization Time | 2 ms | 2 ms | 2 ms | 1 ms | 2 ms |
Garbage Collection Time | 0 ms | 0 ms | 0 ms | 0 ms | 0 ms |
Peak Execution Memory | 10.0 MB | 10.0 MB | 10.0 MB | 10.0 MB | 10.0 MB |
Shuffle Read Size/Records | 242.7 KB/50,856 | 243.9 KB/50,955 | 243.3 KB/50,943 | 242.5 KB/50,934 | 243.8 KB/50,908 |
Shuffle Write Size/Records | 59.0 B/1 | 59.0 B/1 | 59.0 B/1 | 59.0 B/1 | 59.0 B/1 |
Criteria | Avro | CSV | JSON | ORC | Parquet |
---|---|---|---|---|---|
Task count | 1 | 1 | 1 | 1 | 1 |
Median Values Statistics by Task | |||||
Scheduler Delay | 1 ms | 2 ms | 1 ms | 0 ms | 1 ms |
Task Deserialization Time | 1 ms | 1 ms | 1 ms | 1 ms | 1 ms |
Garbage Collection Time | 0 ms | 0 ms | 0 ms | 0 ms | 0 ms |
Peak Execution Memory | 0.0 B | 0.0 B | 0.0 B | 0.0 B | 0.0 B |
Criteria | Avro | CSV | JSON | ORC | Parquet |
---|---|---|---|---|---|
Task count | 200 | 200 | 200 | 200 | 200 |
Shuffle write | 11.5 KB/ 200 | 11.5 KB/ 200 | 11.5 KB/ 200 | 11.5 KB/ 200 | 11.5 KB/200 |
Median Values Statistics by Task | |||||
Scheduler Delay | 2 ms | 1 ms | 2 ms | 1 ms | 1 ms |
Task Deserialization Time | 2 ms | 2 ms | 2 ms | 1 ms | 2 ms |
Garbage Collection Time | 0 ms | 0 ms | 0 ms | 0 ms | 0 ms |
Peak Execution Memory | 10.0 MB | 10.0 MB | 10.0 MB | 10.0 MB | 10.0 MB |
Shuffle Read Size/Records | 242.7 KB/50,856 | 243.9 KB/50,955 | 243.3 KB/50,943 | 242.5 KB/50,934 | 243.8 KB/50,908 |
Shuffle Write Size/Records | 59.0 B/1 | 59.0 B/1 | 59.0 B/1 | 59.0 B/1 | 59.0 B/1 |
Criteria | Avro | CSV | JSON | ORC | Parquet |
---|---|---|---|---|---|
Task count | 1 | 1 | 1 | 1 | 1 |
Median Values Statistics by Task | |||||
Scheduler Delay | 1 ms | 2 ms | 1 ms | 0 ms | 1 ms |
Task Deserialization Time | 1 ms | 1 ms | 1 ms | 1 ms | 1 ms |
Garbage Collection Time | 0 ms | 0 ms | 0 ms | 0 ms | 0 ms |
Peak Execution Memory | 0.0 B | 0.0 B | 0.0 B | 0.0 B | 0.0 B |
Appendix C. Matrices of Alternatives Comparisons
Criteria | Platform Independence | Recording Complex Structure | Volume | Reading All Lines | Filter | Unique Values | Grouping | Sorting |
---|---|---|---|---|---|---|---|---|
Platform independence | 1 | 1/2 | 1/2 | 1/5 | 1/5 | 1/4 | 1/4 | 1/2 |
Recording complex structure | 2 | 1 | 1 | 1/4 | 1/4 | 1/2 | 1/2 | 1 |
Volume | 2 | 1 | 1 | 1/4 | 1/4 | 1/2 | 1/2 | 1 |
Reading all lines | 5 | 4 | 4 | 1 | 1 | 1/2 | 1/2 | 2 |
Filter | 5 | 4 | 4 | 1 | 1 | 1/2 | 1/2 | 2 |
Unique values | 4 | 2 | 2 | 2 | 2 | 1 | 1 | 4 |
Grouping | 4 | 2 | 2 | 2 | 2 | 1 | 1 | 4 |
Sorting | 2 | 1 | 1 | 1/2 | 1/2 | 1/4 | 1/4 | 1 |
Avro | CSV | JSON | ORC | Parquet | |
---|---|---|---|---|---|
avro | 1 | 1 | 1 | 3 | 3 |
csv | 1 | 1 | 1 | 3 | 3 |
json | 1 | 1 | 1 | 3 | 3 |
orc | 1/3 | 1/3 | 1/3 | 1 | 1 |
parquet | 1/3 | 1/3 | 1/3 | 1 | 1 |
Avro | CSV | JSON | ORC | Parquet | |
---|---|---|---|---|---|
avro | 1 | 3 | 1 | 1 | 1 |
csv | 1/3 | 1 | 1/3 | 1/3 | 1/3 |
json | 1 | 3 | 1 | 1 | 1 |
orc | 1 | 3 | 1 | 1 | 1 |
parquet | 1 | 3 | 1 | 1 | 1 |
Avro | CSV | JSON | ORC | Parquet | |
---|---|---|---|---|---|
avro | 1 | 2 | 4 | 1/2 | 1/2 |
csv | 1/2 | 1 | 3 | 1/3 | 1/3 |
json | 1/4 | 1/3 | 1 | 1/4 | 1/4 |
orc | 2 | 3 | 4 | 1 | 1 |
parquet | 2 | 3 | 4 | 1 | 1 |
Avro | CSV | JSON | ORC | Parquet | |
---|---|---|---|---|---|
avro | 1 | 1 | 3/2 | 3/4 | 1/2 |
csv | 1 | 1 | 3/2 | 3/4 | 1/2 |
json | 2/3 | 2/3 | 1 | 1/2 | 1/3 |
orc | 4/3 | 4/3 | 2 | 1 | 2/3 |
parquet | 2 | 2 | 3 | 3/2 | 1 |
Avro | CSV | JSON | ORC | Parquet | |
---|---|---|---|---|---|
avro | 1 | 1 | 2 | 1/4 | 1/4 |
csv | 1 | 1 | 1/2 | 1/4 | 1/4 |
json | 1/2 | 1/2 | 1 | 1/4 | 1/4 |
orc | 4 | 4 | 4 | 1 | 1 |
parquet | 4 | 4 | 4 | 1 | 1 |
Avro | CSV | JSON | ORC | Parquet | |
---|---|---|---|---|---|
avro | 1 | 8/7 | 8/7 | 1 | 5/7 |
csv | 7/8 | 1 | 1 | 7/8 | 5/8 |
json | 7/8 | 1 | 1 | 7/8 | 5/8 |
orc | 1 | 8/7 | 8/7 | 1 | 5/7 |
parquet | 7/5 | 8/5 | 8/5 | 7/5 | 1 |
Avro | CSV | JSON | ORC | Parquet | |
---|---|---|---|---|---|
avro | 1 | 1/2 | 2 | 1/2 | 1/3 |
csv | 2 | 1 | 3 | 4/5 | 3/5 |
json | 1/2 | 1/3 | 1 | 1/4 | 1/5 |
orc | 2 | 5/4 | 4 | 1 | 3/4 |
parquet | 3 | 5/3 | 5 | 4/3 | 1 |
Avro | CSV | JSON | ORC | Parquet | |
---|---|---|---|---|---|
avro | 1 | 1 | 3/2 | 2/3 | 1/2 |
csv | 1 | 1 | 3/2 | 2/3 | 1/2 |
json | 2/3 | 2/3 | 1 | 1/2 | 1/3 |
orc | 3/2 | 3/2 | 2 | 1 | 2/3 |
parquet | 2 | 2 | 3 | 3/2 | 1 |
References
- Chong, D.; Shi, H. Big data analytics: A literature review. J. Manag. Anal. 2015, 2, 175–201. [Google Scholar] [CrossRef]
- Moro Visconti, R.; Morea, D. Big Data for the Sustainability of Healthcare Project Financing. Sustainability 2019, 11, 3748. [Google Scholar] [CrossRef] [Green Version]
- Ardito, L.; Scuotto, V.; Del Giudice, M.; Messeni, A. A bibliometric analysis of research on Big Data analytics for business and management. Manag. Decis. 2018, 57, 1993–2009. [Google Scholar] [CrossRef]
- Cappa, F.; Oriani, R.; Peruffo, E.; McCarthy, I.P. Big Data for Creating and Capturing Value in the Digitalized Environment: Unpacking the Effects of Volume, Variety and Veracity on Firm Performance. J. Prod. Innov. Manag. 2020. [Google Scholar] [CrossRef]
- Yang, C.; Huang, Q.; Li, Z.; Liu, K.; Hu, F. Big Data and cloud computing: Innovation opportunities and challenges. Int. J. Digit. Earth 2017, 10, 13–53. [Google Scholar] [CrossRef] [Green Version]
- Mavridis, I.; Karatza, H. Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark. J. Syst. Softw. 2017, 125, 133–151. [Google Scholar] [CrossRef]
- Lee, S.; Jo, J.Y.; Kim, Y. Survey of Data Locality in Apache Hadoop. In Proceedings of the 2019 IEEE International Conference on Big Data, Cloud Computing, Data Science & Engineering (BCD), Honolulu, HI, USA, 29–31 May 2019; pp. 46–53. [Google Scholar]
- Garg, K.; Kaur, D. Sentiment Analysis on Twitter Data using Apache Hadoop and Performance Evaluation on Hadoop MapReduce and Apache Spark. In Proceedings of the International Conference on Artificial Intelligence (ICAI), Las Vegas, NV, USA, 29 July–1 August 2019; pp. 233–238. [Google Scholar]
- Hive. 2020 Apache Hive Specification. Available online: https://cwiki.apache.org/confluence/display/HIVE (accessed on 11 January 2021).
- Impala. 2020 Apache Impala Specification. Available online: https://impala.apache.org/impala-docs.html (accessed on 11 January 2021).
- Nazari, E.; Shahriari, M.H.; Tabesh, H. BigData Analysis in Healthcare: Apache Hadoop, Apache spark and Apache Flink. Front. Health Inform. 2019, 8, 14. [Google Scholar] [CrossRef]
- Salloum, S.; Dautov, R.; Chen, X.; Peng, P.X.; Huang, J.Z. Big data analytics on Apache Spark. Int. J. Data Sci. Anal. 2016, 1, 145–164. [Google Scholar] [CrossRef] [Green Version]
- Krivulin, N. A new algebraic solution to multidimensional minimax location problems with Chebyshev distance. WSEAS Trans. Math. 2012, 11, 605–614. [Google Scholar]
- Gusev, A.; Ilin, D.; Nikulchev, E. The Dataset of the Experimental Evaluation of Software Components for Application Design Selection Directed by the Artificial Bee Colony Algorithm. Data 2020, 5, 59. [Google Scholar] [CrossRef]
- Ramírez, A.; Parejo, J.A.; Romero, J.R.; Segura, S.; Ruiz-Cortés, A. Evolutionary composition of QoS-aware web services: A many-objective perspective. Expert Syst. Appl. 2017, 72, 357–370. [Google Scholar] [CrossRef]
- Gholamshahi, S.; Hasheminejad, S.M.H. Software component identification and selection: A research review. Softw. Pract. Exp. 2019, 49, 40–69. [Google Scholar] [CrossRef] [Green Version]
- Gusev, A.; Ilin, D.; Kolyasnikov, P.; Nikulchev, E. Effective Selection of Software Components Based on Experimental Evaluations of Quality of Operation. Eng. Lett. 2020, 28, 420–427. [Google Scholar]
- Kudzh, S.A.; Tsvetkov, V.Y.; Rogov, I.E. Life cycle support software components. Russ. Technol. J. 2020, 8, 19–33. [Google Scholar] [CrossRef]
- Munir, R.F.; Abelló, A.; Romero, O.; Thiele, M.; Lehner, W. A cost-based storage format selector for materialized results in big data frameworks. Distrib. Parallel Databases 2020, 38, 335–364. [Google Scholar] [CrossRef]
- Nicholls, B.; Adangwa, M.; Estes, R.; Iradukunda, H.N.; Zhang, Q.; Zhu, T. Benchmarking Resource Usage of Underlying Datatypes of Apache Spark. arXiv 2020, arXiv:2012.04192. Available online: https://arxiv.org/abs/2012.04192 (accessed on 11 January 2021).
- Wang, X.; Xie, Z. The Case for Alternative Web Archival Formats to Expedite The Data-To-Insight Cycle. arXiv 2020, arXiv:2003.14046. [Google Scholar]
- He, D.; Wu, D.; Huang, R.; Marchionini, G.; Hansen, P.; Cunningham, S.J. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries 2020 in Wuhan virtually. ACM Sigweb Newsl. 2020, 1, 1–7. [Google Scholar] [CrossRef]
- Ahmed, S.; Ali, M.U.; Ferzund, J.; Sarwar, M.A.; Rehman, A.; Mehmood, A. Modern Data Formats for Big Bioinformatics Data Analytics. Int. J. Adv. Comput. Sci. Appl. 2017, 8. [Google Scholar] [CrossRef] [Green Version]
- Plase, D.; Niedrite, L.; Taranovs, R. A Comparison of HDFS Compact Data Formats: Avro Versus Parquet. Moksl. Liet. Ateitis 2017, 9, 267–276. [Google Scholar] [CrossRef] [Green Version]
- Khan, S.; Liu, X.; Ali, S.A.; Alam, M. Storage Solutions for Big Data Systems: A Qualitative Study and Comparison. arXiv 2019, arXiv:1904.11498. Available online: https://arxiv.org/abs/1904.11498 (accessed on 11 January 2021).
- Moniruzzaman, A.B.M.; Hossain, S.A. NoSQL Database: New Era of Databases for Big data Analytics-Classification, Characteristics and Comparison. Int. J. Database Theory Appl. 2013, 6, 1–14. [Google Scholar]
- Apache. Avro specification 2012. Available online: http://avro.apache.org/docs/current/spec.html (accessed on 11 January 2021).
- ORC. ORC Specification 2020. Available online: https://orc.apache.org/specification/ORCv1/ (accessed on 11 January 2021).
- Sakr, S.; Liu, A.; Fayoumi, A.G. The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. (CSUR) 2013, 46, 1–44. [Google Scholar] [CrossRef]
- Apache. Parquet Official Documentation 2018. Available online: https://parquet.apache.org/documen-tation/latest/ (accessed on 11 January 2021).
- Chellappan, S.; Ganesan, D. Introduction to Apache Spark and Spark Core. In Practical Apache Spark; Apress: Berkeley, CA, USA, 2018; pp. 79–113. [Google Scholar]
- Zaharia, M.; Chowdhury, M.; Franklin, M.J.; Shenker, S.; Stoica, I. Spark: Cluster computing with working sets. HotCloud 2010, 10, 95. [Google Scholar]
- Krivulin, N.; Sergeev, S. Tropical optimization techniques in multi-criteria decision making with Analytical Hierarchy Process. In Proceedings of the 2017 European Modelling Symposium (EMS), Manchester, UK, 20–21 November 2017; pp. 38–43. [Google Scholar]
- Krivulin, N. Methods of tropical optimization in rating alternatives based on pairwise comparisons. In Operations Research Proceedings 2016; Springer: Cham, Germany, 2018; pp. 85–91. [Google Scholar]
Avro | CSV | JSON | ORC | Parquet | |
---|---|---|---|---|---|
Platform independence | + | + | + | - | - |
The ability to change the file | - | + | + | - | - |
Record complex structures | + | - | + | + | + |
Compliance with ACID | - | - | - | + | - |
Format type | row-oriented | text, string | text, objective | column-oriented | column-oriented |
File compression | + | - | - | + | + |
The presence of metadata | - | - | - | + | + |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Belov, V.; Tatarintsev, A.; Nikulchev, E. Choosing a Data Storage Format in the Apache Hadoop System Based on Experimental Evaluation Using Apache Spark. Symmetry 2021, 13, 195. https://doi.org/10.3390/sym13020195
Belov V, Tatarintsev A, Nikulchev E. Choosing a Data Storage Format in the Apache Hadoop System Based on Experimental Evaluation Using Apache Spark. Symmetry. 2021; 13(2):195. https://doi.org/10.3390/sym13020195
Chicago/Turabian StyleBelov, Vladimir, Andrey Tatarintsev, and Evgeny Nikulchev. 2021. "Choosing a Data Storage Format in the Apache Hadoop System Based on Experimental Evaluation Using Apache Spark" Symmetry 13, no. 2: 195. https://doi.org/10.3390/sym13020195
APA StyleBelov, V., Tatarintsev, A., & Nikulchev, E. (2021). Choosing a Data Storage Format in the Apache Hadoop System Based on Experimental Evaluation Using Apache Spark. Symmetry, 13(2), 195. https://doi.org/10.3390/sym13020195