DAG-Based Formal Modeling of Spark Applications with MSVL
Abstract
:1. Introduction
- (1)
- For RDD operations, we write corresponding MSVL functions to realize its processing of data. In addition, for those RDD operations with wide dependencies, we also write specialized Shuffle functions to complete the exchange of data between the two stages.
- (2)
- We propose an algorithm for converting Spark programs to MSVL programs based on the DAG of Spark programs. The algorithm extracts formal relations within stages and formal relations between stages, respectively, thus completing the transformation of a Spark program into an MSVL program.
- (3)
- Based on the algorithm mentioned above, we develop an automatic translation tool S2M to convert Spark programs to MSVL programs and give a case study to show the conversion process.
2. Related Work
3. Preliminaries
3.1. MSVL
3.2. Spark Framework
- (1)
- Application: a Spark application written by the user contains one or more jobs.
- (2)
- Job: a set of stages executed as a result of an action operation. During the execution of a Spark application, each action operation triggers the creation and submission of a job.
- (3)
- Stage: a set of tasks that perform the same computation in parallel based on partitions of input data.
- (4)
- Task: unit of execution in a stage. A task is a single data processing on a data partition.
4. Translation from Spark to MSVL
4.1. Data Storage Structure
struct ReadData { char *data[10] and int value[10] and int order }; struct KeyValue { char *data and int value and int order }; |
struct CollectData { char *data and int value and int idle }; struct PrepData { char *data and int value[100] and int idle }; |
4.2. Translation from Spark Operations to MSVL Functions
4.2.1. Translation of textFile Operation
MSVL Function 1: ReadWords(char *) |
1 frame() and 2 ( 3 FILE* fopen(, “r”) and skip; 4 int 0, 0 and skip; 5 while (feof() = 0){ 6 0; 0; 7 await( = 1 OR ⋯ OR = 1); 8 while ( N AND = 0){ 9 if( = 1) then { 1}else { + 1} 10 }; 11 fgets() and skip; 12 + 1 13 }; 14 1 15 ) |
4.2.2. Translation of Shuffle
MSVL Function 2: shuffle(int , char ) |
1 frame() and 2 ( 3 int 0, 0 and skip; 4 while(1 OR⋯OR 1 OR 5 await( OR ⋯ OR = ); 6 0; 0; 7 while ( AND = 0){ 8 if( = ) then { 1} else { + 1} 9 }; 10 Collect(i); 11 1; 12 }; 13 Pretreat(); 14 0; 15 0 16 ) |
4.2.3. Translation of reduceByKey Operation
MSVL Function 3: ReduceByKey(int n) |
1 frame() and 2 ( 3 int 0, 0 and skip; 4 while ( = 0){ 5 Bakery(); 6 await( = 1 OR ⋯ OR = 1 OR =1); 7 0; 0; 8 while ( AND = 0){ 9 if( = 1 AND = 0) then { 1}else { + 1} 10 }; 11 1; 12 reduceImplicit(); 13 + 1; 14 0 15 }; 16 ) |
MSVL Function 4: reduceImplicit(int , int n) |
1 frame() and 2 ( 3 int , 0 and skip; 4 ; 5 + 1; 6 0; 7 if(=1)then{ 8 ; 9 while( 0){ 10 + ; 11 + 1 12 }; 13 0 14 }else{ 15 1 16 } 17 ) |
4.3. DAG-Based Formalization
4.3.1. Data Extraction
4.3.2. Formalization of Operations in Stages
Algorithm 1: Execution paths of RDD operations in a stage |
Algorithm 2: FormInStage( |
4.3.3. Formalization between Stages
Algorithm 3: FormStage() |
4.3.4. Function Replacement and Supplementation
Algorithm 4: Formalization of Spark programs |
Algorithm 5: Supplement to the MSVL |
5. Case Study: TopN
5.1. Application Programs
5.2. Formalization
6. Technical Discussion
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
Symbols | Definition |
c | Constant |
x | Variable |
e | Arithmetic expressions |
The previous state of variable e | |
The next state of variable e | |
The call of the state function g | |
The call to an external function | |
b | Boolean expression |
Positive immediate assignment | |
Next-state assignment | |
Conjunction | |
One unit of time over an interval | |
; | Sequential execution |
Parallel execution | |
Execution upon satisfaction of boolean expression b | |
N | Constant |
M | Constant |
L | Constant |
Constant | |
Global variable | |
Acronyms | Full Form |
MSVL | Modeling, Simulation and Verification Language |
PPTL | Propositional Projection Temporal Logic |
RDD | Resilient Distributed Datasets |
DAG | Directed Acyclic Graph |
PTL | Projection Temporal Logic |
CTL | Computing Tree Logic |
LTL | Linear-time Temporal Logic |
References
- Ketu, S.; Mishra, P.K.; Agarwal, S. Performance analysis of distributed computing frameworks for big data analytics: Hadoop vs. spark. Comput. Sist. 2020, 24, 669–686. [Google Scholar] [CrossRef]
- Zhang, J.; Lin, M. A comprehensive bibliometric analysis of Apache Hadoop from 2008 to 2020. Int. J. Intell. Comput. Cybern. 2023, 16, 99–120. [Google Scholar] [CrossRef]
- Zaharia, M.; Xin, R.S.; Wendell, P.; Das, T.; Armbrust, M.; Dave, A.; Meng, X.; Rosen, J.; Venkataraman, S.; Franklin, M.J.; et al. Apache spark: A unified engine for big data processing. Commun. ACM 2016, 59, 56–65. [Google Scholar] [CrossRef]
- Chambers, B.; Zaharia, M. Spark: The Definitive Guide: Big Data Processing MADE Simple; O’Reilly Media: Sebastopol, CA, USA, 2018. [Google Scholar]
- Kalia, K.; Gupta, N. Analysis of hadoop MapReduce scheduling in heterogeneous environment. Ain Shams Eng. J. 2021, 12, 1101–1110. [Google Scholar] [CrossRef]
- Hedayati, S.; Maleki, N.; Olsson, T.; Ahlgren, F.; Seyednezhad, M.; Berahmand, K. MapReduce scheduling algorithms in Hadoop: A systematic study. J. Cloud Comput. 2023, 12, 143. [Google Scholar] [CrossRef]
- Ghazi, M.R.; Gangodkar, D. Hadoop, MapReduce and HDFS: A developers perspective. Procedia Comput. Sci. 2015, 48, 45–50. [Google Scholar] [CrossRef]
- Liu, J.; Li, J.; Li, W.; Wu, J. Rethinking big data: A review on the data quality and usage issues. Isprs J. Photogramm. Remote. Sens. 2016, 115, 134–142. [Google Scholar] [CrossRef]
- Wang, M.; Tian, C.; Duan, Z. Full regular temporal property verification as dynamic program execution. In Proceedings of the 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C), Buenos Aires, Argentina, 20–28 May 2017; pp. 226–228. [Google Scholar]
- Wang, M.; Tian, C.; Zhang, N.; Duan, Z. Verifying full regular temporal properties of programs via dynamic program execution. IEEE Trans. Reliab. 2018, 68, 1101–1116. [Google Scholar] [CrossRef]
- Duan, Z. An Extended Interval Temporal Logic and a Framing Technique for Temporal Logic Programming. Ph.D. Thesis, Newcastle University, Newcastle, UK, 1996. [Google Scholar]
- Duan, Z. Temporal Logic and Temporal Logic Programming; Science Press: Sydney, Australia, 2005. [Google Scholar]
- Yang, K.; Duan, Z.; Tian, C.; Zhang, N. A compiler for MSVL and its applications. Theor. Comput. Sci. 2018, 749, 2–16. [Google Scholar] [CrossRef]
- Zhang, N.; Duan, Z.; Tian, C. A mechanism of function calls in MSVL. Theor. Comput. Sci. 2016, 654, 11–25. [Google Scholar] [CrossRef]
- Duan, Z.; Tian, C.; Zhang, L. A decision procedure for propositional projection temporal logic with infinite models. Acta Inform. 2008, 45, 43–78. [Google Scholar] [CrossRef]
- Duan, Z.; Tian, C. A practical decision procedure for propositional projection temporal logic with infinite models. Theor. Comput. Sci. 2014, 554, 169–190. [Google Scholar] [CrossRef]
- Yang, X.; Wang, X. A Practical Method based on MSVL for Verification of Social Network. In Proceedings of the 2021 8th International Conference on Dependable Systems and Their Applications (DSA), Yinchuan, China, 5–6 August 2021; pp. 383–389. [Google Scholar]
- Zhao, L.; Wu, L.; Gao, Y.; Wang, X.; Yu, B. Formal Modeling and Verification of Convolutional Neural Networks based on MSVL. In Proceedings of the 2022 9th International Conference on Dependable Systems and Their Applications (DSA), Wulumuqi, China, 4–5 August 2022; pp. 280–289. [Google Scholar]
- Wang, M.; Li, S. Formalizing Spark Applications with MSVL. In Proceedings of the International Workshop on Structured Object-Oriented Formal Language and Method, Singapore, 1 March 2021; Springer: Cham, Switzerland, 2020; pp. 193–204. [Google Scholar]
- Luo, N.; Yu, Z.; Bei, Z.; Xu, C.; Jiang, C.; Lin, L. Performance modeling for spark using svm. In Proceedings of the 2016 7th International Conference on Cloud Computing and Big Data (CCBD), Macau, China, 16–18 November 2016; pp. 127–131. [Google Scholar]
- Grossman, S.; Cohen, S.; Itzhaky, S.; Rinetzky, N.; Sagiv, M. Verifying equivalence of spark programs. In Proceedings of the Computer Aided Verification: 29th International Conference, CAV 2017, Heidelberg, Germany, 24–28 July 2017; Part II 30. Springer: Berlin/Heidelberg, Germany; 2017; pp. 282–300. [Google Scholar]
- Beckert, B.; Bingmann, T.; Kiefer, M.; Sanders, P.; Ulbrich, M.; Weigl, A. Relational equivalence proofs between imperative and MapReduce algorithms. In Proceedings of the Verified Software. Theories, Tools, and Experiments: 10th International Conference, VSTTE 2018, Oxford, UK, 18–19 July 2018; Revised Selected Papers 10. Springer: Berlin/Heidelberg, Germany, 2018; pp. 248–266. [Google Scholar]
- Yin, J.; Zhu, H.; Fei, Y.; Fang, Y. Modeling and Verifying Spark on YARN Using Process Algebra. In Proceedings of the 2019 IEEE 19th International Symposium on High Assurance Systems Engineering (HASE), Hangzhou, China, 3–5 January 2019; pp. 208–215. [Google Scholar]
- Baresi, L.; Bersani, M.M.; Marconi, F.; Quattrocchi, G.; Rossi, M. Using formal verification to evaluate the execution time of Spark applications. Form. Asp. Comput. 2020, 32, 33–70. [Google Scholar] [CrossRef]
- de Souza Neto, J.B.; Martins Moreira, A.; Vargas-Solar, G.; Musicante, M.A. TRANSMUT-Spark: Transformation mutation for Apache Spark. Softw. Testing, Verif. Reliab. 2022, 32, e1809. [Google Scholar] [CrossRef]
- Dietsch, D.; Heizmann, M.; Langenfeld, V.; Podelski, A. Fairness modulo theory: A new approach to LTL software model checking. In Proceedings of the Computer Aided Verification: 27th International Conference, CAV 2015, San Francisco, CA, USA, 18–24 July 2015; Proceedings Part I 27. Springer: Berlin/Heidelberg, Germany, 2015; pp. 49–66. [Google Scholar]
- Brockschmidt, M.; Cook, B.; Ishtiaq, S.; Khlaaf, H.; Piterman, N. T2: Temporal property verification. In Proceedings of the Tools and Algorithms for the Construction and Analysis of Systems: 22nd International Conference, TACAS 2016, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2016, Eindhoven, The Netherlands, 2–8 April 2016; Proceedings 22. Springer: Berlin/Heidelberg, Germany, 2016; pp. 387–393. [Google Scholar]
- Cook, B.; Khlaaf, H.; Piterman, N. On automation of CTL* verification for infinite-state systems. In Proceedings of the International Conference on Computer Aided Verification, San Francisco, CA, USA, 18–24 July 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 13–29. [Google Scholar]
- Cook, B.; Koskinen, E. Making prophecies with decision predicates. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Austin, TX, USA, 26–28 January 2011; pp. 399–410. [Google Scholar]
- Duan, Z.; Yang, X.; Koutny, M. Framed temporal logic programming. Sci. Comput. Program. 2008, 70, 31–61. [Google Scholar] [CrossRef]
- Wang, X.; Tian, C.; Duan, Z.; Zhao, L. MSVL: A typed language for temporal logic programming. Front. Comput. Sci. 2017, 11, 762–785. [Google Scholar] [CrossRef]
Name | Syntax | Semantics |
---|---|---|
1. Termination | empty | |
2. Assignment | len(1) | |
3. Positive immediate assignment | ||
4. State frame | lbf(x) | af(x) |
5. Interval frame | frame(x) | (more lbf(x)) |
6. Next | next | |
7. Always | always | |
8. Conditional | if b then else | |
9. Local variable | exist | |
10. Sequential | ; | |
11. Conjunction | and | |
12. While | while | |
13. Selection | or | |
14. Parallel | truetrue | |
15. Projection | prj | |
16. Await | await(b) | frame |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Fan, K.; Wang, M. DAG-Based Formal Modeling of Spark Applications with MSVL. Information 2023, 14, 658. https://doi.org/10.3390/info14120658
Fan K, Wang M. DAG-Based Formal Modeling of Spark Applications with MSVL. Information. 2023; 14(12):658. https://doi.org/10.3390/info14120658
Chicago/Turabian StyleFan, Kaixuan, and Meng Wang. 2023. "DAG-Based Formal Modeling of Spark Applications with MSVL" Information 14, no. 12: 658. https://doi.org/10.3390/info14120658
APA StyleFan, K., & Wang, M. (2023). DAG-Based Formal Modeling of Spark Applications with MSVL. Information, 14(12), 658. https://doi.org/10.3390/info14120658