A Novel Data Management Scheme in Cloud for Micromachines
Abstract
:1. Introduction
- In MapReduce, it is very important to minimize non-local executions for map tasks in order to minimize latency and makespan. Due to co-located VM interference, map tasks fail to receive their required data blocks on time when HDD contention is high in a PM. At this point, it is impossible to avoid non-local execution. In order to overcome this problem, we predict the IO performance of every HDD in PMs in the CDC before loading data blocks from the CPS environment. Because IO contention is directly correlated with disk IO performance, linear regression is used to predict disk IO performance to place data blocks. This minimizes non-local executions for map tasks, which ultimately minimizes job latency and makespan for a batch of MapReduce jobs.
- Furthermore, the performance of VMs that host Hadoop MapReduce is also impacted by co-located VMs that interfere with resource sharing. Since map tasks from different jobs have different resource requirements, it is important to allocate map tasks to the right VM. Consequently, varying performance Hadoop VMs are observed and ranked for scheduling map tasks to minimize job latency.
2. Hadoop MapReduce Background
3. Related Works
- disk IO load is directly correlated to the overall HDD performance, we employ simple linear regression algorithm to predict HDD performance based on the number of IO operations performed over time.
- VM performance varies due to the co-located VM’s interference, map tasks are scheduled based on the VM performance in terms of CPU and disk IO.
4. Proposed Methods
4.1. Predicting Disk IO Performance to Place Data Blocks Using Regression
4.1.1. Problem Definition
4.1.2. Problem Formulation
Algorithm 1: Prediction based block placement at the data loading stage |
4.2. Scheduling Map Tasks Based on the Heterogeneous Performance of VMs
Algorithm 2: Heterogeneous performance aware map task scheduling |
5. Results and Analysis
5.1. Experimental Setup
5.2. Prediction Based Block Placement
5.3. Scheduling Map Tasks Based on Heterogeneous Performance
5.4. Rank Calculation for Launching Map Task
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Rathore, M.M.U.; Shah, S.A.; Awad, A.; Shukla, D.; Vimal, S.; Paul, A. A cyber-physical system and graph-based approach for transportation management in smart cities. Sustainability 2021, 13, 7606. [Google Scholar] [CrossRef]
- Jeyaraj, R.; Balasubramaniam, A.; Kumara, A.M.A.; Guizani, N.; Paul, A. Resource Management in Cloud and Cloud-Influenced Technologies for Internet of Things Applications. ACM Comput. Surv. 2022, 55, 1–35. [Google Scholar] [CrossRef]
- Jeffrey Dean, S.G. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 2008, 51, 2140–2144. [Google Scholar]
- Guo, Y.; Rao, J.; Jiang, C. Moving Hadoop into the Cloud with Flexible Slot Management and Speculative Execution. IEEE Trans. Parallel Distrib. Syst. 2014, 28, 798–812. [Google Scholar] [CrossRef]
- Jeyaraj, R.; Ananthanarayana, V.S.; Paul, A. Improving MapReduce scheduler for heterogeneous workloads in a heterogeneous environment. Concurr. Comput. Pract. Exp. 2020, 32. [Google Scholar] [CrossRef]
- Jeyaraj, R.; Ananthanarayana, V.S.; Paul, A. Dynamic ranking-based MapReduce job scheduler to exploit heterogeneous performance in a virtualized environment. J. Supercomput. 2019, 75, 7520–7549. [Google Scholar]
- Xiong, R.; Du, Y.; Jin, J.; Luo, J. HaDaap: A hotness-aware data placement strategy for improving storage efficiency in heterogeneous Hadoop clusters. Concurr. Comput. Pract. Exp. 2018, 30. [Google Scholar] [CrossRef]
- Hadoop MapReduce Fair Scheduler. Available online: https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/FairScheduler.html (accessed on 20 May 2023).
- Hadoop MapReduce Capacity Scheduler. Available online: https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html (accessed on 20 May 2023).
- Hashem, I.A.T.; Anuar, N.B.; Marjani, M.; Ahmed, E.; Chiroma, H.; Firdaus, A.; Abdullah, M.T.; Alotaibi, F.; Mahmoud Ali, W.K.; Yaqoob, I.; et al. MapReduce scheduling algorithms: A review. J. Supercomput. 2020, 76, 4915–4945. [Google Scholar] [CrossRef]
- Ghemawat, S.; Gobioff, H.; Leung, S.T. The Google file system. ACM USA 2003, 37, 29–43. [Google Scholar]
- Hadoop Distributed File System (HDFS). Available online: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html (accessed on 20 May 2023).
- Song, J.; He, H.Y.; Wang, Z.; Yu, G.; Pierson, J.M. Modulo Based Data Placement Algorithm for Energy Consumption Optimization of MapReduce System. J. Grid Comput. 2018, 16, 409–424. [Google Scholar] [CrossRef]
- Derouiche, R.; Brahmi, Z. A cooperative agents-based workflow-level distributed data placement strategy for scientific cloud workflows. In Proceedings of the 2nd International Conference on Digital Tools & Uses Congress, Virtual, 15–17 October 2020. [Google Scholar]
- Li, C.; Liu, J.; Li, W.; Luo, Y. Adaptive priority-based data placement and multi-task scheduling in geo-distributed cloud systems. Knowl.-Based Syst. 2021, 224, 107050. [Google Scholar] [CrossRef]
- Du, Y.; Xiong, R.; Jin, J.; Luo, J. A Cost-Efficient Data Placement Algorithm with High Reliability in Hadoop. In Proceedings of the Fifth International Conference on Advanced Cloud and Big Data (CBD), Shanghai, China, 13–16 August 2017; pp. 100–105. [Google Scholar]
- Shakarami, A.; Ghobaei-Arani, M.; Shahidinejad, A.; Masdari, M.; Shakarami, H. Data replication schemes in cloud computing: A survey. Clust. Comput. 2021, 24, 2545–2579. [Google Scholar] [CrossRef]
- Sabaghian, K.; Khamforoosh, K.; Ghaderzadeh, A. Data Replication and Placement Strategies in Distributed Systems: A State of the Art Survey. Wirel. Pers. Commun. 2023, 129, 2419–2453. [Google Scholar] [CrossRef]
- Wang, T.; Wang, J.; Nguyen, S.N.; Yang, Z.; Mi, N.; Sheng, B. EA2S2: An efficient application-aware storage system for big data processing in heterogeneous clusters. In Proceedings of the 26th International Conference on Computer Communication and Networks (ICCCN), Vancouver, BC, Canada, 31 July–3 August 2017. [Google Scholar]
- Bouhouch, L.; Zbakh, M.; Tadonki, C. Dynamic data replication and placement strategy in geographically distributed data centers. Concurr. Comput. Pract. 2022, 35. [Google Scholar] [CrossRef]
- Ahmadi, A.; Daliri, M.; Goharshady, A.K.; Pavlogiannis, A. Efficient approximations for cache-conscious data placement. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation, San Diego, CA, USA, 13–17 June 2022; pp. 857–871. [Google Scholar]
- Xu, H.; Liu, W.; Shu, G.; Li, J. LDBAS: Location-aware data block allocation strategy for HDFS-based applications in the cloud. KSII Trans. Internet Inf. Syst. 2018, 12, 204–226. [Google Scholar]
- Gandomi, A.; Reshadi, M.; Movaghar, A.; Khademzadeh, A. HybSMRP: A hybrid scheduling algorithm in Hadoop MapReduce framework. J. Big Data 2019, 6, 106. [Google Scholar] [CrossRef]
- Jin, J.; An, Q.; Zhou, W.; Tang, J.; Xiong, R. DynDL: Scheduling data-locality-aware tasks with dynamic data transfer cost for multicore-server-based big data clusters. Appl. Sci. 2018, 8, 2216. [Google Scholar] [CrossRef]
- Qureshi, N.M.F.; Siddiqui, I.F.; Unar, M.A.; Uqaili, M.A.; Nam, C.S.; Shin, D.R.; Kim, J.; Bashir, A.K.; Abbas, A. An Aggregate MapReduce Data Block Placement Strategy for Wireless IoT Edge Nodes in Smart Grid. Wirel. Pers. Commun. 2019, 106, 2225–2236. [Google Scholar] [CrossRef]
- Sellami, M.; Mezni, H.; Hacid, M.S.; Gammoudi, M.M. Clustering-based data placement in cloud computing: A predictive approach. Clust. Comput. 2021, 24, 3311–3336. [Google Scholar] [CrossRef]
- He, Q.; Zhang, F.; Bian, G.; Zhang, W.; Li, Z.; Yu, Z.; Feng, H. File block multi-replica management technology in cloud storage. Clust. Comput. 2023. [Google Scholar] [CrossRef]
- Malik, M.; Neshatpour, K.; Rafatirad, S.; Homayoun, H. Hadoop workloads characterization for performance and energy efficiency optimizations on microservers. IEEE Trans. Multi-Scale Comput. Syst. 2018, 4, 355–368. [Google Scholar] [CrossRef]
- Yu, Z.; Xiong, W.; Eeckhout, L.; Bei, Z.; Mendelson, A.; Xu, C. MIA: Metric importance analysis for big data workload characterization. EEE Trans. Parallel Distrib. Syst. 2018, 29, 1371–1384. [Google Scholar] [CrossRef]
- Anjos, J.C.S.; Carrera, I.; Kolberg, W.; Tibola, A.L.; Arantes, L.B.; Geyer, C.R. MRA++: Scheduling and data placement on MapReduce for heterogeneous environments. Future Gener. Comput. Syst. 2015, 42, 22–35. [Google Scholar] [CrossRef]
- Ubarhande, V.; Popescu, A.M.; González-Vélez, H. Novel Data-Distribution Technique for Hadoop in Heterogeneous Cloud Environments. In Proceedings of the Ninth International Conference on Complex, Intelligent, and Software Intensive Systems, Santa Catarina, Brazil, 8–10 July 2015; pp. 217–224. [Google Scholar]
- Chen, W.; Paik, I.; Li, Z. Tology-Aware Optimal Data Placement Algorithm for Network Traffic Optimization. IEEE Trans. Comput. 2016, 65, 2603–2617. [Google Scholar] [CrossRef]
- PUMA. Purdue University. Available online: https://engineering.purdue.edu/~puma/datasets.htm (accessed on 20 May 2023).
MapReduce Job | No. of Map Tasks | No. of Reduce Tasks | vCPU | Memory | Map Task Latency | Reduce Task Latency | ||
---|---|---|---|---|---|---|---|---|
Map | Reduce | Map | Reduce | |||||
wordcount () | 1000 | 20 | 1 | 1 | 2 | 1 | 21 | 39 |
wordmean () | 500 | 15 | 1 | 2 | 1 | 1 | 18 | 33 |
wordmedian () | 2000 | 15 | 1 | 2 | 1.5 | 2 | 15 | 30 |
kmean () | 1500 | 10 | 2 | 2 | 1.5 | 2.5 | 21 | 60 |
1 | 0.8 | 0.68 | 0.5 | 0.34 | 7 |
2 | 0.9 | 0.91 | 0.2 | 0.73 | 2 |
3 | 0.7 | 0.55 | 0.6 | 0.22 | 9 |
4 | 0.85 | 0.82 | 0.4 | 0.49 | 4 |
5 | 0.75 | 0.73 | 0.45 | 0.40 | 6 |
6 | 0.65 | 0.64 | 0.65 | 0.22 | 10 |
7 | 0.95 | 0.86 | 0.25 | 0.65 | 3 |
8 | 0.8 | 0.77 | 0.4 | 0.46 | 5 |
9 | 0.7 | 0.59 | 0.55 | 0.27 | 8 |
10 | 0.9 | 0.95 | 0.1 | 0.86 | 1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Singh, G.; Jeyaraj, R.; Sharma, A.; Paul, A. A Novel Data Management Scheme in Cloud for Micromachines. Electronics 2023, 12, 3807. https://doi.org/10.3390/electronics12183807
Singh G, Jeyaraj R, Sharma A, Paul A. A Novel Data Management Scheme in Cloud for Micromachines. Electronics. 2023; 12(18):3807. https://doi.org/10.3390/electronics12183807
Chicago/Turabian StyleSingh, Gurwinder, Rathinaraja Jeyaraj, Anil Sharma, and Anand Paul. 2023. "A Novel Data Management Scheme in Cloud for Micromachines" Electronics 12, no. 18: 3807. https://doi.org/10.3390/electronics12183807
APA StyleSingh, G., Jeyaraj, R., Sharma, A., & Paul, A. (2023). A Novel Data Management Scheme in Cloud for Micromachines. Electronics, 12(18), 3807. https://doi.org/10.3390/electronics12183807