Online Task Scheduling of Big Data Applications in the Cloud Environment
Abstract
:1. Introduction
- Formalize the OTS-DMDR problem considering both the heterogeneity and the processing capacity of the servers, together with locality, movement, and replication of datasets.
- Propose algorithms to estimate the different costs and measure the tasks’ adequacy to the servers to seek a better task-to-server allocation.
- Conduct extensive simulation experiments to evaluate the efficiency of our algorithm, OTS-DMDR.
2. Related Work
2.1. Common Used Metrics
2.2. Single-Objective Scheduling Techniques
2.3. Multi-Objective Scheduling Techniques
2.4. Our Motivation
3. System Model and Problem Formulation
3.1. System Model
- a set of machines, where designates the ith machine.
- is an integer value that represents the storage capacity of machine (MB).
- is a float value that represents the read speed of machine (MB/second).
- is a float value that represents the write speed of machine (MB/second).
- is an integer value that represents the available memory capacity of machine (MB).
- is the number of cores of machine .
- is an integer value that represents the CPU performance of each core of machine (Million Instructions per second— MIPS).
- is an integer value that represents the bandwidth of the connection between machines and (MB/second).
- is an integer value that represents the processing power of machine (in Million Instructions per second—MIPS). is the overall CPU amount of and is calculated as follows:
- defines the list of tasks in progress in .
- a set of tasks, where is the ith task;
- an integer value that designates the length of ith task (in Million Instructions—MI);
- is an integer value that represents the memory capacity required by task (in MB);
- is an integer value that represents the quantity of MIPS required by task ;
- is an integer value that represents the total size of all the required datasets by task ;
- is the index of the final machine assignment () of task ;
- is a decimal value that represents the arrival time of ;
- is the CPU utilization ratio to determine whether a machine has a sufficient amount of resources to support a task or not.
- a set of datasets where is the ith dataset;
- an integer value that designates the volume of ith dataset (in MB);
- is the assignment of the datasets to tasks matrix. We set matrix F because a task may require one or multiple datasets for its execution and many tasks may use the same dataset. Matrix F is generated following Equation (5).
3.2. Problem Formulation
3.3. Objective Function
- Scheduling Time (ST): the time between the arrival of the task in the queue and its scheduling.
- Delay Time (): the time that a task can wait for the availability of a given machine.
- Waiting Time (WT): the sum of scheduling time (ST) and delay time ().
- Data Migration Time (DMT): the time a task needs to locally gather all its remote required datasets.
- Data Access Time (DAT): the time it takes for a task to read all its local required datasets.
- Execution Time (ET): the time to execute the task.
- Total Execution Time (TET): the sum of data migration time (DMT), data access time (DAT), and execution time (ET).
- Response Time (RT): the sum of waiting time (WT) and total execution time (TET).
4. Proposed Approach
- Estimate the response time matrix for the incoming tasks in the queue for all machines;
- Generate a preference list for task-to-machine assignment;
- Perform task selection and assignment;
- Update system state (the availability of the machines and the tasks in the waiting queue Q).
4.1. Response Time Matrix
- Waiting time , which includes both the delay time and the scheduling time (line 20).
- −
- Scheduling time is how much time waits in the queue to be scheduled in .
- −
- Delay time is how much time can wait for to be available. It is measured using the function (line 13). is computed only if will suit later (see also step 2 in Figure 4). More details are explained in Section 4.1.4.
- Time to migrate required remote data is computed when there is no data locality for a given required dataset (line 17 and step 3). Function is detailed in Section 4.1.2. Otherwise, if all the required datasets are locally available, .
- Time to access required data locally (step 4 in Figure 4) is computed by calling the function in line 18. This step aims to measure the needed time to consume the datasets already available locally and the one that has just been gathered via the migration process. More information is depicted in Section 4.1.3.
- Time to execute in defined by as shown at line 19 in Algorithm 1 and step 5 in Figure 4.
Algorithm 1 Compute Total Response Time Matrix for Each Task in the Queue Q |
Intput: 1: Q = (t1, t2, …, tn): Queue of arrived tasks 2: M = (m1, m2, …, mp): Set of machines Output: 3: : Response time of if placed in , where and 4: if Q is empty then 5: Wait for tasks to arrive to Q (see Figure 3) 6: else 7: for () do 8: for () do // assume will be placed in 9: MachineFitTask() 10: if then 11: // can’t host , move to the next machine 12: else if () then 13: ComputeDelayTime() 14: else if () then 15: 16: end if 17: ComputeDataMigrationTime() 18: ComputeDataAccessTime() 19: 20: 21: 22: end for 23: end for 24: GeneratePreferenceList() 25: SelectTasks() 26: end if |
4.1.1. Fitness
- , if the utilization ratio is more than 1 (lines 5 and 6).
- , if and only if the utilization rate does not exceed 1, when the remaining storage capacity in can accommodate the total amount of required data by (line 9 to 15). The load of must be between the and thresholds (line 16 to 25), and the amount of remaining RAM in should be greater than .
- , if one or more of the conditions above are not verified, i.e., does not have enough CPU or/and not enough RAM to host , or/and the storage capacity of cannot store the remote required datasets of or/and is overloaded or underloaded.
- , the machine cannot host the task due to the lack of CPU, and no action can be taken. We start by checking this first case, so we can know from the beginning if we can continue the process of calculating the response time. In this case, the scheduler moves to the next machine.
- , the machine cannot host the task due to insufficient RAM or/and storage or/and being overloaded or underloaded. The peculiarity here is that the task can be delayed and wait for these conditions to be verified and accomplish the fitness on . In this case, we talk about delay scheduling technique. Task can wait for a delay so that the resources of become available again to host . The measurement of the delay time will be explained in Section 4.1.4.
- , the machine can host the task without constraint violations and delay time.
- Dependency between tasks and datasets (): this factor seeks to define how many duplicated datasets in are required for the uncompleted tasks in the queue. In other words, we compute how many tasks in Q are using every replicated dataset in .
- Number of existing replicas of the dataset (): this factor attempts to define how many replicas of each dataset are currently available in the whole system. Therefore, we check if each machine stores as a replica copy. The value of is raised by one each time a replica of is identified.
Algorithm 2 Compute Fitness of mj to host ti |
Input: 1: i: index of task for which fitness is checked 2: j: index of machine whom we check the placement fitness Output: : fitness status (1,0,−1) 3: function MachineFitTask() // CPU Utilization Ratio measurement 4: 5: if then 6: 7: exit; 8: else // Storage capacity verification 9: 10: for () do 11: // total remote data size required by 12: end for 13: if then 14: SelectDatasetsToDelete() 15: end if // Load measurement 16: 17: for () do // search tasks in progress in 18: 19: end for 20: if then// is overloaded 21: 22: end if 23: if then // is underloaded 24: 25: end if // Fitness Measurement 26: if ( && .overloaded = 0 && .underloaded = 0) then 27: 28: else 29: 30: end if 31: end if return 32: end function |
4.1.2. Migration Time
Algorithm 3 Compute required remote Datasets Migration Time of ti placed in mj |
Input: 1: i: index of task for which we will estimate the needed time to migrate its required data 2: j: index of machine we assumed ti will be scheduled and data will be migrated to Output: : Data Migration Time of the remote required datasets of from their distant locations to the local node where is scheduled 3: function ComputeDataMigrationTime() 4: 5: for () do 6: if then // is required by 7: 8: if () then // is a remote data 9: 10: for () do 11: 12: if then // is stored in 13: // time to migrate required by from distant to local 14: end if 15: end for 16: end if // Sort migration times of all machines of each in ascending order 17: // = l, i.e., is migrated from with time of 18: // is the machine source with least migration time to move to 19: // data migration time 20: end if 21: end for return 22: end function |
4.1.3. Data Access Time
Algorithm 4 Compute Data Access Time of ti placed in mj |
Input: 1: i: index of task for which we will estimate time to locally access its required data 2: j: index of machine we assumed ti will be scheduled at Output: : Data Access Time of all the required datasets of in the local node 3: function ComputeDataAccessTime() 4: 5: for () do 6: if then // required by 7: 8: end if 9: end for return 10: end function |
4.1.4. Delay Time
Algorithm 5 Compute Delay Time |
Input: 1: i: index of task for which fitness is checked 2: j: index of machine whom we check the placement fitness Output: : Delay time so that is available to host 3: function ComputeDelayTime() 4: // sort in ascending order by estimated finish time the tasks in // or by arrival time of assigned tasks to 5: 6: 7: 8: 9: for () do 10: while .overloaded = 1 .underloaded = 1 do 11: 12: 13: 14: 15: if && then 16: 17: 18: end if 19: end while 20: end for return 21: end function |
4.2. Task to Machine Preference List
4.3. Tasks Selection
- The available machines M are updated by removing ;
- The characteristics of are updated, i.e., the RAM occupied by is subtracted from the total RAM of (line 12), then the storage capacity of is modified by deleting the volume of migrated data required by (line 13) and the used load by is added to the total load of (line 14);
- The queue Q of incoming tasks is updated by removing the assigned task ;
- The preference list is updated (step 8, lines between 18 and 22) by removing all the triplets concerning the task or the machine .
Algorithm 6 Tasks Selection |
Input: 1: M: Available machines in the system 2: Q: Arrival tasks in the queue 3: PL: Preference list issued by sorting RT Output: 4: Q: Updated Q 5: : Vector of the final assignment of selected tasks 6: function SelectTasks() 7: while ( is not empty) do 8: for () do 9: // is lowest response time 10: // the best placement of is 11: // update M 12: 13: 14: 15: 16: // update M 17: // update by removing the 1st element 18: for () do 19: if () then 20: // Update by removing elements with or in triplet 21: end if 22: end for 23: end for 24: end while return 25: end function |
5. Simulation Setup and Result Analysis
5.1. Simulation Setup
5.2. Performance Metrics
5.2.1. Response Time (RT)
- Scheduling Time (ST);
- Delay Time ();
- Waiting Time (WT);
- Data Migration Time (DMT);
- Data Access Time (DAT);
- Execution Time (ET);
- Total Execution Time (TET);
5.2.2. Throughput
5.2.3. Degree of Imbalance (DI)
5.3. Result Analysis
5.3.1. Experiment 1: Task Variation
5.3.2. Experiment 2: Machine Variation
5.3.3. Experiment 3: Datasets Variation
5.3.4. Experiment 4: Tasks Arrive in 100 Time-Slot
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Barika, M.; Garg, S.; Zomaya, A.Y.; Wang, L.; Moorsel, A.V.; Ranjan, R. Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions. ACM Comput. Surv. 2019, 52, 1–41. [Google Scholar] [CrossRef] [Green Version]
- Rjoub, G.; Bentahar, J.; Wahab, O.A. BigTrustScheduling: Trust-aware big data task scheduling approach in cloud computing environments. Future Gener. Comput. Syst. 2020, 110, 1079–1097. [Google Scholar] [CrossRef]
- Cao, K.; Liu, Y.; Meng, G.; Sun, Q. An Overview on Edge Computing Research. IEEE Access 2020, 8, 85714–85728. [Google Scholar] [CrossRef]
- Petrolo, R.; Loscrì, V.; Mitton, N. Towards a smart city based on cloud of things, a survey on the smart city vision and paradigms. Trans. Emerg. Telecommun. Technol. 2017, 28, e2931. [Google Scholar] [CrossRef] [Green Version]
- Fedushko, S.; Ustyianovych, T.; Syerov, Y.; Peracek, T. User-Engagement Score and SLIs/SLOs/SLAs Measurements Correlation of E-Business Projects Through Big Data Analysis. Appl. Sci. 2020, 10, 9112. [Google Scholar] [CrossRef]
- Zhang, C.; Li, M.; Wu, D. Federated Multidomain Learning With Graph Ensemble Autoencoder GMM for Emotion Recognition. IEEE Trans. Intell. Transp. Syst. 2022, 1–11. [Google Scholar] [CrossRef]
- Luo, X.; Zhang, C.; Bai, L. A fixed clustering protocol based on random relay strategy for EHWSN. Digit. Commun. Netw. 2023, 9, 90–100. [Google Scholar] [CrossRef]
- Chen, H.; Wen, J.; Pedrycz, W.; Wu, G. Big Data Processing Workflows Oriented Real-Time Scheduling Algorithm using Task-Duplication in Geo-Distributed Clouds. IEEE Trans. Big Data 2020, 6, 131–144. [Google Scholar] [CrossRef]
- Arunarani, A.; Manjula, D.; Sugumaran, V. Task scheduling techniques in cloud computing: A literature survey. Future Gener. Comput. Syst. 2019, 91, 407–415. [Google Scholar] [CrossRef]
- Amini Motlagh, A.; Movaghar, A.; Rahmani, A.M. Task scheduling mechanisms in cloud computing: A systematic review. Int. J. Commun. Syst. 2020, 33, e4302. [Google Scholar] [CrossRef]
- Kumar, M.; Sharma, S.; Goel, A.; Singh, S. A comprehensive survey for scheduling techniques in cloud computing. J. Netw. Comput. Appl. 2019, 143, 1–33. [Google Scholar] [CrossRef]
- Liu, J.; Pacitti, E.; Valduriez, P. A Survey of Scheduling Frameworks in Big Data Systems. Int. J. Cloud Comput. 2018, 7, 103–128. [Google Scholar] [CrossRef]
- Gautam, J.V.; Prajapati, H.B.; Dabhi, V.K.; Chaudhary, S. A survey on job scheduling algorithms in Big data processing. In Proceedings of the 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), Coimbatore, India, 5–7 March 2015; pp. 1–11. [Google Scholar] [CrossRef]
- Mishra, S.K.; Puthal, D.; Sahoo, B.; Jena, S.K.; Obaidat, M.S. An adaptive task allocation technique for green cloud computing. J. Supercomput. 2017, 74, 370–385. [Google Scholar] [CrossRef]
- Stavrinides, G.L.; Karatza, H.D. Scheduling Data-Intensive Workloads in Large-Scale Distributed Systems: Trends and Challenges. In Modeling and Simulation in HPC and Cloud Systems; Kołodziej, J., Pop, F., Dobre, C., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 19–43. [Google Scholar] [CrossRef]
- Yang, C.; Huang, Q.; Li, Z.; Liu, K.; Hu, F. Big Data and cloud computing: Innovation opportunities and challenges. Int. J. Digit. Earth 2017, 10, 13–53. [Google Scholar] [CrossRef] [Green Version]
- Hashem, I.A.T.; Yaqoob, I.; Anuar, N.B.; Mokhtar, S.; Gani, A.; Ullah Khan, S. The rise of “big data” on cloud computing: Review and open research issues. Inf. Syst. 2015, 47, 98–115. [Google Scholar] [CrossRef]
- Mazumdar, S.; Seybold, D.; Kritikos, K.; Verginadis, Y. A survey on data storage and placement methodologies for Cloud-Big Data ecosystem. J. Big Data 2019, 6, 1–37. [Google Scholar] [CrossRef] [Green Version]
- Natesan, G.; Chokkalingam, A. Task scheduling in heterogeneous cloud environment using mean grey wolf optimization algorithm. ICT Express 2019, 5, 110–114. [Google Scholar] [CrossRef]
- Jafarnejad Ghomi, E.; Masoud Rahmani, A.; Nasih Qader, N. Load-balancing algorithms in cloud computing: A survey. J. Netw. Comput. Appl. 2017, 88, 50–71. [Google Scholar] [CrossRef]
- Alami Milani, B.; Jafari Navimipour, N. A comprehensive review of the data replication techniques in the cloud environments: Major trends and future directions. J. Netw. Comput. Appl. 2016, 64, 229–238. [Google Scholar] [CrossRef]
- Ahmad, N.; Che Fauzi, A.A.; Sidek, R.; Zin, N.; Beg, A. Lowest Data Replication Storage of Binary Vote Assignment Data Grid. Commun. Comput. Inf. Sci. 2010, 88, 466–473. [Google Scholar] [CrossRef]
- Mohammadi, B.; Navimipour, N.J. Data replication mechanisms in the peer-to-peer networks. Int. J. Commun. Syst. 2019, 32, e3996. [Google Scholar] [CrossRef]
- Campêlo, R.A.; Casanova, M.A.; Guedes, D.O.; Laender, A.H.F. A Brief Survey on Replica Consistency in Cloud Environments. J. Internet Serv. Appl. 2020, 11, 1. [Google Scholar] [CrossRef] [Green Version]
- Long, S.Q.; Zhao, Y.L.; Chen, W. MORM: A Multi-objective Optimized Replication Management strategy for cloud storage cluster. J. Syst. Archit. 2014, 60, 234–244. [Google Scholar] [CrossRef]
- Mokadem, R.; Hameurlain, A. A data replication strategy with tenant performance and provider economic profit guarantees in Cloud data centers. J. Syst. Softw. 2020, 159, 110447. [Google Scholar] [CrossRef]
- Wang, D.; Chen, J.; Zhao, W. A Task Scheduling Algorithm for Hadoop Platform. J. Comput. 2013, 8, 929–936. [Google Scholar] [CrossRef]
- Li, X.; Wang, L.; Lian, Z.; Qin, X. Migration-based Online CPSCN Big Data Analysis in Data Centers. IEEE Access 2018, 6, 19270–19277. [Google Scholar] [CrossRef]
- Dubey, K.; Kumar, M.; Sharma, S. Modified HEFT Algorithm for Task Scheduling in Cloud Environment. Procedia Comput. Sci. 2018, 125, 725–732. [Google Scholar] [CrossRef]
- Mondal, R.; Nandi, E.; Sarddar, D. Load Balancing Scheduling with Shortest Load First. Int. J. Grid Distrib. Comput. 2015, 8, 171–178. [Google Scholar] [CrossRef]
- Lakra, A.V.; Yadav, D.K. Multi-Objective Tasks Scheduling Algorithm for Cloud Computing Throughput Optimization. Procedia Comput. Sci. 2015, 48, 107–113. [Google Scholar] [CrossRef] [Green Version]
- Wang, H.; Wang, F.; Liu, J.; Wang, D.; Groen, J. Enabling customer-provided resources for cloud computing: Potentials, challenges, and implementation. IEEE Trans. Parallel Distrib. Syst. 2015, 26, 1874–1886. [Google Scholar] [CrossRef]
- Gill, S.S.; Chana, I.; Singh, M.; Buyya, R. CHOPPER: An intelligent QoS-aware autonomic resource management approach for cloud computing. Clust. Comput. 2018, 21, 1203–1241. [Google Scholar] [CrossRef]
- Thomas, A.; Krishnalal, G.; Raj, P.V. Credit Based Scheduling Algorithm in Cloud Computing Environment. Procedia Comput. Sci. 2015, 46, 913–920. [Google Scholar] [CrossRef] [Green Version]
- Sajid, M.; Raza, Z. Turnaround Time Minimization-Based Static Scheduling Model Using Task Duplication for Fine-Grained Parallel Applications onto Hybrid Cloud Environment. IETE J. Res. 2015, 62, 402–414. [Google Scholar] [CrossRef]
- Hadji, M.; Zeghlache, D. Minimum Cost Maximum Flow Algorithm for Dynamic Resource Allocation in Clouds. In Proceedings of the 2012 IEEE Fifth International Conference on Cloud Computing, Honolulu, HI, USA, 24–29 June 2012; pp. 876–882. [Google Scholar] [CrossRef]
- Elzeki, O.; Reshad, M.; Abu Elsoud, M. Improved Max-Min Algorithm in Cloud Computing. Int. J. Comput. Appl. 2012, 50, 22–27. [Google Scholar] [CrossRef]
- Fernández Cerero, D.; Fernández-Montes, A.; Jakóbik, A.; Kołodziej, J.; Toro, M. SCORE: Simulator for cloud optimization of resources and energy consumption. Simul. Model. Pract. Theory 2018, 82, 160–173. [Google Scholar] [CrossRef]
- Ma, T.; Chu, Y.; Zhao, L.; Otgonbayar, A. Resource Allocation and Scheduling in Cloud Computing: Policy and Algorithm. IETE Tech. Rev. 2014, 31, 4–16. [Google Scholar] [CrossRef]
- Carrasco, R.; Iyengar, G.; Stein, C. Resource Cost Aware Scheduling. Eur. J. Oper. Res. 2018, 269, 621–632. [Google Scholar] [CrossRef] [Green Version]
- Coninck, E.; Verbelen, T.; Vankeirsbilck, B.; Bohez, S.; Simoens, P.; Dhoedt, B. Dynamic Auto-scaling and Scheduling of Deadline Constrained Service Workloads on IaaS Clouds. J. Syst. Softw. 2016, 118, 101–114. [Google Scholar] [CrossRef]
- Yi, P.; Ding, H.; Ramamurthy, B. Budget-Minimized Resource Allocation and Task Scheduling in Distributed Grid/Clouds. In Proceedings of the 2013 22nd International Conference on Computer Communication and Networks (ICCCN), Nassau, Bahamas, 30 July–2 August 2013; pp. 1–8. [Google Scholar] [CrossRef]
- Reddy, G. A Deadline and Budget Constrained Cost and Time Optimization Algorithm for Cloud Computing. Commun. Comput. Inf. Sci. 2011, 193, 455–462. [Google Scholar] [CrossRef]
- Xin, Y.; Xie, Z.Q.; Yang, J. A load balance oriented cost efficient scheduling method for parallel tasks. J. Netw. Comput. Appl. 2017, 81, 37–46. [Google Scholar] [CrossRef]
- Yang, S.J.; Chen, Y.R. Design adaptive task allocation scheduler to improve MapReduce performance in heterogeneous Clouds. J. Netw. Comput. Appl. 2015, 57, 61–70. [Google Scholar] [CrossRef]
- Smara, M.; Aliouat, M.; Pathan, A.S.; Aliouat, Z. Acceptance Test for Fault Detection in Component-based Cloud Computing and Systems. Future Gener. Comput. Syst. 2016, 70, 74–93. [Google Scholar] [CrossRef]
- Fan, G.; Chen, L.; Yu, H.; Liu, D. Modeling and Analyzing Dynamic Fault-Tolerant Strategy for Deadline Constrained Task Scheduling in Cloud Computing. IEEE Trans. Syst. Man Cybern. Syst. 2017, 50, 1260–1274. [Google Scholar] [CrossRef]
- Zhou, Z.; Abawajy, J.; Chowdhury, M.; Hu, Z.; Li, K.; Cheng, H.; Alelaiwi, A.; Li, F. Minimizing SLA violation and power consumption in Cloud data centers using adaptive energy-aware algorithms. Future Gener. Comput. Syst. 2017, 86, 836–850. [Google Scholar] [CrossRef]
- Pradhan, R.; Satapathy, S. Energy-Aware Cloud Task Scheduling algorithm in heterogeneous multi-cloud environment. Intell. Decis. Technol. 2022, 16, 279–284. [Google Scholar] [CrossRef]
- Chen, H.; Liu, G.; Yin, S.; Liu, X.; Qiu, D. ERECT: Energy-Efficient Reactive Scheduling for Real-Time Tasks in Heterogeneous Virtualized Clouds. J. Comput. Sci. 2017, 28, 416–425. [Google Scholar] [CrossRef]
- Duan, H.; Chen, C.; Min, G.; Wu, Y. Energy-aware scheduling of virtual machines in heterogeneous cloud computing systems. Future Gener. Comput. Syst. 2017, 74, 142–150. [Google Scholar] [CrossRef]
- Shaikh, M.B.; Waghmare Shinde, K.; Borde, S. Challenges of Big Data Processing and Scheduling of Processes Using Various Hadoop Schedulers: A Survey. Int. J. Multifaceted Multiling. Stud. 2019, III, 1–6. [Google Scholar]
- Mohapatra, S.; Mohanty, S.; Rekha, K. Analysis of Different Variants in Round Robin Algorithms for Load Balancing in Cloud Computing. Int. J. Comput. Appl. 2013, 69, 17–21. [Google Scholar] [CrossRef] [Green Version]
- Li, R.; Hu, H.; Li, H.; Wu, Y.; Yang, J. MapReduce Parallel Programming Model: A State-of-the-Art Survey. Int. J. Parallel Program. 2016, 44, 832–866. [Google Scholar] [CrossRef]
- Shyam, G.K.; Manvi, S.S. Resource allocation in cloud computing using agents. In Proceedings of the 2015 IEEE International Advance Computing Conference (IACC), Banglore, India, 12–13 June 2015; pp. 458–463. [Google Scholar] [CrossRef]
- Zhao, Q.; Xiong, C.; Yu, C.; Zhang, C.; Zhao, X. A new energy-aware task scheduling method for data-intensive applications in the cloud. J. Netw. Comput. Appl. 2016, 59, 14–27. [Google Scholar] [CrossRef]
- Dubey, K.; Kumar, M.; Chandra, M.A. A priority based job scheduling algorithm using IBA and EASY algorithm for cloud metaschedular. In Proceedings of the 2015 International Conference on Advances in Computer Engineering and Applications, Ghaziabad, India, 19–20 March 2015; pp. 66–70. [Google Scholar] [CrossRef]
- Nasr, A.A.; El-Bahnasawy, N.A.; Attiya, G.; El-Sayed, A. A new online scheduling approach for enhancing QOS in cloud. Future Comput. Inform. J. 2018, 3, 424–435. [Google Scholar] [CrossRef]
- Reddy, G.; Kumar, S. MACO-MOTS: Modified Ant Colony Optimization for Multi Objective Task Scheduling in Cloud Environment. Int. J. Intell. Syst. Appl. 2019, 11, 73–79. [Google Scholar] [CrossRef]
- Biswas, D.; Samsuddoha, M.; Asif, M.R.A.; Ahmed, M.M. Optimized Round Robin Scheduling Algorithm Using Dynamic Time Quantum Approach in Cloud Computing Environment. Int. J. Intell. Syst. Appl. 2023, 15, 22–34. [Google Scholar] [CrossRef]
- Soltani, N.; Barekatain, B.; Soleimani Neysiani, B. MTC: Minimizing Time and Cost of Cloud Task Scheduling based on Customers and Providers Needs using Genetic Algorithm. Int. J. Intell. Syst. Appl. 2021, 13, 38–51. [Google Scholar] [CrossRef]
- Mohseni, Z.; Kiani, V.; Rahmani, A. A Task Scheduling Model for Multi-CPU and Multi-Hard Disk Drive in Soft Real-time Systems. Int. J. Inf. Technol. Comput. Sci. 2019, 11, 1–13. [Google Scholar] [CrossRef]
- Zaharia, M.; Borthakur, D.; Sen Sarma, J.; Elmeleegy, K.; Shenker, S.; Stoica, I. Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling; EuroSys’10; Association for Computing Machinery: New York, NY, USA, 2010; pp. 265–278. [Google Scholar] [CrossRef]
- He, C.; Lu, Y.; Swanson, D. Matchmaking: A New MapReduce Scheduling Technique. In Proceedings of the 2011 IEEE Third International Conference on Cloud Computing Technology and Science, Athens, Greece, 29 November–1 December 2011; pp. 40–47. [Google Scholar] [CrossRef] [Green Version]
- Kosar, T.; Balman, M. A new paradigm: Data-aware scheduling in grid computing. Future Gener. Comput. Syst. 2009, 25, 406–413. [Google Scholar] [CrossRef]
- Vobugari, S.; Somayajulu, D.V.L.N.; Subaraya, B.M. Dynamic Replication Algorithm for Data Replication to Improve System Availability: A Performance Engineering Approach. IETE J. Res. 2015, 61, 132–141. [Google Scholar] [CrossRef]
- Bouhouch, L.; Zbakh, M.; Tadonki, C. A Big Data Placement Strategy in Geographically Distributed Datacenters. In Proceedings of the 2020 5th International Conference on Cloud Computing and Artificial Intelligence: Technologies and Applications (CloudTech), Marrakesh, Morocco, 24–26 November 2020; pp. 1–9. [Google Scholar] [CrossRef]
- Bouhouch, L.; Zbakh, M.; Tadonki, C. Dynamic data replication and placement strategy in geographically distributed data centers. Concurr. Comput. Pract. Exp. 2022. early view. [Google Scholar] [CrossRef]
- Mohamed, A.; Najafabadi, M.K.; Wah, Y.B.; Zaman, E.A.K.; Maskat, R. The state of the art and taxonomy of big data analytics: View from new big data framework. Artif. Intell. Rev. 2020, 53, 989–1037. [Google Scholar] [CrossRef]
- Samadi, Y.; Zbakh, M.; Tadonki, C. DT-MG: Many-to-one matching game for tasks scheduling towards resources optimization in cloud computing. Int. J. Comput. Appl. 2021, 43, 233–245. [Google Scholar] [CrossRef]
- Calheiros, R.N.; Ranjan, R.; Beloglazov, A.; De Rose, C.A.F.; Buyya, R. CloudSim: A toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Softw. Pract. Exp. 2011, 41, 23–50. [Google Scholar] [CrossRef]
- Calheiros, R.; Ranjan, R.; De Rose, C.; Buyya, R. CloudSim: A Novel Framework for Modeling and Simulation of Cloud Computing Infrastructures and Services. arXiv 2009, arXiv:0903.2525. [Google Scholar]
- Bouhouch, L.; Zbakh, M.; Tadonki, C. Data Migration: Cloudsim Extension. In Proceedings of the ICBDR 2019: 2019 the 3rd International Conference on Big Data Research, Cergy-Pontoise, France, 20–22 November 2019; pp. 177–181. [Google Scholar] [CrossRef]
- Niznik, C.A. Min-max vs. max-min flow control algorithms for optimal computer network capacity assignment. J. Comput. Appl. Math. 1984, 11, 209–224. [Google Scholar] [CrossRef] [Green Version]
Method | Technique | Advantages | Limitations | Parameters |
---|---|---|---|---|
First Come First Served [52] | The last arrived tasks must wait until the end of the execution of earlier ones. | Simple to implement and efficient. | Increases the waiting time of tasks and tasks size not considered. Imbalance load and decreases data locality. | - |
Shortest Job First [30] | Chooses the shortest task to be executed first. | Reduces execution time in comparison with FCFS and RR. | Uneven load distribution on the servers. | Execution time. Response time. |
Round Robin [53] | Circularly distributes tasks. | An equal amount of CPU time is given to every task. | A higher average waiting time. | - |
Shyam and Manvi [55] | VM Migration. | Maximizes resource usage while minimizing time and budget. | Needed more agents for searching the best resource. | Execution time. Makespan time. Response time. Resource utilization. |
Wang et al. [32] | Dynamic resource provisioning. | Considers resources such as CPU, memory and storage. | Data locality and replication not addressed. | Execution cost. Availability. |
Zhao et al. [56] | Energy-efficient technique where datasets and tasks are treated as a binary tree using a data correlation clustering algorithm. | Minimizes the energy usage of cloud data centers. | Online scheduling is not considered. | Execution cost. Resource utilization. Energy consumption. |
Dubey et al. [57] | Scheduling algorithm based on IBA (Improved Backfill Algorithm). | Minimizes the execution time and increases resource usage. | More tasks imply less performance. Considers task priority. | Execution time. Makespan. Resource utilization. |
Elseoud et al. [58] | Online Potential Finish Time heuristic algorithm. | Improves execution time and cost in cloud computing execute tasks with the least amount of delay. | Data locality and replication not addressed. | Execution time/cost. Makespan. Response time. Resource utilization. |
Delay algorithm [63] | Assign tasks based on input data location. It delays tasks until it required data is available. | The ease of scheduling. Achieves data locality. | Imbalance load which can cause higher delays in task execution and lower throughput. | Execution time. |
Matchmaking algorithm [64] | Before assigning a new task to nodes, every node has a fair chance to utilize its local tasks. | High node utilization. High cluster utilization. | No particular. | Resource utilization. Availability. |
Li et al. [28] | Online job scheduling based on data migration based on a trade-off between data transfer cost and the waiting time cost. | Handles data migration. | Schedules tasks sequentially which increases the waiting time. Systems characteristics not considered. Data replication not discussed. | Throughput. |
Reddy et al. [59] | Modified Ant Colony optimization. | Considers VMs RAM, bandwidth, storage, processing speed and makespan in the fitness function. | Data-intensive online tasks not addressed. | Resource utilization. Makespan. Load balance. |
Biswas et al. [60] | Dynamic round robin. | Dynamically determines Time Quantum. | Starvation not handled. | Turnaround. Waiting time. |
Soltani et al. [61] | Genetic meta-heuristic. | Multi-purposed weighted genetic algorithm to enhance performances. | Data-intensive online tasks not addressed. | Response time. Waiting time. Makespan. |
Mohseni et al. [62] | Hard Disk Drive and CPU Scheduling (HCS) algorithm. | Schedules multiple tasks among multi-core systems. | Memory, bandwidth and latency of multi-core systems are not considered. | Execution time. Energy consumption. |
Characteristic | Value |
---|---|
Number of machines | [5–100] |
(MIPS) | [1000–5000] |
RAM (GB) | [64–2048] |
Storage capacity (TB) | [1–25] |
Number of tasks | [30–2000] |
Size of tasks (MI) | [1000–4000] |
Number of datasets | 300 |
Size of datasets (GB) | [1–100] |
Number of required datasets | [1–10] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Bouhouch, L.; Zbakh, M.; Tadonki, C. Online Task Scheduling of Big Data Applications in the Cloud Environment. Information 2023, 14, 292. https://doi.org/10.3390/info14050292
Bouhouch L, Zbakh M, Tadonki C. Online Task Scheduling of Big Data Applications in the Cloud Environment. Information. 2023; 14(5):292. https://doi.org/10.3390/info14050292
Chicago/Turabian StyleBouhouch, Laila, Mostapha Zbakh, and Claude Tadonki. 2023. "Online Task Scheduling of Big Data Applications in the Cloud Environment" Information 14, no. 5: 292. https://doi.org/10.3390/info14050292
APA StyleBouhouch, L., Zbakh, M., & Tadonki, C. (2023). Online Task Scheduling of Big Data Applications in the Cloud Environment. Information, 14(5), 292. https://doi.org/10.3390/info14050292