Fast and Accurate Solution of Integral Formulations of Large MQS Problems Based on Hybrid OpenMP–MPI Parallelization
Abstract
:Featured Application
Abstract
1. Introduction
2. Numerical Formulation of the Magneto-Quasi-Static Problem
- (a)
- assembly of the known-terms vector, ;
- (b)
- assembly of the resistance and inductance matrices, R and L;
- (c)
- assembly of the flux density matrix, ;
- (d)
- inversion of the impedance matrix, , defined as in (6), via factorization and back substitution.
3. Parallel Computing Based on a Hybrid OpenMP–MPI Approach
3.1. Parallelization Strategy Based on MPI Approach: Description and Limits
Algorithm 1 Evaluation of V0, pure MPI approach |
// Initialization |
Split Source Points between MPI processes |
// Parallel pure MPI computation |
for each MPI process do |
for each iel mesh element point do |
for each iel0 source mesh element do |
compute V0 contribution, between iel and iel0 |
end for |
end for |
end for |
// All reduce of the local contribute on the global V0 |
mpi_allreduce(V0_loc,V0_global) |
- -
- MDMESH, used to store the geometrical mesh information. These data have a dimension of the order of magnitude of the number of the mesh elements;
- -
- MDL, used to temporarily store the entries Lij produced during the main loop of element–element interactions. This memory has the same dimension of MLOC, that is the chunk memory required for matrix storage at each node;
- -
- MDEE, used to carry on the main loop of element–element interactions: such a memory is of the order of , being the number of degrees of freedom per element.
Algorithm 2 Assembly of L matrix, pure MPI approach |
Equally distribute element-element interactions among MPI processes |
// Initialization |
for each MPI process |
Allocate dummy memory MdMesh |
Allocate dummy memory MdEE |
Allocate dummy memory MdL |
end for |
Broadcast the geometrical information |
// Parallel pure MPI computation |
for each MPI process do |
for each element iel1 do |
for each element iel2 do |
Compute local iel1-iel2 interactions |
Accumulate local interactions in MdL |
end for |
end for |
end for |
//Final Communications Step |
for each MPI process do |
Allocate local memory Mloc |
end for |
for each MPI process do |
Send and receive the local matrices MdL |
Accumulate in Mloc |
end for |
deallocate(MdMesh, MdEE,MdL) |
Algorithm 3 Evaluation of Q m, pure MPI Approach |
// Initialization |
Equally distribute field points among MPI process |
Broadcast the geometrical information |
// Parallel pure MPI computation |
for each MPI process do |
for each mesh element iel do |
for each field point ifp in the set belonging to current process do |
Compute Magnetic Field or Vector potential |
Accumulate the values |
end for |
end for |
end for |
//Final allreduce |
mpi_allreduce(MagField,Vector potential) |
- -
- —number of the MPI processes;
- -
- —number of the cluster nodes;
- -
- —total memory required to store the global matrix (for example, L);
- -
- —total memory available at any node;
- -
- —dummy memory per MPI process;
- -
- —dummy memory per thread;
- -
- —memory available at each node (for all is needed at the node);
- -
- —actual available memory at each node.
3.2. Parallelization Strategy Based on Hybrid OpenMP–MPI Approach
- (1)
- as in the pure MPI paradigm, the overall computation is partitioned in MPI processes, limiting the number of the processes at node level (in the ideal case this number would be 1);
- (2)
- the computational burden of each MPI process at node level is divided (again) in several threads, in accordance with the characteristics of the OpenMP paradigm.
- (i)
- resources saving—for a fixed speed-up S, the required number of nodes NN is lower;
- (ii)
- speed-up advantage—for fixed resources (NN), the speed-up, S, is higher.
- (i)
- Need for local thread memory— the main loop is distributed among all available threads, each of them requesting its local memory to work properly. This memory is related to mesh element information (e.g., the curls of the elements, the shape functions, geometrical information, element local output, and so on). Anyway, it is usually very small (few GBs) and, in cases of practical interest, is much smaller than the required output global node memory.
- (ii)
- Global matrix memory update—once the single thread made its own job, the thread local output should be accommodated into the global memory of the node. This is a non-trivial operation, and it is a bottleneck for the method, because the local update must be carried out by each thread in a sequential access, so to guarantee consistency. To this end, the CRITICAL OpenMP directive is used [6].
- (iii)
- Global input memory access—this access is critical being shared between the threads. It may cause “cache missing”.
Algorithm 4 Evaluation of V0, hybrid OpenMP-MPI Approach |
// Initialization |
Split Source Point between MPI process |
// Parallel pure MPI computation |
for each MPI process do |
for each iel mesh element point do |
#pragma omp parallel for |
#pragma ompreduction(+:V0loc) |
for each iel0 source mesh element do |
compute V0_loc |
end for |
end for |
end for |
// Synchronization point |
mpi barrier |
// All reduce of the local contribute on the global V0 |
mpi_allreduce(V0_loc,V0_global) |
Algorithm 5 Assembly of L matrix, hybrid OpenMP-MPI approach |
Equally distribute element-element interactions among MPI processes |
// Initialization |
for each MPI process do |
Allocate dummy memory MdMesh |
Allocate dummy memory MdL |
Allocate dummy memory MdEE |
end for |
for each MPI process do |
Broadcast the geometrical information |
end for |
// Hybrid MPI openMP computation |
for each MPI process do |
Declare MdEE as private for each thread |
#pragma omp parallel for |
for each element iel1 do |
for each element iel2 do |
Compute local iel1-iel2 interactions |
Compute the local interactions on a private dummy mem. MdEE |
#pragma omp critical |
Accumulate the local interactions on the shared mem. MdL |
end for |
end for |
end for |
for each MPI process |
Allocate local memory Mloc |
end for |
//Final Communications Step |
for each MPI process do |
Send and receive the local matrices MdL |
Accumulate the contribute on Mloc |
end for |
deallocate(MdMesh, MdEE,MdL |
Algorithm 6 Evaluation of Q matrix, hybrid OpenMP-MPI Approach |
// Initialization |
Equally, distribute field points among MPI process |
Broadcast the geometrical information |
// MPI OpenMP computation |
for each MPI process do |
for each iel mesh element do |
#pragma omp parallel for |
for each ifp field point do |
Compute Magnetic Field or Vector potential |
Accumulate the values |
end for |
end for |
end for |
//Final allreduce |
mpi_allreduce(MagField,Vector potential) |
4. Case Studies and Discussion
4.1. A Benchmark Case Study
4.2. Case Study 1: A Plasma Ring
4.3. Case Study 2: Fusion Reactor Eddy Currents Analysis
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Rubinacci, G.; Tamburrino, A.; Ventre, S.; Villone, F. A Fast Algorithm for Solving 3D Eddy Current Problems with Integral Formulations. IEEE Trans. Magn. 2001, 37, 3099–3103. [Google Scholar] [CrossRef]
- Rubinacci, G.; Tamburrino, A.; Ventre, S.; Villone, F. A fast 3-D multipole method for eddy-current computation. IEEE Trans. Magn. 2004, 40, 1290–1293. [Google Scholar] [CrossRef]
- Hackbusch, W. A sparse matrix arithmetic based on H-matrices. Part I: Introduction to H-matrices. Computing 1999, 62, 89–108. [Google Scholar] [CrossRef]
- Ma, T.; Bosilca, G.; Bouteiller, A.; Dongarra, J.J. Kernel-assisted and topology-aware MPI collective communications on multicore/many-core platforms. J. Parallel Distrib. Comput. 2013, 73, 1000–1010. [Google Scholar] [CrossRef]
- Klinkenberg, J.; Samfass, P.; Bader, M.; Terboven, C.; Müller, M.S. CHAMELEON: Reactive Load Balancing for Hybrid MPI+OpenMP Task-Parallel Applications. J. Parallel Distrib. Comput. 2020, 138, 55–64. [Google Scholar] [CrossRef]
- Legrand, A.; Renard, H.; Robert, Y.; Vivien, F. Mapping and load-balancing iterative computations. IEEE Trans. Parallel Distrib. Syst. 2004, 15, 546–558. [Google Scholar] [CrossRef]
- The MPI Forum. MPI: A Message Passing Interface. In Proceedings of the Supercomputing ’93: 1993 ACM/IEEE Conference on Supercomputing, Portland, OR, USA, 15–19 November 1993; pp. 878–883. [Google Scholar]
- Dagum, L.; Menon, R. Openmp: An industry-standard API for shared memory programming. Comput. Sci. Eng. 1998, 1, 46–55. [Google Scholar] [CrossRef] [Green Version]
- The OpenMP® API Specification for Parallel Programming. Available online: https://openmp.org/wp/about-openmp (accessed on 20 September 2021).
- Saczek, M.; Wawrzak, K.; Tyliszczak, A.; Boguslawski, A. Hybrid MPI/Open-MP acceleration approach for high-order schemes for CFD. J. Phys. Conf. Ser. 2018, 1101, 012031. [Google Scholar] [CrossRef] [Green Version]
- Ahn, J.M.; Kim, H.; Cho, J.G.; Kang, T.; Kim, Y.-S.; Kim, J. Parallelization of a 3-Dimensional Hydrodynamics Model Using a Hybrid Method with MPI and OpenMP. Processes 2021, 9, 1548. [Google Scholar] [CrossRef]
- Procacci, P. Hybrid MPI/OpenMP Implementation of the ORAC Molecular Dynamics Program for Generalized Ensemble and Fast Switching Alchemical Simulations. J. Chem. Inf. Model. 2016, 56, 1117–1121. [Google Scholar] [CrossRef] [PubMed]
- Sataric, B.; Slavnić, V.; Belic, A.; Balaz, A.; Muruganandam, P.; Adhikari, S. Hybrid OpenMP/MPI programs for solving the time-dependent Gross-Pitaevskii equation in a fully anisotropic trap. Comput. Phys. Commun. 2016, 200, 411. [Google Scholar] [CrossRef] [Green Version]
- Jiao, Y.Y.; Zhao, Q.; Wang, L.; Huang, G.-H.; Tan, F. A hybrid MPI/OpenMP parallel computing model for spherical discontinuous deformation analysis. Comput. Geotech. 2019, 106, 217–227. [Google Scholar] [CrossRef]
- Migallón, H.; Piñol, P.; López-Granado, O.; Galiano, V.; Malumbres, M.P. Frame-Based and Subpicture-Based Parallelization Approaches of the HEVC Video Encoder. Appl. Sci. 2018, 8, 854. [Google Scholar] [CrossRef] [Green Version]
- Xu, Y.; Zhang, T. A hybrid open MP/MPI parallel computing model design on the SMP cluster. In Proceedings of the 6th International Conference on Power Electronics Systems and Applications, Hong Kong, China, 15–17 December 2015. [Google Scholar]
- Shen, Y.; Cao, C. Parallel method of parabolic equation for electromagnetic environment simulation. In Proceedings of the IEEE Information Technology, Networking, Electronic and Automation Control Conference, Chongqing, China, 20–22 May 2016; pp. 515–519. [Google Scholar]
- Guo, H.; Hu, J.; Nie, Z.P. An MPI-OpenMP Hybrid Parallel H -LU Direct Solver for Electromagnetic Integral Equations. Intern. J. Antennas Propag. 2015, 2015, 615743. [Google Scholar] [CrossRef]
- Wuatelet, P.; Lavallee, P.-F. Hybrid MPI/OpenMP Programming, PATC/PRACE Course Material, IDRIS/MdlS; The Partnership for Advanced Computing in Europe: Barcelona, Spain, 2015. [Google Scholar]
- Barney, B. Introduction to Parallel Computing; Livermore National Laboratory: Livermore, CA, USA, 2015. [Google Scholar]
- Albanese, R.; Rubinacci, G. Finite Element Methods for the Solution of 3D Eddy Current Problems. Adv. Imaging Electron. Phys. 1998, 102, 1–86. [Google Scholar]
- Wu, X.; Taylor, V. Performance Modeling of Hybrid MPI/OpenMP Scientific Applications on Large-scale Multicore Supercomputers. J. Comput. Syst. Sci. 2013, 79, 1256–1268. [Google Scholar] [CrossRef]
- EFDA, European Fusion Development. The ITER Project. Available online: www.iter.org (accessed on 15 September 2021).
- Scalable Linear Algebra PACKage. Available online: www.scalapack.org (accessed on 12 September 2021).
- Albanese, R.; Rubinacci, G. Integral formulation for 3D eddy-current computation using edge elements. IEE Proc. A 1988, 135, 457–462. [Google Scholar] [CrossRef]
- Rubinacci, G.; Fresa, R.; Ventre, S. An Eddy Current Integral Formulation on Parallel Computer Systems. Intern. J. Numer. Methods Eng. 2005, 62, 1127–1147. [Google Scholar]
- Marathe, J.; Nagarajan, A.; Mueller, F. Detailed cache coherence characterization for openmp benchmarks. In Proceedings of the 18th Annual International Conference on Supercomputing, Malo, France, 26 June–1 July 2004. [Google Scholar]
- MARCONI, the Tier-0 System. Available online: www.hpc.cineca.it/hardware/marconi (accessed on 15 September 2021).
SUNCUDA Cluster | MARCONI Cluster | |
---|---|---|
Number of nodes | 2 | 3216 |
Number of processors per node | 2 | 2 |
Processor type | Intel Xeon [email protected] GHz | Intel Xeon [email protected] GHz |
Number of cores per processor | 8 | 48 |
RAM at each node | 128 GB | 192 GB |
MPI Tasks | OpenMP Threads | Speed-Up Values | |
---|---|---|---|
Assembly of V0 | Assembly of L | ||
1 (reference) | 1 (reference) | 1 | 1 |
1 | 4 | 4.0 | 3.4 |
1 | 8 | 7.4 | 6.7 |
4 | 1 | 3.7 | 3.8 |
4 | 4 | 14 | 15 |
4 | 8 | 26 | 27 |
OpenMP Threads per Node | 10 Nodes 2 MPI per Node | 20 Nodes 2 MPI per Node | 40 Nodes 1 MPI per Node |
---|---|---|---|
1 (reference) | 1 (reference) | 1.99 | 1.99 |
5 | 4.96 | 9.85 | 9.88 |
10 | 9.86 | 19.7 | 19.7 |
15 | 14.8 | 29.5 | 23.2 |
OpenMP Threads per Node | Nodes 1 MPI per Node | Speed-Up |
---|---|---|
22 (reference) | 1 (reference) | 1.0 |
22 | 25 | 23.9 |
22 | 36 | 33.1 |
22 | 64 | 53.5 |
OpenMP Threads per Node | Nodes 2 MPI per Node | Speed-Up |
---|---|---|
22 (reference) | 2 (reference) | 1.0 |
11 | 8 | 2.4 |
22 | 8 | 4.3 |
11 | 18 | 4.9 |
22 | 18 | 7.9 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ventre, S.; Cau, F.; Chiariello, A.; Giovinco, G.; Maffucci, A.; Villone, F. Fast and Accurate Solution of Integral Formulations of Large MQS Problems Based on Hybrid OpenMP–MPI Parallelization. Appl. Sci. 2022, 12, 627. https://doi.org/10.3390/app12020627
Ventre S, Cau F, Chiariello A, Giovinco G, Maffucci A, Villone F. Fast and Accurate Solution of Integral Formulations of Large MQS Problems Based on Hybrid OpenMP–MPI Parallelization. Applied Sciences. 2022; 12(2):627. https://doi.org/10.3390/app12020627
Chicago/Turabian StyleVentre, Salvatore, Francesca Cau, Andrea Chiariello, Gaspare Giovinco, Antonio Maffucci, and Fabio Villone. 2022. "Fast and Accurate Solution of Integral Formulations of Large MQS Problems Based on Hybrid OpenMP–MPI Parallelization" Applied Sciences 12, no. 2: 627. https://doi.org/10.3390/app12020627
APA StyleVentre, S., Cau, F., Chiariello, A., Giovinco, G., Maffucci, A., & Villone, F. (2022). Fast and Accurate Solution of Integral Formulations of Large MQS Problems Based on Hybrid OpenMP–MPI Parallelization. Applied Sciences, 12(2), 627. https://doi.org/10.3390/app12020627