Multi-GPU Acceleration for Finite Element Analysis in Structural Mechanics

Herrero-Pérez, David; Martínez-Barberá, Humberto

doi:10.3390/app15031095

Open AccessArticle

Multi-GPU Acceleration for Finite Element Analysis in Structural Mechanics

by

David Herrero-Pérez

^1,*,†

and

Humberto Martínez-Barberá

^2,†

¹

Escuela Técnica Superior de Ingeniería Industrial, Universidad Politécnica de Cartagena, Campus Muralla del Mar, 30202 Cartagena, Spain

²

Facultad de Informática, Universidad de Murcia, 30100 Murcia, Spain

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2025, 15(3), 1095; https://doi.org/10.3390/app15031095

Submission received: 26 November 2024 / Revised: 18 January 2025 / Accepted: 20 January 2025 / Published: 22 January 2025

(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

Multi-GPU systems allow us to extend multi-core computing techniques that have already demonstrated their effectiveness for many-core computing, addressing large-scale problems. We introduce the capability of solving large systems of equations by exploiting the computing capabilities of GPU clusters and multi-GPU systems using a hybrid GPU-awareness MPI approach. This strategy increases the speedup and alleviates the device memory limitations in GPU computing.

Abstract

This work evaluates the computing performance of finite element analysis in structural mechanics using modern multi-GPU systems. We can avoid the usual memory limitations when using one GPU device for many-core computing using multiple GPUs for scientific computing. We use a GPU-awareness MPI approach implementing a suitable smoothed aggregation multigrid for preconditioning an iterative distributed conjugate gradient solver for GPU computing. We evaluate the performance and scalability of different models, problem sizes, and computing resources. We take an efficient multi-core implementation as the reference to assess the computing performance of the numerical results. The numerical results show the advantages and limitations of using distributed many-core architectures to address structural mechanics problems.

Keywords:

GPU computing; multi-GPU systems; structural mechanics; algebraic multigrid

1. Introduction

Nowadays, parallel processing has become the dominant paradigm in scientific computing to address the ever-increasing complexity and fidelity of the finite element models [1]. Moore’s law [2] is still valid today, but semiconductor companies use these increasing computational resources to increment the number of processors (multi-core and many-core architectures) rather than making faster cores [3]. The limitations to obtaining faster cores despite the increasing number of transistors in integrated circuits are mainly due to three technical walls: the power wall, the Instruction-Level Parallelism (ILP) wall, and the memory wall. The power wall refers to the difficulty of scaling the performance due to the affordable power delivery and dissipation, which limits the operating frequency to a certain threshold. The ILP wall relates to the difficulty of finding enough parallelism in a single instruction stream to keep a single-core processor busy [4]. The memory wall refers to the problems that arise when the processor speed improvement exceeds the memory speed improvement [5]. All these facts allow us to affirm that serial computing has reached its zenith in performance [6].

Multi-core architectures exploit the task or function parallelism, where we divide the problem into smaller ones, and the code is then distributed across multiple processors and executed in parallel. On the other hand, the many-core architectures exploit data parallelism using Single Instruction Multiple Data (SIMD), where the same code executes on different pieces of distributed data across parallel computing nodes. Function parallelism permits flexible parallel computing, whereas data parallelism provides high speedups. For this reason, we often combine both techniques to exploit the computing capabilities of the architectures properly.

Many-core computing in modern high-performance systems refers to non-graphical computing using graphics cards, which is becoming an integral part of modern high-performance systems. Computing nodes of most modern supercomputers include potent CPUs for task parallelism and powerful GPUs for massively parallel computing [7]. Workstation vendors also adopt a similar configuration by including co-processors that permit the use of General-Purpose computing on Graphics Processing Units (GPGPU) [8,9]. However, exploiting these massively parallel architectures requires algorithms with abundant fine-grained or data-level parallelism. The memory limitations in GPU computing using only the device memory of one graphics card restrict the problems that can be addressed using many-core computing. By introducing function parallelism in GPU systems [10,11], whether using shared (multi-GPU systems) or distributed (GPU clusters) memory approaches, we can remove the device memory limitations, usually at the cost of reducing the GPU computing performance due to data transfers between memory systems.

Multi-GPU computing is becoming a hot topic to speed up libraries, tools, and applications limited by the memory specifications of GPU devices. We can mention the mesh-free peridynamic simulation using a multi-GPU parallel scheme [12] in the context of fracture mechanics applications, the simulation of the dynamics of gigamolecules [13] using GPU clusters, and the 3D turbulent flow simulations [14] using a multi-GPU implementation for a gas kinetic scheme. In the context of generic linear algebra applications, we can mention the acceleration of Sparse Matrix-Vector (SpMV) multiplication operations for iterative-solving algorithms used in large-scale problems [15] and the dynamic work balancing in multi-GPU systems through task migration [16], which is evaluated using Block-Sparse General Matrix Multiplication (BSpGEMM), Block General Matrix Multiplication (GEMM), and Block Cholesky Factorization.

In the context of solid mechanics problems, we can accelerate the resolution of the linear system arising from the finite element method [17] using GPU computing [18]. We usually should avoid direct solvers due to the memory requirements for large systems of equations. In such cases, Krylov subspace iteration methods save memory consumption at the cost of increasing the computational burden, which can be alleviated using distributed computing [19,20]. A key point to achieve high-performance GPU computing using Krylov methods [21,22,23] is the efficient implementation of the SpMV multiplication using sparse-matrix representation [24,25,26,27]. We can obtain significant computing performance improvements by exploiting the fine-grained parallelism of the many-core architecture [28] and considering limited transfer rates over data channels, concurrency, and coalesced memory access. Another issue is the assembly process, which is highly parallelizable and can achieve significant speedups using GPU computing [29,30], but it can require a high amount of memory. For this reason, assembly-free methods have been revisited recently for GPU computing [31] in finite element analysis [32,33] and topology optimization [34] problems by exploiting the data locality properly. However, this strategy is inappropriate for finite element models using unstructured meshes, where we achieve poor or null speedup [35] using GPU computing.

There is consensus that multigrid methods are the most efficient and popular techniques for solving large linear systems of equations. They exploit the smoothing property and the coarse grid principle. The former reduces the high-frequency error components, whereas the latter approximates the low-frequency error components on coarser grids, which we then prolongate to the finer grids. We usually classify these solving methods as geometric multigrid (GMG) and algebraic multigrid (AMG) approaches. The former uses the geometry and physics of the problem, whereas the latter only requires the coefficient matrix of the linear system of equations resulting from the discretization of Partial Differential Equations (PDEs). Regular grids facilitate the efficient calculation of the transfer operators between grids in GMG methods, which can be unaffordable in complex geometries. On the other hand, AMG methods are more flexible since they only use the information provided by the matrix of coefficients. For this reason, their use as a “black-box” function in finite element codes is prevalent.

We also can classify AMG methods into two groups: classical AMG and aggregation AMG. The former obtains the hierarchical multigrid by partitioning the nodes into coarse and fine nodes and defines the interpolation operator (or transpose of restriction operator) by the weighted sum of such nodes [36]. In contrast, aggregation AMG methods [37] aggregate a few fine grid nodes to form a coarse grid node and define the interpolation as a piecewise constant operator, which results in an interpolation operator with only one zero per row, reducing the memory requirements meaningfully in comparison with classical AMG [38] and improving the efficiency of the interpolation operator. These features have a significant beneficial effect on the performance of GPU computing.

Multigrid methods are commonly used as a preconditioner of Krylov subspace methods [39,40] in the structural mechanics context. The underlying idea is that the interpolation operator of the multigrid method will hardly be optimal, which makes it less efficient for some specific error components. When this occurs, the convergence is slow despite almost all the error components reducing quickly. Krylov subspace methods usually eliminate these error components efficiently [41]. Commonly, this is a more efficient solution than improving the construction of the interpolation operator. In this context, GPU computing using geometric multigrid methods (GMGs) as a preconditioner has shown clear benefits, both in structural analysis [42] and topology optimization [43,44,45]. Computing using many-core architectures has also shown its benefits in aggregation AMG methods [38,46,47] due to the low memory requirements compared to classical AMG methods.

We present our experiences, accelerating the finite element method in structural mechanics using a modern multi-GPU system. We exploit the computational capabilities using a distributed conjugate gradient solver using the memory available by each processor, including the host and device memory of each graphics card. This approach permits us to increase the problem size that can be addressed using GPU computing due to device memory constraints. We precondition the Krylov subspace iteration method using a smooth aggregation-based AMG method that increases the GPU performance since it requires less device memory than other multigrid approaches. We distribute the computational burden between the computational resources using domain decomposition techniques. We use modern high-level libraries to exploit the computational resources of many-core and multi-core architectures, in particular, ParMETIS [48] for parallel partitioning, AMGCL [49] for distributed Krylov iterative solver and multigrid preconditioning, and VEXCL [50,51] for GPGPU development of the aggregation AMG approach, which makes extensive use of the Boost.Compute library [52] using OpenCL programming. We also present numerical experiments to evaluate the weak scalability and limitations of modern multi-GPU systems, addressing structural mechanics analysis. We also provide a quantitative comparison of the numerical results between the many-core computing approach and an efficient multi-core implementation, showing the benefits in performance using the multi-GPU system. Finally, we discuss and conclude with the pros and cons of using multi-GPU systems in the context of structural mechanics problems.

We organize the remainder of the paper as follows. Section 2 introduces the distributed architecture and the required communications system for working the computational resources together. Section 3 reviews the basis and theoretical background of aggregation algebraic multigrid (AMG) methods. Section 4 is devoted to the parallel implementation of partitioning using domain decomposition and the solving approach using high-level libraries to exploit the computational resources of many-core and multi-core architectures. Section 5 shows the numerical experiments, evaluating the scalability and limitations of the techniques adopted to exploit multi-GPU systems in structural mechanics problems. Finally, Section 6 presents the conclusion of the evaluation of the numerical results.

2. Multi-GPU Architecture and Communications

Using multiple graphics cards incorporates another level of parallelism to the GPU system, namely, task-level parallelism. This additional level permits us to distribute the workload of applications between the multiple GPUs and then exploit the data parallelism for which we design these massive parallel architectures. Such distributed memory systems provide an efficient way to increase flexibility in the implementation and scale up the performance of heavy computing applications. Figure 1 shows the typical configuration of a GPU cluster, where we equip the nodes with many GPUs, namely, multi-GPU systems. We install the graphics cards in motherboards with multiple PCIe slots and/or PCIe slots expansion boards. We can observe that GPU clusters and multi-GPU systems are distributed memory systems, where the computational units use their local memory. This flexible computing architecture with different memory spaces requires an efficient mechanism for sharing data between the computational nodes of the High-Performance Computing (HPC) system.

Figure 1 shows that the communication mechanism can use the PCI-e data bus and the network, including Ethernet or Infiniband. These technologies have different memory bandwidths operating at distinct transfer rates. Communications also introduce latency when data transfer requires computing several operations, such as the operating system (OS) management and the computation of communication protocols. Remote Direct Memory Access (RDMA) technology provides data exchange between devices without involving the OS and CPU. Such devices include GPUs, network interface cards (NICs), and storage adapters. This technology permits boosting the network and host performance with lower latency, lower CPU load, and higher bandwidth. We eliminate communication overheads and bottlenecks since we bypass the host CPU and memory, which results in minimal impact on host resources. We can also use RDMA over networks because RDMA is natively supported in Infiniband interconnect technology and can be used over Converged Ethernet (RoCE) in Ethernet networks. Both technologies share a user API but have different physical and link layers.

Communication frameworks facilitate using these efficient communication mechanisms. One popular option is using the Unified Communication X (UCX) framework, which is cross-platform and performance-oriented for low overheads in the communication path. It supports the devices of most GPU vendors using GPUDirect RDMA technology and ROCm for Nvidia and AMD products, respectively. The most popular implementations of the standard Message Passing Interface (MPI) adopt this framework, which we can use to build applications that scale in distributed memory systems. The latest versions of the most popular MPI implementations support CUDA-aware MPI computing, such as Open MPI from the v1.7.0 release and MPICH from the v4.0 one. This compatibility facilitates the development of scalable software in multi-core and many-core computing architectures.

We adopt these functionalities in the iterative solver preconditioned with a smooth aggregation-based AMG method using a subdomain strategy for the implementation. In particular, we divide the coefficient matrix of the linear system of equations and the vectors needed for the iterative solver and the preconditioner. We optimize the operations between partitioned vectors and the coefficient matrix using sparse-matrix representation dividing the local and remote operations. The former uses the Boost.Compute library [52], which is a GPU/parallel-computing library for C++ based on OpenCL that can operate using sparse-matrix representations. The latter uses the communication framework mentioned above, which optimizes the communications depending on the network topology. The MPI communications use point-to-point communication patterns to avoid bandwidth problems.

3. Aggregation Algebraic Multigrid (AMG)

Consider a linear system of equations,

K u = f,

(1)

where

K \in R^{n \times n}

is the coefficient matrix of the linear elasticity system,

f \in R^{n \times 1}

is the right-hand side vector,

u \in R^{n \times 1}

is the solution vector, and n is the number of unknowns. Multigrid methods use a two-grid scheme to address the problem. Let l be the grid level and

n_{l}

the number of unknowns in such a l level. Considering the initialization

K_{0} = K

and

n_{0} = n

, we define the coarse grid coefficient matrices

K_{l + 1}

recursively as follows:

K_{l + 1} = R_{l} K_{l} P_{l},

(2)

where

P_{l} \in R^{n_{l} \times n_{l + 1}}

is the interpolation or prolongation operator,

R_{l} \in R^{n_{l + 1} \times n_{l}}

(typically obtained as

P_{l}^{T}

) is the restriction operator, and

K_{l} \in R^{n_{l} \times n_{l}}

and

K_{l + 1} \in R^{n_{l + 1} \times n_{l + 1}}

are the fine and coarse grid coefficient matrices, respectively. We calculate these transfer operators recursively until a grid level L, where the number of unknowns

n_{L}

is small enough to solve it within a reasonable time, usually using some direct solver.

Algorithm 1 outlines the AMG setup stage for aggregation-based AMG methods [53], where we calculate the grid hierarchy and the transfer operators using the algebraic properties of the coefficient matrix K. This stage requires the coefficient matrix

K \in R^{n \times n}

and the near null-space vector

B \in R^{n \times m}

, with m number of vectors representing low eigenmodes of the problem. The setup phase requires the following operations: strength, aggregate, tentative, prolongate, and Galerkin projection. The strength operator constructs a graph

C_{l}

of strong connections at the corresponding level l [36,54]. We build this operator using a coarsening ratio by evaluating the relationship between unknowns with the diagonal and outside diagonal coefficients. Using a block matrix by vertex (grouping into blocks all unknowns of a grid point) instead of considering each unknown as a vertex of

C_{l}

often shows computational advantages.

Algorithm 1: AMG setup

We then use the edges of the graph

C_{l}

for the aggregation process. The problem consists of selecting a set of independent grid points as root nodes, which we group with the neighbors to form the aggregates. This grouping can be performed in parallel using the Parallel Maximal Independent Set (PMIS) algorithm [55,56]. Then, the tentative prolongation matrix

{\tilde{P}}_{l}

is built as a simple grid transfer operator by a piecewise constant interpolation, where each row corresponds to a grid point of the coarse grid and the columns correspond to an aggregate of

A g g_{l}

. The resulting operators should satisfy

B_{l} = {\tilde{P}}_{l} B_{l + 1}

and

{\tilde{P}}_{l}^{T} {\tilde{P}}_{l} = I

, which implies that near null-space candidates lie in the rank of

{\tilde{P}}_{l}

and that columns of

{\tilde{P}}_{l}

are orthonormal [56]. We can use a smoother

{\tilde{S}}_{l}

to obtain a more robust prolongation operator

P_{l}

by smoothing the tentative prolongation matrix

{\tilde{P}}_{l}

. The weighted Jacobi smoother is a common choice for this purpose. Finally, we perform the sparse Galerkin product as two matrix–matrix products.

We use the AMG method as a preconditioner of a conjugate gradient iterative solver. Such an iterative solver is appropriate for linear systems of equations whose coefficient matrix is both positive-definite and symmetric as in the case of elasticity problems. Algorithm 2 shows the pseudo-code of the V-cycle of AMG for preconditioning the iterative solver. The algorithm aims to approximate the solution of (1) given the residual of the previous estimation of the iterative solver. The procedure consists of

μ_{1}

smoothing operations to the approximate solution

s_{l}

at level l and the computation of the residual

r_{l}

for the relaxed approximate solution

s_{l}

. This residual is then restricted to the coarse grid and solved if we reach the last L level. When we solve the approximate solution

s_{l}

in the coarsest grid, it is then prolonged to the finer grid by applying

μ_{2}

smoothing operations to the approximate solution.

Algorithm 3 details the conjugate gradient algorithm using the v-cycle of AMG as a preconditioner. This iterative solver only requires the maximum number of iterations

\max_{iter}

, the tolerance

< t o l, {tol}_{abs} >

, the coefficient matrix

K

, the right-hand side

f

, and the initial guess for initializing the iterative procedure. The recursive solver provides an approximate solution

u

of (1) with a residual after

i t

iterations.

Algorithm 2: AMG preconditioner (V-cycle).

Algorithm 3: Conjugate gradient algorithm

We have several choices for the smoothing or relaxation operator mentioned in Algorithm 2. We can mention Gauss–Seidel, damped Jacobi, sparse approximate inverse (SPAI), and incomplete LU factorization multigrid smoothers, to name but a few. In the numerical experiments of this work, we adopt the simple and parallel smoother SPAI-0 [57] and the incomplete LU factorization with zero fill-ins (ILU0). In the case that it uses GPU computing, a prescribed set of Jacobi iterations is used to approximately solve the systems in ILU0 smoothing iterations instead of performing a forward and backward sweep, which has very low parallelism.

4. Parallel Approach

The partition of complex models into smaller and more manageable pieces is a common approach to exploit distributed computational resources and/or to address large-scale problems. The rapid development in multi-core and many-core architectures has increased the interest in these techniques, the so-called domain decomposition methods (DDMs). They permit us to divide the computational burden and the memory requirements across the computational resources of the computational system. The underlying idea consists of partitioning the domain into a set of subdomains, in which we assign individual resources to address the global problem in parallel by communicating to the resources in charge of the other subdomains. This strategy increases the overall performance by performing the operations in parallel and distributing the memory requirements. The load balancing of the calculation required by the subdomains is a key to increasing computing performance. DDMs are iterative in nature and usually require communication between the computational resources. Minimizing such communications is very important for incrementing the computing performance.

This work makes use of a Global Subdomain Implementation (GSI) [58], in which we assume that the coefficient matrix

K_{l}

for each level l of (2) is distributed across

p = {1, \dots, n_{p}}

processes (with

n_{p}

the number of processes) by contiguous blocks of rows as follows:

K_{l} = (\begin{matrix} K_{l}^{0} \\ ⋮ \\ K_{l}^{p - 1} \end{matrix}),

(3)

where we perform the computation of each block submatrix

K_{l}^{p - 1} \in R^{n^{p - 1} \times n}

by one single processor p with

n^{p - 1}

number of rows of the block submatrix, and n number of columns of

K_{l}

. The block submatrices use the global row indices

{n^{0}, \dots, n^{p - 1}}

to facilitate operating between subdomains. We store these submatrices as local

K_{l o c}^{p - 1}

and remote parts

K_{r e m}^{p - 1}

. The former is a square matrix including the “local”

n^{p - 1}

unknowns, whereas the latter is a matrix containing coefficients with global column indices stored in other processors. This strategy allows us to differentiate between the “local” and “distributed” computations. Local computation is performed with the data stored in the own process p, whereas distributed computation requires some communication mechanism.

The communications make use of the standardized and portable MPI mechanism. When we initialize communications, each process p calculates the global columns required from both the own process and other processes. Data exchange consists of receiving ghost values from the processes sharing unknowns (global column indices) and then sending the data to the processes that require them to form the gather vectors. Since each processor p knows its receive and send processors, data exchange is performed directly between such processes. This strategy reduces the computational complexity and the storage requirements because the number of neighbors and the amount of data are independent of the number of p processors.

The GSI requires an efficient mesh partitioning into several non-overlapping subdomains [59]. Ideally, the number of interface elements of these subdomains should be minimized, thus reducing data exchange between processes. The efficient implementation of mesh partitioning is key for addressing large-scale problems and/or with several subdomains because the partition method is memory intensive. The process consists of the generation of a dual graph (each finite element becomes a graph vertex) from the geometric mesh of the finite element model, and we then use a multilevel k-way partitioning [60] method to define the subdomains considering optimization criteria, such as the minimization of the resulting subdomain connectivity graph and the contiguous partitions enforcing.

All these functionalities are available using modern high-level open-source libraries. We can perform mesh partitioning efficiently using multi-core architectures with the ParMETIS library, which performs the parallel generation of a dual graph from the finite element mesh and the multilevel k-way partitioning using the MPI standard. The use of parallel computing techniques permits us to save a large amount of memory in this process, mainly due to the generation of the whole graph for the calculation of each subdomain being especially memory intensive, making the problem unaffordable. The AMGCL library provides the parallel implementation of the iterative solver and the AMG preconditioning using both distributed and shared memory. The VEXCL library permits the use of OpenCL and CUDA backends with the functionalities provided by the AMGCL library. We integrate all these high-level open-source libraries to solve the linear elasticity problems presented below.

5. Numerical Experiments

We evaluate the benefits and limitations of using a distributed conjugate gradient iterative solver preconditioned with aggregation AMG using multi-GPU systems. We test the scalability of these techniques using many-core architectures, analyzing the pros and cons compared to efficient multi-core systems using the same approach. The numerical experiments consist of the resolution of finite element models with different discretization sizes until the hardware limitations of many-core architectures do not permit us to address the problem. These experiments cover examples addressing two- and three-dimensional problems using structured and unstructured meshes with different finite elements.

We run the numerical experiments using a desktop computer with an Intel Xeon W-2145 CPU at 3.70 GHz with 128 Gb of RAM and four Nvidia Titan V graphics cards. Table 1 summarizes the features and specifications of such devices. The Nvidia Titan V uses the processor GV100 based on Volta architecture, which includes

21.1 \times 10^{9}

of transistors, 5120 CUDA cores operating from 1200 MHz (base) to 1455 MHz (boost), 12 GB of HBM2 memory operating at 850 MHz, and a memory interface of 3072 bits, which provide a total memory bandwidth of around 652.8 GBps. This graphics card includes specific hardware to perform double-precision floating-point operations in scientific computing. The theoretical computational performance is 12.29 TFLOPS and 6.14 TFLOPS for single- and double-precision floating-point operations, respectively. The thermal design power (TDP) of Nvidia Titan V is around 250 W, which permits us to add the four graphics devices using a high-power PSU. Figure 2 shows the multi-GPU system considering cooling issues: the graphics cards are physically separate to facilitate heat dissipation.

We run all the experiments using multi-core and many-core architectures with different computing resources to evaluate the scalability and acceleration. The multi-core numerical experiments use one, two, four, and eight cores. The Intel Xeon W-2145 microprocessor can use two threads per core using multithreading, supporting the operating system running 16 threads concurrently. Multithreading aims to increase the utilization of core resources by exploiting function parallelism and ILP complementary techniques. In our experience, multithreading with the massive use of Arithmetic Logic Units (ALUs) serializes the execution of threads, limiting the theoretical performance using multi-core systems. For this, the experiments do not use more than the maximum number of physical cores. The many-core numerical experiments use one, two, three, or four Nvidia Titan V graphics cards. The source code of the numerical experiments is compiled using NVIDIA CUDA Toolkit 12.6 and runs on a 64-bit Ubuntu 20.04.6 LTS OS with the NVIDIA Driver Version 550.107.02.

The battery of experiments consists of three finite element models with different meshes and finite elements. We parameterize these models with the number of divisions

d i v

of the geometry. By modifying this parameter

d i v

, we obtain systems of equations of different sizes, which are solved using the distributed conjugate gradient solver preconditioned with aggregation AMG using multi-core and many-core computing. We specify the geometric parameters and material properties of experiments in Table 2. The stopping criterion of the distributed conjugate gradient is 1 × 10⁻⁸ for all the experiments. We report the wall-clock time for all the numerical experiments. We also present the memory device and the ratio between the setup and solving stages for experiments using many-core architectures. We calculate the speedup for all the experiments of the multi-core system using from one to eight CPUs. These numerical results provide the information to highlight the benefits and limitations of using many-core architectures in structural mechanics problems. We present the details of the finite element models, the numerical results of the experiments, and a discussion about such numerical results.

5.1. Simply Supported Beam with a Hole

The 2D simply supported beam with a hole experiment consists of a beam with fixed support in the bottom left-hand corner and roller support in the bottom right-hand corner. Table 2 specifies the length L and height H of the geometry of the beam and the material properties corresponding to aluminum. Figure 3a shows the geometric configuration of the finite element problem, the boundary conditions, and the load applied in the middle of the top edge. Figure 3b shows the parameterization of the unstructured mesh. We use the parameter

d i v

to obtain meshes with different amounts of finite elements. The increment of this parameter

d i v

provides the finite element models analyzed considering plane strain assumptions. Figure 3b also shows an example of the partitioning into four subdomains using the ParMetis library. We can observe that the partitioning algorithm provides sets of elements with a similar number of elements with a higher concentration around the hole by enforcing the continuity of partitions. We use Constant Strain Triangle (CST) finite elements in the parameterized unstructured mesh.

5.2. Single-Arch Dam

The 3D single-arch dam experiment consists of an ideal arch dam with the same face radius at all its elevations. The dam is wider at the bottom to withstand the hydrostatic pressure load. These structures obtain stability by combining the arch and gravity actions. Figure 4a shows the geometric configuration of the dam model, and Table 2 specifies the geometry parameters of the model and the material properties corresponding to high-strength concrete. We show the boundary conditions of the model and the hydrostatic pressure load applied at the face containing the body of water in Figure 4b. The boundary conditions model firm reliable supports at the abutments. The finite element model also uses symmetry boundary conditions by restricting the displacement vector perpendicular to the symmetry plane

x y

. Figure 4c shows the parameterization of the structured mesh composed of linear tetrahedral (solid) elements. We can obtain meshes with different amounts of finite elements by modifying the parameter

d i v

. Figure 4c also shows the partitioning into four balanced subdomains enforcing the contiguous partitions using the ParMetis library.

5.3. L-Shaped Cantilever Beam

The last experiment consists of a non-uniform L-shaped beam anchored at one end. The beam has a rectangular section with a higher height at the constrained end. Figure 5a shows the geometric configuration of the cantilever model, and Table 2 specifies its geometry and the material properties corresponding to ASTM-A36 steel. Figure 5b details the boundary conditions and the uniform load acting at the non-constraint end of the cantilever. Figure 5c shows the parameterization of the structured mesh composed of eight-node hexahedral linear brick elements. We use the parameter

d i v

to generate finite elements with a similar aspect ratio with the adjustments shown in Figure 5c. The increment of this parameter

d i v

provides finite element models with an increasing number of unknowns. Figure 5c also shows the partitioning into four balanced subdomains enforcing the contiguous partitions using the ParMetis library.

5.4. Numerical Results

The numerical experiments evaluate the GPU performance by solving sets of finite element models that we obtain by modifying the number of unknowns of the problem with the

d i v

parameter. We increase the

d i v

parameter until we reach the hardware limitations of the multi-GPU system using the four graphics cards due to insufficient device memory. For all the experiments, we show the wall-clock time using multi-core and many-core architectures, the device memory used for the corresponding problem size, and the speedup from the use of both one (serial execution) and eight (maximum number of cores) CPUs of the multi-core system.

We solve the 2D simply supported beam experiment using as a preconditioner the AMG method with aggressive coarsening and parallel smoother SPAI-0 for relaxing the solver iterations. Figure 6a shows the wall-clock time for solving the finite element problem with different unknowns. We solve this model with more than 73 M of Degrees of Freedom (DoF) using the serial implementation, which takes about 52 h. We can accelerate the resolution of this problem using four cores of the multi-core system, which takes less than 15 h. This result shows good scalability by obtaining a speedup of 3.5 from the serial implementation. We cannot solve the problem of more than 73 M of DoFs using eight cores due to the memory required by the parallel partitioning approach, limiting the problem size to almost 57 M of DoFs for eight partitions with the 128 GB of RAM available in the desktop computer. We have to remark that the number of CST finite elements have a similar order to the number of unknowns for this type of finite element model. Thus, dual graphs from the geometric mesh (generated for each subdomain) also have high device memory requirements.

Figure 6b shows the amount of device memory used to solve the 2D simply supported beam experiment using GPU computing with different numbers of unknowns. By incrementing the number of graphics devices, we can increase the size of the problem that we can solve and also accelerate the resolution of the system of Equation (1). We can address finite element models of up to 18 M of DoF using one GPU, whereas we can solve finite element models of up to 73 M of DoF using four graphics devices. Concerning the performance, Figure 6c,d show the speedup using multi-core and many-core computing from solving the problem using one and eight CPUs, respectively. They also show the number of levels used by the AMG preconditioner. We can observe that the acceleration obtained using GPU computing is of a different order of magnitude concerning the speedup obtained using multi-core computing. The speedup using GPU computing increases with the problem size through the higher bandwidth of such massive architectures. The speedup curve using one GPU is more regular than the other curves using multiple GPUs. We attribute such oscillations of the speedup curves to communications delays between distributed processes and the number of levels of the preconditioner, which increases both the computation and the communications of distributed operations.

We use an optimized implementation of AMG preconditioner with aggressive coarsening using block matrices (grouping into blocks all unknowns of a grid point) to solve the 3D arch-dam experiment. This instance uses incomplete LU factorization with zero fill-ins (ILU0) for the smoothing iterations. Figure 7a shows the wall-clock time for solving the single-arch dam experiment with a different number of unknowns. We address problems of about 32 M of DoF using the serial implementation, taking about one hour to solve. This efficient implementation using block matrices also shows good scalability by obtaining a speedup of 3.2 with multi-core computing using four CPUs from the serial implementation. Using tetrahedral finite elements in the structured mesh generates high graphs in the parallel partitioning approach, limiting the problem size to almost 22 M of DoFs for eight partitions with the 128 GB of RAM available on the workstation. The number of finite elements of the problem is approximately double that of the unknowns.

Figure 7b shows the amount of device memory used to solve the 3D arch-dam experiment using GPU computing with different numbers of unknowns. We can observe that by incrementing the number of GPU devices, we accelerate solving and increase the size of the problems we can address. The speedup using multi-core and many-core computing from solving the problem using one and eight CPUs is shown in Figure 7c,d, respectively. We can observe significant speedups using GPU computing from the multi-core implementation, mainly due to the higher bandwidth of many-core architectures. Such accelerations are below the speedups obtained in the 2D experiment because the problem size is smaller than in the 2D experiment, and the CPU implementation using block values is much more efficient.

We use an AMG preconditioner with smooth-aggressive coarsening using block matrices to solve the 3D L-shaped cantilever experiment. This experiment also uses ILU0 to smooth the solving iterations. Figure 8a shows the wall-clock time for solving the experiment with a different number of unknowns. We address problems of up to 24 M of DoF using the serial implementation, which takes less than one hour to solve. Since using hexahedral finite elements has fewer connections in the dual graph for partitioning, we can generate the subdomains for the eight partitions in these problems. The multi-core implementation shows good scalability by obtaining a speedup of 5.7 using eight cores from the serial implementation. Figure 8c,d show the speedup from serial and multi-core implementation using one and eight CPUs, respectively. In this case, the speedup using the multi-core system with eight CPUs is of the same order of magnitude as the acceleration obtained using one GPU. We can observe in Figure 8b that using many GPUs allows us to accelerate solving and increase the problem size we can address. Nevertheless, the speedup is lower than the one obtained with the experiment using tetrahedral finite elements. We attribute the performance decrease to the larger grain size than using tetrahedral finite elements of GPU implementation since the rows of the assembled coefficient matrix of (1) have more elements.

A key point of efficient implementation of aggressive AMG is the parallel implementation of the setup stage. This stage provides us with sets of transfer operations that would reduce the frequency error components by interpolation and restriction operations to approximate the solution. The solving stage consists of the distributed conjugate gradient with the preconditioner using V-cycles per iteration. The computational cost of such a stage comes from the matrix-vector operations using sparse-matrix representation and the communications between the subdomains. Figure 9 shows the wall-clock time detailing the setup and solving stage for the experiments using GPU computing. We group these results with the problem size. We can observe that the solving stage obtains a higher speedup using multiple GPUs for the same problem size. This speedup is particularly relevant for large-scale problems or approaches requiring several iterations to converge.

6. Conclusions

We have presented our experiences using modern high-level libraries to exploit the computational resources of multi-core and many-core architectures to accelerate finite element analysis in structural mechanics. The numerical results show that using a distributed conjugate gradient solver preconditioned with aggressive AMG is especially rewarding using GPU computing. These massively parallel computer architectures provide a high bandwidth that we can use to accelerate the resolution of structural mechanics system of equations using the proper techniques. The numerical experiments evaluate the weak scalability using distributed multi-core and many-core systems to address structural mechanics problems. Such problems include models with structured and unstructured meshes using 2D and 3D finite elements, which we solve with different relaxing operators. It also presents a quantitative comparison of the use of both approaches with different computational resources and techniques. The results show the limitations of multi-GPU systems due to the device memory requirements and the advantages in computational capabilities.

The comparison between multi-core and many-core systems shows higher speedups using GPU computing concerning the CPU implementation counterpart. This conclusion is especially true when using multiple graphics devices, which provide a memory bandwidth and latency that can hardly obtain a multi-core system. Another advantage of using many-core architectures is the use of a fewer number of subdomains. A high number of subdomains generates significant problems partitioning large-scale models and increases the use of communications with the global subdomain implementation used in this work. We also have to remark that we perform the numerical experiments on a relatively low-cost workstation, which can address problems of several millions of unknowns in a reasonable time. We adopt, implement, and evaluate techniques that can distribute the computing burden in GPU clusters, which reduces the GPU computing limitations exposed in this work.

As future developments, we can mention the extension of experiments to other GPU architectures to evaluate the applicability and adaptability of the aggregation AMG method across different hardware platforms. A performance comparison between GMG, classical AMG, and aggregation AMG methods using distributed GPU implementations is also an interesting future work. We also plan to evaluate the GPU implementation using mixed precision operations because GPU hardware specifications show an increasing teraflop rating as reducing the precision of the operations. Finally, we can also propose the integration of the aggregation AMG preconditioner with cloud computing environments using synchronous and asynchronous iterative parallel methods for solving structural mechanic problems.

Author Contributions

Conceptualization, D.H.-P. and H.M.-B.; software, D.H.-P. and H.M.-B.; investigation, D.H.-P. and H.M.-B.; supervision, H.M.-B.; project administration, H.M.-B.; writing—review and editing, D.H.-P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by AEI/FEDER and UE grant number DPI2016-77538-R.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Venkataraman, S.; Haftka, R. Structural optimization complexity: What has Moore’s law done for us? Struct. Multidiscip. Optim. 2004, 28, 375–387. [Google Scholar] [CrossRef]
Schaller, R. Moore’s Law: Past, present and future. IEEE Spectr. 1997, 34, 53–59. [Google Scholar] [CrossRef]
Asanovic, K.; Bodik, R.; Demmel, J.; Keaveny, T.; Keutzer, K.; Kubiatowicz, J.; Morgan, N.; Patterson, D.; Sen, K.; Wawrzynek, J.; et al. A View of the Parallel Computing Landscape. Commun. ACM 2009, 52, 56–67. [Google Scholar] [CrossRef]
Wall, D. Limits of instruction-level parallelism. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, Santa Clara, CA, USA, 8–11 April 1991; pp. 176–188. [Google Scholar] [CrossRef]
Wulf, W.; McKee, S. Hitting the memory wall: Implications of the obvious. ACM SIGARCH Comput. Archit. News 1995, 23, 20–24. [Google Scholar] [CrossRef]
Asanovic, K.; Bodik, R.; Catanzaro, B.C.; Gebis, J.J.; Husbands, P.; Keutzer, K.; Patterson, D.A.; Plishker, W.L.; Shalf, J.; Williams, S.W.; et al. The Landscape of Parallel Computing Research: A View from Berkeley; Technical report; UC Berkeley: Berkeley, CA, USA, 2006. [Google Scholar]
Kahle, J.; Moreno, J.; Dreps, D. Summit and Sierra: Designing AI/HPC Supercomputers. In Proceedings of the IEEE International Solid-State Circuits Conference, San Francisco, CA, USA, 17–21 February 2019; pp. 42–43. [Google Scholar] [CrossRef]
Nickolls, J.; Dally, W. The GPU Computing Era. IEEE Micro 2010, 30, 56–69. [Google Scholar] [CrossRef]
Brodtkorb, A.; Hagen, T.; Sætra, M. Graphics processing unit (GPU) programming strategies and trends in GPU computing. J. Parallel Distrib. Comput. 2013, 73, 4–13. [Google Scholar] [CrossRef]
Noaje, G.; Krajecki, M.; Jaillet, C. MultiGPU computing using MPI or OpenMP. In Proceedings of the International Conference on Intelligent Computer Communication and Processing, Cluj-Napoca, Romania, 26–28 August 2010; pp. 347–354. [Google Scholar] [CrossRef]
Cheik-Ahamed, A.K.; Magoulès, F. GPU accelerated substructuring methods for sparse linear systems. In Proceedings of the IEEE Int. Conf. on High Performance Computing and Communications (HPCC), Paris, France, 12–14 December 2016; pp. 614–625. [Google Scholar] [CrossRef]
Wang, X.; Li, S.; Dong, W.; An, B.; Huang, H.; He, Q.; Wang, P.; Lv, G. Multi-GPU parallel acceleration scheme for meshfree peridynamic simulations. Theor. Appl. Fract. Mec. 2024, 131, 104401. [Google Scholar] [CrossRef]
Barreales, G.N.; Novalbos, M.; Otaduy, M.A.; Sanchez, A. MDScale: Scalable multi-GPU bonded and short-range molecular dynamics. J. Parallel Distrib. Comput. 2021, 157, 243–255. [Google Scholar] [CrossRef]
Karzhaubayev, K.; Wang, L.P.; Zhakebayev, D. DUGKS-GPU: An efficient parallel GPU code for 3D turbulent flow simulations using Discrete Unified Gas Kinetic Scheme. Comput. Phys. Commun. 2024, 301, 109216. [Google Scholar] [CrossRef]
Gao, J.; Ji, W.; Wang, Y. Optimization of Large-Scale Sparse Matrix-Vector Multiplication on Multi-GPU Systems. ACM Trans. Arch. Code Optim. 2024, 21, 69. [Google Scholar] [CrossRef]
John, J.; Milthorpe, J.; Herault, T.; Bosilca, G. Multi-GPU work sharing in a task-based dataflow programming model. Future Gener. Comput. Syst. 2024, 156, 313–324. [Google Scholar] [CrossRef]
Georgescu, S.; Chow, P.; Okuda, H. GPU Acceleration for FEM-Based Structural Analysis. Arch. Comput. Method Eng. 2013, 20, 111–121. [Google Scholar] [CrossRef]
Kirk, D.B.; Hwu, W.m.W. Programming Massively Parallel Processors: A Hands-on Approach, 2nd ed.; Morgan Kaufmann: Waltham, MA, USA, 2013. [Google Scholar] [CrossRef]
Gullerud, A.; Dodds, R. MPI-based implementation of a PCG solver using an EBE architecture and preconditioner for implicit, 3-D finite element analysis. Comput. Struct. 2001, 79, 553–575. [Google Scholar] [CrossRef]
Mackie, R. Object-oriented programming of distributed iterative equation solvers. Comput. Struct. 2008, 86, 511–519. [Google Scholar] [CrossRef]
Dehnavi, M.; Fernández, D.; Giannacopoulos, D. Enhancing the Performance of Conjugate Gradient Solvers on Graphic Processing Units. IEEE Trans. Magn. 2011, 47, 1162–1165. [Google Scholar] [CrossRef]
Helfenstein, R.; Koko, J. Parallel preconditioned conjugate gradient algorithm on GPU. J. Comput. Appl. Math. 2012, 236, 3584–3590. [Google Scholar] [CrossRef]
Li, R.; Saad, Y. GPU-accelerated preconditioned iterative linear solvers. J. Supercomput. 2013, 63, 443–466. [Google Scholar] [CrossRef]
Göddeke, D.; Strzodka, R.; Mohd-Yusof, J.; McCormick, P.; Buijssen, S.; Grajewski, M.; Turek, S. Exploring weak scalability for FEM calculations on a GPU-enhanced cluster. Parallel Comput. 2007, 33, 685–699. [Google Scholar] [CrossRef]
Bell, N.; Garland, M. Efficient Sparse Matrix-Vector Multiplication on CUDA; NVIDIA Technical Report NVR-2008-004; NVIDIA: Santa Clara, CA, USA, 2008. [Google Scholar]
Cheik-Ahamed, A.K.; Magoulès, F. Parallel Sub-Structuring Methods for solving Sparse Linear Systems on a cluster of GPU. In Proceedings of the IEEE International Conference on High Performance Computing and Communications (HPCC), Paris, France, 20–22 August 2014; pp. 121–128. [Google Scholar] [CrossRef]
Cheik-Ahamed, A.K.; Magoulès, F. Efficient implementation of Jacobi iterative method for large sparse linear systems on graphic processing units. J. Supercomput. 2017, 73, 3411–3432. [Google Scholar] [CrossRef]
Martínez-Frutos, J.; Martínez-Castejón, P.J.; Herrero-Peréz, D. Fine-grained GPU implementation of assembly-free iterative solver for finite element problems. Comput. Struct. 2015, 157, 9–18. [Google Scholar] [CrossRef]
Cecka, C.; Lew, A.; Darve, E. Assembly of finite element methods on graphics processors. Int. J. Numer. Methods Eng. 2011, 85, 640–669. [Google Scholar] [CrossRef]
Markall, G.; Slemmer, A.; Ham, D.; Kelly, P.; Cantwell, C.; Sherwin, S. Finite element assembly strategies on multi-core and many-core architectures. Int. J. Numer. Methods Fluids 2013, 71, 80–97. [Google Scholar] [CrossRef]
Kiss, I.; Gyimóthy, S.; Badics, Z.; Pávó, J. Parallel Realization of the Element-by-Element FEM Technique by CUDA. IEEE Trans. Magn. 2012, 48, 507–510. [Google Scholar] [CrossRef]
Cai, Y.; Li, G.; Wang, H. A Parallel Node-based Solution Scheme for Implicit Finite Element Method Using GPU. Procedia Eng. 2013, 61, 318–324. [Google Scholar] [CrossRef]
Martínez-Frutos, J.; Herrero-Peréz, D. Efficient Matrix-Free GPU implementation of Fixed Grid Finite Element Analysis. Finite Elem. Anal. Des. 2015, 104, 61–71. [Google Scholar] [CrossRef]
Suresh, K. Efficient generation of large-scale pareto-optimal topologies. Struct. Multidiscip. Optim. 2013, 47, 49–61. [Google Scholar] [CrossRef]
Zegard, T.; Paulino, G. Toward GPU accelerated topology optimization on unstructured meshes. Struct. Multidiscip. Optim. 2013, 48, 473–485. [Google Scholar] [CrossRef]
Ruge, J.; Stüben, K. Algebraic Multigrid. In Multigrid Methods; Ewing, R., Ed.; Society for Industrial and Applied Mathematics (SIAM), 3600 University City Science Center: Philadelphia, PA, USA, 1987; Chapter 4; pp. 73–130. [Google Scholar] [CrossRef]
Bulgakov, V. Multi-level iterative technique and aggregation concept with semi-analytical preconditioning for solving boundary-value problem. Comm. Numer. Methods Engrng. 1993, 9, 649–657. [Google Scholar] [CrossRef]
Gandham, R.; Esler, K.; Zhang, Y. A GPU accelerated aggregation algebraic multigrid method. Comput. Math. Appl. 2014, 68, 1151–1160. [Google Scholar] [CrossRef]
Kettler, R. Analysis and Comparison of Relaxation Schemes in Robust Multigrid and Preconditioned Conjugate Gradient Methods. In Multigrid Methods; Lecture Notes in Mathematics; Hackbusch, W., Trottenberg, U., Eds.; Springer: Berlin/Heidelberg, Germany, 1982; Volume 960, pp. 502–534. [Google Scholar] [CrossRef]
Tatebe, O. The multigrid preconditioned conjugate gradient method. In Proceedings of the Sixth Copper Mountain Conference on Multigrid Methods, Copper Mountain, CO, USA, 4–9 April 1993; pp. 1–14. [Google Scholar]
Stüben, K. A review of algebraic multigrid. J. Comput. Appl. Math. 2001, 128, 281–309. [Google Scholar] [CrossRef]
Dick, C.; Georgii, J.; Westermann, R. A Real-Time Multigrid Finite Hexahedra Method for Elasticity Simulation using CUDA. Simul. Model. Pract. Th. 2011, 19, 801–816. [Google Scholar] [CrossRef]
Martínez-Frutos, J.; Herrero-Pérez, D. GPU acceleration for evolutionary topology optimization of continuum structures using isosurfaces. Comput. Struct. 2017, 182, 119–136. [Google Scholar] [CrossRef]
Martínez-Frutos, J.; Martínez-Castejón, P.J.; Herrero-Pérez, D. Efficient topology optimization using GPU computing with multilevel granularity. Adv. Eng. Softw. 2017, 106, 47–62. [Google Scholar] [CrossRef]
Wu, J.; Dick, C.; Westermann, R. A System for High-Resolution Topology Optimization. IEEE Trans. Visual Comput. Graphics 2016, 22, 1195–1208. [Google Scholar] [CrossRef]
Fu, Z.; Lewis, T.; Kirby, R.; Whitaker, R. Architecting the finite element method pipeline for the GPU. J. Comput. Appl. Math. 2014, 257, 195–211. [Google Scholar] [CrossRef]
Liu, H.; Yang, B.; Chen, Z. Accelerating algebraic multigrid solvers on NVIDIA GPUs. Comput. Math. Appl. 2015, 70, 1162–1181. [Google Scholar] [CrossRef]
Karypis, G.; Schloegel, K. ParMeTis: Parallel Graph Partitioning and Sparse Matrix Ordering Library, Version 4.0; Technical report; University of Minnesota: Minneapolis, MN, USA, 2013. [Google Scholar]
Demidov, D. AMGCL: An Efficient, Flexible, and Extensible Algebraic Multigrid Implementation. Lobachevskii J. Math. 2019, 40, 535–546. [Google Scholar] [CrossRef]
Demidov, D.; Ahnert, K.; Rupp, K.; Gottschling, P. Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries. SIAM J. Sci. Comput. 2013, 35, C453–C472. [Google Scholar] [CrossRef]
Karsten, A.; Demidov, D.; Mulansky, M. Solving Ordinary Differential Equations on GPUs. In Numerical Computations with GPUs; Kindratenko, V., Ed.; Springer International Publishing: Cham, Switzerland, 2014; pp. 125–157. [Google Scholar] [CrossRef]
Szuppe, J. Boost.Compute: A parallel computing library for C++ based on OpenCL. In Proceedings of the 4th International Workshop on OpenCL (IWOCL 16), Vienna, Austria, 19–21 April 2016; Volume 15, pp. 1–39. [Google Scholar] [CrossRef]
Vaněk, P. Acceleration of Convergence of a Two Level Algorithm by Smooth Transfer Operators. Appl. Math. 1992, 37, 265–274. [Google Scholar] [CrossRef]
Vaněk, P.; Mandel, J.; Brezina, M. Algebraic multigrid by smoothed aggregation for second and fourth order elliptic problems. Computing 1996, 56, 179–196. [Google Scholar] [CrossRef]
Tuminaro, R.; Tong, C. Parallel Smoothed Aggregation Multigrid: Aggregation Strategies on Massively Parallel Machines. In Proceedings of the International ACM/IEEE Conference on Supercomputing, Dallas, TX, USA, 13–19 November 2010; pp. 1–20. [Google Scholar] [CrossRef]
Bell, N.; Dalton, S.; Olson, L. Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods. SIAM J. Sci. Comput. 2012, 34, C123–C152. [Google Scholar] [CrossRef]
Grote, M.; Huckle, T. Parallel Preconditioning with Sparse Approximate Inverses. SIAM J. Sci. Comput. 1996, 18, 838–853. [Google Scholar] [CrossRef]
Bitzarakis, S.; Papadrakakis, M.; Kotsopulos, A. Parallel solution techniques in computational structural mechanics. Comput. Methods Appl. Mech. Eng. 1997, 148, 75–104. [Google Scholar] [CrossRef]
Magoulès, F.; Iványi, P.; Topping, B.H.V. Non-overlapping Schwarz methods with optimized transmission conditions for the Helmholtz equation. Comput. Methods Appl. Mech. Eng. 2004, 193, 4797–4818. [Google Scholar] [CrossRef]
Karypis, G.; Kumar, V. Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Dist. Com. 1998, 48, 96–129. [Google Scholar] [CrossRef]

Figure 1. Multi-GPU node architecture.

Figure 2. Multi-GPU system with four Nvidia TITAN V.

Figure 3. Simply supported beam experiment with a hole: (a) geometric configuration and boundary conditions, and (b) mesh parameterization and partitioning into four subdomains.

Figure 4. Single-arch dam experiment: (a) geometric configuration, (b) boundary conditions, and (c) mesh parameterization and partitioning into four subdomains.

Figure 5. L-shaped cantilever experiment: (a) geometric configuration, (b) boundary conditions, and (c) mesh parameterization and partitioning into four subdomains.

Figure 6. Simply supported beam experiment: (a) wall-clock time, (b) device memory, and speedup from (c) one and (d) eight MPI processes.

Figure 7. Single-arch dam experiment: (a) wall-clock time, (b) device memory, and speedup from (c) one and (d) eight MPI processes.

Figure 8. L-shaped cantilever experiment: (a) wall-clock time, (b) device memory, and speedup from (c) one and (d) eight MPI processes.

Figure 9. Wall-clock time of setup and solving stages using GPU computing.

Table 1. GPU specifications of the device used in the benchmarks.

GPU Model	CUDA Cores	Processor Clock (MHz)	Memory Bandwidth (GB/s)	Memory (MB)
Titan V	5120	1455	652.8	12,288

Table 2. Geometric parameters and materials for the experiments.

Simply Supported Beam
	Geometry (m)					Material (GPa)
				L	H	$ν$	E
				6	1	0.3	69.0
Single-arch dam
	Geometry (m)					Material (GPa)
	L	H	$t_{1}$	$t_{2}$	d	$ν$	E
	50	10	4	1	5	0.2	30.0
L-shaped cantilever beam
	Geometry (m)					Material (GPa)
$L_{1}$	$L_{2}$	$H_{1}$	$H_{2}$	R	r	$ν$	E
0.2	0.5	0.15	0.05	0.3	0.1	0.26	200

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Herrero-Pérez, D.; Martínez-Barberá, H. Multi-GPU Acceleration for Finite Element Analysis in Structural Mechanics. Appl. Sci. 2025, 15, 1095. https://doi.org/10.3390/app15031095

AMA Style

Herrero-Pérez D, Martínez-Barberá H. Multi-GPU Acceleration for Finite Element Analysis in Structural Mechanics. Applied Sciences. 2025; 15(3):1095. https://doi.org/10.3390/app15031095

Chicago/Turabian Style

Herrero-Pérez, David, and Humberto Martínez-Barberá. 2025. "Multi-GPU Acceleration for Finite Element Analysis in Structural Mechanics" Applied Sciences 15, no. 3: 1095. https://doi.org/10.3390/app15031095

APA Style

Herrero-Pérez, D., & Martínez-Barberá, H. (2025). Multi-GPU Acceleration for Finite Element Analysis in Structural Mechanics. Applied Sciences, 15(3), 1095. https://doi.org/10.3390/app15031095

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-GPU Acceleration for Finite Element Analysis in Structural Mechanics

Abstract

Featured Application

Abstract

1. Introduction

2. Multi-GPU Architecture and Communications

3. Aggregation Algebraic Multigrid (AMG)

4. Parallel Approach

5. Numerical Experiments

5.1. Simply Supported Beam with a Hole

5.2. Single-Arch Dam

5.3. L-Shaped Cantilever Beam

5.4. Numerical Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI