Next Article in Journal
Variational Color Shift and Auto-Encoder Based on Large Separable Kernel Attention for Enhanced Text CAPTCHA Vulnerability Assessment
Previous Article in Journal
Two-Stage Combined Model for Short-Term Electricity Forecasting in Ports
Previous Article in Special Issue
On a Simplified Approach to Achieve Parallel Performance and Portability Across CPU and GPU Architectures
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enabling Parallel Performance and Portability of Solid Mechanics Simulations Across CPU and GPU Architectures

by
Nathaniel Morgan
1,*,
Caleb Yenusah
2,
Adrian Diaz
3,
Daniel Dunning
1,
Jacob Moore
3,†,
Erin Heilman
3,
Evan Lieberman
3,
Steven Walton
2,
Sarah Brown
1,
Daniel Holladay
4,
Russell Marki
3,‡,
Robert Robey
3,§ and
Marko Knezevic
5
1
Engineering Technology & Design Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA
2
Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA
3
Computational Physics Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA
4
Computer, Computational & Statistical Sciences Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA
5
Department of Mechanical Engineering, University of New Hampshire, Durham, NH 03824, USA
*
Author to whom correspondence should be addressed.
Now a research professor at Mississippi State University.
Graduate student at University of New Hampshire.
§
Now working at AMD corporation.
Information 2024, 15(11), 716; https://doi.org/10.3390/info15110716
Submission received: 18 September 2024 / Revised: 21 October 2024 / Accepted: 2 November 2024 / Published: 7 November 2024
(This article belongs to the Special Issue Advances in High Performance Computing and Scalable Software)

Abstract

:
Efficiently simulating solid mechanics is vital across various engineering applications. As constitutive models grow more complex and simulations scale up in size, harnessing the capabilities of modern computer architectures has become essential for achieving timely results. This paper presents advancements in running parallel simulations of solid mechanics on multi-core CPUs and GPUs using a single-code implementation. This portability is made possible by the C++ matrix and array (MATAR) library, which interfaces with the C++ Kokkos library, enabling the selection of fine-grained parallelism backends (e.g., CUDA, HIP, OpenMP, pthreads, etc.) at compile time. MATAR simplifies the transition from Fortran to C++ and Kokkos, making it easier to modernize legacy solid mechanics codes. We applied this approach to modernize a suite of constitutive models and to demonstrate substantial performance improvements across different computer architectures. This paper includes comparative performance studies using multi-core CPUs along with AMD and NVIDIA GPUs. Results are presented using a hypoelastic–plastic model, a crystal plasticity model, and the viscoplastic self-consistent generalized material model (VPSC-GMM). The results underscore the potential of using the MATAR library and modern computer architectures to accelerate solid mechanics simulations.

1. Introduction

The computer hardware industry is embracing architectural innovations, such as large multi-core CPUs, referred to as homogeneous architectures, and integrating GPUs with multi-core CPUs, known as heterogeneous architectures. These innovations are crucial for sustaining performance growth in computing as they allow for greater parallel processing capabilities. In particular, GPUs can handle massive computational workloads. In the case of the upcoming exascale supercomputers like El Capitan, Frontier, and Aurora, GPUs are integral to achieving unprecedented performance levels. By increasing the cores on a multi-core CPU, or combining the strengths of a multi-core CPU and GPUs, these modern architectures provide a path forward in the era of slowed transistor scaling.
To fully utilize modern computer architectures, fine-grained parallelism is essential, but this poses significant challenges for software developers. Heterogeneous architectures often rely on vendor-specific programming languages, such as CUDA, HIP, or SYCL, which differ from those used for fine-grained parallelism in homogeneous architectures (CPU only), like OpenMP or pthreads. This variation in parallelization languages and approaches complicates the development and maintenance of portable software. To address these challenges, performance portability libraries like Kokkos [1] have been developed, enabling a single-code base to run efficiently across both homogeneous and heterogeneous architectures. The C++ Kokkos library achieves this portability by offering common C++ interfaces to various hardware-specific or vendor-specific backends, including CUDA, HIP, SYCL, openMP, and pthreads.
Kokkos takes a different approach to portability compared to compiler-based approaches like OpenACC. In OpenACC, developers add directives that instruct the compiler to perform tasks such as parallelizing loops. While OpenACC is intended to be portable across multi-core CPUs and GPUs, with major compilers offering support—including the NVIDIA HPC SDK compiler and the newer versions of the GCC compiler—the extent and efficiency of this support can vary at this time. This inconsistency in OpenACC support can restrict compiler options and lead to portability challenges. In contrast, Kokkos offers an approach that does not depend on compilers supporting, e.g., OpenACC directives. Written in C++, Kokkos gives a common interface to leverage hardware-specific languages for performance portability. A thorough review of performance portability tools and their relative performance was conducted by Deakin et al. [2].
To aid the creation of performance portable software, the open-source C++ matrix and array (MATAR) library [3] was created to serve as a higher-level application programming interface (API) to Kokkos (MATAR can be accessed at https://github.com/lanl/MATAR, it was accessed on 21 October 2024). MATAR simplifies the complexity of using Kokkos, providing an accessible and user-friendly syntax for both novice and expert programmers in Fortran, Python, and C/C++ [4]. Furthermore, the MATAR library offers a straightforward Fortran-like syntax and contains unique data types that allow developers to readily transition existing Fortran to C++ with Kokkos for portability across multi-core CPUs and GPUs. The MATAR library is a header-only library, facilitating easy integration with software, and it is compatible with all major C++ compilers.
The MATAR library is important to the computational mechanics community as many codes in the field, including many high-fidelity crystal plasticity codes, are written in Fortran. These codes are widely used by the engineering community but their runtimes can be prohibitive for large problems. This work leverages unique features in the MATAR library to enable fine-grained parallel simulations on disparate computer architectures with hypoelastic plastic strength models, a single crystal plasticity constitutive model, and the visco-plastic self-consistent generalized material model (VPSC-GMM). This allows the engineering community to perform high-fidelity simulations of large-scale applications that account for the microstructure of a given material. The particular version of the crystal plasticity code (VPSC-GMM) was developed in a previous work [5], and details about the model formulation can be found there. This work focuses on modernizing a Fortran code implementation of VPSC-GMM for performance portability. Multiple application case studies are presented in this paper to illustrate the advantages and versatility of using the C++ MATAR library to readily modernize existing Fortran codes to be performant and portable across multi-core CPUs and GPUs.
The open-source C++ Fierro mechanics code [6], which was a 2024 R&D 100 award winner, is built on the MATAR library. Fierro contains Lagrangian methods to solve the governing equations for statics or quasi-statics [7,8], compressible solid dynamics [9,10,11,12,13,14,15], and micromechanical models [16,17]. The Lagrangian methods for compressible solid dynamics are well suited for high-fidelity simulations of deforming materials. In this paper, we present a suite of test problems using the Fierro mechanics code with user-defined constitutive models (see Figure 1).
This paper’s originality lies in demonstrating and documenting the parallel performance improvements of solid mechanics simulations using MATAR across multi-core CPUs, as well as NVIDIA and AMD GPUs. It includes runtime studies of a hypoelastic–plastic model and a crystal plasticity model on various GPUs and, to the best of our knowledge, presents the first runtime analyses of the VPSC-GMM code on GPUs. By comparing runtimes across multi-core CPUs and different GPUs with a single-code implementation, this paper addresses gaps in the existing literature.
The layout of this paper is as follows. An overview of solid mechanics is presented in Section 2. An overview of the MATAR library data types along with examples is given in Section 3. The Fierro mechanics code is discussed in Section 4. A suite of test cases and runtime results are presented in Section 5. Concluding remarks and a summary of our findings are given in Section 6.

2. Solid Mechanics and Strategies for Acceleration

To solve the governing momentum equation in solid mechanics, a constitutive law describing the material behavior under the action of applied deformation is needed. The sought solution for stress–strain measures is commonly obtained numerically using methods such as the finite element (FE) method [18], and the accuracy of the simulation depends on the accuracy of the selected constitutive law. Constitutive laws based on crystal plasticity theory are highly desirable for improving the accuracy of numerical simulations of material behavior under complex loading in metal forming and various service conditions. Crystal plasticity models are more accurate than continuum-scale phenomenological models because they account for crystallography and capture the evolution of microstructure and texture during deformation. These models are multi-scale in nature because they link the local grain-level mechanical response to the response of a polycrystalline aggregate. To this end, various homogenization schemes have been developed including self-consistent (SC) [19,20,21,22], crystal plasticity finite element (CPFE) [23,24,25], and Taylor-type [26,27] and Green’s function-based elasto-visco-plastic fast Fourier transform (EVPFFT) models [28,29,30]. While the spatially resolved CPFE and EVPFFT models are used for more detailed simulations accounting for grain-to-grain interactions, the SC and Taylor-type models are computationally efficient and have proven effective in predicting the flow response and evolution of texture in polycrystals. The SC and Taylor-type models have been coupled to serve as constitutive laws in FE simulations [5,31,32,33], as have EVPFFT models [34]. In these simulations, the spatial variations in deformation across the FE mesh relaxes the intrinsic homogenization assumptions, improving the accuracy while preserving the efficiency. Nevertheless, these models are yet to be adopted by the metal forming community because of prohibitive computational effort and time involved in such simulations. The crystal plasticity simulations at the component level are demanding because of the need to consider many physical details at multiple length and temporal scales. These computational challenges have been emphasized in [35,36], and, clearly, significant speedups are required to render the simulations involving crystal plasticity constitutive laws practical.
Strategies are being explored to accelerate crystal plasticity calculations. Database approaches that store the main attributes of crystal plasticity solutions in the form of spectral coefficients were described in [37,38,39,40,41,42]. The spectral methods have also been used for the efficient representation of texture [43,44] and material properties [45]. A process plane concept, based on proper orthogonal decompositions in Rodrigues–Frank space, was presented in [46]. Other attempts to improve the efficiency of crystal plasticity (CP) simulations involve adaptive sampling algorithms and building a database that constantly updates itself [35,47]. It has been shown that solving crystal plasticity nonlinear equations using the Jacobian-Free Newton–Krylov (JFNK) technique in place of Newton–Raphson’s method can yield some computational benefits [48]. Investing efforts in developing efficient numerical schemes has yielded a noticeable acceleration of relevant simulations. Additional benefits can come from high-performance computational platforms and their effective use. To this end, implementations of crystal plasticity models have been developed to take advantage of GPUs [49,50,51], but past work was not portable across multiple types of GPUs. This motivates creating a new and portable approach based on the MATAR library. GPUs can perform parallel floating point arithmetic computations much faster than traditional CPUs. With the advent of GPUs, the era of high-performance computing has been revolutionized [52,53].

3. Modernizing Fortran Solid Mechanics Codes

A central component of this work is applying a novel approach (based on MATAR) to readily convert Fortran solid mechanics models to C++ and have the coding run in parallel across CPUs and GPUs. This section presents an overview on the MATAR data types along with examples illustrating how to use MATAR. The examples in this section show performant portable code implementations alongside their Fortran counterparts, illustrating that the MATAR library offers a simple and intuitive coding syntax, which is key to gaining support by Fortran programmers who may not be familiar with C++ and/or fine-grained parallelism languages, including CUDA, HIP, SYCL, openMP, and pthreads.

3.1. Naming Conventions in MATAR

MATAR arrays use indices that start at 0, while MATAR matrices use indices that start at 1, where the latter index convention aligns with the Fortran language convention. For multi-dimensional data layouts, MATAR offers data types that can be optimally accessed using row major (the C language convention) or column major (the Fortran language convention). An array or matrix data type with the letter C, as in CArray or CMatrix, denotes a memory layout following the C/C++ language convention, while a data type with the letter F, as in FArray or FMatrix, denotes a memory layout following the Fortran language convention.
The array and matrix data types that allocate memory on the device will include Kokkos at the end of the name (as in CArrayKokkos or FMatrixKokkos); otherwise, the array or matrix data are allocated on the CPU (termed host). For the case of needing memory allocated on both the CPU and GPU, dual-memory data types are offered in MATAR. DCArrayKokkos and DFMatrixKokkos are examples of dual-memory data types. In this work, the FMatrixKokkos data type is principally used to migrate Fortran codes to C++ with Kokkos. MATAR also supports a large range of data types not discussed here, including sparse and ragged arrays implemented with Kokkos for portability across architectures.
MATAR supports diverse types of parallel loops. The FOR_ALL loop syntax is a parallel and portable version of C/C++ for loop. The DO_ALL loop syntax is a parallel and portable loop version of a Fortran DO loop. The DO_ALL loop is rather useful with converting Fortran coding to C++. There are also parallel reduction loops to find a maximum or minimum, or to sum the values in an array or matrix. Examples would be the DO_REDUCE_SUM or DO_REDUCE_MAX. Tightly nested parallelism (also called hierarchical parallelism) is also supported by MATAR. Nested parallelism is for the case where a parallel loop depends on an outer parallel loop or loops. The MATAR syntax for nested parallelism uses the words FIRST, SECOND, and THIRD in the parallel loop names, which e.g., target the GPU team, thread, and vector sizes. The nested parallel loops, like the other parallel loops in MATAR, work across multi-core CPUs and GPUs. Applications for using tightly nested parallelism include matrix–matrix multiplies, matrix–vector multiples, and accessing ragged data storage.

3.2. MATAR Implementation Examples

The first example will show how to convert a code for matrix addition from Fortran to C++ with MATAR, which then runs in parallel across computer architectures. The coding for the first example is shown in Listing 1. The second example is a matrix–vector multiply, the coding of which is shown in Listing 2. As illustrated in these examples, the syntax offered by MATAR has similarities to Fortran, reducing the burden to modernize the solid mechanics codes written in Fortran. In addition, the burden can be further reduced by leveraging software tools that are designed to automatically convert Fortran to C/C++, including F2C (converts F77 to C) [54], FABLE [55], and, more recently, AI options have emerged [56]. These conversion tools can be used in partnership with the MATAR library to modernize a Fortran code for performance portability.
Listing 1. The MATAR programming syntax is compared to Fortran programming syntax for matrix allocation and addition. The C++ coding syntax with MATAR has similarities to the Fortran language. In (a), three 2D matrices are allocated on the device—a multi-core CPU or the GPU—depending on the Kokkos backend used. The contents inside the DO_ALL loop, as shown in (b), are executed in parallel on the device. The Fortran coding, which is serial, is shown in (c,d).
Information 15 00716 i001
Listing 2. The MATAR programming syntax is compared to Fortran programming syntax for a matrix–vector multiply. In (a), two vectors and a matrix are allocated on the device (e.g., a GPU). The contents inside the DO_ALL loop, as shown in (b), are executed in parallel on the device. Nested parallelism was used in this example. An alternative parallel implementation would place a serial loop inside a parallel DO_ALL loop, where the serial loop performs the addition. The Fortran coding, which is serial, is shown in (c,d).
Information 15 00716 i002

4. Fierro Mechanics Code

The open-source C++ Fierro mechanics code uses the MATAR library for productivity, performance, and portability across computer architectures. The Fierro mechanics code contains 2D-RZ axisymmetric and 3D-Cartesian, Lagrangian lumped-mass FE hydrodynamic methods [9,57] for gas and solid dynamics that run in parallel on the device, i.e., the GPU in the case of using the HIP or CUDA backends in Kokkos, or the CPU in the case of using the OpenMP or pthreads backends. For optimal parallel performance, nearly every loop associated with these explicit Lagrangian FE hydrodynamic methods runs in parallel on the device; the sole exceptions are a few setup routines when the simulation starts and writes output files (e.g., writing a graphics file).
The Lagrangian FE hydrodynamic methods in the Fierro code support a gamma-law gas equation of state (EOS) and have an interface to support user-defined EOSs, as well as user-defined hypoelastic plastic or hyperelastic plastic strength models, including crystal plasticity models. The user-defined material model interface in the Fierro code follows (but still differs some from) the one used with the commercial AbaqusTM FE code. The simulation results and runtime studies presented in this paper couple user-defined high-fidelity material models to the Fierro mechanics code.
The fine-grained parallelism approach used with the user-defined models in this work is to parallelize the loop over all elements of the mesh that calls a constitutive model (e.g., the VPSC-GMM function discussed later), see Listing 3. The user-defined material model function for each element will be executed in parallel on the device. As shown in Listing 3, MATAR offers a simple parallel loop syntax to use the Kokkos library for fine-grained parallelism.
Listing 3. In this work, the user-defined material model (e.g., VPSC-GMM implementation) in Fierro is called inside a parallel loop over all the elements in the mesh. The parallel for loop syntax with the MATAR library is FOR_ALL. The coding shown here will run in parallel (via the Kokkos library) on a multi-core CPU using OpenMP or pthreads, and it will run in parallel on a GPU using CUDA for NVIDIA hardware or HIP for AMD hardware. Additional Kokkos backends are available for more fine-grained parallelism than mentioned here.
Information 15 00716 i003

5. Test Cases

For the presented test cases, a single-code implementation was executed on both multi-core CPUs and GPUs. Runtime studies were conducted using various constitutive models applied to solid mechanics problems, showcasing the MATAR library’s utility and quantifying performance gains on modern computer architectures. Each test was intentionally run on diverse CPUs and GPUs to highlight runtime variations across hardware (see Table 1). For the simulations with NVIDIA GPUs, we used the GCC 9 compiler with CUDA 10. For the simulations on an AMD GPU, we used the Clang 13 compiler and ROCM 6. For the OpenMP simulations, we used the GCC 9 compiler.
Comparative plots are provided in this section showing the speedup relative to a serial calculation and when using many CPU cores. The latter comparison is particularly helpful for demonstrating the benefits of using a GPU versus a multi-core CPU. Additionally, to facilitate comparisons, the speedup plots consistently use the same colors for each hardware across all test cases.

5.1. Isotropic Hypoelastic–Plastic Model

The first runtime test was for a high-speed, metal rod impact test case. In this test, a cylindrical specimen with a radius of 3.8 cm and a length of 38.0 cm traveled at a uniform velocity and impacted a rigid wall. The material in this test was aluminum, the rod velocity was 150 m/s, and the final time was 120 μ s. Simulations were performed in 2D axisymmetric cylindrical coordinates and 3D Cartesian coordinates. The simulations used a Mie-Grüneisen EOS with a hypoelastic–plastic strength model for the deviatoric stress tensor.
The simulations were performed on an HPC machine with multi-core CPUs and three different GPUs. The hardware was a Haswell multi-core CPU, Tesla V100 GPU, NVIDIA A100 GPU, and Quadro RTX GPU, respectively. These studies used the various backends inside Kokkos for on-node parallelism (e.g., OpenMP and CUDA). The simulation wall clock times were measured using increasing mesh sizes of 5 × 52, 10 × 104, 20 × 208, and 40 × 416, respectively, where the first value is the number of elements in the cross-section and the second value is the number of elements in the vertical direction. These resolutions equate to 3D meshes with 1802 elements, 12,075 elements, 87,989 elements, and 670,953 elements.
This test shows that GPU architectures can significantly accelerate simulations with 2D cylindrical coordinate meshes and 3D Cartesian coordinate meshes, as seen in Figure 2 and Figure 3. As the mesh resolution increased, the GPUs performed better than the parallel CPU runs, especially the NVIDIA A100 GPU. These runtime results show that GPUs perform well on simulations with high computation e.g., a larger mesh. When looking at the overall speedup, depending on the mesh resolution, the simulations ran up to 180 times faster than serial and 13 times faster than running in parallel on 20 cores for a 3D mesh with 40 elements in the cross-section by 416 elements. For a 2D axisymmetric mesh, the overall speedup is 35 times faster than serial and 2.5 times faster than running in parallel on 20 cores for a 40 × 416 mesh. The 3D and 2D speedup results using GPUs are shown in Figure 4 and Figure 5. The next subsection will present the runtime results on multi-core CPUs and GPUs with a crystal plasticity model coupled to the Fierro mechanics code.

5.2. Crystal Plasticity Model

This runtime test was performed on a metal rod impact test using a more complex, higher-fidelity constitutive model instead of the isotropic model discussed above here. An elasto-viscoplastic, single-crystal plasticity model [16,28] was adapted and coupled to the Lagrangian FE hydrodynamic method in the Fierro mechanics code. These types of models simulate the behavior of individual crystals of metal and their interaction with each other during deformation. The model was calibrated for copper, and all of the simulations used a single-crystal orientation that was aligned with the Cartesian axes. The rod had an initial velocity of 150 m/s, and the simulation was run to a final time of 10 μ s. The rod had a 3 cm radius and a 12 cm length. These simulations used a 3D mesh with 20 elements along each edge of the rod by 68 elements in the vertical direction, creating a mesh with 26,112 elements. The simulations were performed on an HPC machine with multi-core CPUs and two different GPUs. The hardware was a IBM Power9 20-core CPU, Tesla V100 GPU, and NVIDIA A100 GPU, respectively. These studies used the OpenMP and CUDA Kokkos backends. As a reminder, the code implementation was exactly the same between the tests, the only thing that changed was what hardware was targeted by the compilation, which was set by a single compile flag.
In addition to comparing the simulation wall clock times between different hardware, this test also compared two different element orderings to show the effects they have on load balancing. The simulation result of the single-crystal simulation is shown in Figure 6; these results show that the Taylor anvil impact test has a large variance in processing work along the height of the rod, which can create load balancing issues if the mesh is partitioned to the cores along that axis. Thus, to test the mesh numbering effects, two approaches for ordering the elements (i.e., assigning element indices) were considered—ordering within the 2D cross-section (XY plane) first and ordering along the height (Z axis) first.
This test shows that GPU architectures can significantly accelerate crystal plasticity simulations on 3D meshes, as demonstrated in Figure 7. For the overall speedup, the simulations on the NVIDIA A100 GPU ran up to 200 times faster than serial and 14 times faster than when using 16 CPU cores. The Z axis element ordering sped up the multi-core CPU runs by over a factor of 2 compared to the XY plane element ordering. The 8 CPU core runtimes went from 3.25 to 7.25 times faster than serial, and the 16 CPU core runtimes went from 6.5 to 14.25 times faster than serial, whereas the same change in mesh ordering only slightly slowed them down for the GPUs.
Multi-core CPUs have far fewer compute cores and threads compared to a GPU, potentially giving rise to runtime variations when there are variations in the computational work across the mesh. For this test case, the element ordering influenced the runtimes on multi-core CPUs because the majority of the computational work was near the base of the rod where the elements deformed from impacting the wall. Away from the impact region, minimal computational work was performed. The crystal plasticity material model used an implicit Newton–Raphson solver that may require many iterations to reach convergence for a rapidly deforming element. However, away from the impact region, only a few iterations (or none at all) were required to reach convergence as the elements were not deforming much. If a parallel loop walks over the cross section of the rod first (the XY plane) and then moves vertically (the Z axis), then most of the computational work is executed on a limited number of threads with a multi-core CPU. By numbering the element indices of the mesh so that they vary vertically and then horizontally, the computational work is more uniformly spread across all the threads run on a multi-core CPU, thus giving better runtimes and scaling results.

5.3. VPSC-GMM

MATAR was used to convert an approximately 20,000 line Fortran code for a scale-bridging solid mechanics model called the visco-plastic self-consistent generalized material model (VPSC-GMM) to C++. MATAR allowed minimal changes to the original Fortran code, with the bulk of the work being refactoring module variables. The increase in performance resulting from parallelism can be achieved with minimal development time. There was also no increase in the effort to port the code to multiple architectures as the portability was handled by MATAR. Several studies were performed to quantify the performance portability of the VPSC-GMM code.

5.3.1. Stand-Alone VPSC

One common use of the VPSC model is to calculate the homogeneous response of a polycrystalline material to a prescribed strain. These calculations are relevant to researchers who seek to build homogeneous constitutive models for polycrystalline materials by running a suite of prescribed strains for various microstructures. As such, the first test case ran many instances of the VPSC model for an applied strain to a unit cell in parallel on multi-core CPUs and on GPUs.
The material used in the homogenization test case was tantalum, with the single-crystal plasticity model using the {110} and {112} slip systems. The simulations incorporated strain hardening and temperature softening [58], enhancing their accuracy compared to previous work [5]. This test case examined the runtime performance of simulations with differing numbers of VPSC instances and weighted grain orientations to represent a rolled texture. Initially, simulations with 30 grains were used to assess the runtime performance of the stand-alone VPSC model based on the number of VPSC instances. The second component of this test case was used to explore the runtime performance for 30, 100, 200, and 500 grains with a fixed number of VPSC instances.
As a starting place, we sought to understand the scaling results of the C++ VPSC model as a function of the number of instances with 30 grains per instance. The scaling is shown in Figure 8 for a Power9 CPU that has 20 cores and a NVIDIA V100 GPU. In that plot, the V100 GPU performed well compared to a Power9 CPU only when there were many instances of the VPSC model (i.e., when there was a significant computational workload). The V100 GPU started delivering a speedup of the simulation around 1000 instances of the VPSC model. For the case of a few VPSC instances, improved speedups might be possible on the GPU by adding fine-grained parallelism within the VPSC model. For homogenization calculations, the need to run more than 1000 instances of VPSC is quite reasonable; thus, GPUs have value for this application. For the case of running a few instances of VPSC, the portability offered by MATAR allows users to run parallel simulations on multi-core CPUs, which offer the best runtime performance for that case. As such, this study highlights the merits of a portable code, allowing users to choose the best computer architecture for their application.
Next, we compared the parallel runtimes on five computer architectures (covering multi-core CPUs and GPUs) using 16,384 cells with a varying number of grains per cell. The computer architectures used were as follows: an Intel Haswell CPU with 20 cores, a Power9 CPU made by IBM with 20 cores, an AMD MI50 Vega GPU (a prior generation GPU), a NVIDIA V100 GPU, and the current generation NVIDIA A100 GPU. The results are shown in Figure 9. The A100 GPU was the top performer by a large margin, giving accelerations of 7× to 11× over the 20-core Intel Haswell CPU. Varying the number of grains increased the memory and the amount of work performed in each VPSC instance, and that translated to variations in speed ups. The AMD MI50 GPU was slightly slower than the 20 core Intel Haswell CPU in all simulations. Future studies can explore other AMD GPUs.

5.3.2. VPSC-GMM Code Coupled to the Fierro Code

The C++ VPSC-GMM code was coupled to the Fierro mechanics code via the user-defined model interface, and it was then used to simulate a 3D Taylor anvil impact test of a polycrystalline tantalum to demonstrate the utility of the developed simulation framework. The tested material underwent a wide range of strain and strain rate deformation. The measured deformation after the test was used to compare with the simulation predictions to validate the constitutive models for a range of strains and strain rates [5]. Consistent with a test in a prior paper [5], the impact test was simulated using a rod traveling at a velocity of 175 m/s. Unique to this work, we implemented strain hardening and temperature softening [58], which were previously assumed to cancel each other out. The material of the rod was a tantalum with an initial texture consisting of 30 weighted crystal orientations representing a rolled texture. These were embedded at each integration point (i.e., element center). The single-crystal plasticity model of the tantalum used the {110} and {112} slip systems and 30 grains, sharing similarities with the stand-alone VPSC simulations detailed in the prior subsection. The length of the cylinder was 38.1 mm and had a diameter of 7.62 mm. Due to the orthotropic symmetry of the specimen, a quarter of the cylinder was simulated. The rod was discretized with 5148 hexahedral elements. The simulated results differed from [5] because we added strain hardening, which allowed a “foot” to develop, as shown in Figure 10. Nevertheless, these results were of a secondary importance to the evidence that MATAR is easily adopted to modernize a code written in Fortran.
The speedup comparisons of the coupled VPSC-GMM-Fierro code are shown in Figure 11. From Figure 11a, it can be observed that using the linear extrapolation scheme for elements with little accumulated strain is beneficial for the serial code because less time would be spent in the more computationally expensive VPSC solution. However, this causes severe thread divergence on the GPUs, resulting in a slowdown of the code on both the V100 and A100 GPUs by approximately 3×. When the linear extrapolation scheme is turned off, and all elements perform a full VPSC solve, a considerable speedup on the GPUs is observed, as shown in Figure 11b where a speedup of 3.2× is observed on the A100 compared to Broadwell with 32 cores. This speedup on the GPUs is because the thread divergence created using the linear extrapolation scheme is avoided and all the threads on the GPUs will perform the same work, which is beneficial for the GPU architecture.
In a prior paper [5], the Fortran version of VPSC-GMM code was coupled to a C++ FE code. The combination yielded reasonable strong scaling on multi-core CPUs with openMP by looping over the elements in a unique pattern to ensure the computational work was more evenly spread over the cores. That striding approach works reasonably well on multi-core CPUs but is specialized to the mesh and the Taylor anvil test case. That striding approach had a negligible improvement to the runtimes on a GPU. A merit of the performance portability was the flexibility to choose the optimal computer architecture for a particular model.

6. Conclusions

In this work, we conducted a series of tests to demonstrate the benefits of using the MATAR library for performance portability of solid mechanics codes across diverse computer architectures. The test cases were metal rods impacting a rigid wall, a Taylor anvil impact test, and polycrystalline homogenization calculations. For the impact and Taylor anvil tests, the simulations used Lagrangian FE hydrodynamic methods implemented in open-source C++ Fierro mechanics code, which was built using MATAR. Those simulations used three different constitutive models. The first model employed an analytic EOS to calculate the pressure and a hypoelastic–plastic model to determine the deviatoric stress. The second model was an elasto-viscoplastic, single-crystal plasticity model, which was quickly rewritten from Fortran to C++ using MATAR’s data types. The final material model was the multi-scale VPSC-GMM, which was also converted from Fortran to C++, leveraging the MATAR library. The polycrystalline homogenization calculations ran many instances of the VPSC-GMM, calculating a bulk-scale stress as a function of strain.
The 3D simulations using an EOS with a hypoelastic–plastic model saw over a 80× acceleration on a V100 GPU and close to a 180× acceleration on an A100 GPU when compared to a serial Haswell CPU calculation. This A100 GPU calculation was approximately 13× faster than using 20 cores on a Haswell CPU. The 2D-axisymmetric simulations saw more modest gains, a V100 GPU, and A100 GPU delivered close to 31× and 34× accelerations, respectively, over a serial Haswell CPU calculation. The Taylor anvil impact simulations with an elasto-viscoplastic single-crystal plasticity model saw impressive accelerations on GPU architectures. An A100 GPU delivered over 200× acceleration compared to a serial calculation on a Power9 CPU.
Lastly, the new implementation of the VPSC-GMM code was used to predict the bulk response of polycrystalline materials based on microstructure physics. We tested the VPSC-GMM code in several ways. The first way ran many instances of the VPSC-GMM in a stand-alone manner, which is commonly performed to create data to build a bulk-scale constitutive model. For this application, we demonstrated modern GPUs, such as the A100 GPU, can deliver 7 to 11× performance gains (depending on the number of grains and instances of the VPSC-GMM) over using 20 cores on a Haswell CPU. We also coupled the VPSC-GMM code to the Fierro mechanics code to simulate the Taylor anvil simulations accounting for the material microstructure. We showed that a V100 GPU and an A100 GPU deliver 2.5× and 3.2× acceleration over 32 cores on a Broadwell CPU when using the VPSC-GMM in every element of the mesh without the linear extrapolation scheme. The VPSC-GMM was used in this way to prevent thread divergence. In contrast to these calculations, when combining the linear extrapolation with the VPSC-GMM, CPUs, and—especially—GPUs, both did not perform well due to thread divergence. As such, this test case demonstrates that fine-grained parallelism can deliver favorable accelerations, but it has limitations. Future work can explore alternative implementations of the VPSC-GMM code that mitigate thread divergence.
The overarching conclusion from this work is that the MATAR library can benefit a wide range of solid mechanics applications by enabling fine-grained parallelism and portability across CPU and GPU architectures. The library can be used to write new C++ codes or to modernize existing Fortran codes for performance and portability. This latter capability is highly relevant to the solid mechanics community as many legacy codes in the field are written in Fortran.

Author Contributions

Conceptualization, N.M., C.Y., A.D., D.D., J.M., R.R. and M.K.; Methodology, N.M., C.Y., A.D., D.D., J.M., E.H., E.L., S.W., S.B., D.H., R.M., R.R. and M.K.; Software, N.M., C.Y., A.D., D.D., J.M., E.H., E.L., S.W., S.B., R.M. and R.R.; Validation, N.M., C.Y., A.D., E.L., S.B. and R.M.; Investigation, N.M., C.Y., E.H., E.L., R.M. and D.H.; Writing – original draft, N.M., E.H., E.L., D.H. and R.M.; Writing – review & editing, C.Y., A.D., D.D., J.M., S.W., S.B. and M.K; Supervision, N.M., R.R. and M.K.; Project administration, N.M.; Funding acquisition, N.M. All authors have read and agreed to the published version of the manuscript.

Funding

We gratefully acknowledge the funding from the Laboratory Directed Research and Development (LDRD) program at Los Alamos National Laboratory (LANL). The Advanced Simulation and Computing (ASC) program also supported the code work in the Fierro mechanics code and the MATAR library. This research used resources provided by the Darwin testbed at LANL, which is funded by the Computational Systems and Software Environments subprogram of the LANL’s ASC program. LANL is operated by Triad National Security, LLC for the U.S. Department of Energy’s NNSA under contract number 89233218CNA000001.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The Los Alamos unlimited release number is LA-UR-22-20105.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Edwards, H.C.; Trott, C.; Sunderland, D. Kokkos. J. Parallel Distrib. Comput. 2014, 74, 3202–3216. [Google Scholar] [CrossRef]
  2. Deakin, T.; McIntosh-Smith, S.; Price, J.; Poenaru, A.; Atkinson, P.; Popa, C.; Salmon, J. Performance portability across diverse computer architectures. In Proceedings of the 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), Denver, CO, USA, 22 November 2019; pp. 1–13. [Google Scholar]
  3. Dunning, D.J.; Morgan, N.R.; Moore, J.L.; Nelluvelil, E.; Tafolla, T.V.; Robey, R.W. MATAR: A Performance Portability and Productivity Implementation of Data-Oriented Design with Kokkos. J. Parallel Distrib. Comput. 2021, 157, 86–104. [Google Scholar] [CrossRef]
  4. Morgan, N.; Yenusah, C.; Diaz, A.; Dunning, D.; Moore, J.; Roth, C.; Lieberman, E.; Walton, S.; Brown, S.; Holladay, D.; et al. On A Simplified Approach to Achieve Parallel Performance and Portability Across CPU and GPU Architectures. Information, 2024; accepted. [Google Scholar] [CrossRef]
  5. Zecevic, M.; Lebensohn, R.; Rogers, M.; Moore, J.; Chiravalle, V.; Lieberman, E.; Dunning, D.; Shipman, G.; Knezevic, M.; Morgan, N. Viscoplastic self-consistent formulation as generalized material model for solid mechanics applications. Appl. Eng. Sci. 2021, 6, 100040. [Google Scholar] [CrossRef]
  6. Morgan, N.; Moore, J.; Brown, S.; Chiravalle, V.; Diaz, A.; Dunning, D.; Lieberman, E.; Walton, S.; Welsh, K.; Yenusah, C.; et al. Fierro. 2021. Available online: https://github.com/LANL/Fierro (accessed on 21 October 2024).
  7. Diaz, A.; Morgan, N.; Bernardin, J. A parallel multi-constraint topology optimization solver. In Proceedings of the ASME 2022 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference IDETC/CIE2022, St. Louis, MO, USA, 14–17 August 2022. [Google Scholar]
  8. Diaz, A.; Morgan, N.; Bernardin, J. Parallel 3D topology optimization with multiple constraints and objectives. Optim. Eng. 2023, 25, 1531–1557. [Google Scholar] [CrossRef]
  9. Chiravalle, V.; Morgan, N. A 3D finite element ALE method using an approximate Riemann solution. Int. J. Numer. Methods Fluids 2016, 83, 642–663. [Google Scholar] [CrossRef]
  10. Burton, D.; Carney, T.; Morgan, N.; Sambasivan, S.; Shashkov, M. A Cell Centered Lagrangian Godunov-like method for solid dynamics. Comput. Fluids 2013, 83, 33–47. [Google Scholar] [CrossRef]
  11. Liu, X.; Morgan, N.; Burton, D. A high-order Lagrangian discontinuous Galerkin hydrodynamic method for quadratic cells using a subcell mesh stabilization scheme. J. Comput. Phys. 2019, 386, 110–157. [Google Scholar] [CrossRef]
  12. Liu, X.; Morgan, N.R.; Lieberman, E.J.; Burton, D.E. A fourth-order Lagrangian discontinuous Galerkin method using a hierarchical orthogonal basis on curvilinear grids. J. Comput. Appl. Math. 2022, 404, 113890. [Google Scholar] [CrossRef]
  13. Lieberman, E.; Liu, X.; Morgan, N.; Luscher, D.J.; Burton, D. A higher-order Lagrangian discontinuous Galerkin hydrodynamic method for solid dynamics. Comput. Methods Appl. Mech. Eng. 2019, 353, 467–490. [Google Scholar] [CrossRef]
  14. Abgrall, R.; Lipnikov, K.; Morgan, N.; Tokareva, S. Multidimensional staggered grid residual distribution scheme for Lagrangian hydrodynamics. SIAM J. Sci. Comput. 2020, 42, A343–A370. [Google Scholar] [CrossRef]
  15. Moore, J.; Morgan, N.; Horstemeyer, M. ELEMENTS: A high-order finite element library in C++. SoftwareX 2019, 10, 100257. [Google Scholar] [CrossRef]
  16. Yenusah, C.O.; Morgan, N.R.; Lebensohn, R.A.; Zecevic, M.; Knezevic, M. A parallel and performance portable implementation of a full-field crystal plasticity model. Comput. Phys. Commun. 2024, 300, 109190. [Google Scholar] [CrossRef]
  17. Zecevic, M.; Lebensohn, R.A.; Capolungo, L. New large-strain FFT-based formulation and its application to model strain localization in nano-metallic laminates and other strongly anisotropic crystalline materials. Mech. Mater. 2022, 166, 104208. [Google Scholar] [CrossRef]
  18. Bathe, K.J. Finite Element Procedures; Prentice Hall: Englewood Cliffs, NJ, USA, 1996. [Google Scholar]
  19. Lebensohn, R.; Tomé, C. A self-consistent anisotropic approach for the simulation of plastic deformation and texture development of polycrystals: Application to zirconium alloys. Acta Metall. Mater. 1993, 41, 2611–2624. [Google Scholar] [CrossRef]
  20. Zecevic, M.; Knezevic, M. An implicit formulation of the elasto-plastic self-consistent polycrystal plasticity model and its implementation in implicit finite elements. Mech. Mater. 2019, 136, 103065. [Google Scholar] [CrossRef]
  21. Zecevic, M.; Pantleon, W.; Lebensohn, R.; McCabe, R.; Knezevic, M. Predicting intragranular misorientation distributions in polycrystalline metals using the viscoplastic self-consistent formulation. Acta Mater. 2017, 140, 398–410. [Google Scholar] [CrossRef]
  22. Lhadi, S.; raj purohit Purushottam raj purohit, R.; Richeton, T.; Gey, N.; Berbenni, S.; Perroud, O.; Germain, L. Elasto-viscoplastic tensile behavior of as-forged Ti-1023 alloy: Experiments and micromechanical modeling. Mater. Sci. Eng. A 2020, 787, 139491. [Google Scholar] [CrossRef]
  23. Kalidindi, S.; Bronkhorst, C.; Anand, L. Crystallographic texture evolution in bulk deformation processing of FCC metals. J. Mech. Phys. Solids 1992, 40, 537–569. [Google Scholar] [CrossRef]
  24. Ardeljan, M.; Beyerlein, I.; Knezevic, M. A dislocation density based crystal plasticity finite element model: Application to a two-phase polycrystalline HCP/BCC composites. J. Mech. Phys. Solids 2014, 66, 16–31. [Google Scholar] [CrossRef]
  25. Knezevic, M.; Drach, B.; Ardeljan, M.; Beyerlein, I. Three dimensional predictions of grain scale plasticity and grain boundaries using crystal plasticity finite element models. Comput. Methods Appl. Mech. Eng. 2014, 277, 239–259. [Google Scholar] [CrossRef]
  26. Taylor, G. Plastic strain in metals. J. Inst. Met. 1938, 62, 307–324. [Google Scholar]
  27. Fromm, B.; Adams, B.; Ahmadi, S.; Knezevic, M. Grain size and orientation distributions: Application to yielding of α-titanium. Acta Mater. 2009, 57, 2339–2348. [Google Scholar] [CrossRef]
  28. Lebensohn, R.; Kanjarla, A.; Eisenlohr, P. An elasto-viscoplastic formulation based on fast Fourier transforms for the prediction of micromechanical fields in polycrystalline materials. Int. J. Plast. 2012, 32–33, 59–69. [Google Scholar] [CrossRef]
  29. Eghtesad, A.; Knezevic, M. High-performance full-field crystal plasticity with dislocation-based hardening and slip system back-stress laws: Application to modeling deformation of dual-phase steels. J. Mech. Phys. Solids 2020, 134, 103750. [Google Scholar] [CrossRef]
  30. Lieberman, E.; Lebensohn, R.; Menasche, D.; Bronkhorst, C.; Rollett, A. Microstructural effects on damage evolution in shocked copper polycrystals. Acta Mater. 2016, 116, 270–280. [Google Scholar] [CrossRef]
  31. Segurado, J.; Lebensohn, R.; Llorca, J.; Tomé, C. Multiscale modeling of plasticity based on embedding the viscoplastic self-consistent formulation in implicit finite elements. Int. J. Plast. 2012, 28, 124–140. [Google Scholar] [CrossRef]
  32. Barrett, T.; Knezevic, M. Deep drawing simulations using the finite element method embedding a multi-level crystal plasticity constitutive law: Experimental verification and sensitivity analysis. Comput. Methods Appl. Mech. Eng. 2019, 354, 245–270. [Google Scholar] [CrossRef]
  33. Zecevic, M.; Beyerlein, I.; Knezevic, M. Coupling elasto-plastic self-consistent crystal plasticity and implicit finite elements: Applications to compression, cyclic tension-compression, and bending to large strains. Int. J. Plast. 2017, 93, 187–211. [Google Scholar] [CrossRef]
  34. Gierden, C.; Kochmann, J.; Waimann, J.; Kinner-Becker, T.; Sôlter, J.; Svendsen, B.; Reese, S. Efficient two-scale FE-FFT-based mechanical process simulation of elasto-viscoplastic polycrystals at finite strains. Comput. Methods Appl. Mech. Eng. 2021, 374, 113566. [Google Scholar] [CrossRef]
  35. Barton, N.; Bernier, J.; Knap, J.; Sunwoo, A.; Cerreta, E.; Turner, T. A call to arms for task parallelism in multi-scale materials modeling. Int. J. Numer. Methods Eng. 2011, 86, 744–764. [Google Scholar] [CrossRef]
  36. Panchal, J.; Kalidindi, S.; McDowell, D. Key computational modeling issues in Integrated Computational Materials Engineering. Comput.-Aided Des. 2013, 45, 4–25. [Google Scholar] [CrossRef]
  37. Li, D.; Garmestani, H.; Schoenfeld, S. Evolution of crystal orientation distribution coefficients during plastic deformation. Scr. Mater. 2003, 49, 867–872. [Google Scholar] [CrossRef]
  38. Shaffer, J.; Knezevic, M.; Kalidindi, S. Building texture evolution networks for deformation processing of polycrystalline fcc metals using spectral approaches: Applications to process design for targeted performance. Int. J. Plast. 2010, 26, 1183–1194. [Google Scholar] [CrossRef]
  39. Knezevic, M.; Kalidindi, S.; Mishra, R. Delineation of first-order closures for plastic properties requiring explicit consideration of strain hardening and crystallographic texture evolution. Int. J. Plast. 2008, 24, 327–342. [Google Scholar] [CrossRef]
  40. Kalidindi, S.; Duvvuru, H.; Knezevic, M. Spectral calibration of crystal plasticity models. Acta Mater. 2006, 54, 1795–1804. [Google Scholar] [CrossRef]
  41. Knezevic, M.; Al-Harbi, H.; Kalidindi, S. Crystal plasticity simulations using discrete Fourier transforms. Acta Mater. 2009, 57, 1777–1784. [Google Scholar] [CrossRef]
  42. Al-Harbi, H.; Knezevic, M.; Kalidindi, S. Spectral approaches for the fast computation of yield surfaces and first-order plastic property closures for polycrystalline materials with cubic-triclinic textures. Comput. Mater. Contin. 2010, 15, 153–172. [Google Scholar]
  43. Kalidindi, S.; Knezevic, M.; Niezgoda, S.; Shaffer, J. Representation of the orientation distribution function and computation of first-order elastic properties closures using discrete Fourier transforms. Acta Mater. 2009, 57, 3916–3923. [Google Scholar] [CrossRef]
  44. Eghtesad, A.; Barrett, T.; Knezevic, M. Compact reconstruction of orientation distributions using generalized spherical harmonics to advance large-scale crystal plasticity modeling: Verification using cubic, hexagonal, and orthorhombic polycrystals. Acta Mater. 2018, 155, 418–432. [Google Scholar] [CrossRef]
  45. Fast, T.; Knezevic, M.; Kalidindi, S. Application of microstructure sensitive design to structural components produced from hexagonal polycrystalline metals. Comput. Mater. Sci. 2022, 43, 374–383. [Google Scholar] [CrossRef]
  46. Sundararaghavan, V.; Zabaras, N. Linear analysis of texture-property relationships using process-based representations of Rodrigues space. Acta Mater. 2007, 55, 1573–1587. [Google Scholar] [CrossRef]
  47. Barton, N.; Knap, J.; Arsenlis, A.; Becker, R.; Hornung, R.; Jefferson, D. Embedded polycrystal plasticity and adaptive sampling. Int. J. Plast. 2008, 24, 242–266. [Google Scholar] [CrossRef]
  48. Chockalingam, K.; Tonks, M.; Hales, J.; Gaston, D.; Millett, P.; Zhang, L. Crystal plasticity with Jacobian-Free Newton–Krylov. Comput. Mech. 2013, 51, 617–627. [Google Scholar] [CrossRef]
  49. Knezevic, M.; Savage, D. A high-performance computational framework for fast crystal plasticity simulations. Comput. Mater. Sci. 2014, 83, 101–106. [Google Scholar] [CrossRef]
  50. Savage, D.; Knezevic, M. Computer implementations of iterative and non-iterative crystal plasticity solvers on high performance graphics hardware. Comput. Mech. 2015, 56, 677–690. [Google Scholar] [CrossRef]
  51. Eghtesad, A.; Germaschewski, K.; Lebensohn, R.; Knezevic, M. A multi-GPU implementation of a full-field crystal plasticity solver for efficient modeling of high-resolution microstructures. Comput. Phys. Commun. 2020, 254, 107231. [Google Scholar] [CrossRef]
  52. Nickolls, J.; Dally, W. The GPU computing era. IEEE Micro 2010, 30, 56–69. [Google Scholar] [CrossRef]
  53. Eghtesad, A.; Germaschewski, K.; Beyerlein, I.; Hunter, A.; Knezevic, M. Graphics processing unit accelerated phase field dislocation dynamics: Application to bi-metallic interfaces. Adv. Eng. Softw. 2018, 115, 248–267. [Google Scholar] [CrossRef]
  54. Feldman, S.I. A Fortran to C converter. ACM SIGPLAN Fortran Forum 1990, 9, 21–22. [Google Scholar] [CrossRef]
  55. Grosse-Kunstleve, R.; Terwilliger, T.; Sauter, N.; Adams, P. Automatic Fortran to C++ conversion with FABLE. Source Code Biol. Med. 2012, 7, 5. [Google Scholar] [CrossRef] [PubMed]
  56. Online Fortran to C Converter. Available online: https://www.codeconvert.ai/fortran-to-c-converter (accessed on 7 October 2024).
  57. Morgan, N.R.; Archer, B.J. On the origins of Lagrangian hydrodynamic methods. Nucl. Technol. 2021, 207, S147–S175. [Google Scholar] [CrossRef]
  58. Feng, Z.; Zecevic, M.; Knezevic, M.; Lebensohn, R.A. Predicting extreme anisotropy and shape variations in impact testing of tantalum single crystals. Int. J. Solids Struct. 2022, 241, 111466. [Google Scholar] [CrossRef]
Figure 1. In this work, the MATAR library is used to modernize multiple Fortran material model implementations that are then coupled to the C++ Fierro mechanics code, which is also based on the MATAR library.
Figure 1. In this work, the MATAR library is used to modernize multiple Fortran material model implementations that are then coupled to the C++ Fierro mechanics code, which is also based on the MATAR library.
Information 15 00716 g001
Figure 2. The runtime scaling results are presented for the 2D axisymmetric, metal rod impact test conducted on both multi-core Haswell CPUs and GPU architectures. The data are displayed as wall clock time in seconds against increasing mesh resolution. Even on 2D meshes, significant accelerations of the runtime, relative to the serial, are possible on GPUs for larger mesh sizes.
Figure 2. The runtime scaling results are presented for the 2D axisymmetric, metal rod impact test conducted on both multi-core Haswell CPUs and GPU architectures. The data are displayed as wall clock time in seconds against increasing mesh resolution. Even on 2D meshes, significant accelerations of the runtime, relative to the serial, are possible on GPUs for larger mesh sizes.
Information 15 00716 g002
Figure 3. The runtime scaling results are presented for the 3D metal rod impact test conducted on both multi-core Haswell CPUs and GPU architectures. The data are displayed as the wall clock time in seconds against increasing the mesh resolution in 3D. The mesh resolution is the number of elements in the cross section of the rod by the number of elements in the vertical direction. Significant accelerations of the runtime, relative to the serial, are possible on GPUs for larger mesh sizes.
Figure 3. The runtime scaling results are presented for the 3D metal rod impact test conducted on both multi-core Haswell CPUs and GPU architectures. The data are displayed as the wall clock time in seconds against increasing the mesh resolution in 3D. The mesh resolution is the number of elements in the cross section of the rod by the number of elements in the vertical direction. Significant accelerations of the runtime, relative to the serial, are possible on GPUs for larger mesh sizes.
Information 15 00716 g003
Figure 4. Speedup comparisons for the 2D axisymmetric, metal rod impact test on a 40 × 416 2D cylindrical coordinate mesh using an equation of state with an isotropic hypoelastic–plastic model. Plot (a) presents the speedup compared to a serial run, and Plot (b) presents the speedup compared to a parallel 20 core run on the Haswell CPU. On a 2D mesh, GPUs give a significant boost to runtime performance over a serial and a multi-core CPU.
Figure 4. Speedup comparisons for the 2D axisymmetric, metal rod impact test on a 40 × 416 2D cylindrical coordinate mesh using an equation of state with an isotropic hypoelastic–plastic model. Plot (a) presents the speedup compared to a serial run, and Plot (b) presents the speedup compared to a parallel 20 core run on the Haswell CPU. On a 2D mesh, GPUs give a significant boost to runtime performance over a serial and a multi-core CPU.
Information 15 00716 g004
Figure 5. Speedup comparisons for the 3D metal rod impact test on a mesh with 40 elements in the cross-section by 416 elements in the vertical direction using an equation of state with an isotropic hypoelastic–plastic model. Plot (a) presents the speedup compared to a serial run, and Plot (b) presents the speedup compared to a parallel 20 core run on the Haswell CPU. GPUs give a significant boost to runtime performance over a serial and a multi-core CPU.
Figure 5. Speedup comparisons for the 3D metal rod impact test on a mesh with 40 elements in the cross-section by 416 elements in the vertical direction using an equation of state with an isotropic hypoelastic–plastic model. Plot (a) presents the speedup compared to a serial run, and Plot (b) presents the speedup compared to a parallel 20 core run on the Haswell CPU. GPUs give a significant boost to runtime performance over a serial and a multi-core CPU.
Information 15 00716 g005
Figure 6. Von Mises-equivalent stress results in each element of the mesh for the 3D metal rod impact test using an elasto-viscoplastic single-crystal plasticity model.
Figure 6. Von Mises-equivalent stress results in each element of the mesh for the 3D metal rod impact test using an elasto-viscoplastic single-crystal plasticity model.
Information 15 00716 g006
Figure 7. Speedup comparisons to a serial run for the 3D metal rod impact test using an elasto-viscoplastic single-crystal plasticity model. The Power9 CPU was used for the serial, 8-core, and 16-core calculations.
Figure 7. Speedup comparisons to a serial run for the 3D metal rod impact test using an elasto-viscoplastic single-crystal plasticity model. The Power9 CPU was used for the serial, 8-core, and 16-core calculations.
Information 15 00716 g007
Figure 8. The runtimes on a V100 GPU are shorter than 20 cores on a Power9 CPU only when there are many instances of the VPSC model.
Figure 8. The runtimes on a V100 GPU are shorter than 20 cores on a Power9 CPU only when there are many instances of the VPSC model.
Information 15 00716 g008
Figure 9. The scale-bridging VPSC-GMM model was made performant and portable across CPU and GPU architectures using the MATAR library. The speedup results are for 30, 100, 200, and 500 grains on five different computer architectures.
Figure 9. The scale-bridging VPSC-GMM model was made performant and portable across CPU and GPU architectures using the MATAR library. The speedup results are for 30, 100, 200, and 500 grains on five different computer architectures.
Information 15 00716 g009
Figure 10. A Taylor anvil impact test with a polycrystalline tantalum was simulated using the Fierro mechanics code with the VPSC-GMM. The rod deformation is a function of the texture of the material. The rod is colored by the von Mises stress [MPa].
Figure 10. A Taylor anvil impact test with a polycrystalline tantalum was simulated using the Fierro mechanics code with the VPSC-GMM. The rod deformation is a function of the texture of the material. The rod is colored by the von Mises stress [MPa].
Information 15 00716 g010
Figure 11. Speedup comparisons are shown for the VPSC-GMM coupled to the Fierro mechanics code. (a) Using the linear extrapolation scheme in combination with the VPSC model generates thread divergence can greatly hinder fine-grain parallelism. (b) Not using the linear extrapolation scheme in VPSC-GMM yields a favorable speedup on GPUs because it eliminates thread divergence.
Figure 11. Speedup comparisons are shown for the VPSC-GMM coupled to the Fierro mechanics code. (a) Using the linear extrapolation scheme in combination with the VPSC model generates thread divergence can greatly hinder fine-grain parallelism. (b) Not using the linear extrapolation scheme in VPSC-GMM yields a favorable speedup on GPUs because it eliminates thread divergence.
Information 15 00716 g011
Table 1. Simulations were run across diverse hardware using a single-code implementation. Details on the CPUs (top table) and GPUs (bottom table) are shown.
Table 1. Simulations were run across diverse hardware using a single-code implementation. Details on the CPUs (top table) and GPUs (bottom table) are shown.
CPU CompanyIntelIntelIBM
NameHaswellBroadwellPower9
Memory132 GBs132 GBs256 GBs
Number of cores per CPU203220
Clock speed2.6 GHz2.1 GHz3.45 GHz
GPU CompanyNVIDIANVIDIANVIDIAAMD
NameTesla V100sA100Quadro RTXVega MI50
Memory32 GBs40 GBs48 GBs16 GBs
Number of multi-processors841087260
Number of CUDA cores (NVIDIA) or shading units (AMD)5376691246083840
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Morgan, N.; Yenusah, C.; Diaz, A.; Dunning, D.; Moore, J.; Heilman, E.; Lieberman, E.; Walton, S.; Brown, S.; Holladay, D.; et al. Enabling Parallel Performance and Portability of Solid Mechanics Simulations Across CPU and GPU Architectures. Information 2024, 15, 716. https://doi.org/10.3390/info15110716

AMA Style

Morgan N, Yenusah C, Diaz A, Dunning D, Moore J, Heilman E, Lieberman E, Walton S, Brown S, Holladay D, et al. Enabling Parallel Performance and Portability of Solid Mechanics Simulations Across CPU and GPU Architectures. Information. 2024; 15(11):716. https://doi.org/10.3390/info15110716

Chicago/Turabian Style

Morgan, Nathaniel, Caleb Yenusah, Adrian Diaz, Daniel Dunning, Jacob Moore, Erin Heilman, Evan Lieberman, Steven Walton, Sarah Brown, Daniel Holladay, and et al. 2024. "Enabling Parallel Performance and Portability of Solid Mechanics Simulations Across CPU and GPU Architectures" Information 15, no. 11: 716. https://doi.org/10.3390/info15110716

APA Style

Morgan, N., Yenusah, C., Diaz, A., Dunning, D., Moore, J., Heilman, E., Lieberman, E., Walton, S., Brown, S., Holladay, D., Marki, R., Robey, R., & Knezevic, M. (2024). Enabling Parallel Performance and Portability of Solid Mechanics Simulations Across CPU and GPU Architectures. Information, 15(11), 716. https://doi.org/10.3390/info15110716

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop