Enabling Large-Scale Simulations of Quantum Transport with Manycore Computing
Abstract
:1. Introduction
2. Methods
2.1. Processes of RGF Computation
2.2. Strategies for Performance Enhancement
2.2.1. Data-Restructuring for SIMD and SIMT Operations
2.2.2. Blocked (Tiled) Matrix Multiplication
2.2.3. Thread Scheduling for Execution of Nested Loops
- The cubic semiconductor nanostructure has 2 unitcells along the X([1 0 0])-direction. The system matrix consists of 8 × 8 sub-matrices since a single [1 0 0] unitcell has 4 atomic layers.
- The nanostructure has 8 × 16 unitcells on the YZ([0 1 0] and [0 0 1])-plane. Since we use a 10-band tight-binding model and a single [1 0 0] unitcell has 2 atoms per atomic layer, the size of a sub-matrix becomes 2560 × 2560 (= 16 × 8 × 2 × 10).
- The block size is 32 × 32 and is equal to the L1 cache size of a KNL processor.
- The number of threads used in a single MPI process is 32, since a single KNL node we use has 64 physical cores and 2 MPI processes are employed for RGF computation.
- We focus on the moment of processing the 6th iteration of i-loop (i = 1 and j = 7→6→5→4→3→2).
- Decompose j-loop by 2n iterations to determine the number of threads participating in the parallelization of each decomposed loop ().
- Adjust the number of threads participating in matrix multiplication (the subroutine cMat_mul) () in each decomposed loop. × should be always equal to the number of threads that belong to a single MPI process (32 in this case).
- Decompose the number of iterations of j-loop (6) into 4 and 2 (6 = 22 + 21) as shown in the right side of Figure 5b.
- The 4 iterations of j-loop (j = 7→6→5→4) are executed simultaneously with 4 threads, and cMat_mul is processed with 8 threads (32/4 = 8). In this case, all the 32 (4 × 8) threads execute matrix multiplication simultaneously, and the issue of load imbalance does not exist since 80 rows are processed in parallel with 8 threads.
- Once the above 4 iterations are completed, we reschedule threads such that the remaining 2 iterations (j = 3→2) are executed with 2 threads, and cMat_mul is processed with 16 threads (32/2 = 16). This case does not have the issue of load imbalance either since each of the 16 threads can process 5 rows.
2.2.4. Offload Computing with GPU Accelerators
- Transfer the data to be computed in GPU devices from the host memory to the device memory. Here, host CPU cores control GPU devices using streaming, so each host sends block matrices to a physical GPU device using their own GPU stream.
- Conduct computation in GPU devices. Computation can be conducted simultaneously in both host and GPU devices, but GPU-only computation would be more preferable in terms of the speed as the hardware performance of GPU devices becomes better.
- Transfer the results of the computation from the device memory back to the host memory.
3. Results and Discussion
3.1. Utilization Efficiency of GPU Devices for Offload Computing
3.2. Performance of End-To-End Simulations
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Datta, S. Nanoscale device modeling: The Green’s function method. Superlattices Microstruct. 2000, 28, 253–278. [Google Scholar] [CrossRef]
- Abadi, R.M.I.; Saremi, M. A Resonant Tunneling Nanowire Field Effect Transistor with Physical Contractions: A Negative Differential Resistance Device for Low Power Very Large Scale Integration Applications. J. Electron. Mater. 2018, 47, 1091–1098. [Google Scholar] [CrossRef]
- Naser, M.A.; Deen, M.J.; Thompson, D.A. Photocurrent Modeling and Detectivity Optimization in a Resonant-Tunneling Quantum-Dot Infrared Photodetector. IEEE Trans. Quantum Electron. 2010, 46, 849–859. [Google Scholar] [CrossRef]
- Quhe, R.; Li, Q.; Zhang, Q.; Wang, Y.; Zhang, H.; Li, J.; Zhang, X.; Chen, D.; Liu, K.; Ye, Y.; et al. Simulations of Quantum Transport in Sub-5-nm Monolayer Phosphorene Transistors. Phys. Rev. Appl. 2018, 10, 024022. [Google Scholar] [CrossRef]
- Jancu, J.M.; Scholz, R.; Beltram, F.; Bassani, F. Empirical spds* tight-binding calculation for cubic semiconductors: General method and material parameters. Phys. Rev. B 1998, 57, 6493. [Google Scholar] [CrossRef]
- Cauley, S.; Jain, J.; Koh, C.K.; Balakrishnan, V. A scalable distributed method for quantum-scale device simulation. J. Appl. Phys. 2007, 101, 123715. [Google Scholar] [CrossRef]
- Ryu, H.; Jeong, Y.; Kang, J.H.; Cho, K.N. Time-efficient simulations of tight-binding electronic structures with Intel Xeon Phi™ many-core processors. Comput. Phys. Commun. 2016, 209, 79–87. [Google Scholar] [CrossRef] [Green Version]
- Ryu, H.; Kwon, O.K. Fast, energy-efficient electronic structure simulations for multi-million atomic systems with GPU devices. J. Comput. Electron. 2018, 17, 698–706. [Google Scholar] [CrossRef]
- Vogl, P.; Hjalmarson, H.P.; Dow, J.D. A Semi-empirical tight-binding theory of the electronic structure of semiconductors. J. Phys. Chem. Solids 1983, 44, 365–378. [Google Scholar] [CrossRef]
- Ryu, H.; Hong, S.; Kim, H.S.; Hong, K.H. Role of Quantum Confinement in 10 nm Scale Perovskite Optoelectronics. J. Phys. Chem. Lett. 2019, 10, 2745–2752. [Google Scholar] [CrossRef] [PubMed]
- Ryu, H.; Kim, J.; Hong, K.H. Atomistic study on dopant-distributions in realistically sized, highly p-doped si nanowires. Nano Lett. 2015, 15, 450–456. [Google Scholar] [CrossRef] [PubMed]
- Sodani, A. Knights landing (KNL): 2nd generation Intel® Xeon Phi processor. In Proceedings of the 2015 IEEE Hot Chips 27 Symposium (HCS), Cupertino, CA, USA, 22–25 August 2015; pp. 1–24. [Google Scholar] [CrossRef]
- NVIDIA Quadro GV100 GPU Device. Available online: https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/productspage/quadro/quadro-desktop/quadro-volta-gv100-data-sheet-us-nvidia-704619-r3-web.pdf (accessed on 21 January 2021).
- The NURION Supercomputer. Available online: https://www.top500.org/system/179421/ (accessed on 21 January 2021).
- Walker, D.; Dongarra, J. MPI: A standard message passing interface. Supercomputer 1996, 12, 56–68. [Google Scholar]
- Dagum, L.; Menon, R. OpenMP: An industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 1998, 5, 46–55. [Google Scholar] [CrossRef] [Green Version]
- Kirk, D. NVIDIA CUDA software and GPU parallel computing architecture. In Proceedings of the 6th International Symposium on Memory Management (ISMM), Montreal, QC, Canada, 21–22 October 2007; pp. 103–104. [Google Scholar] [CrossRef]
- Cantalupo, C.; Venkatesan, V.; Hammond, J.; Czurlyo, K.; Hammond, S.D. Memkind: An Extensible Heap Memory Manager for Heterogeneous Memory Platforms and Mixed Memory Policies; Technical Report; Sandia National Laboratories: Albuquerque, NM, USA, 2015. [Google Scholar]
- Strzodka, R. Abstraction for AoS and SoA layout in C++. In GPU Computing Gems Jade Edition; Elsevier: Amsterdam, The Netherlands, 2012; Volume 31, pp. 429–441. [Google Scholar]
- Lam, M.D.; Rothberg, E.E.; Wolf, M.E. The cache performance and optimizations of blocked algorithms. ACM SIGOPS Oper. Syst. Rev. 1991, 25, 63–74. [Google Scholar] [CrossRef]
- Park, N.; Hong, B.; Prasanna, V.K. Tiling, block data layout, and memory hierarchy performance. IEEE Trans. Parallel Distrib. Syst. 2003, 14, 640–654. [Google Scholar] [CrossRef]
- CUDA Occupancy Calculator. Available online: https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html (accessed on 21 January 2021).
Number of Iterations of j-loop | Efficiency of Thread Utilization |
---|---|
1 | 80/96 |
(j = 7) | = ∼83.33% |
2 | 2×(80/(16×5))/2 |
(j = 7,6) | = 100.00% |
3 | (2×(80/(16×5)) + 80/96)/3 |
(j = 7,6,5) | = ∼94.44% |
4 | (4×(80/(8×10)))/4 |
(j = 7,6,5,4) | = 100.00% |
5 | (4×(80/(8×10)) + 80/96)/5 |
(j = 7,6,5,4,3) | = ∼96.67% |
6 | (4×(80/(8×10)) + 2×(80/(16×5)))/6 |
(j = 7,6,5,4,3,2) | = 100% |
7 | (4×(80/(8×10)) + 2×(80/(16×5)) + 80/96)/7 |
(j = 7,6,5,4,3,2,1) | = ∼97.62% |
Average efficiency | ∼96.01% |
Host CPU | Intel Xeon Phi 7210 1.3 GHz, 64 cores |
Host memory | DDR4 96GB, MCDRAM 16GB |
GPU device | NVIDIA Quadro GV100 × 2 |
GPU memory | HBM2 32GB |
Compiler | Intel Parallel Studio 2018, |
NVIDIA CUDA toolkit 9.0 | |
Configuration of parallel execution | 2 MPI processes per host CPU, |
32 threads and 1 GPU per MPI process, | |
2 thread-blocks per SM, | |
1024 threads per thread-block | |
Target problem | A silicon nanowire consisting of 100 × 8 × 16 |
[100] unitcells (102,400 atoms) / 1 energy point |
Optimization Technique | Case 1 | Case 2 | Case 3 | Case 4 | Case 5 |
---|---|---|---|---|---|
MPI/OpenMP parallelization & MCDRAM utilization | YES | YES | YES | YES | YES |
Data-restructuring | NO | YES | YES | YES | YES |
Blocked matrix multiplication | NO | NO | YES | YES | YES |
Thread-scheduling | NO | NO | NO | YES | YES |
Offload computing (RGF step 3) | NO | NO | NO | NO | YES |
Host CPU | Intel Xeon Phi 7250 1.4 GHz, 68 cores |
Host memory | DDR4 96GB (MCDRAM is not used) |
Compiler | Intel Parallel Studio 2018 |
(No offload computing) | |
Configuration of parallel execution | 1 MPI processes per host CPU, |
68 threads per MPI process | |
Target problem | A silicon nanowire consisting of 100 × 8 × 16 |
[100] unitcells (102,400 atoms) / 2048 energy points |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jeong, Y.; Ryu, H. Enabling Large-Scale Simulations of Quantum Transport with Manycore Computing. Electronics 2021, 10, 253. https://doi.org/10.3390/electronics10030253
Jeong Y, Ryu H. Enabling Large-Scale Simulations of Quantum Transport with Manycore Computing. Electronics. 2021; 10(3):253. https://doi.org/10.3390/electronics10030253
Chicago/Turabian StyleJeong, Yosang, and Hoon Ryu. 2021. "Enabling Large-Scale Simulations of Quantum Transport with Manycore Computing" Electronics 10, no. 3: 253. https://doi.org/10.3390/electronics10030253
APA StyleJeong, Y., & Ryu, H. (2021). Enabling Large-Scale Simulations of Quantum Transport with Manycore Computing. Electronics, 10(3), 253. https://doi.org/10.3390/electronics10030253