CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming
Gregory Ruetsch, Massimiliano Fatica
Format: PDF / Kindle (mobi) / ePub
CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran, the familiar language of scientific computing and supercomputer performance benchmarking. The authors presume no prior parallel computing experience, and cover the basics along with best practices for efficient GPU computing using CUDA Fortran.
To help you add CUDA Fortran to existing Fortran codes, the book explains how to understand the target GPU architecture, identify computationally intensive parts of the code, and modify the code to manage the data and parallelism and optimize performance. All of this is done in Fortran, without having to rewrite in another language. Each concept is illustrated with actual examples so you can immediately evaluate the performance of your code in comparison.
• Leverage the power of GPU computing with PGI's CUDA Fortran compiler
• Gain insights from members of the CUDA Fortran language development team
• Includes multi-GPU programming in CUDA Fortran, covering both peer-to-peer and message passing interface (MPI) approaches
• Includes full source code for all the examples and several case studies
• Download source code and slides from the book's companion website
code performance. In this chapter we first discuss how to time kernel execution using CPU timers, CUDA events, and the Command Line Profiler as well as the nvprof profiling tool. We then discuss how timing information can be used to determine the limiting factor of kernel execution. Finally, we discuss how to calculate performance metrics, especially those related to bandwidth, and how such metrics should be interpreted. 2.1 Measuring kernel execution time There are several ways to measure
kernel execution time. We can use traditional CPU timers, but in doing so we must be careful to ensure correct synchronization between host and device for such measurements to be accurate. The CUDA event API routines, which are called from host code, can be used to calculate kernel execution time using the device clock. Finally, we discuss how the Command Line Profiler and the nvprof profiling tool can be used to give this timing information. 2.1.1 Host-device synchronization and CPU timers
read-only constant and texture memories. The topic of launching kernels with enough parallelism, whether in the form of instruction-level or thread-level parallelism, is also discussed. The final section covers instruction optimization. Keywords Pinned memory; Asynchronous data transfers; Stream; Hyper-Q; Shared memory; Bank conflicts; Constant memory; Textures; Registers; Data coalescing; Warp; Execution configuration; Thread-level parallelism; Instruction-level parallelism In the
the number of elements refers to the number of elements of the source array. Assignment statements can be used in CUDA Fortran to transfer array sections between device and host, as in: Such operations are generally broken up into multiple separate transfers. A more efficient way of performing such transfers is using the routine cudaMemcpy2D(). The following code section shows how to perform the same array-section transfer as the previous assignment statement using cudaMemcpy2D(): The
effective bandwidth. Timing in this case is done using a wall-clock timer. This code uses the C function gettimeofday(): which is accessed in the Fortran code using the timing module: Whenever this routine is called, we explicitly check to make sure there is no pending or executing operations on the device: Note that most of this multi-GPU code is overhead associated with declaring and initializing arrays and enabling peer-to-peer communication. The actual data transfers and kernel