Optimized Three-Dimensional Stencil Computation on Fermi and Kepler GPUs
Siemens Corporate Technology, SC Siemens SRL
Abstract—Stencil based algorithms are used intensively in to/from which the CPU or host thread can write/read, and scientific computations. Graphics Processing Units (GPU) based which is accessible by all multiprocessors. Furthermore, each implementations of stencil computations speed-up the execution multiprocessor also contains shared memory and registers significantly compared to conventional CPU only systems. In this which are split between the thread blocks and the threads, paper we focus on double precision stencil computations, which which run on the multiprocessor, respectively. With the are required for meeting the high accuracy requirements, : cstutorcsintroduction of the third and fourth generation general purpose inherent for scientific computations. Starting from two baseline GPU (GPGPU), the Fermi and the Kepler generations implementations (using two dimensional and three dimensional respectively [3], [4], the double precision performance has thread block structures respectively), we employ different increased, and a true cache hierarchy (L1/L2) and more shared optimization techniques which lead to seven kernel versions. Both memory are available. Furthermore, the global memory Fermi
and Kepler GPUs are used, to evaluate the impact of bandwidth plays an important role since the performance of different optimization techniques for the two architectures.
GPU cards designed for desktop PCs and notebook PCs. The as a function of its neighboring locations. This pattern is found results indicate that the ratio of execution time is roughly equal
in several application domains, like image processing, computational fluid dynamics, weather prediction, quantum to the inverse of the ratio of power consumption.
Keywords— stencil, GPU, double precision, Kepler, Fermi, physics. Previous studies have shown that, if regular Cartesian optimization grids are used, GPU based implementations are able to significantly speed up the execution compared to regular CPU
https://tutorcs.combased implementations [5], [6].
I. INTRODUCTION
Graphics Processing Units (GPUs) are dedicated processors, designed originally as graphic accelerators. Since
The GPU is viewed as a compute device which is able to run a very high number of threads in parallel inside a kernel (a function, written in C language, which is executed on the GPU and launched by the CPU). The threads of a kernel are organized at three levels: blocks of threads are organized in a three dimensional (3D) grid at the top level, threads are organized in 3D blocks at the middle level, and, at the lowest levels, threads are grouped into warps (groups of 32 threads formed by linearizing the 3D block structure along the x, y and z axes respectively) [2].
The GPU contains several streaming multiprocessors, each of them containing several cores. The GPU (usually also called device) contains a certain amount of global memory
978-1-4799-6233-4/14/$31.00 ©2014 IEEE
Research activities on stencil based computations have been reported long before the introduction of general purpose GPUs. These activities focused on the information transfer between nodes [7] and the relationship between partition shape, stencil structure and architecture [8]. Different optimization techniques have been reported more recently for GPU based stencil computations. The most often encountered optimization techniques used in the past are blocking at registers and at shared memory [9], [10]. Pre-Fermi GPUs did not have any cache memories, making the shared memory blocking technique vital for reducing memory access counts. Temporal blocking is another extensively used technique, with mixed performance improvements on GPUs [11], [12], [13]. Non-GPU architectures have also been used for stencil based computations [14].
The goal of the current work is to evaluate the performance of 3D stencil based algorithms on a series of recent GPUs. Previous research activities have focused on single precision computations. With the introduction of the Fermi and the Kepler architecture, the performance of double precision computations on NVIDIA GPU cards has increased
substantially. To meet the high accuracy requirements, inherent for
scientific computations [15], [16], in the current
work we focus on double precision computations. Starting from two baseline implementations, we employ different optimization techniques which lead to seven different kernel versions. Both Fermi and Kepler GPUs are used, to evaluate the impact of different optimization techniques for the two architectures.
The paper is organized as follows. Section II first performs a brief introduction of the 3D stencil used herein. Next, the baseline implementations are introduced, followed by the Fig. 1. 7-point stencil used for the numerical solution of the unsteady heat diffusion equation.
different optimized approaches. Section III displays the results obtained with the different kernel versions for different Fermi and Kepler GPUs. Finally, section IV draws the conclusions.
II. METHODS
For studying 3D stencil based algorithms implemented on A. Baseline GPU-based implementations
In the following we introduce two baseline GPU-based implementations of the unsteady heat diffusion problem. For the first baseline implementation (called in the following 3DBase) each grid point is handled by a separate thread. Two buffers are allocated, one for the values at the previous time
graphics processing units, we consider the 3D unsteady heat step and one for the values at the new time step. To eliminate conduction problem which is modeled as a second order : cstutorcsthe memory copy requirement from one buffer to the other, the partial differential equation describing the distribution of heat buffers are swapped at the end of each time step.
over time over a given 3D space: Since for the latest GPUs the execution configuration 2 2 2 allows not only for 3D blocks of threads, but also for a 3D
T T T (1)
Assignment Project Exam Helpgrid of thread blocks, the threads and the thread-blocks are organized both into 3D structures. Thus, each thread of the where α is the thermal diffusivity
constant and T represents the grid corresponds to a grid point in the 3-D computational temperature at any point in space (x,y,z) or time (t). domain. To compute the new value at a grid point each thread
For the numerical solution of (1) we apply a finite [email protected]. Since global memory operations are very slow, this performs seven global memory read operations at each time difference method on a 3D grid of points. A uniform mesh of represents a severe limitation of the kernel performance. points is used and the forward difference in time and central difference in space (FTCS) method is applied, leading to a 3D In the CUDA architecture, each thread block is divided
7-point stencil: QQ: 749389476into groups of 32 threads called warps, each of which is
executed in a SIMD fashion (all threads of the same warp
Ti,nj+,1k Δ−tTi,nj,k =α(Ti+n1, j,k − 2ΔTix,nj2,k +Ti−n1, j,k + Ti,nj+1,k − 2ΔTiy,nj2,k +Ti,nj−1,k , (2) execute the same instruction at a time). If the threads inside a warp follow different execution paths, the execution of the
) 0,
Δz2 of warp divergence is required to distinguish between
which can be rewritten as:
Ti,nj+,1k =Ti,nj,k +d(Ti+n1, j,k +Ti−n1, j,k +Ti,nj+1,k +Ti,nj−1,k + (3)
Ti,nj,k+1 +Ti,nj,k−1 − 6Ti,nj,k ), where d = αΔt/Δx2.
In the above equation n represents the discrete time step number, (i,j,k) represents the spatial index, Δt is the time step
T n and Δx is the mesh spacing, which is equal in all directions. i, j,k represents the temperature value at point (i,j,k), at time step n. The numerical solution is stable if the CFL condition holds: d = αΔt/Δx2 < 1/6.
As can be observed in (3), the value at a grid point at time step n+1 is computed from the values at the previous time step, from the same grid point and from its six neighboring points, leading to a 7 point stencil computation (fig. 1).
This solution scheme is fully explicit: the computation of the new value at any grid point is fully independent from the computations at the other grid points.
boundary and non-boundary nodes, so as to perform the computations only for the latter ones).
On the other hand, stencil codes can be characterized by their FLOPs per byte ratio. The baseline implementation performs 13 double-precision floating point operations per update [17]. This leads to 13 · xDim · yDim · zDim operations performed at each iteration (xDim, yDim and zDim represent the grid dimensions). If we assume that at each time step once the old values are loaded they remain in the cache memory (which is unlikely for grid dimensions which exceed the cache size) the amount of data loaded and stored per time step is equal to xDim · yDim · zDim · sizeof(double) · 2. Hence the flop per DRAM byte ratio is:
13⋅ xDim ⋅ yDim ⋅ zDim =
0.8125. (4) xDim ⋅ yDim ⋅ zDim ⋅sizeof (double)⋅2
Current GPUs, however, have a significantly higher ratio. According to this model the performance of the stencil on the GPU is therefore limited by its memory bandwidth. branches is serialized. Thus, warp divergence is another aspect
Unlike the 3DBase implementation, for which a thread updates a single point, herein the same thread operates on several grid points. These points are placed equidistant from each other, the distance from one grid point to another is determined based on the size of the 3D domain (xDim · yDim).
B. Optimized implementations Next, we describe a series of optimization techniques for
2) Three-dimensional baseline implementation with Shared Memory Usage and No Data Overlap
Starting again from the 3DBase implementation, a different shared memory based strategy is developed. The shared memory arrays are padded with an additional slice on each side of the 3D block leading to a total size of (blockXDim + 2) · (blockYDim + 2) · (blockZDim + 2), as shown in fig. 4.
First, each thread populates the value of the grid point it handles in shared memory. Next, the threads located on the boundary of the block load the remaining data slices from global memory to the shared memory (note that the corner points of the blocks are not required for the 7-point stencil). To load points located outside of the block, conditional
operations are introduced which cause branch divergence.
minimizing warp divergence and global memory accesses. Thus, each thread of a thread block has access to all its Besides glob memory, the GPU architecture provides fast neighbors and is able to update the corresponding grid point on-chip memory, registe and shared memory, which is (no overlapping between thread blocks is required – distributed between threads and thread bloc respectively.Assignment Project Exam Help 3DShMNoOverL).
1) Three-dimensional baseline implementation with 3) Two-dimensional distribution of threads with addition
Shared Memory Usage and Data Overlap register usage
The starting point for the new kernel is the 3DBase The 2DBase implementation can be optimized by storing implementatio
Since shared memory is allocated at thread Email: [email protected] data
registers. Therein, the value of the current block level, threads can cooperate when populating data grid point for adjacent 2D slic is read from the global blocks allocated in the shared memory. If data can then be memory by the same thread. The same holds tr for grid reused by different threads, global memory accesses are points which lie on the front or back sides of the 2D slices. reduc
and overall kernel performance is improved. QQ: 749389476
Because slices are iterated along the z direction, the val
Shared memory arrays of size blockXDim · blockYDim · blockZDim are allocated (blockXDim, blockYDim and blockZDim represent the dimensions of the thread blocks).
Each thread within a block loads the value of the grid point https://tutorcs.com it handles from global memory to
shared memory. To avoid
undefined behavior and incorrect results when sharing data read by different threads, a synchronization barrier is introduced. All values required for the implementation of (4) are then read from the shared memory.
With this technique, threads lying at the border of a thread block do not have access to all their neighbors and can not compute the corresponding new values. Hence, the execution configuration is designed so as to ensure block overlapping in
Fig. 2. 2DBase kernel: the computational grid is divided into x-y planes and a loop is then used to traverse the grid in the z-direction.
Fig. 3. 3DShMOverL kernel: the shared memory arrays have the same size as the thread blocks. Thread blocks
overlap to enable the computation at all grid points.
Fig. 4. 3DShMNoOverL: the shared memory arrays are
padded with additional data slices loaded by the threads located at the border of the thread block. at grid point (i, j, k+1) becomes the value at (i, j, k) at the next memory. Next, threads located on the upper and lower iteration. Similarly, the value at (i, j, k) becomes the value at boundary of the block load the remaining values. for caching them and two global memory accesses are saved at
(i, j, k–1). Instead of rereading these values, registers are used
Besides the two registers that store the values of the nodes located next to the left and right boundaries, another 2 each iteration along the Z axis (in the following this kernel is registers are used for the optimization described in section called 2DReg).
II.B.2.
4) Two-dimensional distribution of threads with Shared Memory Usage III. RESULTS
As for the kernels with 3D thread blocks, shared memory
To evaluate the performance of the different strategies for can also be used to reduce global memory accesses for the running 3D stencil based algorithms on GPUs, we used three kernels with 2D thread blocks. The size of the shared memory different NVIDIA GPU cards: GeForce GTX 480, GeForce array chosen for this kernel version is (blockXDim + 2) ·
GTX 660M and GeForce GTX 680 (the first one is based on
(blockYDim + 2). To allow each thread of the thread block to the Fermi architecture, while the other two are based on the compute the new value of the corresponding grid point,
Kepler architecture), and the CUDA toolkit version 5.5. The additional slices are populated at each border of the 2D shared unsteady heat conduction problem was solved on a rectangular memory array. Hence, the size of the shared memory array domain with Dirichlet boundary conditions, whereas the used for this configuration is (blockXDim + 2) · (blockYDim + boundary values were set to 100 C for one side of the 2). Each thread first reads the value of the grid point it handles rectangle and 0 C for the other sides. The thermal diffusivity and stores it in the shared memory.
Next, threads located on : -5 2 constant was se to 1.9 · 10 m /s and the computations are the boundary of the block load the remaining values (in the
performed until convergence is reached. The numerical following this kernel is called 2DShM). solution was obtained on a grid of 128x128x128 nodes and is
5) Two-dimensional distribution of threads with displayed in fig. 6. The numerical solution was identical for all Additional
Register and Shared Memory Usage Assignment Project Exam Helpthree GPU cards and for all implementation versions down to th For the implementation version described in section II.B.4 the 15 decimal, i.e. close to the precision of the double-type the loading of the central section of the shared memory does representation in computer data structures.
Fig. 5. 2DShMReg: Northern and southern slices are read from the shared Fig. 6. Computation result for the unsteady heat conduction problem on a memory, eastern and western values from the global memory. rectangular domain with Dirichlet boundary conditions.
not introduce any divergent branches since it is not Table 1 displays the execution times for one time step for conditioned. The loading of the slices with [email protected] index equal to 0 the three above mentioned GPU cards, obtained for the seven or blockYDim + 2 introduces a maximum of two divergent different kernel versions introduced in the previous section. branches, one for each half-warp, depending on the compute The GTX660M card leads to the largest execution times capability of the GPU. On the other side, the slices with x although it has been considerably later released compared to index equal to 0 or blockXDim + 2 lead to divergent branches the GTX480 card. This can be explained however by the fact and only one thread of the entire half-warp
from the global memory and stored into registers. Only the smallest execution times. The ratio of the execution times for threads lying at the left or right border perform separate global the GTX660M and GTX680 cards varies between 4.26 and 5.56 memory reads (fig. 5 – 2DShMReg), while the other values are for different kernel versions. This roughly reflects the inverse safely read from the shared memory. of the power consumption ratio, which is equal to 3.9.
To reduce branch divergence, the shared memory array is consumption of 250W and 195W respectively, the GTX660M used only for the central section and for the slices with index https://tutorcs.comonly required 50W). The GTX680 is the best performing card: equal to 0 or blockYDim + 2, while the other values are read for each of the seven implementation versions it leads to the blockXDim · (blockYDim + 2). Each thread first reads the value of the grid point it handles and stores it in the shared
TABLE I. EXECUTION TIME [MS] FOR A SINGLE TIME STEP, OBTAINED FOR decreased by 35.39% and the total number of read operations THE SEVEN DIFFERENT IMPLEMENTATION VERSIONS ON THREE DIFFERENT was reduced by
25.66% for the 3DShMNoOverL kernel. 程序代写代做
Method GTX4
80 GTX
660M GTX
680
3DBase 1.7 3.45 0.62
3DShMOverL 3.5 6.17 1.13
3DShMNoOverL 1.8 3.78 0.73
2DBase 1.2 3.09 0.63
2DReg 0.9 2.47 0.58
2DShM 1.2 2.87 0.59
2DShMReg 1.09 2.32 0.48
CS编程辅导compute limited instead of bandwidth limited. The main GPU CARDS . Compared to the 3DBase kernel, this implementation is reason for the change of the limitation type lies in the number of divergent branches, which increased considerably and which in the end leads to a higher execution time than for the 3DBase kernel.
Next, we refer to the kernels based on a 2D thread block structure. The 2DReg kernel leads to a significant reduction of memory operations (28.34%) and as a result of the execution time (7.93%), compared to the 2DBase kernel. The 2DShM
Interestingly, whereas for the GTX660M and the GTX680 kernel
further reduces the number of global memory load cards the 2DShMReg kernel performs best, for the GTX480 operations but execution time increases slightly, which is card the 2DReg kernel leads to the smallest execution time. caused by the non-optimized register usage. Finally the Shared memory based optimizations were particularly
since the L1 cache is intensively used for caching global IV. CONCLUSIONS memory read operations, the 2DReg
kernel outperforms the
2DShMemReg kernel. On the other hand, for the GTX660M In this paper, we have presented performance studies for and the
GTX680M, since the L1 cache functionality is limited [email protected] stencil based algorithms on recent NVIDIA GPUs. To our
to register spilling, shared memory usage became more knowledge this is the first study to evaluate different important, illustrated by the better performance of the implementation and optimization strategies for double 2DShMReg kernel. precision computations. The increased accuracy obtained for double precision is required in scientific computations, which
In the following we focus on the differences between the QQ: 749389476represent the main area of application for the 3D stencil based kernel versions for the GTX680 card, which was determined algorithms.
as the best performing one considered herein. Table 2 displays besides the execution time other important details of the For the analysis we have used Fermi and Kepler various kernel versions. architecture based cards, which represent the last two released https://tutorcs.comGPU architectures from NVIDIA. Besides the shift in L1
lead to almost identical execution times. Referring first to the kernels based on a 3D thread block structure, the 3DShMOverL performs worse than the 3DBase kernel: execution time increased by 82% although the number of global accesses was reduced by 66.13%. This can be explained by the fact that a considerable amount of threads perform only load operations.
Compared to the 3DShMOverL kernel, the execution time
The two baseline implementations (2DBase and 3DBase) cache usage from Fermi to Kepler, other important minor and major changes in the hardware configuration have been performed [4].
Hence, starting from two different baseline implementations (based on 3D and 2D thread block structures), we have applied different optimization strategies which have lead to different performance changes for the Fermi and Kepler cards. Overall the GTX680 GPU card (Kepler architecture) performed best for a kernel with 2D
TABLE II. KERNEL PERFORMANCE AND DETAILS FOR THE GTX680 CARD.
Method Execution time [ms] Reg. per thread Divergent branches Shared memory
per block [bytes] Total number Total of 64 bit global number of load instr. 64 bit global
store instr.
3DBase 0.62 25 12016 – 14002632 2000376
3DShMOverL 1.13 19 20811 4096 4741632 2000376
3DShMNoOverL 0.73 21 12694 8000 3524851 2000376
2DBase 0.63 25 94 – 14002632 2000376
2DReg 0.58 25 94 – 10033632 2000376
2DShM 0.59 25 94 800 6953688 2000376
2DShMReg 0.48 25 94 640 2984688 2000376
architecture) the 2D kernel, which does not use shared [4] NVIDIA Corporation, “NVIDIA Kepler GK110 Architecture best,
GPUs. [6] T. Shimokawabe, T. Aoki, T. Takaki, T. Endo, A. Yamanaka, N.
Maruyama, A. Nukada, and S. Matsuoka, “Peta-scale phase-field Finally, for the Kepler architecture we have evaluated the simulation for dendritic solidification on the TSUBAME 2.0 performance for a GPU designed for desktop PCs (GTX680) supercomputer”, Intern. Conf. for High Performance Computing, Networking, Storage and Analysis, pp. 13-18, 2011. and for a GPU designed for notebook PCs (GTX660M). The
[7] G.C. Fox, “Concurrent processing for scientific calculations”, IEEE results have indicated that the ratio of execution time is roughly Computer Society International Conference, pp. 70-73, 1984. equal to the inverse of the ratio of power consumption. [8] D. Reed, L. Adams, and M. Patrick, “Stencils and problem partitionings:
Their influence on the performance of multiple processor systems”, IEEE
ACKNOWLEDGMENT Trans. Comput., vol. 7, pp. 845-858, 1987.
[9] P. Micikevicius, “3D Finite difference computation on GPUs using
This work is supported by the program Partnerships in CUDA”. Workshop on General Purpose Processing on Graphics
Processing Units, pp. 79-84, 2009.
[10] L. M. Itu, C. Suciu, F. Moldoveanu, and A. Postelnicu, “GPU Optimized
Priority Domains (PN II), financed by ANCS, CNDI –
Computation of Stencil Based Algorithms”, RoEduNet Inter. Conf., pp.
This paper is supported by the Sectoral Operational [11] T. Grosser, A. Cohen, P. H. J. Kelly, J. Ramanujam, P. Sadayappan, and Programme Human Resources Development (SOP HRD), S. Verdoolaege, “Split tiling for GPUs: automatic parallelization using trapezoidal tiles”, Workshop on General Purpose
Romanian Government. [12] J. Holewinski, L. N. Pouchet, and P. Sadayappan, “High-performance
The research leading to these results has received funding code generation for stencil computations on GPU architectures”,
ACM
Intern. Conf. on Supercomputing, pp. 311-320,
2012. from the European Union’s Seventh Framework Programme
We hereby acknowledge the structural funds project PRO- [14] K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D.
DD (POS-CCE, O.2.2.1., ID 123, SMIS 2637, ctr. No Patterson, J. Shalf, and K. Yelick, “Stencil computation optimization and autotuning on state-of-the-art multicore architectures”, ACM/IEEE 11/2009) for providing the infrastructure used in this work. Conf. on Supercomputing, pp. 1-12, Nov. 2008.
[15] C. Niţă, L. M. Itu, and C. Suciu, “GPU Accelerated Blood Flow
[16] P. Zaspel, M. Griebel, “Solving incompressible two-phase flows on [1] D. Kirk, and W.M. Hwu, Programming Massively Parallel Processors: A multi-GPU clusters”, Computers & Fluids 2012, vol. 80, pp. 356-364,
[2] NVIDIA Corporation, “CUDA, Compute unified device architecture [17] N. Maruyama, and T. Aoki, “Optimizing stencil computations for
Computations, pp. 1-7, Jan. 2104.

![[SOLVED] Comp4300 assignment2 p0](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip.jpg)

![[SOLVED] CSE 6242 / CX 4242: Data and Visual Analytics HW3](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip-1200x1200.jpg)
Reviews
There are no reviews yet.