COMP Distributed
Introduction
References
Copyright By Assignmentchef assignmentchef
NVIDIAGPUEducatorsProgram https://developer.nvidia.com/educators
NVIDIAs Academic Programs
https://developer.nvidia.com/academia
The contents of this short course ppt slides are mainly copied from the following book and its accompanying teaching materials:
. Kirk and Wen-mei W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, 2nd edition,, 2013
Review CUDA Execution Model
Heterogeneous host (CPU) + device (GPU) application C program
Serial parts in host C code
Parallel parts in device kernel code
SIMD and multithreading (Single Instruction & Multiple Threads, or SIMT)
Serial Code (host)
Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args);
Serial Code (host)
Parallel Kernel (device) KernelB<<< nBlk, nTid >>>(args);
Review Typical Structure of a CUDA Program
Kernel function
__global__ void kernelOne(args){}
allocatememoryspaceonthedevice
cudaMalloc(&d_GlblVarPtr, bytes )
transferdatafromhosttodevicecudaMemCpy(d_GlblVarPtr, h_Gl)
kernelcallkernelOne<<
transferresultsfromdevicetohostcudaMemCpy(h_GlblVarPtr,)needed
optional:compareagainstgolden(hostcomputed)solution
Review Thread Grid, Blocks and Warps
A CUDA kernel is executed by a grid (array) of threads
All threads in a grid run the same kernel code (Single Program Multiple Data)
Thread array divided into multiple blocks which are distributed to different SMs (8 12 blocks/SM)
Multiple dimensional blocks/grid and multiple dimensional threads/block
Threads in different blocks do not interact
Thread indexes to compute memory addresses and make control decisions (i = blockIdx.x * blockDim.x + threadIdx.x)
Each Block is executed as 32-thread Warps Warps are scheduling units in SM
Threads in a warp execute in SIMD
Grid size is application dependent
Block size is machine dependent
Control divergence: threads in a warp take different control flow paths which are serialized
Review CUDA Memories
Registers (per-thread)
Shared memory (per-block)
Global memory (all threads)
Memory coalescing: locality across threads for one instruction
Shared memory 100x faster than global memory: data loaded into shared memory & then used many times
Block (0, 0)
Registers Registers Registers Registers
Thread (0, 0)
Global Memory
Shared Memory
Constant Memory
Block (1, 0)
Shared Memory
Thread (1, 0)
Thread (0, 0)
Thread (1, 0)
Review Synchronization
Threads within a block cooperate via shared memory, atomic operations and barrier synchronization
Barrier Synchronization __syncthreads()
Avoid data race with atomic operations
e.g., int atomicAdd(int* address, int val);
Privatization: atomic operations per block on shared memory & then across blocks on global memory
Not Covered
Asynchronous memory copy between host and device
Multiple Streams
Multiple devices
Distributed-memory cluster with multiple GPUs
Parallel Algorithm Design and Analysis
The most difficult task
Unfortunately, there are no simple recipes
Requires a sort of integrative thought that is commonly referred to as creativity
needs experience
In general, we need to consider
Fine grained parallelism
Memory coalescing
Effective use of shared memory Control divergence
Synchronization overhead
Developer Tools Debuggers
NSIGHT CUDA-GDB CUDA MEMCHECK NVIDIA Provided
https://developer.nvidia.com/debugging-solutions
Developer Tools Profilers
NVVP NVPROF NVIDIA Provided
VampirTrac e
https://developer.nvidia.com/performance-analysis-tools
Ways to Accelerate Applications
Easy to use Most Performance
Applications
Compiler Directives
Easy to use Portable code
Programming Languages
Most Performance Most Flexibility
GPU Accelerated Libraries
Linear Algebra FFT, BLAS, SPARSE, Matrix
Numerical & Math RAND, Statistics
Data Struct. & AI Sort, Scan, Zero Sum
Visual Processing Image & Video NPP
NVIDIA cuFFT , cuBLAS, cuSP ARSE
NVIDIA Math Lib
GPU AI Board Games
NVIDIA cuRAND
GPU AI Path Finding
NVIDIA Video Encode
Compiler Directives: Easy, Portable Acceleration
Ease of use: Compiler takes care of details of parallelism management and data movement
Portable: The code is generic, not specific to any type of hardware and can be deployed into multiple languages
Uncertain: Performance of code can vary across compiler versions
CompilerdirectivesforC,C++,andFORTRAN
#pragma acc parallel loop copyin(input1[0:inputLength],input2[0:inputLength]),
copyout(output[0:inputLength])
for(i = 0; i < inputLength; ++i) {output[i] = input1[i] + input2[i]; } Programming Languages: Most Performance and Flexible AccelerationPerformance: Programmer has best control of parallelism and data movementFlexible: The computation does not need to fit into a limited set of library patterns or directive typesVerbose: The programmer often needs to express more details CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.