[SOLVED] CS GPU cache cuda compiler CUDA: Introduction

$25

File Name: CS_GPU_cache_cuda_compiler_CUDA:_Introduction.zip
File Size: 423.9 KB

5/5 - (1 vote)

CUDA: Introduction
Christian Trefftz / Greg Wolffe Grand Valley State University
Supercomputing 2008 Education Program
(modifications by Jernej Barbic, 2008

2019)

What is GPGPU?

l
l
l
l

Terms
Purpose computing on a Graphics Processing Unit
l
General
Using graphic hardware for non
computations What is CUDA?
Parallel computing platform and API by Nvidia Compute Unified Device Architecture
l
graphic

parallel Introduced in 2007; still actively updated
Software architecture for managing data programming
2

Motivation
3

Motivation
4

Motivation
5

CPU
l Fastcaches
CPU vs. GPU
l Branching adaptability l High performance
GPU
l Multiple ALUs
l Fast onboard memory
l High throughput on parallel tasks
Executesprogramoneachfragment/vertex
CPUs are great for task parallelism GPUs are great for data parallelism
6

CPU vs. GPU

Hardware

More transistors devoted to data processing
7

Traditional Graphics Pipeline
Vertex processing
o
Rasterizer
o
Fragment processing
o
Renderer (textures)
8

Pixel / Thread Processing
9

GPU Architecture
10

Processing Element

Processing element = thread processor
11

GPU Memory Architecture
Uncached:
Registers Shared Memory Local Memory Global Memory
Cached:
Constant Memory Texture Memory

12

Data
co

parallel Programming

Think of the GPU as a massively
threaded
Write kernel functions that execute on

processor

Keep it busy!
Keep your data close!
the device elements in parallel
[
[

processing multiple data
massive threading local memory
13

Hardware Requirements

capable
CUDA
video card

Power supply Cooling
PCI

Express
14

A Gentle Introduction to CUDA Programming
17

Credits

l l
l l
The code used in this presentation is based on code available in:
the Tutorial on CUDA in Dr. Dobbs Journal Andrew Bellenirs code for matrix multiplication
Igor Majdandzics code for Voronoi diagrams NVIDIAs CUDA programming guide
18

Software Requirements/Tools
CUDA device driver
CUDA Toolkit (compiler, CUBLAS, CUFFT) CUDA Software Development Kit
Emulator Profiling:
Occupancy calculator Visual profiler

l

19

To compute, we need to:
Allocate memory for the computation on the GPU (incl. variables)
Provide input data
computation to be performed
Specify the
Read the results from the GPU (output)
20

Initially:
array
CPU Memory
GPU Cards Memory
21

Allocate Memory in the GPU card
array
Hosts Memory
array_d
GPU Cards Memory
22

Copy content from the hosts memory to the GPU card memory
array
Hosts Memory
array_d
GPU Cards Memory
23

Execute code on the GPU
GPU MPs
array
Hosts Memory
array_d
GPU Cards Memory
24

Copy results back to the host memory
array
Hosts Memory
array_d
GPU Cards Memory
25

The Kernel
The code to be executed in the stream processors on the GPU
Simultaneous execution in several (perhaps all) stream processors on the GPU
How is every instance of the kernel going to know which piece of data it is working on?
26

Grid and Block Size
Grid size: The number of blocks

l

l

Can be 1, 2, or 3
dimensional array of blocks Each block is divided into threads
Can be 1 or 2

dimensional array of threads
27

Lets look at a very simple example
The code has been divided into two files:
simple.c simple.cu
simple.c is ordinary code in C
It allocates an array of integers, initializes it to values corresponding to the indices in the array and prints the array.
It calls a function that modifies the array The array is printed again.

l l

28

simple.c

}
fillArray(a,SIZEOFARRAY);
/* Now print the array after calling fillArray */
#include
#define SIZEOFARRAY 64
extern void fillArray(int *a,int size);
/* The main program */
int main(int argc,char *argv[])
{
/* Declare the array that will be modified by the GPU */
int a[SIZEOFARRAY];
int i;
/* Initialize the array to 0s */
for(i=0;i < SIZEOFARRAY;i++) {a[i]=0; }/* Print the initial array */printf(“Initial state of the array:for(i = 0;i < SIZEOFARRAY;i++) {printf(“%d “,a[i]);
“);printf(”
“);/* Call the function that will in turn call the function in the GPU that will fill the array */ printf(“Final state of the array:n”); for(i = 0;i < SIZEOFARRAY;i++) { }printf(“%d “,a[i]); printf(” n”);return 0; }29simple.culsimple.cu contains two functionsfillArray(): A function that will be executed on the host and which takes care of: Allocating variables in the global GPU memory Copying the array from the host to the GPU memory Setting the grid and block sizesInvoking the kernel that is executed on the GPU Copying the values back to the host memory Freeing the GPU memory 30fillArray (part 1)#define BLOCK_SIZE 32extern “C” void fillArray(int *array, int arraySize) {int * array_d; cudaError_t result;/* cudaMalloc allocates space in GPU memory */result = cudaMalloc((void**)&array_d,sizeof(int)*arraySize);/* copy the CPU array into the GPU array_d */result = cudaMemcpy(array_d,array,sizeof(int)*arraySize,cudaMemcpyHostToDevice);31fillArray (part 2)/* Indicate block size */dim3 dimblock(BLOCK_SIZE);/* Indicate grid size */dim3 dimgrid(arraySize / BLOCK_SIZE);/* Call the kernel */ cu_fillArray<< >>(array_d);
/* Copy the results from GPU back to CPU memory */
result = cudaMemcpy(array,array_d,sizeof(int)*arraySize,cudaMemcpyDevice ToHost);
/* Release the GPU memory */
cudaFree(array_d);
}
32

simple.cu (cont.)

l
l
The other function in simple.cu is cu_fillArray():
This is the GPU kernel
Identified by the keyword: __global__
Built

l
in variables:

blockIdx.x : block index within the grid threadIdx.x: thread index within the block
33

cu_fillArray
__global__ void cu_fillArray(int * array_d) {
int x;
x = blockIdx.x * BLOCK_SIZE + threadIdx.x; array_d[x] = x;
}
__global__ void cu_addIntegers(int * array_d1, int * array_d2) {
int x;
x = blockIdx.x * BLOCK_SIZE + threadIdx.x; array_d1[x] += array_d2[x];
}
34

To compile:
nvcc simple.c simple.cu
The compiler generates the code for both the host and the GPU
Demo on cuda.littlefe.net

o simple
35

In the GPU:
Elements
Array Elements
Block 0
Processing
Thread 0
Thread 1
Thread 2
Thread 3
Thread 0
Thread 1
Thread 2
Thread 3
Block 1
37

Another Example:
SAXPY (Scalar Alpha X Plus Y)
A common operation in linear algebra
saxpy

l

CUDA: loop iteration

thread
41

Traditional Sequential Code
void saxpy_serial(int n, float alpha,
float *x,
float *y)
{
for(int i = 0;i < n;i++)y[i] = alpha*x[i] + y[i];} 42CUDA Code__global__ void saxpy_parallel(int n, float alpha,float *x,float *y) {int i = blockIdx.x*blockDim.x+threadIdx.x;if (i

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] CS GPU cache cuda compiler CUDA: Introduction
$25