The purpose of this assignment is to familiarize yourself with CUDA programming.
Get the source code:
$ wget https://nycu-sslab.github.io/PP-f21/HW5/HW5.zip$ unzip HW5.zip -d HW5$ cd HW5
1. Problem Statement: Paralleling Fractal Generation with CUDA
Following part 2 of HW2, we are going to parallelize fractal generation by using CUDA.
Build and run the code in the HW5
directory of the code base. (Type make
to build, and ./mandelbrot
to run it. ./mandelbrot --help
displays the usage information.)
The following paragraphs are quoted from part 2 of HW2.
This program produces the image file
mandelbrot-test.ppm
, which is a visualization of a famous set of complex numbers called the Mandelbrot set. [Most platforms have a.ppm
viewer. For example, to view the resulting images, use tiv command (already installed) to display them on the terminal.]As you can see in the images below, the result is a familiar and beautiful fractal. Each pixel in the image corresponds to a value in the complex plane, and the brightness of each pixel is proportional to the computational cost of determining whether the value is contained in the Mandelbrot set. To get image 2, use the command option
--view 2
. You can learn more about the definition of the Mandelbrot set.
Your job is to parallelize the computation of the images using CUDA. A starter code that spawns CUDA threads is provided in function hostFE()
, which is located in kernel.cu
. This function is the host front-end function that allocates the memory and launches a GPU kernel.
Currently hostFE()
does not do any computation and returns immediately. You should add code to hostFE()
function and finish mandelKernel()
to accomplish this task.
The kernel will be implemented, of course, based on mandel()
in mandelbrotSerial.cpp
, which is shown below. You may want to customized it for your kernel implementation.
int mandel(float c_re, float c_im, int maxIteration){ float z_re = c_re, z_im = c_im; int i; for (i = 0; i < maxIteration; ++i) { if (z_re * z_re + z_im * z_im > 4.f) break; float new_re = z_re * z_re - z_im * z_im; float new_im = 2.f * z_re * z_im; z_re = c_re + new_re; z_im = c_im + new_im; } return i;}
2. Requirements
- You will modify only
kernel.cu
, and use it as the template. - You need to implement three approaches to solve the questions:
- Method 1: Each CUDA thread processes one pixel. Use
malloc
to allocate the host memory, and usecudaMalloc
to allocate GPU memory. Name the filekernel1.cu
. (Note that you are not allowed to use the image input as the host memory directly) - Method 2: Each CUDA thread processes one pixel. Use
cudaHostAlloc
to allocate the host memory, and usecudaMallocPitch
to allocate GPU memory. Name the filekernel2.cu
. - Method 3: Each CUDA thread processes a group of pixels. Use
cudaHostAlloc
to allocate the host memory, and usecudaMallocPitch
to allocate GPU memory. You can try different size of the group. Name the filekernel3.cu
.
- Method 1: Each CUDA thread processes one pixel. Use
- Q1 What are the pros and cons of the three methods? Give an assumption about their performances.
- Q2 How are the performances of the three methods? Plot a chart to show the differences among the three methods
- for VIEW 1 and VIEW 2, and
- for different
maxIteration
(1000, 10000, and 100000).
You may want to measure the running time via the
nvprof
command to get a comprehensive view of performance. - Q3 Explain the performance differences thoroughly based on your experimental results. Does the results match your assumption? Why or why not.
- Q4 Can we do even better? Think a better approach and explain it. Implement your method in
kernel4.cu
.
Answer the questions (marked with Q1-Q4) in a REPORT using HackMD. Notice that in this assignment a higher standard will be applied when grading the quality of your report.
Reviews
There are no reviews yet.