1. Objective
Implement an exclusive scan program on a 2D arrays.
The scan operator will be the addition (plus) operator, so in other words you are performing a Summed-Area Table.
You should try to implement the work efficient kernel (Blelloch) shown in the lectures, but it is fine if you implement the naive scan (Hillis and Steele). You will get 80% for correct outputs and 20% for having good optimization.
One thing to be aware of is that most implementations only handle arrays the same size as a block; however, your kernel should be able to handle input arrays of arbitrary dimension. More instruction about this below.
This lab will be submitted as one zipped file through eclass. Details for submission are at the end of the lab.
2. Instructions
First, you need to implement a 1D scan kernel and from this 1D case you will be able to extend to 2D case.
2.1. Handling 1D list with arbitrary length:
First, make sure you carefully read the instructions after reading through the following NVIDIA article:
http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html Especially examples 39-1 and 39-2.
These examples show you instructions on how to implement a good 1D prefix sum kernel that can handle list with length exactly equal to the block size or block size * 2 depend on the implementation but its a good place to start. You can do a little bit of extension to the kernel mentioned in the article to make it work on arrays with length less than block size as well as greater than block size.
To make the kernel work on arrays with length less than block size, you just need to add two things to your kernel:
- When loading the input into shared memory, check if the index is out of bound (more than the length of the array 1). If it is out of bound, simply load a 0 into shared memory. This is like padding the 1D list with 0 at the end.
- When you write the output, also check if the index is out of bound. If it is, do not write anything
After this step, you have a kernel that can handle lists with length less than or equal to block size (or block size * 2)
To make the kernel work on arrays with length greater than block size: Lets assume we are running an inclusive scan with blocks of size 4 and we have the following input array of length 12. We are doing exclusive scan but the logic is the very same except we are shifting our output by 1 to the right.
X0 | X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 | X11 |
For an inclusive scan, we want the following:
X0 | (X0..X1) | (X0..X2) | (X0..X3) | (X0..X9) | (X0..X10) | (X0..X11) |
Let say we fix our block size to a constant and divide the bigger list into smaller lists of equal size that are handled by each block. Each block will do a scan on each smaller list (of length block size * 2 or block size in this example).
After we do scan on smaller list, we will have: Block 0:
X0 | (X0..X1) | (X0..X2) | (X0..X3) |
X4 | (X4..X5) | (X4..X6) | (X4..X7) |
Block 1:
Block 2:
X8 | (X8..X9) | (X8..X10) | (X8..X11) |
Now, if we add 0 to all the output of block 0, (X0..X3) to all the output of block 1 and
(X0..X3) + (X4..X7) to the output of block two we will have Block 1:
X4 | (X0..X5) | (X0..X6) | (X0..X7) |
Block 2:
X8 | (X0..X9) | (X0..X10) | (X0..X11) |
This is exactly what we wanted.
So the trick is to have an additional array that can somehow contains the sum of elements from the beginning of the list to the elements that are at index of multiples of the block size. How do we get this? From the above example, let say we have (X0..X3) and (X4..X7) and we want 0, (X0..X3) and (X0..X3) + (X4..X7). This is exactly equal to doing a scan on a list with two elements (X0..X3) and (X4..X7).
This bring us to the idea in section 39.2.4 Arrays of Arbitrary Size of the link http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html :
Before zeroing the last element of block i (the block of code labeled B in Listing 39-2), we store the value (the total sum of block i) to an auxiliary array SUMS. We then scan SUMS in the same manner, writing the result to an array INCR. We then add INCR[i] to all elements of block i using a simple uniform add kernel invoked on N/B thread blocks of B/2 threads each. (Note that the above sentence is just a quote and does not apply to this example)
Come back to the example, we allocate an array which is the same length as the number of blocks (in this case 3). Then each block stores the max value in the aux array. If the entire array fits into a single block, then we dont need the aux array and can pass NULL as the argument. To use the aux array, we add the following line at the end of the kernel code.
- if (aux && threadIdx.x == 0)
- aux[blockIdx.x] = temp[BLOCK_SIZE 1]; // where temp is the shared memory array
Note that BLOCK_SIZE -1 is supposed to point to the last element in the block or the max sum, but this might be different for your implementation. For example in the second block aux[blockIdx.x] will be (X4..X7). With the example above the aux array should be:
(X0..X3) | (X4..X7) | (X8..X11) |
Now we can perform scan on the aux array. In this case the array fits into one block and we do not need to pass anything to the aux argument. By performing scan on the aux array, we get:
(X0..X3) | (X0..X7) | (X0..X11) |
Now we can perform uniform sum on the output array from the first scan:
Block 0:
X0 | (X0..X1) | (X0..X2) | (X0..X3) |
Block 1 + aux[0]:
(X0..X4) | (X0..X5) | (X0..X6) | (X0..X7) |
Block 2 + aux[1]:
(X0..X8) | (X0..X9) | (X0..X10) | (X0..X11) |
We need to change the signature of our scan function from (this is in the cuda scan page above):
__global__ void scan(float *g_odata, float *g_idata, int n)
The arguments are as follows:
- g_odata output array which is the same size as the input array
- g_idata input array
- n block size
To:
__global__ void scan(float *g_odata, float *g_idata, float *g_aux, int len)
Where:
- g_odata output array which is the same size as the input array
- g_idata input array
- g_aux is an array the size of the number of blocks we will see why later
- len is the size of the array
We removed n and additionally predefine the block size:
#define BLOCK_SIZE 512
The one issue we must still solve is when the length of aux array is larger than block size (i.e. when the input length is greater than block size * block size or (block size * 2) * (block size * 2)). In that case, we need another aux array to add the sums from the previous blocks. A simple solution is to write a recursive function on the CPU as a wrapper of the actual scan kernel.
void recursive_scan(float *g_odata, float *g_idata, int len)
The pseudo code will look something like this:
- FUNCTION recursive_scan(output, input, len)
- Calculate the number of blocks needed
- IF only one block is needed perform scan by calling the kernel and exit function
- ELSE
- Allocate memory for aux array (contains the sum of elements of each block)
- Allocate memory for scanned aux array (contain the prefix sum of aux)
- Perform scan passing in the arguments from the function and the aux array
- Call recursive_scan (scanned_aux, aux, num_of_blocks)
- Perform uniform addition on output with the scanned aux array
- Free aux array and scanned aux array
- ENDIF
For the pseudo code above, red lines are CUDA kernel calls. Line 7 being the actual scan kernel for 1D list.
The exclusive scan is like the inclusive scan; however, everything is shifted to the right by one. For example, using the example array above, we will get the following:
Block 0:
0 | X0 | (X0..X1) | (X0..X2) |
0 | X4 | (X4..X5) | (X4..X6) |
Block 1:
Block 2:
0 | X8 | (X8..X9) | (X8..X10) |
Now we need to add (X0..X2) + X3 to Block 1 and (X0..X2) + X3 + (X4..X6) + X7 to Block 2. There is a simple trick for the exclusive algorithm if you implement the work efficient kernel (Blelloch). After the Up-Sweep or Reduce phase we have the value we need.
Block 0:
X0 | (X0..X1) | X2 | (X0..X3) |
X4 | (X4..X5) | X6 | (X4..X7) |
Block 1:
Block 2:
X8 | (X8..X9) | X10 | (X8..X11) |
1. | // Up-Sweep | |||||
2. 3.4. if (threadIdx.x == 0) {5. if (aux) | ||||||
6. aux[blockIdx.x] = temp[BLOCK_SIZE 1];7. temp[BLOCK_SIZE 1] = 0; // clear the last 8. }9. | // save the last element | |||||
element | ||||||
10. 11.12. | // Down-Sweep | |||||
2.2. Extend 1D to 2D scan:
Extending 1D scan to 2D scan is relatively easy. First, you do a 1D scan on each row of the matrix. Then do another scan on the transpose of what you get from the previous step and then transpose back to get the final result. (See section 39.3.2 Summed-Area Tables of the CUDA scan page)
For this lab, you need to use a transpose kernel. A good place to start is in this link: https://devblogs.nvidia.com/parallelforall/efficientmatrixtransposecudacc/ But any other code may also be fine.
The pseudo code for all the steps we mentioned:
- for each row j in deviceInput array:
- recursive_scan(deviceTmpOutput[j,:] , deviceInput[j,:], numInputColumns) 3. deviceOutput = transpose(deviceTmpOutput)
- for each row j in deviceOutput array:
- recursive_scan(deviceTmpOutput[j,:] , deviceOutput[j,:], numInputColumns)
- deviceOutput = transpose(deviceTmpOutput)
Note that in the above pseudo code we are using another array called deviceTmpOutput to store the temporary output (there a variable with same name in the code) and the assignment deviceOutput = transpose(deviceTmpOutput) may not be what the code actually really looks like (the transpose should be a kernel with input, output array in its argument list and should not return anything). The notation X[j,:] denotes the row j of the 2D array X (actually notation in your code should be different).
Important: after launching a kernel in the pseudo code make sure you call:
wbCheck(cudaDeviceSynchronize());
Or
cudaDeviceSynchronize();
3. Local Setup Instructions
Steps:
- Download Lab8.zip.
- Unzip the file.
- Open the Visual Studios Solution in Visual Studios 2013.
- Build the project. Note the project has two configurations.
- Debug
- Submission
But make sure you have the Submission configuration selected when you finally submit.
- Run the program by pressing the following button:
Make sure the Debug configuration is selected.
4. Testing
To run all tests located in Dataset/ Test, first build the project with the Submission configuration selected. Make sure you see the Submission folder and the folder contains the executables.
To run the tests, click on Testing_Script.bat. This will take a couple of seconds to run and the terminal should close when finished. The output is saved in Marks.js, but to view the calculated grade open Grade.html in a browser. If you make changes and rerun the tests, then make sure you reload Grade.html. You can double check with the timestamp at the top of the page.
You can test your performance using NSight. First build with the Test configuration selected. Then select NSIGHT -> Start Performance Analysis
In Application Settings make sure the Application: is set to the executable you wish to run and the Arguments: should be:
-e output.raw -i input.raw -o myOutput.raw -t vector
The Working Directory: should be the path to the test. For example:
PathToLab8DatasetTest11
Remember to replace PathToLab8 with the actual path to lab 8.
To make it easier for you to test the 1D scan kernel (the first and important step of this lab), you will be given another Visual Studio solution that contains the skeleton code that you can write your exclusive scan kernel in and test it. There will be two project in the solution, 1 for inclusive and 1 for exclusive scan, but you just need to write the exclusive scan. There will be test script provided for this solution too. Moving the scan kernel from this 1D solution to the 2D solution should be a matter of copy paste.
Also, there will a reference file provided for you to compare the speed of your kernel.
Reviews
There are no reviews yet.