Suppose you have four threads. Why would you not want to use the following domain decomposition to parallelize the FD problem? Figure 1. Example domain decomposition for Problem 1.a. (b) (5 pts) Suppose Ts is the time it takes to synchronize and Tp is the time it takes to process a single FD node (i.e., the time to process a single ui,j ).If threads execute in a perfectly parallel fashion, then the time to process the fully-spun up region of the domain (assume the number of blocks in one dimension equals the number of threads) will be given by: Tfull = Nw(Ts + nbTp) where Nw is the number of waves in the fully spun-up region and nb is the number of elements per block. Notice that if we increase the number of blocks, nb becomes smaller while Nw becomes larger. When does synchronization contribute more time than computation to Tfull? Smaller blocks allow us to have better parallelization by minimizing the spin-up and spin-down time. How does the cost of synchronization impact our choice of block sizes?Associated Files: main.cpp, wavefront.h Name your file: wavefront.cpp Expected compile command: g++ -o hw2 -std=c++11 -fopenmp main.cpp wavefront.cpp Running the program: export OMP NUM THREADS = hw2 420 WARNING: do not modify main.cpp or wavefront.h. For testing you can write your own main file if you like and compile your program using the same command as above with your main file in place of main.cpp. (a) 24 pts Implement the wavefront parallelization in the function wavefront420where the number of blocks in the y-dimension, Ny, equals num threads. Also implement the helper function process block, which you must use in wavefront420 to process each block.Note this is NOT the wrap-around algorithm discussed in class but the easier, “nice-case” domain decomposition. Make sure your implementation handles cases where the number of nodes may not be evenly divisible by the number of threads/number of blocks you choose. The number of finite difference nodes, nx and ny, are given as constants in wavefront.h. Index data using the function cartesian2flat: ui,j = data[cartesian2flat(i,j,ny)].Use the C math library, cmath, for sine. (b) 7 pts The code in main.cpp times your implementation. Report the (strong scaling) speed-up from using 1 to 8 threads and Nx = Ny = num threads. Plot the results. (c) 7 pts Using 4 threads, time your code for Nx = nx, nx/4, nx/16, nx/64, num threads and plot the results. Use log scale for the Nx axis. Note: You can use MatLab, Excel, Julia, Python, etc for plotting. Your plots will be graded for presentation; make sure to label axes and give the plot a title. To get your code to compile, add an empty definition for the function wavefront520 following to your .cpp file.Associated Files: main.cpp, wavefront.h Name your file: wavefront.cpp Expected compile command: g++ -o hw2 -std=c++11 -fopenmp main.cpp wavefront.cpp Running the program: export OMP NUM THREADS = hw2 520 WARNING: do not modify main.cpp or wavefront.h. For testing you can write your own main file if you like and compile your program using the same comman as above with your main file in place of main.cpp. (a) 32 pts Implement the wavefront parallelization in the function wavefront520 with wrap around for when the number of blocks in each dimension does not equal num threads.Also implement the helper function process block, which you must use in wavefront520 to process each block. Make sure your implementation handles cases where the number of nodes may not be evenly divisible by the number of threads/number of blocks you choose. The number of finite difference nodes, nx and ny, are given as constants in wavefront.h. Index data using the function cartesian2flat: ui,j = data[cartesian2flat(i,j,ny)]. Use the C math library, cmath, for sine. (b) 5 pts In terms of nx, ny, Nx, Ny, and NT = num threads, what fraction of the parallel program is spent in the spin-up and spin-down phases versus the fully-parallelized region?Assume every block takes the same amount of time to process and synchronize. (c) 15 pts The code in main.cpp times your implementation. Compute the (strong scaling) speed-up using 1 to 8 threads with Nx = Ny = num threads.Repeat this with Nx = Ny = 2*num threads and with Nx = Ny = 3*num threads. Plot the three data sets on a single plot. Note: You can use MatLab, Excel, Julia, Python, etc for plotting. Your plots will be graded for presentation; make sure to label axes and give the plot a title. To get your code to compile, add an empty definition for the function wavefront420 following to your .cpp file.
Reviews
There are no reviews yet.