, , , ,

[SOLVED] Csci 2400, performance lab

$25

File Name: Csci_2400,_performance_lab.zip
File Size: 244.92 KB

5/5 - (1 vote)

This assignment deals with optimizing memory intensive code. Image processing offers many examples of functions that can benefit from optimization. In this lab, we will consider two image processing operations: rotate, which rotates an image counter-clockwise by 90, and smooth, which “smooths” or “blurs” an image.For this lab, we will consider an image to be represented as a two-dimensional matrix M, where Mi,j denotes the value of (i,j)th pixel of M. Pixel values are triples of red, green, and blue (RGB) values. We will only consider square images. Let N denote the number of rows (or columns) of an image. Rows and columns are numbered, in C-style, from 0 to N − 1.Given this representation, the rotate operation can be implemented quite simply as the combination of the following two matrix operations:This combination is illustrated in Figure 1.The smooth operation is implemented by replacing every pixel value with the average of all the pixels around it (in a maximum of 3 × 3 window centered at that pixel). Consider Figure 2. The values of pixels M2[1][1] and M2[N-1][N-1] are given below:M2M2j                                                         iFigure 1: Rotation of an image by 90counterclockwiseFigure 2: Smoothing an imageStart by copying perflab-handout.tarto the directory in which you plan to do your work. Then give the command: tar xvf perflab-handout.tar. This will cause a number of files to be unpacked into the directory. The only file you will be modifying and handing in is kernels.c. The driver.c program is a driver program that allows you to evaluate the performance of your solutions. Use the command make driver to generate the driver code and run it with the command ./driver.Looking at the file kernels.c you’ll notice a C structure student into which you should insert the requested identifying information. Do this right away so you don’t forget.The core data structure deals with image representation. A pixel is a struct as shown below:typedef struct { unsigned short red; /* R value */ unsigned short green; /* G value */ unsigned short blue; /* B value */} pixel;As can be seen, RGB values have 16-bit representations (“16-bit color”). An image I is stored as a one-dimensional array of pixels, where the (i,j)th pixel is I[RIDX(i,j,n)]. Here n is the dimension of the image matrix, and RIDX is a macro defined as follows: #define RIDX(i,j,n) ((i)*(n)+(j))See the file defs.h for this code.You should think of I[RIDX(i,j,n)] as equivalent to I[i][j] for most purposes – the reason RIDX is used at all is because it allows run-time changes of the array size, which is needed for the testing/grading code.The following C function computes the result of rotating the source image src by 90and stores the result in destination image dst. dim is the dimension of the image.void naive_rotate(int dim, pixel *src, pixel *dst){int i, j; for (i = 0; i < dim; i++){ for (j = 0; j < dim; j++){ dst[RIDX(dim-1-j, i, dim)].red = src[RIDX(i, j, dim)].red; dst[RIDX(dim-1-j, i, dim)].green = src[RIDX(i, j, dim)].green; dst[RIDX(dim-1-j, i, dim)].blue = src[RIDX(i, j, dim)].blue;}}}The above code scans the rows of the source image matrix, copying to the columns of the destination image matrix. Your task is to rewrite this code to make it run as fast as possible using techniques like code motion, loop unrolling and blocking.See the file kernels.c for this code.The smoothing function takes as input a source image src and returns the smoothed result in the destination image dst. Here is part of an implementation:void naive_smooth(int dim, pixel *src, pixel *dst){int i, j, ii, jj; pixel_sum ps;for (j = 0; j < dim; j++){ for (i = 0; i < dim; i++){initialize_pixel_sum(&ps); for(ii = max(i-1, 0); ii <= min(i+1, dim-1); ii++){ for(jj = max(j-1, 0); jj <= min(j+1, dim-1); jj++){ accumulate_sum(&ps, src[RIDX(ii,jj,dim)]); } } dst[RIDX(i,j,dim)].red = ps.red/ps.num; dst[RIDX(i,j,dim)].green = ps.green/ps.num; dst[RIDX(i,j,dim)].blue = ps.blue/ps.num; }}}The functions max, min, initializepixelsum, and accumulatesum are all functions you can and probably will want to modify.This code (and the helper functions) are all in the file kernels.c.Our main performance measure is CPE or Cycles per Element. If a function takes C cycles to run for an image of size N × N, the CPE value is C/N2. Table 1 summarizes the performance of the naive implementations shown above and compares it against an optimized implementation. Performance is shown for 5 different values of N. All measurements were made on a perf server.The ratios (speedups) of the optimized implementation over the naive one will constitute a score of your implementation. To summarize the overall effect over different values of N, we will compute the geometric mean of the resultsTable 1: CPEs and Ratios for Optimized vs. Naive Implementationsfor these 5 values. That is, if the measured speedups for N = {64,128,256,512,1024} are R64, R128, R256, R512, and R1024, then we compute the overall performance asR = p5 R64× R128× R256× R512× R1024AssumptionsTo make life easier, you can assume that N is a multiple of 32. Your code must run correctly for all such values of N.We have provided support code to help you test the correctness of your implementations and measure their performance. This section describes how to use this infrastructure. The exact details of each part of the assignment is described in the following section.Note: The only source file you will be modifying is kernels.c.You will find yourself writing many versions of the rotate and smooth routines. To help you compare the performance of all the different versions you’ve written, we provide a way of “registering” functions.For example, the file kernels.c that we have provided you contains the following function:void register_rotate_functions() { add_rotate_function(&rotate, rotate_descr);}This function contains one or more calls to addrotatefunction. In the above example,addrotatefunction registers the function rotate along with a string rotatedescr which is an ASCII description of what the function does. See the file kernels.c to see how to create the string descriptions. This string can be at most 256 characters long.A similar function for your smooth kernels is provided in the file kernels.c.The source code you will write will be linked with object code that we supply into a driver binary. To create this binary, you will need to execute the commandmake driverYou will need to re-make driver each time you change the code in kernels.c. To test your implementations, you can then run the command: unix> ./driverThe driver can be run in four different modes:If run without any arguments, driver will run all of your versions (default mode). Other modes and options can be specified by command-line arguments to driver, as listed below:-g : Run only rotate() and smooth() functions (autograder mode).-f <funcfile> : Execute only those versions specified in <funcfile> (file mode).-d <dumpfile> : Dump the names of all versions to a dump file called <dumpfile>, one line to a version (dump mode).-q : Quit after dumping version names to a dump file. To be used in tandem with -d. For example, to quit immediately after printing the dump file, type ./driver -qd dumpfile.-h : Print the command line usage.Important: Before you start, you should fill in the struct in kernels.c with your information (name and email address).In this part, you will optimize rotate to achieve as low a CPE as possible. You should compile driver and then run it with the appropriate arguments to test your implementations.For example, running driver with the supplied naive version (for rotate) might generate the output shown below: unix> ./driverTeamname: bovikMember 1: Harry Q. BovikEmail 1: [email protected]Rotate: Version = naive_rotate: Naive baseline implementation:In this part, you will optimize smooth to achieve as low a CPE as possible.For example, running driver with the supplied naive version (for smooth) might generate the output shown below:unix> ./driverSmooth: Version = naive_smooth: Naive baseline implementation:Dim             32      64      128     256     512     MeanYour CPEs   226.0 238.1 240.3 252.4 364.1 Baseline CPEs 224.8 237.9 240.7 250.8 364.1Speedup         1.0     1.0     1.0     1.0     1.0     1.0Some advice. Focus on optimizing the inner-most loop (the code that gets repeatedly executed in a loop) using the optimization tricks covered in class. The smooth is more compute-intensive and less memory-sensitive than the rotate function, so the optimizations are of somewhat different flavors. Consider looking at the assembly code generated for the rotate and smooth, and/or running a profiler.You may write any code in kernels.c you want, as long as it satisfies the following:You can only modify code in kernels.c. You are allowed to define macros, additional global variables, and other procedures in these files.Your solutions for rotate and smooth will each count for 50% of your grade, or up to 20 code-execution points. In your interview, you will be asked to explain what you changed about your code, and why. You might be asked how some small additional change would effect performance. score for each will be based on the following:More specifically, with Sr your averaged speedup for rotate and Ss your averaged speedup for smooth, the following equations are used to calculate your code-execution points (assuming you made at least some degree of improvement).Other Note: Extra credit will require doing even better than Sr = 1.8 and Ss = 6.7, and may require esoteric or extreme modifications. For now, getting the full 10 extra credit points will require doing 10% better (ie, Sr >= 2.0 and Ss >=7.4).Performance can vary widely from computer to computer, even if all of the machines are using the same virtualmachine image. Your final grade will be determined using one of the ”perf” machines listed below; the same one used to produce the baseline and optimize values listed above. We have a number of servers set up, which all perform almost the same:To get your files on to the server:scp -r ./perflab-handout <identikey-name>@perf-XX.cs.colorado.edu:˜You may be prompted about a security key: type ‘yes’. You will be asked for your password: enter it.To ssh in to a server:ssh <identikey-name>@perf-XX.cs.colorado.eduYou may be prompted about a security key: type ‘yes’. You will be asked for your password: enter it.Once your files are on the server and you are ssh’d in, you can use make to compile your code and the driver to test it as normal. Be sure to always recompile your code for the server (use the command make clean before make if necessary).Note: You should work all the time on your local machine without connecting to the ”perf” servers, when you make sure that you have a performance improvement on your local machine you may connect to one of the servers to verify and test your results.Note: If multiple students are connected to the same server at the same time running tests, it will effect performance. Thus, please log out of the servers when done. You can use the who command once ssh’d in to the server to see if anyone else is also using it. If so, you can check one of the other four machines.Generally, we recommend you do most of your work just testing on your machine – as long as you don’t modify smoothnaive, the speedup between that and your fastest version will be a decent approximation of your final score.At least once before your final submission, we recommend you ssh in to one of ther perf machines , and run make clean, then make, then ./driver -g. The resulting reported score should be very close to the score computed at the start of your grading interview.When you have completed the lab, you will upload one file, kernels.c, to Moodle.Good luck!

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] Csci 2400, performance lab
$25