1 Introduction
This assignment deals with optimizing memory intensive code. Image processing offers many examples of functions that can benefit from optimization. In this lab, youll be improving the overall performance of an image processing application by a factor of about 25 an extra credit phase will challenge you do increase the speed by a factor of 45.
The application youll be modifying reads in an image (a picture) and a filter. An image is represented as a three-dimensional array, described in cs1300bmp.h. Each pixel is represented as a combination of (red, green, blue) values when coded in the BMP picture format, the individual (R,G,B) values can taken on the values 0255, but the cs1300bmp.h code is designed to handle larger pixel values. The code in cs1300bmp.cpp provides routines for reading and writing images in the BMP format. You are free to modify the format or the layout of the data structures in that code.
The filter is an nn array of numbers. Well go through the logistics of how a filter works in recitation and in class and briefly summarize it here. Basically, you an cause a number of visual affects by applying a filter to an image. The filter is implemented as a convolution, which means that elements of the filter matrix are multiplied by the image matrix to compute a new value for the image. Pictorially, this is represented as:
Figure 1: Filter Operation
Computationally, this is structured as five nested for loops (three to go over the colors, row and columns and two more to apply the filter). In the solution provided to you, filters are represented using the Filter class, implemented in Filter.h and Filter.cpp.
The majority of the work performed by the filter application is in the routine shown in Figure 4.
long long cycStart, cycStop; rdtscll(cycStart);
output -> width = input -> width; output -> height = input -> height;
for(int col = 1; col < (input -> width) 1; col = col + 1) {
for(int row = 1; row < (input -> height) 1 ; row = row + 1) {
for(int plane = 0; plane < 3; plane++) {
output -> color[plane][row][col] = 0; for (int j = 0; j < filter -> getSize(); j++) {
for (int i = 0; i < filter -> getSize(); i++) {
output -> color[plane][row][col] = output -> color[plane][row][col]
+ input -> color[plane][row + i 1][col + j 1] * filter -> get(i, j);
} } output -> color[plane][row][col]
= output -> color[plane][row][col] / filter -> getDivisor(); if ( output -> color[plane][row][col] < 0 ) { output -> color[plane][row][col] = 0; } if ( output -> color[plane][row][col] > 255 ) { output -> color[plane][row][col] = 255; } output -> color[plane][row][col] = output -> color[plane][row][col];
}
} } rdtscll(cycStop); double diff = cycStop cycStart; fprintf(stderr, Took %f cycles to process, or %f cycles per pixel
, diff, diff / (output -> width * output -> height));
Figure 2: Core of filter code
This routine is instrumented using the rdscll function. This inline function records the starting and finish times in terms of CPU cycles. We use this to determine the cycles per element needed to apply the filter to a given image. A sample of the output for the provided setup code looks like the following when run on a machine.
Took 170624720.000000 cycles to process, or 3229.816007 cycles per pixel
The cycle counter measures times using the CPU clock of your computer. It is fairly accurate, but many things (such as other running programs) influence the reported time. Thus, we will use the median of a number of runs to determine the time for a particular implementation of the program for a number of different filters.
Your job is going to be to improve the performance of this application using techniques detailed in Chapter 5 of the text. Youll be using two images (boats.bmp and blocks-small.bmp) The first image is fairly small (useful for quickly testing ideas) and the second image is larger (and used for grading / evaluation). You can run your program using a command line similar to this:
$ filter hline.filter boats.bmp
This invocation will leave the output image in filter-hline-boats.bmp and will report the time taken, as in the examples above. You can repeat the image name (or use different images) on the same command line to run the filter multiple times & return the average. For example:
$ ./filter hline.filter boats.bmp boats.bmp
Took 139255936.000000 cycles to process, or 2636.025138 cycles per pixel
Took 137527184.000000 cycles to process, or 2603.300977 cycles per pixel
Average cycles per sample is 2619.663057
2 Performance and Correctness
Well be using a script called Judge to score your program. The Judge program takes three optional argument. The -n argument specifies the number of times each filter should be executed, the -i specifies the image file and -p specifies the program name. The default options are shown explicitly in the following example.
% make g++ -m32 -g -O0 -fno-omit-frame-pointer -o filter FilterMain.cpp Filter.cpp cs1300b
../Judge -p ./filter -i blocks-small.bmp gauss: 4003.824003..4038.171866..3872.371058..3990.658428..3931.838805..4319.173193 avg: 3807.727624..3822.502626..3962.588553..4237.542068..4322.871861..4056.862357.. hline: 4171.320541..4234.822551..3851.320030..4385.223120..4366.434546..4422.357822 emboss: 3694.797139..3924.230283..3927.254491..4309.121033..3981.110480..3990.70275 Scores are 3694 3807 3822 3851 3872 3924 3927 3931 3962 3981 3990 3990 4003 4038 40 median CPE for is 4003 Resulting score for is 0 echo Done
Done
This would result in an average of 4003 cycles per second. Scoring is based on the function log2(CPE)
14.2 + 160 for the blocks-small.bmp image running on the https://coding.csel.io machines. We use those machines to provide an even playing field for evaluation.
The provided Makefile compiles your program; you may need to modify it to change compiler options or the like. You can also compile your program by hand, but you should remember what you did. We assume you are compiling your program on the https://coding.csel.io machines. You are free to use different compiler options if you think that will help and know how. You can read more about the GCC options at http://gcc.gnu.org/onlinedocs/gcc-4.4.2/gcc/i386-and-x86_ 002d64-Options.html. You can also use the Clang compiler (clang++) if you think that will help.
The Makefile includes a target make test that compares your output to the reference solution. Run make test frequently, because its easy to accidentally break the code when trying to optimize. Your code must be correct as well as quick.
The Makefile also provides a make judge rule that runs the test images and filters. Lastly, it provides a make clean rule to delete any temporary files or images.
Your measured time needs to include any active processing you do to the image after it is read in and before it is written out. Youre free to go whole hog on any optimization that might work, as long as it works with all the test images and filters included in the assignment. Your modified code does not need to handle any filters or images not included in the evaluation suite. This means you can go ahead and e.g change the matrix layout in the BMP image library, replace the Filter library, etc. However, youd be well advised to make certain those changes are important and effective before you sink a lot of time.
3 Extra Credit
The most extreme solution is to use SIMD extensions such as MMX, SSE, SSE2 or SSE3. The cpus on the perf machines support SSE2 extensions. This lets you operate on up to 16 items in parallel at once time; equally important, it provides more registers for you to work with, which is huge on the x86. See http:// jeffmatherphotography.com/dispatches/2012/02/an-experiment-with-sse/ for more information.
The perf machines also each have multiple CPUs or cores and you could use the OpenMP language extensions (see http://bisqwit.iki.fi/story/howto/openmp/ to try to use both cores. This is an easy thing to do, but hard to get big speedup.
Alternative, you can take advantage of the sk1llz of other l33t h4x0rs and use a libary like the Open Computer Vision library, OpenCVhttp://docs.opencv.org/modules/imgproc/doc/filtering. html?highlight=filter2d#filter2d. This is complicated because you have to get your image into their data structure format.
One year, one student (who later interned for Google) got a CPE of about 20ish.
Some students have used these methods in the past, but youre well advised to go for the easy low-hanging fruit before tackling the most aggressive optimization.
If you come up with a solution that uses OpenMP, youll get up to an additional 5 points extra credit. If you come up with a solution using SSE, youll get up to 10 points extra credit (assuming your changes result in faster code and you know why) and likewise switching to OpenCV is good for 10 points of extra credit. In each case, youll have to know what you did and why it works like always.
4 Logistics
You may work in a group of up to two people in solving the problems for this assignment. The only handin will be electronic. Any clarifications and revisions to the assignment will be posted on the course Web page.
Different computers will run this code at different speeds (even measured in cycles per second). In order to have a level playing field, everyone must use the machines perf-01 through perf-04 to time their projects. When you use those machines to compare your performance, make certain no one else is using it (the uptime command will indicate that well go over that in class).
You should do your development on your virtual machine. Since its usually true that making something run faster on one machine makes it run faster on another, you might be able to do your development on one machine and then use the perf machines simply to validate your measurements or improvements.
Download the perflab-setup.zip file right away and unzip it. Make certain you can compile and run the distributed version and achieve times comparable to those above.
Reviews
There are no reviews yet.