You (plus optional teammate) are tasked with the job of making the fastest matrix multiplication program as possible for all machines. That means you cannot specifically target a machine. But you are free to research and find all usual architectures specification for personal and server machines. You may assume that everything is Intel architecture (x86_64) to make life easier.Background Reading:Chapter 4.12The matrix is column major. Nave implementation is given in dgemm-naive.c andyou can run the bench-naive to see the output.void dgemm( int m, int n, float *A, float *C ){for( int i = 0; i < m; i++ )for( int k = 0; k < n; k++ )for( int j = 0; j < m; j++ )C[i+j*m] += A[i+k*m] * A[j+k*m];}C is where the result is stored and we are doing all the calculations from just one matrix, A. You are required to do all the calculations and no optimization is allowed on this front to make benchmarking easier. Zip contains the following files :Makefile: to make and benchmark benchmark.c: do not modify. It check results and produce performance numbersdgemm-naive.c: nave implementation as shown above dgemm-optimize.c: your optimizationChoose at most 3 of the following common optimizations (1 per function, DO NOTcombine). The project worths 100 points, any extra points will go into your final exam as an extra credit (if submitted on time. Having multiple projects due orexams is not an excuse!). [20 points] Reordering of instructions (compiler peep-hole optimization) [20 points] Register blocking (reusing the same registers for multiplecalculations) Cache optimizations (each sub-bullet counts as one optiomazation)o [40 points] Blocking/Tiling (trying to keep the data in the cache for largematrices)o [40 points] Pre-fetching (Copying small matrix blocks into contiguouschunks of memory)o [20 points] Pre-compute transpose for spatial locality Loop optimizations (each sub-bullet counts as one optiomazation)o [40 points] SSE instructionso [20 points] Reorderingo [40 points] Unrolling (at least 3 iterations) [40 points] Padding Matrices (odd sizes can hurt pipeline performance)You should not use any libraries for parallel computing, such as openMP. You mayassume multiple cores. Anything else you can find or can think of is fine to increaseperformance. Just remember to calculate all the results (copying from one part ofthe resulting matrix to another is not allowed).Your solution will be ran across different machines and the results aggregated.Note that you should not optimize just for your computer or one particular matrixsize. Use your knowledge of computer architecture with all the modern features thattries to accelerate execution. Caches will play a big role but it is not safe to assume aparticular architecture. But in general, optimizing memory accesses will lead to biggains. Matrix size and corner cases will also matter, as the same optimization willnot work across the board. Have fun with this project.
Only logged in customers who have purchased this product may leave a review.
Reviews
There are no reviews yet.