[Solved] Cache Lab: Understanding Cache Memories

30 $

SKU: [Solved] Cache Lab: Understanding Cache Memories Category: Tag:

1 LogisticsThis is an individual project. You must run this lab on a 64-bit x86-64 machine.SITE-SPECIFIC: Insert any other logistical items here, such as how to ask for help.2 OverviewThis lab will help you understand the impact that cache memories can have on the performance of your Cprograms.The lab consists of two parts. In the first part you will write a small C program (about 200-300 lines) thatsimulates the behavior of a cache memory. In the second part, you will optimize a small matrix transposefunction, with the goal of minimizing the number of cache misses.3 Downloading the assignmentSITE-SPECIFIC: Insert a paragraph here that explains how the instructor will hand outthe cachelab-handout.tar file to the students.Start by copying cachelab-handout.tar to a protected Linux directory in which you plan to do yourwork. Then give the commandlinux> tar xvf cachelab-handout.tarThis will create a directory called cachelab-handout that contains a number of files. You will bemodifying two files: csim.c and trans.c. To compile these files, type:1linux> make cleanlinux> makeWARNING: Do not let the Windows WinZip program open up your .tar file (many Web browsers are setto do this automatically). Instead, save the file to your Linux directory and use the Linux tar program toextract the files. In general, for this class you should NEVER use any platform other than Linux to modifyyour files. Doing so can cause loss of data (and important work!).4 DescriptionThe lab has two parts. In Part A you will implement a cache simulator. In Part B you will write a matrixtranspose function that is optimized for cache performance.4.1 Reference Trace FilesThe traces subdirectory of the handout directory contains a collection of reference trace files that we willuse to evaluate the correctness of the cache simulator you write in Part A. The trace files are generated by aLinux program called valgrind. For example, typinglinux> valgrind –log-fd=1 –tool=lackey -v –trace-mem=yes ls -lon the command line runs the executable program “ls -l”, captures a trace of each of its memory accessesin the order they occur, and prints them on stdout.Valgrind memory traces have the following form:I 0400d7d4,8M 0421c7f0,4L 04f6b868,8S 7ff0005c8,8Each line denotes one or two memory accesses. The format of each line is[space]operation address,sizeThe operation field denotes the type of memory access: “I” denotes an instruction load, “L” a data load,“S” a data store, and “M” a data modify (i.e., a data load followed by a data store). There is never a spacebefore each “I”. There is always a space before each “M”, “L”, and “S”. The address field specifies a 64-bithexadecimal memory address. The size field specifies the number of bytes accessed by the operation.4.2 Part A:Writing a Cache SimulatorIn Part A you will write a cache simulator in csim.c that takes a valgrind memory trace as input,simulates the hit/miss behavior of a cache memory on this trace, and outputs the total number of hits,misses, and evictions.

We have provided you with the binary executable of a reference cache simulator, called csim-ref, thatsimulates the behavior of a cache with arbitrary size and associativity on a valgrind trace file. It uses theLRU (least-recently used) replacement policy when choosing which cache line to evict.The reference simulator takes the following command-line arguments:Usage: ./csim-ref [-hv] -s <s> -E <E> -b <b> -t <tracefile>• -h: Optional help flag that prints usage info• -v: Optional verbose flag that displays trace info• -s <s>: Number of set index bits (S = 2s is the number of sets)• -E <E>: Associativity (number of lines per set)• -b <b>: Number of block bits (B = 2b is the block size)• -t <tracefile>: Name of the valgrind trace to replayThe command-line arguments are based on the notation (s, E, and b) from page 597 of the CS:APP2etextbook. For example:linux> ./csim-ref -s 4 -E 1 -b 4 -t traces/yi.tracehits:4 misses:5 evictions:3The same example in verbose mode:linux> ./csim-ref -v -s 4 -E 1 -b 4 -t traces/yi.traceL 10,1 missM 20,1 miss hitL 22,1 hitS 18,1 hitL 110,1 miss evictionL 210,1 miss evictionM 12,1 miss eviction hithits:4 misses:5 evictions:3Your job for Part A is to fill in the csim.c file so that it takes the same command line arguments andproduces the identical output as the reference simulator. Notice that this file is almost completely empty.You’ll need to write it from scratch.Programming Rules for Part A• Include your name and loginID in the header comment for csim.c.

• Your csim.c file must compile without warnings in order to receive credit.• Your simulator must work correctly for arbitrary s, E, and b. This means that you will need toallocate storage for your simulator’s data structures using the malloc function. Type “man malloc”for information about this function.• For this lab, we are interested only in data cache performance, so your simulator should ignore allinstruction cache accesses (lines starting with “I”). Recall that valgrind always puts “I” in the firstcolumn (with no preceding space), and “M”, “L”, and “S” in the second column (with a precedingspace). This may help you parse the trace.• To receive credit for Part A, you must call the function printSummary, with the total number ofhits, misses, and evictions, at the end of your main function:printSummary(hit_count, miss_count, eviction_count);• For this this lab, you should assume that memory accesses are aligned properly, such that a singlememory access never crosses block boundaries. By making this assumption, you can ignore therequest sizes in the valgrind traces.4.3 Part B: Optimizing Matrix TransposeIn Part B you will write a transpose function in trans.c that causes as few cache misses as possible.Let A denote a matrix, and Aij denote the component on the ith row and jth column. The transpose of A,denoted AT , is a matrix such that Aij = ATji.To help you get started, we have given you an example transpose function in trans.c that computes thetranspose of N ×M matrix A and stores the results in M × N matrix B:char trans_desc[] = “Simple row-wise scan transpose”;void trans(int M, int N, int A[N][M], int B[M][N])The example transpose function is correct, but it is inefficient because the access pattern results in relativelymany cache misses.Your job in Part B is to write a similar function, called transpose_submit, that minimizes the numberof cache misses across different sized matrices:char transpose_submit_desc[] = “Transpose submission”;void transpose_submit(int M, int N, int A[N][M], int B[M][N]);Do not change the description string (“Transpose submission”) for your transpose_submitfunction. The autograder searches for this string to determine which transpose function to evaluate forcredit.

Programming Rules for Part B• Include your name and loginID in the header comment for trans.c.• Your code in trans.c must compile without warnings to receive credit.• You are allowed to define at most 12 local variables of type int per transpose function.1• You are not allowed to side-step the previous rule by using any variables of type long or by usingany bit tricks to store more than one value to a single variable.• Your transpose function may not use recursion.• If you choose to use helper functions, you may not have more than 12 local variables on the stackat a time between your helper functions and your top level transpose function. For example, if yourtranspose declares 8 variables, and then you call a function which uses 4 variables, which calls anotherfunction which uses 2, you will have 14 variables on the stack, and you will be in violation of the rule.• Your transpose function may not modify array A. You may, however, do whatever you want with thecontents of array B.• You are NOT allowed to define any arrays in your code or to use any variant of malloc.5 EvaluationThis section describes how your work will be evaluated. The full score for this lab is 60 points:• Part A: 27 Points• Part B: 26 Points• Style: 7 Points5.1 Evaluation for Part AFor Part A, we will run your cache simulator using different cache parameters and traces. There are eighttest cases, each worth 3 points, except for the last case, which is worth 6 points:linux> ./csim -s 1 -E 1 -b 1 -t traces/yi2.tracelinux> ./csim -s 4 -E 2 -b 4 -t traces/yi.tracelinux> ./csim -s 2 -E 1 -b 4 -t traces/dave.tracelinux> ./csim -s 2 -E 1 -b 3 -t traces/trans.tracelinux> ./csim -s 2 -E 2 -b 3 -t traces/trans.trace1The reason for this restriction is that our testing code is not able to count references to the stack. We want you to limit yourreferences to the stack and focus on the access patterns of the source and destination arrays.linux> ./csim -s 2 -E 4 -b 3 -t traces/trans.tracelinux> ./csim -s 5 -E 1 -b 5 -t traces/trans.tracelinux> ./csim -s 5 -E 1 -b 5 -t traces/long.traceYou can use the reference simulator csim-ref to obtain the correct answer for each of these test cases.During debugging, use the -v option for a detailed record of each hit and miss.For each test case, outputting the correct number of cache hits, misses and evictions will give you full creditfor that test case. Each of your reported number of hits, misses and evictions is worth 1/3 of the creditfor that test case. That is, if a particular test case is worth 3 points, and your simulator outputs the correctnumber of hits and misses, but reports the wrong number of evictions, then you will earn 2 points.5.2 Evaluation for Part BFor Part B, we will evaluate the correctness and performance of your transpose_submit function onthree different-sized output matrices:• 32 × 32 (M = 32, N = 32)• 64 × 64 (M = 64, N = 64)• 61 × 67 (M = 61, N = 67)5.2.1 Performance (26 pts)For each matrix size, the performance of your transpose_submit function is evaluated by usingvalgrind to extract the address trace for your function, and then using the reference simulator to replaythis trace on a cache with parameters (s = 5, E = 1, b = 5).Your performance score for each matrix size scales linearly with the number of misses, m, up to somethreshold:• 32 × 32: 8 points if m < 300, 0 points if m > 600• 64 × 64: 8 points if m < 1, 300, 0 points if m > 2, 000• 61 × 67: 10 points if m < 2, 000, 0 points if m > 3, 000Your code must be correct to receive any performance points for a particular size. Your code only needs tobe correct for these three cases and you can optimize it specifically for these three cases. In particular, it isperfectly OK for your function to explicitly check for the input sizes and implement separate code optimizedfor each case.

5.3 Evaluation for StyleThere are 7 points for coding style. These will be assigned manually by the course staff. Style guidelinescan be found on the course website.The course staff will inspect your code in Part B for illegal arrays and excessive local variables.6 Working on the Lab6.1 Working on Part AWe have provided you with an autograding program, called test-csim, that tests the correctness of yourcache simulator on the reference traces. Be sure to compile your simulator before running the test:linux> makelinux> ./test-csimYour simulator Reference simulatorPoints (s,E,b) Hits Misses Evicts Hits Misses Evicts3 (1,1,1) 9 8 6 9 8 6 traces/yi2.trace3 (4,2,4) 4 5 2 4 5 2 traces/yi.trace3 (2,1,4) 2 3 1 2 3 1 traces/dave.trace3 (2,1,3) 167 71 67 167 71 67 traces/trans.trace3 (2,2,3) 201 37 29 201 37 29 traces/trans.trace3 (2,4,3) 212 26 10 212 26 10 traces/trans.trace3 (5,1,5) 231 7 0 231 7 0 traces/trans.trace6 (5,1,5) 265189 21775 21743 265189 21775 21743 traces/long.trace27For each test, it shows the number of points you earned, the cache parameters, the input trace file, and acomparison of the results from your simulator and the reference simulator.Here are some hints and suggestions for working on Part A:• Do your initial debugging on the small traces, such as traces/dave.trace.• The reference simulator takes an optional -v argument that enables verbose output, displaying thehits, misses, and evictions that occur as a result of each memory access. You are not required toimplement this feature in your csim.c code, but we strongly recommend that you do so. It willhelp you debug by allowing you to directly compare the behavior of your simulator with the referencesimulator on the reference trace files.• We recommend that you use the getopt function to parse your command line arguments. You’llneed the following header files:#include <getopt.h>#include <stdlib.h>#include <unistd.h>7See “man 3 getopt” for details.• Each data load (L) or store (S) operation can cause at most one cache miss. The data modify operation(M) is treated as a load followed by a store to the same address. Thus, an M operation can result intwo cache hits, or a miss and a hit plus a possible eviction.• If you would like to use C0-style contracts from 15-122, you can include contracts.h, which wehave provided in the handout directory for your convenience.6.2 Working on Part BWe have provided you with an autograding program, called test-trans.c, that tests the correctness andperformance of each of the transpose functions that you have registered with the autograder.You can register up to 100 versions of the transpose function in your trans.c file. Each transpose versionhas the following form:/* Header comment */char trans_simple_desc[] = “A simple transpose”;void trans_simple(int M, int N, int A[N][M], int B[M][N]){/* your transpose code here */}Register a particular transpose function with the autograder by making a call of the form:registerTransFunction(trans_simple, trans_simple_desc);in the registerFunctions routine in trans.c. At runtime, the autograder will evaluate each registeredtranspose function and print the results. Of course, one of the registered functions must be thetranspose_submit function that you are submitting for credit:registerTransFunction(transpose_submit, transpose_submit_desc);See the default trans.c function for an example of how this works.The autograder takes the matrix size as input. It uses valgrind to generate a trace of each registered transposefunction. It then evaluates each trace by running the reference simulator on a cache with parameters(s = 5, E = 1, b = 5).For example, to test your registered transpose functions on a 32 × 32 matrix, rebuild test-trans, andthen run it with the appropriate values for M and N:linux> makelinux> ./test-trans -M 32 -N 32Step 1: Evaluating registered transpose funcs for correctness:func 0 (Transpose submission): correctness: 1func 1 (Simple row-wise scan transpose): correctness: 1func 2 (column-wise scan transpose): correctness: 1func 3 (using a zig-zag access pattern): correctness: 1Step 2: Generating memory traces for registered transpose funcs.Step 3: Evaluating performance of registered transpose funcs (s=5, E=1, b=5)func 0 (Transpose submission): hits:1766, misses:287, evictions:255func 1 (Simple row-wise scan transpose): hits:870, misses:1183, evictions:1151func 2 (column-wise scan transpose): hits:870, misses:1183, evictions:1151func 3 (using a zig-zag access pattern): hits:1076, misses:977, evictions:945Summary for official submission (func 0): correctness=1 misses=287In this example, we have registered four different transpose functions in trans.c. The test-transprogram tests each of the registered functions, displays the results for each, and extracts the results for theofficial submission.Here are some hints and suggestions for working on Part B.• The test-trans program saves the trace for function i in file trace.fi.2 These trace files areinvaluable debugging tools that can help you understand exactly where the hits and misses for eachtranspose function are coming from. To debug a particular function, simply run its trace through thereference simulator with the verbose option:linux> ./csim-ref -v -s 5 -E 1 -b 5 -t trace.f0S 68312c,1 missL 683140,8 missL 683124,4 hitL 683120,4 hitL 603124,4 miss evictionS 6431a0,4 miss…• Since your transpose function is being evaluated on a direct-mapped cache, conflict misses are apotential problem. Think about the potential for conflict misses in your code, especially along thediagonal. Try to think of access patterns that will decrease the number of these conflict misses.• Blocking is a useful technique for reducing cache misses. Seehttp://csapp.cs.cmu.edu/public/waside/waside-blocking.pdffor more information.2Because valgrind introduces many stack accesses that have nothing to do with your code, we have filtered out all stackaccesses from the trace. This is why we have banned local arrays and placed limits on the number of local variables.

6.3 Putting it all TogetherWe have provided you with a driver program, called ./driver.py, that performs a complete evaluationof your simulator and transpose code. This is the same program your instructor uses to evaluate yourhandins. The driver uses test-csim to evaluate your simulator, and it uses test-trans to evaluateyour submitted transpose function on the three matrix sizes. Then it prints a summary of your results andthe points you have earned.To run the driver, type:linux> ./driver.py7 Handing in Your WorkEach time you type make in the cachelab-handout directory, the Makefile creates a tarball, calleduserid-handin.tar, that contains your current csim.c and trans.c files.SITE-SPECIFIC: Insert text here that tells each student how to hand in their useridhandin.tar file at your school.IMPORTANT: Do not create the handin tarball on a Windows or Mac machine, and do not handin files inany other archive format, such as .zip, .gzip, or .tgz files.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] Cache Lab: Understanding Cache Memories
30 $