, , , , , , , , ,

[SOLVED] Ece463-563  project #1: cache design, memory hierarchy design (version 1.0)

$25

File Name: Ece463-563 _project_#1:_cache_design,_memory_hierarchy_design_(version_1.0).zip
File Size: 715.92 KB

5/5 - (1 vote)

You must implement your project using the C, C++, or Java languages, for two reasons. First, these languages are preferred for computer architecture performance modeling. Second, our Gradescope autograder only supports compilation of these languages.In this project, you will implement a flexible cache and memory hierarchy simulator and use it to compare the performance, area, and energy of different memory hierarchy configurations, using a subset of the SPEC 2006 benchmark suite, SPEC 2017 benchmark suite, and/or microbenchmarks.Design a generic cache module that can be used at any level in a memory hierarchy. For example, this cache module can be “instantiated” as an L1 cache, an L2 cache, an L3 cache, and so on. Since it can be used at any level of the memory hierarchy, it will be referred to generically as CACHE throughout this specification.CACHE should be configurable in terms of supporting any cache size, associativity, and block size, specified at the beginning of simulation:= # blocks in the cache = SIZE/BLOCKSIZE is a fully-associative cache.) o BLOCKSIZE: The number of bytes in a block. There are a few constraints on the above parameters: 1) BLOCKSIZE is a power of two and 2) the number of sets is a power of two. Note that ASSOC (and, therefore, SIZE) need not be a power of two. As you know, the number of sets is determined by the following equation:#sets = SIZE         ASSOC× BLOCKSIZE 3.2. Replacement policy CACHE should use the LRU (least-recently-used) replacement policy.CACHE should use the WBWA (write-back + write-allocate) write policy.o Write-allocate: A write that misses in CACHE will cause a block to be allocated in CACHE. Therefore, both write misses and read misses cause blocks to be allocated in CACHE. o Write-back: A write updates the corresponding block in CACHE, making the block dirty. It does not update the next level in the memory hierarchy (next level of cache or memory). If a dirty block is evicted from CACHE, a writeback (i.e., a write of the entire block) will be sent to the next level in the memory hierarchy.Your simulator must be capable of modeling one or more instances of CACHE to form an overall memory hierarchy, as shown in Figure 1. CACHE receives a read or write request from whatever is above it in the memory hierarchy (either the CPU or another cache). The only situation where CACHE must interact with the next level below it (either another CACHE or main memory) is when the read or write request misses in CACHE. When the read or write request misses in CACHE, CACHE must “allocate” the requested block so that the read or write can be performed. Thus, let us think in terms of allocating a requested block X in CACHE. The allocation of requested block X is actually a two-step process. The two steps must be performed in the following order. To summarize, when allocating a block, CACHE issues a write request (only if there is a victim block and it is dirty) followed by a read request, both to the next level of the memory hierarchy. Note that each of these two requests could themselves miss in the next level of the memory hierarchy (if the next level is another CACHE), causing a cascade of requests in subsequent levels. Fortunately, you only need to correctly implement the two steps for an allocation locally within CACHE. If an allocation is correctly implemented locally (steps 1 and 2, above), the memory hierarchy as a whole will automatically handle cascaded requests globally. After servicing a read or write request, whether the corresponding block was in the cache already (hit) or had just been allocated (miss), remember to update other state. This state includes LRU counters affiliated with the set as well as the valid and dirty bits affiliated with the requested block. Figure 1.  Your simulator must be capable of modeling one or more instances of CACHE to form an overall memory hierarchy.Students enrolled in ECE 563 must additionally augment CACHE with a prefetch unit. The prefetch unit implements Stream Buffers. In this project, consider the prefetch unit to be an extension implemented within CACHE. This preserves the clean abstraction of one or more instances of CACHE interacting in an overall memory hierarchy (see Figure 1), where each CACHE may have a prefetch unit within it.Your generic implementation of CACHE should support a configurable prefetch unit as follows. The prefetch unit has N Stream Buffers. Each Stream Buffer contains M memory blocks. Both N and M should be configurable. Setting N=0 disables the prefetch unit.A Stream Buffer is a simple queue that is capable of holding M consecutive memory blocks. A Stream Buffer has a single valid bit that indicates the validity of the buffer as a whole. If its valid bit is 0, it means the Stream Buffer is empty and doesn’t contain a prefetch stream. If its valid bit is 1, it means the Stream Buffer is full and contains a prefetch stream (M consecutive memory blocks). When CACHE receives a read or write request for block X, both CACHE and its Stream Buffer are checked for a hit. Note that all Stream Buffer entries, not just the first entry (as in the original Stream Buffer paper), are searched for block X. There are four possible scenarios:  Stream Buffer contains the requested block X, in this scenario). Next, manage the Stream Buffer as illustrated in Figure 2. Notice in the “before” picture, the fourth entry of the Stream Buffer hit (it contained the requested block X). As shown in the “after” picture, all blocks before and including block X (X-3, X-2, X-1, X) are removed from the Stream Buffer, the blocks after block X (X+1, X+2) are “shifted up”, and the newly freed entries are refilled by prefetching the next consecutive blocks (issue prefetches of blocks X+3, X+4, X+5, X+6). A non-shifting circular buffer implementation, based on a head pointer that points to the least block address in the prefetch stream, is more efficient in real hardware and in software simulators, and is illustrated in Figure 3.  before  afterV (valid) = 1X+1 X+2 X+3 hit X+4 X+5 Figure 2.  Managing the Stream Buffer when there is a hit in the Stream Buffer (scenarios #2 and #4).  before                                       afterV (valid) = 1head àhitX+1X+2 Figure 3.  A non-shifting circular buffer implementation is more efficient in hardware and in software simulators. The operation of a single Stream Buffer, described in the previous section, extends to multiple Stream Buffers. The main difference is that all Stream Buffers are checked for a hit. For Scenario #1 (request misses in CACHE and misses in all Stream Buffers), one of the Stream Buffers must be chosen for the new prefetch stream: select the least-recently-used Stream Buffer, i.e., apply the LRU policy to the Stream Buffers as a whole. When a new stream is prefetched into a particular Stream Buffer (Scenario #1), or a particular Stream Buffer supplies a requested block to CACHE (Scenario #2), or we are keeping a Stream Buffer in sync (Scenario #4), that Stream Buffer becomes the most-recently-used buffer. Policy for multiple Stream Buffer hits:  It is possible for two or more Stream Buffers to have some blocks in common (redundancy). For example, suppose all Stream Buffers are initially invalid and CACHE is empty; the CPU requests block X which creates the prefetch stream X+1 to X+6 in a first Stream Buffer (assume M=6); and then the CPU requests block X-2 which creates the prefetch stream X-1 to X+4 in a second Stream Buffer; thus, after these initial two misses, the Stream Buffers have X+1 to X+4 in common. Other scenarios create redundancy as well, such as one continuing prefetch stream reaching the start of another prefetch stream. Redundancy means that a given request may hit in multiple Stream Buffers. Managing multiple Stream Buffers as in Figure 2, for the same hit, results in redundant prefetches because the multiple Stream Buffers will all try to continue their overlapping streams. A simple solution is to only consider the hit to the most-recently-used Stream Buffer among those that hit and ignore the other hits. From a simulator standpoint, this could mean (for example) searching Stream Buffers for a hit in recency order, and stopping at the first hit. Only that Stream Buffer is managed as shown in Figure 2, i.e., only that Stream Buffer continues its prefetch stream.A Stream Buffer never contains dirty blocks, that is, it never contains a block whose content differs from the same block in the next level of the memory hierarchy. The benefit of this design is that replacing the contents of the Stream Buffer will never require writebacks from the Stream Buffer. In this section, we discuss a Stream Buffer complication that we will handle conceptually. The problem and solution are only discussed out of academic interest. The solution does not require any explicit support in the simulator. Consider that a dirty copy of block Y may exist in CACHE while a clean copy of block Y exists in a Stream Buffer. Here is a simple example of how we can get into this situation (assume M=6 for the example):Now, suppose CACHE evicts its dirty copy of block Y (e.g., it is replaced by a missed block Z) before referencing it again (fyi: referencing it as a hit might wipe it from the Stream Buffer to keep the latter’s prefetch stream in sync with demand references, as per scenario #4). Stale block Y still exists in the Stream Buffer which could lead to incorrect operation in the future, namely, when the CPU requests block Y again and hits on the stale copy in the Stream Buffer. We will assume a solution that does NOT require any code in your simulator. When a dirty block Y is evicted from CACHE (i.e., when there is a writeback), any Stream Buffers that contain block Y update their copy of block Y. In this way, a Stream Buffer’s copy of block Y will remain clean and up to date with respect to the next level, since the writeback is performed not only in the next level but also in the Stream Buffer. In addition, let us also assume that this operation does NOT update recency among Stream Buffers. Therefore, the only effect is updating data, and your simulator does not model data. While Figure 1 illustrates an arbitrary memory hierarchy, you will only study the memory hierarchy configurations shown in Figure 4 (ECE 463) and Figure 5 (ECE 563). Also, these are the only configurations that Gradescope will test. For this project, all CACHEs in the memory hierarchy will have the same BLOCKSIZE. from CPU                                                                       from CPUFigure 4.  ECE 463: Two configurations to be studied. from CPU                                                                       from CPU                                                                from CPU                                                                       from CPUMain Memory                                                                                    Main MemoryFigure 5.  ECE 563: Four configurations to be studied.  The simulator reads a trace file in the following format: r|w <hex address> r|w <hex address> … “r” (read) indicates a load and “w” (write) indicates a store from the processor. Example: r ffe04540 r ffe04544 w 0eff2340 r ffe04548 … Traces are posted on the Moodle website. NOTE: All addresses are 32 bits. When expressed in hexadecimal format (hex), an address is 8 hex digits as shown in the example trace above. In the actual trace files, you may notice some addresses are comprised of fewer than 8 hex digits: this is because there are leading 0’s which are not explicitly shown. For example, an address “ffff” is really “0000ffff”, because all addresses are 32 bits, i.e., 8 nibbles.The simulator executable built by your Makefile must be named “sim” (the Makefile is discussed in Section 8). Your simulator must accept exactly 8 command-line arguments in the following order:sim   <BLOCKSIZE><L1_SIZE>  <L1_ASSOC><L2_SIZE>  <L2_ASSOC><PREF_N>  <PREF_M><trace_file>  Example:  8KB 4-way set-associative L1 cache with 32B block size, 256KB 8way set-associative L2 cache with 32B block size, L2 prefetch unit has 3 stream buffers with 10 blocks each, gcc trace:sim  32  8192  4  262144  8  3  10  gcc_trace.txt  Some additional points:  Your simulator should output the following:(See Section 8 regarding the formatting of these outputs and validating your simulator.) hit) (with L2, should match i+k+m+o+p:  all L2 read misses + L2 write misses + writebacks from L2 + L2 prefetches) (without L2, should match b+d+f+g:  L1 read misses + L1 write misses + writebacks from L1 + L1 prefetches) † For this project, as shown in Figure 5 for ECE 563 students, prefetching is only tested and explored in the last-level cache of the memory hierarchy. This means that measurements j and k, above, should always be 0 because the L1 will not issue prefetch requests to the L2. Nonetheless, a well-done implementation of a generic CACHE will distinguish incoming demand read requests from incoming prefetch read requests, even though in this project the distinction will not be exercised. Note for ECE 463 students: Just assume and print 0 for any prefetch-specific measurement. These are: g, j, k, p.  Sample simulation outputs are provided on the Moodle site. These are called “validation runs”. Refer to the validation runs to see how to format the outputs of your simulator. You must submit, validate, and self-grade[2] your project using Gradescope. Here is how Gradescope (1) receives your project (zip file), (2) compiles your simulator (Makefile), and (3) runs and checks your simulator (arguments, print-to-console requirement, and “diff -iw”):    exact number of spaces or tabs as long as there is some whitespace where the validation runs have whitespace. Note, however, that extra or missing blank lines are NOT ok: “diff -iw” does not ignore extra or missing blank lines.  See the report template in Moodle for experiments, graphs, and analysis.  Use the report template as the basis for the report that you submit (insert graphs, fill in answers to questions, etc.).  Below, you will find information about calculating AAT, area, and energy, for a given memory hierarchy configuration. Calculating AAT, Area, and Energy  Table 1 gives names and descriptions of parameters and how to get these parameters.  Table 1. Parameters, descriptions, and how you obtain these parameters. * We will not be using energy in any of the experiments for this semester’s Project 1. Thus, they are grayed-out in Table 1. For memory hierarchy without L2 cache: Total access time = (L1 reads + L1 writes)HTL1 + (L1 read misses + L1 write misses)Miss_Penalty AAT = HTL1 + L1 readL1 misses reads ++ L1L1 write writes misses ⋅ Miss_Penalty= HTL1 + MRL1 ⋅ Miss_Penalty For memory hierarchy with L2 cache: Total access time = (L1 reads + L1 writes)HTL1 + (L1 read misses + L1 write misses)HTL2 + (L2 read misses not originating from L1 prefetches)Miss_Penalty  AAT = HTL1 +L1 readL1 misses reads ++ L1L1 writes write misses⋅HTL2 +L2 read misses L1not  readsoriginatin+ L1 writesg from L1 prefetchesMiss_Penalty = HTL1 + MRL1 HTL2 +L2 read misses L1not  readsoriginatin+ L1 writesg from L1 prefetches⋅Miss_Penalty= HTL1 + MRL1 HTL2 + MR1 L1 ⋅ L2 read misses L1not  readsoriginatin+ L1 writesg from L1 prefetches⋅Miss_Penalty= HTL1 + MRL1 ⋅HTL2 + L1 readL1 misses reads ++ L1L1 writes write missesL2 read misses L1not  readsoriginatin+ L1 writesg from L1 prefetches⋅Miss_Penalty= HTL1 + MRL1 HTL2 +L2 read missesL1 read not  missesoriginatin + L1 writeg from misses L1 prefetches⋅Miss_Penalty= HTL1 + MRL1 ⋅HTL2 + L2 L2read reads misses not  not originatinoriginating fromg from L1  prefetchesL1 prefetches⋅Miss_Penalty= HTL1 + MRL1 ⋅(HTL2 + MRL2 ⋅Miss_Penalty)  The total area of the caches:Area = AL1 + AL2If a particular cache does not exist in the memory hierarchy configuration, then its area is 0. Note that it is difficult to estimate the area of the prefetch unit using CACTI due to the specialized structure of the Stream Buffers.  Dynamic energy estimates:Each read or write request to a cache consumes that cache’s access energy. Each read or write request that misses in the cache causes a “line fill” (allocation) into the cache, which also consumes that cache’s access energy.[3] Each writeback of an evicted dirty block involves reading that block from the cache, which also consumes that cache’s access energy. For memory hierarchy without L2 cache: Total dynamic energy =(L1 reads + L1 writes + L1 read misses + L1 write misses + L1 writebacks) * EL1 + (L1 read misses + L1 write misses + L1 writebacks + L1 prefetches) * EMEM For memory hierarchy with L2 cache: Total dynamic energy =(L1 reads + L1 writes + L1 read misses + L1 write misses + L1 writebacks) * EL1 +(all L2 reads + L2 writes + all L2 read misses + L2 write misses + L2 writebacks) * EL2 + (all L2 read misses + L2 write misses + L2 writebacks + L2 prefetches) * EMEM  average dynamic energy per access = (total dynamic energy)/(L1 reads + L1 writes)  Table 2 shows the breakdown of points for the project:[30 points]       Substantial programming effort.[50 points]      A working simulator, as determined by matching validation runs.[20 points] Experiments and report. You can only get credit for experiments with L1 if your simulator passes BOTH validation runs #1 and #2. You can only get credit for L1+L2 experiments if your simulator passes BOTH validation runs #3 and #4. You can only get credit for L1+pref experiments if your simulator passes BOTH validation runs #5 and #6. Table 2.  Breakdown of points. [30 points]  Substantial programming effort.  [50 points]  A working simulator: match validation runs.  [20 points]  Experiments and report.  Analysis:463 max points with just L1 working and corresponding graphs+discussion (#1,2,4): 56 (sim.) + 12 (exp.) = 68563 max points with just L1 working and corresponding graphs+discussion (#1,2,4): 50 (sim.) + 10 (exp.) = 60563 max points with everything but pref. working, and corr. graphs+discussion (#1-5): 70 (sim.) + 17 (exp.) = 87563 max points with everything but L2 working, and corr. graphs+discussion (#1,2,4,T1): 56 (sim.) + 13 (exp.) = 69Grading of “Substantial programming effort”:  If your simulator passes at least one validation run, the TAs will automatically credit 30 points for “Substantial programming effort”. If your simulator does not match any validation runs, you can receive between 0 and 30 points for “Substantial programming effort” depending on how many implementation aspects are covered.  The TAs will use the rubric below for assessing partial credit for implementation effort of a basic cache class.  To get credit for a given rubric item, the code attempt must be valid and substantial. Note that the 0 – 30 points for implementation effort covers only a basic cache class.  On the positive side, this incentivizes students to at least attempt to get an L1 cache fully working and get credit for experiments with L1 only.  On the other hand, as a counterexample, suppose a 563 student strives to get their prefetcher working but fails to match any validation runs with prefetchers; there is no partial credit for implementation effort of the prefetcher itself; the incentive for the student to code the prefetcher and get it working, is the additional point value of passing the prefetcher validation runs and the point value of the associated experiments.Other notes about grading:    Various deductions (out of 100 points): -1 point for each day (24-hour period) late, according to the Gradescope timestamp. The late penalty is pro-rated on an hourly basis: -1/24 point for each hour late. We will use the “ceiling” function of the lateness time to get to the next higher hour, e.g., ceiling(10 min. late) = 1 hour late, ceiling(1 hr, 10 min. late) = 2 hours late, and so forth. For this first project, Gradescope will accept late submissions no more than two weeks after the deadline. The goal of this policy is to encourage forward progress for other work in the class. See Section 1.1 for penalties and sanctions for academic integrity violations. It is good practice to frequently make backups of all your project files, including source code, your report, etc.  You can backup files to another hard drive (your NFS B: drive in your NCSU account, home PC, laptop … keep consistent copies in multiple places) or removable media(flash drive, etc.).Correctness of your simulator is of paramount importance. That said, making your simulator efficient is also important because you will be running many experiments: many memory hierarchy configurations and multiple traces. Therefore, you will benefit from implementing a simulator that is reasonably fast. One simple thing you can do to make your simulator run faster is to compile it with a high optimization level. The example Makefile posted on the Moodle site includes the –O3 optimization flag. Note that, when you are debugging your simulator in a debugger (such as gdb), it is recommended that you compile without –O3 and with –g. Optimization includes register allocation. Often, register-allocated variables are not displayed properly in debuggers, which is why you want to disable optimization when using a debugger. The –g flag tells the compiler to include symbols (variable names, etc.) in the compiled binary. The debugger needs this information to recognize variable names, function names, line numbers in the source code, etc. When you are done debugging, recompile with –O3 and without –g, to get the most efficient simulator again. As mentioned in Section 8, another reason for being wary of excessive run times is Gradescope’s autograder timeout.[1] For accurate performance accounting using the Average Access Time (AAT) expression, you will need to convey to the next level in the memory hierarchy that these read requests are prefetches. This will enable the next level in the memory hierarchy to distinguish between 1) its read misses that originated from normal read requests versus 2) its read misses that originated from prefetch read requests. Note that this is only needed for accurate performance accounting.[2] The mystery runs component of your grade will not be published until we release it.  The report will be manually graded by the TAs.[3] Note: There is a noticeable underestimate of energy when a request misses in the cache but hits in its stream buffer. We don’t count this as a miss (it doesn’t get counted as an L1 read miss or L1 write miss) and we don’t explicitly count this scenario. Yet, this scenario also involves a “line fill” (allocation) into the cache (transfer block from stream buffer to cache). This can be fixed by explicitly counting this scenario but we shall ignore it in this project.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] Ece463-563  project #1: cache design, memory hierarchy design (version 1.0)
$25