- Given the following MIPS code snippet (note that instruction #6 could be anything):
loop:
- addi $t0, $t0, 4
- lw $v0, 0($t0)
- sw $v0, 20($t0)
- lw $s0, 60($t0)
- bne $s0, $0, loop
- ## The following instruction could be anything!
- Detect hazards and insert no-ops to insure correct operation. Assume no delayed branch, no forwarding units, registers cannot read and write at the same time, and no interlocked pipeline stages. Your answer on the right should take the form of pair(s) of numbers: [email protected] indicating num no-ops should be placed at location. E.g., if you wanted to place 6 noops between lines 2 and 3 (i.e., location=2.5) and 8 noops between lines 5 and 6 (i.e., location=5.5), you would write: [email protected], [email protected].
- Now, reorder/rewrite the program to maximize performance. Assume delayed branch and forwarding units, but no interlocked pipeline stages. For unknown reasons, the first instruction after the loop label must be the addi. Feel free to insert no-ops where needed. You should be able to do it using 6 instructions per loop (10 pts) or only 5 instructions (hard, 15 pts).
_____________________________ ##Extra instruction before loop (if any) _____________________________ ##Extra instruction before loop (if any) loop:
- addi $t0, $t0, 4
- ______________________________
- ______________________________
- ______________________________
- ______________________________
- ______________________________
- ## The following instruction could be anything!
- Name three reasons why we use caches in a pipelined processor
- Circle whether each change will make something decrease (-), increase (+) or remain the same (=). The effects you will consider are the number of instructions in the program, the amount of cycles for each instruction (in a pipelined processor) and the amount of time for each cycle (seconds/cycle).
Instructions/Progr am | Cycles/Instruction | Seconds/Cycle | |||
Reducing the number of registers in the ISA | = | + | = + | = | + |
Adding a branch delay slot | = | + | = + | = | + |
Merging the Execute and Mem stages(loads and stores use an additional adder to calculate base+offset) | = | + | = + | = | + |
Changing the implementation from a microcodedCISC machine to aRISC pipeline | = | + | = + | = | + |
- Consider the following loop.
loop: | lw r1, 0(r1) |
and r1, r1, r2 | |
lw r1, 0(r1) | |
lw r1, 0(r1) | |
beq r1, r0, loop |
Assume that perfect branch prediction is used (no stalls due to control hazards), that there are no delay slots, and that the pipeline has full forwarding support. Also assume that many iterations of this loop are executed before the loop exits.
- Show a pipeline execution diagram for the third iteration of this loop, from the cycle in which we fetch the first instruction of that iteration up to (but not including) the cycle in which we can fetch the first instruction of the next iteration. Show all instructions that are in the pipeline during these cycles (not just those from the third iteration).
- How often (as a percentage of all cycles) do we have a cycle in which all five pipeline stages are doing useful work?
- Caches are important to providing a high-performance memory hierarchy to processors. Below is a list of 32bit memory address references, given as word addresses (addresses are assigned according to words, not bytes): 3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253
- For each of these references, identify the binary address (in words, not bytes), the tag, and the index given a direct-mapped cache with 16 one-word blocks. Also list if each reference is a hit or a miss, assuming the cache is initially empty and the references are accessed according to the order as listed. The first address has been done as an example.
Word Address | Binary Address | Tag | Index | Hit (H) / Miss (M) |
3 | 0000 0011 | 0 | 3 | M |
180 | ||||
43 | ||||
2 | ||||
191 | ||||
88 | ||||
190 | ||||
14 | ||||
181 | ||||
44 | ||||
186 | ||||
253 |
- For each of these references, identify the binary address (in words), the tag, and the index given a direct-mapped cache with two-word blocks and a total size of 8 blocks. Also list if each reference is a hit or a miss, assuming the cache is initially empty.
Word Address | Binary Address | Tag | Index | Hit (H) / Miss (M) |
3 | 0000 0011 | 0 | 1 | M |
180 | ||||
43 | ||||
2 | ||||
191 | ||||
88 | ||||
190 | ||||
14 | ||||
181 | ||||
44 | ||||
186 | ||||
253 |
- You are asked to optimize a cache design for the above references. There are three direct-mapped cache designs possible, all with a total of 8 words of data: C1 has 1-word blocks, C2 has 2-word blocks, and C3 has 4-word blocks. If the miss stall time is 25 cycles, and C1 has an access time of 2 cycles, C2 takes 3 cycles, and C3 takes 5 cycles, which is the best cache design? (Use a table, similar to part a and b, to help you evaluate the hit/miss rate of C1, C2, and C3.)
1.Below are listed parameters for different direct-mapped cache designs:
Cache Data Size: 32 KiB
Cache Block Size: 2 words
Cache Access Time: 1 cycle
Generate a series of read requests that have a lower miss rate on a 2 KiB 2-way set associative cache than the cache listed above. Identify one possible solution that would make the cache listed have an equal or lower miss rate than the 2 KiB cache. Discuss the advantages and disadvantages of such a solution.
- For a direct-mapped cache design with a 32-bit address, the following bits of the address are used to access the cache:
Tag | Index | Offset |
31-10 | 9-5 | 4-0 |
- What is the cache block size (in words)?
- How many entries does the cache have
- What is the ratio between total bits required for such a cache implementation over the data storage bits?
Starting from power on, the following byte-addressed cache references are recorded:
0, 4, 16, 132, 232, 160, 1024, 30, 140, 3100, 180, 2180
- How many blocks are replaced?
- What is the hit ratio?
Reviews
There are no reviews yet.