Cache Performance can be improved by two techniques: reducing the miss rate or reducing the miss penalty. Lab 4 (Cache Simulator) focuses on reducing the miss rate by reducing the probability that two different memory blocks will contend for the same cache location. In this lab, cache performance is going to be improved by reducing the miss penalty, and further measured by considering program execution cycles and memory stall cycles. You should simulate both CPU and cache behaviors with C/C++ style simulators with given timing assumptions. This Lab will help you understand the impact of cache performance.
2. Requirement
- Please attach your names and student IDs as comment at the top of each file.
- Please modify your direct_mapped_cache_lru.cpp in Lab 4 to complete this Lab, and provide Makefile to compile your source codes into executable file named simulate_caches.
- You are asked to simulate the single-cycle CPU and cache behaviors of a function for matrix multiplication: matmul(A, B), where both A and B are matrices with their dimensions to be m n and n p, respectively. The assembly code of matmul(A, B) is given as matmul.txt. According to matmul.txt and following delay assumptions, you can calculate and analyze the performance.
For example, considering the memory organization shown in Fig 1(c), if there is a miss in both L1 and L2, a 32-word block will be transferred from memory to L2, a 4-word block will be transferred from L2 to L1, and the required data will be sent to CPU with 1+32(1+100+1+10)+4(1+10+1+1)+1+1 = 3639 memory stall cycles.
3. Input/output
The first line contains three hexadecimal numbers ADDR0, ADDR1, and ADDR2, which indicate the base addresses of input matrices A, B and output matrix C, and three decimal integers, m, n, and p (m, n, p are 2x, 2 x 10), which indicate the dimensions of matrices Amn and Bnp, respectively. In the following m+n lines are the elements of matrices A and B. Your output should contain the result matrix Cm p and the simulated program execution cycles and the total memory stall cycles according to the calculation of miss penalty in p.33-34 of Chapter 5 slides.
4. Report
Briefly write down how you calculate the memory stall cycles of different miss condition. Based on the results of CPU and cache simulation, compare and discuss the difference among the three memory organizations described in Fig 1.
5. Bonus: Performance improvement by software
The processor performance can be further improved by software, i.e., compiler. To get the bonus, you should rewrite the assembly code of matmul(A, B) as bonus_matmul.txt and calculate the new memory stall cycles with your C/C++ style simulator. The reasons why your matmul(A, B) is faster than the original one should be written down in the report. Note that the functionality of your assembly code should be the same as the original one, i.e., it should calculate the correct matrix multiplication result Cmp, which would be verified by TAs simple MIPS assembler (only includes addi, addu, subu, mul, slt, beq, bne, j, lw, sw, see bonus_readme.txt for details) and single-cycle CPU. Besides, every instruction has its delay penalty and the total delay should be recalculated and written into the report!
Reviews
There are no reviews yet.