[SOLVED] CS计算机代考程序代写 assembly mips algorithm computer architecture RISC-V cache compiler EEC 170

30 $

File Name: CS计算机代考程序代写_assembly_mips_algorithm_computer_architecture_RISC-V_cache_compiler_EEC_170.zip
File Size: 989.1 KB

SKU: 8546972310 Category: Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Or Upload Your Assignment Here:


EEC 170
Name: Instructions:
UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering
Introduction to Computer Architecture
Midterm 1
Fall 2019
1. This exam is open-note, open-book. Calculators are allowed.
2. No devices that can communicate (laptops, phones, or PDAs) allowed, with the excep- tion of a device that can read your e-book textbook, on which you must turn off your wifi and/or cellular connections. Other than “find”, no typing. Turn off all phones, pagers, and alarms.
3. You may not share any materials with anyone else, including books, notes, lab reports, datasheets, calculators, and most importantly, answers!
4. This exam is designed to be a 60 minute exam, but you have 110 minutes to complete it. (Roughly) one point is budgeted per minute. Use your time wisely!
Excerpts from the UC Davis Code of Academic Conduct:
1. Each student should act with personal honesty at all times.
2. Each student should act with fairness to others in the class. That means, for example, that when taking an examination, students should not seek an unfair advantage over classmates through cheating or other dishonest behavior.
3. Students should take group as well as individual responsibility for honorable behavior.
I understand the honor code and agree to be bound by it.
Signature:
Page:
2
3
4
5
6
9
Total
Points:
8
12
12
8
10
15
65
Score:
Page 1 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

All instructions in this exam, unless otherwise noted, are RISC-V instructions. Recall that if the instruction writes to a register, that register appears first in the instruction. Unless otherwise indicated, you may not use any RISC-V pseudoinstructions on this exam, only actual RISC-V instructions. Feel free to detach the two RISC-V reference pages and the adder reference page at the end of the exam, but if you do, please take them with you and recycle them (don’t try to reattach them or turn them in).
1. Many machine learning applications do not require the precision or range of single- precision (32b) floating point. Thus Google has introduced a new 16-bit floating-point format, bfloat16, which is structured in a similar way to the IEEE 754 formats we discussed in class. bfloat16 has one sign bit, an 8-bit exponent, and a 7-bit significand (the bits after the 1.).
(a) (5 points) What is the smallest normal positive number that can be rep- resented as a bfloat16? (“Normal” here means “ignore subnormal numbers”.) Express your answer as a normalized number ±a.b×2c, where you specify the sign, a, b, and c. You should clearly show your work if you want partial credit.
(b) NVIDIA hardware also supports a 16-bit floating-point format, half. It has 1 sign bit, 5 exponent bits, and 11 significand bits.
Answer each of the following questions with bfloat16 or half.
i. (1 point) The format with the largest representable value is (bfloat1 or half):
ii. (1 point) The format with a representable value closest to zero is (bfloat1 or half):
iii. (1 point) The format with the smallest range between two adjacent values when the bias-adjusted exponent is zero (meaning the normalized number is 1.x × 20) (where, for these two adjacent values, the floating-point representation is identical except for the least significant bit of the significand) is (bfloat1 or half):
Page 2 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

2. Short code sequences.
(a) (6 points) MIPS has the instruction nor. RISC-V lacks this instruction. Imple- ment nor x2, x3, x4 in as few RISC-V instructions as possible. You may not make any changes to any register except x2.
(b) (6 points) RISC-V also lacks the instruction sgtz, which sets its output register to 1 if the input register is larger than zero and to 0 otherwise. Implement sgtz x2, x3 in as few RISC-V instructions as possible. You may not make any changes to any register except x2.
Page 3 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

(c) (12 points) RISC-V has a floating-point instruction called fsgnj.d rd, rs1, rs2, “floating-point sign injection”. It returns a number whose magnitude is the same as its rs1 argument but whose sign is the same as its rs2 argument. Note it works on double-precision (64b) data.
Implement fsgnj.d x2, x3, x4 in as few integer RISC-V instructions as possible. Your inputs are 64b floating-point numbers stored in 64b integer registers and your output is also a 64b floating-point number in a 64b integer register. Do not use floating-point registers or instructions to solve this problem. You may use x10, x11, etc. as temporary registers if needed. For this part only, you may use the pseudoinstruction li (load immediate), which will count as one instruction.
To aid grading (to help give you partial credit), describing your approach in a sentence or two will be helpful.
Page 4 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

3. (8 points) Let’s consider adding a reciprocal operation to RISC-V. Your engineers have determined that the primary use for a reciprocal operation is to normalize 3-component vectors:
xnew = √ x , x2 +y2 +z2
with ynew and znew defined analogously. Note we could take the reciprocal of the square root once and then perform multiply instructions to achieve the same results. To compute xnew, ynew, and znew, we could replace divide operations with reciprocal and multiply instructions.
If we assume that add instructions take 1 cycle, multiply operations take 4 cycles, and both square-root and divide operations take 20 cycles, what is the maximum number of cycles for the reciprocal operation so that 3-component vector normaliza- tion is at least as fast using reciprocal and multiply as it was using division?
Page 5 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

4. You have simplified the design of your new RISC-Y processor to the point where you only have two classes of instructions, A-type (which takes 1 cycle to finish) and B- type (which takes 3). You compile your company’s most important benchmark on your internal compiler and find it has a CPI of 2.8.
The Team Rocket compiler company visits your company and boasts they can achieve twice the MIPS (millions of instructions per second) on your benchmark on your RISC-Y processor. You test their compiler: it’s true, they can. However, the runtime of their compiled code is the same as yours.
(a) (7 points) What is the instruction mix (what percentage of A-type and what per- centage of B-type) of the compiled code from the Team Rocket compiler?
(b) (3 points) If Team Rocket’s code gets twice as many MIPS, why does it run at the same speed as your internal compiler’s? Be precise as to why and quantify your answer.
Page 6 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

5. In this problem we design an integer ALU that is inspired by the POWER instruction set. For the purposes of this problem, assume that POWER supports 32-bit integers, that it has an R-type instruction format that takes two input registers and one output register, and an I-type instruction format that takes one input register, one output register, and a 16-bit immediate.
Our base hardware is a 32-bit adder that has five inputs. Four of the inputs (AH, AL, BH, and BL) specify the A and B inputs to the adder, each divided into two 16-bit chunks (the 16b high and low halves of the 32b input words to the adder). The fifth input (Cin) is a one-bit carry in.
Your job is to select what goes into these five inputs. For the first four inputs, you must set the control signals on a SUPERMUX, which is a 16-bit wide mux that can select from 14 16-bit inputs:
• A[31:16]
• A[15:0]
• B[31:16]
• B[15:0]
•̃A[31:16] (invert the bits of A[31:16]) •̃A[15:0]
•̃B[31:16]
•̃B[15:0]
• imm[15:0] (the immediate from I-type instructions) • A[31] x 16 (bit A[31] copied to all 16 bits)
• B[31] x 16 (bit B[31] copied to all 16 bits)
• imm[15] x 16 (bit imm[15] copied to all 16 bits)
• 0 x 16 (16 zeroes)
• 1 x 16(16ones)
BH BL
AH AL
Cin
Cout
Out[31:0]
Cout A[15:0]
B[31:16] 32-Bit Adder
B[15:0] Cin
A[31:16]
Sum
Page 7 of 12
SUPER- MUX
SUPER- MUX
SUPER- MUX
SUPER- MUX
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

For the last control input (Cin), you can set it to 0 or 1.
As an example, if I asked you to configure this ALU to compute a standard 32-bit add,
you would set the following:
• AH = A[31:16] • AL = A[15:0] • BH = B[31:16] • BL = B[15:0] • Cin = 0
This information is replicated at the end of the exam. Feel free to detach it and use it to answer the following questions. There are no questions on this page.
Page 8 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

For each of the following instructions, fill in how you would set the five control signals.
(a) (3 points) SUB: a 32-bit subtract, A-B.
(b) (3 points) LUI: load the 16-bit immediate into the upper 16 bits of the output, setting the other output bits to zero
(c) (3 points) ADDIS: “add immediate shifted”; add input A to the (immediate left- shifted by 16 bits)
(d) (3 points) DBL: “multiply by 2”; return A×2
(e) (3 points) In English, what does the instruction with the following control settings do?
• AH = 1 x 16 • AL = 1 x 16 • BH = B[15:0] • BL = B[31:16] • Cin = 1
Page 9 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

There are no questions on this page. This page is for reference only. Feel free to detach it.
SUPERMUX inputs: • A[31:16]
• A[15:0]
• B[31:16]
• B[15:0]
•̃A[31:16] (invert the bits of A[31:16])
•̃A[15:0]
•̃B[31:16]
•̃B[15:0]
• imm[15:0] (the immediate from I-type instructions) • A[31] x 16 (bit A[31] copied to all 16 bits)
• B[31] x 16 (bit B[31] copied to all 16 bits)
• imm[15] x 16 (bit imm[15] copied to all 16 bits)
• 0 x 16 (16 zeroes)
• 1 x 16(16ones)
For the last control input (Cin), you can set it to 0 or 1.
As an example, if I asked you to configure this ALU to compute a standard 32-bit add, you would set the following: AH = A[31:16]; AL = A[15:0]; BH = B[31:16]; BL = B[15:0]; Cin = 0.
BH BL
AH AL
Cin
Cout
Out[31:0]
Cout A[15:0]
B[31:16] 32-Bit Adder
B[15:0] Cin
A[31:16]
Sum
Page 10 of 12
SUPER- MUX
SUPER- MUX
SUPER- MUX
SUPER- MUX
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

RISC-V Instruction Set Core Instruction Formats
31 27 26 25 24 20 19
RV32I Base Integer Instructions
Inst
add
sub
xor
or
and
sll
srl
sra
slt
sltu
addi
xori
ori
andi
slli
srli
srai
slti
sltiu
lb
lh
lw
lbu
lhu
sb
sh
sw
beq
bne
blt
bge
bltu
bgeu
jal
jalr
lui
auipc
ecall
ebreak
15 14 12 11
7
6
0
RISC-V Reference
James Zhu
R-type I-type S-type B-type U-type J-type
Note
msb-extends zero-extends
msb-extends zero-extends
zero-extends zero-extends
zero-extends zero-extends
imm: 0x000 imm: 0x001
funct7
rs2
rs1
funct3
rd
opcode
imm[11:0]
rs1
funct3
rd
opcode
imm[11:5]
rs2
rs1
funct3
imm[4:0]
opcode
imm[12|10:5]
rs2
rs1
funct3
imm[4:1|11]
opcode
imm[31:12]
rd
opcode
imm[20|10:1|11|19:12]
rd
opcode
Name
FMT
Opcode
F3
F7
Description (C)
ADD
SUB
XOR
OR
AND
Shift Left Logical Shift Right Logical Shift Right Arith* Set Less Than
Set Less Than (U)
R R R R R R R R R R
0000011
0000011
0000011
0000011
0000011
0000011
0000011
0000011
0110011
0110011
0x0
0x0
0x4
0x6
0x7
0x1
0x2
0x3
0x2
0x3
0x00
0x20
0x00
0x00
0x00
0x00
0x00
0x20
rd = rs1 + rs2
rd = rs1 – rs2
rd = rs1 ˆ rs2
rd = rs1 | rs2
rd = rs1 & rs2
rd = rs1 << rs2rd = rs1 >> rs2
rd = rs1 >> rs2
rd = (rs1 < rs2)?1:0rd = (rs1 < rs2)?1:0 ADD ImmediateXOR ImmediateOR ImmediateAND ImmediateShift Left Logical Imm Shift Right Logical Imm Shift Right Arith Imm Set Less Than ImmSet Less Than Imm (U) I I I I I I I I I0010011001001100100110010011001001100100110010011001001100100110x00x00x00x00x10x10x30x20x3 0x000x000x000x000x000x000x20 rd = rs1 + immrd = rs1 ˆ immrd = rs1 | immrd = rs1 & immrd = rs1 << immrd = rs1 >> imm
rd = rs1 >> imm
rd = (rs1 < imm)?1:0rd = (rs1 < imm)?1:0Load Byte Load Half Load Word Load Byte (U) Load Half (U) I I I I I000001100000110000011000001100000110x00x10x20x40x5rd = M[rs1+imm][0:7]rd = M[rs1+imm][0:15]rd = M[rs1+imm][0:31]rd = M[rs1+imm][0:7]rd = M[rs1+imm][0:15]Store Byte Store Half Store Word S S S0100011010001101000110x0 0x1 0x2M[rs1+imm][0:7] = rs2[0:7]M[rs1+imm][0:15] = rs2[0:15]M[rs1+imm][0:31] = rs2[0:31]Branch ==Branch !=Branch <Branch ≤Branch < (U) Branch ≥ (U) Jump And Link Jump And Link RegB B B B B B1100011110001111000111100011110001111000110x00x10x40x50x60x7 if(rs1 == rs2) PC += immif(rs1 != rs2) PC += immif(rs1 <rs2) PC += immif(rs1 >= rs2) PC += imm
if(rs1 <rs2) PC += immif(rs1 >= rs2) PC += imm
J I
1101111
1100111
0x0
rd = PC+4; PC += imm
rd = PC+4; PC = rs1 + imm
Load Upper Imm
Add Upper Imm to PC
U U
0110111
0010111
rd = imm << 12rd = PC + (imm << 12) Environment Call I11100110x0 0x00 Transfer control to OSEnvironment Break I11100110x0 0x00 Transfer control to debugger1(c) John Owens 2019Do not reproduce in any way Do not upload anywhereRISC-V Reference CardV0.1 Standard ExtensionsRV32M Multiply ExtensionInst Description (C)NameFMTOpcodeF3F7MULMUL HighMUL High (S) (U) MUL High (U) DIVDIV (U) Remainder Remainder (U) R R R R R R R R011001101100110110011011001101100110110011011001101100110x00x10x20x30x40x50x60x7 0x010x010x010x010x010x010x010x01 mulmulhmulsu rd = (rs1 * rs2)[63:32]muludiv rd=rs1/rs2 divu rd=rs1/rs2 rem rd=rs1%rs2 remu rd=rs1%rs2RV32A Atomic Extension31 27 26 25 2420191514 12117600} = rdrd = (rs1 * rs2)[31:0]rd = (rs1 * rs2)[63:32]rd = (rs1 * rs2)[63:32]funct5 aqrl rs2 rs1funct3rd opcode 51155357Inst Description (C) NameFMTOpcodeF3F5Load Reserved Store ConditionalAtomic Swap Atomic ADD Atomic AND Atomic OR Atomix XOR Atomic MAX Atomic MINR RR R R R R R R0101111010111101011110101111010111101011110101111010111101011110x2 0x20x20x20x20x20x20x20x2 0x02 0x030x010x000x0C0x0A0x040x140x10lr.w sc.wrd = M[rs1], reserve M[rs1]if (reserved) { M[rs1] = rs2; rd =else { rd = 1 }amoswap.w rd = M[rs1]; swap(rd, rs2); M[rs1] amoadd.w amoand.w amoor.w amoxor.w amomax.w amomin.wRV32F / D Floating-Point Extensionsrd = M[rs1] + rs2; M[rs1] = rdrd = M[rs1] & rs2; M[rs1] = rdrd = M[rs1] | rs2; M[rs1] = rdrd = M[rs1] ˆ rs2; M[rs1] = rdrd = max(M[rs1], rs2); M[rs1] = rdrd = min(M[rs1], rs2); M[rs1] = rd Inst Description (C)NameFMTOpcodeF3F5Flt Load WordFlt Store WordFlt Fused Mul-AddFlt Fused Mul-SubFlt Neg Fused Mul-Add Flt Neg Fused Mul-Sub Flt AddFlt SubFlt MulFlt DivFlt Square RootFlt Sign InjectionFlt Sign Neg Injection Flt Sign Xor Injection Flt MinimumFlt MaximumFlt Conv from Sign Int Flt Conv from Uns Int Flt Convert to IntFlt Convert to Int Move Float to Int Move Int to FloatFloat EqualityFloat Less ThanFloat Less / Equal Float Classify **************************flwfswfmadd.s rd=rs1*rs2+rs3 fmsub.s rd=rs1*rs2-rs3 fnmadd.s rd=-rs1*rs2+rs3 fnmsub.s rd=-rs1*rs2-rs3 fadd.s rd=rs1+rs2fsub.s rd=rs1-rs2fmul.s rd=rs1*rs2fdiv.s rd=rs1/rs2fsqrt.sfsgnj.sfsgnjn.sfsgnjx.sfmin.sfmax.sfcvt.s.wfcvt.s.wu rd = (float) rs1rd = (int32_t) rs1 fcvt.wu.s rd = (uint32_t) rs1fcvt.w.sfmv.x.wfmv.w.xfeq.sflt.sfle.sfclass.srd = *((int*) &rs1)rd = *((float*) &rs1) rd=(rs1==rs2)? 1: 0 rd=(rs1 = 0) return z;
return -z; }
A straightforward implementation of the above code requires control-flow instructions (branches and/or jumps). In this problem we consider non-RISC-V instructions that might allow us to implement this without control-flow instructions. In this problem ignore the fact that a twos-complement representation is unbalanced (-MININT cannot be represented).
(a) (5 points) Using a minimal sequence of RISC-V instructions, implement ABSDIFF x1, x2, x3 without control-flow instructions. ABSDIFF x1, x2, x3 means x1 = ABS(x2 – x3). You may use any RISC-V registers, but do not overwrite x2 or x3. In this part, you must use one or more conditional move instructions (which, as we discussed in class, are part of many instruction sets but not RISC-V), and specifically any or all of three different conditional move instructions:
• CMOVGTZ x1, x2, x3 # if (x3 > 0) x1 = x2 • CMOVLTZ x1, x2, x3 # if (x3 < 0) x1 = x2 • CMOVEQZ x1, x2, x3 # if (x3 == 0) x1 = x2Page 2 of 14(c) John Owens 2019Do not reproduce in any way Do not upload anywhere(b) (5 points) Using a minimal sequence of RISC-V instructions, implement ABSDIFF x1, x2, x3 without control-flow instructions. You may use any RISC-V reg- isters, but do not overwrite x2 or x3. In this part, you must use either or both of the following saturating-subtract instructions:• SUBSAT x1, x2, x3 (subtraction of x2-x3, treating both operands as signed numbers and producing a signed result, but rather than overflowing, saturating to the largest/smallest signed value)• SUBUSAT x1, x2, x3 (subtraction of x2-x3, treating both operands as un- signed numbers and producing an unsigned result, but rather than overflowing, saturating to the largest/smallest unsigned value)Recall that the actual computation of subtraction of signed and unsigned numbers generates exactly the same bit result; it’s only the saturation operation that differs between these two operations. This problem is fairly tricky.Page 3 of 14(c) John Owens 2019Do not reproduce in any way Do not upload anywhere2. In this problem we will look at the single-cycle datapath of the “Bitner” processor, only focusing on the compute part of the Bitner pipeline. In this datapath, the ALU can perform one arithmetic operation every clock cycle and the data memory can perform one memory operation every clock cycle. Note the output of the ALU is an input to the data memory and the output of the data memory is an input to the ALU. Assume Bitner’s instruction format contains an opcode, three registers, and one immediate.At the end of this exam is a larger, detachable copy of this datapath. RFDataControlRFW RFR1 RFR2 IMM MEMD MEMA ALU2 ALU1 ALUOp 1 2A ALUB Control points:MuxRFInRegister FileRF2 RFRegister AddressesRF1 RF• RFR1: register number of the first register to be read from the register file• RFR2: register number of the second register to be read from the register file• IMM: the immediate specified by the instruction• ALU1: controls the source of data into the ALU’s input 1 (choices: RF1, IMM MEM)• ALU2: controls the source of data into the ALU’s input 2 (choices: RF2, IMM, MEM)• MEMA: controls the source of data into the memory’s address input (choices: RF1, IMM, ALU)• MEMD: controls the source of data into the memory’s data input (choices: RF2, IMM, ALU)• ALUOP: controls the operation performed by the ALU (choices: ADD)• RFW: register number of the register to be written to the register file• RFData: controls the source of data to be written into register RFW in the register file (choices: ALU or MEM)In this problem we ask you to “set the control points” for this datapath. As an example, the instruction ADD x1, x2, x3 would have the following settings: RFR1 = 2; RFR2 = 3; IMM = X; ALU1 = RF1; ALU2 = RF2; MEMA = X; MEMD = X; ALUOP = ADD; RFW = 1; RFData = ALU, where X is a don’t-care.Page 4 of 14RF1 IMM MEMRF2 IMM MEM RF1 IMM ALUData Memory Addr MEMData InRF2 IMM ALUMuxMuxMuxMux(c) John Owens 2019Do not reproduce in any way Do not upload anywhere(a) (5 points) Which of these signals should come directly from bits in the 32-bit instruction (a bit or bits in the instruction are routed as control signals directly into the datapath without any intermediate logic) and which should be generated by additional logic derived from the 32-bit instruction? Answer either “bits” or “logic”.• RFR1 • RFR2 • IMM• ALU1 • ALU2 • MEMA • MEMD • ALUOP • RFW• RFData(b) (4 points) What is the primary reason why this particular datapath would be a challenge to pipeline with good performance?Page 5 of 14(c) John Owens 2019Do not reproduce in any way Do not upload anywhere(c) Consider the following C program. This program fetches the value in array from memory, adds 42 to it, stores the new value back to memory, and increments array to point to the next element in the array. int * array; // assume 64-bit integers while (1) { *array++ += 42; }Assume that array is stored in x2.i. (4 points) Write the body of this loop (corresponding to the single line of C code between the braces) as a minimal sequence of instructions in RISC-V assembly. You can use x3 as a temporary. Your solution should have 4 RISC-V instructions. Do not write any instructions that relate to the loop itself (you should have no branch/jump instructions).Page 6 of 14(c) John Owens 2019Do not reproduce in any way Do not upload anywhereii. (10 points) Write the body of this loop (corresponding to the single line of C code between the braces) as a minimal sequence of instructions on Bitner. Express each instruction as a setting of control points; by this we mean to specify how you would set each of the 10 “control points” listed above. If a control point is a don’t-care, mark it as X. You can use x3 as a temporary. Do not write any instructions that relate to the loop itself (you should have no branch/jump instructions). Hint: Because this datapath is potentially more capable than the RISC-V datapath we discussed in class, it may be possible to accomplish more in a single instruction than you could in a RISC-V datapath.Page 7 of 14(c) John Owens 2019Do not reproduce in any way Do not upload anywhere(d) (5 points) On the following diagram, note 12 circles on the inputs to the 4 muxes.Put an X through each circle whose input is not required to run RISC-V instructions. For inputs that are required to run RISC-V instructions, do nothing. For instance, if the top input of the ALU does not require a RF1 input to run RISC-V instructions, put an X in the topmost circle.RFDataControlRFW RFR1 RFR2 IMM MEMD RF1ALUOpA ALUBMEMARFInFileRF2Register AddressesRF1 Register RF2RF2 IMMIMMALU2ALU1 RF1 MEM MEMMuxALU MEMRF1IMM Data MemoryAddr MEMData InALU RF2 IMM ALU Page 8 of 14MuxMuxMuxMux(c) John Owens 2019Do not reproduce in any way Do not upload anywhere(e) Ben Bitdiddle proposes new instructions for Bitner. Ben hopes each of these in- structions can be implemented as single Bitner instructions (not sequences of Bitner instructions). For each of Ben’s proposals:• If it is possible to implement Ben’s proposal as a single instruction with the current datapath, set the control points that implement the instruction on this datapath; OR• If it is not possible to implement Ben’s proposal as a single instruction with the current datapath, describe the changes needed to the datapath (as English sentences).i. (4 points) LDADDI xdest1, xdest2, xsrc, imm: Perform two operations in parallel: (1) Register xdest1 is set to xsrc + imm; register xdest2 is set to the memory contents of the address stored in xsrc.ii. (4 points) MEMCPY xdest, xsrc: Copies the memory word stored at the ad- dress specified in xsrc into the memory location specified by the address in xdest.iii. (4 points) LD2 xdest, imm(xbase1+xbase2): Same as LD but instead of the load address as xbase+imm, it now supports xbase1+xbase2+imm.Page 9 of 14(c) John Owens 2019Do not reproduce in any way Do not upload anywhere3. Alyssa P. Hacker was looking at the class datapath (below) and identified some possible inefficiencies. Specifically, she was looking at the following control signals:• RegWrite: 1 only if data is written into the register file• ALUSrc: the second input of the ALU is from the register file (0) or from theimmediate (1)• PCSrc: the next PC comes from PC+4 (0) or from a computed branch/jump target (1)• MemWrite: 1 only if the write data input to the memory will be written to the address input• MemRead: 1 only if the data memory is outputting the contents of the memory address specified by its address input• MemtoReg: the data to write into the register file comes from the ALU (0) or from the data memory (1)She proposed the following design simplifications. For each proposed design change,indicate whether her proposal could work properly OR, if it will not, an in- struction that will not work correctly and why. For the purposes of this question, only consider the following instructions (the ones we covered in Lecture 2): ADD, ADDI, AND, ANDI, BEQ, BGE, BLT, BNE, JAL, JALR, LD, LUI, OR, ORI, SD, SLLI, SRLI, SUB, XOR, XORI. Page 10 of 14(c) John Owens 2019Do not reproduce in any way Do not upload anywhere(a) (4 points) Combine MemWrite and MemRead into one signal MemOp, where MemOp is set to 1 only on store instructions, and MemWrite <- MemOp and MemRead <- not(MemOp). Assume that the memory system can read from any arbitrary memory address without any side effects (no exceptions/page faults/etc.) and that there are no performance consequences if we do this.(b) (4 points) Remove RegWrite; instead set RegWrite if PCSrc is 0.(c) (4 points) Remove ALUSrc; instead set ALUSrc if either MemWrite or MemRead are 1.Page 11 of 14(c) John Owens 2019Do not reproduce in any way Do not upload anywhereRISC-V Instruction Set Core Instruction Formats31 27 26 25 24 20 19RV32I Base Integer InstructionsInst add sub xor or and sll srl sra slt sltu addi xori ori andi slli srli srai slti sltiu lb lh lw lbu lhu sb sh sw beq bne blt bge bltu bgeu jal jalr lui auipc ecall ebreak15 14 12 11760RISC-V ReferenceJames Zhu
R-type I-type S-type B-type U-type J-type
Note
msb-extends zero-extends
msb-extends zero-extends
zero-extends zero-extends
zero-extends zero-extends
imm: 0x000 imm: 0x001
funct7
rs2
rs1
funct3
rd
opcode
imm[11:0]
rs1
funct3
rd
opcode
imm[11:5]
rs2
rs1
funct3
imm[4:0]
opcode
imm[12|10:5]
rs2
rs1
funct3
imm[4:1|11]
opcode
imm[31:12]
rd
opcode
imm[20|10:1|11|19:12]
rd
opcode
Name
FMT
Opcode
F3
F7
Description (C)
ADD
SUB
XOR
OR
AND
Shift Left Logical Shift Right Logical Shift Right Arith* Set Less Than
Set Less Than (U)
R R R R R R R R R R
0000011
0000011
0000011
0000011
0000011
0000011
0000011
0000011
0110011
0110011
0x0
0x0
0x4
0x6
0x7
0x1
0x2
0x3
0x2
0x3
0x00
0x20
0x00
0x00
0x00
0x00
0x00
0x20
rd = rs1 + rs2
rd = rs1 – rs2
rd = rs1 ˆ rs2
rd = rs1 | rs2
rd = rs1 & rs2
rd = rs1 << rs2rd = rs1 >> rs2
rd = rs1 >> rs2
rd = (rs1 < rs2)?1:0rd = (rs1 < rs2)?1:0 ADD ImmediateXOR ImmediateOR ImmediateAND ImmediateShift Left Logical Imm Shift Right Logical Imm Shift Right Arith Imm Set Less Than ImmSet Less Than Imm (U) I I I I I I I I I0010011001001100100110010011001001100100110010011001001100100110x00x00x00x00x10x10x30x20x3 0x000x000x000x000x000x000x20 rd = rs1 + immrd = rs1 ˆ immrd = rs1 | immrd = rs1 & immrd = rs1 << immrd = rs1 >> imm
rd = rs1 >> imm
rd = (rs1 < imm)?1:0rd = (rs1 < imm)?1:0Load Byte Load Half Load Word Load Byte (U) Load Half (U) I I I I I000001100000110000011000001100000110x00x10x20x40x5rd = M[rs1+imm][0:7]rd = M[rs1+imm][0:15]rd = M[rs1+imm][0:31]rd = M[rs1+imm][0:7]rd = M[rs1+imm][0:15]Store Byte Store Half Store Word S S S0100011010001101000110x0 0x1 0x2M[rs1+imm][0:7] = rs2[0:7]M[rs1+imm][0:15] = rs2[0:15]M[rs1+imm][0:31] = rs2[0:31]Branch ==Branch !=Branch <Branch ≤Branch < (U) Branch ≥ (U) Jump And Link Jump And Link RegB B B B B B1100011110001111000111100011110001111000110x00x10x40x50x60x7 if(rs1 == rs2) PC += immif(rs1 != rs2) PC += immif(rs1 <rs2) PC += immif(rs1 >= rs2) PC += imm
if(rs1 <rs2) PC += immif(rs1 >= rs2) PC += imm
J I
1101111
1100111
0x0
rd = PC+4; PC += imm
rd = PC+4; PC = rs1 + imm
Load Upper Imm
Add Upper Imm to PC
U U
0110111
0010111
rd = imm << 12rd = PC + (imm << 12) Environment Call I11100110x0 0x00 Transfer control to OSEnvironment Break I11100110x0 0x00 Transfer control to debugger1(c) John Owens 2019Do not reproduce in any way Do not upload anywhereRISC-V Reference CardV0.1 Standard ExtensionsRV32M Multiply ExtensionInst Description (C)NameFMTOpcodeF3F7MULMUL HighMUL High (S) (U) MUL High (U) DIVDIV (U) Remainder Remainder (U) R R R R R R R R011001101100110110011011001101100110110011011001101100110x00x10x20x30x40x50x60x7 0x010x010x010x010x010x010x010x01 mulmulhmulsu rd = (rs1 * rs2)[63:32]muludiv rd=rs1/rs2 divu rd=rs1/rs2 rem rd=rs1%rs2 remu rd=rs1%rs2RV32A Atomic Extension31 27 26 25 2420191514 12117600} = rdrd = (rs1 * rs2)[31:0]rd = (rs1 * rs2)[63:32]rd = (rs1 * rs2)[63:32]funct5 aqrl rs2 rs1funct3rd opcode 51155357Inst Description (C) NameFMTOpcodeF3F5Load Reserved Store ConditionalAtomic Swap Atomic ADD Atomic AND Atomic OR Atomix XOR Atomic MAX Atomic MINR RR R R R R R R0101111010111101011110101111010111101011110101111010111101011110x2 0x20x20x20x20x20x20x20x2 0x02 0x030x010x000x0C0x0A0x040x140x10lr.w sc.wrd = M[rs1], reserve M[rs1]if (reserved) { M[rs1] = rs2; rd =else { rd = 1 }amoswap.w rd = M[rs1]; swap(rd, rs2); M[rs1] amoadd.w amoand.w amoor.w amoxor.w amomax.w amomin.wRV32F / D Floating-Point Extensionsrd = M[rs1] + rs2; M[rs1] = rdrd = M[rs1] & rs2; M[rs1] = rdrd = M[rs1] | rs2; M[rs1] = rdrd = M[rs1] ˆ rs2; M[rs1] = rdrd = max(M[rs1], rs2); M[rs1] = rdrd = min(M[rs1], rs2); M[rs1] = rd Inst Description (C)NameFMTOpcodeF3F5Flt Load WordFlt Store WordFlt Fused Mul-AddFlt Fused Mul-SubFlt Neg Fused Mul-Add Flt Neg Fused Mul-Sub Flt AddFlt SubFlt MulFlt DivFlt Square RootFlt Sign InjectionFlt Sign Neg Injection Flt Sign Xor Injection Flt MinimumFlt MaximumFlt Conv from Sign Int Flt Conv from Uns Int Flt Convert to IntFlt Convert to Int Move Float to Int Move Int to FloatFloat EqualityFloat Less ThanFloat Less / Equal Float Classify **************************flwfswfmadd.s rd=rs1*rs2+rs3 fmsub.s rd=rs1*rs2-rs3 fnmadd.s rd=-rs1*rs2+rs3 fnmsub.s rd=-rs1*rs2-rs3 fadd.s rd=rs1+rs2fsub.s rd=rs1-rs2fmul.s rd=rs1*rs2fdiv.s rd=rs1/rs2fsqrt.sfsgnj.sfsgnjn.sfsgnjx.sfmin.sfmax.sfcvt.s.wfcvt.s.wu rd = (float) rs1rd = (int32_t) rs1 fcvt.wu.s rd = (uint32_t) rs1fcvt.w.sfmv.x.wfmv.w.xfeq.sflt.sfle.sfclass.srd = *((int*) &rs1)rd = *((float*) &rs1) rd=(rs1==rs2)? 1: 0 rd=(rs1 = 0) return z;
return -z; }
A straightforward implementation of the above code requires control-flow instructions (branches and/or jumps). In this problem we consider non-RISC-V instructions that might allow us to implement this without control-flow instructions. In this problem ignore the fact that a twos-complement representation is unbalanced (-MININT cannot be represented).
(a) (5 points) Using a minimal sequence of RISC-V instructions, implement ABSDIFF x1, x2, x3 without control-flow instructions. ABSDIFF x1, x2, x3 means x1 = ABS(x2 – x3). You may use any RISC-V registers, but do not overwrite x2 or x3. In this part, you must use one or more conditional move instructions (which, as we discussed in class, are part of many instruction sets but not RISC-V), and specifically any or all of three different conditional move instructions:
• CMOVGTZ x1, x2, x3 # if (x3 > 0) x1 = x2 • CMOVLTZ x1, x2, x3 # if (x3 < 0) x1 = x2 • CMOVEQZ x1, x2, x3 # if (x3 == 0) x1 = x2 Solution: Page 1 of 9(c) John Owens 2019Do not reproduce in any way Do not upload anywhere SUB x4, x1, x2CMOVGTZ x1, x4, x4 # x1 now has the correct answer # if the diff was positiveSUB x4, x0, x4 # flip the sign of x4CMOVGTZ x1, x4, x4 # x1 now also has the correct answer # if the diff was negativeCMOVEQZ x1, x4, x4 # better cover the 0 case too (b) (5 points) Using a minimal sequence of RISC-V instructions, implement ABSDIFF x1, x2, x3 without control-flow instructions. You may use any RISC-V reg- isters, but do not overwrite x2 or x3. In this part, you must use either or both of the following saturating-subtract instructions:• SUBSAT x1, x2, x3 (subtraction of x2-x3, treating both operands as signed numbers and producing a signed result, but rather than overflowing, saturating to the largest/smallest signed value)• SUBUSAT x1, x2, x3 (subtraction of x2-x3, treating both operands as un- signed numbers and producing an unsigned result, but rather than overflowing, saturating to the largest/smallest unsigned value)Recall that the actual computation of subtraction of signed and unsigned numbers generates exactly the same bit result; it’s only the saturation operation that differs between these two operations. This problem is fairly tricky.2. In this problem we will look at the single-cycle datapath of the “Bitner” processor, only focusing on the compute part of the Bitner pipeline. In this datapath, the ALU can perform one arithmetic operation every clock cycle and the data memory can perform one memory operation every clock cycle. Note the output of the ALU is an input to the data memory and the output of the data memory is an input to the ALU. Assume Bitner’s instruction format contains an opcode, three registers, and one immediate.At the end of this exam is a larger, detachable copy of this datapath. Solution: The key realization here is that unsigned saturated subtract of a − b is equivalent to max(a − b, 0).SUBUSAT x1, x2, x3# x1 = max(x2-x3,0)SUBUSAT x4, x3, x2# x4 = max(x3-x2,0); either x1 or x4 is now 0ORx1, x1, x4# choose whichever of x1 or x4 is not 0 Page 2 of 9(c) John Owens 2019Do not reproduce in any way Do not upload anywhere RFDataControlRFW RFR1 RFR2 IMM MEMD MEMA ALU2 ALU1 ALUOp RFInRegister FileRF2 RFRegister AddressesRF1 RF1 2A ALUBRF1 IMM MEM RF2 IMM MEMMux RF1 IMM ALU RF2 IMM ALUData Memory Addr MEMData In Control points:• RFR1: register number of the first register to be read from the register file• RFR2: register number of the second register to be read from the register file• IMM: the immediate specified by the instruction• ALU1: controls the source of data into the ALU’s input 1 (choices: RF1, IMM MEM)• ALU2: controls the source of data into the ALU’s input 2 (choices: RF2, IMM, MEM)• MEMA: controls the source of data into the memory’s address input (choices: RF1, IMM, ALU)• MEMD: controls the source of data into the memory’s data input (choices: RF2, IMM, ALU)• ALUOP: controls the operation performed by the ALU (choices: ADD)• RFW: register number of the register to be written to the register file• RFData: controls the source of data to be written into register RFW in the register file (choices: ALU or MEM)In this problem we ask you to “set the control points” for this datapath. As an example, the instruction ADD x1, x2, x3 would have the following settings: RFR1 = 2; RFR2 = 3; IMM = X; ALU1 = RF1; ALU2 = RF2; MEMA = X; MEMD = X; ALUOP = ADD; RFW = 1; RFData = ALU, where X is a don’t-care.(a) (5 points) Which of these signals should come directly from bits in the 32-bit instruction (a bit or bits in the instruction are routed as control signals directly into the datapath without any intermediate logic) and which should be generated by additional logic derived from the 32-bit instruction? Answer either “bits” or “logic”.• bits RFR1 Page 3 of 9MuxMuxMuxMux(c) John Owens 2019Do not reproduce in any way Do not upload anywhere• • • • • • • • •bits RFR2bits IMM logic ALU1 logic ALU2 logic MEMA logic MEMD logic ALUOPbits RFW logic RFData (b) (4 points) What is the primary reason why this particular datapath would be a challenge to pipeline with good performance? Solution: Because the data memory output is an input to the ALU and the ALU output is an input to the memory, we’d have two choices:• We could put them both into the same pipeline stage in which case that stage would be very long and yield poor performance• We could put them into separate pipeline stages but then would have to choose which came first, and we would have a structural hazard every time they were used in the opposite order (c) Consider the following C program. This program fetches the value in array from memory, adds 42 to it, stores the new value back to memory, and increments array to point to the next element in the array. int * array; // assume 64-bit integers while (1) { *array++ += 42; }Assume that array is stored in x2.i. (4 points) Write the body of this loop (corresponding to the single line of C code between the braces) as a minimal sequence of instructions in RISC-V assembly. You can use x3 as a temporary. Your solution should have 4 RISC-V instructions. Do not write any instructions that relate to the loop itself (you should have no branch/jump instructions). Solution:ld x3, 0(x2)addi x3, x3, 42sd x3, 0(x2)addi x2, x2, 8 Page 4 of 9(c) John Owens 2019Do not reproduce in any way Do not upload anywhereii. (10 points) Write the body of this loop (corresponding to the single line of C code between the braces) as a minimal sequence of instructions on Bitner. Express each instruction as a setting of control points; by this we mean to specify how you would set each of the 10 “control points” listed above. If a control point is a don’t-care, mark it as X. You can use x3 as a temporary. Do not write any instructions that relate to the loop itself (you should have no branch/jump instructions). Hint: Because this datapath is potentially more capable than the RISC-V datapath we discussed in class, it may be possible to accomplish more in a single instruction than you could in a RISC-V datapath. Solution: We can collapse the four RISC-V instructions above into two Bitner instructions:• Combine the load and the first add-immediate (+42) into one instruc- tion; the output of the load is the input to an add, and the output of the add goes back into the register file. RFR1 = 2 RFR2 = X IMM = 42 RFW = 3 RFData = ALU ALU1 = MEM ALU2 = IMM (42) MEMA = RF1 MEMD = X ALUOP = ADD• Then combine the store and the second add-immediate (+8) into one instruction that run in parallel. Only the add-immediate writes back. RFR1 = 2 RFR2 = 3 IMM = 8 RFW = 2 RFData = ALU ALU1 = RF1 ALU2 = IMM (8) MEMA = RF1 MEMD = RF2 ALUOP = ADD (d) (5 points) On the following diagram, note 12 circles on the inputs to the 4 muxes.Put an X through each circle whose input is not required to run RISC-VPage 5 of 9(c) John Owens 2019Do not reproduce in any way Do not upload anywhereinstructions. For inputs that are required to run RISC-V instructions, do nothing. For instance, if the top input of the ALU does not require a RF1 input to run RISC-V instructions, put an X in the topmost circle. RFDataALU MEMControlRFW RFR1 RFR2 IMM MEMD MEMA ALU2 ALU1 ALUOp RFInFileRF2Register AddressesRF1 RegisterRF1 RF2RF1 IMM ALURF1 IMM MEMRF2 IMM MEMA BALU Mux Data MemoryAddr MEMData In RF2 IMM ALUSolution: To run only RISC-V instructions, we do not need the following connections:• ALU A: IMM, MEM• ALU B: MEM• Memory address: RF1, IMM • Memory data: IMM, ALUIt’s arguable that ALU A and ALU B could switch their answers, although that may lead to complications with where I-type instructions get their RF input. RFDataControlRFW RFR1 RFR2 IMM MEMD MEMA ALU2 ALU1 ALUOp Register AddressesRF1 RFRF1 IMM MEM 1x xRFInRegister FileRF2 RF2x xA ALUBRF2 IMM MEMMuxxx xRF1 IMM ALU Data Memory Addr MEMData InALUMEMRF2 IMM ALU Page 6 of 9(c) John Owens 2019Do not reproduce in any way Do not upload anywhereMuxMuxMuxMuxMuxMuxMuxMux(e) Ben Bitdiddle proposes new instructions for Bitner. Ben hopes each of these in- structions can be implemented as single Bitner instructions (not sequences of Bitner instructions). For each of Ben’s proposals:• If it is possible to implement Ben’s proposal as a single instruction with the current datapath, set the control points that implement the instruction on this datapath; OR• If it is not possible to implement Ben’s proposal as a single instruction with the current datapath, describe the changes needed to the datapath (as English sentences).i. (4 points) LDADDI xdest1, xdest2, xsrc, imm: Perform two operations in parallel: (1) Register xdest1 is set to xsrc + imm; register xdest2 is set to the memory contents of the address stored in xsrc.ii. (4 points) MEMCPY xdest, xsrc: Copies the memory word stored at the ad- dress specified in xsrc into the memory location specified by the address in xdest. Solution: The current datapath does not support this; it requires writing back to two different registers. We would have to add a second write port to the register file and make sure the outputs of the ALU and the memory could both be routed into the two write ports.Solution: The current datapath does not support this; it requires two memory operations and thus two data memories. There are multiple places where a second memory could be placed, including:• Inparallelwiththecurrentdatamemory(ratherthanhavingoneALU and one memory in parallel, we’d have one ALU and two memories in parallel). The output of the first memory (load) must be able to be routed into the data input of the second memory (store).• Inseriesafterthecurrentdatamemory(theoutputofthefirstmemory can be routed into the data input of the second memory). iii. (4 points) LD2 xdest, imm(xbase1+xbase2): Same as LD but instead of the load address as xbase+imm, it now supports xbase1+xbase2+imm. Solution: The current datapath does not support this; it does not allow two additions. One way to address this is to add an adder to the address input of the data memory that allows a second addition. One input to that adder should be the output of the ALU; the second input could be either read register or the immediate (the datapath allows any of those three to work). Page 7 of 9(c) John Owens 2019Do not reproduce in any way Do not upload anywhere3. Alyssa P. Hacker was looking at the class datapath (below) and identified some possible inefficiencies. Specifically, she was looking at the following control signals:• RegWrite: 1 only if data is written into the register file• ALUSrc: the second input of the ALU is from the register file (0) or from theimmediate (1)• PCSrc: the next PC comes from PC+4 (0) or from a computed branch/jump target (1)• MemWrite: 1 only if the write data input to the memory will be written to the address input• MemRead: 1 only if the data memory is outputting the contents of the memory address specified by its address input• MemtoReg: the data to write into the register file comes from the ALU (0) or from the data memory (1)She proposed the following design simplifications. For each proposed design change,indicate whether her proposal could work properly OR, if it will not, an in- struction that will not work correctly and why. For the purposes of this question, only consider the following instructions (the ones we covered in Lecture 2): ADD, ADDI, AND, ANDI, BEQ, BGE, BLT, BNE, JAL, JALR, LD, LUI, OR, ORI, SD, SLLI, SRLI, SUB, XOR, XORI.(a) (4 points) Combine MemWrite and MemRead into one signal MemOp, where MemOp is set to 1 only on store instructions, and MemWrite <- MemOp and MemRead <- not(MemOp). Assume that the memory system can read from any arbitrary memory address without any side effects (no exceptions/page faults/etc.) and that there are no performance consequences if we do this.Solution: This should work properly. The reasons to have two signals are to allow both of them to be true at a time (and we have no instructions that do both a read and a write) or both of them to be false (which is probably true on non-memory instructions, but the MemtoReg mux handles this case). Note, however, that we don’t generally want to dispatch memory reads that aren’t Page 8 of 9(c) John Owens 2019Do not reproduce in any way Do not upload anywhere from a load instruction; we may access addresses that we’re not allowed to access (e.g., segmentation faults), we’d probably pollute the cache, and we’d burden the memory system with requests that we don’t need so there would be performance consequences. (b) (4 points) Remove RegWrite; instead set RegWrite if PCSrc is 0.(c) (4 points) Remove ALUSrc; instead set ALUSrc if either MemWrite or MemRead are 1. Solution: This doesn’t work. Alyssa’s logic here is that non-branch-or-jump instructions always write back to the register file. This is not true for store instructions (SD).Solution: This doesn’t work. Alyssa’s logic here is that we use the immediate input to the ALU only on load and store instructions, but we also use it on ALU instructions that take an immediate (ADDI, ANDI, ORI, etc.). Page 9 of 9(c) John Owens 2019Do not reproduce in any way Do not upload anywhereEEC 170Name: Instructions:UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer EngineeringIntroduction to Computer ArchitectureFinal ExaminationFall 2019 1. This exam is open-note, open-book. Calculators are allowed.2. No devices that can communicate (laptops, phones, or PDAs) allowed, with the excep- tion of a device that can read your e-book textbook, on which you must turn off your wifi and/or cellular connections. Other than “find”, no typing. Turn off all phones, pagers, and alarms.3. You may not share any materials with anyone else, including books, notes, lab reports, datasheets, calculators, and most importantly, answers!4. This exam is designed to be a 60 minute exam, but you have 120 minutes to complete it. (Roughly) one point is budgeted per minute. Use your time wisely!Excerpts from the UC Davis Code of Academic Conduct:1. Each student should act with personal honesty at all times.2. Each student should act with fairness to others in the class. That means, for example, that when taking an examination, students should not seek an unfair advantage over classmates through cheating or other dishonest behavior.3. Students should take group as well as individual responsibility for honorable behavior.I understand the honor code and agree to be bound by it.Signature: Page:2 3 456 7 8 9 Total Points:4 10 1097 3 12 6 61 Score: Page 1 of 12(c) John Owens 2019Do not reproduce in any way Do not upload anywhereAll instructions in this exam, unless otherwise noted, are RISC-V instructions. Recall that if the instruction writes to a register, that register appears first in the instruction. Unless otherwise indicated, you may not use any RISC-V pseudoinstructions on this exam, only actual RISC-V instructions. Feel free to detach the two RISC-V reference pages and the memory system diagram at the end of the exam, but if you do, please take them with you and recycle them (don’t try to reattach them or turn them in).1.12 3 4We discussed dense matrix-dense matrix multiplication in class. To recap, consider A, B, and C as n×n matrices; we wish to compute C = C +A×B. This can be computed byfori=1ton for j = 1 to nfor k = 1 to nC(i,j) = C(i,j) + A(i,k) * B(k,j) // this is normal (scalar) + and *We begin by computing the arithmetic intensity of this code in operations per memory word, assuming no caches, as a function of n. (Recall that the arithmetic intensity is the number of arithmetic operations per word of memory bandwidth.) Assume n is large; you can approximate any answers given this assumption. Assume each addition or multiplication counts as one operation. Assume all of A, B, and C are stored in main memory (DRAM) and each read from/write to main memory counts as one memory word.The inner line of code runs n3 times. It has 2 arithmetic operations, 3 loads, and 1 write. Thus the arithmetic intensity is 2n3 = 1/2.(a) (4 points) Consider two different implementations of the same task with the same runtime. Baby Yoda’s has high arithmetic intensity and is limited by arithmetic capability. The Mandalorian’s has low arithmetic intensity and is limited by mem- ory bandwidth. Even though both have the same runtime, we prefer Baby Yoda’s implementation with higher arithmetic intensity because of how we expect future machines to behave. In terms of technology trends, why do we prefer imple- mentations with more arithmetic intensity? (3+1)n3 Page 2 of 12(c) John Owens 2019Do not reproduce in any way Do not upload anywhere(b)(4 points) Assume we have a very large cache so all memory loads only read a reference the first time they see that reference and (because they are captured in the cache) all memory writes occur only at the end of the program. Reads from/writes to cache are free (do not count in terms of arithmetic intensity). What is the arithmetic intensity now?(c)12 3 4 5 6 7 8(6 points) Now consider a more realistic cache that is not large enough to hold the entire input and output. This cache can hold two matrix rows or columns plus one output element. Assume there are no cache conflicts; any two rows/columns plus one output element can fit in the cache. The Mandalorian proposes using the cache in the following way:fori=1ton// read row i of A into cache for j = 1 to n// read row j of B into cache // read C(i,j) into cachefor k = 1 to nC(i,j) = C(i,j) + A(i,k) * B(k,j) // this is normal (scalar) + and * // after k loop is complete, write C(i,j) back to memoryWhat is the arithmetic intensity now? It is OK to ignore lower-order terms in this analysis. Page 3 of 12(c) John Owens 2019Do not reproduce in any way Do not upload anywhere(d)(10 points) Finally, Baby Yoda proposes to partition the matrix in a different way. Instead of reading 1-dimensional rows or columns from A and B, Baby Yoda ar- ranges those matrices as b × b subblocks. (For instance, a 128 × 128 matrix may be structured as a 4×4 grid of 32 × 32 subblocks.) The cache can hold any two input blocks plus one output block at any time.Now our matrix multiply looks like this (note our operations are now on blocks at a time instead of rows or columns at a time). The arithmetic, while factored differently, is exactly the same as the previous case.// note i, j, and k are now block indexes, not element indexesfori=1ton/b for j = 1 to n/b// read block C(i,j) into cachefor k = 1 to n/b// read block A(i,k) into cache// read block B(k,j) into cacheC(i,j) = C(i,j) + A(i,k) * B(k,j)// Here + indicates a matrix sum of blocks (block + block)// Here * indicates a matrix multiply of blocks (block * block)// after k loop is complete, write block C(i,j) back to memoryWhat is the arithmetic intensity now? It is again OK to ignore lower-order terms in this analysis. 12 3 4 5 6 7 8 910 11Page 4 of 12(c) John Owens 2019Do not reproduce in any way Do not upload anywhere2. In the 2-page RISC-V Reference Card at the end of the exam, at the bottom right of the first page, you will see the “16-bit (RVC) Instruction Formats”. RVC stands for “RISC-V Compressed”.(a) (5 points) Note CR-type instructions have a funct4 entry as part of their opcodes instead of a funct3 (as in all the other instructions). Assume that the opcode combi- nations (inst[1:0] == 01 OR inst[1:0] == 10) AND inst[15:13] == 100 are used to encode CR-type instructions. Given the instruction formats listed in the reference page, how many different instructions can RVC support? Note not all possible RVC instructions are specified on the reference card.(b) In practice, these 16-bit RVC instructions are translated to 32-bit full (“RVI”) RISC-V instructions. This process is summarized in the table on the reference card called “Optional Compressed (16-bit) Instruction Extension: RVC”, but we will highlight the important parts here directly. In this part, you may need to consult the table “16-bit (RVC) instruction formats”.i. (2 points) The RVC instruction C.ADD rd, rs1 translates to the RVI instruc- tion ADD rd, rd, rs1. Both instructions run a full 64b add. However, the C.ADD operation has a significant restriction compared to the full ADD instruc- tion. What is that restriction? Be specific.ii. (2 points) Some RVC instructions, like C.BEQZ, have a register argument with a prime (apostrophe) after the register name (C.BEQZ rs1’, imm → BEQ rs1’, x0, imm). What is the restriction indicated by the prime? Page 5 of 12(c) John Owens 2019Do not reproduce in any way Do not upload anywhereiii. (2 points) All RVC memory-access instructions multiply their immediates by a power-of-twoconstant(e.g.,C.LW rd’, rs1’, imm→LW rd’, rs1’, imm*4). Because of this multiplication, what can a RVI memory-access instruc- tion do that its partner RVC instruction cannot? (The answer is not “allows byte addressing”; both RVC and RVI memory-access instructions are byte-addressed.)iv. (2 points) The instruction C.ADDI16SP is primarily intended to manipulate the . (Fill in the previous blank with one word.)(c) (3 points) The new BabyYoda compiler inputs a RVI (32-bit) RISC-V instruction sequence and transforms some RVI instructions into their RVC (16-bit) equivalents. When running the resulting code, compared to the RVI implementation, would you expect the instruction cache hit rate would increase, decrease, or stay the same? Justify your answer.Page 6 of 12(c) John Owens 2019Do not reproduce in any way Do not upload anywhere3. While sipping his soup, Baby Yoda uses the Force to derive the following characteristics of his laptop’s memory system, which is byte-addressed and contains a virtual memory subsystem and L1 and L2 caches. A copy of this diagram is attached at the end of the exam, which you can detach. In this diagram, means a signal of bits wide.
Figure B.17 T(ah)e (o3vepraolilnptsic)tuWrehoaftaishythpoethpeatgicealsimze?mory hierarchy going from virtual address to L2 cache access. The page size is 16 KB. The TLB is two-way set associative with 256 entries. The L1 cache is a direct-mapped 16 KB, and the L2 cache is a four-way set associative with a total of 4 MB. Both use 64-byte blocks. The virtual address is 64 bits and the physical address is 40 bits.
Copyright © 2011, Elsevier Inc. All rights Reserved.
Page 7 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

(b) (3 points) The TLB has 256 entries in it. What is the associativity of the TLB (e.g., 1-way set associative = direct-mapped, 2-way set associative, 4-way, fully associative, etc.)?
(c) (3 points) How large is the L1 cache (data only)?
(d) (3 points) Assuming the L2 cache stores 222 B = 4 MiB of data, what is its associa- tivity?
(e) (3 points) What is the motivation for making replacement algorithms for hardware caches less complicated than replacement algorithms for pages?
Page 8 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

4. Assume that you have 2 64-bit double-precision floating point numbers stored in (integer) registers x2 and x3. You run the instruction XOR x1, x2, x3. In this question “normal number” means “not zero, not infinity, not denormal, not NaN”.
(a) (3 points) If you know that x2 is a positive normal number and x3 is a negative normal number, what bits in x1 do you know for certain? (Your answer shouldbeoftheform“Bit42willbea1andbit24willbea0”.) Bit0istheleast significant bit (LSB) and bit 63 is the most significant bit (MSB).
(b) (3 points) If you know that x2 is a negative denormal number and x3 is negative infinity, what bits in x1 do you know for certain? (Your answer should be of the form “Bit 42 will be a 1 and bit 24 will be a 0”.)
Page 9 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

Free & Open Reference Card 1
Base Integer Instructions: RV32I, RV64I, and RV128I
Category Name
Loads Load Byte Load Halfword Load Word Load Byte Unsigned Load Half Unsigned
Stores Store Byte Store Halfword Store Word
Shifts Shift Left Shift Left Immediate Shift Right Shift Right Immediate Shift Right Arithmetic Shift Right Arith Imm
Arithmetic ADD ADD Immediate SUBtract
Load Upper Imm Add Upper Imm to PC
Fmt
I I I I I
S S S
R I R I R I
R I R
U U
RV32I Base
LBrd,rs1,imm
LHrd,rs1,imm
LWrd,rs1,imm
LBU rd,rs1,imm
LHU rd,rs1,imm
SBrs1,rs2,imm
SHrs1,rs2,imm
SWrs1,rs2,imm
SLL rd,rs1,rs2
SLLIrd,rs1,shamt
SRL rd,rs1,rs2
SRLIrd,rs1,shamt
SRA rd,rs1,rs2
SRAIrd,rs1,shamt
ADD rd,rs1,rs2
ADDIrd,rs1,imm
SUB rd,rs1,rs2
LUI rd,imm
AUIPC rd,imm
+RV{64,128}
L{D|Q}rd,rs1,imm
L{W|D}U rd,rs1,imm
S{D|Q}rs1,rs2,imm
SLL{W|D}rd,rs1,rs2
SLLI{W|D} rd,rs1,shamt
SRL{W|D}rd,rs1,rs2
SRLI{W|D} rd,rs1,shamt
SRA{W|D}rd,rs1,rs2
SRAI{W|D} rd,rs1,shamt
ADD{W|D}rd,rs1,rs2
ADDI{W|D} rd,rs1,imm
SUB{W|D}rd,rs1,rs2
RV Privileged Instructions
Category Name
CSR Access Atomic R/W Atomic Read & Set Bit Atomic Read & Clear Bit Atomic R/W Imm Atomic Read & Set Bit Imm Atomic Read & Clear Bit Imm
Change Level Env. Call Environment Breakpoint
Environment Return
Trap Redirect to Supervisor Redirect Trap to Hypervisor Hypervisor Trap to Supervisor
Interrupt Wait for Interrupt
MMU Supervisor FENCE
ECALL EBREAK
ERET
MRTS
MRTH
HRTS
WFI
RV mnemonic
CSRRWrd,csr,rs1
CSRRSrd,csr,rs1
CSRRCrd,csr,rs1
CSRRWI rd,csr,imm
CSRRSI rd,csr,imm
CSRRCI rd,csr,imm
SFENCE.VM rs1
Logical XOR XOR Immediate
OR OR Immediate AND AND Immediate
Compare Set < Set < Immediate Set < Unsigned Set < Imm UnsignedBranches Branch = Branch ≠ Branch < Branch ≥ Branch < Unsigned Branch ≥ UnsignedJump & Link J&L Jump & Link RegisterSynch Synch thread Synch Instr & DataSystem System CALL System BREAKCounters ReaD CYCLE ReaD CYCLE upper Half ReaD TIME ReaD TIME upper Half ReaD INSTR RETired ReaD INSTR upper HalfR IR I R IR I R ISBSBSBSBSBSBUJ UJI II II I I I I IXOR rd,rs1,rs2XORIrd,rs1,immORrd,rs1,rs2ORI rd,rs1,immAND rd,rs1,rs2ANDIrd,rs1,immSLT rd,rs1,rs2SLTIrd,rs1,immSLTUrd,rs1,rs2SLTIU rd,rs1,immBEQ rs1,rs2,immBNE rs1,rs2,immBLT rs1,rs2,immBGE rs1,rs2,immBLTUrs1,rs2,immBGEUrs1,rs2,immJAL rd,immJALRrd,rs1,immFENCEFENCE.ISCALL SBREAKRDCYCLErdRDCYCLEH rdRDTIME rdRDTIMEHrdRDINSTRETrdRDINSTRETH rdOptional Compressed (16-bit) Instruction Extension: RVC Category NameLoads Load Word Load Word SPLoad Double Load Double SP Load Quad Load Quad SPStores Store Word Store Word SP Store Double Store Double SPStore Quad Store Quad SPArithmetic ADD ADD Word ADD Immediate ADD Word Imm ADD SP Imm * 16 ADD SP Imm * 4 Load Immediate Load Upper Imm MoVe SUBShifts Shift Left ImmBranches Branch=0 Branch≠0Jump Jump Jump RegisterJump & Link J&L Jump & Link RegisterSystem Env. BREAKFmtCL CICL CI CL CICS CSS CS CSSCS CSSCR CR CI CI CI CIW CI CI CR CRCICBCBCJCRCJCRCIC.SLLI rd,immC.BEQZ rs1′,immC.BNEZ rs1′,immC.JimmC.JR rd,rs1C.JALimmC.JALR rs1C.EBREAKRVCC.LWrd′,rs1′,immC.LWSPrd,immC.LDrd′,rs1′,immC.LDSPrd,immC.LQrd′,rs1′,immC.LQSPrd,immC.SWrs1′,rs2′,immC.SWSPrs2,immC.SDrs1′,rs2′,immC.SDSPrs2,immC.SQrs1′,rs2′,immC.SQSPrs2,immC.ADDrd,rs1C.ADDW rd,rs1C.ADDI rd,immC.ADDIWrd,immC.ADDI16SP x0,immC.ADDI4SPN rd’,immC.LI rd,immC.LUIrd,immC.MV rd,rs1C.SUBrd,rs1RVI equivalentLW rd′,rs1′,imm*4LW rd,sp,imm*4LD rd′,rs1′,imm*8LD rd,sp,imm*8LQ rd′,rs1′,imm*16LQ rd,sp,imm*16SW rs1′,rs2′,imm*4SW rs2,sp,imm*4SD rs1′,rs2′,imm*8SD rs2,sp,imm*8SQ rs1′,rs2′,imm*16SQ rs2,sp,imm*16ADD rd,rd,rs1ADDWrd,rd,immADDIrd,rd,immADDIW rd,rd,immADDIsp,sp,imm*16ADDIrd’,sp,imm*4ADDIrd,x0,immLUI rd,immADD rd,rs1,x0SUB rd,rd,rs1SLLIrd,rd,immBEQ rs1′,x0,immBNE rs1′,x0,immJAL x0,immJALRx0,rs1,0JAL ra,immJALRra,rs1,0EBREAK 32-bit Instruction Formats16-bit (RVC) Instruction FormatsR I S SB U UJCRCSS CIW CL CS CB CJRISC-V Integer Base (RV32I/64I/128I), privileged, and optional compressed extension (RVC). Registers x1-x31 and the pc are 32 bits wide in RV32I, 64 in RV64I, and 128 in RV128I (x0=0). RV64I/128I add 10 instructions for the wider formats. The RVI base of <50 classic integer RISC instructions is required. Every 16-bit RVC instruction matches an existing 32-bit RVI instruction. See risc.org.(c) John Owens 2019Do not reproduce in any way Do not upload anywhereCI Free & Open Reference Card (riscv.org) 2 Category NameMultiply MULtiply MULtiply upper Half MULtiply Half Sign/Uns MULtiply upper Half Uns Divide DIVide DIVide Unsigned Remainder REMainder REMainder UnsignedSwap SWAPAdd ADDLogical XOR AND ORMin/Max MINimum MAXimum MINimum Unsigned MAXimum UnsignedMove Move from Integer Move to IntegerConvert Convert from Int Convert from Int Unsigned Convert to Int Convert to Int UnsignedArithmetic ADD SUBtract MULtiply DIVide SQuare RooTMul-Add Multiply-ADD Multiply-SUBtract Negative Multiply-SUBtract Negative Multiply-ADDSign Inject SiGN source Negative SiGN source Xor SiGN sourceMin/Max MINimum MAXimumCompare Compare Float = Compare Float < Compare Float ≤ Categorization Classify TypeConfiguration Read Status Read Rounding Mode Read Flags Swap Status Reg Swap Rounding Mode Swap Flags Swap Rounding Mode Imm Swap Flags ImmFmtR R R R R R R RRRR R RR R R RR R R RR R R R RR R R RR R RR RR R R RR R R R R R I IOptional Multiply-Divide Instruction Extension: RVM RV32M (Multiply-Divide)MUL rd,rs1,rs2MULHrd,rs1,rs2MULHSUrd,rs1,rs2MULHU rd,rs1,rs2DIV rd,rs1,rs2DIVUrd,rs1,rs2REM rd,rs1,rs2REMUrd,rs1,rs2SC.Wrd,rs1,rs2AMOSWAP.W rd,rs1,rs2AMOADD.Wrd,rs1,rs2AMOXOR.WAMOAND.WAMOOR.Wrd,rs1,rs2rd,rs1,rs2rd,rs1,rs2AMOMIN.WAMOMAX.WAMOMINU.WAMOMAXU.Wrd,rs1,rs2rd,rs1,rs2rd,rs1,rs2rd,rs1,rs2Three Optional Floating-Point Instruction Extensions: RVF, RVD, & RVQFMV.{H|S}.X rd,rs1FMV.X.{H|S} rd,rs1FCVT.{H|S|D|Q}.Wrd,rs1FCVT.{H|S|D|Q}.WU rd,rs1FCVT.W.{H|S|D|Q}rd,rs1FCVT.WU.{H|S|D|Q} rd,rs1FADD.{S|D|Q} rd,rs1,rs2FSUB.{S|D|Q} rd,rs1,rs2FMUL.{S|D|Q} rd,rs1,rs2FDIV.{S|D|Q} rd,rs1,rs2FSQRT.{S|D|Q}rd,rs1FMADD.{S|D|Q}rd,rs1,rs2,rs3FMSUB.{S|D|Q}rd,rs1,rs2,rs3FNMSUB.{S|D|Q} rd,rs1,rs2,rs3FNMADD.{S|D|Q} rd,rs1,rs2,rs3FSGNJ.{S|D|Q}rd,rs1,rs2FSGNJN.{S|D|Q} rd,rs1,rs2FSGNJX.{S|D|Q}rd,rs1,rs2FMIN.{S|D|Q} rd,rs1,rs2FMAX.{S|D|Q} rd,rs1,rs2FEQ.{S|D|Q}rd,rs1,rs2FLT.{S|D|Q}rd,rs1,rs2FLE.{S|D|Q}rd,rs1,rs2FCLASS.{S|D|Q} rd,rs1FRCSRFRRMFRFLAGSFSCSRFSRMFSFLAGSFSRMIFSFLAGSIrdrdrdrd,rs1rd,rs1rd,rs1rd,immrd,immMUL{W|D}DIV{W|D}REM{W|D}REMU{W|D}rd,rs1,rs2rd,rs1,rs2rd,rs1,rs2rd,rs1,rs2+RV{64,128}Category NameLoad Load ReservedStore Store ConditionalOptional Atomic Instruction Extension: RVA FmtRRRV32A (Atomic)LR.W rd,rs1+RV{64,128}LR.{D|Q}rd,rs1SC.{D|Q}rd,rs1,rs2AMOSWAP.{D|Q} rd,rs1,rs2AMOADD.{D|Q}rd,rs1,rs2AMOXOR.{D|Q}rd,rs1,rs2AMOAND.{D|Q}rd,rs1,rs2AMOOR.{D|Q} rd,rs1,rs2AMOMIN.{D|Q}rd,rs1,rs2AMOMAX.{D|Q}rd,rs1,rs2AMOMINU.{D|Q} rd,rs1,rs2AMOMAXU.{D|Q} rd,rs1,rs2 Category NameFmtR RRV32{F|D|Q} (HP/SP,DP,QP Fl Pt)+RV{64,128}FMV.{D|Q}.X rd,rs1FMV.X.{D|Q} rd,rs1FCVT.{H|S|D|Q}.{L|T}rd,rs1FCVT.{H|S|D|Q}.{L|T}U rd,rs1FCVT.{L|T}.{H|S|D|Q}rd,rs1FCVT.{L|T}U.{H|S|D|Q} rd,rs1Load LoadStore StoreISFL{W,D,Q}rd,rs1,immFS{W,D,Q}rs1,rs2,immRegisterx0x1x2x3x4 x5-7x8x9x10-11×12-17×18-27×28-31f0-7 f8-9f10-11f12-17f18-27f28-31ABI Namezeroraspgptpt0-2 s0/fps1 a0-1 a2-7s2-11t3-t6 ft0-7 fs0-1 fa0-1 fa2-7fs2-11ft8-11RISC-V Calling Convention Saver— Caller Callee — — Caller Callee Callee Caller Caller Callee CallerCaller Callee Caller Caller Callee CallerDescriptionHard-wired zeroReturn addressStack pointerGlobal pointerThread pointerTemporariesSaved register/frame pointer Saved registerFunction arguments/return values Function argumentsSaved registersTemporariesFP temporariesFP saved registersFP arguments/return values FP argumentsFP saved registersFP temporaries RISC-V calling convention and five optional extensions: 10 multiply-divide instructions (RV32M); 11 optional atomic instructions (RV32A); and 25 floating-point instructions each for single-, double-, and quadruple-precision (RV32F, RV32D, RV32Q). The latter add registers f0-f31, whose width matches the widest precision, and a floating-point control and status register fcsr. Each larger address adds some instructions: 4 for RVM, 11 for RVA, and 6 each for RVF/D/Q. Using regex notation, {} means set, so L{D|Q} is both LD and LQ. See risc.org. (8/21/15 revision)(c) John Owens 2019Do not reproduce in any way Do not upload anywhereDiagram from Problem 3.igure B.17 The overall picture of a hypothetical memory hierarchy going from virtual address to L2 cache acage size is 16 KB. The TLB is two-way set associative with 256 entries. The L1 cache is a direct-mapped 16 KB, a ache is a four-way set associative with a total of 4 MB. Both use 64-byte blocks. The virtual address is 64 bits and the ddress is 40 bits.Copyright © 2011, Elsevier Inc. All rights Reserved. Page 12 of 12(c) John Owens 2019Do not reproduce in any way Do not upload anywhereFcpn caUNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer EngineeringEEC 170 Introduction to Computer Architecture Fall 2019Final Examination SolutionsAll instructions in this exam, unless otherwise noted, are RISC-V instructions. Recall that if the instruction writes to a register, that register appears first in the instruction. Unless otherwise indicated, you may not use any RISC-V pseudoinstructions on this exam, only actual RISC-V instructions. Feel free to detach the two RISC-V reference pages and the memory system diagram at the end of the exam, but if you do, please take them with you and recycle them (don’t try to reattach them or turn them in).1.12 3 4We discussed dense matrix-dense matrix multiplication in class. To recap, consider A, B, and C as n×n matrices; we wish to compute C = C +A×B. This can be computed byfori=1ton for j = 1 to nfor k = 1 to nC(i,j) = C(i,j) + A(i,k) * B(k,j) // this is normal (scalar) + and *We begin by computing the arithmetic intensity of this code in operations per memory word, assuming no caches, as a function of n. (Recall that the arithmetic intensity is the number of arithmetic operations per word of memory bandwidth.) Assume n is large; you can approximate any answers given this assumption. Assume each addition or multiplication counts as one operation. Assume all of A, B, and C are stored in main memory (DRAM) and each read from/write to main memory counts as one memory word.The inner line of code runs n3 times. It has 2 arithmetic operations, 3 loads, and 1 write. Thus the arithmetic intensity is 2n3 = 1/2.(a) (4 points) Consider two different implementations of the same task with the same runtime. Baby Yoda’s has high arithmetic intensity and is limited by arithmetic capability. The Mandalorian’s has low arithmetic intensity and is limited by mem- ory bandwidth. Even though both have the same runtime, we prefer Baby Yoda’s implementation with higher arithmetic intensity because of how we expect future machines to behave. In terms of technology trends, why do we prefer imple- mentations with more arithmetic intensity? (3+1)n3Page 1 of 7(c) John Owens 2019Do not reproduce in any way Do not upload anywhere Solution: Historically, arithmetic capability increases more quickly than mem- ory bandwidth (this is the “memory wall”). A future machine will hopefully continue to increase both arithmetic capability and memory bandwidth, but arithmetic capability will likely increase more, so Baby Yoda’s implementation will run faster on future machines than the Mandalorian’s. (b)(4 points) Assume we have a very large cache so all memory loads only read a reference the first time they see that reference and (because they are captured in the cache) all memory writes occur only at the end of the program. Reads from/writes to cache are free (do not count in terms of arithmetic intensity). What is the arithmetic intensity now?(6 points) Now consider a more realistic cache that is not large enough to hold the entire input and output. This cache can hold two matrix rows or columns plus one output element. Assume there are no cache conflicts; any two rows/columns plus one output element can fit in the cache. The Mandalorian proposes using the cache in the following way:fori=1ton// read row i of A into cache for j = 1 to n// read row j of B into cache // read C(i,j) into cachefor k = 1 to nC(i,j) = C(i,j) + A(i,k) * B(k,j) // this is normal (scalar) + and * // after k loop is complete, write C(i,j) back to memoryWhat is the arithmetic intensity now? It is OK to ignore lower-order terms in this analysis. Solution: We still do 2n3 operations. But now we read each element of A, B,and C once and write each element of C once. That’s a total of 4n2 memoryoperations. So the arithmetic intensity is 2n3 = n/2. 4n2(c)12 3 4 5 6 7 8 Solution: Again we have 2n3 operations. The primary memory cost is readingeach column of B n times (n2 column reads × n elements per column = n3memory accesses). Thus the arithmetic intensity is 2n3 = 2. n3More precisely, we do the following memory operations: • read each column of B n times (n3 memory loads)Page 2 of 7(c) John Owens 2019Do not reproduce in any way Do not upload anywhere • read each column of A once for each i (n2 memory loads) • read and write each element of C once (2n2 memory loads)for a total of = n3 + 3n2 memory operations. The arithmetic intensity is thus2n3 , roughly 2 for large n. n3 +3n2(d)(10 points) Finally, Baby Yoda proposes to partition the matrix in a different way. Instead of reading 1-dimensional rows or columns from A and B, Baby Yoda ar- ranges those matrices as b × b subblocks. (For instance, a 128 × 128 matrix may be structured as a 4×4 grid of 32 × 32 subblocks.) The cache can hold any two input blocks plus one output block at any time.Now our matrix multiply looks like this (note our operations are now on blocks at a time instead of rows or columns at a time). The arithmetic, while factored differently, is exactly the same as the previous case.// note i, j, and k are now block indexes, not element indexesfori=1ton/b for j = 1 to n/b// read block C(i,j) into cachefor k = 1 to n/b// read block A(i,k) into cache// read block B(k,j) into cacheC(i,j) = C(i,j) + A(i,k) * B(k,j)// Here + indicates a matrix sum of blocks (block + block)// Here * indicates a matrix multiply of blocks (block * block)// after k loop is complete, write block C(i,j) back to memoryWhat is the arithmetic intensity now? It is again OK to ignore lower-order terms in this analysis. 12 3 4 5 6 7 8 910 11Solution: Again we have 2n3 operations.For both A and B, we read (n/b)3 blocks of size b2. That’s 2n3/b reads.We also read and write each block of C once (n2 reads, n2 writes). (For large n, you can ignore this term.)The arithmetic intensity is thus 2n3 , roughly b for large n. This is sub- 2n3 /b+2n2stantially better than the previous cached version (2). It suggests larger b gives better arithmetic intensity. This is true, but it also means the cache must be large enough to hold three blocks of size b.Page 3 of 7(c) John Owens 2019Do not reproduce in any way Do not upload anywhere2. In the 2-page RISC-V Reference Card at the end of the exam, at the bottom right of the first page, you will see the “16-bit (RVC) Instruction Formats”. RVC stands for “RISC-V Compressed”.(a) (5 points) Note CR-type instructions have a funct4 entry as part of their opcodes instead of a funct3 (as in all the other instructions). Assume that the opcode combi- nations (inst[1:0] == 01 OR inst[1:0] == 10) AND inst[15:13] == 100 are used to encode CR-type instructions. Given the instruction formats listed in the reference page, how many different instructions can RVC support? Note not all possible RVC instructions are specified on the reference card. Solution: Most instructions have 2 opcode bits and 3 funct bits. This implies 5 opcode bits total, so 25 = 32 instructions can be supported in this way.However, CR-type instructions get an extra opcode bit instr[12]. The CR instructions take 2 of the 32 instruction slots available to a 5-bit opcode. Since CR-type instructions have an additional opcode bit, they can fit 4 instructions into those 2 instruction slots.Thus RVC, given these assumptions, supports 34 instructions. (b) In practice, these 16-bit RVC instructions are translated to 32-bit full (“RVI”) RISC-V instructions. This process is summarized in the table on the reference card called “Optional Compressed (16-bit) Instruction Extension: RVC”, but we will highlight the important parts here directly. In this part, you may need to consult the table “16-bit (RVC) instruction formats”.i. (2 points) The RVC instruction C.ADD rd, rs1 translates to the RVI instruc- tion ADD rd, rd, rs1. Both instructions run a full 64b add. However, the C.ADD operation has a significant restriction compared to the full ADD instruc- tion. What is that restriction? Be specific.ii. (2 points) Some RVC instructions, like C.BEQZ, have a register argument with a prime (apostrophe) after the register name (C.BEQZ rs1’, imm → BEQ rs1’, x0, imm). What is the restriction indicated by the prime?iii. (2 points) All RVC memory-access instructions multiply their immediates by a Page 4 of 7 Solution: C.ADD only takes 2 register arguments, so the destination register must also be a source register. ADD takes 3 register arguments, so the two source registers can be distinct from the destination register.Solution: Registers indicated with a prime are only specified with 3 bits in the instruction encoding, instead of RISC-V’s 5 bits. Thus these instruc- tions can only access a subset of registers (this is not important, but it’s x8–x15). (c) John Owens 2019Do not reproduce in any way Do not upload anywherepower-of-twoconstant(e.g.,C.LW rd’, rs1’, imm→LW rd’, rs1’, imm*4). Because of this multiplication, what can a RVI memory-access instruc- tion do that its partner RVC instruction cannot? (The answer is not “allows byte addressing”; both RVC and RVI memory-access instructions are byte-addressed.)iv. (2 points) The instruction C.ADDI16SP is primarily intended to manipulate the stack . (Fill in the previous blank with one word.)(c) (3 points) The new BabyYoda compiler inputs a RVI (32-bit) RISC-V instruction sequence and transforms some RVI instructions into their RVC (16-bit) equivalents. When running the resulting code, compared to the RVI implementation, would you expect the instruction cache hit rate would increase, decrease, or stay the same? Justify your answer.3. While sipping his soup, Baby Yoda uses the Force to derive the following characteristics of his laptop’s memory system, which is byte-addressed and contains a virtual memory subsystem and L1 and L2 caches. A copy of this diagram is attached at the end of the exam, which you can detach. In this diagram, means a signal of bits wide.
Solution: RVI instructions support misaligned memory operations where the n-byte memory address is not aligned to an n-byte boundary. RVC instructions do not allow misaligned memory addresses.
Solution: The I-cache hit rate should increase. The instruction stream is iden- tical in terms of operations but the instruction stream with RVC instructions takes fewer bytes, takes up less space in the cache, can thus store more instruc- tions than the RVI-only stream, and hence will have a higher hit rate.
Page 5 of 7
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

Figure B.17 The overall picture of a hypothetical memory hierarchy going from virtual address to L2 cache access. The (a) (3 points) What is the page size?
page size is 16 KB. The TLB is two-way set associative with 256 entries. The L1 cache is a direct-mapped 16 KB, and the L2 cach
addr
e is a four-way set associative with a total of 4 MB. Both use 64-byte blocks. The virtual address is 64 bits and the physical
14
ess is 40 bits.
Solution: The page offset is 14 bits, so the page size is 2
= 16 kB.
Copyright © 2011, Elsevier Inc. All rights Reserved. 10
(b) (3 points) The TLB has 256 entries in it. What is the associativity of the TLB (e.g., 1-way set associative = direct-mapped, 2-way set associative, 4-way, fully associative, etc.)?
(c) (3 points) How large is the L1 cache (data only)?
(d) (3 points) Assuming the L2 cache stores 222 B = 4 MiB of data, what is its associa- tivity?
Solution: The TLB index has 7 bits, so there are 27 = 128 sets. That’s 2 entries per set; the TLB is 2-way set associative.
Solution: The cache index is 8 bits, so there are 28 = 256 entries in the cache. Each entry is 512 bits (64 bytes), also shown by a 6-bit block offset.
256 entries × 64 bytes is 16 kB.
Solution: The data per entry is 512 bits = 64 B. Thus there are 4 MiB/64 B = 64 k = 216 entries. The cache index has only 14 bits, so that implies 4 entries per set, making the cache 4-way set associative.
Page 6 of 7
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

(e) (3 points) What is the motivation for making replacement algorithms for hardware caches less complicated than replacement algorithms for pages?
Solution: Either of a couple of general reasons are good answers here:
• Cache replacement algorithms have to run quickly (and hence can’t be as complicated) because processing time for the algorithm (thus hit/miss times) must be low.
• Page replacement algorithms must be as accurate as possible (reducing the miss rate) because the miss penalty when pages are not in memory and have to be paged from disk is very long.
4. Assume that you have 2 64-bit double-precision floating point numbers stored in (integer) registers x2 and x3. You run the instruction XOR x1, x2, x3. In this question “normal number” means “not zero, not infinity, not denormal, not NaN”.
(a) (3 points) If you know that x2 is a positive normal number and x3 is a negative normal number, what bits in x1 do you know for certain? (Your answer shouldbeoftheform“Bit42willbea1andbit24willbea0”.) Bit0istheleast significant bit (LSB) and bit 63 is the most significant bit (MSB).
(b) (3 points) If you know that x2 is a negative denormal number and x3 is negative infinity, what bits in x1 do you know for certain? (Your answer should be of the form “Bit 42 will be a 1 and bit 24 will be a 0”.)
Solution: Bit 63 will be 1, because you know the two input sign bits will be different.
Solution: Bit 63 will be 0, because both input sign bits are the same.
Bits 62–52 inclusive will all be 1, because x2’s exponent is all zeroes and x3’s exponent is all ones.
Page 7 of 7
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] CS计算机代考程序代写 assembly mips algorithm computer architecture RISC-V cache compiler EEC 170
30 $