Lecture 5:
Arithmetic 1/3
John Owens
Introduction to Computer Architecture UC Davis EEC 170, Winter 2021
Arithmetic for Computers
Operations on integers
Addition and subtraction
Multiplication and division
Dealing with overflow
Floating-point real numbers
Representation and operations
2 UC Davis EEC 170, Winter 2021 / John Owens
3.1 Introduction
Arithmetic
Where weve been:
Performance (seconds, cycles, instructions) Abstractions:
Instruction Set Architecture
Assembly Language and Machine Language
Whats up ahead:
Number Representation
Implementing an ALU
Implementing the Architecture
a
32
b
32
operation
ALU
32
result
3
UC Davis EEC 170, Winter 2021 / John Owens
32 bit signed numbers
0000 0000 0000 0000 0000 0000 0000 0000two = 0ten 0000 0000 0000 0000 0000 0000 0000 0001two = + 1ten 0000 0000 0000 0000 0000 0000 0000 0010two = + 2ten
0111 1111 1111 1111 1111 1111 1111 1110two = + 2,147,483,646 ten
0111 1111 1111 1111 1111 1111 1111 1111two = + 2,147,483,647 ten
1000 0000 0000 0000 0000 0000 0000 0000two = 2,147,483,648ten 1000 0000 0000 0000 0000 0000 0000 0001two = 2,147,483,647ten 1000 0000 0000 0000 0000 0000 0000 0010two = 2,147,483,646ten
1111 1111 1111 1111 1111 1111 1111 1101two = 3ten
1111 1111 1111 1111 1111 1111 1111 1110two = 2 ten
1111 1111 1111 1111 1111 1111 1111 1111two = 1 ten
maxint
minint
4
UC Davis EEC 170, Winter 2021 / John Owens
What happens if you compute minint?
5 UC Davis EEC 170, Winter 2021 / John Owens
Twos Complement Operations
Negating a twos complement number: invert all bits and add 1
remember: negate and invert are quite different!
Converting n bit numbers into numbers with more than n bits:
RISC V n bit immediate gets converted to 64 bits for arithmetic
copy the most significant bit (the sign bit) into the other bits
0010-> 0000 0010
1010-> 1111 1010
sign extension (lbu vs. lb)
mem [x1] = 0xff
lb x2, 0(x1) -> x2 = ffff ffff
lbu x2, 0(x1) -> x2 = 0000 00ff
6 UC Davis EEC 170, Winter 2021 / John Owens
Addition & Subtraction
Twos complement operations easy
subtraction using addition of negative numbers
0111 (7) + 1010 (-6)
0111 (7)
0110 (6)
7
UC Davis EEC 170, Winter 2021 / John Owens
Addition & Subtraction
Overflow (result too large for finite computer word):
adding two n-bit numbers does not yield an n-bit number
1111
+0001
10000
How about -1 + -1?
8 UC Davis EEC 170, Winter 2021 / John Owens
Detecting Overflow
No overflow when adding a positive and a negative number No overflow when signs are the same for subtraction Overflow occurs when the value affects the sign:
overflow when adding two positives yields a negative
or, adding two negatives gives a positive
or, subtract a negative from a positive and get a negative
or, subtract a positive from a negative and get a positive
Consider the operations A + B, and A B
Can overflow occur if B is 0 ?
Can overflow occur if A is 0 ?
HW problem on figuring out exact criteria
9 UC Davis EEC 170, Winter 2021 / John Owens
and Subtraction
Integer Addition Example: 7 + 6
Overflow if result out of range
Adding +ve and ve operands, no overflow
Adding two +ve operands
Overflow if result sign is 1 Adding two ve operands
Overflow if result sign is 0
10
UC Davis EEC 170, Winter 2021 / John Owens
Integer Subtraction
Add negation of second operand
Example: 7 6 = 7 + (6)
+7: 0000 0000 0000 0111 6: 1111 1111 1111 1010 +1:0000 0000 0000 0001
Overflow if result out of range
Subtracting two +ve or two ve operands, no overflow
Subtracting +ve from ve operand
Overflow if result sign is 0
Subtracting ve from +ve operand Overflow if result sign is 1
11
UC Davis EEC 170, Winter 2021 / John Owens
Effects of Overflow
An exception (interrupt) occurs
Control jumps to predefined address for exception
Interrupted address is saved for possible resumption
Details based on software system / language
example: flight control vs. homework assignment
C/Java do not detect overflow
Fortran (evidently) can detect overflow
Dont always want to detect overflow
RISC-V advocates branches on overflow instead
MIPS instructions (but not RISC-V): addu,addiu,subu addiu sign-extends
sltu, sltiu for unsigned comparisons
12 UC Davis EEC 170, Winter 2021 / John Owens
Arithmetic for Multimedia
Saturating operations
On overflow, result is largest representable value
c.f. 2s-complement modulo arithmetic E.g., clipping in audio, saturation in video
Graphics and media processing operates on vectors of 8-bit and 16-bit data
Use 64-bit adder, with partitioned carry chain
(this probably doesnt mean anything to you yet, but wait until the end of lecture)
Operate on 88-bit, 416-bit, or 232-bit vectors SIMD (single-instruction, multiple-data)
13 UC Davis EEC 170, Winter 2021 / John Owens
Review: Boolean Algebra & Gates
Problem: Consider a logic function with three inputs: A, B, and C.
Output D is true if at least one input is true
Output E is true if exactly two inputs are true
Output F is true only if all three inputs are true
14 UC Davis EEC 170, Winter 2021 / John Owens
Review: Boolean Algebra & Gates
Show the Boolean equations for these three functions.
15 UC Davis EEC 170, Winter 2021 / John Owens
Review: Boolean Algebra & Gates
Show an implementation consisting of inverters, AND, and OR gates.
16 UC Davis EEC 170, Winter 2021 / John Owens
You should know
Boolean algebra
Logic gates (and, or, not, xor, nor, multiplexors [muxes], decoders, etc.)
Converting between equations, truth tables, and gate representations
Critical path
Clocking, and registers / memory
Finite state machines
COD5e Appendix A summarizes this material
17 UC Davis EEC 170, Winter 2021 / John Owens
Break
18 UC Davis EEC 170, Winter 2021 / John Owens
Administrivia
We have a TA! I already put him to work. Trivikram Reddy
By popular demand, I will have an office hour Fri 18 October, when the train arrives (9:30)11 at the CoHo
19 UC Davis EEC 170, Winter 2021 / John Owens
Bit Slices
Concentrate on one bit of the adder:
0111 + 0110
Could we build the same hardware for every bit?
This is a good idea. Why?
Each bits hardware is called a bit slice
20 UC Davis EEC 170, Winter 2021 / John Owens
Truth Table for Adder Bit Slice 3 inputs (A, B, Cin); 2 outputs (Sum, Cout)
A
B
Cin
Sum
Cout
0
0
0
0
0
0
0
1
1
0
0
1
0
1
0
0
1
1
0
1
1
0
0
1
0
1
0
1
0
1
1
1
0
0
1
1
1
1
1
1
21 UC Davis EEC 170, Winter 2021 / John Owens
Adder Equations Sum = (AB)Cin
Carry = AB + ACin + BCin Abstract as Full Adder:
A B
Cin
Full Cout Adder Sum
22 UC Davis EEC 170, Winter 2021 / John Owens
Cascading Adders
Cascade Full Adders to make multibit adder:
A+B=S
A B
Cin
A B
Cin
A B
Cin
A B
Cin
Cout Sum
Cout Sum
Cout Sum
Cout Sum
A3
Full Adder
B3
A2
S3
Full Adder
B2
S2
A1
Full Adder
B1
S1
A0
Full Adder
B0
S0
23
UC Davis EEC 170, Winter 2021 / John Owens
Arithmetic for Multimedia
Saturating operations
On overflow, result is largest representable value
c.f. 2s-complement modulo arithmetic E.g., clipping in audio, saturation in video
Graphics and media processing operates on vectors of 8-bit and 16-bit data
Use 64-bit adder, with partitioned carry chain
(this probably doesnt mean anything to you yet, but wait until the end of lecture)
Operate on 88-bit, 416-bit, or 232-bit vectors SIMD (single-instruction, multiple-data)
24 UC Davis EEC 170, Winter 2021 / John Owens
Truth Table for Subtractor Bit Slice 3 inputs (A, B, Bin); 2 outputs (Diff, Bout)
A
B
Bin
Diff
Bout
0
0
0
0
0
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
1
0
0
1
0
1
0
1
0
0
1
1
0
0
0
1
1
1
1
1
25 UC Davis EEC 170, Winter 2021 / John Owens
Cascading Subtractors
Cascade Full Subtrs to make multibit subtr:
A-B=S
A3
A B
Bin
A B
Bin
A B
Bin
A B
Full Subtr
Full Subtr
Full Subtr
Bout Diff
Bout Diff
Bout Diff
Bout Diff
B3
B2
A1
B1
Bin
26
UC Davis EEC 170, Winter 2021 / John Owens
Full Subtr
S3
A2
S2
S1
A0
B0
S0
How can we combine + and -?
A
A B
Cin
A B
Bin
Full Adder
Full Subtr
Cout Sum
Bout Diff
+
+
B
This is commonits what well do (for example) for logic functions (and, or, etc.)
27 UC Davis EEC 170, Winter 2021 / John Owens
Cin+1
S
How can we combine + and -?
S = A- B
S = A + (~B) + 1
A B
Cin
A B
Cin
A B
Cin
A B
Cin
Full Adder
Full Adder
Full Adder
Full Adder
Cout Sum
Cout Sum
Cout Sum
Cout Sum
A3
B3
A2
S3
B2
S2
A1
B1
S1
A0
B0
S0
28
UC Davis EEC 170, Winter 2021 / John Owens
Lest you think this is only theoretical
29 UC Davis EEC 170, Winter 2021 / John Owens
How can we combine + and -?
A1
B1
+
A B
Cin
A B
Cin
Cout Full
Sum S1
Adder
A0
+
Full Adder
Cout
Sum S0
B0
S = A + ~B + 1 First, negate B
30
UC Davis EEC 170, Winter 2021 / John Owens
How can we combine + and -?
A1
B1
+
A B
Cin
A B
Cin
Full Adder
Cout
Sum
S1
A0
+
+
Full Adder
Cout
Sum
B0
S0
S = A + ~B + 1 then add 1
0
1
31
UC Davis EEC 170, Winter 2021 / John Owens
Control for +/-
One bit controls three muxes. This is a control point.
A B
Cin
A B
Cin
A1
B1
+
Full Adder
Cout
Sum
S1
A0
Cout
Sum
How do we set this control point for add? subtract?
+
Full Adder
B0
S0
0
+
1
32
UC Davis EEC 170, Winter 2021 / John Owens
imm[11:0] rs1 funct3 rd opcode I-type
RISC-V instruc
rs2
imm[31:12] rd opcode
im
m
[2
|19
rd opcode
0|1
codings
:12]
rs1 funct3 imm[4:0] opcode
S-type B-type U-type J-type
LUI AUIPC JAL JALR BEQ BNE BLT BGE BLTU BGEU LB
LH LW LBU LHU SB
SH SW ADDI SLTI SLTIU XORI ORI ANDI SLLI SRLI SRAI ADD SUB SLL SLT SLTU XOR SRL
SRA
imm[11:5]
imm[12|10:5] tiorns2 enrs1 funct3 imm[4:1|11] opcode
0:1|11
RV32I Base Instruction Set
imm[31:12]
rd
0110111
imm[31:12]
rd
0010111
imm[20|10:1|11|19:12]
rd
1101111
imm[11:0]
rs1
000
rd
1100111
imm[12|10:5]
rs2
rs1
000
imm[4:1|11]
1100011
imm[12|10:5]
rs2
rs1
001
imm[4:1|11]
1100011
imm[12|10:5]
rs2
rs1
100
imm[4:1|11]
1100011
imm[12|10:5]
rs2
rs1
101
imm[4:1|11]
1100011
imm[12|10:5]
rs2
rs1
110
imm[4:1|11]
1100011
imm[12|10:5]
rs2
rs1
111
imm[4:1|11]
1100011
imm[11:0]
rs1
000
rd
0000011
imm[11:0]
rs1
001
rd
0000011
imm[11:0]
rs1
010
rd
0000011
imm[11:0]
rs1
100
rd
0000011
imm[11:0]
rs1
101
rd
0000011
imm[11:5]
rs2
rs1
000
imm[4:0]
0100011
imm[11:5]
rs2
rs1
001
imm[4:0]
0100011
imm[11:5]
rs2
rs1
010
imm[4:0]
0100011
imm[11:0]
rs1
000
rd
0010011
imm[11:0]
rs1
010
rd
0010011
imm[11:0]
rs1
011
rd
0010011
imm[11:0]
rs1
100
rd
0010011
imm[11:0]
rs1
110
rd
0010011
imm[11:0]
rs1
111
rd
0010011
0000000
shamt
rs1
001
rd
0010011
0000000
shamt
rs1
101
rd
0010011
0100000
shamt
rs1
101
rd
0010011
0000000
rs2
rs1
000
rd
0110011
0100000
rs2
rs1
000
rd
0110011
0000000
rs2
rs1
001
rd
0110011
0000000
rs2
rs1
010
rd
0110011
0000000
rs2
rs1
011
rd
0110011
0000000
0000000
0100000
rs2 rs1 rs2 rs1 rs2 rs1
33
100 rd
101 rd
101 rd
0110011
0110011
0110011
UC Davis EEC 170, Winter 2021 / John Owens
0000000 rs2 rs1 110 rd 0110011 OR
|| ||
imm[12 10:5] rs2 rs1 101 imm[4:1 11] 1100011 BGE
imm[12|10:5] rs2 rs1 110 imm[4:1|11] 1100011 BLTU imm[12|10:5] rs2 rs1 111 imm[4:1|11] 1100011 BGEU
imm[11:0] rs1 000 rd 0000011 LB
RISC-V instruction encodings (funct7)
imm[11:0] rs1
0000011 LH 0000011 LW 0000011 LBU 0000011 LHU 0100011 SB 0100011 SH 0100011 SW 0010011 ADDI 0010011 SLTI 0010011 SLTIU 0010011 XORI 0010011 ORI 0010011 ANDI
imm[11:0] rs1 ADD: i0m0m[0110:0] 00 rs1 imm[11:0] rs1
001 rd 010 rd
100 rd
101 rd
000 imm[4:0]
001 imm[4:0]
010 imm[4:0] 000 rd
010 rd
011 rd
100 rd
110 rd
111 rd
001 rd 0010011 SLLI
SUB: 0100000
imm[11:5] rs2 rs1
imm[11:5] rs2 rs1 imm[11:5] rs2 rs1 imm[11:0] rs1 imm[11:0] rs1 imm[11:0] rs1 imm[11:0] rs1 imm[11:0] rs1 imm[11:0] rs1 0000000 shamt rs1
0000000
shamt
rs1
101
rd
0010011
0100000
shamt
rs1
101
rd
0010011
0000000
rs2
rs1
000
rd
0110011
0100000
rs2
rs1
000
rd
0110011
0000000
rs2
rs1
001
rd
0110011
0000000
rs2
rs1
010
rd
0110011
0000000
rs2
rs1
011
rd
0110011
0000000 rs2 rs1 0000000 rs2 rs1 0100000 rs2 rs1 0000000 rs2 rs1 0000000 rs2 rs1
fm pred succ rs1
100 rd
101 rd
101 rd
110 rd
111 rd
000 rd
SRLI SRAI ADD SUB SLL SLT SLTU
0110011 XOR
0110011 SRL
0110011 SRA
0110011 OR
0110011 AND
34
UC Davis EEC 170, Winter 2021 / John Owens
0001111 FENCE
000000000000 00000 000 00000 1110011 ECALL
imm[11:0] rs1 001 rd 0000011 LH
imm[11:0] rs1 010 rd 0000011 LW imm[11:0] rs1 100 rd 0000011 LBU
RISC-V instruction encodings (funct3)
imm[11:0] rs1 101 rd 0000011 LHU
imm[11:5] rs2 rs1 imm[11:5] rs2 rs1
000 imm[4:0]
001 imm[4:0]
010 imm[4:0]
000 ADDI
010 SLTI
011 SLTIU
100 XORI 110 ORI
111
001 SLLI 101 SRLI 101 SRAI 000 ADD
000 SUB
001 SLL
010 SLT
011 SLTU
ADDI: 000
imm[11:5] rs2 rs1
0100011 SB 0100011 SH 0100011 SW
SLTI: 010
imm[11:0] rs1
rd
0010011
rd
0010011
rd
0010011
rd
0010011
rd
0010011
rd
0010011
rd
0010011
rd
0010011
rd
0010011
rd
0110011
rd
0110011
rd
0110011
rd
0110011
rd
0110011
SLTIU: 011
imm[11:0] rs1
imm[11:0] rs1
SLT: 010
imm[11:0] rs1
imm[11:0] rs1
SLimTmU[:11:0]011 rs1 0000000 shamt rs1 0000000 shamt rs1 0100000 shamt rs1 0000000 rs2 rs1 0100000 rs2 rs1 0000000 rs2 rs1 0000000 rs2 rs1 0000000 rs2 rs1 0000000 rs2 rs1 0000000 rs2 rs1 0100000 rs2 rs1 0000000 rs2 rs1 0000000 rs2 rs1
fm pred succ rs1
ANDI
35
0001111 FENCE
100 rd
101 rd
101 rd
110 rd
111 rd
000 rd
0110011 XOR
0110011 SRL
0110011 SRA
0110011 OR
0110011 AND
UC Davis EEC 170, Winter 2021 / John Owens
000000000000 00000 000 00000 1110011 ECALL
Bit Slices
Concentrate on one bit of the adder:
0111 + 0110
Needs:
2 inputs (A and B)
Carry from previous slice (Cin)
Output (Sum)
Carry to next slice (Cout)
36 UC Davis EEC 170, Winter 2021 / John Owens
MIPS Opcode Map
essor Users Manual / Joe Heinrich]
37 UC Davis EEC 170, Winter 2021 / John Owens
End of lecture / Quiz 1
38 UC Davis EEC 170, Winter 2021 / John Owens
But What About Performance? Critical path of one bitslice is CP
Critical path of n-bit rippled-carry adder is n*CP
Design Trick:
Throw hardware at it
A0 B0
A1 B1
A2 B2
A3 B3
CarryIn0
1-bit ALU
CarryOut0
Result0
Result1
Result2
Result3
CarryIn1
1-bit ALU
CarryOut1
CarryIn2
1-bit ALU
CarryOut2 CarryIn3
CarryOut3
39
UC Davis EEC 170, Winter 2021 / John Owens
1-bit ALU
Truth Table for Adder Bit Slice 3 inputs (A, B, Cin); 2 outputs (Sum, Cout)
A
B
Cin
Sum
Cout
0
0
0
0
0
0
0
1
1
0
0
1
0
1
0=Cin
0
1
1
0
1=Cin
1
0
0
1
0=Cin
1
0
1
0
1=Cin
1
1
0
0
1
1
1
1
1
1
40 UC Davis EEC 170, Winter 2021 / John Owens
A0 B0
A1 B1
A2 B2
A3 B3
Carry Look Ahead (Design trick: peek)
C0 = Cin
S
S
S
S
C4 = . . .
G
P
G = A and B P = A xor B WHY are these interesting?
41
UC Davis EEC 170, Winter 2021 / John Owens
A B Cout
0 0 0 kill
0 1 Cin 1 0 Cin
1 1 1
propagate propagate generate
G P
G P
C1 = G0 + C0 P0
G P
C2 = G1 + G0 P1 + C0 P0 P1
C3 = G2 + G1 P2 + G0 P1 P2 + C0 P0 P1 P2
G P
CLA vs. Ripple
C0 = Cin
S
G = A and B P = A xor B
CarryIn0
A0 B0
A1 B1
A0
1-bit ALU
CarryOut0
Result0
G P
B0 CarryIn1
A1
B1 CarryIn2
1-bit
ALU Result1
C1 = G0 + C0 P0
C2 = G1 + G0 P1 + C0 P0 P1
C3 = G2 + G1 P2 +
G0 P1 P2 +
C0 P0 P1 P2
C4 = . . .
S
G P
A2 A2 B2
CarryOut1
1-bit ALU
CarryOut2
1-bit
ALU
CarryOut3
Result2
Result3
S
G P
B2
A3 B3
CarryIn3
A3 B3
S
G
P
G P
42
UC Davis EEC 170, Winter 2021 / John Owens
Cascaded Carry Look-ahead (16-bit)
C
L C0
A
4-bit Adder
4-bit Adder
4-bit Adder
G0 P0
C1 = G0 + C0 P0
Abstraction!
C2 = G1 + G0 P1 + C0 P0 P1
C3 = G2 + G1 P2 + G0 P1 P2 + C0 P0 P1 P2
G P
C4 = . . .
43
UC Davis EEC 170, Winter 2021 / John Owens
Design Trick: Guess (or Precompute)
CP(2n) = 2*CP(n)
n-bit adder n-bit adder
(2n) = CP(n) + CP(mux)
n-bit adder
1n-bit adder
0
n-bit adder
Cout
Carry-select adder
44
UC Davis EEC 170, Winter 2021 / John Owens
P
Carry Skip Adder: reduce worst case delay
Just speed up the slowest case for each block
4-bit Ripple Adder 4-bit Ripple Adder
B A4 B A0
Exercise: optimal design uses variable block sizes (why?)
P3 S P3 S
P2
P1
P0
P2
P1
P0
45 UC Davis EEC 170, Winter 2021 / John Owens
Adder Lessons
Reuse hardware if possible
+/- reuse is compelling argument for 2s complement
For higher performance:
Look for critical path, optimize for it Reorganize equations [propagate/generate / carry lookahead]
Precompute [carry save]
Reduce worst-case delay [carry skip]
46 UC Davis EEC 170, Winter 2021 / John Owens
Finished way early (1:20 in, 30 minutes
47 UC Davis EEC 170, Winter 2021 / John Owens
End of lecture
48 UC Davis EEC 170, Winter 2021 / John Owens
Lecture 6:
Arithmetic 2/3
John Owens
Introduction to Computer Architecture UC Davis EEC 170, Winter 2021
Multiply (unsigned)
Paper and pencil example (unsigned):
Multiplicand 1000 Multiplier x1001 1000
0000 0000
1000 Product 01001000
m bits x n bits = m+n bit product
Binary makes it easy:
0 => place 0 ( 0 x multiplicand )
1 => place a copy ( 1 x multiplicand )
4 versions of multiply hardware & algorithm:
successive refinement
50 UC Davis EEC 170, Winter 2021 / John Owens
m bits x n bits = m+n bit product
51 UC Davis EEC 170, Winter 2021 / John Owens
Unsigned Combinational Multiplier
0000
A3 A2 A1 A0
A3 A2
A3 A2
A3 A2
P7 P6 P5 P4 P3 P2 P1 P0
B0 B1
B2 B3
A1
A0
A1
A0
A1
A0
52 UC Davis EEC 170, Winter 2021 / John Owens
How does it work?
00
A3 A3 A2
0
0000
A3 A2 A1 A0 A2
A1
A0
A1
A0
A3 A2
P7 P6 P5 P4 P3 P2 P1 P0
B0 B1
B2 B3
A1
A0
At each stage shift A left (multiply it by 2)
Use next bit of B to determine whether to add in shifted
multiplicand
Accumulate 2n bit partial product at each stage
53 UC Davis EEC 170, Winter 2021 / John Owens
Unsigned Combinational Multiplier
0000
A3 A2 A1 A0
A3 A2
A3 A2
A3 A2
Stage i accumulates A * 2i if Bi == 1
P7 P6 P5 P4 P3 P2 P1 P0
B0 B1
B2 B3
A1
A0
A1
A0
A1
A0
Q: How much hardware for 32 bit multiplier? Critical path?
54 UC Davis EEC 170, Winter 2021 / John Owens
Unsigned shift-add multiplier (version 1)
64-bit Multiplicand reg, 64-bit ALU, 64-bit Product reg, 32-bit multiplier reg
Multiplicand
64 bits
Shift Left
Multiplier
32 bits
Control
Shift Right
64-bit ALU
Product
64 bits
Write
Multiplier = datapath + control
55
UC Davis EEC 170, Winter 2021 / John Owens
Multiply Algorithm V1
Product
0000 0000 0011
1:0000 0010 0011 2:0000 0010 0011 3:0000 0010 0001 1:0000 0110 0001 2:0000 0110 0001 3:0000 0110 0000
0000 0110 0000
Start
1. Test Multiplier0
Multiplier Multiplicand 0000 0010
Multiplier0 = 1
Multiplier0 = 0
0000 0010
0000 0100
0000 0100 0000 0100 0000 1000 0000 1000
0000 1000
1a. Add multiplicand to product &
place the result in Product register
2. Shift the Multiplicand register left 1 bit. 3. Shift the Multiplier register right 1 bit.
32nd repetition?
No: < 32 repetitions Yes: 32 repetitions Done56UC Davis EEC 170, Winter 2021 / John Owens Observations on Multiply Version 1 1 clock per cycle => 100 clocks per multiply Ratio of multiply to add 5:1 to 100:1
1/2 bits in multiplicand always 0 => 64-bit adder is wasted
0s inserted in right of multiplicand as shifted
=> least significant bits of product never changed once formed
Instead of shifting multiplicand to left, shift product to right?
57 UC Davis EEC 170, Winter 2021 / John Owens
Multiply Hardware Version 2
32-bit Multiplicand reg, 32-bit ALU, 64-bit Product reg, 32-bit Multiplier reg
Multiplicand
32 bits
32-bit ALU
Product
64 bits
Multiplier
32 bits
Control
Shift Right
Shift Right
Write
58
UC Davis EEC 170, Winter 2021 / John Owens
How to think of this?
Remember original combinational multiplier:
0000
A3 A2 A1 A0
A3 A2
A3 A2
A3 A2
P7 P6 P5 P4 P3 P2 P1 P0
B0 B1
B2 B3
A1
A0
A1
A0
A1
A0
59 UC Davis EEC 170, Winter 2021 / John Owens
Simply warp to let product move right
0000
A3 A2 A1 A0
A3 A2 A1 A0
A3 A2 A1 A0
B0
B1
B2
B3
P7 P6 P5 P4 P3 P2 P1 P0
MultiplicaAnd staAys stAill anAd product moves right 3210
60 UC Davis EEC 170, Winter 2021 / John Owens
Multiply Algorithm V2
Start
Multiplier0 = 1
1. Test Multiplier0
Multiplier0 = 0
0000 0000
1: 0010 0000
2: 0001 0000
0011 0010
0011 0010
1a. Add multiplicand to the left half of product & 0011 0010
3: 00010000 00p0l1acet0h0e1re0sultinthelefthalfofProductregister
1: 0011 0000 0001 0010
2: 0001 1000 0001 0010 Product Multiplier Multiplicand
3: 0001 1000 1: 0001 1000 2: 0000 1100 3: 0000 1100 1: 0000 1100 2: 0000 0110 3: 0000 0110
0000 0110
0000 0010 0000 0010 0000 0010 0000 0010 0000 0010 0000 0010 0000 0010
0000 0010
2. Shift the Product register right 1 bit.
3. Shift the Multiplier register right 1 bit.
32nd repetition?
No: < 32 repetitions Yes: 32 repetitions Done61UC Davis EEC 170, Winter 2021 / John Owens Still more wasted space!Start Multiplier0 = 11. Test Multiplier0Multiplier0 = 00000 0000 1: 0010 0000 2: 0001 0000 3: 0001 0000 1: 0011 00000011 00101a. Add multiplicand to the left half of product & 0011 0010place the result in the left half of Product register 0011 0010 2: 0001 1000 3: 0001 1000 1: 0001 1000 2: 0000 1100 3: 0000 1100 1: 0000 1100 2: 0000 0110 3: 0000 01100000 01100001 00100001 00100001 0010 Product Multiplier Multiplicand0000 0010 0000 0010 0000 0010 0000 0010 0000 0010 0000 0010 0000 00100000 00102. Shift the Product register right 1 bit.3. Shift the Multiplier register right 1 bit.32nd repetition?No: < 32 repetitions Yes: 32 repetitions Done62UC Davis EEC 170, Winter 2021 / John Owens Observations on Multiply Version 2 Product register wastes space that exactly matches size of multiplier=> combine Multiplier register and Product register
63 UC Davis EEC 170, Winter 2021 / John Owens
Multiply Hardware Version 3
32-bit Multiplicand reg, 32 -bit ALU, 64-bit Product reg, (0-bit Multiplier reg)
Multiplicand 32 bits
32-bit ALU
Product (Multiplier) 64 bits
Shift Right
Write
Control
64
UC Davis EEC 170, Winter 2021 / John Owens
Multiply Algorithm V3
Product0 = 1
Start
1.
Product0
Product0 = 0
Test
1a. Add multiplicand to the left half of product & place the result in the left half of Product register
0000 0011 0010
1: 0010 0011 0010
2: 0001 0001 0010
1: 0011 0000 0010 Product Multiplicand
2: 0001 1000 0010
1: 0001 1000 0010
2: 0000 1100 0010
1: 0000 1100 0010
2: 0000 0110 0010
0000 0110 0010
2. Shift the Product register right 1 bit.
32nd repetition?
No: < 32 repetitions Yes: 32 repetitions Done65UC Davis EEC 170, Winter 2021 / John Owens Observations on Multiply Version 3 2 steps per bit because Multiplier & Product combined What about signed multiplication?- easiest solution is to make both positive & remember whether to complement product when done (leave out the sign bit, run for 31 steps)- apply definition of 2s complement- need to sign-extend partial products and subtract at the end- Booths Algorithm is elegant way to multiply signed numbers using same hardware as before and save cycles- can handle multiple bits at a time66 UC Davis EEC 170, Winter 2021 / John Owens Faster MultiplierUses multiple adders- Cost/performance tradeoff Can be pipelined- Several multiplications performed in parallel67UC Davis EEC 170, Winter 2021 / John Owens Motivation for Booths Algorithm Example 2 x 6 = 0010 x 0110:0010 x 0110+ + + +0000 shift (0 in multiplier) 0010 add (1 in multiplier) 0010 add (1 in multiplier)0000 shift (0 in multiplier) 000110068 UC Davis EEC 170, Winter 2021 / John Owens Motivation for Booths Algorithm ALU with add or subtract can get same result in more than one way: 6 = 4 + 2 = -2 + 80110 = -00010 + 01000 = 11110 + 01000 For example:0010 (2 [multiplier]) x 0110 (6 [multiplicand])0000 – 00100000 + 001000001100shift (0 in multiplier)sub (first 1 in multpl.)shift (mid string of 1s)add (prior step had last 1)69 UC Davis EEC 170, Winter 2021 / John Owens Booths Algorithm CuernredntoBfitruBnit tomthide dRilgehtoEfxprluanatbioenginning oExfarmupnle Op 1 0 sub1 1 0 1 0 0Begins run of 1s 0001111000 011110Middle of run of 1s 0001111000 none End of run of 1s 0001111000 add Middle of run of 0s 0001111000 none Originally for speed (when shift was faster than add) Replace a string of 1s in multiplier with an initial subtract when we firstsee a one and then later add for the bit after the last one Handles twos complement!70 UC Davis EEC 170, Winter 2021 / John Owens Booths Example (2 x 7)Operation0. initial value1a. P = P – m 1b.2.3.4a. 4b.Multiplicand 0010Product0000 0111 0 1110 0111 0 1111 0011 1 1111 1001 1 1111 1100 1 0001 1100 1 0000 1110 0next?10 -> sub
shift P (sign ext) 11 -> nop, shift 11 -> nop, shift 01 -> add
shift done
Blue + red = multiplier
(red is 2 bits to compare); Green = arithmetic ops; Black = product
71 UC Davis EEC 170, Winter 2021 / John Owens
0010
0010
0010
0010
0010
0010
+ 1110->
+ 0010->
Booths Example (2 x -3)
Operation Multiplicand Product
next?
10 -> sub shift P (sign ext) 01 -> add
shift P
10 -> sub shift
11 -> nop
shift done
0. initial value 0010
0000 1101 0 1110 1101 0
1111 0110 1
1a. P = P m 1b.
2a. 2b.
3a.
3b. 4a.
1110 + 1110 0010
+ 0010
0001 011 0000 1011 0
0 1
0010
+ 1110 1110 1011 0
1111 0101 1 1111 0101 1
0010 0010
4b.
Blue + red = multiplier (red is 2 bits to compare); Green = arithmetic ops; Black = product
0010
1111 1010 1
72 UC Davis EEC 170, Winter 2021 / John Owens
Radix-4 Modified Booths Algorithm
Current
Bits Right
00 0
01 0
10 0
11 0
00 1
01 1
10 1
Explanation Example Recode
Bit to the
Middle of zeros Single one Begins run of 1s Begins run of 1s
Ends run of 1s Ends run of 1s Isolated 0 Middle of run
00 00 00 00 00 0 00 00 00 01 00 1 00 01 11 10 00 -2
00 01 11 11 00 -1 00 00 11 11 00 1
00 01 11 11 00 2 00 11 10 11 00 -1 00 11 11 11 00 0
11 1
Same insight as one-bit Booths, simply adjust for alignment of 2 bits. Allows multiplication 2 bits at a time.
73 UC Davis EEC 170, Winter 2021 / John Owens
RISC-V Multiplication Support
Four multiply instructions: mul: multiply
Gives the lower 64 bits of the product mulh: multiply high
Gives the upper 64 bits of the product, assuming the operands are signed
mulhu: multiply high unsigned
Gives the upper 64 bits of the product, assuming the operands are unsigned
mulhsu: multiply high signed/unsigned
Gives the upper 64 bits of the product, assuming one operand is signed and the other unsigned
Use mulh result to check for 64-bit overflow
74 UC Davis EEC 170, Winter 2021 / John Owens
RISC-V Support for multiply
If both the high and low bits of the same product are required, then the recommended code sequence is: MULH[[S]U] rdh, rs1, rs2; MUL rdl, rs1, rs2 (source register specifiers must be in same order andrdhcannot be the same asrs1or
rs2). Microarchitectures can then fuse these into a single multiply operation instead of performing two separate multiplies.
75 UC Davis EEC 170, Winter 2021 / John Owens
Multiplication Summary
Iterative algorithm Design techniques:
Analyze hardwarewhats not in use?
Spend more hardware to get higher performance
Booths Algorithmmore general (2s complement)
Booths Algorithmrecoding is powerful technique to think about problem in a different way
Booths Algorithmmore bits at once gives higher performance
76 UC Davis EEC 170, Winter 2021 / John Owens
Break
77 UC Davis EEC 170, Winter 2021 / John Owens
Administrivia
TA Trivikram coming at 11:30 to present the project Youre going to write 4 RISC-V procedures.
HW 3 released last Friday, due on Friday
Solutions will be outside my office, Kemper 3175
Midterm is week from today
Open book, open note
Bring a calculator
78 UC Davis EEC 170, Winter 2021 / John Owens
Is NVIDIA Doubling Down On RISC-V?
Betteridges law of headlines is an adage that states:
Any headline that ends in a question mark can be answered by the word no.
79 UC Davis EEC 170, Winter 2021 / John Owens
Is NVIDIA Doubling Down On RISC-V?
Six RISC-V positions have been advertised by NVIDIA, based in Shanghai and pertaining to architecture, design, and verification.
Due its light weight and extensibility, RISC-V is gaining mainstream adoption across many new sectors, including datacenter accelerators, mobile & wireless, automotive, and IoT.
In a 2017 RISC-V workshop in Shanghai, NVIDIA explained that shortcomings such as low performance and lack of caches and thread protection meant Falcons architecture could not meet growing complexity demands.
NVIDIA listed the technical criteria for its next-gen architecture: more than twice the performance of Falcon, less than twice the area cost of Falcon, support for caches, tightly coupled memories, 64-bit addresses, and suitability for modern operating systems. They concluded only RISC-V meets all criteria. The new RISC-V micro-controllers will outperform Falcon micro- controllers by three times, Toms Hardware has reported.
https://medium.com/syncedreview/is-nvidia-doubling-down-on-r8is0c-v-1ce714a919eb UC Davis EEC 170, Winter 2021 / John Owens
RISC-V Support for divide
4 instructions:
-{div, divu, rem, remu} rd, rs1, rs2
If both the quotient and remainder are required from the same division, the recommended code sequence is: DIV[U] rdq, rs1, rs2; REM[U] rdr,
rs1, rs2 (rdq cannot be the same as rs1 or rs2). Microarchitectures can then fuse these into a single divide operation instead of performing two separate divides.
Overflow and division-by-zero dont produce errors
Just return defined results
Faster for the common case of no error
div: rs1 / rs2, treat as signed
divu: rs1 / rs2, treat as unsigned
rem: rs1 mod rs2, treat as signed
remu: rs1 mod rs2, treat as unsigned
81 UC Davis EEC 170, Winter 2021 / John Owens
MIPS Support for multiply/divide
Rather than target the general-purpose registers:
hi
mul placed its output into two specialhiandloregisters div placed its divide output intoloand its rem output into
general-purpose register)
MIPS providedmfloandmfhiinstructions (destination:
82 UC Davis EEC 170, Winter 2021 / John Owens
Divide: Paper & Pencil
1001 Quotient Divisor 1000 1001010 Dividend
100100
101
1010
1000
10 Remainder
(or Modulo result)
See how big a number can be subtracted, creating quotient bit on each step
Binary => 1 * divisor or 0 * divisor Dividend = Quotient x Divisor + Remainder 3 versions of divide, successive refinement
83 UC Davis EEC 170, Winter 2021 / John Owens
Divide Hardware Version 1
64-bit Divisor reg, 64-bit ALU, 64-bit Remainder reg, 32-bit Quotient reg
Divisor
64 bits
Shift Right
64-bit ALU
Remainder
64 bits
Write
Quotient
32 bits
Control
Shift Left
84
UC Davis EEC 170, Winter 2021 / John Owens
Divide Algorithm V1
Takes n+1 steps for n-bit Quotient & Rem.
Start: Place Dividend in Remainder
1. Subtract the Divisor register from the Remainder register, and place the result in the Remainder register.
Remainder Quotient
0000 0111
Divisor Test Remainder 0 Remainder
0010 0000
Remainder < 0 00002b. Restore the original value by adding the Divisor register to the Remainder register, & place the sum in the Remainder register. Also shift the Quotient register to the left, setting the new least significant bit to 0.restoring division 2a. Shift the Quotient register to the left setting the new rightmost bit to 1.3. Shift the Divisor register right1 bit.n+1 repetition?No: < n+1 repetitionsYes: n+1 repetitions (n = 4 here) UC Davis EEC 170, Winter 2021 / John Owens Done 85Divide Algorithm I example (7 / 2)Remainder QuotientDivisor 0000 0111 1:1110 0111 000002:0000 0111 0000 3:0000 0111 0000 1:1111 0111 00002:0000 0111 000000000010 0000 0010 00000010 00000001 00000001 00000001 00000000 10000000 10000000 10000000 01000000 01000000 01000000 00100000 00100000 00100000 0001Answer: Quotient = 3 Remainder = 100 0 000 03:0000 0111 000 1:11111111 0002:0000 0111 0000 0003:0000 0111 00000 1:0000 0011 00000 0001 2:0000 0011 03:0000 0011 00001 0001 2:0000 0001 000113:0000 0001 000111:0000 0001 086UC Davis EEC 170, Winter 2021 / John Owens Divide: Paper & Pencil01010 QuotientDivisor 0001 00001010 Dividend 0000100010000 00010001000 Remainder (or Modulo result)- No way to get a 1 in leading digit!- (this is an overflow, i.e quotient would have n+1 bits) -can save 1 iteration switch order to shift first and then subtract,87 UC Davis EEC 170, Winter 2021 / John Owens Observations on Divide Version 1 1/2 bits in divisor always 0=> 1/2 of 64-bit adder is wasted
=> 1/2 of divisor is wasted
Instead of shifting divisor to right, shift remainder to left?
88 UC Davis EEC 170, Winter 2021 / John Owens
Divide Algorithm I example: wasted space
Remainder Quotient 0000 0111 00000
1:1110 011100000
2:0000 011100000
3:0000 011100000
1:1111 011100000
2:0000 011100000
3:0000 011100000
1:1111 111100000
2:0000 011100000
3:0000 011100000
1:0000 001100000
2:0000 001100001
3:0000 0011 0
Divisor 0010 0000
0010 0000
0010 0000
0001 0000
0001 0000
0001 0000
0000 1000
0000 1000
0000 1000
0000 0100
0000 0100
0000 0100
0000 0010
0000 0010
0000 0010
0001
1:0000 0001 0
0001
2:0000 0001 0
0011
3:0000 0001 00011 0000 0010
89 UC Davis EEC 170, Winter 2021 / John Owens
Divide Hardware Version 2
32-bit Divisor reg, 32-bit ALU, 64-bit Remainder reg, 32-bit Quotient reg
Divisor
32 bits
32-bit ALU
Remainder
64 bits
Quotient
32 bits
Control
Shift Left
Shift Left
Write
90
UC Davis EEC 170, Winter 2021 / John Owens
Divide Algorithm V2
Start: Place Dividend in Remainder
1. Shift the Remainder register left 1 bit.
Remainder Quotient Divisor
0000 0111 0000 0010
2. Subtract the Divisor register from the
left half of the Remainder register, & place the result in the left half of the Remainder register.
Remainder 0
Test Remainder
Remainder < 0 3a. Shift the Quotient register to the left setting the new rightmostbit to 1.3b. Restore the original value by adding the Divisor register to the left half of the Remainder register, &place the sum in the left half of the Remainder register. Also shift the Quotient register to the left, setting the new least significant bit to 0. nthrepetition?No: < n repetitions Yes: n repetitions (n = 4 here)UC Davis EEC 170, Winter 2021 / John Owens Done 91Observations on Divide Version 2 Eliminate Quotient register by combining with Remainder as shifted left- Start by shifting the Remainder left as before.- Thereafter loop contains only two steps because the shifting of the Remainder register shifts both the remainder in the left half and the quotient in the right half- The consequence of combining the two registers together and the new order of the operations in the loop is that the remainder will shifted left one time too many.- Thus the final correction step must shift back only the remainder in the left half of the register92 UC Davis EEC 170, Winter 2021 / John Owens Divide Hardware Version 3 32-bit Divisor reg, 32 -bit ALU, 64-bit Remainder reg, (0-bit Quotient reg)Divisor32 bits32-bit ALUHI LOShift LeftWriteRemainder(Quotient)Control64 bits93UC Davis EEC 170, Winter 2021 / John Owens Start: Place Dividend in Remainder Divide Algorithm V31. Shift the Remainder register left 1 bit.2. Subtract the Divisor register from the Remainder Divisorleft half of the Remainder register, & place the 0000 0111 00103a. Shift the Remainder register to the left setting the new rightmostresult in the left half of the Remainder register.Remainder 0 Test Remainder < 0 Remainder3b. Restore the original value by adding the Divisor register to the left half of the Remainder register, &place the sum in the left half of the Remainder register. Also shift the Remainder register to the left, setting the new least significant bit to 0. bit to 1.nth repetition?No: < n repetitionsYes: n repetitions (n = 4 here) Done. Shift left half of Remainder right 1 bit. UC Davis EEC 170, Winter 2021 / John Owens94Final Multiply / Divide HardwareMultiplicand 32 bits32-bit ALUProduct (Multiplier) 64 bitsDivisor32 bits32-bit ALUHI LOShift RightWriteControlRemainder(Quotient)Shift LeftWriteControl64 bits95UC Davis EEC 170, Winter 2021 / John Owens Observations on Divide Version 3 Same Hardware as Multiply: just need ALU to add or subtract, and 64-bit register to shift left or shift right Hi and Lo registers in MIPS combine to act as 64-bit register for multiply and divide Signed Divides: Simplest is to remember signs, make positive, and complement quotient and remainder if necessary- Note: Dividend and Remainder must have same sign- Note: Quotient negated if Divisor sign & Dividend sign disagree e.g., 7 2 = 3, remainder = 1- What about? 7 2 = 4, remainder = +1- See http://mathforum.org/library/drmath/view/52343.html Possible for quotient to be too large: if divide 64-bit integer by 1, quotient is 64 bits (called saturation)96 UC Davis EEC 170, Winter 2021 / John Owens SRT DivisionD. Sweeney of IBM, J.E. Robertson of the University of Illinois, and T.D. Tocher of Imperial College, London CurrentRemainderP-D (Partial Divisor) Plot9|9 43211111 8|8 42211110 7|7 32111100 6|6 32111000 5 | 5 2 1 1 1 0 00 0 4 | 4 2 1 1 0 0 00 0 3|3 11000000 2|2 10000000 1|1 00000000 0 +——————0 1 2 3 4 5 6 78 9 97 UC Davis EEC 170, Winter 2021 / John Owens SRT Division Intel Pentium divide implementation: SRT division with 2 bits/ iteration (radix 4) Allows negative entries 1066 entries in lookup table98UC Davis EEC 170, Winter 2021 / John Owens[http://members.cox.net/srice1/ pentbug/introduction.html] Faster Division Cant use parallel hardware as in multiplier- Subtraction is conditional on sign of remainder Faster dividers (e.g. SRT division) generate multiple quotient bits per step- Still require multiple steps99UC Davis EEC 170, Winter 2021 / John Owens Division Lessons In practice, slower than multiplication- Also less frequent- But, in the simple case, can use same hardware! Generates quotient and remainder together Floating-point division faster than integer division (why?) Similar hardware lessons as multiplier:- Look for unused hardware- Can process multiple bits at once at cost of extra hardware100 UC Davis EEC 170, Winter 2021 / John Owens End of lecture (1:30 in, to allow project101 UC Davis EEC 170, Winter 2021 / John Owens Lecture 7:Arithmetic 3/3John OwensIntroduction to Computer Architecture UC Davis EEC 170, Winter 2021RISC-V logical instructions Instruction Meaning Pseudocode XORI rd,rs1,imm SRAI rd,rs1,immExclusive Or Immediaterd ux(rs1) ux(imm) ORI rd,rs1,imm Or Immediaterd ux(rs1) ux(imm) ANDI rd,rs1,imm SLLI rd,rs1,immAnd Immediaterd ux(rs1) ux(imm)Shift Left Logical Immediaterd ux(rs1) ux(imm) SRLI rd,rs1,imm Shift Right Logical Immediaterd ux(rs1) ux(imm)SLL rd,rs1,rs2SRL rd,rs1,rs2Shift Right Arithmetic Immediaterd sx(rs1) ux(imm)Shift Left Logicalrd ux(rs1) rs2 XOR rd,rs1,rs2 Exclusive Orrd ux(rs1) ux(rs2)Shift Right Logicalrd ux(rs1) rs2 SRA rd,rs1,rs2 OR rd,rs1,rs2Shift Right Arithmeticrd sx(rs1) rs2Orrd ux(rs1) ux(rs2) AND rd,rs1,rs2 Andrd ux(rs1) ux(rs2)103 UC Davis EEC 170, Winter 2021 / John Owens Shift Operations Bit manipulation:- S EEEEEEEE MMMMMMMMMMMMMMMMMMMMMMM & 0 11111111 00000000000000000000000 ———————————— 0 EEEEEEEE 00000000000000000000000- Right shift 23 bits to get 000000000000000000000000 EEEEEEEE- Do arithmetic manipulation 000000000000000000000000 ENEWENEW- Left shift 23 bits to get0 ENEWENEW 00000000000000000000000104 UC Davis EEC 170, Winter 2021 / John Owens Shift Operations Arithmetic operation:- Example: 00011 << 2 [3 left shift 2]- 00011 << 2 = 01100 = 12 = 2 * 4- Each bit shifted left == multiply by two- Example: 01010 >> 1 [10 right shift 1]
01010 >> 1 = 00101 = 5 = 10/2
Each bit shifted right == divide by two
Why?
Compilers do thisstrength reduction
105 UC Davis EEC 170, Winter 2021 / John Owens
Shift Operations
With left shift, what do we shift in?
00011 << 2 = 01100 (arithmetic)- 0000XXXX << 4 = XXXX0000 (logical)- We shifted in zeroesHow about right shift?- XXXX0000 >> 4 = 0000XXXX (logical) Shifted in zero
00110 (= 6) >> 1 = 00011 (3) (arithmetic) Shifted in zero
11110 (= -2) >> 1 = 11111 (-1) (arithmetic) Shifted in one
106 UC Davis EEC 170, Winter 2021 / John Owens
Shift Operations
How about right shift?
XXXX0000 >> 4 = 0000XXXX: Logical shift
Shifted in zero
00110 (= 6) >> 1 = 00011 (3)
11110 (= -2) >> 1 = 11111 (-1): Arithmetic shift
Shifted in sign bit
RISC-V supports both logical and arithmetic:
slli, srai, srli: Shift amount taken from within instruction (imm)
6 bits 6 bits 5 bits 3 bits 5 bits
7 bits 5 bits 5 bits 3 bits 5 bits
sll, sra, srl: shift amount taken from register (variable)
How far can we shift with slli/srai/slli? With sll/sra/srl?
7 bits
7 bits
funct6
shamt
rs1
funct3
rd
opcode
funct7
rs2
rs1
funct3
rd
opcode
107
UC Davis EEC 170, Winter 2021 / John Owens
Combinational Shifter from MUXes
Basic Building Block AB
sel 1 0 D
8-bit right shifter
A7 A6 A5 A4 A3 A2 A1 A0 1010101010101010
1010101010101010
1010101010101010 R7 R6 R5 R4 R3 R2 R1 R0
S2 S1 S0
What comes in the MSBs?
How many levels for 64-bit shifter? What if we use 4-1 Muxes ?
108 UC Davis EEC 170, Winter 2021 / John Owens
General Shift Right Scheme using 16 bit example S 0 If we added right-to-left connections, we could supportROTATE
(0,1)
S1 (0,2)
S2 (0,4)
(not in RISC-V but found in other ISAs)
S3 (0,8)
109 UC Davis EEC 170, Winter 2021 / John Owens
Funnel Shifter Shift A by i bits
Problem: Set Y, X, sa Logical:
Arithmetic: Rotate:
Left shifts:
Extract 64 bits of 128 (sa selects which bits)
Y
Y
64
X
64
64 R
Shift Right
X
| sa |
R
110
UC Davis EEC 170, Winter 2021 / John Owens
Barrel Shifter
Technology-dependent solutions: transistor per switch
in6
in5
in4
out3
out2
out1
out0
SR3
SR2 SR1 SR0
in3
in2 in1 in0
111
UC Davis EEC 170, Winter 2021 / John Owens
Shifter Summary
Shifts common in logical ops, also in arithmetic RISC-V (oops) has:
2 flavors of shift: logical and arithmetic
2 directions of shift: right and left
2 sources for shift amount: immediate, variable
Lots of cool shift algorithms, but
Barrel shifter prevalent in todays hardware
112 UC Davis EEC 170, Winter 2021 / John Owens
ating Point
Floating Point
Representation for non-integral numbers
Including very small and very large numbers
Like scientific notation
2.34 1056
+0.002 104
+987.02 109 In binary
1.xxxxxxx2 2yyyy
Typesfloatanddoublein C
normalized
not normalized
113
UC Davis EEC 170, Winter 2021 / John Owens
Floating Point Standard
Defined by IEEE Std 754-1985
Developed in response to divergence of representations
Portability issues for scientific code Now almost universally adopted
Two representations
Single precision (32-bit)
Double precision (64-bit)
114 UC Davis EEC 170, Winter 2021 / John Owens
115 UC Davis EEC 170, Winter 2021 / John Owens
Floating-point Formats
Single-precision (32 bits)
Double precision (64 bits)
116 UC Davis EEC 170, Winter 2021 / John Owens
IEEE Floating-Point Format
x = (1)S (1+Fraction)2(ExponentBias)
S: sign bit (0 non-negative, 1 negative)
Normalize significand: 1.0 |significand| < 2.0- Always has a leading pre-binary-point 1 bit, so no need torepresent it explicitly (hidden bit)- Significand is Fraction with the 1. restoredFraction: single: 23 bits double: 52 bits Exponent: excess representation: actual exponent + Bias- Ensures exponent is unsigned- Single: Bias = 127; Double: Bias = 1023117 UC Davis EEC 170, Winter 2021 / John OwensExponent: single: 8 bits double: 11 bits Single-Precision Range Exponents 00000000 and 11111111 reserved Smallest value- Exponent: 00000001 actual exponent = 1 127 = 126- Fraction: 000…00 significand = 1.0- 1.0 2126 1.2 1038 Largest value- exponent: 11111110 actual exponent = 254 127 = +127- Fraction: 111…11 significand 2.0- 2.0 2+127 3.4 10+38118UC Davis EEC 170, Winter 2021 / John Owens Double-Precision Range Exponents 0000…00 and 1111…11 reserved Smallest value- Exponent: 00000000001 actual exponent = 1 1023 = 1022- Fraction: 000…00 significand = 1.0- 1.0 21022 2.2 10308 Largest value- Exponent: 11111111110 actual exponent = 2046 1023 = +1023- Fraction: 111…11 significand 2.0- 2.0 2+1023 1.8 10+308119UC Davis EEC 170, Winter 2021 / John Owens Floating-Point Precision Relative precision- all fraction bits are significant- Single: approx 223- Equivalent to 23 log102 23 0.3 6 decimal digits ofprecision- Double: approx 252- Equivalent to 52 log102 52 0.3 16 decimal digits of precision120UC Davis EEC 170, Winter 2021 / John Owens Floating-Point Example Represent 0.75- -0.75 = (1)1 1.12 21 – S =1Fraction = 1000…002 – Exponent = 1 + Bias- Single: 1 + 127 = 126 = 011111102- Double: 1 + 1023 = 1022 = 011111111102 Single: 1011111101000…00 Double: 1011111111101000…00121UC Davis EEC 170, Winter 2021 / John Owens Floating-Point Example What number is represented by the single-precision float 11000000101000…00- S =1–Fraction = 01000…002Fxponent = 100000012 = 129 x = (1)1 (1 + .012) 2(129 127)= (1) 1.25 22 = 5.0122UC Davis EEC 170, Winter 2021 / John Owens Denormal Numbers Exponent = 000…0 hidden bit is 0x=(1)S (0+Fraction)2Bias Smaller than normal numbers- allow for gradual underflow, with diminishing precision Denormal with fraction = 000…0x=(1)S (0+0)2Bias =0.0Two representations of 0.0!123UC Davis EEC 170, Winter 2021 / John Owens Infinities and NaNs Exponent = 111…1, Fraction = 000…0- Infinity- Can be used in subsequent calculations, avoiding need for overflow check Exponent = 111…1, Fraction = 000…0- Not-a-Number (NaN)- Indicates illegal or undefined result- e.g., 0.0 / 0.0- Can be used in subsequent calculations124UC Davis EEC 170, Winter 2021 / John Owens Floating-Point Addition Consider a 4-digit decimal example – 9.999 101 + 1.610 101 1. Align decimal points- Shift number with smaller exponent- 9.999 101 + 0.016 101 2. Add significands- 9.999 101 + 0.016 101 = 10.015 101 3. Normalize result & check for over/underflow – 1.0015 102 4. Round and renormalize if necessary – 1.002 102125UC Davis EEC 170, Winter 2021 / John Owens Floating-Point Addition Now consider a 4-digit binary example- 1.0002 21 + 1.1102 22 (0.5 + 0.4375) 1. Align binary points- Shift number with smaller exponent- 1.0002 21 + 0.1112 21 2. Add significands- 1.0002 21 + 0.1112 21 = 0.0012 21 3. Normalize result & check for over/underflow – 1.0002 24, with no over/underflow 4. Round and renormalize if necessary – 1.0002 24 (no change) = 0.0625126UC Davis EEC 170, Winter 2021 / John Owens FP Adder Hardware Much more complex than integer adder Doing it in one clock cycle would take too long- Much longer than integer operations- Slower clock would penalize all instructions FP adder usually takes several cycles- Can be pipelined127 UC Davis EEC 170, Winter 2021 / John Owens FP Adder Hardware Step 1Step 2Step 3 Step 4128UC Davis EEC 170, Winter 2021 / John Owens Floating-Point Multiplication Consider a 4-digit decimal example- 1.110 1010 9.200 105 1. Add exponents- For biased exponents, subtract bias from sum- New exponent = 10 + 5 = 5 2. Multiply significands- 1.110 9.200 = 10.212 10.212 105 3. Normalize result & check for over/underflow – 1.0212 106 4. Round and renormalize if necessary – 1.021 106 5. Determine sign of result from signs of operands – +1.021 106129UC Davis EEC 170, Winter 2021 / John Owens Floating-Point Multiplication Now consider a 4-digit binary example- 1.0002 21 1.1102 22 (0.5 0.4375) 1. Add exponents- Unbiased: 1 + 2 = 3- Biased: (1 + 127) + (2 + 127) = 3 + 254 127 = 3 + 127 2. Multiply significands- 1.0002 1.1102 = 1.1102 1.1102 23 3. Normalize result & check for over/underflow – 1.1102 23 (no change) with no over/underflow 4. Round and renormalize if necessary – 1.1102 23 (no change)5. Determine sign: +ve ve ve – 1.1102 23 = 0.21875130UC Davis EEC 170, Winter 2021 / John Owens FP Arithmetic Hardware FP multiplier is of similar complexity to FP adder- But uses a multiplier for significands instead of an adder FP arithmetic hardware usually does- Addition, subtraction, multiplication, division, reciprocal,square-rootFP integer conversion Operations usually takes several cycles – Can be pipelined-131 UC Davis EEC 170, Winter 2021 / John Owens FP Instructions in RISC-V Separate FP registers: f0, …, f31- double-precision- single-precision values stored in the lower 32 bits FP instructions operate only on FP registers- Programs generally dont do integer ops on FP data, or viceversa- More registers with minimal code-size impact FP load and store instructions- -flw, fld fsw, fsd132 UC Davis EEC 170, Winter 2021 / John Owens FP Instructions in RISC-V Single-precision arithmetic-fadd.s, fsub.s, fmul.s, fdiv.s, fsqrt.s- Double-precision arithmetice.g.,fadds.s f2, f4, f6-fadd.d, fsub.d, fmul.d, fdiv.d, fsqrt.d- Single- and double-precision comparison-feq.s, flt.s, fle.s -feq.d, flt.d, fle.d- Result is 0 or 1 in integer destination register- Use beq, bne to branch on comparison result Branch on FP condition code true or false -B.conde.g.,fadd.d f2, f4, f6133UC Davis EEC 170, Winter 2021 / John Owens FP Example: F to C C code: float f2c (float fahr) { return ((5.0/9.0)*(fahr – 32.0));}- fahr in f10, result in f10, literals in global memory space Compiled RISC-V code: f2c: flwf0,const5(x3)// f0 = 5.0f flwf1,const9(x3)// f1 = 9.0f fdiv.s f0, f0, f1// f0 = 5.0f / 9.0f flwf1,const32(x3) // f1 = 32.0f fsub.s f10,f10,f1// f10 = fahr – 32.0 fmul.s f10,f0,f10// f10 = (5.0f/9.0f) * (fahr-32.0f) jalr x0,0(x1)// return134 UC Davis EEC 170, Winter 2021 / John Owens FP Example: Array Multiplication C = C +A B- All 32 32 matrices, 64-bit double-precision elements C code: void mm (double c[][],double a[][], double b[][]) {size_t i, j, k;for (i = 0; i < 32; i = i + 1)for (j = 0; j < 32; j = j + 1)for (k = 0; k < 32; k = k + 1)}c[i][j] = c[i][j]+ a[i][k] * b[k][j];-i,j,kin x5, x6, x7Addresses ofc,a,bin x10, x11, x12, and135 UC Davis EEC 170, Winter 2021 / John Owens FP Example: Array MultiplicationRISC-V code:mm:…lix28,32 lix5,0L1:lix6,0L2:lix7,0 // x28 = 32 (row size/loop end)// i = 0; initialize 1st for loop// j = 0; initialize 2nd for loop// k = 0; initialize 3rd for loop// x30 = i * 2**5 (size of row of c)sllix30,x5,5 add x30,x30,x6 // x30 = i * size(row) + j sllix30,x30,3// x30 = byte offset of [i][j]add x30,x10,x30// x30 = byte address of c[i][j] fld f0,0(x30)// f0 = c[i][j]L3:sllix29,x7,5 // x29 = k * 2**5 (size of row of b)add x29,x29,x6 // x29 = k * size(row) + jsllix29,x29,3// x29 = byte offset of [k][j] add x29,x12,x29// x29 = byte address of b[k][j] fld f1,0(x29)// f1 = b[k][j]136UC Davis EEC 170, Winter 2021 / John Owens FP Example: Array Multiplication…slli x29,x5,5 // x29 = i * 2**5 (size of row of a)add slliadd x29,x29,x7 // x29 = i * size(row) + k x29,x29,3// x29 = byte offset of [i][k]x29,x11,x29// x29 = byte address of a[i][k]fldf2,0(x29)// f2 = a[i][k] fmul.d f1, f2, f1 // f1 = a[i][k] * b[k][j]fadd.d f0, f0, f1 // f0 = c[i][j] + a[i][k] * b[k][j]addi x7,x7,1// k = k + 1bltu x7,x28,L3// if (k < 32) go to L3fsdf0,0(x30)// c[i][j] = f0bltu bltuaddi x6,x6,1 x6,x28,L2addi x5,x5,1 x5,x28,L1// j = j + 1// if (j < 32) go to L2// i = i + 1// if (i < 32) go to L1137UC Davis EEC 170, Winter 2021 / John Owens Accurate Arithmetic IEEE Std 754 specifies additional rounding control- Extra bits of precision (guard, round, sticky)- Choice of rounding modes- Allows programmer to fine-tune numerical behavior of a computation Not all FP units implement all options- Most programming languages and FP libraries just usedefaults Trade-off between hardware complexity, performance, and market requirements138 UC Davis EEC 170, Winter 2021 / John OwensSubword Parallellism Graphics and audio applications can take advantage of performing simultaneous operations on short vectors- Example: 128-bit adder:- Sixteen 8-bit adds- Eight 16-bit adds- Four 32-bit adds Also called data-level parallelism, vector parallelism, or Single Instruction, Multiple Data (SIMD)139 UC Davis EEC 170, Winter 2021 / John Owens3.6 Parallelism and Computer Arithmetic: Subword Parallelismx86 FP Architecture Originally based on 8087 FP coprocessor- 8 80-bit extended-precision registers- Used as a push-down stack- Registers indexed from TOS: ST(0), ST(1), … FP values are 32-bit or 64 in memory- Converted on load/store of memory operand- Integer operands can also be converted on load/store Very difficult to generate and optimize code- Result: poor FP performance140 UC Davis EEC 170, Winter 2021 / John Owens3.7 Real Stuff: Streaming SIMD Extensions and AVX in x86 x86 FP Instructions Optional variationsData transfer Arithmetic- I: integer operand FIADDP mem/ST(i) – P: pop operand from stackFILD mem/ST(i)FISTP mem/ST(i) FIMULP mem/ST(i)FISUBRP mem/ST(i) – R: reverse opeFraIDnIdVRoPrdmerm/ST(i) FICOMPFIUCOMPFSTSW AX/memFLDPIFLD1 FSQRT- But not all combinations allowedFRNDINTFLDZFABSCompareTranscendentalFPATANF2XMI FCOSFPTANFPREMFPSINFYL2X 141UC Davis EEC 170, Winter 2021 / John Owens Streaming SIMD Extension 2 (SSE2) Adds 4 128-bit registers- Extended to 8 registers in AMD64/EM64T Can be used for multiple FP operands- 2 64-bit double precision- 4 32-bit double precision- Instructions operate on them simultaneously- Single-Instruction Multiple-Data142UC Davis EEC 170, Winter 2021 / John OwensMatrix Multiply Unoptimized code:1. void dgemm (int n, double* A, double* B, double* C)2. {3.for (int i = 0; i < n; ++i)4.for (int j = 0; j < n; ++j)5. {6. double cij = C[i+j*n]; /* cij = C[i][j] */7. for (int k = 0; k < n; k++)8. cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][k]*B[k][j] */9. C[i+j*n] = cij; /* C[i][j] = cij */10. } 11. }143UC Davis EEC 170, Winter 2021 / John Owens3.8 Going Faster: Subword Parallelism and Matrix Multiply Matrix Multiply x86 assembly code:1. vmovsd (%r10),%xmm0# Load 1 element of C into %xmm02. mov %rsi,%rcx# register %rcx = %rsi3. xor %eax,%eax# register %eax = 04. vmovsd (%rcx),%xmm1# Load 1 element of B into %xmm15. add %r9,%rcx # register %rcx = %rcx + %r96. vmulsd (%r8,%rax,8),%xmm1,%xmm1 # Multiply %xmm1, element of A7. add $0x1,%rax# register %rax = %rax + 18. cmp %eax,%edi# compare %eax to %edi9. vaddsd %xmm1,%xmm0,%xmm0 # Add %xmm1, %xmm010. jg 30
11. add $0x1,%r11d# register %r11 = %r11 + 1
12. vmovsd %xmm0,(%r10) # Store %xmm0 into C element
144
UC Davis EEC 170, Winter 2021 / John Owens
Matrix Multiply Optimized C code:
1. #include
2. void dgemm (int n, double* A, double* B, double* C)
3. {
4.
5.
6.
7.
8.
9.
10.
11.
12.}
13. }
for ( int i = 0; i < n; i+=4 ) for ( int j = 0; j < n; j++ ) {__m256d c0 = _mm256_load_pd(C+i+j*n); /* c0 = C[i][j] */for( int k = 0; k < n; k++ ) c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */_mm256_mul_pd(_mm256_load_pd(A+i+k*n),_mm256_broadcast_sd(B+k+j*n)));_mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */145UC Davis EEC 170, Winter 2021 / John Owens Matrix Multiply Optimized x86 assembly code:1. vmovapd (%r11),%ymm02. mov %rbx,%rcx3. xor %eax,%eax4. vbroadcastsd (%rax,%r8,1),%ymm1 # Make 4 copies of B element 5. add $0x8,%rax # register %rax = %rax +810. jne 50
11. add $0x1,%esi
12. vmovapd %ymm0,(%r11)
# jump if not %r10 != %rax
# register % esi = % esi + 1
# Store %ymm0 into 4 C elements
# Load 4 elements of C into %ymm0
# register %rcx = %rbx
# register %eax = 0
6. vmulpd (%rcx),%ymm1,%ymm1 # Parallel mul %ymm1,4 A elements
7. add %r9,%rcx# register %rcx = %rcx + %r9
8. cmp %r10,%rax # compare %r10 to %rax
9. vaddpd %ymm1,%ymm0,%ymm0# Parallel add %ymm1, %ymm0
146
UC Davis EEC 170, Winter 2021 / John Owens
es and Pitfalls
Right Shift and Division
Left shift by i places multiplies an integer by 2i
Right shift divides by 2i?
Only for unsigned integers
For signed integers
Arithmetic right shift: replicate the sign bit
e.g., 5 / 4
111110112 >> 2 = 111111102 = 2
Rounds toward
c.f. 111110112 >>> 2 = 001111102 = +62
147
UC Davis EEC 170, Winter 2021 / John Owens
Associativity
Parallel programs may interleave operations in unexpected orders
Assumptions of associativity may fail
(x+y)+z
x+(y+z)
x
-1.50E+38
0.00E+00
-1.50E+38
y
1.50E+38
1.50E+38
z
1.0
1.0
1.00E+00
0.00E+00
Need to validate parallel programs under varying degrees of parallelism
148 UC Davis EEC 170, Winter 2021 / John Owens
Who Cares About FP Accuracy?
Important for scientific code
But for everyday consumer use?
The Intel Pentium FDIV bug
The market expects accuracy
See Colwell, The Pentium Chronicles
My bank balance is out by 0.0002!
149 UC Davis EEC 170, Winter 2021 / John Owens
Concluding Remarks
Bits have no inherent meaning
Interpretation depends on the instructions applied
Computer representations of numbers
Finite range and precision
Need to account for this in programs
ISAs support arithmetic
Signed and unsigned integers
Floating-point approximation to reals
Bounded range and precision
Operations can overflow and underflow
150 UC Davis EEC 170, Winter 2021 / John Owens
3.10 Concluding Remarks
Break
151 UC Davis EEC 170, Winter 2021 / John Owens
Administrivia
My office hour is in the Coffee House, not the Silo. The Silo is closed for construction.
Maybe oops: https://en.wikipedia.org/wiki/ Kahan_summation_algorithm
Tell me about errors in slides / in homework
Anyone try RARS?
Midterm philosophy
No typing on the midterm
Bring a calculator
152 UC Davis EEC 170, Winter 2021 / John Owens
Problem: Design a fast ALU for the RISC-V ISA
Requirements?
Must support the Arithmetic / Logic operations
Tradeoffs of cost and speed based on frequency of occurrence, hardware budget
153 UC Davis EEC 170, Winter 2021 / John Owens
RISC-V ALU requirements
Add, Sub, AddI, AddI
=> 2s complement adder/sub
And, Or, AndI, OrI, Xor, Xori
=> Logical AND, logical OR, XOR
SLTI, SLTIU (set less than)
=> 2s complement adder with inverter, check sign bit of
result
See ALU from COD5E, appendix A.5
154 UC Davis EEC 170, Winter 2021 / John Owens
MIPS arithmetic instruction format I-format:
immediate
rs1
funct3
rd
opcode
12 bits
R-format: SLTI
SLTIU ANDI ORI XORI
5 bits
3 bits
5 bits
7 bits
opcode
Type op funct
ADD 00 0110011 | 000 | 0000000
Type op 0010011
SUB
5 bits
funct3
rd
7 bits
5 bits
AND OR XOR SLT SLTU
3 bits
5 bits
7 bits
ADDI 0010011 | 000
funct7 rs2 rs1
0010011
0010011
0010011
0010011
00 00 00 00 00
0110011 | 0110011 | 0110011 | 0110011 | 0110011 |
111 | 0000000 110 | 0000000 100 | 0000000 010 | 0000000 011 | 0000000
| 010 | 011 | 111 | 110 | 100
00
0110011 | 000 | 0100000
155
UC Davis EEC 170, Winter 2021 / John Owens
Design Trick: divide & conquer
Trick: Break the problem into simpler problems, solve them and glue together the solution
Example: assume the immediates have been taken care of before ADDI 0010011 | 000 ADD 00 0110011 | 000 | 0000000
SUB 00 0110011 | 000 | 0100000
Type op Type op funct
the ALU
ANDI ORI XORI SLTI SLTIU
0010011 | 111 AND 00 0110011 | 111 | 0000000
7 operations (could be 3 bits, but really 4)
0010011 | 110 0010011 | 100 0010011 | 010 0010011 | 011
OR 00 XOR 00 SLT 00 SLTU 00
0110011 | 110 | 0000000 0110011 | 100 | 0000000 0110011 | 010 | 0000000 0110011 | 011 | 0000000
156 UC Davis EEC 170, Winter 2021 / John Owens
Lets Build a ALU
Functional Specification:
inputs: 2 x 32-bit operands A, B, 4-bit OPeration
outputs: 32-bit result S, 1-bit carry, 1 bit overflow
operations: add, sub, and, or, xor, slt, sltu
32 32 AB
c ALU OP ovf
S
32
4
157
UC Davis EEC 170, Winter 2021 / John Owens
We already know how to do add/sub
A1
B1
+
A B
Cin
A B
Cin
Cout Full
Sum S1
Adder
A0
A + ~B + 1
+
+
Cout Full
Sum S0
S=
then add 1
B
0
Adder
0
1
158
UC Davis EEC 170, Winter 2021 / John Owens
Control for +/-
One bit controls three muxes. This is a control point.
A1
B1
+
A B
Cin
A B
Cin
Cout Full
Sum S1
Adder
A0
+
Cout Full
Sum S0
How do we set this control point for add? subtract?
B0
Adder
0
+
1
159
UC Davis EEC 170, Winter 2021 / John Owens
160 UC Davis EEC 170, Winter 2021 / John Owens
AND and OR
Consider ALU that supports two functions, AND and OR
How do we do this?
A
and
or
B
161 UC Davis EEC 170, Winter 2021 / John Owens
AND and OR
A
Combinational logic:
Control bit OP is 0 for AND, 1 for OR
A
B
OP
OUT
0
0
0
0
0
0
1
0
0
1
0
0
0
1
1
1
1
0
0
0
1
0
1
1
1
1
0
1
1
1
1
1
B
Hard with lots OP of functions! But lets do it anyway.
and
or
A
OP
B
162 UC Davis EEC 170, Winter 2021 / John Owens
Function Inputs Outputs K-Map
127
M0 M1 M2 M3 A B Cin S Cout
add 0000000 00
7-to-2 Combinational Logic
Start turning the crank . . . 0
163 UC Davis EEC 170, Winter 2021 / John Owens
AND and OR
using a mux
Instead, generate several functions and use control bits to select
and
OP
0 1
Not easy to decide the best way to build something
Dont want too many inputs to a single gate
Dont want to have to go through too many gates
For our purposes, ease of comprehension is important
A
B
or
164 UC Davis EEC 170, Winter 2021 / John Owens
Supporting More Functions
With the mux approach, its easy to add other functions
Like add
To add more, just enlarge mux
Control signals in mux, not datapath
OP
0 1
and
165
UC Davis EEC 170, Winter 2021 / John Owens
A
B
or
Supporting More Functions
With the mux approach, its easy to add other functions Like add
OP
To add more, just enlarge mux
Control signals in mux, not datapath
0
1
2
A
B
A
B Cin
FA
Cout Sum
166
UC Davis EEC 170, Winter 2021 / John Owens
Supporting More Functions
With the mux approach, its easy to add other functions
Like add
To add more, just enlarge mux
(or put more muxes elsewhere)
Control signals in muxes, not datapath
OP
OP
A
B
A +B
Cin
+
FA
Cout Sum
0
1
2
0
1
167
UC Davis EEC 170, Winter 2021 / John Owens
Tailoring the ALU to RISC-V
Need to support the set-on-less-than instruction (slt)
slt produces a 1 if rs < rt and 0 otherwise- use subtraction: (a-b) < 0 implies a < b- So now weve got a-b as our result. How does this translate to slt operation? What do we have to test and where does it go?- We test- To produce the proper result, it goes in168 UC Davis EEC 170, Winter 2021 / John Owens Tailoring the ALU to RISC-V Need to support the set-on-less-than instruction (slt)- slt produces a 1 if rs < rt and 0 otherwise- use subtraction: (a-b) < 0 implies a < b- So now weve got a-b as our result. How does this translate to slt operation? What do we have to test and where does it go?- We testthe highest (sign) bit (S[31])- To produce the proper result, it goes inthe lowest bit (S[0])169 UC Davis EEC 170, Winter 2021 / John Owens Why do you think the MIPS-V designers170 UC Davis EEC 170, Winter 2021 / John Owens Tailoring the ALU to the MIPS Need to support test for equality (beq $t5, $t6, LABEL)- use subtraction: (a-b) = 0 implies a = b – How do we test if the product is zero?171 UC Davis EEC 170, Winter 2021 / John Owens Original Diagram: bit-slice ALUA32 B32a31 b31a0m ALU0b0 mcin4 M ALU31cocincos0s31 32 S172 UC Davis EEC 170, Winter 2021 / John Owens Revised Diagram LSB and MSB need to do a little extraA 32 B 32a31 b31ALU0 cos31 cina0 b0ALU0co s0 cin4produce carry-in (for add/ subtract), etc. ?MC/L toOverflow(but not in RISC-V)32 S173UC Davis EEC 170, Winter 2021 / John Owensslt sign bit Behavioral Representation: Verilogmodule ALU(A, B, m, S, c, ovf);input [0:31] A, B;input [0:3] m;output [0:31] S;output c, ovf;reg [0:31] S;reg c, ovf;always @(A, B, m) begin case (m)0: S = A + B;1: S = A – B;2: S = …… endendmodule32 32 ABc ALU m ovfS324174UC Davis EEC 170, Winter 2021 / John Owens Conclusion We can build an ALU to support the RISC-V instruction set- key idea: use multiplexor to select the output we want- we can efficiently perform subtraction using twos complement- we can replicate a 1-bit ALU to produce a 32-bit ALU Important points about hardware- all of the gates are always working- the speed of a gate is affected by the number of inputs to the gate- the speed of a circuit is affected by the number of gates in series(on the critical path or the deepest level of logic)175 UC Davis EEC 170, Winter 2021 / John Owens MIPS Opcode Map[from MIPS R4000 Microprocessor Users Manual / Joe Heinrich]176 UC Davis EEC 170, Winter 2021 / John Owens Encodings for ADD, SUBOPOPBig picture of what were doing here: understand tie btwn ISA and hardwareA BA +B- Cin+ -Cout Sum0120 1OP+ -0FA1177UC Davis EEC 170, Winter 2021 / John Owens MIPS Opcode Map [from MIPS R4000 Microprocessor Users Manual / Joe Heinrich]178 UC Davis EEC 170, Winter 2021 / John Owens MIPS (really) Encodings for AOPOP AB A +B- CinFACout Sum012OP0 1d bit do? Green? Blue? – +0+ -1 179 UC Davis EEC 170, Winter 2021 / John Owens ADD 00 100000 (040)ADDU 00 100001 (041) D, SUB, SLTSUB 00 SUBU 00*? 00*? 00SLT 00 SLTU 00100010 (042) 100011 (043) 101000 (050) 101001 (051) 101010 (052) 101011 (053)eD MIPS Opcode Map [from MIPS R4000 Microprocessor Users Manual / Joe Heinrich]180 UC Davis EEC 170, Winter 2021 / John Owens MIPS arithmetic instruction format Type op ADDI 001000 (010) I-Type:op Rs31 25 20 15 0ADDIU 001001 (011) RtImmed 16SLTI 001010 (012) SLTIU 001011 (013) ANDI 001100 (014) ORI 001101 (015) XORI 001110 (016) LUI 001111 (017)Perhaps it is surprising that addiu and sltiu alsoWhat does the red bit do?sign-extend their immediates, but they do. The ustands for unsigned, but in reality addiu is often used simply as an add instruction that cannot overflow, and hence we often want to add negative numbers. It’s much harder to come up with an excuse for why sltiu sign extends its immediate field. COD2E p. 230181 UC Davis EEC 170, Winter 2021 / John Owens MIPS Opcode Map [from MIPS R4000 Microprocessor Users Manual / Joe Heinrich]182 UC Davis EEC 170, Winter 2021 / John Owens MIPS arithmetic instruction formatType op functSLL 00R-type:*00 SRL 00 SRA 00 SLLV 00 * 00 SRLV 00 SRAV 0031 25 20 15 10 5 0000000 (000) 000001 (001) Rt Rd shamt functop Rs000010 (002)What does the red bit do?000011 (003) 000100 (004) 000101 (005) 000110 (006) 000111 (007)Green? Blue?* instructions?183 UC Davis EEC 170, Winter 2021 / John Owens MIPS Opcode Map[from MIPS R4000 Microprocessor Users Manual / Joe Heinrich]184 UC Davis EEC 170, Winter 2021 / John Owens MIPS arithmetic instruction format R-type: I-Type:Type op31 25 20 155 0 opADDI ADDIURsRt Rd functADD 000000 100000 (040) SLTI SLTIU ANDI ORI XORI LUI001010 (012) 001011 (013) 001100 (014) 001101 (015) 001110 (016) 001111 (017)op001000 (010) 001001 (011)RsRtType op funct ADDU 000000 100001 (041)Immed 16SUB 000000 100010 (042) SUBU 000000 100011 (043) AND 000000 100100 (044) OR 000000 100101 (045) XOR 000000 100110 (046) NOR 000000 100111 (047)185UC Davis EEC 170, Winter 2021 / John Owens
Reviews
There are no reviews yet.