CSE 371 Computer Organization and Design
CIS 501: Comp Arch|Dr.| Arithmetic
Computer Organization and Design
Copyright By Assignmentchef assignmentchef
Unit 3: Arithmetic
Based on slides by Profs. & & C.J.IS 501: Comp Arch|Dr.| Arithmetic
This Unit: Arithmetic
A little review
Binary + 2s complement
Ripple-carry addition (RCA)
Fast integer addition
Carry-select (CSeA)
Carry Lookahead Adder (CLA)
Integer multiplication and division
Floating point arithmetic
System software
CIS 501: Comp Arch|Dr.| Arithmetic
CIS 501: Comp Arch|Dr.| Arithmetic
The Importance of Fast Arithmetic
Addition of two numbers is most common operation
Programs use addition frequently
Loads and stores use addition for address calculation
Branches use addition to test conditions and calculate targets
All insns use addition to calculate default next PC
Fast addition is critical to high performance
CIS 501: Comp Arch|Dr.| Arithmetic
Review: Binary Integers
Computers represent integers in binary (base2)
3 = 11, 4 = 100, 5 = 101, 30 = 11110
Natural since only two values are represented
Addition, etc. take place as usual (carry the 1, etc.)
17 =10001
+5 =101
22 =10110
Some old machines use decimal (base10) with only 0/1
30 = 011 000
Often called BCD (binary-coded decimal)
Unnatural for digital logic, implementation complicated & slow
More precise for currency operations
CIS 501: Comp Arch|Dr.| Arithmetic
Fixed Width
On pencil and paper, integers have infinite width
In hardware, integers have fixed width
N bits: 16, 32 or 64
LSB is 20, MSB is 2N-1
Range: 0 to 2N1
Numbers >2N represented using multiple fixed-width integers
In software (e.g., Java BigInteger class)
CIS 501: Comp Arch|Dr.| Arithmetic
What About Negative Integers?
Sign/magnitude
Unsigned plus one bit for sign
10 = 000001010, -10 = 100001010
Matches our intuition from by hand decimal arithmetic
representations of both +0 and 0
Addition is difficult
symmetric range: (2N-11) to 2N-11
Option II: twos complement (2C)
Leading 0s mean positive number, leading 1s negative
10 = 00001010, -10 = 11110110
One representation for 0 (all zeroes)
Easy addition
asymmetric range: (2N-1) to 2N-11
CIS 501: Comp Arch|Dr.| Arithmetic
The Tao of 2C
How did 2C come about?
Lets design a representation that makes addition easy
Think of subtracting 10 from 0 by hand with 8-bit numbers
Have to borrow 1s from some imaginary leading 1
0 = 100000000
-10 =00000010
-10 = 011111110
Now, add the conventional way
-10 =11111110
+10 =00000010
0 = 100000000
CIS 501: Comp Arch|Dr.| Arithmetic
Still More On 2C
What is the interpretation of 2C?
Same as binary, except MSB represents 2N1, not 2N1
10 = 11110110 = 27+26+25+24+22+21
Extends to any width
10 = 110110 = 25+24+22+21
Why? 2N = 2*2N1
25+24+22+21 = (26+2*25)25+24+22+21 = 26+25+24+22+21
Equivalent to computing modulo 2N
Trick to negating a number quickly: B = B + 1
(1) = (0001)+1= 1110+1 = 1111 = 1
(1) = (1111)+1= 0000+1 = 0001 = 1
(0) = (0000)+1= 1111+1 = 0000 = 0
Think about why this works
CIS 501: Comp Arch|Dr.| Arithmetic
CIS 501: Comp Arch|Dr.| Arithmetic
1st Grade: Decimal Addition
Repeat N times
Add least significant digits and any overflow from previous add
Carry overflow to next addition
Overflow: any digit other than least significant of sum
Shift two addends and sum one digit to the right
Sum of two N-digit numbers can yield an N+1 digit number
Why N+1, why not 2N?
CIS 501: Comp Arch|Dr.| Arithmetic
Binary Addition: Works the Same Way
1 111111
43 = 00101011
+29 = 00011101
72 = 01001000
Repeat N times
Add least significant bits and any overflow from previous add
Carry the overflow to next addition
Shift two addends and sum one bit to the right
Sum of two N-bit numbers can yield an N+1 bit number
More steps (smaller base)
Each one is simpler (adding just 1 and 0)
So simple we can do it in hardware
CIS 501: Comp Arch|Dr.| Arithmetic
The Half Adder
How to add two binary integers in hardware?
Start with adding two bits
When all else fails look at truth table
A B = C0 S
0 0 =0 0
0 1 =0 1
1 0 =0 1
1 1 =1 0
CO (carry out) = AB
This is called a half adder
CIS 501: Comp Arch|Dr.| Arithmetic
The Other Half
We could chain half adders together, but to do that
Need to incorporate a carry out from previous adder
C A B = C0 S
0 0 0 =0 0
0 0 1 =0 1
0 1 0 =0 1
0 1 1 =1 0
1 0 0 =0 1
1 0 1 =1 0
1 1 0 =1 0
1 1 1 =1 1
S = CAB | CAB | CAB | CAB = C ^ A ^ B
CO = CAB | CAB | CAB | CAB = CA | CB | AB
This is called a full adder
CIS 501: Comp Arch|Dr.| Arithmetic
Ripple-Carry Adder
N-bit ripple-carry adder
N 1-bit full adders chained together
CO0 = CI1, CO1 = CI2, etc.
CON1 is carry-out of entire adder
CON1 = 1 overflow
Example: 16-bit ripple carry adder
How fast is this?
How fast is an N-bit ripple-carry adder?
CIS 501: Comp Arch|Dr.| Arithmetic
Quantifying Adder Delay
Combinational logic dominated by gate delays
How many gates between input and output?
Array storage dominated by wire delays
Longest delay or critical path is what matters
Can implement any combinational function in 2 logic levels
1 level of AND + 1 level of OR (PLA)
NOTs are free: push to input (DeMorgans) or read from latch
Example: delay(FullAdder) = 2
d(CarryOut) = delay(AB | AC | BC)
d(Sum) = d(A ^ B ^ C) = d(ABC | ABC | ABC | ABC) = 2
Note ^ means Xor (just like in C & Java)
Caveat: 2 assumes gates have few (<8 ?) inputsGate delays is an abstraction: not all gates are equally expensive, etc.CIS 501: Comp Arch|Dr.| ArithmeticRipple-Carry Adder DelayLongest path is to CO15 (or S15)delay(CO15) = 2 + MAX(d(A15),d(B15),d(CI15))d(A15) = d(B15) = 0, d(CI15) = d(CO14)d(CO15) = 2 + d(CO14) = 2 + 2 + d(CO13) d(CO15) = 32d(CON1) = 2NLinear in number of bitsNumber of gates is also linearCIS 501: Comp Arch|Dr.| ArithmeticCIS 501: Comp Arch|Dr.| Arithmetic4th Grade: Decimal Division9// quotient 3 |29// divisor | dividend2// remainderShift divisor left (multiply by 10) until MSB lines up with dividendsRepeat until remaining dividend (remainder) < divisorFind largest single digit q such that (q*divisor) < dividendSet LSB of quotient to qSubtract (q*divisor) from dividendShift quotient left by one digit(multiply by 10)Shift divisor right by one digit (divide by 10)Note that shifting the divisor to the right is equivalent to shifting the dividend to the left. We will use this trick later.CIS 501: Comp Arch|Dr.| ArithmeticBinary Division3 |29 = 0011 |011101-24 = – 0110005 = 000101- 3 = – 0000112 = 000010CIS 501: Comp Arch|Dr.| ArithmeticBinary Division HardwareSame as decimal division, except (again)More individual steps (base is smaller)Each step is simplerFind largest bit q such that (q*divisor) < dividendq = 0 or 1Subtract (q*divisor) from dividendq = 0 or 1 no actual multiplication, subtract divisor or notComplication: largest q such that (q*divisor) < dividendHow do you know if (1*divisor) < dividend?Human can eyeball this Computer does not have eyeballsSubtract and see if result is negativeCIS 501: Comp Arch|Dr.| ArithmeticSoftware Divide AlgorithmCan implement this algorithm in softwareInputs: dividend and divisorOutputs: quotient = dividend / divisorrem = dividend % divisorremainder = 0;quotient = 0;for (int i = 0; i < 32; i++) {remainder = (remainder << 1) | (dividend >> 31);
if (remainder >= divisor) {
quotient = (quotient << 1) | 1;remainder = remainder – divisor;quotient = (quotient << 1) | 0;dividend = dividend << 1;Same idea except that we are shifting the dividend and keeping the divisor constantNote ORing the quotient with 0 is redundant. All this means is that we are shifting in a zero into the LSB as opposed to shifting in a 1 as in the previous statement.Small point right shifting dividend may result in sign extension if you arent careful which isnt what you want. That why ANDing with 1 may be useful.CIS 501: Comp Arch|Dr.| ArithmeticDivide ExampleInput: Divisor = 00011 , Dividend = 11101StepRemainderQuotientRemainderDividend 00000000000 000001110110000100000 000011101020001100001 000001010030000100010 000010100040001000100 000101000050010101001 0001000000Result: Quotient: 1001, Remainder: 10CIS 501: Comp Arch|Dr.| ArithmeticDivider CircuitShift in 0 or 1 Shift in 0 or 1 Shift in 0N cycles for n-bit divideFast AdditionCIS 501: Comp Arch|Dr.| ArithmeticCIS 501: Comp Arch|Dr.| ArithmeticBad idea: a PLA-based Adder?If any function can be expressed as two-level logicwhy not use a PLA for an entire 8-bit adder?Approx. 216 AND gates, each with 16 inputsThen, 8 OR gates, each with 216 inputsNumber of gates exponential in bit width! Not that fast, eitherAn AND gate with 65K inputs != 2-input AND gateMany-input gates made from tree of, say, 4-input gates216-input gates would have at least 8 logic levelsSo, at least 10 levels of logic for a 16-bit PLAEven so, delay is still logarithmic in number of bits There are better (faster, smaller) waysCIS 501: Comp Arch|Dr.| ArithmeticTheme: Hardware != SoftwareHardware can do things that software fundamentally cantAnd vice versa (of course)In hardware, its easier to trade resources for latencyOne example of this: speculationSlow computation waiting for some slow input?Input one of two things?Compute with both (slow), choose right one later (fast)Does this make sense in software? Not on a uni-processorDifference? hardware is parallel, software is sequentialCIS 501: Comp Arch|Dr.| ArithmeticCarry-Select AdderCarry-select adderDo A15-8+B15-8 twice, once assuming C8 (CO7) = 0, once = 1Choose the correct one when CO7 finally becomes availableEffectively cuts carry chain in half (break critical path)But adds muxOn board: bad CSA design with 15/1 splitCIS 501: Comp Arch|Dr.| ArithmeticMulti-Segment Carry-Select AdderMultiple segmentsExample: 5, 5, 6 bit = 16 bitHardware costStill mostly linear (~2x)Compute each segment with 0 and 1 carry-inSerial mux chain5-bit adder (10) +Two muxes (4) = 14CIS 501: Comp Arch|Dr.| ArithmeticCarry-Select Adder DelayWhat is carry-select adder delay (two segment)?d(CO15) = MAX(d(CO15-8), d(CO7-0)) + 2d(CO15) = MAX(2*8, 2*8) + 2 = 18In general: 2*(N/2) + 2 = N+2(vs 2N for RCA)What if we cut adder into 4 equal pieces? Would it be 2*(N/4) + 2 = 10? Not quited(CO15) = MAX(d(CO15-12),d(CO11-0)) + 2d(CO15) = MAX(2*4, MAX(d(CO11-8),d(CO7-0)) + 2) + 2d(CO15) = MAX(2*4,MAX(2*4,MAX(d(CO7-4),d(CO3-0)) + 2) + 2) + 2d(CO15) = MAX(2*4,MAX(2*4,MAX(2*4,2*4) + 2) + 2) + 2d(CO15) = 2*4 + 3*2 = 14N-bit adder in M equal pieces: 2*(N/M) + (M1)*216-bit adder in 8 parts: 2*(16/8) + 7*2 = 1816b Carry-Select Adder DelayCIS 501: Comp Arch|Dr.| ArithmeticOptimal delay is at M=4: 14 gate delaysTODO: can do better with Bharaths asymmetric 2-2-3-4-5 design! Just 12 gate delays, overlaps RCA latency with muxes better.gate delays12345678910111213141516321814.666666666666671414.415.3333333333333316.5714285714285691819.55555555555556121.222.9090909090909124.66666666666667126.4615384615384628.2857142857142630.13333333333331932MGate delayCIS 501: Comp Arch|Dr.| ArithmeticAnother Option: Carry LookaheadIs carry-select adder as fast as we can go? Another approach to using additional resourcesInstead of redundantly computing sums assuming different carriesUse redundancy to compute carries more quicklyThis approach is called carry lookahead (CLA)CIS 501: Comp Arch|Dr.| ArithmeticA & B Generate & PropagateLets look at the single-bit carry-out functionCO = AB|AC|BC Factor out terms that use only A and B (available immediately)CO =(AB)|(A|B)C(AB): generates carry-out regardless of incoming C rename to G(A|B) : propagates incoming C rename to PWhy dont we worry about sum bit?We have a carry out if we either 1) generate it OR 2) we propagate AND there is a carry-inCIS 501: Comp Arch|Dr.| ArithmeticInfinite Hardware CLACan expand C1N in terms of Gs, Ps, and C0Example: C16C16 = G15 | P15C15C16 = G15 | P15(G14|P14C14)C16 = G15 | P15G14 | | P15P14P2P1G0 | P15P14P2P1P0C0Similar expansions for C15, C14, etc.How much does this cost?CN needs: N ANDs + 1 ORs, largest have N+1 inputsCNC1 needs: N*(N+1)/2 ANDs + N ORs, max N+1 inputsN=16: 152 total gates, max input 17N=64: 2144 total gates, max input 65And how fast is it really?Not that fast, unfortunately, 17-input gates are really slowCIS 501: Comp Arch|Dr.| ArithmeticCompromise: Multi-Level CLARipple carryFew small gates: 2N gates, max 3 inputsLaid in series: 2N (linear) latencyInfinite hardware CLAMany big gates: N*(N+3)/2 additional gates, max N+1 inputsLaid in parallel: constant 5 latencyIs there a compromise?Reasonable number of small gates?Sublinear (doesnt have to be constant) latency? Yes, multi-level CLA: exploits hierarchy to achieve this5 gate delays = 3 GD for carry computation + 2 GD for final addition (full adder)CIS 501: Comp Arch|Dr.| ArithmeticCarry Lookahead Adder (CLA)Calculate propagate and generate based on A, BNot based on carry inCombine with tree structureTake awaysTree gives logarithmic delayReasonable areaCi is the carry into bit iCIS 501: Comp Arch|Dr.| ArithmeticIndividual G & P Windowed G & PWindowed G/P: useful abstraction for multi-level CLAIndividual carry equationsC1 = G0 | P0C0,C2 = G1 | P1C1Infinite hardware CLA equationsC1 = G0 | P0C0C2 = G1 | P1G0 | P1P0C0Group terms into windowsC2 = (G1 | P1G0) | (P1P0)C0C2 = G1-0 | P1-0C0G1-0, P1-0 are window G & Pa single bit summarizing information from all the bits in the windowG1-0: carry-out generated by 1-0 bitsP1-0: carry-out propagated by 1-0 bitsCIS 501: Comp Arch|Dr.| ArithmeticTwo-Level CLA for 4-bit AdderIndividual carry equationsC1 = G0 | P0C0,C2 = G1 | P1C1,C3 = G2 | P2C2, C4 = G3 | P3C3Infinite hardware CLA equationsC1 = G0 | P0C0C2 = G1 | P1G0 | P1P0C0C3 = G2 | P2G1 | P2P1G0 | P2P1P0C0C4 = G3 | P3G2 | P3P2G1 | P3P2P1G0 | P3P2P1P0C0Hierarchical CLA equationsFirst level: expand C2 using C1, C4 using C3C2 = G1 | P1(G0 | P0C0) = (G1 | P1G0) | (P1P0)C0 = G1-0 | P1-0C0C4 = G3 | P3(G2 | P2C2) = (G3 | P3G2) | (P3P2)C2 = G3-2 | P3-2C2Second level: expand C4 using expanded C2C4 = G3-2 | P3-2(G1-0 | P1-0C0) = (G3-2 | P3-2G1-0) | (P3-2P1-0)C0C4 = G3-0 | P3-0C0NB: Gx-y and Px-y are each single bit wiresCIS 501: Comp Arch|Dr.| ArithmeticTwo-Level CLA for 4-bit AdderTop first-level CLA blockInput: C0, G0, P0, G1, P1Output: C1, G1-0, P1-0Bottom first-level CLA blockInput: C2, G2, P2, G3, P3Output: C3, G3-2, P3-2Second-level CLA blockInput: C0, G1-0, P1-0, G3-2, P3-2Output: C2, G3-0, P3-0These 3 blocks are the sameInput: C0, G3-0, P3-0Output: C4CIS 501: Comp Arch|Dr.| Arithmetic4-bit CLA Timing: d0Which signals are ready at 0 gate delays (d0)?CIS 501: Comp Arch|Dr.| Arithmetic4-bit CLA Timing: d1Signals ready from befored0: C0, Ai, Bi New signals ready at d1P0 = A0 | B0, G0 = A0B0P1 = A1 | B1, G1 = A1B1P2 = A2 | B2, G2 = A2B2P3 = A3 | B3, G3 = A3B3CIS 501: Comp Arch|Dr.| Arithmetic4-bit CLA Timing: d3Signals ready from befored0: C0, Ai, Bi d1: Pi, GiNew signals ready at d3P1-0 = P1P0G1-0 = G1 | P1G0C1 = G0 | P0C0P3-2 = P3P2G3-2 = G3 | P3G2C3 is not readyCIS 501: Comp Arch|Dr.| Arithmetic4-bit CLA Timing: d5Signals ready from befored0: C0, Ai, Bi d1: Pi, Gid3: P1-0, G1-0 , C1 , P3-2 , G3-2New signals ready at d5P3-0 = P3-2P1-0G3-0 = G3-2 | P3-2G1-0C2 = G1-0 | P1-0C0CIS 501: Comp Arch|Dr.| Arithmetic4-bit CLA Timing: d7Signals ready from befored0: C0, Ai, Bi d1: Pi, Gid3: P1-0, G1-0 , C1 , P3-2 , G3-2d5: P3-0, G3-0, C2New signals ready at d7C3 = G2 | P2C2C4 = G3-0 | P3-0C0Si ready d1 after Ci Q: why is sum ready in just one gate delay?Q: what happens in C4 block?CIS 501: Comp Arch|Dr.| ArithmeticTwo-Level CLA for 4-bit AdderCLA blocks: 5 gates, max 3 inputs (infinite CLA with N=2)3 of theseC4 block: 2 gates, max 2 inputsTotal: 17 gates/3inputsInfinite: 14/52 for outer CLA, 4 for interiorG/P go up, C go downTotal: 8 (7 for CLA, 1 for S)Infinite: 42L is bigger and slower CLA hierarchy doesnt pay off until we have larger addersCIS 501: Comp Arch|Dr.| ArithmeticTwo-Level CLA for 16-bit Adder4 G/P inputs per levelFirst level: 14 gates * 4 blocksSecond level: 14 gates * 1 blockTotal: 70 gateslargest gate: 5 inputsInfinite: 152 gates, 17 inputsTotal: 8 (1 + 2 + 2 + 2 + 1)Infinite: 4 (1 + 2 + 1)CLA for a 64-bit adder?CIS 501: Comp Arch|Dr.| Arithmetic16-bit CLA Timing: d0What is ready after 0 gate delays?CIS 501: Comp Arch|Dr.| Arithmetic16-bit CLA Timing: d1What is ready after 1 gate delay? Individual G/PCIS 501: Comp Arch|Dr.| Arithmetic16-bit CLA Timing: d3What is ready after 3 gate delays?1st level group G/PInterior carries of 1st groupC1, C2, C3 CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.