ECE437: Introduction to Digital Computer Design
Chapter 4b (Pipelining)
1st of 4 pillars of modern computer systems in 437
Applies to ALL hardware and software Spring 2021
Outline
Pipelining
What? Basic concepts
Overlapping execution Latency vs. throughput
Why? Performance implications Speedup
CPI, cycletime
How? Implementation challenges
ECE437, S21 Vijaykumar and Thottethodi (2)
Motivation
We want to improve performance
Single-cycle CPUs CPI = 1 but long cycle time
Can we shrink cycle time without worsening CPI?
Simply 4x faster clock would mean CPI = 4 no improvement in performance!
Breaking up each instr into many cycles is one (only?) way to get faster clock assuming single- cycle CPU implementation is as fast as it can be
Break every instr into 5 cycles for fast cycle time and then overlap 5 instrs to make the CPI = 1!
Remember CPI is EFFECTIVE CPI We overlap through pipelining
ECE437, S21 Vijaykumar and Thottethodi (3)
RECALL: Iron law
Time/program = instrs/program x cycles/instr x sec/cycle
NEVER forget this!
sec/cycle (a.k.a. cycle time, clock time)
mostly determined by technology and CPU orgn. cycles/instr (a.k.a. CPI)
mostly determined by ISA and CPU organization EFFECTIVE cycles/instr and NOT actual latency overlap among instructions makes this smaller
Each instr 5 cycles but 5 instrs overlap CPI = 1
AVERAGE over instrs (instrs have different CPI)
instr/program (a.k.a. instruction count)
instrs executed, NOT static code
mostly determined by program, compiler, ISA
ECE437, S21 Vijaykumar and Thottethodi (4)
1
T a s k
6PM 7 8 9 10 11 12 1 2AM 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
T aA
s kB
30 30 30 30 30 30 30
Time
Pipelining
CENTRAL computer science/engineering concept in BOTH hardware and software
There is almost no hardware or software that does not use pipelining for performance!
Pipelining
Ann, Brian, Cathy, Dave
each have one load of clothes to wash, dry, and fold
Washertakes30minutes
Dryertakes30minutes
Foldertakes30minutes
Stashertakes30minutes to put clothes into drawers
ECE437, S21 Vijaykumar and Thottethodi (6)
ECE437, S21 Vijaykumar and Thottethodi
(5)
A Time B
Sequential Laundry
Pipelined Laundry
6PM 7 8 9 10 11 12 1 2AM
OC OC rD rD
de Sequential laundry takes 8 hours for 4 loads r If they learned pipelining, how long would
laundrytake?
ECE437, S21 Vijaykumar and Thottethodi (7)
de r
Pipelined laundry takes 3.5 hours for 4 loads!
Not esoteric computer architecture concept Natural,Intuitive
ECE437, S21 Vijaykumar and Thottethodi (8)
LaundryExample
ABCD
2
Pipelining lessons
6PM 7 8 9
Time entire workload
T a 30 30 30 30 30 30 30
s A
k
Multiple tasks operating simultaneously using different resources Potential speedup =
Pipeliningdoesnthelp latency of single task, it helps throughput of
B Number pipe steps/
O
stages
C
Pipeline rate limited by
r D d
e r
ECE437, S21 Vijaykumar and Thottethodi
slowest pipeline stage Unbalanced lengths of
pipe stages reduces
speedup
Timetofillpipeline
and time to drain it
reduces speedup
Stallfordependencies
(9)
The Steps/Stages of Load
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Load Ifetch Reg/Dec Exec Mem Wr
Ifetch: Instruction Fetch
Fetch the instruction from the Instruction Memory
Reg/Dec: Registers Fetch and Instruction Decode
Exec: compute/calculate memory address
Mem: Read the data from the Data Memory
Wr: Write the data back to the register file
REMEMBER THESE 5 STEPS FOREVER!
ECE437, S21 Vijaykumar and Thottethodi (10)
IF
Time
Exe
Mem WB
Single-cycle execution
Dec
IF
Dec
Exe
Mem WB
IF
Dec
Exe
Mem WB
Program Flow
When an instruction is in one stage what are the other stages doing?
ECE437, S21 Vijaykumar and Thottethodi (11)
IF
Dec
Pipelined Execution Representation
Time
Program Flow
What is the CPI here?
Ideal speedup =?
ECE437, S21 Vijaykumar and Thottethodi (12)
IFetch
Dcd
IFetch
Exec
Dcd
IFetch
Mem
Exec
Dcd
IFetch
WB
Mem
Exec
Dcd
IFetch
WB
Mem
Exec
Dcd
IFetch
WB
Mem
Exec
Dcd
WB
Mem
Exec
WB
Mem
WB
E
3
Ideal speedup
All instructions are executed in 1 cycle in a single-cycle datapath (i.e. CPI = 1)
Single-cycle cycletime = t ns (say)
Instr. Count = n
For P pipeline stages, pipelined cycletime = t/P
Old time (single-cycle) = n x 1 x t
Newtime(pipelined)=nx1xt/P+2x(P-1)x t/P (fill + drain)
Speedup = P/(1 + 2(P-1)/n)
P is some constant, n is large Speedup = P
(= the number of pipeline stages)
ECE437, S21 Vijaykumar and Thottethodi (13)
One-minute quiz
Compared to single-cycle, which (one or more) of these three factors does pipelining improve: instruction count, CPI, clock speed?
Previous quiz
Iron Law: Time/prog = Instr. Count x CPI x cycle time
ECE437, S21 Vijaykumar and Thottethodi (14)
Non-uniform pipeline stages
Ideal speedup is number of stages in the pipeline. Do we achieve this?
ECE437, S21 Vijaykumar and Thottethodi (15)
Non-uniform stages
Maximum Speedup Number of stages Speedup Time for unpipelined operation
Time for longest stage
UNLESS you go to asynchronous or self- timed pipelines
Idea has been around for decades
Not caught on because very hard to do
ECE437, S21 Vijaykumar and Thottethodi (16)
4
Exercise
A single cycle processor implementation can be pipelined in two ways
Pipeline A uses a 5-stage pipeline
the 5 stages account for 15%, 10%, 15%, 20%, 40%
of the delays respectively
Pipeline B uses a 3-stage pipeline
the stages are balanced
If instructions are all independent, which pipeline implementation is the better option
ECE437, S21 Vijaykumar and Thottethodi (17)
Single Cycle vs. Pipeline
Cycle 1
Clk
Single Cycle Implementation:
Load
Cycle 2
Store Waste
Cycle1 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6 Cycle7 Cycle8 Cycle9Cycle10 Clk
Pipeline Implementation: Load Ifetch Reg Exec Store Ifetch Reg
R-type Ifetch
Mem Wr
Exec Mem Wr
Reg Exec Mem Wr
ECE437, S21 Vijaykumar and Thottethodi
(18)
What hardware is added to Single-cycle to get pipelining?
ECE437, S21 Vijaykumar and Thottethodi (19)
Pipelined Datapath
Pipeline datapath with latches
ECE437, S21 Vijaykumar and Thottethodi (20)
5
Hardware requirements
Similar to single cycle pipeline Latches** to separate stages
Real machines use latches and multiphase clocks Latches == flipflops in this course
Control
Good news
Minor changes from single-cycle version
ECE437, S21 Vijaykumar and Thottethodi (21)
More on Pipeline latches
phi1 phi2
Remember construction of flip flop Two latches, clock and inverted clock.
2-phase non-overlapping clocks
1 pipe stage uses two (level-sensitive) latches
Edge-triggered
ECE437, S21 Vijaykumar and Thottethodi
phi1 phi2 phi1
(22)
Ideal speedup with latches?
Suppose we execute N instructions Single Cycle Machine
45ns/cycle x1CPIxNinst=45Nns Ideal pipelined machine
5 stages but assume slightly slower clock at 10ns and not 9ns (due to latch overhead)
10 ns/cycle x (1 CPI x N inst + 8 cycle fill+drain) = 10N + 80 ns = ~10N for large N
Speedup = 4.5
ECE437, S21 Vijaykumar and Thottethodi (23)
Resources other than latches are already there!
Reg Dm
Reg Im
nIInst0Im Reg s
Time (clock cycles)
Dm
Reg Reg
Im Reg Dm Reg
Im Dm Reg
t Inst 1 r.
Im
O Inst2 r
d Inst 3 e
r
Inst 4
ECE437, S21 Vijaykumar and Thottethodi
Reg
(24)
Dm Reg
6
ALU
ALU ALU
ALU ALU
One important thing about latency vs. throughput
Pipelining does a strange thing
To improve program latency, pipelining improves instruction throughput without improving instruction latency
May actually worsen instruction latency E.g., Pipeline Latch overheads
ECE437, S21 Vijaykumar and Thottethodi (25)
Pipelined Processor Design
Designing a pipelined processor
Associate resources with states
Resources not necessarily indivisible or full-cycle
Register reads and writes can happen in the same cycle
Writes in first half of cycle and reads in the second half we will see later
Assert appropriate controls in each stage
Make sure all necessary information is carried through the pipeline along with the instruction
ECE437, S21 Vijaykumar and Thottethodi (26)
Shading convention
Shade first half => write shade second half => read
ALU full-cycle
Register file can handle read/write in the same cycle
Memory can handle only 1 read or 1 write per cycle
ECE437, S21 Vijaykumar and Thottethodi (27)
Pipelining the Load Instruction
Clock 1st lw
Ifetch 2nd lw
Reg/Dec Ifetch
3rd lw
Exec Reg/Dec Ifetch
Mem
Exec Reg/Dec
Wr Mem Exec
Cycle 1 Cycle 2
Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Wr
Mem Wr
Thefiveindependentblocksinthepipelinedatapath are:
Instruction Memory for the Ifetch stage
RegisterFilesReadports(busAandbusB)fortheReg/Dec
stage
ALUfortheExecstage
DataMemoryfortheMemstage
Register Files Write port (bus W) for the Wr stage
ECE437, S21 Vijaykumar and Thottethodi (28)
7
Pipelined Datapath
Pipeline datapath with latches
ECE437, S21 Vijaykumar and Thottethodi (29)
lw
Instruction fetch
0 M u x 1
Add 4
MEM/WB
1 M u x 0
Instruction Fetch
IF/ID ID/EX
EX/MEM
Add
Add result
Zero
ALU ALU result
Shift left 2
0
M u x
PC
Address
Instru mem
ction ory
Read register 1
Read register 2
Read data 1
Registers Read Write data 2
Write data
register
Read
data
ta ory
1
Address
Da
mem
Write data
ECE437, S21 Vijaykumar and Thottethodi (30)
16
Sign extend
32
Address
Instru mem
0 M u x 1
Instruction Decode/Reg Read
lw
Instruction decode
ction ory
IF/ID
ID/EX
EX/MEM
MEM/WB
1 M u x 0
Add 4
Read register 1
Read register 2
Registers Read Write data 2
Write data
Shift left 2
0 M u x 1
Add
Add result
Zero
ALU ALU result
PC
register
ECE437, S21 Vijaykumar and Thottethodi (31)
Read data1
Address
Da me
Write data
Read
data
ta mory
16
Sign extend
32
0 M u x 1
Execute (Address calculation)
lw
Execution
MEM/WB
1 M u x 0
Add 4
PC
Address
Instru mem
IF/ID ID/EX
EX/MEM
Read register 1
Read register 2
Registers Read Write data 2
Write data
A
dd
Add result
Zero
ALU ALU result
Shift left 2
M u x
ction ory
register
ECE437, S21 Vijaykumar and Thottethodi (32)
Read data 1
0
Address
Da
mem
Write data
Read
data ta
ory
1
16
Sign extend
32
8
Instruction
Instruction
Instruction
Memory Access
lw
Memory
0 M u x 1
Add 4
MEM/WB
1 M u x 0
IF/ID ID/EX
EX/MEM
Add result
Zero ALU ALU result
M u x
A
dd
Shift left 2
PC
Address
Instru mem
Read register 1
Read register 2
Registers Read Write data 2
Write data
Read data1
ction ory
0
1
Address Da
mem
Write data
Read data
ta ory
register
ECE437, S21 Vijaykumar and Thottethodi (33)
16
Sign extend
32
IF/ID ID/EX
EX/MEM
Writeback
0 M ux 1
Add 4
lw
W r i t e b a c k
Shift left 2
M u x
A
dd
Add result
Zero ALU ALU result
PC
MEM/WB
1 M u x 0
Address
Instru mem
Read register 1
Read register 2
Registers Read Write data 2
Write data
Read data1
ction ory
0
Address Da
mem
Write data
Read ta data
ory
register
1
ECE437, S21 Vijaykumar and Thottethodi
(34)
16
Sign extend
32
Pipeline latches
Length of Pipeline latches Book says (Fig 4.35)
IF/ID : IR (32), PC+4 (32) : 64 bits
ID/EX: IR (32), PC+4 (32) + RegA + RegB : 128 bits EX/MEM: ALUout(32) + Equal(1) + PC+4+SX(imm)
(32) : 97
MEM/WB: ALUout (32) + MemData(32) : 64
Inaccurate, will be refined:
(see next slide)
Other control bits (IR not going through)
ECE437, S21 Vijaykumar and Thottethodi (35)
Corrected Datapath for lw
Carry destination register through pipeline latches to WB stage
ECE437, S21 Vijaykumar and Thottethodi (36)
9
Instruction
Instruction
The Four Stages of R-type
Cycle1 Cycle2 Cycle3 Cycle4
R-type Ifetch Reg/Dec Exec Wr
Ifetch:InstructionFetch
Fetch the instruction from the Instruction Memory
Reg/Dec: Registers Fetch and Instruction Decode Exec:
ALU operates on the two register operands
Update PC
Wr:WritetheALUoutputbacktotheregisterfile
ECE437, S21 Vijaykumar and Thottethodi (37)
Pipelining R-type and Loads
Cycle 1 Cycle 2
Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Clock
R-type Ifetch
R-type
Reg/Dec Ifetch
Load
Exec Reg/Dec Ifetch
R-type
Wr
Exec Reg/Dec Ifetch
R-type
Wr
Exec Reg/Dec Ifetch
Oops! We have a problem!
Mem Wr
Exec Wr
Reg/Dec Exec Wr
Wehavepipelineconflictorstructuralhazard:
Twoinstructionstrytowritetotheregisterfileatthesame
time!
Onlyonewriteport
ECE437, S21 Vijaykumar and Thottethodi (38)
Key observation
Eachhardwareblockcanonlybeusedonceper instruction
Eachhardwareblockmustbeusedatthesamestage for all instructions:
Load uses Register Files Write Port during its 5th stage 12345
Load Ifetch Reg/Dec Exec Mem Wr
R-type uses Register Files Write Port during its 4th stage 1234
R-type Ifetch Reg/Dec Exec Wr
2 ways to solve this pipeline hazard ECE437, S21 Vijaykumar and Thottethodi (39)
Soln.1: Insert Bubble
Cycle 1 Cycle 2
Ifetch Load
Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Clock
Wr
Exec Mem R-type Ifetch Reg/Dec Exec
Reg/Dec Ifetch
Exec Reg/Dec
Wr
Wr
R-type Ifetch Reg/Dec Pipeline Exec Wr R-type Ifetch Bubble Reg/Dec Exec
Wr Ifetch Reg/Dec Exec
Insertabubbleintothepipelinetoprevent2writes at the same cycle
Thecontrollogiccanbecomplex.
Lose instruction fetch opportunity.
NoinstructionisstartedinCycle6CPIincreases!
ECE437, S21 Vijaykumar and Thottethodi (40)
10
Soln.2: Delay R-types Write
Delay R-types register write by one cycle:
Now R-type instructions also use Reg Files write port at Stage 5
Mem stage is NOOP for R-type: nothing done 12345
R-type Ifetch Reg/Dec Exec
Mem
Cycle 6
Wr
Mem
Exec
Reg/Dec Exec
(41)
Cycle 1 Cycle 2
Cycle 3 Cycle 4 Cycle 5
Wr
Cycle 7 Cycle 8
Cycle 9
Wr
Clock
R-type
Ifetch R-type
Reg/Dec Ifetch
Load
Exec Reg/Dec Ifetch
R-type
Mem Wr
Exec Mem Reg/Dec Exec
Ifetch Reg/Dec R-type Ifetch
Mem
Wr
Mem
Wr
ECE437, S21 Vijaykumar and Thottethodi
Does soln. 2 lose performance?
DoesitincreaseR-typeslatency? Does it worsen R-types CPI?
Whatisthebottomlineperformanceconcernin pipelining?
ECE437, S21 Vijaykumar and Thottethodi (42)
Modified RTL & Datapath
SA + B;
SA or ZX;
SA + SX;
If Cond
PC PC+SX;
MS R[rd] M;
MS R[rt] M;
IRMem[PC]; PC < PC+4; AR[rs]; B< R[rt]MMem[S] Mem[S]B R[rt] M;ASM BD ECE437, S’21 Vijaykumar and Thottethodi(43) Four Stages of StoreCycle1 Cycle2 Cycle3 Cycle4Store Ifetch Reg/DecExec Mem Wr Ifetch:InstructionFetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec:Calculatethememoryaddress Mem:WritethedataintotheDataMemoryECE437, S’21 Vijaykumar and Thottethodi (44)11Mem AccessData MemNext PC PCInst. Mem IRReg FileExecReg. FileEqual 0 M u x 1Exec Stage of StoreswExecutionIF/ID ID/EXEX/MEMMEM/WB1 M u x 0PCAdd 4Reg B and SX(Imm) are both neededAddressInstru meRead register 1Read register 2Registers Read Write data 2Write dataregisterECE437, S’21 Vijaykumar and Thottethodi (45)Read data1AddAdd resultZeroALU ALU resultShift left 20M u xction mory 1Address DameWrite dataRead datata mory 16Sign extend320 M u x 1Add 4swMemoryMem stage of StoreIF/ID ID/EXEX/MEMPCMEM/WB1 M u x 0AddressInstru memRead data 1Shift left 2Add resultZero ALU ALU resultM u xAdd ction oryRead register 1Read register 2Registers Read Write data 2Write data 0 1Address DamemWrite dataReaddatata oryregister 16Sign extend32ECE437, S’21 Vijaykumar and Thottethodi (46)FourStages of BeqCycle1 Cycle2 Cycle3 Cycle4Beq Ifetch Reg/Dec Exec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: compares the two register operand, Compute branch target Mem select correct branch target address latch into PC There is more to this ……laterECE437, S’21 Vijaykumar and Thottethodi (47)10 lw 14 addI 20 sub 24 beq 30 ori 34 add100 andr1, r2(35) r2, r2, 3r3, r4, r5r6, r7, 100 r8, r9, 17 r10, r11, r12r13, r14, 15these addresses are OCTAL(48)ECE437, S’21 Vijaykumar and ThottethodiVisualizing the pipeline12InstructionInstructionStart: Fetch 10nnIRnnWB Ctrl Mem Ctrlrs rt im ABS D M10 lw14 addI 20 sub 24 beq 30 ori 34 add100 andr1, r2(35) r2, r2, 3r3, r4, r5r6, r7, 100 r8, r9, 17 r10, r11, r12r13, r14, 15 ECE437, S’21 Vijaykumar and Thottethodi(49) Fetch 14, Decode 10 nnn WB CtrlIR2 rt im AB MS D 10 lw14 addI 20 sub 24 beq 30 ori 34 add100 andr1, r2(35)r2, r2, 3r3, r4, r5r6, r7, 100 r8, r9, 17 r10, r11, r12r13, r14, 15 ECE437, S’21 Vijaykumar and Thottethodi(50)Mem Ctrl Fetch 20, Decode 14, Exec 10 nnWB Ctrl Mem Ctrl IR2 rtr2BM S D 10 lw14 addI20 sub 24 beq 30 ori 34 add100 andr1, r2(35) r2, r2, 3r3, r4, r5r6, r7, 100 r8, r9, 17 r10, r11, r12r13, r14, 15 ECE437, S’21 Vijaykumar and Thottethodi(51) Fetch 24, Decode 20, Exec 14, Mem 10 nWB Ctrlem CtrlM IR45r2BM 10 lwD14 addI 20 sub 24 beq 30 ori 34 add100 andr1, r2(35) r2, r2, 3r3, r4, r5r6, r7, 100 r8, r9, 17 r10, r11, r12r13, r14, 15 ECE437, S’21 Vijaykumar and Thottethodi(52)13PC20PC10PC24r2+35lw r1PC14Next PCExecNext PCExecNext PCExecNext PCExecMem AccessMem AccessData MemData MemMem AccessMem AccessData MemData MemReg FileDecodeReg FileDecodeReg. FileReg. FileReg FileDecodeReg FileDecodeReg. FileReg. File35lw r13addI r2, r2, 3Inst. MemInst. MemaddI r2, r2, 3Inst. MemInst. Memsub r3, r4, r5lw r1, r2(35)Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10 — deal with branch laterIR67r4r5Mem CtrlWB Ctrl10 lwD14 addI 20 sub 24 beq 30 ori 34 add100 andr1, r2(35) r2, r2, 3r3, r4, r5r6, r7, 100 r8, r9, 17 r10, r11, r12r13, r14, 15 ECE437, S’21 Vijaykumar and Thottethodi(53) One-minute quiz What is the speedup over single-cycle? 5-stage pipeline with per-stage delays in terms of the single-cycle clock of 10%, 15%, 20%, 20%, and the remaining On top of these delays, the pipeline latch overhead is 5% of single-cycle clock FYI: Real latch overhead is 1-2% Previous quiz Compared to single-cycle, which (one or more) of these three factors does pipelining improve: instruction count, CPI, clock speed? Clock speedECE437, S’21 Vijaykumar and Thottethodi (54) 0 M u x 1Add 4PCRecap: DatapathIF/ID ID/EXEX/MEMMEM/WB1 M u x 0 Add resultZero ALU ALU resultM u xAdd Shift left 2 Read register 1Read register 2Registers Read Write data 2Write data AddressInstru me ction mory0 1 Address DameWrite dataReaddata tamoryregisterRead data 1ECE437, S’21 Vijaykumar and Thottethodi (55)16Sign extend32Outline Pipeline Datapath Under simplifying assumptions All independent instructions Dependencies later Datapath not yet capable of handling dependencies Walk-through Control Most complicated (read as irregular/unstructured) Not that bad for pipelined processors Similar to single cycle controlECE437, S’21 Vijaykumar and Thottethodi (56)14InstructionReg FileDecodeNext PCExecPC 30Memr2+3addI r2Access DataMem M[r2+35]lw r1Reg. FileInst. Membeq r6, r7 100sub r3Pipelined Datapath with Controls0 M u x 1Add 4 IF/IDID/EXEX/MEMB AddAdd resultZeZreoroALU ALU resultranchPCSrcMemWriteMEM/WBMemtoReg1 M u x 0 RegWriteShift left 2RegDstPCAddressInstruction memoryRead register 1Read register 2Read data 1Registers Read Write data 2Write data AddressReadregister Write datadata memoryData Instruction [150] 16truction [20 16]truction [15 11]Sign extend32ALUSrc0 M u x 160 M u x 1ALU controlInsInsMemReadECE437, S’21 Vijaykumar and Thottethodi (57)ALUOpPipeline Control vs. Single cycle control Similar (generation) What about PCWrite? IR-write? Write enable for the pipeline latches? Pipelined processor implementation Stages common to all instructions Instruction fetch stage (IF) Decode and Register read (ID) Instruction-specific stages Execute (Exec) Memory access (Mem) Write back (WB)ECE437, S’21 Vijaykumar and Thottethodi (58) Generating controls Simplify the problem Reduce pipeline control to single-cycle control (almost) Generate controls once Consume (i.e., use and discard) signals asyou proceed along the pipeline stages Identify Stage of consumption for all control signalsECE437, S’21 Vijaykumar and Thottethodi (59) Focus on ControlExecALUSrc ALUOp RegDstMemRd MemWr BranchMemtoReg RegWrReg/DecALUSrc ALUOp RegDst MemRd MemWr Branch MemtoReg RegWrMemMemRd MemWr BranchMemtoReg RegWrWrMemtoReg RegWr Main Control The Main Control generates the control signals during Reg/Dec Control signals for Exec (ExtOp, ALUSrc, …) are used 1 cycle later Control signals for Mem (MemWr Branch) are used 2 cycles later Control signals for Wr (MemtoReg MemWr) are used 3 cycles laterECE437, S’21 Vijaykumar and Thottethodi (60)15InstructionMem/Wr RegisterEx/Mem RegisterID/Ex RegisterIF/ID RegisterMeaning of controls RegWr : 1-> write, 0-> no write
MemToReg : 1-> MDR, 0-> ALUOut
RegDst : 1->rd, 0->rt
ALUOp<1:0> : 00->Add,01->Sub,10->funct ALUSrc : 0->RegB, 1->SX(Imm)
Branch: 0->non-branch inst, 1-> branch
MemRead: 1->memread, 0-> no memread
MemWrite: 1->memwrite, 0->no memwrite ALU control abstracted away (as before)
Inputs: ALUop (2 bits), 6 funct bits from IR
ECE437, S21 Vijaykumar and Thottethodi (61)
Extended pipeline latches
Control
ID/EX EX/MEM MEM/WB
Simple extensions to carry control state
ECE437, S21 Vijaykumar and Thottethodi (62)
WB
Instruction
M
EX
WB
M
WB
IF/ID
Pipeline Walkthrough with controls
Use walkthrough worksheets Use code segment shown
Fill in controls
Interesting stages
Controls generated in Decode stage
Controls consumed in subsequent stages
ECE437, S21 Vijaykumar and Thottethodi (63)
lw $10, 20($1) sub $11, $2, $3 and $12, $4, $5 or $13, $6, $7 add $14, $8, $9
IF: lw $10, 20($1)
ID: before<1>
EX: before<2>
MEM: before<3>
WB: before<4>
1 M u x 0
Pipeline Walkthrough with controls
4
Add
IF/ID
ID/EX
WB M
EX/MEM
MEM/WB
0 WB 0
0 M u x 1
00 000 0000
00
Control
WB
000 00
00
0 00
EX
00
M
0
M u x 1
Ad
Add d result
ALUSrc
Shift left2
Branch
Address
Instru mem
Read
register 1 Read Read data 1 register 2
Registers Read Write data 2
Write data
ction ory
register
Zero
ALU ALU result
Read data
Write data
Address memory
Data
truction
[15 0] Sign
Ins
truction [2016]
Ins
extend
0
ALU control
ALUOp
MemRead
Ins
M truction u
[1511] x 1
RegDst
PC
lw
$10, 20($1)
sub $11, $2, $3 and $12, $4, $5 or $13, $6, $7 add $14, $8, $9
Clock 1
ECE437, S21 Vijaykumar and Thottethodi
(64)
16
WB <1:0> Exec <3:0>
Mem <2:0>
Instruction
MemtoReg
MemWrite
RegWrite
Pipeline Walkthrough with controls
IF: sub $11, $2, $3
ID: lw $10, 20($1)
EX: before<1>
MEM: before<2>
WB: before<3>
1 M u x 0
0
M u
IF/ID
x
1
l
Control
000 00
00
00 0
ID/EX
WB M
EX/MEM
WB
MEM/WB
0 WB 0
11 010 0001
$1 $X
00
EX M 00
1 X
w
Branch
Address
Instru mem
Read register 1
Read register 2
Read data 1
ction ory
Registers Read Write data 2
Write data
register
Read data
Write data
Address memory
Data
Instruction
[150] Sign 20
Shift left 2
0
M u x 1
0 M u
Add Add result
ALUSrc
Zero
ALU ALU result
ALU control
20
MemRead
Ins 10
extend [2016] 10
truction
ALUOp
X
Instruction
[1511] X x
1
RegDst
PC
4
Add
lw
$10, 20($1)
sub $11, $2, $3
and $12, $4, $5 or $13, $6, $7 add $14, $8, $9
Clock 2
ECE437, S21 Vijaykumar and Thottethodi
(65)
IF: and $12, $4, $5
ID: sub $11, $2, $3
EX: lw $10, . . .
MEM: before<1>
WB: before<2>
1 M u x 0
Pipeline Walkthrough with controls
Add
4
0
M u
IF/ID
EX/MEM
WB
MEM/WB
0 WB 0
x
1
sub
Control
10 000 1100
ID/EX
WB M
11
010 00
00 00 0 10
EX
M
Shift left 2
$2 $1
Ad
Add d result
ALUSrc
Branch
Address
Instru mem
2 3
Read
register 1
Read data 1 register 2
Registers Read Write data 2
Write data
Read
ction ory
0
register
$3
M u x 1
Zero
ALU ALU result
Read data
Write data
Address memory
Data
Ins
X
truction
[150] Sign X
20
10
M u
ALU control
ALUOp
Ins
X
truction [20 16]
X
extend
0
[1511] 11 x 1
MemRead
Ins
11
truction
RegDst
PC
lw
$10, 20($1)
sub $11, $2, $3
and $12, $4, $5
or $13, $6, $7 add $14, $8, $9
Clock 3
ECE437, S21 Vijaykumar and Thottethodi
(66)
IF: or $13, $6, $7
ID: and $12, $2, $3
EX: sub $11, . . .
MEM: lw $10, . . .
WB: before<1>
1 M u x 0
Pipeline Walkthrough with controls
Add
IF/ID
ID/EX EX/MEM
11
WB
0
1
0
MEM/WB
0 WB 0
0 M u x 1
and
Control
10 000 1 100
WB M EX
10
000
1 10 0
d r e As ud l dt
ALUSrc
M
Ad
Shift left 2
$4 $2
Branch
4 5
Address
Instru mem
Read register 1
Read register 2
Read data 1
ction ory
Registers Read Write data 2
Write data
$5 $3
0
Address
Write data
Read
data Data
memory
register
M u x 1
Zero
ALU ALU result
X
Instruction
[150] Sign X
MemRead
Ins X
truction [20 16]
X
extend
0
M truction u [1511] 12 11 x 1
ALU control
ALUOp
Ins 12
10
RegDst
PC
4
lw
sub $11, $2, $3 and $12, $4, $5 or $13, $6, $7 add $14, $8, $9
Clock 4
$10, 20($1)
ECE437, S21 Vijaykumar and Thottethodi
(67)
4
Pipeline Walkthrough with controls
IF: add $14, $8, $9
ID: or $13, $6, $7
EX: and $12, . . .
MEM: sub $11, . . .
WB: lw $10, . . .
1 M u x 0
Add
IF/ID
EX/MEM
WB
MEM/WB
1 WB 1
0 M u x 1
or
Control
10 000 1100
ID/EX
WB M
10
000 10
10 10 0 00
EX
M
Ad
Add d result
ALUSrc
Shift left 2
$6 $4
Branch
Address
Instruction memory
6 7 10
Read register 1
Read register 2
Regist Write
register
Write data
Read data 1
ers Read data 2
$7 $5
Read data
Write data
Address memory
M u x 1
Z ero
ALU ALU result
Ins
X
Ins
X
truction
[150] Sign X
0
0
Data
ALU control
ALUOp
MemRead
truction [20 16]
X
extend
Ins
13
M truction u
[1511] 13 12 x 1
11
10
RegDst
PC
lw
sub $11, $2, $3 and $12, $4, $5 or $13, $6, $7 add $14, $8, $9
Clock 5
$10, 20($1)
ECE437, S21 Vijaykumar and Thottethodi
(68)
17
Instruction
Instruction
MemtoReg
Instruction
Instruction
MemtoReg
MemtoReg
MemWrite
MemWrite
MemWrite
RegWrite
RegWrite
RegWrite
RegWrite
MemtoReg
MemWrite
Pipeline Walkthrough with controls
Pipeline Walkthrough with controls
IF: after<1>
4
PC
$10, 20($1)
ID: add $14, $8, $9
EX: or $13, . . .
MEM: and $12, . . .
WB: sub $11, .
1
M u x 0
Add
0
M u
IF/ID
ID/EX EX/MEM
10
WB
0 0 0
MEM/WB
1 WB 0
x 1
add
Control
10 000 1100
WB M EX
10
000
1 10 0
Shift left 2
$8 $6
Ad
d reAsudldt
ALUSrc
M
Branch
Read register 1
Read register 2
Read data 1
Address
Instruction memory
8 9 11
Registers Read Write data 2
Write data
$9 $7
0
Read data
Write data
Address memory
register
M u x 1
Zero
ALU ALU result
Data
X
Instruction
[150] Sign X
extend
truction
[2016] X
0
M truction u [1511] 14 13 x 1
ALU control
ALUOp
MemRead
Ins X
Ins 14
12
11
RegDst
lw
sub $11, $2, $3 and $12, $4, $5 or $13, $6, $7 add $14, $8, $9
Clock 6
ECE437, S21 Vijaykumar and Thottethodi
(69)
IF: after<2>
4
PC
lw
sub $11, $2, $3 and $12, $4, $5 or $13, $6, $7 add $14, $8, $9
Clock 7
ID: after<1>
EX: add $14, . . .
MEM: or $13, . . .
WB: and $12,
1
M u x 0
Add
0
M u
IF/ID
ID/EX EX/MEM
MEM/WB
1 WB 0
x 1
Control
00 000 0000
WB M
10
000
1 10 0
10
WB
0 0 0
EX
M
Shift left 2
$8
Ad
Add d result
ALUSrc
Branch
Address
Instru mem
Read register 1
Read register 2
Regist Write
register
Write data
ction ory
Read data 1
ers Read data 2
$9
12
0
Read data
memory
Write data
Address
Data
M u x 1
Zero
ALU ALU result
truction
[15 0] Sign
Ins
truction [2016]
extend
0
M
truction u [1511] 14 x
ALU control
ALUOp
MemRead
Ins
Ins
1
RegDst
13
12
$10, 20($1)
ECE437, S21 Vijaykumar and Thottethodi
(70)
Pipeline Walkthrough with controls
IF: after<3>
4
PC
lw $10, 20($1) sub $11, $2, $3 and $12, $4, $5 or $13, $6, $7 add $14, $8, $9
Clock 8
ID: after<2>
EX: after<1>
MEM: add $14, . . .
WB: or $13, . . .
1
M u x 0
Pipeline Walkthrough with controls
Add
0
M u
IF/ID
EX/MEM
10
WB
0 0 0
MEM/WB
1 WB 0
x 1
Control
00 000 0000
ID/EX
WB M EX
00
000
0 00 0
M
Add reAsudldt
ALUSrc
Shift left 2
Branch
Address
Instru mem
Read register 1
Read register 2
Regis
Write register
Write data
Read data 1
ters Read data 2
ction ory
13
0
M u x 1
Zero
ALU ALU result
Read data
Write data
Ins
0
M truction u [1511] x 1
Address memory
Data
Instruction
[15 0] Sign
ALU control
ALUOp
MemRead
truction [2016]
extend
Ins
R
egDst
14
13
ECE437, S21 Vijaykumar and Thottethodi
(71)
IF: after<4>
4
PC
lw $10, 20($1) sub $11, $2, $3 and $12, $4, $5 or $13, $6, $7 add $14, $8, $9
Clock 9
ID: after<3>
EX: after<2>
MEM: after<1>
WB: add $14, .
1
M u x 0
Add
0
M u
IF/ID
ID/EX EX/MEM
MEM/WB
1 WB 0
x 1
Control
00 000 0000
WB M EX
00
000
0 00 0
00
WB
0 0 0
M
Add Add result
ALUSrc
Shift left 2
Branch
Address
Instru mem
Read register 1
Read register 2
Regis
Write register
Write data
Read data 1
ters Read data 2
ction ory
14
0
Read data
Write data
Address memory
M u x 1
Zero
ALU ALU result
Data
Instruction
[15 0] Sign
ALU control
ALUOp
truction [2016]
Ins
extend
0
M truction u [1511] x 1
MemRead
Ins
Re
gDst
14
ECE437, S21 Vijaykumar and Thottethodi
(72)
18
Instruction
MemtoReg
Instruction
Instruction
MemtoReg
MemtoReg
MemWrite
MemWrite
MemWrite
RegWrite
RegWrite
RegWrite
Instruction
MemtoReg
MemWrite
RegWrite
Implementing Pipeline control
How do we design the control logic block?
Similar to single-cycle implementation Derive logic expressions
E.g. MemtoReg = lw
ALUSrc = lw OR sw
RegWrite = R-type OR lw
Implement Combinational logic PLA implementation
ROM implementation
ECE437, S21 Vijaykumar and Thottethodi (73)
Pipeline registers
Preliminary estimates
IF/ID : IR (32), PC+4 (32) : 64 bits
ID/EX: IR (32), PC+4 (32) + RegA + RegB : 128 bits
EX/MEM: ALUout(32) + zero(1) + PC+4+SX(imm)
(32) : 97
MEM/WB: ALUout (32) + MemData(32) : 64
Corrections:
ALUout and MemData
Destination register (5 bits)
Other control bits (IR not going through)
ECE437, S21 Vijaykumar and Thottethodi (74)
Implementing Pipeline Control
Exercise
Compute required bit-width of pipeline latches
Before the next lecture
Keep memory controller (arbiter) SEPARATE from pipeline (as you did for single-cycle)
In IF and MEM, send request to arbiter
ECE437, S21 Vijaykumar and Thottethodi (75)
Summary
Only slightly more complicated than single cycle
NOT really, seemingly so because
We ignored complications of DEPENDENCIES
Need deeper understanding of dependencies
need to modify datapath as well
ECE437, S21 Vijaykumar and Thottethodi (76)
19
Definitely:
Hard part of pipelining
MUST maintain illusion of sequential execution Execution is actually overlapped.
Pipeline Hazards
structural hazards: attempt to use the same resource two different
ways at the same time
data hazards: attempt to use item before it is ready
instruction depends on result of prior instruction still in the pipeline
control flow hazards: attempt to make a decision before condition is
evaulated
branch instructions
ECE437, S21 Vijaykumar and Thottethodi (77)
Hazards
Structural hazards
Two instructions need the same hardware at the same time
Data Hazards
Data not ready
Control flow Hazards
Which instruction to fetch? Not known.
ECE437, S21 Vijaykumar and Thottethodi (78)
Hazards
Can always resolve hazards by waiting
pipeline control must detect the hazard
take action (or delay action) to resolve hazards
Delays
Pipeline stalls/bubbles Increase CPI
Reduce speedup
ECE437, S21 Vijaykumar and Thottethodi (79)
Single Memory: Structural Hazard
nI Load Mem s
t Instr 1 r.
O Instr 2 r
de I n s t r 3
Time (clock cycles)
Reg Mem Reg
Mem Reg
Mem Reg
M e m
Mem Reg Mem
Reg M e m
r
Instr 4
R e g Mem
Reg
R e g
Mem Reg
Detection is easy in this case! (right half highlight means read, left half write) ECE437, S21 Vijaykumar and Thottethodi (80)
20
ALU
ALU ALU
ALU ALU
Structural Hazards
Single memory (suppose)
1.3 memory accesses per instruction
How?
1 per instruction for instruction fetch Fraction for data load/store
Depends on instruction mix 20% load + 10% store
15% load + 15% store
CPI is at least 1.3 (otherwise memory is used more than 100%)
Solution?
ECE437, S21 Vijaykumar and Thottethodi (81)
Data Hazards
or r8, r1 ,r9 xor r10, r1 ,r11
ECE437, S21 Vijaykumar and Thottethodi (82)
add r1 ,r2,r3 sub r4, r1 ,r3 and r6, r1 ,r7
Hazards on r1
Dependencies backwards in time
Time (clock cycles)
IF ID/RF EX MEM WB add r1,r2,r3 Im Reg Dm Reg
I n
st sub r4,r1,r3 Im Reg
Dm Reg Reg Dm
r. and r6,r1,r7 O
dr or r8,r1,r9
e
r xor r10,r1,r11
ECE437, S21 Vijaykumar and Thottethodi
Im
Reg
Dm Reg
Dm Reg
Im
Reg
Im Reg
(83)
Data Hazard Solution1: Stall
Can always stall until hazard goes away Delay sub and later instrs till add is in WB Increase CPI lose performance
But performance loss bad only if common case
Amdahls law
So what about data hazards?
Think about how code looks
ECE437, S21 Vijaykumar and Thottethodi (84)
21
ALU
ALU ALU
ALU ALU
Data Hazard Solution1: Stall
So what about data hazards? Think about how code looks
Keydifferencebetweencircuitdesignerand computer architect
circuitpeopledonotthinkaboutcode
Architectsthinkaboutinteractionofcodewith
hardware common case interactions VERY COMMON code:
x=a+b;
= use x; /* IMMEDIATELY use x */
ECE437, S21 Vijaykumar and Thottethodi (85)
Data Hazard Solution2: forwarding (a.k.a. bypassing)
Time (clock cycles)
IF ID/RF EX MEM WB
I n
add r1,r2,r3 Im
Reg Dm Reg
Im Reg Dm Reg
Im Reg Dm Im Reg
Im Reg
st sub r4,r1,r3
r. and r6,r1,r7
O
dr or r8,r1,r9
e
r xor r10,r1,r11
Reg
Dm Reg
Dm Reg
What is the EARLIEST point where r1 is available SOMEWHERE in the pipe
ECE437, S21 Vijaykumar and Thottethodi (86)
Datapath w/Forwarding Unit
ID/EX
EX/MEM MEM/WB
M u x
Registers
Forwar
M u
dA
Forwarding unit
ALU
Data memory
M u x
Rs For
Rt
x
wardB
Rt M
Rd u x
EX/MEM.RegisterRd
MEM/WB.RegisterRd
b. With forwarding
Forwarded values from pipeline latches
ECE437, S21 Vijaykumar and Thottethodi (87)
Time (in clock cycles)
Value of CC 1 register $2: 10
Program execution order
(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
CC 2 10
Reg
CC 3 10
Reg
CC 4 10
CC 5 10/20
Reg
Reg
CC 6 20
Reg
Reg
CC 7 20
Reg
DM
CC 8 20
Reg
CC 9 20
Dependence : Backward in time
IM
DM
IM
DM
IM Reg
DM
IM
IM
DM Reg
ECE437, S21 Vijaykumar and Thottethodi
(88)
22
ALU
ALU ALU
ALU ALU
Other types of Data Hazards
Challenge: maintain illusion of sequential
execution
Types of data hazards
RAW (what we have seen), WAR, WAW
RAW (read after write) Data Hazard (what we have seen so far)
IF
DCD
EX
Mem
WB
IF
DCD
EX
Mem
WB
WAW Data Hazard (write after write)
Mem
WAR Data Hazard (write after read)
IF
DCD EX
Mem
WB
IF
DCD OF
Ex
ECE437, S21 Vijaykumar and Thottethodi (90)
IF
DCD
OF
Ex
WB
True dependence : Forward in time
Time (in clock cycles)
CC 1 Value of register $2 : 10 ValueofEX/MEM: X ValueofMEM/WB: X
Program execution order (in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
CC 2 10 X X
CC 3 10 X X
Reg
CC 4 10
20 X
CC 5 CC 6 10/20 20 X X 20 X
Reg
Reg
IM Reg DM Reg
CC 7 20 X X
Reg
CC 8 20 X X
CC 9 20 X X
DM
IM Reg
IM
DM
IM Reg
sw $15, 100($2)
ECE437, S21 Vijaykumar and Thottethodi
IM
Reg DM Reg
(89)
DM
Data Hazards
RAW Data Hazard (what we have seen) WAW Data Hazard
Mem
WAR Data Hazard
IF
DCD
EX
Mem
WB
IF
DCD
EX
Mem
WB
IF
DCD EX
Mem
WB
IF
DCD OF
Ex
IF
DCD
Avoidsomebydesign(asdoneinMIPSpipeline)
eliminateWARbyalwaysfetchingoperandsearly(DCD)in
pipe
eliminateWAWbydoingallWBsinlaststage
Detectandresolveremainingones(RAW) stallorforward(ifpossible)
ECE437, S21 Vijaykumar and Thottethodi (91)
OF
Ex
WB
Handling RAW Hazards
Pre-requisite for handling RAW hazard Detection!
Need to know:
Pending writes
available results that havent been written back to
registers
Operand Reads
Later instructions that potentially use these values Some instructions may not write to
register file (store, branch)
ECE437, S21 Vijaykumar and Thottethodi (92)
23
Detecting RAW hazards
Suppose instruction i is in EX and a predecessor instruction j is in a later stage
A RAW hazard may exist on register if Rregs( i ) Wregs( j )
Compare pending writes (for insts in later stages) with operand
A WAW hazard may exist on register if Wregs( i ) Wregs( j )
A WAR hazard may exist on register if Wregs( i ) Rregs( j )
MIPS: RAW hazards only
ECE437, S21 Vijaykumar and Thottethodi (93)
regs of current instruction.
One-minute quiz
Fill in the blanks:
Pipelining improves instruction _________ while possibly worsening instruction ________
Previous quiz Speedup over single-cycle
5-stage pipeline with per-stage delays in terms of the single- cycle clock of 10%, 15%, 20%, 20%, and the remainder
Ontopofthesedelays,thepipelinelatchoverheadis5%of single-cycle clock
Real latch overhead is 1-2% 2.5x
ECE437, S21 Vijaykumar and Thottethodi (94)
Registers
Record of pending writes
ID/EX EX/MEM MEM/WB
M u x
M u
xM u x
ForwardA ALU
Rs ForwardB Rt
RtM
Rd u
x
Current operand registers Pending writes
EX/MEM.RegisterRd Forwarding MEM/WB.RegisterRd
unit
b.Withforwarding
hazard <=((rs == rwex) ((rs == rwwb) ((rt == rwex) ((rt == rwwb)& regWex) OR ((rs == rwmem) & regWme) OR & regWwb) OR& regWex) OR ((rt == rwmem) & regWme) OR & regWwb)ECE437, S’21 Vijaykumar and Thottethodi (95)Data memory Logic equations for Hazard Detection Restatement of equations Text book version WB stage is not really a hazard written in first half of cycle, read in 2nd half EX/MEM.RegisterRd == ID/EX.RegisterRs EX/MEM.RegisterRd == ID/EX.RegisterRt MEM/WB.RegisterRd == ID/EX.RegisterRs MEM/WB.RegisterRd == ID/EX.RegisterRtECE437, S’21 Vijaykumar and Thottethodi (96)24Lookahead: Forwarding datapath We know how to detect RAW hazards Now, Modify Datapath to enable forwarding Desired control behaviorECE437, S’21 Vijaykumar and Thottethodi (97) Base Pipelined Datapath Simplifiedrepresentationofpipelineddatapath ToavoidclutterID/EXEX/MEM MEM/WB RegistersALU Data memoryM u x a. No forwardingECE437, S’21 Vijaykumar and Thottethodi (98)Datapath w/Forwarding UnitID/EXEX/MEM MEM/WB M u x Registers ForwarM udAForwarding unit ALUData memory M u x Rs For RtxwardBRt MRd u xEX/MEM.RegisterRd MEM/WB.RegisterRdb. With forwarding ForwardA/ForwardB: 01->Mem, 10->EX
ECE437, S21 Vijaykumar and Thottethodi (99)
Data Hazards and Forwarding: Walkthrough
Code snippet
identify hazards
identify forwarding paths
ECE437, S21 Vijaykumar and Thottethodi
sub $2, $1, $3 and $4, $2, $5 or $4, $4, $2 add $9, $4, $2
(100)
25
Walkthrough
or $4, $4, $2
and $4, $2, $5
IF/ID
sub $2, $1, $3 ID/EX
before<1>
EX/MEM
before<2>
M
10
WB M EX
10
$1
Control
MEM/WB
WB
M u x
Instruction memory
sub $2, $1, $3 and $4, $2, $5 or $4, $4, $2 add $9, $4, $2
Clock 3
ALU
PC
u Mx
u x
M u x
Skip the boring stuff, jump to cycle 3
ECE437, S21 Vijaykumar and Thottethodi (101)
Forwarding unit
WB M
2 5
$2
$5
Regis
ters
$3
Data memory
2
4
1 3
2
5
add $9, $4, $2
or $4, $4, $2
and $4, $2, $5 ID/EX
sub $2, . . .
EX/MEM
10
before<1>
10
WB M EX
10
Control
WB M
MEM/WB
IF/ID
WB
4 $4$2
6
Regis
ters
Instruction memory
sub $2, $1, $3 and $4, $2, $5 or $4, $4, $2 add $9, $4, $2
Clock 4
ALU
PC
M u x
$2
2 6
4
M u x
M u x
Data memory
$5
2
5
M 4u x
2
Forward ALUOut to Operand 1
ECE437, S21 Vijaykumar and Thottethodi (102)
Forwarding unit
after<1>
PC Instruction memory
sub $2, $1, $3 and $4, $2, $5 or $4, $4, $2 add $9, $4, $2
Clock 5
add $9, $4, $2
or $4, $4, $2 ID/EX
and $4, . . .
EX/MEM
10
sub $2, . . .
10
WB M EX
10
Control
WB M
MEM/WB
WB
IF/ID
1
4 $4$4
M 2u x
2
Regis
ters
ALU
M
$2
$2
Data memory
u Mx
u x
M u x
4 2
9
4 2
4
4
2
Forward ALUout to Op1, Mem to Op2
ECE437, S21 Vijaykumar and Thottethodi (103)
Forwarding unit
after<2> after<1>
add $9, $4, $2
or $4, . . .
EX/MEM
10
and $4, . . .
WB Control M EX
MEM/WB
ID/EX
10
IF/ID
PC
sub $2, $1, $3 and $4, $2, $5 or $4, $4, $2 add $9, $4, $2
Clock 6
M u x
Two candidates match, forward the latest
ECE437, S21 Vijaykumar and Thottethodi (104)
WB M
1
WB
$4
M u x
M u x
M u x
ALU
Forwarding unit
Instruction memory
4
Regis
ters
$2
Data memory
4
2 9
4
4
26
Instruction
Instruction
Instruction
Instruction
ID/EX
Final Datapath
EX/MEM MEM/WB
M u x
M u x
ALUSrc
ALU
Forwarding unit
Registers
M u x
M u x
M u x
For Imm op, ALUsrc mux (Fig 4.57)
ECE437, S21 Vijaykumar and Thottethodi (105)
Data memory
Control for Forwarding
EXhazard
If (EX/MEM.RegWrite AND // not store or branch EX/MEM.RegsterRd != 0 AND // Result is used EX/MEM.RegisterRd = ID/EX.RegisterRs) ForwardA = 10
If (EX/MEM.RegWrite AND EX/MEM.RegsterRd != 0 AND EX/MEM.RegisterRd = ID/EX.RegisterRt) ForwardB = 10
ECE437, S21 Vijaykumar and Thottethodi (106)
Control for Forwarding
MEMhazard
If (MEM/WB.RegWrite AND MEM/WB.RegsterRd != 0 AND MEM/WB.RegisterRd = ID/EX.RegisterRs) ForwardA = 01
If (MEM/WB.RegWrite AND MEM/WB.RegsterRd != 0 AND MEM/WB.RegisterRd = ID/EX.RegisterRt) ForwardB = 01
Doesthisfullymeetourrequirements?
ECE437, S21 Vijaykumar and Thottethodi (107)
Forwarding Summary
ID/EX
EX/MEM MEM/WB
M u x
MM uu xx
M u x
Registers
ALUSrc
ALU
Data memory
M u x
Forwarding unit
Designed forwarding unit to solve RAW hazards for R-type instructions
ECE437, S21 Vijaykumar and Thottethodi (108)
27
Pay attention
Forwardingis
FROM EX/MEM or MEM/WB pipeline latches TOINSIDEEX(NOTtoID/EXlatch)
Othercombos
egFROMunlatchedALUoutputTOINSIDEEX
FROM unlatched ALU output TO ID/EX pipeline latch
or FROM EX/M or M/W pipeline latches to ID/EX latch etc
willnotworkorcausestallsorrequirelargerpipelinelatches
ECE437, S21 Vijaykumar and Thottethodi (109)
Forwarding doesnt solve all RAW hazards
Time (clock cycles)
lw r1,0(r2) sub r4,r1,r3
IF ID/RF EX MEM WB
Im Reg Dm Reg
Im Reg
Dm Reg
What is the EARLIEST point where load data is available SOMEWHERE in the pipe
KEY difference between R-type and ld
Value available: for R-type after EX, for ld after
MEM
Cant solve with forwarding:
Must delay/stall instruction dependent on loads
ECE437, S21 Vijaykumar and Thottethodi (110)
Time (in clock cycles)
Load instruction
Program execution order
(in instructions)
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
CC 1
CC 2
CC 3
CC 4
DM
CC 5
Reg
CC 6
Reg
CC 7
Reg
CC 8
CC 9
IM Reg
IM Reg
D
IM Reg
IM
Reg DM Reg
DM
IM
Reg DM Reg
True dependence backward in time
ECE437, S21 Vijaykumar and Thottethodi
(111)
M
Ld-use hazard Solution
Catch-all solution for hazards
Stall if instruction immediately next to a load is dependent on the load
always works, but hurts performance
Use as last resort
Cant help: load value is unavailable, unlike
R-type instructions Challenge:
Modify pipeline to stall for such hazards
Detect and stall LATER instructions
ECE437, S21 Vijaykumar and Thottethodi (112)
28
ALU ALU
Terminology
Minor change in terminology
If forwarding can solve it, it is not a hazard!
Hazard refers only to true backward dependencies in time
ECE437, S21 Vijaykumar and Thottethodi (113)
Ld-use Detection
Conditions
IMMEDIATELY preceding instruction must read memory
MemRead must be asserted
Destination of preceding instruction
lw $2, 20($1) and $4, $2, $5 or $4, $4, $2 add $9, $4, $2
(rt) must be one of operands of current instruction
Logic equations restate above conditions formally
If( ID/EX.MemRead AND
( (ID/EX.RegRt = IF/ID.RegRs) OR
(ID/EX.RegRt = IF/ID.RegRt) ) ) STALL
ECE437, S21 Vijaykumar and Thottethodi (114)
Stalling the pipeline
Current instruction cannot proceed
Later instructions must be stalled too else they will run into each other
Solution for RAW hazards
inject NOP into ID/EX pipeline latch
Prevent writes to PC and IF/ID register Earlier instructions proceed as usual
GENERAL solution for all stalling
Inject nop instead of stalled instr in next stage freeze later instrs behind in previous stages
continue earlier instrs ahead in later stages
ECE437, S21 Vijaykumar and Thottethodi (115)
Stalling the pipeline
GENERAL solution for all stalling WHILE stalled
inject nop in place of stalled instr in next stage
As many nops as stalls
freeze later instrs behind (in earlier
stages)
As long as stalled
continue earlier instrs ahead (in later stages)
ECE437, S21 Vijaykumar and Thottethodi (116)
29
Instruction memory
Registers
0
Datapath
Hazard detection unit
Control
IF/ID.RegisterRs IF/ID.RegisterRt IF/ID.RegisterRt IF/ID.RegisterRd
ID/EX.RegisterRt
EX/MEM
ID/EX.MemRead
ID/EX
WB
M
u M x
WB
M WB
u Mx
Data memory
MEM/WB
IF/ID
EX
M u x
PC
ALU
M
Rt Rd
Rs Rt
u x
M u x
EX
/MEM.RegisterRd
ECE437, S21 Vijaykumar and Thottethodi
Forwarding unit
(117)
MEM/WB.RegisterRd
and $4, $2, $5
lw $2, 20($1)
before<1>
before<2>
EX/MEM
WB M
before<3>
MEM/WB
WB
Walk-through (1 of 6)
Hazard detection unit
ID/EX.MemRead
1
ID/EX
WB
M
u M x
X
11
Control
0
IF/ID
EX
1 $1
M u x
Instru mem
ction ory
X
Regis
ters
Data memory
PC
ALU
$X M u Mx
lw
$2, 20($1)
1 X 2
u x
Mu x
and $4, $2, $5 or $4, $4, $2 add $9, $4, $2
Clock 2
ID/EX.RegisterRt
Forwarding unit
(118)
Skiptocycle2ldisinDecode
ECE437, S21 Vijaykumar and Thottethodi
or $4, $4, $2
PC Instruction memory
lw $2,20($1) and $4, $2, $5 or $4, $4, $2 add $9, $4, $2
Clock 3
lw $2, 20($1) ID/EX
2
5
2 5
Walk-through (2 of 6)
and $4, $2, $5
Hazard detection unit
IF/ID
All0s=>NOP(MemWr,RegWr,othersdeasserted)
ECE437, S21 Vijaykumar and Thottethodi (119)
before<1>
EX/MEM
WB M
before<2>
ID/EX.MemRead
Control
M u x
00
$2
WB
M EX
11
$1
MEM/WB
WB
0
Regis
ters
M u x
Data memory
ALU
M
$5 $X
21 5 X
u Mx
u x
2M 4u x
ID/EX.RegisterRt
Forwarding unit
or $4, $4, $2
PC
lw $2,20($1)
and $4, $2, $5
bubble ID/EX
lw $2, . . .
EX/MEM
2 5
5
Walk-through (3 of 6)
Hazard detection unit
ID/EX.MemRead
before<1>
WB
M
u Mx
u x
10
WB
00
Control
u x
M
EX
$2 $2
$5 $5
22 5 5
WB M
M 11
MEM/WB
IF/ID
0
2
M u x
Instruction memory
Registers
Da me
ta mory
ALU
nop 44Mu 2
and $4, $2, $5
or $4, $4, $2 add $9, $4, $2
Clock 4
x
ECE437, S21 Vijaykumar and Thottethodi
ID/EX.RegisterRt
Forwarding unit
(120)
30
Instruction
Instruction
Instruction
PCWrite
PCWrite
PCWrite
IF/IDWrite
Instruction
IF/IDWrite
IF/IDWrite
PCWrite
IF/IDWrite
add $9, $4, $2
PC
or $4, $4, $2
and $4, $2, $5
WB
bubble
EM
lw $2, . . .
MEM/WB
Walk-through (4 of 6)
4
Hazard detection unit
ID/EX.MemRead
ID/E
10
2
Control
M0 u M
x
0
11
IF/ID
EX M 4 $4$2
WB
M u x
ALU
M
$2 $5
42
u Mx
u x
lw $2,20($1)
nop M 2
and$4,$2,$5 or $4, $4, $2 add $9, $4, $2
Clock 5
2 5
4 4 ux
ID/EX.RegisterRt
Forwarding unit
LoadvalueforwardedfromMEM/WBregister
ECE437, S21 Vijaykumar and Thottethodi (121)
X
10
EX/M
WB
2
Instruction memory
2
Registers
Data memory
after<1>
PC
lw $2,20($1)
add $9, $4, $2
or $4, $4, $2
and $4, . . .
EX/MEM
bubble
0
4 2
2
Walk-through (5 of 6)
Hazard detection unit
ID/EX.MemRead
ID/
WB
EX
10
Control
M 10 u M
x
10
WB
IF/ID
EX
M
WB
0
MEM/WB
4 $4$4
$2 $2
44
2 2
M u x
Instruction memory
Regis
ters
ALU
M
Data memory
u Mx
u x
nop M 4
and $4, $2, $5 or $4, $4, $2 add $9, $4, $2 Clock 6
94u x
ID/EX.RegisterRt
Forwarding unit
$4valueforwardedfromEX/MEMregister
ECE437, S21 Vijaykumar and Thottethodi (122)
after<2>
after<1>
Hazard detection unit
add $9, $4, $2
or $4, . . .
MEM
and $4, . . .
Walk-through (6 of 6)
ID/EX.MemRead
10
ID/EX
WB
10
M 10 u M
x
0
EX
PC M u x
Control
WB M
MEM/WB
1
IF/ID
WB
$4
M u x
EX/
ALU
Instruction memory
4
Registers
$2
Data memory
lw $2, 20($1) nop
and $4, $2, $5 or $4, $4, $2 add $9, $4, $2
Clock 7
TWOvaluesfor$4,pickmostrecenttoforward
ECE437, S21 Vijaykumar and Thottethodi (123)
4
4
9
M u x
2
M u x
4
ID/EX.RegisterRt
Forwarding unit
RAW Hazard with Loads: Summary
True backward dependencies in time
Need to stall
Dependent instruction & later instructions stalled Earlier instructions can proceed
Stall achieved by
Detecting hazard (remember logic equation)
Inserting NOP (all EX/MEM/WB controls set to 0)
Preventing IF/ID latch and PC from being
overwritten
ECE437, S21 Vijaykumar and Thottethodi (124)
31
Instruction
Instruction
PCWrite
PCWrite
IF/IDWrite
IF/IDWrite
Instruction
PCWrite
IF/IDWrite
Recap : Pipeline Register Widths
IF: add $14, $8, $9
ID: or $13, $6, $7
EX: and $12, . . .
MEM: sub $11, . . .
WB: lw $10, . . .
1 M u x 0
Add
IF/ID = 64 ID/EX = 147 EX/MEM = 107 MEM/WB = 71
Clock 5
4
0
M u
IF/ID
EX/MEM
WB
MEM/WB
1 WB 1
x
1
or
Control
10 000 1100
ID/EX
WB M
10
000 10
10 10 0 00
EX
M
Ad
Add d result
ALUSrc
Shift left 2
$6 $4
Branch
Address
Instruction memory
6 7 10
Read register 1
Read register 2
Regist Write
register
Write data
Read data 1
ers Read data 2
$7 $5
Read data
Write data
Address memory
M u x 1
Zero
ALU ALU result
Ins
X
Ins
X
truction
[150] Sign X
0
0
Data
ALU control
ALUOp
MemRead
truction [20 16]
X
extend
Ins
13
M truction u
[1511] 13 12 x 1
11
10
RegDst
PC
ECE437, S21 Vijaykumar and Thottethodi
(125)
Solution3 for RAW Hazard with Loads
We can use the compiler to insert nops between ld and use to prevent hazards
Then hardware need not detect this hazard But we can do better
Ld $16, 0[$24]
Nop (inserted by compiler or h/w) Add, $17, $16, $18
Sub $23, $8, $9
St $17, 0[$25]
What can you do?
ECE437, S21 Vijaykumar and Thottethodi (126)
Solution3 for RAW Hazard with Loads
Ld $16, 0[$24]
Nop (inserted by compiler or h/w) Add, $17, $16, $18
Sub $23, $8, $9
St $17, 0[$25]
What can you do?
Move sub which is independent of ld and add
ECE437, S21 Vijaykumar and Thottethodi (127)
Recall Hazards
Structural hazards
Two instructions need the same hardware Half/half for REG and ch5 for MEM
Data Hazards
Data not ready
Forward/bypass (stall for loads)
Control flow Hazards next
Which instruction to fetch? Not known. Delayed branch, Predict not taken
ECE437, S21 Vijaykumar and Thottethodi (128)
32
Instruction
MemtoReg
MemWrite
RegWrite
When are conditional branches resolved?
0 M u x 1
Add 4
IF/ID
ID/EX
EX/MEM
B
Add
Add result
ZeZreoro
ALU ALU result
ranch
PCSrc
MemWrite
MEM/WB
MemtoReg
1 M u x 0
RegWrite
Shift left 2
RegDst
PC
Address
Instruction memory
Read register 1
Read register 2
Read data 1
Registers Read Write data 2
Write data
Address
Read
register
Write data
data memory
Data
Instruction [150] 16
truction [20 16]
truction [15 11]
Sign extend
32
ALUSrc
0 M u x 1
6
0 M u x 1
ALU control
Ins
Ins
MemRead
ECE437, S21 Vijaykumar and Thottethodi
(129)
ALUOp
Program execution order
(in instructions)
Time (in clock cycles)
Branch Hazards
CC 1
CC 2
CC 3
CC 4
CC 5
Reg
DM
CC 6
Reg
CC 7
CC 8
CC 9
IM Reg DM
IM Reg
IM Reg
40 beq $1, $3, 7
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
Branch resolved in the MEM stage If taken,
PC<- PC + 4 + SX(Imm*4) 40+4+7*4=72DM Reg IM Reg DM Reg IM Reg(130)DM Reg ECE437, S’21 Vijaykumar and Thottethodi Control/Branch Hazards Branch resolved in the MEM stage Which is 3 cycles later but next instruction has to fetched in the next cycle Solution 1: Stall but penalty is 3 bubbles Oh but how common is this? Amdahls law worry about it only if common Every 6th instruction is a branch CPI would go from ~1.2 to 1.2+1/6*3 = 1.7 (why 1.2?) Large performance loss There are 3 other solutions ECE437, S’21 Vijaykumar and Thottethodi (131) Hazards in real microprocessors Some of you may be thinking 1-3 bubbles for loads and branches aint that bad performance impact is not much But real pipelines are 10-15 stages deep, so without doing anything each load and branch would incur 10+ bubbles! Well then why are real pipelines so deep? To get fast clock via many schemes (taught in 437)ECE437, S’21 Vijaykumar and Thottethodi (132)33InstructionControl flow Hazard: Solution2 WecanmoveupbranchdecisiontoDecodebyadding h/w to compare registers as they are being read I nst Add r.O Beqdr And erTime (clock cycles) MemReg MemMem RegMem RegMem RegReg Impact:from3bubblesto1bubbleECE437, S’21 Vijaykumar and Thottethodi (133)Mem Reg One-minute quiz True or false Instructions generate the relevant control signals at every pipeline stage Previous quiz Pipelining improves instruction throughput while possibly worsening instruction latencyECE437, S’21 Vijaykumar and Thottethodi (134) IF.FlushDatapath for branch hazardsard detectionunitHazID/EXWBEX/MEMM u x ControlMuM WB 0xEX MIF/IDMEM/WBWB 4PC Registers =M u xM u xM u xInstruction memoryShi leftft 2 Data memory M u xSign extendALUForwarding unitECE437, S’21 Vijaykumar and Thottethodi (135) Solution 2 move branch decision earlier in pipeline Need additional comparator (r1=r2?) andadder (PC+4+SX(IMM)*4) Recall branch needs both outcome & target flush IF/ID latch to kill potentially incorrect fetch immediately after branch Will this work well?ECE437, S’21 Vijaykumar and Thottethodi (136)34ALU ALUALUBr Solution 2 move branch decision earlier in pipeline Whoa! Value needed in earlier stage what if r1/r2 write is pending? Add r1, r2, r3 Beq r1, 0, target // is this common? Forwarding and/or stalling Forwarding from EX/MEM and MEM/WB intodecode (in addition to forwarding into EX) 1 bubble if no dependence If dependent like above, 2 bubbles with forwarding Is this dependent case common? Too much! (and better solution exists)ECE437, S’21 Vijaykumar and Thottethodi (137) Br Solution 2 Dont use sol.2 in the lab its ugly Ignore the textbook If you insist on using solution2 in the lab (please dont) then you need to latch the branch target at the end of decode (it is not latched in previous slide) else br target will be lost if you stall fetch for some (other) reason (e.g., due to memory structural hazard) WHILE a br in decode proceeds without stallECE437, S’21 Vijaykumar and Thottethodi (138) Can we do anything about br stalls? RAW hazards a problem for both branches and R-types Value produced later but needed earlier R-type: Written in WB and needed in ID Br: Outcome produced in MEM and needed in ID But solutions are fundamentally different There is a key difference between branches and R-types what?ECE437, S’21 Vijaykumar and Thottethodi (139) Can we do anything about br stalls? Solution 3 Predict branch is always not taken MUCH more sophisticated prediction possible Why not predict taken? Soution 4: Delay slots Compilers problem Walkthrough example for solution 3 Predict not takenECE437, S’21 Vijaykumar and Thottethodi (140)35Control flow Hazard: Solution3 Predict:guessonedirectionthenbackupifwrong Predictnottaken I nst Add r.O Beqdr And erMemReg MemTime (clock cycles)Mem RegReg Mem Reg Mem RegMem Reg Impact: 0 bubbles if correct, 1 bubble as before if wrong assuming prediction+sol. #2 (correct say 50%) Moredynamicscheme(correct90+%)ECE437, S’21 Vijaykumar and Thottethodi (141) and $12, $2, $5IF.FlushWalkthrough (1 of 2)beq $1, $3, 7Hazard detection unitsub $10, $4, $8before<1>
EX/MEM
before<2>
72
M u
ID
W
/EX
B
48
x
M
Control uM WB
28x ME 0
IF/ID 48
EX M
M/WB
WB
44
72
Shift left 2
7
Sign extend
$1
=
$4
M u x
M u x
Registers
Data memory
ALU
$3 M
72
4
PC
Instruction memory
44
10
Clock 3
u $8 x
(142)
Flush IF/ID latch
ECE437, S21 Vijaykumar and Thottethodi
Forwarding unit
lw $4, 50($7)
IF.Flush
Walkthrough (2 of 2)
bubble (nop)
Hazard detection unit
beq $1, $3, 7
sub $10, . . .
EX/MEM
before<1>
M
u 76 x
ID
/EX
WB
M
Control uM WB
x
0
=
EX M
IF/ID
76
MEM/WB
WB
72
Shift left 2
Sign extend
$1
M u x
M u x
Registers
Data memory
ALU
$3 x
(143)
76
4
PC
Instruction memory
72
M u
10
Clock 4
ECE437, S21 Vijaykumar and Thottethodi
Forwarding unit
Visualizing sol. #3 prediction+sol. #2
10 lw 14 addI 20 sub 24 beq 30 ori 34 add
100 and
r1, r2(35) r2, r2, 3
r3, r4, r5
r6, r7, 100 r8, r9, 17 r10, r11, r12
r13, r14, 15
these addresses are octal
(144)
ECE437, S21 Vijaykumar and Thottethodi
36
ALU
ALU ALU
Fetch 24, Decode 20, Exec 14, Mem 10
n
WB Ctrl
em Ctrl
M
IR
45
r2
B
M
10 lw
D
14 addI 20 sub 24 beq 30 ori 34 add
100 and
r1, r2(35) r2, r2, 3
r3, r4, r5
r6, r7, 100 r8, r9, 17 r10, r11, r12
r13, r14, 15
ECE437, S21 Vijaykumar and Thottethodi
(145)
Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10
IR
D
Fetchori-predictnottaken,resolvebeq ECE437, S21 Vijaykumar and Thottethodi (146)
Mem Ctrl
67
r4
r5
WB Ctrl
14 addI 20 sub 24 beq 30 ori 34 add
100 and
r1, r2(35) r2, r2, 3
r3, r4, r5
r6, r7, 100 r8, r9, 17 r10, r11, r12
r13, r14, 15
10 lw
Fetch 100, Dcd 30, Ex 24, Mem 20, WB 14
IR
D
Squashmispredictedoriturnintonop
ECE437, S21 Vijaykumar and Thottethodi (147)
Mem Ctrl
00
r6
r7
WB Ctrl
14 addI 20 sub 24 beq 30 ori 34 add
100 and
r1, r2(35) r2, r2, 3
r3, r4, r5
r6, r7, 100 r8, r9, 17 r10, r11, r12
r13, r14, 15
10 lw
Fetch 104, Dcd 100, Ex 30, Mem 24, WB 20
WB Ctrl
Mem Ctrl
14 15
IR
r9 x
r1=M[r2+35]
10 lw 14 addI 20 sub 24 beq 30 ori 34 add
100 and
r2 = r2+3
r1, r2(35) r2, r2, 3
r3, r4, r5
r6, r7, 100 r8, r9, 17 r10, r11, r12
r13, r14, 15
D
Nopdoriflowsthroughpipe
ECE437, S21 Vijaykumar and Thottethodi
(148)
37
PC
100
100
beq
3
addI r2, r2, 3
PC
104
17
NOP
sub r3
Next PC
Reg File
Decode
Next PC PC 24
Reg File
Decode
Next PC
Next PC
Mem Access
Mem Access
Data Mem
r2+3
addI r2
Data Mem
Mem Access
Mem Access
Data Mem
r4-r5
sub r3
Data Mem
M[r2+35]
lw r1
Exec
Exec
r4-r5
sub r3
r2+35
lw r1
Reg. File
Reg. File
r1=M[r2+35]
Reg File
Decode
Reg File
Decode
Exec
Exec
xxx
beq
r2+3
addI r2
Reg. File
Reg. File
Inst. Mem
Inst. Mem
NOP
sub r3, r4, r5
Inst. Mem
Inst. Mem
And r13, r14, r15
beq r6, r7 100
PC
30
Fetch 110, Dcd 104, Ex 100, Mem 30, WB 24 n
WB Ctrl
Mem Ctrl
IR
14 15
r15
r1=M[r2+35]
r14
r2 = r2+3 r3 = r4-r5
10 lw 14 addI 20 sub 24 beq 30 ori 34 add
100 and
r1, r2(35) r2, r2, 3
r3, r4, r5
r6, r7, 100 r8, r9, 17 r10, r11, r12
r13, r14, 15
D
ECE437, S21 Vijaykumar and Thottethodi
(149)
Fetch 114, Dcd 110, Ex 104, Mem 100, WB 30 n
Ctrl
D
WB Ctrl
Mem
IR
r1=M[r2+35]
r2 = r2+3 r3 = r4-r5
10 lw 14 addI 20 sub 24 beq 30 ori 34 add
100 and
r1, r2(35) r2, r2, 3
r3, r4, r5
r6, r7, 100 r8, r9, 17 r10, r11, r12
r13, r14, 15
NOPdoesnotdoanywrites
ECE437, S21 Vijaykumar and Thottethodi
(150)
Br Solution 3 prediction+sol. #1
If we assume prediction+sol. #1 then Sol.#1:Brscause3-cyclestall
With prediction (solution 3 )
If correct 0 bubbles and 3 bubbles otherwise (as before)
ECE437, S21 Vijaykumar and Thottethodi (151)
How will sol.#1 datapath change for 3-
bubble mispredictions?
PCSrc
MemWrite
0 M u x 1
IF/ID
ID/EX
EX/MEM
Branch
MEM/WB
Add 4
Shift left 2
Add
c
0
Add result
ZeZreoro
PC
Address
Instruction memory
Read register 1
Read register 2
Read data 1
RegWrite
Registers Read Write data 2
Write data
register
ALUSr
ALU ALU result
MemtoReg
1 M u x 0
M u x 1
Address
Read
Write data
data memory
Data
Instruction [150] 16
Instruction [20 16]
Sign extend
32
6
ALU control
ALUOp
0
M Instruction u [15 11] x 1
ECE437, S21 Vijaykumar and Thottethodi (152)
MemRead
RegDst
38
Instruction
Next PC PC 114
Reg File
Decode
Next PC PC 110
Reg File
Decode
Mem Access
Data Mem
xxx
beq
Mem Access
Data Mem
r9 | 17
NOP
Exec
r9 | 17
NOP
Reg. File
Exec
R14.r15
And r13
Reg. File
xx
And r13
xx
Inst. Mem
Inst. Mem
Dynamic Branch Prediction take solution3 further
Better (i.e., more accurate) than static prediction of not-taken
Branches are biased predictable
Some are mostly taken and others mostly not-
taken
Why? Think of code (architects always think of
s/w)
Give egs of mostly taken and mostly not-taken
ECE437, S21 Vijaykumar and Thottethodi (153)
Dynamic Branch Prediction take solution3 further
~90% of program execution time is spent in ~10% of code (inner loops)
Think of a program loop of N iterations Taken N-1 times
Not taken last time
Then how does h/w learn the bias of each branch?
ECE437, S21 Vijaykumar and Thottethodi (154)
Dynamic Branch Prediction
Taken
Predict taken Taken Not taken
Predict not taken Not taken
1-bit saturating counter
Taken
Predict taken Th
Taken
Predict not taken Ns
Not taken Taken Not taken
Taken
Predict taken Ts
Not taken
Predict not taken Nh
Not taken
2-bit saturating counter
Store each branch instructions history ***
If a branch is taken in recent history, assume the branch is biased to taken and predict taken
One bit saturating counter Two bit counters
ECE437, S21 Vijaykumar and Thottethodi (155)
While (always)
fori=0to3..
blabla
Outcome: 1-bit (init N):
1-bit vs. 2-bit
top: mov ri, 0 loop:blabla
add ri, ri, 1 blt ri, 3 loop j top
TTTNTTTNTTTNTTTN..
NTTTNTTTNTTTNTTT..
2 mistakes per loop: one each at entry, exit
Ignore init mistake
2-bit (init Nh): NhNsThThTsThThThTsThThThTsThThTh..
1 mistake per loop: at exit Ignore init mistake
ECE437, S21 Vijaykumar and Thottethodi (156)
39
One-minute quiz
Remove the load-use stall: Add $16, $16, 4
Ld $24, 0[$16]
Sub $25,$24,$10
Or $25, $25, $11 Previous quiz
Instructions generate the relevant control signals at every pipeline stage
False
Previous outcome vs average bias,
complex control flow in loop?
ECE437, S21 Vijaykumar and Thottethodi (157)
Branch Prediction Table
PC range
Branch prediction table 1- or 2-bit entries
Storeeachbranchshistory*** Notreally
Keepasmalltableindexedbyprogramcounter
PCislarge(32bitnumber)
Many-to-fewmappingfromPCtotableentries
Eachentry1-or2-bitstateofstatemachine
E.g. 2048-entry branch prediction table
Mapping:use11bitsofPCWITHOUTlast2bits(why?)
Problem:Multiplebranchesmaymaptosameentryin table Aliasing
ECE437, S21 Vijaykumar and Thottethodi (158)
Branch Prediction: Branch Target Buffer
PC range
Branch target buffer (BTB)
Unlikestaticpredictionofalwaysnot-taken,dynamic prediction will predict both taken and not-taken
Fortakenprediction,wehaveaproblem(what?)
Thereissomethingaboutbranchtargetsthathelpsus solve the problem (what?)
KeepasmalltableofbrtargetsindexedbyPC(BTB) Nextslide
Many-to-fewmappingfromPCtoBTB
ECE437, S21 Vijaykumar and Thottethodi (159)
Fetch with branch prediction
10 lw 14 addI 20 sub 24 beq 30 ori 34 add
100 and
r1, r2(35) r2, r2, 3
r3, r4, r5
r6, r7, 100 r8, r9, 17 r10, r11, r12
r13, r14, 15
Mem
PC
IF/ID
Fetch
Br Pred
BTB
ECE437, S21 Vijaykumar and Thottethodi (160)
WHILEfetching(potentially)abranch,usebranchPC to look up prediction AND target of branch so NEXT cycle you can fetch taken instruction if predict taken else fetch not-taken instruction back-to-back fetch for both taken or not-taken
beq F | D | X | M | W
and F|D |X nobubble
40
Updating the predictor and BTB
At fetch, you only look up the predictor/BTB to make a prediction and get the br target
Carry prediction/br target down the pipe
When br is resolved (in Mem)
You know the true outcome and target Compare these to predictions
If correct, do nothing
If wrong, update predictor and BTB
Predictor: state machine state change
BTB: write the target for br PC into BTB
All this in Mem WHILE squashing mispredicted instrs Next cycle fetch from correct PC & lookup pred/BTB
Correct~80%(0bubbles)incorrect20%(3bubbles)
ECE437, S21 Vijaykumar and Thottethodi (161)
Prediction accuracy
Always not-taken ~30% whoa! Why so bad?
always taken ~60%
1-bit ~65%,
2-bit ~80%
correlating predictors (~90+%)
Come to grad school, 565
ECE437, S21 Vijaykumar and Thottethodi (162)
Recap
Branch instructions
Control flow hazard
Static branch prediction
Predict not taken
Squash instruction if prediction incorrect
Dynamic Branch prediction
1-bit and 2-bit state machines to track history of branches
Finite table
Potential for aliasing
Multiple branches map to the same predictor
ECE437, S21 Vijaykumar and Thottethodi (163)
Control flow Hazard: Solution4
Redefinebranchbehaviortotakeeffectafternext instruction) delayed branch
I n
st Add r.
O Beq
Mem Reg Mem
Time (clock cycles)
Mem Reg Reg Mem
Mem Reg
Mem Reg
Reg
Mem Reg
Mem Reg
dr Delay-slot e
r Load
Impact:0bubblesperbranchifcanfindinstruction to put in delay slot ( say 60% of time)
Aspipelinesgetdeeper,lessusefulwhy?
ECE437, S21 Vijaykumar and Thottethodi (164)
41
ALU
ALU ALU
ALU
10 lw 14 addI 20 sub 24 ori 30 beq 34 NOP 38 add
100 and
r1, r2(35) r2, r2, 3 r3, r4, r5 r8, r9, 17 r6, r7, 100
r10, r11, r12 r13, r14, 15
10 lw 14 addI 20 sub 24 beq 30 ori 34 add
100 and
r1, r2(35) r2, r2, 3
r3, r4, r5
r6, r7, 100 r8, r9, 17 r10, r11, r12
r13, r14, 15
Easy way*** to hide branch hazard delay
Delayed branch
Instruction after branch always executes
independent instruction from before the branch
Find independent instructions from Taken (target) OR from Not Taken (fall-through) code section
a. From before
Becomes
b. From target
Becomes
(166)
c. From fall through
Becomes
Find an
add $s1, $s2, $s3 if $s1 = 0 then
Delay slot sub $t4, $t5, $t6
*** For Architects ECE437, S21 Vijaykumar and Thottethodi
add $s1, $s2, $s3 if $s2 = 0 then
Delay slot
if $s2 = 0 then add $s1, $s2, $s3
sub $t4, $t5, $t6
add $s1, $s2, $s3 if $s1 = 0 then
Delay slot
add $s1, $s2, $s3 if $s1 = 0 then
sub $t4, $t5, $t6
add $s1, $s2, $s3 if $s1 = 0 then
sub $t4, $t5, $t6
Delayed Branch: Solution 4
Delayedbranch:instafterbranchALWAYSexecuted Brancheffectisdelayedby1cyclehencethename
Invisibletoprogrammer
Compiler and/or assembler transforms code
ECE437, S21 Vijaykumar and Thottethodi (165)
Ideal from before branch
R1 <- M[R2] R2==0? Delay slotR2==0?R1 <- M[R2] Independentinstructionstofilldelayslot Codetransformationmustpreserveoriginalsemantics Correctness Butmaynotfindindependentinstructionfrombeforebranch then try from taken or not-taken pathECE437, S’21 Vijaykumar and Thottethodi (167) From Taken/Not-taken pathsR2 <- M[R2] R2==0?N
Y is more likely (common case)
R2 <- M[R2] R2==0?N R3<-R5+R6 Y
Y
Delay slot
R3<- R5+R6R4<- R3 R3 DEAD in N path and inwingding and beyond DEAD = overwritten before being read (KEY correctness requirement) Delay slot Profitable mostly, not profitable occasionally, but NEVER unsafeECE437, S’21 Vijaykumar and Thottethodi (168)(incorrect)
R4<-R3
42
Two cases if we move from Y path
R2 <- M[R2] R2==0?NYR2 <- M[R2] R2==0?N R3<-R5+R6 Y R3 <- R8-R9 OK (R3 dead in N path)
R3<-R5+R6 R23 <- R3
NOT OK (R3 alive in N path)
(169)
R4<- R3ECE437, S’21 Vijaykumar and ThottethodiR4<-R3R48 <- R3 Recall Hazards Structural hazards Two instructions need the same hardware Half/half for REG and two copies for MEM Data Hazards Data not ready Forward/bypass (stall for loads) Control flow Hazards Which instruction to fetch? Not known. Move up br. decision, Predict not taken, delayedbranchECE437, S’21 Vijaykumar and Thottethodi (170) IF.FlushRecap: Datapath for branch hazardsard detectionunitHazID/EXWBEX/MEMM u x ControlMuM WB 0xEX MIF/IDMEM/WBWB 4PC Registers =M u xM u xM u xInstruction memoryShi leftft 2 Data memory M u xSign extendALUForwarding unitECE437, S’21 Vijaykumar and Thottethodi (172) Common confusion in CPI calculation Ideal CPI = 1, 1/4th loads, 1/6th branches, br prediction accuracy 90% and branches resolved at the END of MEM and assume 50% loads are followed immediately by a dependent instr and by independent instr for rest CPI=1+1/4*1/2*1+1/6*0.1*3 The first 1 captures no stalls for ALL instructions, so later terms are ONLY for stallsECE437, S’21 Vijaykumar and Thottethodi (173)43Common confusion in CPI calculation CPI=1+1/4*1/2*1+1/6*0.1*3 DO NOT do 1(14+1/6)+14*1/2*1+14*1/2 *2+1/6 *0.9*1 + 1/6*0.1*4 correct but laborious 1 + 14*1+1/6*3 Incorrect because double-counts non-stall casesECE437, S’21 Vijaykumar and Thottethodi (174) What next? Exceptions Multiple instructions in flight PC has changed Advanced topics VERY briefly Superscalar, dynamically scheduled processors, etc Real machines Pentium 4 pipeline, Niagara PipelineECE437, S’21 Vijaykumar and Thottethodi (175) ExceptionsException:normal control flow:sequential, jumps, branches, calls, returns Exception=unprogrammedsurprisefunctioncall OStakesactiontohandletheexception must record the address of the offending instruction returns control to user user programSystem Exception Handlerreturn from exception mustsave&restoreuserstate LetsOSgetin/outwithoutbreakinguserprogramas if user program had a virtual CPU to itselfECE437, S’21 Vijaykumar and Thottethodi (176) Interrupt, Exception, Trap? Interrupts caused by external events (eg I/O) asynchronous to program execution may be handled between instructions suspend user program, handle interrupt, resume user program Traps caused by internal events exceptional conditions (overflow) errors (parity) faults (ch. 5) KEY mechanism in all modern computers synchronous to program execution condition must be remedied by the handler for program to proceed instruction may be retried or simulated and program continued, orprogram may be aborted MIPS convention: External : Interrupts Internal : ExceptionECE437, S’21 Vijaykumar and Thottethodi (177)44Importance of exceptions No modern OS would work without exceptions and interrupts Exceptions are uncommon case Butcommoncaseisnoteverything,uncommoncase must be correct Especially if uncommon is needed by OSECE437, S’21 Vijaykumar and Thottethodi (178) Exception Semantics MIPS architecture defines the instruction as having no effect if the instruction causes an exception. When we get to virtual memory (ch 5b) we will see that certain classes of exceptions must prevent the instruction from changing the machine state. This aspect of handling exceptions becomes complex and potentially limits performance => why it is hard
Precise interrupts vs Imprecise interrupts
ECE437, S21 Vijaykumar and Thottethodi (179)
Exceptions
Pipeline Semantics
NO instruction AFTER the excepting instruction may execute
EVERY instruction BEFORE the excepting instruction must complete execution
Sounds similar to what?
Exception handler software saves and
restores registers and other state Different from 362 microcontroller
ECE437, S21 Vijaykumar and Thottethodi (180)
MIPS Exceptions
All exceptions jump to same handler code Cause register
We consider
Illegal instructions
Arithmetic overflows
Hardware behavior
Save PC of offending instruction (How? PC+4 has already been written to PC)
Use special register EPC(why not use $31 like jal?)
Set cause register appropriately (0=ILL; 1=OVF)
Jump to handler at fixed address
ECE437, S21 Vijaykumar and Thottethodi (181)
45
Datapath modifications
Pipeline complications
Which stage is exception detected?
Overflow?
In EX, squash (convert to nop) EX & earlier stages
Illegal Instruction?
In ID, squash (convert to nop) ID & earlier stages Similar to RAW hazard
What about external interrupts?
Overflow in instruction i, illegal instruction in
instruction i+1
Simultaneous exceptions Hardware sorting
ECE437, S21 Vijaykumar and Thottethodi (182)
Walk-through: Code snippet
Main Code
40 sub $11, $2, $4 44 and $12, $2, $5 48 or $13, $2, $1 4C add $1, $2, $1 50 slt $15, $6, $7
ECE437, S21 Vijaykumar and Thottethodi
Exception Code
[Exception handler PC] sw $25, 1000($0)
(183)
lw $16, 50($7)
4000
Clock 5
Walkthrough (1 of 2)
IF.Flu
sh
slt $15, $6, $7
Shift left 2
Sign extend
add $1, $2, $1
WB x 0
or $13, . . .
EX/MEM
and $12, . . .
ID.Flush
EX.Flush
ard detection
unit
trol
Haz
ID/EX
40000040
M u x
M0 0 10 u
4
IF/ID
58
Con
M0010 M0
uM uWB
0x x /WB
0
$6
=
0
Cause 1 EX M B
50 Except PC
M $2 u
x
54
PC
0040
Instruction memory
54
12
Registers
Data memory
MEM
W
ALU
$7 M u M $1 x
u x
M 13 12 15$1 ux
Forwarding un it
All three instructions converted to nop
ECE437, S21 Vijaykumar and Thottethodi (184)
IF.Flush
40000040
4
40000040
4000
x
Walkthrough (2 of 2)
sw $25, 1000($0)
4000
Clock 6
bubble (nop)
ard tion t
bubble
WB x 0
bubble
EX/MEM
or $13, . . .
Haz
Co
ID.Flush
EX.Flush
M u
detec uni
ID/EX
M 00 0 00 u
ntrol M0000 M 00 uM uWB
0044 IF/ID
0x x /WB 001
MEM
W
Shift left 2
Sign extend
=
EXCause M B Except
PC
M u x
PC 0044
Instruction memory
13
Registers
Data memory
ALU
M
u Mx
u x
M u x
13
FetchnextinstructionfromhandlerPC(MIPS)
ECE437, S21 Vijaykumar and Thottethodi (185)
Forwarding un it
46
WB
MEM/WB
WB
M
PC
Instruction memory
Address
Read data
Data memory
Address Read
data data
Write
MemRead
Understanding Performance
Iron law: Insts/prog * CPI * cycletime With pipelining:
CPI ~ 1 (with ideal memory, good branch prediction and few data hazards)
Cycletime : determined by critical path of one stage
Can we do better? CPI<1? How about CPI of 0.25?ECE437, S’21 Vijaykumar and Thottethodi (187)BranchPipelined ProcessorEX.FlushIF.Flush400000404ID.Flush Hazard detection unitM uID/EX M xu WB xIF/IDEX/MEM0 MMControl uM u 0xx0ALUSrcEXCauseExcept PCM u x Shift left 2Read register 1data 1Read register 2Read=data uM xux ALUcontrolALUOpForwarding unit(186) Registers WriteALU register WriteRead data 232M xM u xM u 16 Sign extendInstruction [25 21] Instruction [20 16] Instruction [20 16] Instruction [15 11]RegDst Phew! ECE437, S’21 Vijaykumar and Thottethodi Superscalar Processor What does it mean? Scalar processors (operate on scalar quantities) Vector (operate on vectors) Superscalar: multiple scalar operations in one cycle More than one instruction per cycleECE437, S’21 Vijaykumar and Thottethodi (188)M 40000040 u xPCECE437, S’21 Vijaykumar and ThottethodiM u xM u x(189)ALUALUSuperscalar Datapath4 Instruction memoryReplicate datapath elements Static Multiple issue datapathSignextend SignRegisters M u xWrite dataData memoryAddress extend47InstructionMemtoRegRegWriteMemWriteDynamic Scheduling No need to suffer hazards if other useful work can be achieved Load hazard results in pipeline stall But later instructions are ready Oh! But we cannot execute instructions out of order Not true! Modern CPUs doit all the time!!lw $t0, 20($s2) addu $t1, $t0, $t2 sub $s4, $s4, $t3 slti $t5, $s4, $t3 Execute sub,slti, & later instrs while lw has not completed!ECE437, S’21 Vijaykumar and Thottethodi (190) Dynamic Scheduling Instruction fetch and decode unit Reservation stationReser vation stationReservation stationReser vation stationFunctional unitsInteger…Integer …Floating Load/ point StoreIn-order issueOut-of-order executeIn-order commit Commit unit Instructions can execute when operands are ready Instructions commit when all preceding instructions haveECE437, S’21 Vijaykumar and Thottethodi (191)committed Real machines Lets examine Pentium 4, Core 2, i7 Microarchitecture has evolved Technology has improved Also ARM A3, Sun Niagara, Apple A14 FirestormECE437, S’21 Vijaykumar and Thottethodi (192) Pentium 4 on 0.18 micron 42 million transistors 3GHz Several parts are clocked at half the speed In-order front-end, out-of-order execution, in order retireECE437, S’21 Vijaykumar and Thottethodi(193) 48Pentium 4 pipeline One specific pipeline (misprediction)ECE437, S’21 Vijaykumar and Thottethodi (194)Core 2 45nm Multiple decodes 14 stage pipeline (went as high as ~31 in Pentium 4 line) Many other considerations Pipelining for yield Source: WikipediaECE437, S’21 Vijaykumar and Thottethodi(195) i7 6 uops/cycle 14 stage pipeline 17 cycles br penalty 48 loads and 32 stores in-flightECE437, S’21 Vijaykumar and Thottethodi(196) ARM A3 3-cycle fetch Addr Generation Unit uses BTB 5-cycle decode and 6-cycle EXECE437, S’21 Vijaykumar and Thottethodi (197) 49 Nottoo dissimilar from 437 4threads Eightsuch processors March/April 2005 Issue of IEEE MICROECE437, S’21 Vijaykumar and Thottethodion a chipSun Niagara(198) X86 contender BeatsIntel i7 1185G7 (28 W) and close to AMD Ryzen 9 5950X (49 W) Foriphone! <5W!ECE437, S’21 Vijaykumar and Thottethodi(199)Apple A14 Firestorm Summary Exceptions Know how to handle the easy cases What to squash, what not to Know how complicated exceptions can be Read Chapter 4 NOW Maximize impact Study while lecture material is warm 2-3 hours now vs. 8-12 hours laterECE437, S’21 Vijaykumar and Thottethodi (200)50
Reviews
There are no reviews yet.