Computer Architecture
Course code: 0521292B 06. Pipelining & Hazards
Jianhua Li
College of Computer and Information Hefei University of Technology
slides are adapted from CA course of wisc, princeton, mit, berkeley, etc.
The uses of the slides of this course are for educa/onal purposes only and should be
used only in conjunc/on with the textbook. Deriva/ves of the slides must
acknowledge the copyright no/ces of this and the originals.
:
Moresystematically:
Pipelining the execution of multiple instructions
Idea:
Divide the instruction processing cycle into distinct
stages of processing
Ensure there are enough hardware resources to process
one instruction in each stage
Process a different instruction in each stage simultaneously
Instructions consecutive in program order are processed in consecutive stages
Benefit: Increases instruction processing throughput
:
Multi-cycle: 4 cycles per instruction
F
D
E
W
F
D
E
W
F
D
E
W
F
D
E
W
Time
Pipelined: 4 cycles per 4 instructions !
F
D
E
W
F
D
E
W
F
D
E
W
F
D
E
W
Time
:
Multi-cycle: 4 cycles per instruction
F
D
E
W
F
D
E
W
F
D
E
W
F
D
E
W
Time
Pipelined: 4 cycles per 4 instructions ?
Time
F
D
E
W
F
D
E
W
F
D
E
W
F
D
E
W
Goal: Increase throughput with little increase in cost (hardware cost, in case of instruction processing)
Repetition of identical operations
The same operation is repeated on a large number of
different inputs
Repetition of independent operations
No dependencies between repeated operations
Uniformly partitionable suboperations
Processing can be evenly divided into uniform-latency suboperations
BW=~(1/T)
combinational logic (F,D,E,M,W) T psec
T/2 ps (F,D,E)
T/2 ps (M,W)
BW=~(2/T)
T/3
ps (F,D)
T/3
ps (E,M)
T/3
ps (M,W)
BW=~(3/T)
: Throughput Nonpipelined version with delay T
BW = 1/(T+S) where S = latch delay
T ps
k-stagepipelinedversion BWk-stage = 1 / (T/k +S )
Latch delay reduces throughput
BWmax = 1 / (1 gate delay + S )
T/k ps
T/k ps
: Cost
Nonpipelined version with combinational cost G
Cost = G+L where L = latch cost
G gates
k-stagepipelinedversion Costk-stage = G + Lk
Latches increase hardware cost
G/k
G/k
Pipeline:
1. Instruction fetch (IF) 2. Instruction decode and
register operand fetch (ID/RF)
3. Execute/Evaluate memory address (EX/AG) 4. Memory operand fetch (MEM)
5. Store/writeback result (WB)
Instruction [25 0]
Shift
Jump address [31 0]
PCSrc =Jump 011
MM uu xx
AddALU 1 0 result
Add 4
RegDst
Jump
Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite
Shift left 2
Instruction [31 26]
Instruction [25 21]
Control
PCSrc2=Br Taken
PC
26 left2 28
PC+4 [31 28]
Read address
Instruction memory
Instruction [31 0]
Read register 1
Read
register 2
Write Registers Read
register data 2
Write data
Instruction [20 16]
Read data 1
0 M u x
1
Instruction [15 11]
0 M u x
1
ALU control
Zero
ALU ALU result
bcond
Address
Read data
Write data
Data memory
Instruction [15 0]
16 Sign 32 extend
Instruction [5 0]
1 M u x
0
ALU operation
200ps
IF: Instruction fetch
0 M u x
1
Add 4
100ps
ID: Instruction decode/ register file read
200ps
EX: Execute/ address calculation
200ps
MEM: Memory access
100ps
WB: Write back
PC
Shift left 2
0
M u x
1
Add Add result
Zero
ALU ALU result
Address
Instruction memory
Instruction
Read register 1
Read register 2
Read data1
Write Registers Read
register data 2
Write data
Address
Read data
Data memory
Write data
16 Sign 32 extend
1
M u x
0
200ps
IF: Instruction fetch
0 M u x
1
Add 4
100ps
ID: Instruction decode/ register file read
200ps
EX: Execute/ address calculation
200ps
MEM: Memory access
100ps
WB: Write back
ignore for now
PC
Shift left 2
0
M u x
1
Add Add result
Zero
ALU ALU result
Address
Instruction memory
Instruction
Read register 1
Read register 2
Read data1
Write Registers Read
register data 2
Write data
Address
Read data
Data memory
Write data
RF write
16 Sign 32 extend
1
M u x
0
200ps
IF: Instruction fetch
0 M u x
1
Add 4
100ps
ID: Instruction decode/ register file read
200ps
EX: Execute/ address calculation
200ps
MEM: Memory access
100ps
WB: Write back
ignore for now
PC
Shift left 2
0
M u x
1
Add Add result
Zero
ALU ALU result
Address
Instruction memory
Instruction
Read register 1
Read register 2
Read data1
Write Registers Read
register data 2
Write data
Address
Read data
Data memory
Write data
RF write
16 Sign 32 extend
1
M u x
0
Is this the correct partitioning? Why not 4 or 6 stages?
Why not different boundaries?
Program execution order
(in instructions)
lw $1, 100($0) lw $2, 200($0) lw $3, 300($0)
Program execution order
(in instructions)
lw $1, 100($0) lw $2, 200($0) lw $3, 300($0)
2 4 6 8 10 12 14 16 18
Instruction Pipeline Throughput
Time
Instruction fetch
Reg
ALU
Data access
Reg
Instruction fetch
Reg
ALU
Data access
Reg
8 ns
8 ns
Instruction fetch
8 ns
Time
2 4 6 8 10 12 14
Instruction fetch
Reg
ALU
Data access
Reg
Instruction fetch
Reg
ALU
Data access
Reg
Instruction fetch
Reg
ALU
Data access
Reg
2 ns
2 ns
2 ns
2 ns
2 ns
2 ns
2 ns
Program execution order
(in instructions)
lw $1, 100($0) lw $2, 200($0) lw $3, 300($0)
Program execution order
(in instructions)
lw $1, 100($0) lw $2, 200($0) lw $3, 300($0)
2 4 6 8 10 12 14 16 18
Instruction Pipeline Throughput
Time
Instruction fetch
Reg
ALU
Data access
Reg
Instruction fetch
Reg
ALU
Data access
Reg
8 ns
8 ns
Instruction fetch
8 ns
Time
2 4 6 8 10 12 14
Instruction fetch
Reg
ALU
Data access
Reg
Instruction fetch
Reg
ALU
Data access
Reg
Instruction fetch
Reg
ALU
Data access
Reg
2 ns
2 ns
2 ns
2 ns
2 ns
2 ns
2 ns
5-stage speedup is 4, not 5 as predicted by the ideal model. Why?
: Pipeline Registers
IF: Instruction fetch
ID: Instruction decode/ EX: Execute/ MEM: Memory access WB: Write back register file read address calculation
00 MM uu xx
11
AdAddd 44
No resource is used by more than 1 stage!
IF/ID ID/EX
left 2 left 2
00
MM u ux x
11
EX/MEM
MEM/WB
Shift Shift
AdAdd AdAdd rerseusltult
Zero Zero
PC C
A
Address ddress
P
Instruction
Instruction
memory
Instruction memory
ReRaedad
register 1 register 1
ReRaeda
Read Read
data1 data1
rergeisgtiesrte2r 2
Write
Write
data 2 data2
register
register
Write
Write
data
Registers Registers
Read Read
d
data
ALU
ALU ALU
ALU rerseusltult
Address Address
DaDtata mmememoroyry
Read
11 MM u ux
x 00
WWritreite
Read
data
data
16 32
16 Sign 32 Sign
data data
extend extend
T/k ps
T/k ps
16
T
ImmE
BE AE
PCE+4
BM
AoutM
nPCM
AoutW MDRW
IRD
Instruction
PCD+4
PCF
PC
Read register 1 regiisisteter1 register 1
Read Read Read
PC PC
data1 datata1
lw
Instruction fetch
All instruction classes must follow the same path and timing through the pipeline stages.
4 4
Address Adrres
Add Add Ad
0 0 0
lw lw lw
lw
Write back
MEM/WB MEM/W/WB
M M u u x x
Execution
Memory
1 1
Add Add
IF/ID IFIF/I/IDID
EX/MEM EX/M/MEM
1
Write data
Write 0 data
Instruction decode
Read
Read
Read
Read
16 1632 32
Sign extend extend
Sign
Any performance impact?
ID/EX IDID/E/EX
Shift Shifififtft
left 2 lleleftftt2
Ad result resultltt
InInInsnststrtrturuructctitiotioion Instruction
0 0M Mu
lw
Instruction decode
lw
Write back
ux x 1
1
Instruction IInInsttrtructtitioion Instruction
Read register 2 regisisisteter2
Zero Zero Zero
memory memorry memory
Registers Regiisisteters Read Registers Read
0 0 0
ALU
ALU ALU ALU ALU
Read Read Read
Write
ALU
Address
Wriitte Wrritiete Write register regiisisteter register
Write Wriititete Write data datata data
data 2
result rresulltt rressultlt result
Addrress Addrres
1 1 1 1
16
32
16
32
16
Sign
32
datta 2 datata2 data 2
M M M u u u x x x
Data
datta datta
M M u u x x
Siigign
Sign
extend
extetend
extend
1
1 10
Write Wrrititete Write
0
data
datata
data
Data
Datta
memorryy memory memorry
0
Clock 1
Clock 3 Clock 5
sluwb$101,,2$02(,$$13) Instruction fetch
extend extend
slwub$$1101, ,2$02($, 1$)3
Instruction decode
lw $10, 20($1)
Execution
sub $11, $2, $3
Execution
4 4
Address Address
Add Add Add Add result
0 0
lw $10, 20($1) sub $11, $2, $3
Memory Memory
sub $11, $2, $3 lw $10, 20($1)
M M
Write back Write back
u u
x x
1 1
Add Add
IF/ID IF/ID
ID/EX ID/EX
EX/MEM EX/MEM
MEM/WB MEM/WB
drreagtaiistterr Write
M data 1M
Write
x
data
x
x
data
16
1
Sign
data data
extend
16 32 16 Sign 32
Sign
32
1
Write Write
0 0
M u0
u
Write data
Data memory
u
InInstrtruructitioion
PC PC
register 1 register 1
Read Read
Instruction Instructiion
memory memory
register 2 register 2
Clock 3
ClColcokck5 1 CClolcoCkclko6c2k 4
sub $11, $2, $3
lw $10, 20($1)
18
Instruction fetch
Instruction decode
sub $11, $2, $3
lw $10, 20($1)
sub $11, $2, $3
0
Read Read
Read Read
data1 data 1
Zero Zero
Write Wriite
Read
0 0 0
ALU
Address Addrress
Read Read Read
1 1
register register
data 2 data 2
M M
result result
data data
M M
Write Wriite
u u
Datta Datta
u u
Registers Regiisters Read
ALU
ALU ALU
1 ?
16 32
memorry memory
x x x
data
data 1
0
16
Sign 32 Sign
extend extend
Shift Shift
left 2 lleft 2
x x
resullt
Write Write
0
data data
: Operation View
t0 t1 t2 t3
t5
MEM
t4
IF
ID
EX
WB
IF
ID
EX
MEM
WB
IF
ID
IF
EX
MEM
EX
WB
ID
MEM
IF
ID
EX
IF
ID
IF
Inst0 Inst1 Inst2 Inst3 Inst4
WB
MEM EX
ID
IF
steady state (full pipeline)
:
IF
ID
EX
MEM
WB
t0
I0
t1
I1
I0
t2
I2
I1
I0
t3
I3
I2
I1
I0
t4
I4
I3
I2
I1
I0
t5
I5
I4
I3
I2
I1
t6
I6
I5
I4
I3
I2
t7
I7
I6
I5
I4
I3
t8
I8
I7
I6
I5
I4
t9
I9
I8
I7
I6
I5
t10
I9
I8
I7
I6
I10
PCSrc
0
M u x
1
Add
MEM/WB
4
Branch
PC
MemWrite
IF/ID ID/EX
EX/MEM
Add Add result
Shift left 2
ALUSrc
0
M u x
1
ZeZreoro
ALU ALU result
RegWrite
Address
Instruction memory
Read register 1
Read
register 2
Write Registers Read
register data 2
Write data
Read data 1
Address
Read data
Data memory
Write data
Instruction
[150] 16 Sign 32
6
0 M
u x
1
RegDst
ALU control
ALUOp
extend
MemRead
Instruction [20 16]
Instruction [15 11]
MemtoReg
1 M
u
x
0
Identical set of control points as the single-cycle datapath!!
Instruction
For a given instruction
same control signals as single-cycle, but
control signals required at different cycles, depending on stage
Option 1: decode once using the same logic as single-cycle and buffer signals until consumed
WB
Instruction
Control
M
WB
EX
M
WB
IF/ID
Option 2: carry relevant instruction word/field down the pipeline and decode locally within each or in a previous stage
Which one is better?
ID/EX EX/MEM MEM/WB
PCSrc
Address
Instruction memory
0 M
u x
1
Add
ID/EX
WB
Control
M
EX/MEM
EX
WB
M
IF/ID
4
Shift left 2
Add Add result
ALUSrc
Branch
MEM/WB
WB
PC
Read register 1
Read register 2
Read data 1
Write Registers Read register data 2
Write data
Read data
memory
Write data
Address
Data
Instruction 16 32 [15 0] Sign
Instruction [20 16]
extend
6
ALU control
0 M
u
x 1
Instruction [15 11]
ALUOp
MemRead
RegDst
0
M u x
1
Zero
ALU ALU
result 1
M u x
0
Instruction
MemtoReg
MemWrite
RegWrite
Remember:
Goal: Increase throughput with little increase in cost (hardware cost, in case of instruction processing)
Repetitionofidenticaloperations
The same operation is repeated on a large number of different inputs (e.g., all laundry loads go through the same steps)
Repetition of independent operations
No dependencies between repeated operations
Uniformly partitionable suboperations
Processing an be evenly divided into uniform-latency suboperations (that do not share resources)
Instruction Pipeline:
Identical operations NOT!
different instructionsnot all need the same
stages
Forcing different instructions to go through the same pipe stages external fragmentation (some pipe stages idle for some instructions)
Uniform suboperations NOT!
different pipeline stagesnot the same latency Need to force each stage to be controlled by the same clock
internal fragmentation (some pipe stages are too fast but all take the same clock cycle time)
Independent operations NOT!
instructions are not independent of each other Need to detect and resolve inter-instruction dependencies to ensure the pipeline provides correct results
pipeline stalls (pipeline is not always moving)
Balancing work in pipeline stages
How many stages and what is done in each stage
Keeping the pipeline correct, moving, and full in the presence of events that disrupt pipeline flow
Handling dependences Data
Control
Handling resource contention
Handling long-latency (multi-cycle) operations
Handling exceptions, interrupts
Advanced:Improvingpipelinethroughput Minimizing stalls
Stall: A condition when the pipeline stops moving
Resource contention
Dependences (between instructions) Data
Control
Long-latency (multi-cycle) operations
(denpendence)
Also called dependency or less desirably hazard
Dependences dictate ordering requirements between instructions
Two types
Data dependence
Control dependence
Resource contention is sometimes called resource dependence
However, this is not fundamental to program semantics.
(Resource Contention)
Happens when instructions in two pipeline stages
need the same resource
Solution 1: Eliminate the cause of contention
Duplicate the resource or increase its throughput
E.g., use separate instruction and data memories (caches) E.g., use multiple ports for memory structures
Solution 2: Detect the resource contention and stall one of the contending stages
Which stage do you stall?
Example: What if you had a single read and write port for the register file?
(Data Dependence)
Types of data dependences
Flow dependence (true dependence read after write) Output dependence (write after write)
Anti dependence (write after read)
Which ones cause stalls in a pipelined machine?
For all of them, we need to ensure semantics of the
program is correct
Flow dependences always need to be obeyed because they constitute true dependence on a value
Anti and output dependences exist due to limited number of registers in the CPU
They are dependence on a name, not a value We will later see what we can do about them
Flow dependence r3 r1 op r2 r5 r3 op r4
Anti dependence r3 r1 op r2 r1 r4 op r5
Output-dependence r3 r1 opr2
r5 r3 opr4
r3 r6 opr7
Read-after-Write (RAW)
Write-after-Read (WAR)
Write-after-Write (WAW)
Clock 1
Clock 3 Clock 5
sluwb$101,,2$02(,$$13) Instruction fetch
slwub$$1101, ,2$02($, 1$)3
Instruction decode
0 0
lw $10, 20($1) sub $11, $2, $3
Memory Memory
sub $11, $2, $3 lw $10, 20($1)
M M
Write back Write back
u u
x x
1 1
Add Add
IF/ID IF/ID
ID/EX ID/EX
EX/MEM EX/MEM
MEM/WB MEM/WB
drreagtaiistterr Write
M data 1M
Write
x
lw $10, 20($1)
Execution
sub $11, $2, $3
Execution
data
x
x
data
16
1
Sig
n extend
data data
16 32 16 Sign 32
Sign
exte
nd extend
32
1
Write Write
0 0
M u0
u
Write data
Data memory
u
InInstrtruructitioion
PC PC
register 1 register 1
Read Read
4 4
Address Address
Add Add Add Add result
Instruction Instructiion
memory memory
register 2 register 2
Clock 3
ClColcokck5 1 CClolcoCkclko6c2k 4
sub $11, $2, $3
lw $10, 20($1)
32
Instruction fetch
Instruction decode
sub $11, $2, $3
lw $10, 20($1)
sub $11, $2, $3
0
Read Read
Read Read
data1 data 1
Zero Zero
Write Wriite
Read
0 0 0
ALU
Read Read Read
Write Wriite
u u
u u
Registers Regiisters Read
ALU
ALU ALU
Address Addrress
Datta Datta
What if the SUB were dependent on LW?
16 32
1 1
register register
data 2 data 2
M M
result result
data data
M M
data
data 1
16
Sign 32 Sign
extend extend
Shift Shift
left 2 lleft 2
x x
memorry memory
x x x
1
0
resullt
Write Write
0
data data
Anti and output dependences are easier to handle
write to the destination in one stage and in program order
Flow dependences are more interesting
Four typical ways of handling flow dependences Detect and wait until value is available in register file
Detect and forward/bypass data to dependent instruction Detect and eliminate the dependence at the software level
No need for the hardware to detect dependence
Predict the needed value(s), execute speculatively, and verify
Pipeline Interlocking ()
Detection of dependence between instructions in a pipelined processor to guarantee correct execution
Softwarebasedinterlocking Hardwarebasedinterlocking
MIPSacronym?
(1)
Scoreboarding()
Each register in register file has a Valid bit associated with it
An instruction that is writing to the register resets the Valid bit (set to 0)
An instruction in Decode stage checks if all its source and destination registers are Valid (value is 1)
Yes: No need to stall No dependence
No: Stall the instruction
Advantage:
Simple, 1 bit per register
Disadvantage:
Need to stall for all types of dependences, not only flow dependence.
?
What changes would you make to the scoreboard to enable this?
Think about it!
(2) Combinational dependence check logic
Special logic that checks if any instruction in later stages (pipeline stages) is supposed to write to any source register of the instruction that is being decoded
Yes: stall the instruction/pipeline
No: no need to stall no flow dependence
Advantage:
No need to stall on anti and output dependences
Disadvantage:
Logic is more complex than a scoreboard
Logic becomes more complex as we make the pipeline deeper and wider (think superscalar execution)
What do you do afterwards?
Observation: Dependence between two instructions is detected before the communicated data value becomes available
Option 1: Stall the dependent instruction right away
Option 2: Stall the dependent instruction only when
necessarydata forwarding/bypassing
Option 3:
Next Topic Dependence Handling
Control Dependence
Question: What should the fetch PC be in the next cycle?
Answer: The address of the next instruction
All instructions are control dependent on previous ones. Why?
If the fetched instruction is a non-control-flow instruction: Next Fetch PC is the address of the next-sequential instruction Easy to determine if we know the size of the fetched instruction
If the instruction that is fetched is a control-flow instruction: How do we determine the next Fetch PC?
In fact, how do we know whether or not the fetched instruction is a control-flow instruction?
Reviews
There are no reviews yet.