Lesson 06 Thread-Level Parallelism: Introduction
Introduction
Introduction
Pipelining became universal technique in 1985 Overlaps execution of instructions
Beyond pipelining, Instruction Level Parallelism (ILP) Executes instructions in parallel
There are two main approaches:
Hardware-based dynamic approaches:
Software-based static approaches:
Used in server and Not as successful outside of desktop processors scientific applications
Instruction-Level Parallelism
When exploiting instruction-level parallelism, goal is to maximize CPI
Pipeline CPI =
Ideal pipeline CPI +
Structural stalls + Data hazard stalls +
Control stalls
Parallelism with basic block is limited Typical size of basic block =
3-6 instructions
Must optimize across branches
Data Dependence
Major challenge for Instruction Level Parallelism: Data dependency
Instruction j is data dependent on instruction i if
Instruction i produces a result that may be used
by instruction j
Instruction j is data dependent on instruction
k and instruction k is data dependent on instruction i Dependent instructions cannot be executed simultaneously.
Data Dependence
Dependencies are a property of programs:
Pipeline organization determines if dependence is detected and if it causes a stall
Data dependence conveys:
Possibility of a hazard
Order in which results must be calculated
Upper bound on exploitable instruction level parallelism
Dependencies that flow through memory locations are difficult to detect.
Name Dependence
Two instructions use the same name but no flow of information.
Not a true data dependence, but is a problem when reordering instructions
Anti-dependence: instruction j writes a register or memory location that instruction i reads
Initial ordering (i before j) must be preserved
Output dependence: instruction i and instruction j write the same register or memory location
Ordering must be preserved
To resolve, use register renaming techniques.
Data Hazards:
Read after write (RAW) Write after write (WAW) Write after read (WAR)
Other Factors
Control Dependence:
Ordering of instruction i with respect to a branch instruction
Instruction control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch
An instruction not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch
Control Dependence
ADD R1,R2,R3 BEQ R4,R0,L SUB R1,R1,R6 .
L: .
OR R7,R1,R8
ADD R1,R2,R3 BEQ R12,R0, skip SUB R4,R5,R6 ADD R5,R4,R9
. skip: .
OR R7,R8,R9
Example 1:
OR instruction dependent on ADD and SUB.
Example 2:
Assume R4 isnt used after skip
Possible to move SUB before the BEQ
END OF SECTION
Lesson 06 Instruction-Level Parallelism: Loop Unrolling
Loop Unrolling
Compiler Techniques for Exposing ILP
Loop Unrolling: loop transformation technique to optimize program execution through (statical) pipeline scheduling of dependent instructions
Example:
for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;
Instruction producing result
Instruction using result
Latency in clock cycles
FP ALU op
Another FP ALU op
3
FP ALU op
Store Double
2
Load Double
FP ALU op
1
Load Double
Store Double
0
FP Loop: Where Are the Hazards?
To simplify, assume 8 is lowest address, R1 = base address of x and F2 = s
Loop: L.D F0,0(R1) ADD.D F4,F0,F2
S.D 0(R1),F4 DADDUI R1,R1,-8 BNEZ R1,Loop
;F0=vector element ;add scalar from F2
;store result
;decrement pointer 8B (DW)
;branch R1!=zero
FP Loop Showing Stalls
1 Loop: 2
3
4
5 6 7 8 9
L.D F0,0(R1) stall
ADD.D F4,F0,F2 stall
stall
S.D 0(R1),F4 DADDUI R1,R1,-8 stall
BNEZ R1,Loop
;F0=vector element ;add scalar in F2
;store result
;decrement pointer 8B (DW) ;assumes cant forward to branch ;branch R1!=zero
9 clock cycles: Rewrite code to minimize stalls?
Revised FP Loop Minimizing Stalls
1 Loop: 2
3
4
5 6 7
L.D F0,0(R1) DADDUI R1,R1,-8 ADD.D F4,F0,F2 stall
stall
S.D 8(R1),F4 BNEZ R1,Loop
;altered offset when moving DADDUI
Swap DADDUI and S.D by changing address of S.D
7 clock cycles, but only 3 for execution
(L.D, ADD.D, S.D), 4 for loop overhead; How to make faster?
Loop unrolling:
Unroll by a factor of 4 (assume # elements is divisible by 4)
Eliminate unnecessary instructions (DADDUI and BNEZ)
Optimize (reschedule the code)
1 Loop: 3
6
7
9
12
13
15
18
19
21
24
25
26
L.D F0,0(R1) ADD.D F4,F0,F2 S.D 0(R1),F4 L.D F6,-8(R1) ADD.D F8,F6,F2 S.D -8(R1),F8 L.D F10,-16(R1) ADD.D F12,F10,F2 S.D -16(R1),F12 L.D F14,-24(R1) ADD.D F16,F14,F2 S.D -24(R1),F16 DADDUI R1,R1,#-32 BNEZ R1,Loop
Loop Unrolling
Please Note
Number of live registers vs. original loop.
1Loop: L.D
3 ADD.D
6 S.D
7 L.D
9 ADD.D
12 S.D
13 L.D F10,-16(R1)
15 ADD.D F12,F10,F2
18 S.D -16(R1),F12
19 L.D F14,-24(R1)
21 ADD.D F16,F14,F2
24 S.D -24(R1),F16
25 DADDUI R1,R1,#-32
26 BNEZ R1,Loop
1 cycle stall
2 cycles stall
;drop DSUBUI & BNEZ ;drop DSUBUI & BNEZ ;drop DSUBUI & BNEZ
;alter to 4*8
F0,0(R1) F4,F0,F2 0(R1),F4 F6,-8(R1) F8,F6,F2 -8(R1),F8
27 clock cycles, or 6.75 per iteration
(Assumes R1 is multiple of 4)
Rewrite loop to minimize stalls?
Loop Unrolled 4 Times
Optimize Unrolled Loop by Minimizing Stalls
1 Loop: L.D
2 L.D
3 L.D
4 L.D
5 ADD.D 6 ADD.D 7 ADD.D 8 ADD.D 9 S.D
10 S.D
11 S.D
12 DSUBUI 13 S.D
F0,0(R1)
F6,-8(R1)
F10,-16(R1) F14,-24(R1)
F4,F0,F2
F8,F6,F2
F12,F10,F2 F16,F14,F2
0(R1),F4
-8(R1),F8
-16(R1),F12 R1,R1,#32
8(R1),F16 ; 8-32 = -24
14 clock cycles, or 3.5 per iteration.
14
BNEZ R1,Loop
Loop Unrolling Factor
Do not usually know upper bound of loop
Suppose it is n, and we would like to unroll
the loop to make k copies of the body
Instead of a single unrolled loop, we
generate a pair of consecutive loops:
1st executes (n mod k) times and has a body that is the original loop
2nd is the unrolled body surrounded by an outer loop that iterates (n/k) times
For large values of n, most of the execution time will be spent in the unrolled loop
5 Loop Unrolling Decisions
Requires understanding how one instruction depends on another and how the instructions can be changed or reordered given the dependences:
3
Eliminate the extra test and branch instructions and adjust the loop termination and iteration code
1
Determine loop unrolling useful by finding that loop iterations were independent (except for maintenance code)
2
Use different registers to avoid unnecessary constraints forced by using same registers for different computations
4
Determine that loads and stores in unrolled loop can be interchanged by observing that loads and stores from different iterations are independent.
Transformation requires analyzing memory addresses and
making sure that they do not refer to the same address
5
Rescheduling the instructions in the unrolled loop, preserving any dependences needed to yield the same result as the original code
END OF SECTION
Lesson 06 Instruction-Level Parallelism: Dynamic Branch Prediction
Basic Dynamic Branch Prediction
Static Branch Prediction
Rescheduling code around delayed branch
To reorder code around branches, need to predict branch statically when compile Simplest scheme is to predict a branch as taken
Average misprediction = untaken branch frequency = 34% (SPEC benchmarks)
25%
22%
20% 15% 10%
5% 0%
18%
15%
More accurate scheme predicts branches using profile information collected from earlier runs, and modify prediction based on last run
12%
11%
12%
10%
9%
4%
6%
Integer
Floating Point
compress eqntott
espresso gcc
li doduc ear
hydro2d mdljdp su2cor
Misprediction Rate
Dynamic Branch Prediction
Why does prediction work?
Underlying algorithm has regularities
Data that is being operated on has regularities
Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems
Can we implement a dynamic branch prediction which will be executed at runtime?
There is a small number of important branches in programs which have dynamic behavior
Dynamic Branch Prediction
Performance = (accuracy, cost of misprediction)
Branch History Table (BHT): Lower bits of PC
address index table of 1-bit values
Says whether or not branch taken last time
No address check
Problem: in a loop, 1-bit BHT will cause two
mispredictions (avg is 9 iterations before exit):
End of loop case, when it exits instead of looping as before
First time through loop on next time through code, when it predicts exit instead of looping
Solution: 2-bit scheme where change prediction only if get misprediction twice
Branch Prediction
Basic 2-bit predictor: For each branch:
Predict taken or not taken
If the prediction is wrong two consecutive times,
change prediction
Correlating predictor:
Multiple 2-bit predictors for each branch
One for each possible combination of outcomes of preceding n branches
(m,n) predictor: behavior from last m branches to choose from 2m n-bit predictors
Tournament predictor:
Combine correlating predictor with local predictor
2-Bit Branch Predictor
Idea: Change prediction only if get misprediction twice T
Predict Taken
Predict Not Taken
NT Predict Taken T T NT
NT
T
Predict Not Taken
Green: go, taken
Red: stop, not taken
Adds hysteresis to decision making process
NT
2-Bit Branch Predictor Accuracy
20% 18% 16% 14% 12% 10%
8% 6% 4%
2% 0%
18%
4096 Entries BHT
5%
5%
12%
10%
9%
9% 9%
0% 1%
Integer
Floating Point
Mispredict because either:
Wrong guess for that branch
Got branch history of wrong branch when index the BHT
eqntott espresso
gcc li
spice doduc spice
fpppp matrix300 nasa7
Misprediction Rate
Correlating Branch Prediction
Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper n-bit branch history table.
In general, a (m,n) predictor means recording last m branches to select between 2m history tables, each with n-bit counters.
Thus, old 2-bit BHT is a (0,2) predictor
Global Branch History: m-bit shift register keeping T/NT status of last m branches.
Each entry in table has 2m n-bit predictors.
Correlating Branch Prediction
(2,2) predictor:
Behavior of recent branches selects between four predictions of next branch, updating just that prediction
Branch address
4
2-bits per branch predictor
Prediction
2-bit global branch history
20% 18% 16% 14% 12% 10%
8%
6%
4%
2%
0%
4096 Entries 2-bit BHT
Unlimited Entries 2-bit BHT
1024 Entries (2,2) BHT
9% 9% 5%
18%
Branch Prediction Performance
12% 11%
10% 6% 6% 5% 6%
5%
4%
1%
1% 0% 1% 1%
Spec89 Benchmarks
nasa7 matrix300
tomcatv doducd spice fpppp
gcc expresso
eqntott li
Frequency of Mispredictions
END OF SECTION
Lesson 06 Instruction-Level Parallelism: Dynamic Branch Prediction
Advanced Dynamic Branch Prediction
The motivation for correlating branch predictors came from the standard 2-bit predictor only using local information.
The tournament predictors use multiple predictors (global and local predictors) and choosing between them.
Tournament Predictors
Tournament predictor using, say, 4K 2-bit counters indexed by local branch address. Chooses between:
Global predictor:
4K entries index by history of last 12 branches (212 = 4K)
Each entry is a standard 2-bit predictor Local predictor:
Local history table: 1024 10-bit entries recording last 10 branches, index by branch address
The pattern of the last 10 occurrences of that particular branch used to index table of 1K entries with 3-bit saturating counters
Advantage of tournament predictor is ability to select the right predictor for a particular branch (e.g., 40% for SPEC integer and less than 15% for SPEC FP).
Tournament Predictors
0/0, 1/0,1/1
0/0, 1/0,1/1
Using multilevel brand prediction can achieve better accuracy and effectively use large number of prediction bits
Use n-bit saturating counter to choose between predictors
Usual choice between global and local predictors
P1/P2 ={0,1}/{0,1}
0: prediction incorrect 1: prediction correct
Use predictor 1
Use predictor 2
1/0
0/1
1/0
0/1
Use predictor 1
0/1 1/0
Use predictor 2
0/0,1/1
0/0,1/1
10-bit shift register
10 10
Most recent branch result (not taken/taken)
Branch Prediction
Branch history
Branch address
Branch history
Branch address
Global predictors
Selector
Local predictors
Exclusive OR
10
1024 2-bit predictors
Prediction
Prediction
gshare Predictor
Tournament Predictor
mux
8%
7% 6%
5% 4%
3% 2% 1%
Local 2-bit predictors
Branch Prediction Performance
Corrrelating predictors Tournaments predictors
0%0 32 64 96 128160192224256288320352384416448480512 Total predictor size
Conditional branch misprediction rate
Tagged Hybrid Predictors
Problem: reduce the size of the tables required for each branch and history
Combines multiple predictors to track if a prediction is associated with the current branch or not
Employs a series of global predictors indexed with different length histories
Solution: tagged hybrid branch predictor
Use hash tables, whose hash value is based
on branch address and branch history
Longer histories may lead to increased chance of hash collision, use multiple tables with increasingly shorter histories
pc
pc
hash
h[0:L(1)]
hash
pc h[0:L(2)]
hash hash
P(2)
pc h[0:L(3)]
hash hash
P(3)
pc h[0:L(4)]
hash hash
P(4)
Tagged Hybrid Predictors
P(0) P(1)
pred
tag
pred
tag
pred
tag
pred
tag
=? =? =? =?
Prediction
Base predictor
8 7 6 5 4 3 2 1 0
SPECfp
SPECint
MultiMedia
Server
Tagged Hybrid Predictors Performance
TAGE gshare
0.59
1.301
3.299
6.607
6.778
6.52
4.45
2.939
Misses per one thousand instructions
Branch Target Buffers (BTB)
Branch hazards are caused by delays needed for calculating the branch condition and the branch target address
Calculation of the branch target address is costly and stalls the instruction fetch
Branch Target Buffer (BTB) stores PCs of recent branch instructions and their calculated target addresses.
The same way as in caches, when a match is found the corresponding Predicted PC
is returned
If the branch was predicted taken, instruction fetch continues at the returned predicted PC
Dynamic Branch Prediction
Branch Target Buffer:
Dynamic branch prediction address of branch index to get prediction AND branch address (if taken)
Branch PC
Predicted PC
PC of Instruction FETCH
Please Note
Must check for branch match now, since cant use wrong branch address
No: branch not predicted, proceed normally (Next PC = PC+4)
=?
Yes: instruction is branch and use predicted PC as next PC
Extra prediction state bits
Branch-Target Buffer
PC of Instruction to fetch
Look up
Branch PC (Tag)
Predicted PC (Data)
Number of entries in Branch-target buffer
=
Yes: then instruction is taken branch and predicted PC should be used as the next PC
No: instruction is not predicted to be a taken branch; proceed normally
Branch-Target Buffer
Send PC to memory and branch-target buffer
IF
Entry
No found in Yes branch-target
buffer?
ID
No
Normal instruction execution
Is
instruction Yes
a taken branch?
Send out predicted PC
No
Taken Yes Branch?
EX
Enter branch instruction address and next PC into branch-buffer
Mispredicted branch, kill fetched instruction: restart fetch at other target: delete entry from target buffer
Branch correctly predicted; continue execution with no stalls
Branch Folding
Optimization:
For unconditional branches store the target instructions themselves in the buffer
Larger branch-target buffer! Branch folding
Add target instruction into buffer to deal with longer decoding time required by larger buffer
Return Address Predictor
Most unconditional branches come from functional returns The same procedure can be called from multiple sites
Causes the buffer to potentially forget about the return address from previous calls
Create return address buffer organized as a stack
Return Address Predictor
70% 60%
50%
40% 30% 20%
10%
0%0 1 2 4 8 16
Go m88ksim cc1 Compress Xlisp ljpeg
Perl Vortex
Return address buffer entries
Misprediction frequency
Integrated Instruction Fetch Unit
Design monolithic unit that performs: Branch prediction
Instruction prefetch
Fetch ahead
Instruction memory access and buffering
Deals with crossing cache lines
Dynamic Branch Prediction (Summary)
Prediction becoming important part of execution
Branch History Table: 2 bits for loop accuracy
Correlation: Recently executed branches correlated with next branch
Either different branches or different executions of same branches
Tournament predictors take insight to next level, by using multiple predictors
Usually one based on global information and one based on local information, and combining them with a selector
Branch Target Buffer: include branch address & prediction
Branch target folding adds addresses for
unconditional branches
Return address buffer to cater for functional returns
END OF SECTION
Lesson 06 Instruction-Level Parallelism: Dynamic Scheduling
Dynamic Scheduling Introduction
Dynamic Scheduling
Rearrange order of instructions to reduce stalls while maintaining data flow.
Advantages Disadvantages
Substantial increase in hardware complexity.
Compiler doesnt need to have knowledge of microarchitecture.
Complicates exceptions handling.
Handles cases where dependencies are unknown at compile time.
Dynamic scheduling implies: Out-of-order execution Out-of-order completion
Example 1
FDIV.D F0,F2,F4 FADD.D F10,F0,F8 FSUB.D F12,F8,F14
Dynamic Scheduling
FSUB.D is not dependent, issue before FADD.D.
Dynamic Scheduling
FADD.D is not dependent, but the anti-dependence in F0 makes it impossible to issue earlier without register renaming.
Example 2
FDIV.D F0,F2,F4 FMUL.D F6,F0,F8 FADD.D F0,F10,F14
Example 3
FDIV.D F0,F2,F4 FADD.D F6,F0,F8 SD.D F6,0(X1) FSUB.D F8,F10,F14 FMUL.D F6,F10,F8
Data-dependence F0
Anti-dependence F8 Output dependence F6
Register Renaming
Name dependence with F0,F6,F8.
Register Renaming
Example 3
FDIV.D F0,F2,F4 FADD.D S,F0,F8 SD.D S,0(X1) FSUB.D T,F10,F14 FMUL.D F6,F10,T
Now only data-dependences (RAW hazards) remain, which can be strictly ordered.
Register Renaming
Register renaming is provided by reservation stations (RS)
Tomasulos Algorithm
Tracks when operands are available Introduces register renaming
in hardware
Minimizes WAW and WAR hazards
Contains:
The instruction
Buffered operand values (when available)
Reservation station number of instruction providing the operand values
END OF SECTION
Lesson 06 Instruction-Level Parallelism: Dynamic Scheduling
Tomasulos Algorithm
Tomasulos Algorithm
Simple pipeline had 1 stage to check both structural and data hazards: Instruction Decode (ID), also called Instruction Issue
Tomasulos Algorithm: Split the ID pipe stage of simple 5-stage pipeline into 2 stages:
IssueDecode instructions, check for structural hazards
Read operandsWait until no data hazards, then read operands
Tomasulos Algorithm
Control & buffers distributed with Function Units (FU)
FU buffers called Reservation Stations (RS)
RS fetches and buffers an operand as soon as it becomes available (not necessarily involving register file)
As instructions are issued, the register specifiers are replaced by values or pointers to RS
Form of register renaming
Avoids WAR and WAW hazards
Pending instructions designate the RS to which they will
send their output
May be more reservation stations than registers, so can do optimizations compilers cant
Tomasulos Algorithm
Result values broadcast from RS to all FUs on a result bus, called the common data bus (CDB)
Results do not go through registers Form of bypassing
Only the last output updates the register file
Load and Stores treated as FUs with buffers
Contain data and addresses, act like reservation stations
Integer instructions can go past branches, allowing FP ops beyond basic block in
FP queue
Tomasulos Algorithm
From instruction unit
Instruction queue
Load/Store operations
FP registers
Store buffers
Floating-point operations
Address unit
Load buffers
Address Data
Memory unit
FP adders
Operation bus
321
stations 1
Operand buses
Reservation 2
Common data bus (CDB)
FP multipliers
Issue:
Get next instruction
from FIFO queue
If available RS, issue the instruction to the RS with operand values if available
If operand values not available, stall the instruction
Tomasulos Algorithm
Three Steps:
Execute:
When operand becomes available, store it in any reservation stations waiting for it
When all operands are ready, issue the instruction
Loads and stores maintained in program order through effective address
No instruction allowed to initiate execution until all branches that precede it in program order have completed
Write result:
Write result on CDB into reservation stations and store buffers
Stores must wait until address and value are received
Tomasulos Algorithm Execution
1. Issue: get instruction from FP Op Queue
If reservation station free (no structural hazard),
control issues instruction and sends operands (renames registers).
2. Execute: operate on operands
When both operands ready then execute; if not ready, watch Common Data Bus for result.
3. Write result: finish execution
Write on Common Data Bus to all awaiting units; mark reservation station available.
Normal data bus: data + destination (go to bus)
Common data bus: data + source (come from bus)
64 bits of data + 4 bits of Functional Unit source address
Write if matches expected Functional Unit (produces result) Does the broadcast
Tomasulos Algorithm Example 1
Code Sequence
LD.D F6,32(R2) LD.D F2,44(R3) FMUL F0,F2,F4 FSUB F8,F2,F6 FDIV F0,F0,F6 FADD F6,F8,F2
Instruction status
Instruction Issue
Execute Write result
LD.D F6,32(R2)
LD.D F2,44(R3)
FMUL F0,F2,F4
FSUB F8,F2,F6
FDIV F0,F0,F6
FADD F6,F8,F2
Tomasulos Algorithm Snapshot
Reservation stations
Name Busy
Op Vj Vk Qj Qk A
Load1 No
Load2 Yes
Load 44+[R3]
Add1 Yes
FSUB M[32+ [R2]] Load2
Add2 Yes
FADD Add1 Load2
Add3 No
Mult1 Yes
FMUL [F4] Load2
Mult2 Yes
FDIV M[32+[R2]] Mult1
Field f0 f2 f4 f6 f8 f10 f12 f30
FU Mult1 Load2 Add2 Add1 Mult2
Tomasulos Algorithm Example 2
Loop Iterations
Loop: LD F0,0(R1) FMUL.D F4,F0,F2 SD F4,0(R1)
SUBI R1,R1,#8
BNE R1,R2,Loop // branches if R1 != R2
Instruction From iteration Issue Execute Write result
LD.D F0,0(R1) 1
SD.D F4,0(R1) 1
Tomasulos Algorithm Snapshot
Load1 Yes
Load2 Yes
Add1 No
Add2 No
Add3 No
Mult1 Yes
Mult2 Yes
Store1 Yes
Store2 Yes
Instruction status
FMUL F4, F0,F2 1
LD.D F0,0 (R1) 2
FMUL F4,F0,F2 2
SD.D F4,0(R1) 2
Name Busy
Reservation stations
Op Vj Vk Qj Qk A
Load [R1] + 0
Load [R1] 8
FMUL [F2] Load1
FMUL [F2] Load2
Store [R1] Mult1
Store [R1] 8 Mult2
Field f0 f2 f4 f6 f8 f10 f12 f30
FU Load2 Mult2
From Mem
FP Op Queue
FP Registers
Tomasulos Algorithm Organization
Load1
Load2 Load3 Load4 Load5 Load6
FP Add/Sub: 3 clocks FP Mult: 10 clocks FP Div: 40 clocks
Load Buffers
FP adders
Add1 Add2 Add3
Mult1 Mult2
Reservation Stations
Store Buffers
To Mem
FP multipliers
Common Data Bus (CDB)
Reservation Station Components
Op: Operation to perform in the unit (e.g., + or )
Vj, Vk: Value of Source operands
Store buffers has V field, result to
be stored
Qj, Qk: Reservation stations producing source registers (value to be written)
Store buffers only have Qi for RS producing result
Busy: Indicates reservation station or FU is busy
Register result statusIndicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.
Tomasulos Algorithm Example
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2
LD.D F2
45+ R3
FMUL F0
F2 F4
FSUB F8
F6 F2
FDIVD F10
F0 F6
FADD F6
F8 F2
Instruction status:
Instruction stream
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
Reservation Stations:
FU Countdown
3 Load/Buffers
3 FP Adder R.S. 2 FP Mult R.S.
F0 F2 F4 F6 F8 F10 F12 F30
FU
Clock
0
Register result status:
Clock cycle counter
Tomasulos Algorithm Example (Cycle 1)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1
LD.D F2
45+ R3
FMUL F0
F2 F4
FSUB F8
F6 F2
FDIVD F10
F0 F6
FADD F6
F8 F2
Instruction status:
Busy Addr
Load1 Yes 34+[R2]
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 F30
FU Load1
Clock
1
Register result status:
Tomasulos Algorithm Example (Cycle 2)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1
LD.D F2
45+ R3 2
FMUL F0
F2 F4
FSUB F8
F6 F2
FDIVD F10
F0 F6
FADD F6
F8 F2
Instruction status:
Busy Addr
Load1 Yes 34+[R2]
Load2 Yes 45+[R3]
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 F30
FU Load2 Load1
Clock
2
Register result status:
Tomasulos Algorithm Example (Cycle 3)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3
LD.D F2
45+ R3 2
FMUL F0
F2 F4 3
FSUB F8
F6 F2
FDIVD F10
F0 F6
FADD F6
F8 F2
Instruction status:
Busy Addr
Load1 Yes 34+[R2]
Load2 Yes 45+[R3]
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F4] Load2
Mult2 No
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 F30
FU Mult1 Load2 Load1
Clock
3
Register result status:
Tomasulos Algorithm Example (Cycle 4)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4
FDIVD F10
F0 F6
FADD F6
F8 F2
Instruction status:
Busy Addr
Load1 No
Load2 Yes 45+[R3]
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 Yes FSUB M[A1] Load2
Add2 No
Add3 No
Mult1 Yes FMUL [F4] Load2
Mult2 No
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 F30
FU Mult1 Load2 M[A1] Add1
Clock
4
Register result status:
Tomasulos Algorithm Example (Cycle 5)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4
FDIVD F10
F0 F6 5
FADD F6
F8 F2
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
3
Add1 Yes FSUB M[A1] M[A2] Load2
Add2 No
Add3 No
10
Mult1 Yes FMUL M[A2] [F4] Load2
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 F30
FU Mult1 M[A2] M[A1] Add1 Mult2
Clock
5
Register result status:
Tomasulos Algorithm Example (Cycle 6)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
2
Add1 Yes FSUB M[A1] M[A2] Load2
Add2 Yes FADD M[A2] Add1
Add3 No
9
Mult1 Yes FMUL M[A2] [F4] Load2
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 F30
FU Mult1 M[A2] Add2 Add1 Mult2
Clock
6
Register result status:
Tomasulos Algorithm Example (Cycle 7)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4 7
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
1
Add1 Yes FSUB M[A1] M[A2] Load2
Add2 Yes FADD M[A2] Add1
Add3 No
8
Mult1 Yes FMUL M[A2] [F4] Load2
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 F30
FU Mult1 M[A2] Add2 Add1 Mult2
Clock
7
Register result status:
Tomasulos Algorithm Example (Cycle 8)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
3
Add2 Yes FADD (M-M) M[A2]
Add3 No
7
Mult1 Yes FMUL M[A2] [F4] Load2
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 F30
FU Mult1 M[A2] Add2 (M-M) Mult2
Clock
8
Register result status:
Tomasulos Algorithm Example (Cycle 9)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
2
Add2 Yes FADD (M-M) M[A2]
Add3 No
6
Mult1 Yes FMUL M[A2] [F4]
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 F30
FU Mult1 M[A2] Add2 (M-M) Mult2
Clock
9
Register result status:
Tomasulos Algorithm Example (Cycle 10)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6 10
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
1
Add2 Yes FADD (M-M) M[A2]
Add3 No
5
Mult1 Yes FMUL M[A2] [F4]
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 F30
FU Mult1 M[A2] Add2 (M-M) Mult2
Clock
10
Register result status:
Tomasulos Algorithm Example (Cycle 11)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6 10 11
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
4
Mult1 Yes FMUL M[A2] [F4]
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 F30
FU Mult1 M[A2] (M-M+M) (M-M) Mult2
Clock
11
Register result status:
Tomasulos Algorithm Example (Cycle 12)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6 10 11
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
3
Mult1 Yes FMUL M[A2] [F4]
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 F30
FU Mult1 M[A2] (M-M+M) (M-M) Mult2
Clock
12
Register result status:
Tomasulos Algorithm Example (Cycle 13)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6 10 11
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
2
Mult1 Yes FMUL M[A2] [F4]
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 F30
FU Mult1 M[A2] (M-M+M) (M-M) Mult2
Clock
13
Register result status:
Tomasulos Algorithm Example (Cycle 14)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3 14
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6 10 11
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
1
Mult1 Yes FMUL M[A2] [F4]
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 F30
FU Mult1 M[A2] (M-M+M) (M-M) Mult2
Clock
14
Register result status:
Tomasulos Algorithm Example (Cycle 15)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3 14 15
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6 10 11
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
40
Mult2 Yes FDIV M*F4 M[A1]
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 F30
FU M*F4 M[A2] (M-M+M) (M-M) Mult2
Clock
15
Register result status:
Faster Than Light Computation (Skip 39 Cycles)
Tomasulos Algorithm Example (Cycle 54)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3 14 15
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6 10 11
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
2
Mult2 Yes FDIV M*F4 M[A1]
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 F30
FU M*F4 M[A2] (M-M+M) (M-M) Mult2
Clock
54
Register result status:
Tomasulos Algorithm Example (Cycle 55)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3 15 16
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5 55
FADD F6
F8 F2 6 10 11
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
1
Mult2 Yes FDIV M*F4 M[A1]
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 F30
FU M*F4 M[A2] (M-M+M) (M-M) Mult2
Clock
55
Register result status:
Tomasulos Algorithm Example (Cycle 56)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3 15 16
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5 55 56
FADD F6
F8 F2 6 10 11
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 Yes
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 F30
FU M*F4 M[A2] (M-M+M) (M-M) Result
Clock
56
Register result status:
END OF SECTION
Lesson 06 Instruction-Level Parallelism: Dynamic Scheduling
Tomasulos Algorithm (Dynamic Loop Unrolling)
Tomasulo Algorithm Loop Example
Loop: LD.D F0, 0(R1) FMUL F4, F0, F2
SD.D F4, 0(R1) SUBI R1, R1, #8 BNEZ R1, Loop
This time assume Multiply takes 5 clocks
Assume 1st load takes 8 clocks
(L1 cache miss), 2nd load takes 1 clock (hit)
To be clear, will show clocks for SUBI, BNEZ
Reality: integer instructions ahead of FP Instructions
Show 2 iterations
Iteration count
Instruction status:
Loop Example
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1
1 FMULF4 F0 F2
1 SD.DF4 0 R1
2 LD.DF0 0 R1
2 FMULF4 F0 F2
2 SD.DF4 0 R1
Load2 No
Busy Addr
FU
Load1 No
Load3 No
Store1 No
Store2 No
Store3 No
Added Store Buffers
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
Instruction Loop
F0 F2 F4 F6 F8 F10 F12 F30
FU
Clock R1 0 80
Register result status:
Value of Register used for address, iteration control
Loop Example (Cycle 1)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1
Load2 No
Busy Addr
FU
Load1 Yes 80
Load3 No
Store1 No
Store2 No
Store3 No
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 F30
FU Load1
Clock R1 1 80
Register result status:
Loop Example (Cycle 2)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1
1 FMULF4 F0 F2 2
Load2 No
Busy Addr
FU
Load1 Yes 80
Load3 No
Store1 No
Store2 No
Store3 No
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F2] Load1
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 F30
FU Load1 Mult1
Clock R1 2 80
Register result status:
Loop Example (Cycle 3)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
Busy Addr
Load1 Yes 80
FU
Load2 No
Load3 No
Store1 Yes 80
Mult1
Store2 No
Store3 No
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F2] Load1
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 F30
FU Load1 Mult1
Clock R1 3 80
Register result status:
Loop Example (Cycle 4)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
Load2 No
Busy Addr
FU
Load1 Yes 80
Load3 No
Store1 Yes 80
Store2 No
Store3 No
Mult1
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F2] Load1
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 F30
FU Load1 Mult1
Clock R1 4 80
Register result status:
Loop Example (Cycle 5)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
Load2 No
Busy Addr
FU
Load1 Yes 80
Load3 No
Store1 Yes 80
Store2 No
Store3 No
Mult1
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F2] Load1
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 F30
FU Load1 Mult1
Clock R1 5 72
Register result status:
Loop Example (Cycle 6)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6
Busy Addr
Load2 Yes 72
FU
Load1 Yes 80
Load3 No
Store1 Yes 80
Store2 No
Store3 No
Mult1
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F2] Load1
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 F30
FU Load2 Mult1
Clock R1 6 72
Register result status:
Loop Example (Cycle 7)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6
2 FMULF4 F0 F2 7
Busy Addr
Load2 Yes 72
FU
Load1 Yes 80
Load3 No
Store1 Yes 80
Store2 No
Store3 No
Mult1
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F2] Load1
Mult2 Yes FMUL [F2] Load2
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 F30
FU Load2 Mult2
Clock R1 7 72
Register result status:
Loop Example (Cycle 8)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6
2 FMULF4 F0 F2 7
2 SD.DF4 0 R1 8
Busy Addr
Load1 Yes 80
FU
Load2 Yes 72
Load3 No
Store1 Yes 80
Store2 Yes 72
Store3 No
Mult1
Mult2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F2] Load1
Mult2 Yes FMUL [F2] Load2
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 F30
FU Load2 Mult2
Clock R1 8 72
Register result status:
Loop Example (Cycle 9)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6
2 FMULF4 F0 F2 7
2 SD.DF4 0 R1 8
Busy Addr
Load1 Yes 80
FU
Load2 Yes 72
Load3 No
Store1 Yes 80
Store2 Yes 72
Store3 No
Mult1
Mult2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F2] Load1
Mult2 Yes FMUL [F2] Load2
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 F30
FU Load2 Mult2
Clock R1 9 72
Register result status:
Loop Example (Cycle 10)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6 10
2 FMULF4 F0 F2 7
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 Yes 72
Load3 No
Store1 Yes 80
Store2 Yes 72
Store3 No
Mult1
Mult2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL M[80] [F2]
Mult2 Yes FMUL [F2] Load2
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 F30
FU Load2 Mult2
Clock R1 10 64
Register result status:
Loop Example (Cycle 11)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 Yes 80
Store2 Yes 72
Store3 No
Mult1
Mult2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL M[80] [F2]
Mult2 Yes FMUL M[72] [F2]
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 F30
FU Load3 Mult2
Clock R1 11 64
Register result status:
Loop Example (Cycle 12)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 Yes 80
Store2 Yes 72
Store3 No
Mult1
Mult2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL M[80] [F2]
Mult2 Yes FMUL M[72] [F2]
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 F30
FU Load3 Mult2
Clock R1 12 64
Register result status:
Loop Example (Cycle 13)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 Yes 80
Store2 Yes 72
Store3 No
Mult1
Mult2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL M[80] [F2]
Mult2 Yes FMUL M[72] [F2]
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 F30
FU Load3 Mult2
Clock R1 13 64
Register result status:
Loop Example (Cycle 14)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2 14
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 Yes 80
Store2 Yes 72
Store3 No
Mult1
Mult2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL M[80] [F2]
Mult2 Yes FMUL M[72] [F2]
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 F30
FU Load3 Mult2
Clock R1 14 64
Register result status:
Loop Example (Cycle 15)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2 14 15
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7 15
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 Yes 80
Store2 Yes 72
Store3 No
M[80]*F2
Mult2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
1 Mult2 Yes FMUL M[72] [F2]
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 F30
FU Load3 Mult2
Clock R1 15 64
Register result status:
Loop Example (Cycle 16)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2 14 15
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7 15 16
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 Yes 80
Store2 Yes 72
Store3 No
M[80]*F2
M[72]*F2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
5 Mult1 Yes FMUL [F2] Load3
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 F30
FU Load3 Mult2
Clock R1 16 64
Register result status:
Loop Example (Cycle 17)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2 14 15
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7 15 16
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 Yes 80
Store2 Yes 72
Store3 Yes 64
M[80]*F2
M[72]*F2
Mult1
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
4 Mult1 Yes FMUL [F2] Load3
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 F30
FU Load3 Mult2
Clock R1 17 64
Register result status:
Loop Example (Cycle 18)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2 14 15
1 SD.DF4 0 R1 3 18
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7 15 16
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 Yes 80
Store2 Yes 72
Store3 Yes 64
M[80]*F2
M[72]*F2
Mult1
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
3 Mult1 Yes FMUL [F2] Load3
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 F30
FU Load3 Mult2
Clock R1 18 64
Register result status:
Loop Example (Cycle 19)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2 14 15
1 SD.DF4 0 R1 3 18 19
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7 15 16
2 SD.DF4 0 R1 8 19
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 No
Store2 Yes 72
Store3 Yes 64
M[72]*F2
Mult1
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
2 Mult1 Yes FMUL [F2] Load3
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 F30
FU Load3 Mult2
Clock R1 19 56
Register result status:
Loop Example (Cycle 20)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2 14 15
1 SD.DF4 0 R1 3 18 19
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7 15 16
2 SD.DF4 0 R1 8 19 20
Load2 No
Busy Addr
FU
Load1 Yes 56
Load3 Yes 64
Store1 No
Store2 No
Store3 Yes 64
Mult1
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
1 Mult1 Yes FMUL [F2] Load3
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 F30
FU Load3 Mult2
Clock R1 20 56
Register result status:
Why Can Tomasulo Overlap Iterations of Loops?
Register renaming
Multiple iterations use different physical destinations for registers (dynamic loop unrolling).
Reservation stations
Permit instruction issue to advance past integer control flow operations.
Also buffer old values of registers totally avoiding the WAR stalls.
Other perspective: Tomasulo building data flow dependency graph on the fly
Major Advantages of Tomasulos Algorithm
The distribution of the hazard detection logic:
Distributed reservation stations and the CDB
If multiple instructions waiting on single result, and each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB
If a centralized register file is used, the units would have to read their results from the registers when register buses are available
The elimination of stalls for WAW and WAR hazards
What about Precise Interrupts?
State of machine looks as if no instruction beyond faulting instructions has issued.
Tomasulo had:
In-order issue, out-of-order execution, and out-of-order completion
Need to fix the out-of-order completion aspect so that we can find precise breakpoint in instruction stream.
END OF SECTION
Lesson 06 Instruction-Level Parallelism: Dynamic Scheduling
Tomasulos Algorithm with Speculation
Relationship between Precise Interrupts and Speculation
Speculation: guess and check
Important for branch prediction:
Need to take our best shot at predicting branch direction
If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly:
This is exactly same as precise exceptions!
Technique for both precise interrupts/exceptions
and speculation:
In-order completion or commit
Hardware-Based Speculation
Execute instructions along predicted execution paths but only commit the results of prediction when correct
Instruction commit: allowing an instruction to update the register file when instruction is no longer speculative
Need an additional piece of hardware to prevent any irreversible action until an instruction commits (i.e., updating state or taking an execution)
Reorder Buffer
Reorder Buffer (ROB)
Register values and memory references are not written until an instruction commits.
On misprediction:
Speculated entries in ROB are cleared
Exceptions:
Not recognized until it is ready to commit
Reorder Buffer Operation
ROB holds instructions in FIFO order, exactly as issued
Modify reservation stations:
Operand source is now ROB instead of functional unit
When instructions complete, results placed into ROB
Supplies operands to other instruction between execution complete & commitmore registers like RS
Tag results with ROB buffer number instead of reservation station
Instructions commit values at head of ROB placed in registers (or memory locations)
As a result, easy to undo speculated instructions on mispredicted branches or on exceptions
Commit path
FP
Op Queue
Reorder Buffer
FP Regs
Res Stations
Res Stations
FP Adder
FP Adder
Reorder Buffer Entry Fields
Each entry in the ROB contains four fields:
1 Instruction type
A branch (has no destination result), a store (has a memory address destination), or a register operation (FP operation or load, which
has register destinations)
2 Destination
Register number (for loads and FP operations) or memory address (for stores) where the instruction result should be written
3 Value
Value of instruction result until the instruction commits
4 Ready
Indicates that instruction has completed execution, and the value is ready
Reorder Buffer
Entry Busy Instruction State Destination Value
1 No LD.D F6,32(R2) Commit F6 Mem[32+R2]
2 No LD.D F2, 44(R3) Commit F2 Mem[44+R3]
3 Yes FMUL F0,F2,F4 Write result
F0 #2 x F4
4 Yes FSUB F8,F2,F6 Write result F8 #2 #1
5 Yes FDIV F0,F0,F6 Execute F0
6 Yes FADD F6,F8,F2 Write result F6 #4 + #2
Reservation stations
Name Busy
OP Vj Vk
Qj Qk Dest A
Load1 No
Load2 No
Add1 No
Add2 No
Add3 No
Mult1 No
FMUL Mem[44+R3] F4
#3
Mult2 Yes
FDIV Mem[32+R2]
#3 #5
FP register status
Field F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10
Reorder # 3 6 4 5
Busy Yes No No No No No Yes Yes No Yes
Adding Speculation to Tomasulo
In non-speculative Tomasulos Algorithm, once an instruction writes its result, any subsequently issued instructions will find result in the register file
With speculation, the register file is not updated until the instruction commits
(we know definitively that the instruction should execute)
Thus, the ROB supplies operands in interval between completion of instruction execution and instruction commit
ROB is a source of operands for instructions, just as reservation stations (RS) provide operands in Tomasulos Algorithm
Revised Tomasulos Algorithm
1 Issue get instruction from FP Op Queue (sometimes called dispatch)
If reservation station and ROB slot free, issue instruction and send operands and ROB no. for destination
2 Execution operate on operands
When both operands ready then execute; if not ready, watch CDB for result; when both operand values in reservation station, execute; checks RAW
3 Write result finish execution
Write result and ROB tag on CDB to all awaiting FUs and ROB; Mark reservation station available
4 Commit update register with reorder result (sometimes called graduation)
When instruction reaches head of ROB and result present, update register with result (or store to memory) and remove instruction from ROB. When a mispredicted branch reaches head of ROB, discard all entries (flush ROB)
Address unit
123
Store address
Instruction queue
Load/Store operations
Floating-point operations
Load buffers
123
Reservation stations
From instruction unit
Operation bus
Reorder buffer
Reg # Data
FP registers
Operand buses
12 Common data bus (CDB)
Memory unit
FP adders
FP multipliers
Architecture of Speculative Tomasulo
Floating point addition: 2 Clock Cycles Floating point division: 20 Clock Cycles Load operation: 3 Clock Cycles
Cache miss penalty: 10 Clock Cycles
Store data
Load data
Speculative Tomasulo Example (Cycle 1)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
1
10+R2
FP adders
FP multipliers
Speculative Tomasulo Example (Cycle 2)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
1
10+R2
FP adders
FP multipliers
Speculative Tomasulo Example (Cycle 3)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
3
FDIV
ROB2,R(F6)
1
10+R2
FP adders
FP multipliers
Speculative Tomasulo Example (Cycle 4)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
BNE F2,<…>
N
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
3
FDIV
ROB2,R(F6)
1
10+R2
5
0+R3
FP adders
FP multipliers
Speculative Tomasulo Example (Cycle 5)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
F4
LD.D F4,0(R3)
N
BNE F2,<…>
N
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
3
FDIV
ROB2,R(F6)
1
10+R2
5
0+R3
FP adders
FP multipliers
Speculative Tomasulo Example (Cycle 6)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
F0
FADD F0,F4,F6
N
F4
LD.D F4,0(R3)
N
BNE F2,<…>
N
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
3
FDIV
ROB2,R(F6)
6
FADD
ROB5, R(F6)
1
10+R2
5
0+R3
FP adders
FP multipliers
Speculative Tomasulo Example (Cycle 7)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
ROB5
SD.D 0(R3),F4
N
F0
FADD F0,F4,F6
N
F4
LD.D F4,0(R3)
N
BNE F2,<…>
N
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
3
FDIV
ROB2,R(F6)
6
FADD
ROB5, R(F6)
1
10+R2
5
0+R3
FP adders
FP multipliers
Speculative Tomasulo Example (Cycle 8)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
M[10]
SD.D 0(R3),F4
Y
F0
FADD F0,F4,F6
N
F4
M[10]
LD.D F4,0(R3)
Y
BNE F2,<…>
N
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
3
FDIV
ROB2,R(F6)
6
FADD
R(F4),ROB1
M[10], R(F6)
1
10+R2
FP adders
FP multipliers
Speculative Tomasulo Example (Cycle 9)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
M[10]
SD.D 0(R3),F4
Y
F0
FADD F0,F4,F6
Ex
F4
M[10]
LD.D F4,0(R3)
Y
BNE F2,<…>
N
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
3
FDIV
ROB2,R(F6)
6
FADD
M[10], R(F6)
1
10+R2
FP adders
FP multipliers
Speculative Tomasulo Example (Cycle 10)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
M[10]
SD.D 0(R3),F4
Y
F0
FADD F0,F4,F6
Y
F4
M[10]
LD.D F4,0(R3)
Y
BNE F2,<…>
N
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
3
FDIV
ROB2,R(F6)
1
10+R2
FP adders
FP multipliers
Speculative Tomasulo Example (Cycle 11)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
M[10]
SD.D 0(R3),F4
Y
F0
FADD F0,F4,F6
Y
F4
M[10]
LD.D F4,0(R3)
Y
BNE F2,<…>
N
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
M[20]
LD.D F0,10(R2)
Y
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
What about memory hazards?
3
FDIV
ROB2,R(F6)
FP adders
FP multipliers
Avoiding Memory Hazards
WAW and WAR hazards through memory are eliminated with speculation because actual updating of memory occurs in order, when a store is at head of the ROB, and hence, no earlier loads or stores can still be pending
RAW dependence through memory are maintained by two restrictions:
1
Not allowing a load to initiate the second step of its execution if any active ROB entry occupied by a store has a Destination field that matches the value of the address field of the load.
2
Maintaining the program order for the computation of an effective address of a load with respect to all earlier stores.
These restrictions ensure that any load that accesses a memory location written by an earlier store cannot perform the memory access until the store has written the data
Speculation for Greater ILP
Greater ILP: Overcome control dependence by hardware speculating on outcome of branches and executing program as if guesses were correct.
Speculation
fetch, issue, and execute instructions as if branch predictions were always correct
Dynamic scheduling
only fetches and issues instructions
Essentially a data flow execution model: Operations execute as soon as their operands are available.
Speculation for Greater ILP
3 components of HW-based speculation
1 2
3
Dynamic branch prediction to choose which instructions to execute
Speculation to allow execution of instructions before control dependences are resolved
Ability to undo effects of incorrectly speculated sequence
Dynamic scheduling to deal with scheduling of different combinations of basic blocks
Exceptions and Interrupts
IBM 360/91 invented imprecise interrupts
Computer stopped at this PC; its likely close to this address Not so popular with programmers
Also, what about Virtual Memory? (Not in IBM 360)
Technique for both precise interrupts/exceptions and speculation: in-order completion and in-order commit
If we speculate and are wrong, we need to back up and restart execution to point at which we predicted incorrectly
This is exactly what is needed to implement a precise exception
Exceptions are handled by not recognizing the exception until instruction that caused it is ready to commit in ROB
If a speculated instruction raises an exception, the exception is recorded in the ROB This is why reorder buffers are in all new processors
END OF SECTION
Lesson 06 Instruction-Level Parallelism: Multiple Instruction Issue
Multiple Instructions Issue
Multiple Issue and ILP
To achieve CPI < 1, need to complete multiple instructions per clock.Solutions:Multi-way issue superscalar processors VLIW (very long instruction word) processors Multi-word vector processorsAnticipated success of multiple instructions lead to Instructions Per Clock cycle (IPC) vs. CPIMain motivation of ILP is to reduce the amount of stall in the pipelined execution of instructions In particular for multi-cyle operationsMain goal of ILP is to get performance close to ideal CPI = 1Getting CPI < 1:Issuing Multiple Instructions/Cycle Superscalar Varying no. instructions/cycle (1 to 8) scheduled statically by compiler or dynamically in hardware (Tomasulo) IBM PowerPC, Sun UltraSparc, Pentium 4 Very Long Instruction Words VLIW Fixed number of instructions (4-16) scheduled by the compiler; put ops into wide templates Intel Architecture-64 (IA-64) 64-bit address Renamed: Explicitly Parallel Instruction Computer (EPIC) Vector Processing Explicit coding of independent loops as operations on large vectors of numbers Multimedia instructions being added tomany processors Scalar Pipeline ExecutionSuccessive Instructions0 1 2 3 4 5 6 7 8 9 10 11 12 13Time in Base CyclesFetch Decode Execute Write backSuperscalar of Degree 3Fetch Decode Execute Write back0123456789Time in Base Cycles VLIW Execution with Degree 3Fetch Decode Execute Write backExecute3 operations0123456789Time in Base Cycles Vector Execution with Degree 3FetchDecodeRead vectorExecuteWrite backWrite vector 0123456789Time in Base Cycles Multiple Issue Common Issue name structure Hazard Distinguishingdetection Scheduling characteristic Example Superscalar Dynamic (static) Hardware Static In-order execution Mostly in the embedded space: MIPS and ARM, including theCortex-A53 Superscalar Dynamic (dynamic) Hardware Dynamic Some out-of-order None at the present execution, but nospeculation Superscalar Dynamic (speculative) Hardware Dynamic with Out-of-order execution Intel Core i3, i5: AMD Phenom: speculation with speculation IBM Power 7 VLIW/LIW Static Primary Static All hazards determined Most examples are in signal software and indicated by compiler processing, such as the TI C6x(often implicitly) EPIC Primarily static Primary Mostly static All hazards determined Itanium software and indicated explicitlyby the compilerEND OF SECTION Lesson 06 Instruction-Level Parallelism: Multiple Instruction Issue Very Long Instruction Word (VLIW)VLIW ProcessorsExample VLIW bundle: One integer instruction (or branch) Two independent floating-point operations Two independent memory referencesPackage multiple operations into one instruction. Must be enough parallelism in code to fill the available slots. VLIW Pipelinei tIFIDEXEXMEMWB EXEX IFIDEXMEMWB EX EX EX IFIDEX…MEMWBInstruction FormatVLIW Organization FP Add FP Mult Int ALU Branch Load/StoreInstruction Issue UnitFP Mult FP AddInt ALUBranchLoad/ StoreRegister FileVLIW: Very Long Instruction Word Each instruction has explicit coding for multiple operations In the IA-64 architecture, grouping called a packet In the Transmeta processor, grouping called a molecule (with atoms as ops) Tradeoff instruction space for simple decoding The long instruction word has room for many operations By definition, all the operations the compiler puts in the long instruction word are independent executein parallel E.g., 2 integer operations, 2 FP ops, 2 memory refs, 1 branch 16 to 24 bits per field716 or 112 bits to 724 or 168 bits wide Need compiling technique that schedules across several branches Recall: Unrolled Loop that Minimizes Stalls for Scalar 1 Loop: LD.D 2 LD.D 3 LD.D 4 LD.D 5 FADD 6 FADD 7 FADD 8 FADD 9 SD.D 10 SD.D 11 SD.D 12 SUBUI 13 BNEZ 14 SD.DF0,0(R1) F6,-8(R1) F10,-16(R1) F14,-24(R1) F4,F0,F2 F8,F6,F2 F12,F10,F2 F16,F14,F2 F4,0(R1) F8,-8(R1) F12,-16(R1), R1,R1,#32 R1,Loop F16,8(R1); 8-32 = -24 LD.D to FADD: 1 Cycle FADD to SD.D: 2 Cycles 14 clock cycles, or 3.5 per iteration. Loop Unrolling in VLIWMemory Memory reference 1 reference 2 FP FP Integeroperation 1 operation 2 operation/branchClock LD.D F0,0(R1) LD.D F6,-8(R1) 1 LD.D F10,-16(R1) LD.D F14,-24(R1) 2LD.D F18,-32(R1) LD.D F22,-40(R1) FADD F4,F0,F2 FADD F8,F6,F2 3LD.D F26,-48(R1) FADD F12,F10,F2 FADD F16,F14,F2 4FADD F20,F18,F2 FADD F24,F22,F25 SD.D F4, 0(R1) SD.D F8,-8(R1) FADD F28,F26,F26 SD.D F12,-16(R1) SD.D F16,-24(R1) SUBUI R1,R1,#567 SD.D F20,24(R1) SD.D F24,16(R1)8SD.D F28,8(R1) BNEZ R1,Loop9 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) Average: 2.5 ops per clock, 50% efficiencyPlease NoteNeed more registers in VLIW (15 vs. 6 in single scalar)Problems with 1st Generation VLIWVLIW have not been successful for commercial processors due to several reasons: Increase in code size Generating enough operations in a straight-line code fragment requires ambitious unrolling of loops Whenever VLIW instructions are not full, unused functional units translate to wasted bits in instruction encoding Operated in lock-step; no hazard detection HW A stall in any functional unit pipeline causes entire processor to stall, since all functional units must be kept synchronized Compiler might predict function units, but caches hard to predict Binary code compatibility Pure VLIW different numbers of functional units and unit latencies require different versions of the codeEND OF SECTION Lesson 06 Instruction-Level Parallelism: Multiple Instruction Issue Dynamic Scheduling, Multiple Issue, and SpeculationDynamic Scheduling, Multiple Issue, and Speculation Modern architectures: Dynamic scheduling + multiple issue+ speculation Two approaches: Assign reservation stations and update pipelinecontrol table in half clock cycles Only supports 2 instructions/clock Design logic to handle any possible dependencies between the instructions Issue logic (dispatch unit) is the bottleneck in dynamically scheduled superscalars Superscalar OrganizationInstruction Fetch Unit Instruction QueueDispatch Unit Floating Point Execution UnitLoad UnitStore UnitInteger Execution UnitBranch Unit Write Buffer2-way Superscalar Pipeline IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB i tMultiple Issue and Dependencies Examine all the dependencies among the instructions issued simultaneously If dependencies exist in the bundle of instruction, encode them in reservation stations Also need multiple completion/commit To simplify reservation station allocation: Limit the number of instructions of a given class that can be issued in a bundle, i.e., one FP, one integer, one load, one store 2-way Superscalar TomasuloFrom instruction unit Load/store operationsLoad buffersInstruction Reg # queueDataOperand buses Address unit Store address32 1Reservation 21 stations21Store dataLoad dataAddress Reorder buffer Integer and FP registersFloating-point operations Memory unitOperation busInterger unitFP addersFP multipliers Common data bus (CDB) Multiple Issue Example Loop: LD R2,0(R1) ADDI R2,R2,#1 SD R2,0(R1)ADDI R1,R1,#8 BNE R2,R3,Loop//R2=array element //increment R2 //store result //increment pointer //branch if not last Multiple Issue Example (No Speculation)Iteration Issues at clock Execute number Instruction cycle number at clock cyclenumberMemory access Write CDB at clock at clock cycle cycle numbernumberComment 1 LD R2,0(R1) 1 2 3 4First issue1 ADDI 2,R2,#1 1 5 6Wait for LD1SD R2,0(R1) 2 3 7 Wait for ADDI1ADDI R1,R1,#8 2 3 4Execute directly1BNE R2,R3,Loop 3 7Wait for ADDI2LD R2,0(R1) 4 8 9 10Wait for BNE2ADDI R2,R2,#1 4 11 12Wait for LD2SD R2,0(R1) 5 9 13Wait for ADDI2ADDI R1,R1,#8 5 8 9Wait for BNE2BNE R2,R3,Loop 6 13Wait for ADDI3LD R2,0(R1) 7 14 15 16Wait for BNE3ADDI R2,R2,#1 7 17 18Wait for LD3SD R2,0(R1) 8 15 19Wait for ADDI 3 ADDIR1,R1,#8 8 14 15Wait for BNE 3 BNER2,R3,Loop 9 19 Wait for ADDIMultiple Issue Example (with Speculation)Iteration Issues at clock Execute number Instruction number at clocknumberRead access at clock numberWrite CDB at clock numberCommitsat clock Comment number1 LD R2,0(R1) 1 2 3 4 5 First issue1 ADDI R2,R2,#1 1 5 6 7 Wait for LD1 SD R2,0(R1) 2 3 7 Wait for ADDI 1 ADDI R1,R1,#8 2 3 4 8 Commit in order1 BNE R2,R3,Loop 3 7 8 Wait for ADDI2 LD R2,0(R1) 4 5 6 7 9 No execute delay2 ADDI R2,R2,#1 4 8 9 10 Wait for LD2 SD R2,0(R1) 5 6 10 Wait for ADDI2 ADDI R1,R1,#8 5 6 7 11 Commit to order2 BNE R2,R3,Loop 6 10 11 Wait for ADDI 3 LD R2,0(R1) 7 8 9 10 12 Earliest possible3 ADDI R2,R2,#1 7 11 12 13 Wait for LD3 SD R2,0(R1) 8 9 13 Wait for ADDI3 ADDI R1,R1,#8 8 9 10 14 Executes earlier3 BNE R2,R3,Loop 9 13 14 Wait for ADDIRegister RenamingRegister renaming vs. reorder buffers: Instead of virtual registers from reservation stations and reorder buffer, create a single register pool Contains visible registers and virtual registers Use hardware-based map to rename registers during issuing process WAW and WAR hazards are avoided Speculation recovery occurs bycopying during commit Still need a ROB-like queue to update table in order Simplifies commit: Record that mapping between architectural register and physical register is no longer speculative Free up physical register used to hold older value In other words: SWAP physical registers on commit Physical register de-allocation is more difficult Simple approach: deallocate virtual register when next instruction writes to its mapped architecturally-visibly register Integrated Issue and RenamingCombining instruction issue with register renaming: Issue logic pre-reserves enough physical registers for the bundle Issue logic finds dependencies within bundle, maps registers as necessary Issue logic finds dependencies between current bundle and already in execution bundles, maps registers as necessary Instr. # InstructionPhysical register assigned Instruction with physical or destination register numbersRename map change1 ADD R1,R2,R3 p32 ADD p32,p2,p3R1p322 SUB R1,R1,R2 p33 SUB p33,p32,p2R1p333 ADD R2,R1,R2 p34 ADD p34,p33,p2R2p344 SUB R1,R3,R2 p35 SUB p35,p3,p34R1p355 ADD R1,R1,R2 p36 ADD p36,p35,p34R1p366 SUB R1,R3,R1 p37 SUB p37,p3,p36R1p37 The difficulty in this process is that all renaming and replacement of architecture registers by physical ones happens in one clock cycle and not sequentially.It is up to issue logic to find all dependencies and rewrite the instructions in parallel.How Much Speculation? How much to speculate Mis-speculation degrades performance andpower relative to no speculation May cause additional misses (cache, TLB) Prevent speculative code from causing higher costing misses (e.g. L2) Speculating through multiple branches Complicates speculation recovery Speculation and energy efficiency Note: speculation is only energy efficient when it significantly improves performance 45% 40% 35% 30% 25% 20% 15% 10%5% 0%How Much Speculation? integerfloating-point 164.gzip 175.vpr176.gcc186.crafty181.mcf 168.wupwise 172.mgrid 177.mesa171.swim173.appluMisspeculation Energy Efficiency Value prediction Uses: Loads that load from a constant pool Instruction that produces a value from asmall set of values Not incorporated into modern processors Similar idea address aliasing prediction is used on some processors to determine if two stores or a load and a store reference the same address to allow for reorderingEND OF SECTION Lesson 06 Instruction-Level Parallelism: Case Studies Fallacies and Pitfalls11 10 9 8 7 6 5 4 3 2 1 0Fallacies and Pitfalls It is easy to predict the performance/energy efficiency of two different versions of the same ISA if we hold the technology constantSpeedupEnergy efficiency_201_compress _202_jessFop Luindex_209_db _213_javac_212_mpegaudio _228_jack400.perlbench 401.bzip2 403.gcc445.gobmk 456.hmmer 458.sjeng 462.libquantum 464.h264ref 470.omnetpp 473.astar483.xalancbmk416.gamess433.milc 434.zeusmp435.gromacs 436.Cactus ADM 437.leslie3d 444.namd 447.dealll 450.soplex453.povray454.calculix 459.gams FDTDantlr Bloat429.mcf465.tonto 470.ibm482.sphinx3I7 920 and Atom 230 performance and energy ratio Fallacies and Pitfalls Processors with lower CPIs / faster clock rates will also be faster Processor Implementation Clock rate Power SPECCInt2006 SPECCFP2006 technology base baselineIntel Pentium 4 670 90 nm 3.8 GHz 115 W 11.5 12.2Intel Itanium 2 90 nm 1.66 GHz 104 W approx. 14.5 17.3 70 W one coreIntel i7 920 45 nm 3.3 GHz 130 W total approx. 35.5 80 W one core 38.4 Pentium 4 had higher clock, lower CPI Itanium had same CPI, lower clockFallacies and Pitfalls Sometimes bigger and more aggressive is better Pentium 4 and Itanium were advanced designs, but could not achieve their peak instruction throughput because of relatively small caches as compared to i7 And sometimes smarter is better than bigger and more aggressive TAGE branch predictor outperforms gshare with less stored predictionsespressogcc10 10 108915 8 10 13 15 Fallacies and Pitfalls Believing that there are large amounts of ILP available, if only we had the right techniques9 11li fppppdoduc tomcatv12 1112 22 3552 47 91417 161215 14 22 3456 45 00 10 20 30 40 50 60 Degree of parallelism (%)Window sizeInfinite 256 128 6432BenchmarksEND OF SECTION
Reviews
There are no reviews yet.