Module Outline:
Module 3: High Performance Techniques Tomasulo Algorithm
Improved Dynamic Instruction Scheduling: Tomasulo Algorithm
Recall: Compiler Techniques for Parallelism
Loop unrolling => multiple iterations of loop in software:
amortizes loop overhead over several iterations
gives more opportunity for scheduling around stalls
Software Pipelining => take one instruction from each of several iterations of the loop
software overlapping of loop iterations
Very Long Instruction Word machines (VLIW) => multiple operations coded in single, long instruction
requires sophisticated compiler to decide which operations can be done in parallel
trace scheduling => find common path and schedule code as if branches didnt exist
(add fixup code)
ALL OF THE ABOVE require additional registers
Recall: How are WAR and WAW hazards handled in scoreboard?
WAR hazards handled by stalling in WriteBack stage WAW hazards handled by stalling in Issue stage
Are either of these real hazards???
consider the following WAR hazard:
ADD $1, $2, $3 SUB $3, $5, $4 ADD $2, $3, $5
why not do renaming like this:
ADD $1, $2, $3 SUB $7, $5, $4 ADD $2, $7, $5
now, WAR hazard has disappeared!!
A Famous Dynamic Scheduling Algorithm: Tomasulo Algorithm
For IBM 360/91 about 3 years after CDC6600 (1966)
Goal: high performance without special compilers
Difference between IBM 360 and CDC6600 ISA
IBM has only 2 register specifiers per instruction vs. 3 in CDC6600 IBM has 4 FP registers vs. 8 in CDC6600
IBM has memory-register ops
Why study this algorithm?
BECAUSE: it leads to DEC Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, etc.
Tomasulo Algorithm vs. Scoreboard
Control and buffers distributed with Function Units (FU) vs. centralized in scoreboard:
FU buffers called reservation stations; have pending operands
Registers in instructions replaced by values or pointers to reservation stations (RS);
called Register Renaming
avoids WAR, WAW hazards
more reservation stations than registers, so can do optimizations that compilers cant do
Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs
Load and Store treated as FUs with RSs as well
Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue
Tomasulo Organization
Reservation Station Components
Op: operation to perform in the unit (e.g., add or sub)
Vj, Vk, Value of source operands
store buffers has V field, result to be stored
Qj, Qk: reservation stations producing source registers (value to be written)
note: no ready flags as in scoreboard; Qj, Qk = 0 ==> ready store buffers only have Qi for RS producing result
Busy: indicates reservation station or FU is busy
Register result statusindicates which functional unit will write each register, if one exists; blank when no pending instructions that will write that register
1. Issue:
Three Stages of Tomasulo Algorithm
Get instruction from FP Op Queue
if reservation station free (no structural hazard), control issues instructions and sends operands (renames registers);
2. Execution:
Operate on operands (EX)
when both operands ready then execute; if not ready, watch Common Data Bus for result
3. Write Result:
Finish execution (WB)
write on Common Data Bus to all awaiting units; mark reservation station available
Remarks:
Normal data bus: data + destination (go to bus)
Common Data Bus: data + source (come from bus)
64 bits of data + 4 bits of Functional Unit source address
write if matches expected Functional Unit (produces result); does the broadcast
Tomasulo Example
Tomasulo Example Cycle 1
Tomasulo Example Cycle 2
Tomasulo Example Cycle 3
Note: registers names are removed (renamed) in Reservation Stations; MULT issued vs. scoredboard
Load1 completing; what is waiting for Load1??
Tomasulo Example Cycle 4
Tomasulo Example Cycle 5
Issue ADDD here vs. scoreboard?
Tomasulo Example Cycle 6
Tomasulo Example Cycle 7
Tomasulo Example Cycle 8
Tomasulo Example Cycle 9
Tomasulo Example Cycle 10
Tomasulo Example Cycle 11
All quick instructions complete in this cycle!
Tomasulo Example Cycle 12
Tomasulo Example Cycle 13
Tomasulo Example Cycle 14
Tomasulo Example Cycle 15
Tomasulo Example Cycle 16
Lets skip some cycles: Tomasulo Example Cycle 55
Tomasulo Example Cycle 56
Tomasulo Example Cycle 57
Compare to Scoreboard Cycle 62
Tomasulo
Scoreboard
Tomasulo vs. Scoreboard (IBM 360/91 vs. CDC6600)
pipelined functional units
6 load, 3 store, 3 add, 2 mult/division window size: ~14 instructions
no issue on structural hazard
WAR: renaming avoids
WAW: renaming avoids
broadcast results from FU
control: reservation stations
multiple functional units
1 load/store, 1 add, 2 mult, 1 division ~5 instructions
same
WAR: stall completion
stall issue
write/read registers
central scoreboard
Complexity
delays of 360/91, MIPS 10000, IBM 620?
Many associative stores (CDB) at high speed Performance limited by Common Data Bus
Tomasulo Drawbacks
multiple CDBs ==> more FU logic for parallel associative stores
Reviews
There are no reviews yet.