Module Outline:
Module 3: High Performance Techniques Tomasulo Algorithm
 Improved Dynamic Instruction Scheduling: Tomasulo Algorithm 
Recall: Compiler Techniques for Parallelism
 Loop unrolling => multiple iterations of loop in software:
 amortizes loop overhead over several iterations
 gives more opportunity for scheduling around stalls
 Software Pipelining => take one instruction from each of several iterations of the loop
 software overlapping of loop iterations
 Very Long Instruction Word machines (VLIW) => multiple operations coded in single, long instruction
 requires sophisticated compiler to decide which operations can be done in parallel
 trace scheduling => find common path and schedule code as if branches didnt exist
(add fixup code)
 ALL OF THE ABOVE require additional registers 
Recall: How are WAR and WAW hazards handled in scoreboard?
 WAR hazards handled by stalling in WriteBack stage  WAW hazards handled by stalling in Issue stage
 Are either of these real hazards???
 consider the following WAR hazard:
ADD $1, $2, $3 SUB $3, $5, $4 ADD $2, $3, $5
 why not do renaming like this:
ADD $1, $2, $3 SUB $7, $5, $4 ADD $2, $7, $5
 now, WAR hazard has disappeared!! 
A Famous Dynamic Scheduling Algorithm: Tomasulo Algorithm
 For IBM 360/91 about 3 years after CDC6600 (1966)
 Goal: high performance without special compilers
 Difference between IBM 360 and CDC6600 ISA
 IBM has only 2 register specifiers per instruction vs. 3 in CDC6600  IBM has 4 FP registers vs. 8 in CDC6600
 IBM has memory-register ops
 Why study this algorithm?
BECAUSE: it leads to DEC Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, etc. 
Tomasulo Algorithm vs. Scoreboard
 Control and buffers distributed with Function Units (FU) vs. centralized in scoreboard:
 FU buffers called reservation stations; have pending operands
 Registers in instructions replaced by values or pointers to reservation stations (RS);
called Register Renaming
 avoids WAR, WAW hazards
 more reservation stations than registers, so can do optimizations that compilers cant do
 Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs
 Load and Store treated as FUs with RSs as well
 Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue 
Tomasulo Organization
Reservation Station Components
 Op: operation to perform in the unit (e.g., add or sub)
 Vj, Vk, Value of source operands
 store buffers has V field, result to be stored
 Qj, Qk: reservation stations producing source registers (value to be written)
 note: no ready flags as in scoreboard; Qj, Qk = 0 ==> ready  store buffers only have Qi for RS producing result
 Busy: indicates reservation station or FU is busy
 Register result statusindicates which functional unit will write each register, if one exists; blank when no pending instructions that will write that register 
1. Issue:
Three Stages of Tomasulo Algorithm
 Get instruction from FP Op Queue
 if reservation station free (no structural hazard), control issues instructions and sends operands (renames registers);
2. Execution:
 Operate on operands (EX)
 when both operands ready then execute; if not ready, watch Common Data Bus for result
3. Write Result:
 Finish execution (WB)
 write on Common Data Bus to all awaiting units; mark reservation station available
Remarks:
 Normal data bus: data + destination (go to bus)
 Common Data Bus: data + source (come from bus)
 64 bits of data + 4 bits of Functional Unit source address
 write if matches expected Functional Unit (produces result); does the broadcast 
Tomasulo Example
Tomasulo Example Cycle 1
Tomasulo Example Cycle 2
Tomasulo Example Cycle 3
 Note: registers names are removed (renamed) in Reservation Stations; MULT issued vs. scoredboard
 Load1 completing; what is waiting for Load1?? 
Tomasulo Example Cycle 4
Tomasulo Example Cycle 5
 Issue ADDD here vs. scoreboard?
Tomasulo Example Cycle 6 
Tomasulo Example Cycle 7
Tomasulo Example Cycle 8
Tomasulo Example Cycle 9
Tomasulo Example Cycle 10
Tomasulo Example Cycle 11
 All quick instructions complete in this cycle! 
Tomasulo Example Cycle 12
Tomasulo Example Cycle 13
Tomasulo Example Cycle 14
Tomasulo Example Cycle 15
Tomasulo Example Cycle 16
Lets skip some cycles: Tomasulo Example Cycle 55
Tomasulo Example Cycle 56
Tomasulo Example Cycle 57
Compare to Scoreboard Cycle 62
Tomasulo
Scoreboard
Tomasulo vs. Scoreboard (IBM 360/91 vs. CDC6600)
pipelined functional units
6 load, 3 store, 3 add, 2 mult/division window size: ~14 instructions
no issue on structural hazard
WAR: renaming avoids
WAW: renaming avoids
broadcast results from FU
control: reservation stations
multiple functional units
1 load/store, 1 add, 2 mult, 1 division ~5 instructions
same
WAR: stall completion
stall issue
write/read registers
central scoreboard 
 Complexity
 delays of 360/91, MIPS 10000, IBM 620?
 Many associative stores (CDB) at high speed  Performance limited by Common Data Bus
Tomasulo Drawbacks
 multiple CDBs ==> more FU logic for parallel associative stores 

![[SOLVED]  algorithm MIPS parallel compiler software Module Outline:](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip.jpg)

![[SOLVED] Naughty Receiver - Reliable Data Transfer](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip-1200x1200.jpg)
 
 
 
Reviews
There are no reviews yet.