Static Scheduling & VLIW 15-740
Prof. Nathan Beckmann
(Original slides by Onur Mutlu, edited by Seth Goldstein)
Carnegie Mellon University
Reprise of dynamic scheduling
n
2
DO WE REALLY NEED ALL THIS COMPLEX HARDWARE?
HOW FAR CAN WE GET WITHOUT IT?
3
Key Questions
Q1. How do we find independent instructions to fetch/execute?
Q2. How do we enable more compiler optimizations?
e.g., common subexpression elimination, constant propagation, dead code elimination, redundancy elimination,
Q3. How do we increase the instruction fetch rate?
i.e., have the ability to fetch more instructions per cycle
4
Key Questions
Q1. How do we find independent instructions to fetch/execute?
Q2. How do we enable more compiler optimizations?
e.g., common subexpression elimination, constant propagation, dead code elimination, redundancy elimination,
Q3. How do we increase the instruction fetch rate?
i.e., have the ability to fetch more instructions per cycle
A: Enabling the compiler to optimize across a larger number of instructions that will be executed straight line (without branches getting in the way) eases all of the above
5
Very long instruction word VLIW
n Compiler does the scheduling statically
n Simple hardware with multiple function units
q Reduced hardware complexity
q Little or no scheduling done in hardware, e.g., in-order q Hopefully, faster clock and less power
n Compiler required to group and schedule instructions (compare to OoO superscalar)
q Predicated instructions to help with scheduling (trace, etc.) q More registers (for software pipelining, etc.)
6
VLIW example 2-cycle loads
n RISC code MUL R1, R3, 3
LDR4, 0(R1)
ADD R2, R2, R4
SUB R3, R3, 1
BNEZR3, -4
n VLIW code MUL R1, R3, 3
LDR4, 0(R1)
NOP
ADD R2, R2, R4
SUB R3, R3, 1
NOP
NOP
BNEZ R3, -4
7
Very long instruction word VLIW
n Compiler does the scheduling statically
n Simple hardware with multiple function units
n Compiler required to group and schedule instructions (compare to OoO superscalar)
n Example machines:
q Multiflow, Cydra 5 (8-16 ops per VLIW)
q IA-64 Itanium (3 ops per bundle, EPIC VLIW variant) q TMS32xxxx (5+ ops per VLIW, DSPs)
q Crusoe (4 ops per VLIW, JIT x86 e VLIW)
q Tilera TILEPro (3 ops per VLIW, 100+ cores)
8
Comparison between SS VLIW
From Mark Smotherman, Understanding EPIC Architectures and Implementations
Comparison: CISC, RISC, VLIW
VLIW is a natural extension of RISC ideas to superscalar
VLIW relies on compiler for ILP
n
11
VLIW: Finding Independent Operations
n Within a basic block, there is limited instruction-level parallelism
n To find multiple instructions to be executed in parallel, the compiler needs to consider multiple basic blocks
n Problem: Moving an instruction above a branch is unsafe because instruction is not guaranteed to be executed
n Idea: Enlarge blocks at compile time by finding the frequently-executed paths
q Trace scheduling
q Superblock scheduling q Hyperblock scheduling q Software Pipelining
Its all about the compiler and how to schedule the instructions to maximize parallelism
12
List Scheduling: For 1 basic block
n Idea:Assignprioritytoeachinstruction
n Initializereadylistthatholdsallreadyinstructions
n Choose one ready instruction I from ready list with the highest priority
n InsertIintoschedule
q Ensuring resources are available (structural hazards)
n Addthoseinstructionswhoseprecedenceconstraintsare now satisfied into the ready list
13
Data Precedence Graph
i1 i2 i5 i6 i7 i10 i11 i12
222222 i3 2 i8 2 i13
2
22
i4
i9
i14
4
4 i15
2 i16
14
Instruction Prioritization Heuristics
n Number of descendants in precedence graph
n Maximum latency from root node of precedence graph n Length of operation latency
n Ranking of paths based on importance
n Some combination of above
15
VLIW List Scheduling
n Assign priorities critical path in this example
n Compute data ready list (DRL) all operations whose predecessors
have been scheduled.
n Select from DRL in priority order while checking resource constraints n Add newly ready operations to DRL and repeat for next instruction
5
1
23334
23456
223
789
112
1011 12
1
4-wide VLIW
Data Ready List
1
{1}
6
3
4
5
{2,3,4,5,6}
9
2
7
8
{2,7,8,9}
12
10
11
{10,11,12}
13
{13}
13
16
VLIW List Scheduling
5
1 2233435364
4-wide VLIW
2
2
3
2
1
13
789
11
1011 12
17
VLIW List Scheduling
5
1 2233435364
4-wide VLIW
1
6
3
4
5
9
2
7
8
12
10
11
13
2
2
3
2
1
13
789
11
1011 12
18
VLIW List Scheduling+Structural Hazards
5
1 2233435364
4-wide VLIW
2
2
3
2
1011 12
1
13
789 11
19
VLIW List Scheduling+Structural Hazards
5
1 2233435364
4-wide VLIW
1
2
3
6
5
9
4
7
8
12
10
11
13
2
2
3
2
1011 12
1
13
789 11
20
Extending the scheduling domain
n Basic block is too small to get any real parallelism q Recall: 88% of OOO benefit from speculation e
larger scheduling window
n How to extend the basic block?
q Why do we have basic blocks in the first place?
q Loops
n Loop unrolling
n Software pipelining
q Non-loops
n Will almost always involve some speculation n Thus profiling may be very important
[MTZ13]
21
Safety and Legality in Code Motion
n Two characteristics of speculative code motion:
q Safety: whether or not spurious exceptions may occur q Legality: whether or not result will be always correct
n Four possible types of code motion:
r4 = r1
r1 =
r1 = r2 & r3
r1 = r2 & r3
(a) safe and legal
(b) illegal
r1 =
r1 = load A
r4 = r1
r1 = load A
(c) unsafe
(d) unsafe and illegal
22
Code Movement Constraints
n Downward
q When moving an operation from a BB to one of its dest BBs,
n all the other dest basic blocks should still be able to use the result of the operation
n the other source BBs of the dest BB should not be disturbed n Upward
q When moving an operation from a BB to its source BBs n register values required by the other dest BBs must not be
destroyed
n the movement must not cause new exceptions
23
Trace Scheduling
n Trace: A frequently executed path in the control-flow graph (has multiple side entrances and multiple side exits)
n Idea: Find independent operations within a trace to pack into VLIW instructions.
q Traces determined via profiling
q Compiler adds fix-up code for correctness (if a side entrance or side exit of a trace is exercised at runtime, corresponding fix-up code is executed)
24
Trace Scheduling Idea
25
Trace Scheduling (II)
n There may be conditional branches from the middle of the trace (side exits) and transitions from other traces into the middle of the trace (side entrances).
n These control-flow transitions are ignored during trace scheduling.
n After scheduling, fix-up/bookkeeping code is inserted to ensure the correct execution of off-trace code.
n Fisher, Trace scheduling: A technique for global microcode compaction, IEEE TC 1981.
26
Trace Scheduling (III)
Instr 1
Instr 2
Instr 3
Instr 4
Instr 5
Instr 2
Instr 3
Instr 4
Instr 1
Instr 5
What bookkeeping is required when Instr 1
is moved below the side entrance in the trace?
27
Trace Scheduling (IV)
Instr 3
Instr 4
Instr 1
Instr 2
Instr 3
Instr 4
Instr 5
Instr 2
Instr 3
Instr 4
Instr 1
Instr 5
28
Trace Scheduling (V)
Instr 1
Instr 2
Instr 3
Instr 4
Instr 5
Instr 1
Instr 5
Instr 2
Instr 3
Instr 4
What bookkeeping is required when Instr 5 moves above the side entrance in the trace?
29
Trace Scheduling (VI)
Instr 5
Instr 1
Instr 2
Instr 3
Instr 4
Instr 5
Instr 1
Instr 5
Instr 2
Instr 3
Instr 4
30
Trace Scheduling Fixup Code Issues
n Sometimes need to copy instructions more than once to ensure correctness on all paths (see C below)
Original trace
A
B X C
D A B C Y Scheduled B
trace E DYA
C E D
EC
BX Correctness C
B X
DY
31
Trace Scheduling Overview
n Trace Selection
q select seed block (the highest frequency basic block) q extend trace (along the highest frequency edges)
forward (successor of the last block of the trace)
backward (predecessor of the first block of the trace) q dont cross loop back edge
q bound max_trace_length heuristically
n Trace Scheduling
q build data precedence graph for a whole trace
q perform list scheduling and allocate registers
q add compensation code to maintain semantic correctness
n Speculative Code Motion (upward)
q move an instruction above a branch if safe
32
Trace Scheduling Example (I)
B1
9 stalls
B3
1 stall
fdiv f1, f2, f3 fadd f4, f1, f5 beq r1, $0
fdiv f1, f2, f3 fadd f4, f1, f5
beq r1, $0
990
ld r2, 0(r3)
990
800
800
10
r2 and f2 not live out
B3
f2 not live out
B6
B2
ld r2, 4(r3)
10
ld r2, 0(r3)
add r2, r2, 4 beq r2, $0
fsub f2, f2, f6 st.d f2, 0(r8)
add r3, r3, 4 add r8, r8, 4
add r2, r2, 4 beq r2, $0
B4
B5
200
fsub f2, f3, f7
200
B7
B6
1 stall
fsub f2, f2, f6 st.d f2, 0(r8)
add r3, r3, 4 add r8, r8, 4
33
Trace Scheduling Example (I)
9 stalls
fdiv f1, f2, f3 fadd f4, f1, f5 beq r1, $0
ld r2, 0(r3)
add r2, r2, 4 beq r2, $0
fsub f2, f2, f6 st.d f2, 0(r8)
B6
fdiv f1, f2, f3 beq r1, $0
ld r2, 0(r3) fsub f2, f2, f6
add r2, r2, 4 beq r2, $0
st.d f2, 0(r8)
add r3, r3, 4 add r8, r8, 4 fadd f4, f1, f5
1 stall
r2 and f2 not live out
B3 0 stall 0 stall
f2 not live out
1 stall
1 stall
add r3, r3, 4 add r8, r8, 4
B3 B6
34
Trace Scheduling Example (II)
fdiv f1, f2, f3 beq r1, $0
ld r2, 0(r3) fsub f2, f2, f6
add r2, r2, 4 beq r2, $0
st.d f2, 0(r8)
add r3, r3, 4
add r8,r8,4
fadd f4, f1, f5 fadd f4, f1, f5
B6
0 stall 0 stall
1 stall
fdiv f1, f2, f3 beq r1, $0
ld r2, 0(r3) fsub f2, f2, f6
add r2, r2, 4 beq r2, $0
st.d f2, 0(r8)
fadd f4, f1, f5
Split comp. code
fadd f4, f1, f5
add r3, r3, 4
B3 add r8,r8,4 B3
B6
35
Trace Scheduling Example (III)
fdiv f1, f2, f3 beq r1, $0
ld r2, 0(r3) fsub f2, f2, f6 add r2, r2, 4 beq r2, $0
st.d f2, 0(r8)
fadd f4, f1, f5
Split comp. code
fadd f4, f1, f5
add r3, r3, 4 add r8, r8, 4 fadd f4, f1, f5
B3
B6
add r3, r3, 4 add r8, r8, 4
Join comp. code
36
Trace Scheduling Example (IV)
fdiv f1, f2, f3 beq r1, $0
ld r2, 0(r3) fsub f2, f2, f6 add r2, r2, 4 beq r2, $0
st.d f2, 0(r8)
add r3, r3, 4 add r8, r8, 4 fadd f4, f1, f5
fadd f4, f1, f5
Split comp. code
fadd f4, f1, f5
B3
add r2, r2, 4 beq r2, $0
fsub f2, f2, f6 st.d f2, 0(r8)
add r3, r3, 4 add r8, r8, 4
B6
Copied split instructions
add r3, r3, 4 add r8, r8, 4
Join comp. code
37
Trace Scheduling Example (V)
ld r2, 0(r3) fsub f2, f2, f6 add r2, r2, 4
beq r2, $0
fdiv f1, f2, f3 beq r1, $0
fadd f4, f1, f5
ld r2, 4(r3)
add r2, r2, 4 beq r2, $0
B3
st.d f2, 0(r8) add r3, r3, 4 add r8, r8, 4 fadd f4, f1, f5
fadd f4, f1, f5
B6
fsub f2, f2, f6 st.d f2, 0(r8)
add r3, r3, 4 add r8, r8, 4
fsub f2, f3, f7
add r3, r3, 4 add r8, r8, 4
38
Trace Scheduling Tradeoffs
n Advantages
+ Enables the finding of more independent instructions a fewer
NOPs in a VLIW instruction
n Disadvantages
Profile dependent
What if dynamic path deviates from trace a lots of NOPs in the VLIW instructions
Code bloat and additional fix-up code executed Due to side entrances and side exits
Infrequent paths interfere with the frequent path Effectiveness depends on the bias of branches
Unbiased branches a smaller traces a less opportunity for finding independent instructions
39
Superblock Scheduling
n Trace: multiple entry, multiple exit block
n Superblock: single-entry, multiple exit block
q A trace with side entrances are eliminated
q Infrequent paths do not interfere with the frequent path
+ More optimization/scheduling opportunity than traces
+ Eliminates difficult bookkeeping due to side entrances
Hwu+, The Superblock: An Effective Technique for VLIW and superscalar compilation, J of SC 1991.
40
Superblock example
opA: mul r1,r2,3
99
opC: mul r3,r2,3
opA: mul r1,r2,3
99 opB: add r2,r2,1
11
opB: add r2,r2,1
opC: mul r3,r2,3
opC: mul r3,r2,3
1
Original Code
Code After Superblock Formation
opA: mul r1,r2,3
99
opC: mov r3,r1
1
opB: add r2,r2,1
opC: mul r3,r2,3
Code After Common Subexpression Elimination
41
Superblock Scheduling Shortcomings
Still profile-dependent
No single frequently executed path if there is an unbiased
branch
Reduces the size of superblocks
Code bloat and additional fix-up code executed Due to side exits
42
Hyperblock Scheduling
n Idea: Use predication support to eliminate unbiased branches and increase the size of superblocks
n Hyperblock: A single-entry, multiple-exit block with internal control flow eliminated using predication (if-conversion)
n Advantages
+ Reduces the effect of unbiased branches on scheduled block size
n Disadvantages
Requires predicated execution support
All disadvantages of predicated execution
43
Hyperblock Formation (I)
n Hyperblock formation 1. Block selection
2. Tail duplication
3. If-conversion
10
90 80
80 20
20
n Block selection
q Select subset of BBs for inclusion in HB
q Difficult problem 10
q Weighted cost/benefit function n Height overhead
n Resource overhead
n Dependency overhead
n Branch elimination benefit n Weighted by frequency
90 10
10
n Mahlke et al., Effective Compiler Support for Predicated Execution Using the Hyperblock, MICRO 1992.
BB1
BB2
BB3
BB4
BB5
BB6
44
Hyperblock Formation (II)
Tail duplication same as with Superblock formation 10
BB1
80
BB2
80
10
90 10
BB6
10
20
BB1
80
BB2
80
20
BB3
20
90
10
BB3
20
BB4
BB4
10
BB5
BB5
90
BB6
81 9
10 19
BB6
45
Hyperblock Formation (III)
10
BB1
80
BB2
80
If-convert (predicate) intra-hyperblock branches 10
BB1 p1,p2 = CMPP
BB2 if p1
BB3 if p2
BB4
BB6
BB4
20
BB3
20
90
10
10
BB5
BB6
81 9
9
1
10
81
9
1
BB6
BB5
BB6
46
WHAT ABOUT MEMORY?
47
Non-Faulting Loads and Exception Propagation
ld.s r1=[a]
inst 1 inst 2 . br
inst 1 inst 2 .
br
unsafe code
motion
.
ld r1=[a]
use=r1
n ld.s fetches speculatively from memory
i.e. any exception due to ld.s is suppressed
n If ld.s r1 did not cause an exception then chk.s r1 is a NOP, else a branch is taken (to execute some compensation code)
.
chk.s r1
use=r1
ld r1=[a]
48
Non-Faulting Loads and Exception Propagation in IA-64
ld.s r1=[a]
inst 1 inst 2 use=r1 .
br
inst 1 inst 2 .
br br
unsafe code
motion
.
ld r1=[a] use=r1
n Load data can be speculatively consumed prior to check
n speculation status is propagated with speculated data
n Any instruction that uses a speculative result also becomes speculative itself (i.e. suppressed exceptions)
n chk.s checks the entire dataflow sequence for exceptions
.
chk.s use
ld r1=[a] use=r1
49
Aggressive ST-LD Reordering in IA-64
inst 1
inst 2
.
st [?]
st[?] .
ld r1=[x]
use=r1
ld.a r1=[x]
inst 1
inst 2
.
st [?]
.
ld.c r1=[x] use=r1
potential aliasing
n ld.a starts the monitoring of any store to the same address as the advanced load
n If no aliasing has occurred since ld.a, ld.c is a NOP n If aliasing has occurred, ld.c re-loads from memory
50
Aggressive ST-LD Reordering in IA-64
inst 1
inst 2
.
st [?]
st[?] .
ld r1=[x] use=r1
ld.a r1=[x]
inst 1 inst 2 use=r1 .
st [?] . chk.a X .
potential aliasing
ld r1=[a] use=r1
51
Can We Do Better?
n Hyperblock still
q Profile dependent
q Requires fix-up code
q And, requires predication support
n Single-entry, single-exit enlarged blocks
q Block-structured ISA
n Optimizes multiple paths (can use predication to enlarge blocks) n No need for fix-up code (duplication instead of fixup)
52
Block Structured ISA
n Blocks (> instructions) are atomic (all-or-none) operations q Either all of the block is committed or none of it
n Compiler enlarges blocks by combining basic blocks with their control flow successors
q Branches within the enlarged block converted to fault operations a if the fault operation evaluates to true, the block is discarded and the target of fault is fetched
Melvin and Patt, Enhancing Instruction Scheduling with a Block-Structured ISA, IJPP 1995.
53
Block Structured ISA (II)
n Advantages:
+ Larger atomic blocks a larger units can be fetched from I-cache
+ Aggressive compiler optimizations (e.g. reordering) can be enabled within atomic blocks (no side entries or exits)
+ Can explicitly represent dependencies among operations within an enlarged block
n Disadvantages:
Fault operations can lead to work to be wasted (atomicity)
Code bloat (multiple copies of the same basic block exists in the binary and possibly in I-cache)
Need to predict which enlarged block comes next n Optimizations
q Within an enlarged block, the compiler can perform optimizations that cannot normally/easily be performed across basic blocks
54
Block Structured ISA (III)
n Hao et al., Increasing the instruction fetch rate via block- structured instruction set architectures, MICRO 1996.
55
Superblock vs. BS-ISA
n Superblock
q Single-entry, multiple exit code block
q Not atomic
q Compiler inserts fix-up code on superblock side exit
n BS-ISA blocks
q Single-entry, single exit
q Atomic
q Need to roll back to the beginning of the block on fault
56
Superblock vs. BS-ISA
n Superblock
+ No ISA support needed
Optimizes for only 1 frequently executed path
Not good if dynamic path deviates from profiled path a missed opportunity to optimize another path
n Block Structured ISA
+ Enables optimization of multiple paths and their dynamic selection.
+ Dynamic prediction to choose the next enlarged block. Can dynamically adapt to changes in frequently executed paths at run- time
+ Atomicity can enable more aggressive code optimization Code bloat becomes severe as more blocks are combined
Requires next enlarged block prediction, ISA+HW support More wasted work on fault due to atomicity requirement
57
Summary and Questions
n Trace, superblock, hyperblock, block-structured ISA
n How many entries, how many exits does each of them have?
q What are the corresponding benefits and downsides?
n What are the common benefits?
q Enable and enlarge the scope of code optimizations q Reduce fetch breaks; increase fetch rate
n What are the common downsides?
q Code bloat (code size increase)
q Wasted work if control flow deviates from enlarged blocks path
58
VLIW Summary
n Heavy reliance on compiler (push RISC to the extreme) q Compiler algorithms (e.g., software pipelining) have lasting
impact outside of VLIW
n Is there enough statically knowable parallelism?
q E.g., memory aliasing and branch bias n What about wasted FUs? Code bloat?
q Code size is already a big problem with x86 apps!
n Architecture joke: VLIW is the architecture of the future,
and always will be.
n Yet many DSPs are VLIW. Why?
59
WHAT ABOUT LOOPS?
60
The problem with loops
n Consider the following code:
for (int i = 0; i < N; i++) {b[i] = b[i] * b[i];}61 The problem with loopsn Consider the following code:for (int i = 0; i < N; i++) {b[i] = b[i] * b[i];}n RISC assembly (LD & MUL 2-cycles) LOOP:Useful workLoop overheadLD R1, 0(R3)MUL R2, R1, R1ST R2, 0(R3) ADD R3, R3, 4BLT R3, R4, LOOPe 7 cycles per iteration 62 The problem with loopsn Consider the following code:for (int i = 0; i < N; i++) {b[i] = b[i] * b[i];}n VLIW assembly (1 ALU, 1 LD/ST; 2-cycles) LOOP:e5 cycles per iterationNOP NOPLD R1, 0(R3)NOP NOP NOPUseful workMUL R2, R1, R1 ADD R,3 R,3 4BLT R3, R4, LOOPST R2, -4(R3) Loop overhead 63 Amortize overheads by unrolling loops n Key idea is to schedule the following code instead:for (int i = 0; i < N; i+=4) {b[i+0] = b[i+0] * b[i+0];b[i+1] = b[i+1] * b[i+1];b[i+2] = b[i+2] * b[i+2];b[i+3] = b[i+3] * b[i+3]; }Loop unrollinge Larger scheduling block e Better schedule 64 Amortize overheads by unrolling loops n VLIW assemblyLOOP: NOP NOPe 2 cycles per iterationLD R1, 0(R9)LD R3, 4(R9)LD R5, 8(R9)LD R7, 12(R9)ST R2, 0(R9)ST R4, 4(R9)ST R6, 8(R9)ST R8, -4(R9)MUL R2, R1, R1MUL R4, R3, R3MUL R6, R5, R5MUL R8, R7, R7ADD R9, R9, 16BLT R9, R10 LOOPLoop overheadUseful work 65 Correctness of loop unrollingn Is this transformation legal?for (int i = 0; i < N; i++) {b[i] = b[i] * b[i];}for (int i = 0; i < N; i+=4) {b[i+0] = b[i+0] * b[i+0];b[i+1] = b[i+1] * b[i+1];b[i+2] = b[i+2] * b[i+2];b[i+3] = b[i+3] * b[i+3];} 66 Correctness of loop unrollingn Instead, schedule the following code: int i;for (i = 0; i+3 < N; i+=4) {b[i+0] = b[i+0] * b[i+0];b[i+1] = b[i+1] * b[i+1];b[i+2] = b[i+2] * b[i+2];b[i+3] = b[i+3] * b[i+3];}for (; i < N; i++) {b[i] = b[i] * b[i];}67 Loop unrolling summaryn Advantagesq Reduces loop overheadq Improves code schedule within loop n (eg, hiding MUL latency in example)n Disadvantagesq Increases code sizeq Less effective on loops with internal branches n Can use predication … more on this later68 Loop Unrollingn Idea: Replicate loop body multiple times within an iteration + Reduces loop maintenance overheadq Induction variable increment or loop condition test + Enlarges basic block (and analysis scope)q Enables code optimization and scheduling opportunities– What if iteration count not a multiple of unroll factor? (need extra code to detectthis)– Increases code size 69 Can we do better?n VLIW assembly (ALU + LD/ST) LOOP: NOPNOPLD R1, 0(R9)LD R3, 4(R9)LD R5, 8(R9)LD R7, 12(R9)ST R2, 0(R9)ST R4, 4(R9)ST R6, 8(R9)ST R8, -4(R9)MUL R2, R1, R1MUL R4, R3, R3MUL R6, R5, R5MUL R8, R7, R7ADD R9, R9, 16BLT R9, R10 LOOPe 2 cycles per iterationLoop overheadUseful workNot in this case LD/ST unit is at 100% utilization 70 Can we do better?n VLIW assembly (3-wide) LOOP: NOPNOPNOP NOP NOPLD R1, 0(R9)LD R3, 4(R9)LD R5, 8(R9)LD R7, 12(R9) MUL R2, R1, R1MUL R4, R3, R3MUL R6, R5, R5MUL R8, R7, R7 ST R2, 0(R9)ST R4, 4(R9)ST R6, 8(R9)ST R8, 12(R9)NOP ADD R9, R9, 16BLT R9, R10, LOOPNOPLoop overheade 7/4 cycles per iterationUseful work 71 Can we do better?n VLIW assembly (3-wide) LOOP: NOPNOPNOP NOP NOPLD R1, 0(R31)LD R3, 4(R31)LD R5, 8(R31)LD R7, 12(R31)LD R9, 12(R31)LD R11, 12(R31)LD R13, 12(R31)LD R15, 12(R31) MUL R2, R1, R1MUL R4, R3, R3MUL R6, R3, R3MUL R8, R3, R3MUL R10, R3, R3MUL R12, R3, R3MUL R14, R5, R5MUL R16, R7, R7 ST R2, 0(R31)ST R4, 4(R31)ST R6, 8(R31)ST R8, 12(R31)ST R10, 16(R31)ST R12, 20(R31)ST R14, 24(R31)ST R16, -4(R31)NOP ADD R31, R31, 32BLT R31, R10, LOOPNOPe 11/8 cycles per iterationLoop overheadUseful work…but code & register bloat 72 Can we do better than unrolling? n VLIW assembly (3-wide) LOOP: NOPNOPRamp-up & ramp-down overhead NOPLoop overheadUseful workNOP NOP NOPLD R1, 0(R31)LD R3, 4(R31)LD R5, 8(R31)LD R7, 12(R31)LD R9, 12(R31)LD R11, 12(R31)LD R13, 12(R31)LD R15, 12(R31) MUL R2, R1, R1MUL R4, R3, R3MUL R6, R3, R3MUL R8, R3, R3MUL R10, R3, R3MUL R12, R3, R3MUL R14, R5, R5MUL R16, R7, R7 ST R2, 0(R31)ST R4, 4(R31)ST R6, 8(R31)ST R8, 12(R31)ST R10, 16(R31)ST R12, 20(R31)ST R14, 24(R31)ST R16, -4(R31)NOP ADD R31, R31, 32BLT R31, R10, LOOP73 Can we do better than unrolling? n VLIW assembly (3-wide) LOOP: NOPNOPRamp-up & ramp-down overheadNOP NOP NOP LD R1, 0(R31)LD R3, 4(R31)LD R5, 8(R31)LD R7, 12(R31)LD R9, 12(R31)LD R11, 12(R31)LD R13, 12(R31)LD R15, 12(R31) MUL R2, R1, R1MUL R4, R3, R3MUL R6, R3, R3MUL R8, R3, R3MUL R10, R3, R3MUL R12, R3, R3MUL R14, R5, R5MUL R16, R7, R7 ST R2, 0(R31)ST R4, 4(R31)ST R6, 8(R31)ST R8, 12(R31)ST R10, 16(R31)ST R12, 20(R31)ST R14, 24(R31)ST R16, -4(R31)NOP ADD R31, R31, 32BLT R31, R10, LOOPPerfect efficiencyNOPLoop overheadUseful workGoal: Maintain peak efficiency, w/out ramp-up/down 74 Software Pipeliningn Idea: Move instructions across iterations of the loop q Very large improvements in running time are possible n E.g., 5-wide VLIWLOOP:{LD R1, 0(R9) NOPNOP NOPLD R1, 4(R9) MUL R2, R1, R1ADD R9, R9, 4NOPADD R9, R9, 4BGE R9, R10, ENDCurrent iteration One iteration ago Two iterations agoNOP SUB R10, R10, 4LD R1, 0(R9) MUL R2, R1, R1ST R2, -8(R9)ADD R9, R9, 4BLT R9, R10, LOOP}NOP ST R2, -4(R9) NOP ST R2, 0(R9)END:NOPNOPNOPNOPNOPMUL R2, R1, R1NOPNOP 15-745 Seth Copen Goldstein 2000-5 75 Software Pipeliningn Idea: Move instructions across iterations of the loop q Very large improvements in running time are possible n E.g., 5-wide VLIWLD R1, 0(R9) NOPNOP NOPLD R1, 4(R9) MUL R2, R1, R1NOP ADD R9, R9, 4 NOP ADD R9, R9, 4 BGE R9, R10, ENDe 1 cycle per iteration}NOP ST R2, -4(R9) NOP ST R2, 0(R9) LOOP:Perfect efficiencyEND:{ SUB R10, R10, 4LD R1, 0(R9) NOPNOPNOPNOPNOPMUL R2, R1, R1NOPNOPMUL R2, R1, R1ST R2, -8(R9)ADD R9, R9, 4BLT R9, R10, LOOP 15-745 Seth Copen Goldstein 2000-5 76 Goal of SPn Increase distance between dependent operations by moving destination operation to a later iterationA:a ld[d] B:b a*aC: st [d], b D:d d+4Assume all have latency of 2ABCD15-745 Seth Copen Goldstein 2000-5 77 Can we decrease the latency?n Lets unrollA: ald[d]B: ba*aC: st [d], b D: dd+4 A1: a ld [d] B1: b a * a C1: st [d], b D1: d d+4 ABCDA1B1C1D115-745 Seth Copen Goldstein 2000-5 78 Rename variables A: a ld[d]B: b a*aC: st [d], b D: d1d+4A1: a1 ld [d1]B1: b1 a1*a1 C1: st [d1], b1 D1:d d1+4ABCDA1B1C1D115-745 Seth Copen Goldstein 2000-5 79 ScheduleA: a ld[d]B: b a*aC: st [d], b D: d1d+4A1: a1 ld [d1]B1: b1 a1*a1 C1: st [d1], b1 D1:d d1+4ADBA1CB1D1C1 ABCD1DA1B1C1 15-745 Seth Copen Goldstein 2000-5 80 Unroll Some More A: aB: bC:D: d1 A1: a1 B1: b1 C1:D1: d2 A2: a2 B2: b2 C2:D2: d ld[d]a*ast [d], b d+4ld [d1] a1*a1st [d1], b1 d1+4ld [d2] a2*a2st [d2], b2 d2 + 4AD BCA1 D1 B1 A2 C1C2 D2B2 ABCD2DA1B1C1D1A2B2C215-745 Seth Copen Goldstein 2000-5 81 Unroll Some More A: a B: bC:D: d1 A1: a1 B1: b1 C1:D1: d2 A2: a2 B2: b2 C2:D2: dld [d]a*ast [d], b d+4ld [d1] a1*a1st [d1], b1 d1+4ld [d2] a2*a2st [d2], b2 d2+4AD BCA1 D1B1 A2 D2C1B2 A3 C2 B3 C3 ABCD3DA1B1C1D1A2B2C2D2A3B3C3 D3 15-745 Seth Copen Goldstein 2000-5 82ABCD4DA1B1C1D1A2B2C2D2A3B3C3D3A4B4C4 One More Time AD BCA1 D1B1 A2 D2 C1B2 A3 C2 B3 C3D3A4 B4 C4 15-745 Seth Copen Goldstein 2000-5 83ABCD4DA1B1C1D1A2B2C2D2A3B3C3D3A4B4C4 Can Rearrange AD BCA1 D1B1 A2 D2 C1B2 A3 C2 B3 C3D3A4 B4 C4 15-745 Seth Copen Goldstein 2000-5 84 Rearrange AD BC A1 D1B1 A2 D2 C1B2 A3 C2 B3 C3ABCD3DA1B1C1D1A2B2C2D2A3B3C3 D3 15-745 Seth Copen Goldstein 2000-5 85 Rearrange AD BC A1 D1B1 A2 D2 C1B2 A3D3 C2 B3C3ABCD3DA1B1C1D1A2B2C2D2A3B3C3 15-745 Seth Copen Goldstein 2000-5 86 SP LoopA: a B: b D: d1 A1: a1 D1: d2C:B1: b1 A2: a2 D2: dB2: b2 C1:D3: d2 C2:ld [d] a*a d+4 ld [d1] d1+4st [d], b a1*a1 ld [d2] d2+4a2*a2st [d1], b1 d1+4st [d2], b2PrologBody EpilogABCCCD3DA1B1B1B1C1D1A2A2A2B2C2D2D2D2 15-745 Seth Copen Goldstein 2000-5 87 Goal of Software Pipeliningn Increase distance between dependent operations by moving destination operation to a later iterationA BCdependencies in initial loopA after SPBCi+1 i+2 iteration i 15-745 Seth Copen Goldstein 2000-5 88 Goal of Software Pipeliningn Discover ILP across iterations!A0 A0 A1 B0A2 B1 C0 A3 B2 C1B3 C2 C3A1 B0 Ai Bi-1 Ci-2Bi Ci-1 Ci 15-745 Seth Copen Goldstein 2000-5 89 ExampleAssume operating on a infinite wide machine A0A1 B0Prolog loop bodyepilog Ai Bi-1 Ci-2 Bi Ci-1 Ci15-745 Seth Copen Goldstein 2000-5 90 Dealing with exit conditionsfor (i=0; i
A0
B0
if (i+1 == N) goto last i=1
A1
if (i+2 == N) goto epilog i=2
{
}
loop: Ai Ai
Bi Bi-1 Ci Ci-2
i++
if (i < N) goto loop epilog:BiCi-1 last:ci done: 15-745 Seth Copen Goldstein 2000-5 91 Loop Unrolling vs. Software PipeliningFor SuperScalar or VLIWn Loop Unrolling reduces loop overhead n Software Pipelining reduces fill/drain n Best is if you combine themSoftware PipeliningLoop Unrolling Time15-745 Seth Copen Goldstein 2000-5 92 TODO: More complicated SP example93 Software Pipelining Approachesn The first serious approach to software pipelining was presented by Aiken & Nicolau.q Aikens 1988 Ph.D. thesis.q Impractical as it ignores resource hazards (focusing onlyon data-dependence constraints).n Iterative Modulo Scheduling Rau MICRO94q Addresses resource constraintsq Iterate over different loop lengths until one schedules successfullyq Compute loop lower/upper bounds from required & available resources94SYSTOLIC ARRAYS 95 Why Systolic Architectures?n Idea: Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memoryn Similar to an assembly lineq Different people work on the same carq Many cars are assembled simultaneously q Can be two-dimensionaln Special purpose accelerators/architectures needq Simple, regular designs (keep # unique parts small andregular)q High concurrency a high performanceq Balanced computation and I/O (memory access)96 Systolic Architecturesn H. T. Kung, Why Systolic Architectures?, IEEE Computer 1982.Memory: heart PEs: cellsMemory pulses data through cells 97 Systolic Architecturesn Basic principle: Replace a single PE with a regular array of PEs and carefully orchestrate flow of data between the PEs a achieve high throughput w/o increasing memory bandwidth requirementsn Differences from pipelining:q Array structure can be non-linearand multi-dimensionalq PE connections can be multidirectional (and different speed)q PEs can have local memory and execute kernels (rather than a piece of the instruction) 98 Systolic Computation Examplen 99 Systolic Computation Example: Convolution 100 Systolic Computation: Convolutiony1 w3w2x2x1 y1w1 w3w2w1x3x2x1 x3y1w3y1w2x2y2w1y2 w3w2w1x4x3x2 y2y3 w3w2x4x3w1 101 Systolic Computation: Convolution y1 w3w2w1 x2x1102 Systolic Computation: Convolution y1 w3w2w1 x3x2x1103 Systolic Computation: Convolution y1y2 w3w2w1 x3x2104 Systolic Computation: Convolution y1y2 w3w2w1 x4x3x2105 Systolic Computation: Convolution y1y2y3 w3w2w1 x4x3106 Systolic Computation Example: Convolutionn Worthwhile to implement adder and multiplier separately to allow overlapping of add/multiply executions 107 TODO: Example relating SP to systolic architecture for some computation (maybe the convolution)108 More Programmabilityn Each PE in a systolic arrayq Can store multiple weightsq Weights can be selected on the flyq Eases implementation of, e.g., adaptive filteringn Taken furtherq Each PE can have its own data and instruction memoryq Data memory a to store partial/temporary results, constantsq Leads to stream processing, pipeline parallelism n Moregenerally,stagedexecution109 Pipeline Parallelism 110 File Compression Example 111 Why pipeline parallelism in software?n Pipeline parallelism vs data parallelismq Why split pipeline stages across PEs?q No cycle-time benefit like we got in hardwaren Data movement patterns differq Pipeline parallelism: move input data between PEs q Data parallelism: move task code/data between PEsn Tight feedback loops within single stage q E.g., compression or encryptionn Appropriate design depends on application112 Systolic Array Summaryn Advantagesq Makes multiple uses of each data item a reduce data fetches q High concurrencyq Regular design (both data and control flow)n Disadvantagesq Not good at exploiting irregular parallelismq Relatively special purpose a need software, programmer support to be a general purpose model113 The WARP Computern HT Kung, CMU, 1984-1988n Linear array of 10 cells, each cell a 10 Mflop programmableprocessorn Attached to a general purpose host machinen High-level language and optimizing compiler to program the systolic arrayn Used extensively to accelerate vision and robotics tasks n Annaratone et al., Warp Architecture andImplementation, ISCA 1986.n Annaratone et al., The Warp Computer: Architecture, Implementation, and Performance, IEEE TC 1987.114 The WARP Computer 115 The WARP Computer 116 Resource Constraintsn Minimally indivisible sequences, i and j, can execute together if combined resources in a step do not exceed available resources.n R(i) is a resource configuration vector R(i) is the number of units of resource in r(i) is a resource usage vector s.t. 0 r(i) R(i)n Each node in G has an associated r(i)15-745 Seth Copen Goldstein 2000-5 117SCALAR REPLACEMENT 118 Scalar Replacement: Example DO I = 1, NDO J = 1, MA(I) = A(I) + B(J)ENDDOENDDODO I = 1, NT = A(I)DO J = 1, MT = T + B(J)ENDDOA(I) = T ENDDOn AllloadsandstorestoAintheinner loop have been savedn A(I)canbeleftinaregister throughout the inner loopn Superscalar+cachewillgetmostof n HighchanceofTbeingallocateda this, but not allocate A(I) to register register by compilerScalar Replacementn Convert array reference to scalar referencen Approach is to use dependences to achieve these memory hierarchy transformations Dependence and Memory Hierarchyn True or Flow – save loads and cache miss n Anti – save cache miss?n Output – save storesn Input – save loadsA(I) = … + B(I) … =A(I)+k A(I) = …… = B(I) Dependence and Memory Hierarchyn Loop Carried dependences – Consistent dependences most useful for memory management purposesn Consistent dependences – dependences with constant dependence distance Using Dependencesn In the reduction example DO I = 1, N DO J = 1, M A(I) = A(I) + B(J)ENDDO ENDDODO I = 1, NT = A(I)DO J = 1, MT = T + B(J)ENDDOA(I) = T ENDDOn True dependence – replace the references to A in the inner loop by scalar Tn Output dependence – store can be moved outside the inner loopn Antidependence – load can be moved before the inner loop Scalar Replacementn Example: Scalar Replacement in case of loop independent dependence DO I = 1, NA(I) = B(I) + CX(I) = A(I)*QENDDODO I = 1, N t = B(I) + CA(I) = t X(I) = t*QENDDOn One less load for each iteration for reference to A Scalar Replacementn Example: Scalar Replacement in case of loop carried dependence spanning single iterationDO I = 1, NA(I) = B(I-1)B(I) = A(I) + C(I)ENDDOtB = B(0)DO I = 1, N tA = tB A(I) = tA tB = tA + C(I) B(I) = tBENDDOn One less load for each iter for ref to B which had a loop carried true dependence of 1 itern Also one less load per iter for reference to A Scalar Replacementn Example: Scalar Replacement in case of loop carried dependence spanning multiple iterationsDO I = 1, NA(I) = B(I-1) + B(I+1)ENDDOnt1 = B(0)t2 = B(1)DO I = 1, N t3 = B(I+1) A(I) = t1 + t3 t1 = t2 t2 = t3ENDDOOne less load for each iter for ref to B which had a loop carried input dependence of 2 itersInvariants maintained weret1=B(I-1); t2=B(I);t3=B(I+1) nAiken/Nicolau Scheduling Step 1Perform scalar replacement to eliminate memory references where possible. for i:=1 to N doa := j A V[i-1]b := a A f c := e A j d := f A c e := b A d f := U[i] g: V[i] := b h: W[i] := dj := X[i]for i:=1 to N do a:=jAb b:=aAf c:=eAj d:=fAc e:=bAdf := U[i] g: V[i] := b h: W[i] := dj := X[i] 15-745 Seth Copen Goldstein 2000-5 127 Aiken/Nicolau Scheduling Step 2Unroll the loop and compute the data-dependence graph (DDG). DDG for rolled loop:for i:=1 to N do a := j A b b := a A f c := e A j d := f A c e := b A df := U[i] g: V[i] := b h: W[i] := dj := X[i]acjbd gfh e 15-745 Seth Copen Goldstein 2000-5 128 Aiken/Nicolau Scheduling Step 2, contd DDG for unrolled loop:for i:=1 to N do a := j A b b := a A f c:=eAj d:=fAc e:=bAdf:=U[i]g: V[i] := bh: W[i] := dj := X[i]a1 c1b1 d1g a j1 e h1 121b f1 c 22 g2 a3b 3j2 d 2f2e2 c3h115-745 Seth Copen Goldstein 2000-5 129 Aiken/Nicolau Scheduling Step 3Build a tableau of iteration number vs cycle time.ac 1 1b d1 1j iteration 123456 1 acfjfjfjfjfjfj 2bd3egha 1 h 4 cb g1a2 e115dga f 6ehb b1c 7 cga 2j28db2d 9 ehga g2a3 2 10 cb fh11 dga 2e112 ehb b3 2 13 cg c 14 d3 15 eh15-745 Seth Copen Goldstein 2000-5 130 cycle Aiken/Nicolau Scheduling Step 3Build a tableau of iteration number vs cycle time.basically, youre emulatingiteration1 1 123456ac a superscalar with infinite1 acfjfjfjfjfjfj 2 bd3 egh a4 cb5 dga6 eh b 7cga 8db9 ehga resources, infinite registerb d1renaming, always predicting 1jthe loop-back branch: h g a 1 e 1121thus, just puref1 b2jc2 data dependency 2dg2a3 2 10 cb fh11 dga 2e112 ehbb3 2 13 cg c 14 d3 15 eh15-745 Seth Copen Goldstein 2000-5 131cycle Aiken/Nicolau Scheduling Step 4Find repeating patterns of instructions. iteration 123456 1 acfjfjfjfjfjfj 2 bd 34 egh acb5 dga6 ehb7 cg a 8db9101112131415eh g a cbdga ehb cgd eh 15-745 Seth Copen Goldstein 2000-5 132cycle Aiken/Nicolau Scheduling Step 4Find repeating patterns of instructions. iteration 123456 1 acfjfjfjfjfjfj 2 bd34 egh acb5 dga6 ehb7 cg a 8db 9101112131415eh g a cbdga ehb cgd eh 15-745 Seth Copen Goldstein 2000-5 133cycle Aiken/Nicolau Scheduling Step 4Find repeating patterns of instructions. iteration 123456 1 acfjfjfjfjfjfj 2 bd34 egh acb5 dga6 ehb7 cg a 8dbGo back and relate slopes to DDG9101112131415eh g a cbdga ehb cgd eh 15-745 Seth Copen Goldstein 2000-5 134cycle Aiken/Nicolau Scheduling Step 5Coalesce the slopes. iteration iteration 123456 1234561 acfjfjfjfjfjfj 2 bd3 egh a4 cb1 acfj 2 bd 3 egh 4fjacb fjdg aeh bfj5 dga6 ehb7 cg a 8db8db9101112131415eh g a cbdga ehb cgd ehcg a9 eh g fj5 6 710 ca 11 db 12 eh g13 c 14 d 15 eh 15-745 Seth Copen Goldstein 2000-5 135cyclecycle Aiken/Nicolau Scheduling Step 6Find the loop body and reroll the loop.iteration 1234561 acfj 2bd 3 egh 4fjacb fj dga ehbfj 567 8db9101112131415ehg fj cadb eh g c dehcg a 15-745 Seth Copen Goldstein 2000-5 136cycle Aiken/Nicolau Scheduling Step 6Find the loop body and reroll the loop.Prologue/entry codeLoop body Epilogue/exit code iteration 123456 1 acfj2bd fj3 egh a4 cb fj5 dga6 ehbfj 78 cg a 91011 db 12 eh g13 c 14 d 15 ehdb ehg fjca 15-745 Seth Copen Goldstein 2000-5 137cycle Aiken/Nicolau Scheduling Step 7Generate code.(Assume VLIW-like machine for this example. The instructions on each line should be issued in parallel.) L:bi+1 := ai A fiW[i] := diai+2 := ji+1 A bi+1 i := i+1bN :=aN A fN-1 W[N-1] := dN-1w[N]:=dNa1:=j0 A b0 b1:=a1 A f0 e1:=b1 A d1 c2:=e1 A j1 d2:=f1 A c2 e2:=b2 A d2 c3:=e2 A j2di:=fi-1Aci ei := bi A di ci+1:=eiAjic1:=e0 A j0 d1:=f0 A c1 V[1] := b1 b2:=a2 A f1 V[2] := b2 W[2] := d2 V[3] := b3f1:=U[1]f2:=U[2] W[1] := d1f3 := U[3] a3:=j2 A b3:=a3 A a4:=j3 Aj1 := X[1]j2 := X[2] a2:=j1 A b1j3 := X[3] b2f2 f4 := U[4] b3 i:=3j4 := X[4]dN-1 :=fN-2 A cN-1 eN-1 := bN-1 A dN-1cN := eN-1 A jN-1 dN := fN-1 + cNeN :=bN A dNV[i+1] := bi+1 v[N] := bNfi+2 := U[I+2] ji+2 := X[i+2] if i
n For an edge, uv, we must have s(v)-s(u) 3 d(u,v)
15-745 Seth Copen Goldstein 2000-5 149
Precedence Constraints
n Cyclic: constraint becomes a tuple:
q p is the minimum iteration delay
(or the loop carried dependence distance) q d is the delay
n For an edge, uv, we must have s(v)-s(u) 3 d(u,v)-s*p(u,v)
np30
n If data dependence is
q within an iteration, p=0
q loop-carried across p iter boundaries, p>0
15-745 Seth Copen Goldstein 2000-5 150
Iterative Approach
n Finding minimum S that satisfies the constraints is NP- Complete.
n Heuristic:
q Find lower and upper bounds for S
q foreach s from lower to upper bound? n Schedule graph.
n If succeed, done
n Otherwise try again (with next higher s)
n Thus: Iterative Modulo Scheduling Rau MICRO94
15-745 Seth Copen Goldstein 2000-5 151
Iterative Approach
n Heuristic:
q Find lower and upper bounds for S
q foreach s from lower to upper bound
n Schedule graph.
n If succeed, done
n Otherwise try again (with next higher s)
n So the key difference:
q AN88 does not assume S when scheduling
q IMS must assume an S for each scheduling attempt to understand resource conflicts
15-745 Seth Copen Goldstein 2000-5 152
Lower Bounds
n Resource Constraints: SR
maximum over all resources of # of uses divided by #
available
n Precedence Constraints: SE
max delay over all cycles in dataflow graph
In practice, one is easy, other is hard.
Tims secret approach: just use SR as lower bound, then do binary search for best S
15-745 Seth Copen Goldstein 2000-5 153
Acyclic Example
a <0,2>
<0,3> c
b <0,1>
Lower Bound: SR=2 Upper Bound: 5
15-745 Seth Copen Goldstein 2000-5 154
Lower Bound on s
Assume 1 ALU and 1 MU
Assume latency Op or load is 1 cycle
a
c
for i:=1 to N do a := j A b b := a A f c := e A j d:=fAc e := b A d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
<1,1> <0,1>
<1,1> <0,1>
<1,1> <0,1>
j
b
h
f
d
g
<0,1>
<0,1>
<1,1>
Resources => 5 cycles Dependencies => 3 cycles
e
15-745 Seth Copen Goldstein 2000-5 155
Scheduling data structures
To schedule for initiation interval s:
n Create a resource table with s rows and R columns
n Create a vector, s, of length N for n instructions in the loop
q s[n] = the time at which n is scheduled, or NONE
n Prioritize instructions by some heuristic q critical path (or cycle)
q resource critical
15-745 Seth Copen Goldstein 2000-5 156
Scheduling algorithm
n Pick an instruction, n
n Calculate earliest time due to dependence constraints For all x=pred(n),
earliest = max(earliest, s(x)+d(x,n)-s.p(x,n)) n try and schedule n from earliest to (earliest+s-1)
s.t. resource constraints are obeyed.
q possible twist: deschedule a conflicting node to make way for n, maybe randomly, like sim anneal
n If we fail, then this schedule is faulty (i.e. give up on this s)
15-745 Seth Copen Goldstein 2000-5 157
Scheduling algorithm cont.
n We now schedule n at earliest, I.e., s(n) = earliest n Fix up schedule
q Successors, x, of n must be scheduled s.t.
s(x) >= s(n)+d(n,x)-s.p(n,x), otherwise they are removed
(descheduled) and put back on worklist.
n repeat this some number of times until either
q succeed, then register allocate q fail, then increase s
15-745 Seth Copen Goldstein 2000-5 158
Simplest Example
for () {
a = b+c
b = a*a
c = a*194 }
<1,1>
<1,1> <0,1> <0,1>
b
Resources:
a
1
1
c
What is IIres? What is IIrec?
15-745 Seth Copen Goldstein 2000-5 159
Simplest Example
for () { 0 a = b+c
b = a*a
c = a*194 1 }
a
Modulo Resource Table:
0 1
b
c
Try II = 2
1
15-745 Seth Copen Goldstein 2000-5 160
Simplest Example
for () { 0 a = b+c
b = a*a
c = a*194 1 }
a
Try II = 2
Modulo Resource Table:
0 1
1
b
c
1
15-745 Seth Copen Goldstein 2000-5 161
Simplest Example
for () { 0 a = b+c
b = a*a
c = a*194 1 }
Modulo Resource Table:
0 1
1
1
a
2
Try II = 2
b
c
1
15-745 Seth Copen Goldstein 2000-5 162
Simplest Example
for () { 0 a = b+c
b = a*a
c = a*194 1 }
a
b
Try II = 2
Modulo Resource Table:
0 1
earliest a: sigma(c) + delay(c) 2 = 2+1-2 = 1
1
c
2
1
15-745 Seth Copen Goldstein 2000-5 163
Simplest Example
for () {
a = b+c
b = a*a
c = a*194
0 1 2
}
b
a
Try II = 2
Modulo Resource Table:
0 1
earliest b? scheduled b? what next?
1
c
1
15-745 Seth Copen Goldstein 2000-5 164
Simplest Example
for () {
a = b+c
b = a*a
c = a*194
0 1 2 3
}
Try II = 2
Modulo Resource Table:
0 1
Lesson: lower bound may not be achievable
1
b
a
c
1
15-745 Seth Copen Goldstein 2000-5 165
Example
for i:=1 to N do a := j A b b := a A f c := e A j d := f A c e := b A d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities: ?
<1,1> <0,1>
<1,1> <0,1>
<1,1> <0,1>
<0,1>
<0,1>
<1,1>
g
a
h
j
f
c
b
d
e
15-745 Seth Copen Goldstein 2000-5 166
Example
for i:=1 to N do a := j A b b := a A f c := e A j d := f A c e := b A d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities: c,d,e,a,b,f,j,g,h
g
a
<1,1> <0,1>
<1,1> <0,1>
<1,1> <0,1>
<0,1>
<0,1>
<1,1>
h
f
c
j
b
d
e
15-745 Seth Copen Goldstein 2000-5 167
for i:=1 to N do a := j A b b := a A f c := e A j d := f A c e := b A d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
s=5
instr
s
a
b
c
d
e
f
g
h
j
ALU
MU
Priorities: c,d,e,a,b,f,j,g,h
a
b
j
h
f
c
g
e
d
15-745 Seth Copen Goldstein 2000-5 168
for i:=1 to N do a := j A b b := a A f c := e A j d := f A c e := b A d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities: a,b,f,j,g,h
s=5
instr
s
a
b
c
0
d
1
e
2
f
g
h
j
ALU
MU
c
d
e
j
a
c
f
b
d
g
h
e
15-745 Seth Copen Goldstein 2000-5 169
for i:=1 to N do a := j A b b := a A f c := e A j d := f A c e := b A d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities: b,f,j,g,h
s=5
instr
s
a
3
b
c
0
d
1
e
2
f
g
h
j
ALU
MU
c
d
e
a
j
a
c
f
b
d
g
h
e
15-745 Seth Copen Goldstein 2000-5 170
for i:=1 to N do a := j A b b := a A f c := e A j d := f A c e := b A d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities: b,f,j,g,h
s=5
instr
s
a
3
b
4
c
0
d
1
e
2
f
g
h
j
ALU
MU
c
d
e
a
b
j
a
c
f
b
d
g
h
e
15-745 Seth Copen Goldstein 2000-5 171
for i:=1 to N do a := j A b b := a A f c := e A j d := f A c e := b A d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities: e,f,j,g,h
s=5
ALU
MU
c
d
a
b
j
a
c
f
b
d
h
g
instr
a
b
c
d
e
f
g
h
j
1
s
3
4
0
e
b causes b->e edge violation
15-745 Seth Copen Goldstein 2000-5 172
for i:=1 to N do a := j A b b := a A f c := e A j d := f A c e := b A d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities: e,f,j,g,h
s=5
ALU
MU
c
d
e
a
b
j
a
c
f
b
d
h
g
instr
a
b
c
d
e
f
g
h
j
1
s
3
4
0
7
e
e causes e->c edge violation
15-745 Seth Copen Goldstein 2000-5 173
for i:=1 to N do a := j A b b := a A f c := e A j d := f A c e := b A d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities: f,j,g,h
s=5
instr
s
a
3
b
4
c
5
d
6
e
7
f
0
g
h
j
ALU
MU
c
f
d
e
a
b
j
a
c
f
b
d
g
h
e
15-745 Seth Copen Goldstein 2000-5 174
for i:=1 to N do a := j A b b := a A f c := e A j d := f A c e := b A d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities:j,g,h
s=5
instr
s
a
3
b
4
c
5
d
6
e
7
f
0
g
h
j
1
ALU
MU
c
f
d
j
e
a
b
j
a
c
f
b
d
g
h
e
15-745 Seth Copen Goldstein 2000-5 175
for i:=1 to N do a := j A b b := a A f c := e A j d := f A c e := b A d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities:g,h
s=5
instr
s
a
3
b
4
c
5
d
6
e
7
f
0
g
7
h
8
j
1
ALU
MU
c
f
d
j
e
g
a
h
b
j
a
c
f
b
d
g
h
e
15-745 Seth Copen Goldstein 2000-5 176
Creating the Loop
n Create the body from the schedule.
n Determine which iteration an instruction
falls into
q Mark its sources and dest as belonging to that iteration.
q Add Moves to update registers n Prolog fills in gaps at beginning
q For each move we will have an instruction in prolog, and we fill in dependent instructions
n Epilog fills in gaps at end
instr
s
a
3
b
4
c
5
d
6
e
7
f
0
g
7
h
8
j
1
15-745 Seth Copen Goldstein 2000-5 177
f0 = U[0]; j0 = X[0];
FOR i = 0 to N f1 := U[i+1] j1 := X[i+1] nop
a := j0 ? b b := a ? f0 c := e ? j0 d := f0 ? c e := b ? d
h: W[i] := d f0 = f1
j0 = j1
g: V[i] := b
15-745 Seth Copen Goldstein 2000-5 178
Conditionals
n What about internal control structure, I.e., conditionals n Three approaches
q Schedule both sides and use conditional moves
q Schedule each side, then make the body of the conditional a
macro op with appropriate resource vector q Trace schedule the loop
15-745 Seth Copen Goldstein 2000-5 179
What to take away
n Architecture includes compiler!
n Dependence analysis is very important
(including alias analysis)
n Software pipelining crucial for statically scheduled, but also very useful for dynamically scheduled
15-745 Seth Copen Goldstein 2000-5 180
Multiflow: An early VLIW architecture (1987)
EPIC Intel IA-64 Architecture
n Gets rid of lock-step execution of instructions within a VLIW instruction
n Idea: More ISA support for static scheduling and parallelization q Specify dependencies within and between VLIW instructions
(explicitly parallel)
+ No lock-step execution
+ Static reordering of stores and loads + dynamic checking
Hardware needs to perform dependency checking (albeit aided by software)
Other disadvantages of VLIW still exist
n Huck et al., Introducing the IA-64 Architecture, IEEE Micro, Sep/Oct 2000.
183
IA-64 Instructions
n IA-64 Bundle (~EPIC Instruction)
q Total of 128 bits
q Contains three IA-64 instructions
q Template bits in each bundle specify dependencies within a bundle
n IA-64 Instruction
q Fixed-length 41 bits long
q Contains three 7-bit register specifiers
q Contains a 6-bit field for specifying one of the 64 one-bit predicate registers
184
IA-64 Instruction Bundles and Groups
n Groups of instructions can be executed safely in parallel
q Marked by stop bits
n Bundles are for packaging
q Groups can span multiple bundles
n Alleviates recompilation need somewhat
185
Template Bits
n Specify two things
q Stop information: Boundary of independent instructions
q Functional unit information: Where should each instruction be routed
186
Reviews
There are no reviews yet.