[SOLVED] CS计算机代考程序代写 cache algorithm data structure assembly AI x86 concurrency compiler prolog Static Scheduling & VLIW 15-740

30 $

File Name: CS计算机代考程序代写_cache_algorithm_data_structure_assembly_AI_x86_concurrency_compiler_prolog_Static_Scheduling_&_VLIW_15-740.zip
File Size: 1281.12 KB

SKU: 4228178153 Category: Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Or Upload Your Assignment Here:


Static Scheduling & VLIW 15-740
Prof. Nathan Beckmann
(Original slides by Onur Mutlu, edited by Seth Goldstein)
Carnegie Mellon University

Reprise of dynamic scheduling
n
2

DO WE REALLY NEED ALL THIS COMPLEX HARDWARE?
HOW FAR CAN WE GET WITHOUT IT?
3

Key Questions
Q1. How do we find independent instructions to fetch/execute?
Q2. How do we enable more compiler optimizations?
e.g., common subexpression elimination, constant propagation, dead code elimination, redundancy elimination, …
Q3. How do we increase the instruction fetch rate?
i.e., have the ability to fetch more instructions per cycle
4

Key Questions
Q1. How do we find independent instructions to fetch/execute?
Q2. How do we enable more compiler optimizations?
e.g., common subexpression elimination, constant propagation, dead code elimination, redundancy elimination, …
Q3. How do we increase the instruction fetch rate?
i.e., have the ability to fetch more instructions per cycle
A: Enabling the compiler to optimize across a larger number of instructions that will be executed straight line (without branches getting in the way) eases all of the above
5

Very long instruction word – VLIW
n Compiler does the scheduling statically
n Simple hardware with multiple function units
q Reduced hardware complexity
q Little or no scheduling done in hardware, e.g., in-order q Hopefully, faster clock and less power
n Compiler required to group and schedule instructions (compare to OoO superscalar)
q Predicated instructions to help with scheduling (trace, etc.) q More registers (for software pipelining, etc.)
6

VLIW example – 2-cycle loads
n RISC code MUL R1, R3, 3
LDR4, 0(R1)
ADD R2, R2, R4
SUB R3, R3, 1
BNEZR3, -4
n VLIW code MUL R1, R3, 3
LDR4, 0(R1)
NOP
ADD R2, R2, R4
SUB R3, R3, 1
NOP
NOP
BNEZ R3, -4
7

Very long instruction word – VLIW
n Compiler does the scheduling statically
n Simple hardware with multiple function units
n Compiler required to group and schedule instructions (compare to OoO superscalar)
n Example machines:
q Multiflow, Cydra 5 (8-16 ops per VLIW)
q IA-64 Itanium (3 ops per bundle, “EPIC” VLIW variant) q TMS32xxxx (5+ ops per VLIW, DSPs)
q Crusoe (4 ops per VLIW, JIT x86 è VLIW)
q Tilera TILEPro (3 ops per VLIW, 100+ cores)
8

Comparison between SS « VLIW
From Mark Smotherman, “Understanding EPIC Architectures and Implementations”

Comparison: CISC, RISC, VLIW
VLIW is a natural extension of RISC ideas to superscalar

VLIW relies on compiler for ILP
n
11

VLIW: Finding Independent Operations
n Within a basic block, there is limited instruction-level parallelism
n To find multiple instructions to be executed in parallel, the compiler needs to consider multiple basic blocks
n Problem: Moving an instruction above a branch is unsafe because instruction is not guaranteed to be executed
n Idea: Enlarge blocks at compile time by finding the frequently-executed paths
q Trace scheduling
q Superblock scheduling q Hyperblock scheduling q Software Pipelining
It’s all about the compiler and how to schedule the instructions to maximize parallelism
12

List Scheduling: For 1 basic block
n Idea:Assignprioritytoeachinstruction
n Initializereadylistthatholdsallreadyinstructions
n Choose one ready instruction I from ready list with the highest priority
n InsertIintoschedule
q Ensuring resources are available (structural hazards)
n Addthoseinstructionswhoseprecedenceconstraintsare now satisfied into the ready list
13

Data Precedence Graph
i1 i2 i5 i6 i7 i10 i11 i12
222222 i3 2 i8 2 i13
2
22
i4
i9
i14
4
4 i15
2 i16
14

Instruction Prioritization Heuristics
n Number of descendants in precedence graph
n Maximum latency from root node of precedence graph n Length of operation latency
n Ranking of paths based on importance
n Some combination of above
15

VLIW List Scheduling
n Assign priorities – critical path in this example
n Compute data ready list (DRL) – all operations whose predecessors
have been scheduled.
n Select from DRL in priority order while checking resource constraints n Add newly ready operations to DRL and repeat for next instruction
5
1
23334
23456
223
789
112
1011 12
1
4-wide VLIW
Data Ready List
1
{1}
6
3
4
5
{2,3,4,5,6}
9
2
7
8
{2,7,8,9}
12
10
11
{10,11,12}
13
{13}
13
16

VLIW List Scheduling
5
1 2233435364
4-wide VLIW
2
2
3
2
1
13
789
11
1011 12
17

VLIW List Scheduling
5
1 2233435364
4-wide VLIW
1
6
3
4
5
9
2
7
8
12
10
11
13
2
2
3
2
1
13
789
11
1011 12
18

VLIW List Scheduling+Structural Hazards
5
1 2233435364
4-wide VLIW
2
2
3
2
1011 12
1
13
789 11
19

VLIW List Scheduling+Structural Hazards
5
1 2233435364
4-wide VLIW
1
2
3
6
5
9
4
7
8
12
10
11
13
2
2
3
2
1011 12
1
13
789 11
20

Extending the scheduling domain
n Basic block is too small to get any real parallelism q Recall: 88% of OOO benefit from speculation è
larger scheduling window
n How to extend the basic block?
q Why do we have basic blocks in the first place?
q Loops
n Loop unrolling
n Software pipelining
q Non-loops
n Will almost always involve some speculation n Thus profiling may be very important
[MTZ’13]
21

Safety and Legality in Code Motion
n Two characteristics of speculative code motion:
q Safety: whether or not spurious exceptions may occur q Legality: whether or not result will be always correct
n Four possible types of code motion:
r4 = r1 …
r1 = …
r1 = r2 & r3
r1 = r2 & r3
(a) safe and legal
(b) illegal
r1 = …
r1 = load A
r4 = r1 …
r1 = load A
(c) unsafe
(d) unsafe and illegal
22

Code Movement Constraints
n Downward
q When moving an operation from a BB to one of its dest BB’s,
n all the other dest basic blocks should still be able to use the result of the operation
n the other source BB’s of the dest BB should not be disturbed n Upward
q When moving an operation from a BB to its source BB’s n register values required by the other dest BB’s must not be
destroyed
n the movement must not cause new exceptions
23

Trace Scheduling
n Trace: A frequently executed path in the control-flow graph (has multiple side entrances and multiple side exits)
n Idea: Find independent operations within a trace to pack into VLIW instructions.
q Traces determined via profiling
q Compiler adds fix-up code for correctness (if a side entrance or side exit of a trace is exercised at runtime, corresponding fix-up code is executed)
24

Trace Scheduling Idea
25

Trace Scheduling (II)
n There may be conditional branches from the middle of the trace (side exits) and transitions from other traces into the middle of the trace (side entrances).
n These control-flow transitions are ignored during trace scheduling.
n After scheduling, fix-up/bookkeeping code is inserted to ensure the correct execution of off-trace code.
n Fisher, “Trace scheduling: A technique for global microcode compaction,” IEEE TC 1981.
26

Trace Scheduling (III)
Instr 1
Instr 2
Instr 3
Instr 4
Instr 5
Instr 2
Instr 3
Instr 4
Instr 1
Instr 5
What bookkeeping is required when Instr 1
is moved below the side entrance in the trace?
27

Trace Scheduling (IV)
Instr 3
Instr 4
Instr 1
Instr 2
Instr 3
Instr 4
Instr 5
Instr 2
Instr 3
Instr 4
Instr 1
Instr 5
28

Trace Scheduling (V)
Instr 1
Instr 2
Instr 3
Instr 4
Instr 5
Instr 1
Instr 5
Instr 2
Instr 3
Instr 4
What bookkeeping is required when Instr 5 moves above the side entrance in the trace?
29

Trace Scheduling (VI)
Instr 5
Instr 1
Instr 2
Instr 3
Instr 4
Instr 5
Instr 1
Instr 5
Instr 2
Instr 3
Instr 4
30

Trace Scheduling Fixup Code Issues
n Sometimes need to copy instructions more than once to ensure correctness on all paths (see C below)
Original trace
A
B X C
D A’ B’ C’ Y Scheduled B
trace E DYA
C’’’ E’’ D’’
EC
BX Correctness C
B’’ X
DY
31

Trace Scheduling Overview
n Trace Selection
q select seed block (the highest frequency basic block) q extend trace (along the highest frequency edges)
forward (successor of the last block of the trace)
backward (predecessor of the first block of the trace) q don’t cross loop back edge
q bound max_trace_length heuristically
n Trace Scheduling
q build data precedence graph for a whole trace
q perform list scheduling and allocate registers
q add compensation code to maintain semantic correctness
n Speculative Code Motion (upward)
q move an instruction above a branch if safe
32

Trace Scheduling Example (I)
B1
9 stalls
B3
1 stall
fdiv f1, f2, f3 fadd f4, f1, f5 beq r1, $0
fdiv f1, f2, f3 fadd f4, f1, f5
beq r1, $0
990
ld r2, 0(r3)
990
800
800
10
r2 and f2 not live out
B3
f2 not live out
B6
B2
ld r2, 4(r3)
10
ld r2, 0(r3)
add r2, r2, 4 beq r2, $0
fsub f2, f2, f6 st.d f2, 0(r8)
add r3, r3, 4 add r8, r8, 4
add r2, r2, 4 beq r2, $0
B4
B5
200
fsub f2, f3, f7
200
B7
B6
1 stall
fsub f2, f2, f6 st.d f2, 0(r8)
add r3, r3, 4 add r8, r8, 4
33

Trace Scheduling Example (I)
9 stalls
fdiv f1, f2, f3 fadd f4, f1, f5 beq r1, $0
ld r2, 0(r3)
add r2, r2, 4 beq r2, $0
fsub f2, f2, f6 st.d f2, 0(r8)
B6
fdiv f1, f2, f3 beq r1, $0
ld r2, 0(r3) fsub f2, f2, f6
add r2, r2, 4 beq r2, $0
st.d f2, 0(r8)
add r3, r3, 4 add r8, r8, 4 fadd f4, f1, f5
1 stall
r2 and f2 not live out
B3 0 stall 0 stall
f2 not live out
1 stall
1 stall
add r3, r3, 4 add r8, r8, 4
B3 B6
34

Trace Scheduling Example (II)
fdiv f1, f2, f3 beq r1, $0
ld r2, 0(r3) fsub f2, f2, f6
add r2, r2, 4 beq r2, $0
st.d f2, 0(r8)
add r3, r3, 4
add r8,r8,4
fadd f4, f1, f5 fadd f4, f1, f5
B6
0 stall 0 stall
1 stall
fdiv f1, f2, f3 beq r1, $0
ld r2, 0(r3) fsub f2, f2, f6
add r2, r2, 4 beq r2, $0
st.d f2, 0(r8)
fadd f4, f1, f5
Split comp. code
fadd f4, f1, f5
add r3, r3, 4
B3 add r8,r8,4 B3
B6
35

Trace Scheduling Example (III)
fdiv f1, f2, f3 beq r1, $0
ld r2, 0(r3) fsub f2, f2, f6 add r2, r2, 4 beq r2, $0
st.d f2, 0(r8)
fadd f4, f1, f5
Split comp. code
fadd f4, f1, f5
add r3, r3, 4 add r8, r8, 4 fadd f4, f1, f5
B3
B6
add r3, r3, 4 add r8, r8, 4
Join comp. code
36

Trace Scheduling Example (IV)
fdiv f1, f2, f3 beq r1, $0
ld r2, 0(r3) fsub f2, f2, f6 add r2, r2, 4 beq r2, $0
st.d f2, 0(r8)
add r3, r3, 4 add r8, r8, 4 fadd f4, f1, f5
fadd f4, f1, f5
Split comp. code
fadd f4, f1, f5
B3
add r2, r2, 4 beq r2, $0
fsub f2, f2, f6 st.d f2, 0(r8)
add r3, r3, 4 add r8, r8, 4
B6
Copied split instructions
add r3, r3, 4 add r8, r8, 4
Join comp. code
37

Trace Scheduling Example (V)
ld r2, 0(r3) fsub f2, f2, f6 add r2, r2, 4
beq r2, $0
fdiv f1, f2, f3 beq r1, $0
fadd f4, f1, f5
ld r2, 4(r3)
add r2, r2, 4 beq r2, $0
B3
st.d f2, 0(r8) add r3, r3, 4 add r8, r8, 4 fadd f4, f1, f5
fadd f4, f1, f5
B6
fsub f2, f2, f6 st.d f2, 0(r8)
add r3, r3, 4 add r8, r8, 4
fsub f2, f3, f7
add r3, r3, 4 add r8, r8, 4
38

Trace Scheduling Tradeoffs
n Advantages
+ Enables the finding of more independent instructions à fewer
NOPs in a VLIW instruction
n Disadvantages
— Profile dependent
— What if dynamic path deviates from trace à lots of NOPs in the VLIW instructions
— Code bloat and additional fix-up code executed — Due to side entrances and side exits
— Infrequent paths interfere with the frequent path — Effectiveness depends on the bias of branches
— Unbiased branches à smaller traces à less opportunity for finding independent instructions
39

Superblock Scheduling
n Trace: multiple entry, multiple exit block
n Superblock: single-entry, multiple exit block
q A trace with side entrances are eliminated
q Infrequent paths do not interfere with the frequent path
+ More optimization/scheduling opportunity than traces
+ Eliminates “difficult” bookkeeping due to side entrances
Hwu+, “The Superblock: An Effective Technique for VLIW and superscalar compilation,” J of SC 1991.
40

Superblock example
opA: mul r1,r2,3
99
opC: mul r3,r2,3
opA: mul r1,r2,3
99 opB: add r2,r2,1
11
opB: add r2,r2,1
opC’: mul r3,r2,3
opC: mul r3,r2,3
1
Original Code
Code After Superblock Formation
opA: mul r1,r2,3
99
opC: mov r3,r1
1
opB: add r2,r2,1
opC’: mul r3,r2,3
Code After Common Subexpression Elimination
41

Superblock Scheduling Shortcomings
— Still profile-dependent
— No single frequently executed path if there is an unbiased
branch
— Reduces the size of superblocks
— Code bloat and additional fix-up code executed — Due to side exits
42

Hyperblock Scheduling
n Idea: Use predication support to eliminate unbiased branches and increase the size of superblocks
n Hyperblock: A single-entry, multiple-exit block with internal control flow eliminated using predication (if-conversion)
n Advantages
+ Reduces the effect of unbiased branches on scheduled block size
n Disadvantages
— Requires predicated execution support
— All disadvantages of predicated execution
43

Hyperblock Formation (I)
n Hyperblock formation 1. Block selection
2. Tail duplication
3. If-conversion
10
90 80
80 20
20
n Block selection
q Select subset of BBs for inclusion in HB
q Difficult problem 10
q Weighted cost/benefit function n Height overhead
n Resource overhead
n Dependency overhead
n Branch elimination benefit n Weighted by frequency
90 10
10
n Mahlke et al., “Effective Compiler Support for Predicated Execution Using the Hyperblock,” MICRO 1992.
BB1
BB2
BB3
BB4
BB5
BB6
44

Hyperblock Formation (II)
Tail duplication same as with Superblock formation 10
BB1
80
BB2
80
10
90 10
BB6
10
20
BB1
80
BB2
80
20
BB3
20
90
10
BB3
20
BB4
BB4
10
BB5
BB5
90
BB6
81 9
10 19
BB6’
45

Hyperblock Formation (III)
10
BB1
80
BB2
80
If-convert (predicate) intra-hyperblock branches 10
BB1 p1,p2 = CMPP
BB2 if p1
BB3 if p2
BB4
BB6
BB4
20
BB3
20
90
10
10
BB5
BB6
81 9
9
1
10
81
9
1
BB6’
BB5
BB6’
46

WHAT ABOUT MEMORY?
47

Non-Faulting Loads and Exception Propagation
ld.s r1=[a]
inst 1 inst 2 …. br
inst 1 inst 2 ….
br
unsafe code
motion
….
ld r1=[a]
use=r1
n ld.s fetches speculatively from memory
i.e. any exception due to ld.s is suppressed
n If ld.s r1 did not cause an exception then chk.s r1 is a NOP, else a branch is taken (to execute some compensation code)
….
chk.s r1
use=r1
ld r1=[a]
48

Non-Faulting Loads and Exception Propagation in IA-64
ld.s r1=[a]
inst 1 inst 2 use=r1 ….
br
inst 1 inst 2 ….
br br
unsafe code
motion
….
ld r1=[a] use=r1
n Load data can be speculatively consumed prior to check
n “speculation” status is propagated with speculated data
n Any instruction that uses a speculative result also becomes speculative itself (i.e. suppressed exceptions)
n chk.s checks the entire dataflow sequence for exceptions
….
chk.s use
ld r1=[a] use=r1
49

Aggressive ST-LD Reordering in IA-64
inst 1
inst 2
….
st [?]
st[?] ….
ld r1=[x]
use=r1
ld.a r1=[x]
inst 1
inst 2
….
st [?]
….
ld.c r1=[x] use=r1
potential aliasing
n ld.a starts the monitoring of any store to the same address as the advanced load
n If no aliasing has occurred since ld.a, ld.c is a NOP n If aliasing has occurred, ld.c re-loads from memory
50

Aggressive ST-LD Reordering in IA-64
inst 1
inst 2
….
st [?]
st[?] ….
ld r1=[x] use=r1
ld.a r1=[x]
inst 1 inst 2 use=r1 ….
st [?] …. chk.a X ….
potential aliasing
ld r1=[a] use=r1
51

Can We Do Better?
n Hyperblock still
q Profile dependent
q Requires fix-up code
q And, requires predication support
n Single-entry, single-exit enlarged blocks
q Block-structured ISA
n Optimizes multiple paths (can use predication to enlarge blocks) n No need for fix-up code (duplication instead of fixup)
52

Block Structured ISA
n Blocks (> instructions) are atomic (all-or-none) operations q Either all of the block is committed or none of it
n Compiler enlarges blocks by combining basic blocks with their control flow successors
q Branches within the enlarged block converted to “fault” operations à if the fault operation evaluates to true, the block is discarded and the target of fault is fetched
Melvin and Patt, “Enhancing Instruction Scheduling with a Block-Structured ISA,” IJPP 1995.
53

Block Structured ISA (II)
n Advantages:
+ Larger atomic blocks à larger units can be fetched from I-cache
+ Aggressive compiler optimizations (e.g. reordering) can be enabled within atomic blocks (no side entries or exits)
+ Can explicitly represent dependencies among operations within an enlarged block
n Disadvantages:
— “Fault operations” can lead to work to be wasted (atomicity)
— Code bloat (multiple copies of the same basic block exists in the binary and possibly in I-cache)
— Need to predict which enlarged block comes next n Optimizations
q Within an enlarged block, the compiler can perform optimizations that cannot normally/easily be performed across basic blocks
54

Block Structured ISA (III)
n Hao et al., “Increasing the instruction fetch rate via block- structured instruction set architectures,” MICRO 1996.
55

Superblock vs. BS-ISA
n Superblock
q Single-entry, multiple exit code block
q Not atomic
q Compiler inserts fix-up code on superblock side exit
n BS-ISA blocks
q Single-entry, single exit
q Atomic
q Need to roll back to the beginning of the block on fault
56

Superblock vs. BS-ISA
n Superblock
+ No ISA support needed
— Optimizes for only 1 frequently executed path
— Not good if dynamic path deviates from profiled path à missed opportunity to optimize another path
n Block Structured ISA
+ Enables optimization of multiple paths and their dynamic selection.
+ Dynamic prediction to choose the next enlarged block. Can dynamically adapt to changes in frequently executed paths at run- time
+ Atomicity can enable more aggressive code optimization — Code bloat becomes severe as more blocks are combined
— Requires “next enlarged block” prediction, ISA+HW support — More wasted work on “fault” due to atomicity requirement
57

Summary and Questions
n Trace, superblock, hyperblock, block-structured ISA
n How many entries, how many exits does each of them have?
q What are the corresponding benefits and downsides?
n What are the common benefits?
q Enable and enlarge the scope of code optimizations q Reduce fetch breaks; increase fetch rate
n What are the common downsides?
q Code bloat (code size increase)
q Wasted work if control flow deviates from enlarged block’s path
58

VLIW Summary
n Heavy reliance on compiler (push RISC to the extreme) q Compiler algorithms (e.g., software pipelining) have lasting
impact outside of VLIW
n Is there enough statically knowable parallelism?
q E.g., memory aliasing and branch bias n What about wasted FUs? Code bloat?
q Code size is already a big problem with x86 apps!
n Architecture joke: “VLIW is the architecture of the future,
and always will be.”
n Yet many DSPs are VLIW. Why?
59

WHAT ABOUT LOOPS?
60

The problem with loops
n Consider the following code:
for (int i = 0; i < N; i++) {b[i] = b[i] * b[i];}61 The problem with loopsn Consider the following code:for (int i = 0; i < N; i++) {b[i] = b[i] * b[i];}n RISC assembly (LD & MUL 2-cycles) LOOP:Useful workLoop overheadLD R1, 0(R3)MUL R2, R1, R1ST R2, 0(R3) ADD R3, R3, 4BLT R3, R4, LOOPè 7 cycles per iteration 62 The problem with loopsn Consider the following code:for (int i = 0; i < N; i++) {b[i] = b[i] * b[i];}n VLIW assembly (1 ALU, 1 LD/ST; 2-cycles) LOOP:è5 cycles per iterationNOP NOPLD R1, 0(R3)NOP NOP NOPUseful workMUL R2, R1, R1 ADD R,3 R,3 4BLT R3, R4, LOOPST R2, -4(R3) Loop overhead 63 Amortize overheads by unrolling loops n Key idea is to schedule the following code instead:for (int i = 0; i < N; i+=4) {b[i+0] = b[i+0] * b[i+0];b[i+1] = b[i+1] * b[i+1];b[i+2] = b[i+2] * b[i+2];b[i+3] = b[i+3] * b[i+3]; }Loop unrollingè Larger scheduling block è Better schedule 64 Amortize overheads by unrolling loops n VLIW assemblyLOOP: NOP NOPè 2 cycles per iterationLD R1, 0(R9)LD R3, 4(R9)LD R5, 8(R9)LD R7, 12(R9)ST R2, 0(R9)ST R4, 4(R9)ST R6, 8(R9)ST R8, -4(R9)MUL R2, R1, R1MUL R4, R3, R3MUL R6, R5, R5MUL R8, R7, R7ADD R9, R9, 16BLT R9, R10 LOOPLoop overheadUseful work 65 Correctness of loop unrollingn Is this transformation legal?for (int i = 0; i < N; i++) {b[i] = b[i] * b[i];}for (int i = 0; i < N; i+=4) {b[i+0] = b[i+0] * b[i+0];b[i+1] = b[i+1] * b[i+1];b[i+2] = b[i+2] * b[i+2];b[i+3] = b[i+3] * b[i+3];} 66 Correctness of loop unrollingn Instead, schedule the following code: int i;for (i = 0; i+3 < N; i+=4) {b[i+0] = b[i+0] * b[i+0];b[i+1] = b[i+1] * b[i+1];b[i+2] = b[i+2] * b[i+2];b[i+3] = b[i+3] * b[i+3];}for (; i < N; i++) {b[i] = b[i] * b[i];}67 Loop unrolling summaryn Advantagesq Reduces loop overheadq Improves code schedule within loop n (eg, hiding MUL latency in example)n Disadvantagesq Increases code sizeq Less effective on loops with internal branches n Can use predication … more on this later68 Loop Unrollingn Idea: Replicate loop body multiple times within an iteration + Reduces loop maintenance overheadq Induction variable increment or loop condition test + Enlarges basic block (and analysis scope)q Enables code optimization and scheduling opportunities– What if iteration count not a multiple of unroll factor? (need extra code to detectthis)– Increases code size 69 Can we do better?n VLIW assembly (ALU + LD/ST) LOOP: NOPNOPLD R1, 0(R9)LD R3, 4(R9)LD R5, 8(R9)LD R7, 12(R9)ST R2, 0(R9)ST R4, 4(R9)ST R6, 8(R9)ST R8, -4(R9)MUL R2, R1, R1MUL R4, R3, R3MUL R6, R5, R5MUL R8, R7, R7ADD R9, R9, 16BLT R9, R10 LOOPè 2 cycles per iterationLoop overheadUseful workNot in this case – LD/ST unit is at 100% utilization 70 Can we do better?n VLIW assembly (3-wide) LOOP: NOPNOPNOP NOP NOPLD R1, 0(R9)LD R3, 4(R9)LD R5, 8(R9)LD R7, 12(R9) MUL R2, R1, R1MUL R4, R3, R3MUL R6, R5, R5MUL R8, R7, R7 ST R2, 0(R9)ST R4, 4(R9)ST R6, 8(R9)ST R8, 12(R9)NOP ADD R9, R9, 16BLT R9, R10, LOOPNOPLoop overheadè 7/4 cycles per iterationUseful work 71 Can we do better?n VLIW assembly (3-wide) LOOP: NOPNOPNOP NOP NOPLD R1, 0(R31)LD R3, 4(R31)LD R5, 8(R31)LD R7, 12(R31)LD R9, 12(R31)LD R11, 12(R31)LD R13, 12(R31)LD R15, 12(R31) MUL R2, R1, R1MUL R4, R3, R3MUL R6, R3, R3MUL R8, R3, R3MUL R10, R3, R3MUL R12, R3, R3MUL R14, R5, R5MUL R16, R7, R7 ST R2, 0(R31)ST R4, 4(R31)ST R6, 8(R31)ST R8, 12(R31)ST R10, 16(R31)ST R12, 20(R31)ST R14, 24(R31)ST R16, -4(R31)NOP ADD R31, R31, 32BLT R31, R10, LOOPNOPè 11/8 cycles per iterationLoop overheadUseful work…but code & register bloat 72 Can we do better than unrolling? n VLIW assembly (3-wide) LOOP: NOPNOPRamp-up & ramp-down overhead NOPLoop overheadUseful workNOP NOP NOPLD R1, 0(R31)LD R3, 4(R31)LD R5, 8(R31)LD R7, 12(R31)LD R9, 12(R31)LD R11, 12(R31)LD R13, 12(R31)LD R15, 12(R31) MUL R2, R1, R1MUL R4, R3, R3MUL R6, R3, R3MUL R8, R3, R3MUL R10, R3, R3MUL R12, R3, R3MUL R14, R5, R5MUL R16, R7, R7 ST R2, 0(R31)ST R4, 4(R31)ST R6, 8(R31)ST R8, 12(R31)ST R10, 16(R31)ST R12, 20(R31)ST R14, 24(R31)ST R16, -4(R31)NOP ADD R31, R31, 32BLT R31, R10, LOOP73 Can we do better than unrolling? n VLIW assembly (3-wide) LOOP: NOPNOPRamp-up & ramp-down overheadNOP NOP NOP LD R1, 0(R31)LD R3, 4(R31)LD R5, 8(R31)LD R7, 12(R31)LD R9, 12(R31)LD R11, 12(R31)LD R13, 12(R31)LD R15, 12(R31) MUL R2, R1, R1MUL R4, R3, R3MUL R6, R3, R3MUL R8, R3, R3MUL R10, R3, R3MUL R12, R3, R3MUL R14, R5, R5MUL R16, R7, R7 ST R2, 0(R31)ST R4, 4(R31)ST R6, 8(R31)ST R8, 12(R31)ST R10, 16(R31)ST R12, 20(R31)ST R14, 24(R31)ST R16, -4(R31)NOP ADD R31, R31, 32BLT R31, R10, LOOPPerfect efficiencyNOPLoop overheadUseful workGoal: Maintain peak efficiency, w/out ramp-up/down 74 Software Pipeliningn Idea: Move instructions across iterations of the loop q Very large improvements in running time are possible n E.g., 5-wide VLIWLOOP:{LD R1, 0(R9) NOPNOP NOPLD R1, 4(R9) MUL R2, R1, R1ADD R9, R9, 4NOPADD R9, R9, 4BGE R9, R10, ENDCurrent iteration One iteration ago Two iterations agoNOP SUB R10, R10, 4LD R1, 0(R9) MUL R2, R1, R1ST R2, -8(R9)ADD R9, R9, 4BLT R9, R10, LOOP}NOP ST R2, -4(R9) NOP ST R2, 0(R9)END:NOPNOPNOPNOPNOPMUL R2, R1, R1NOPNOP 15-745 © Seth Copen Goldstein 2000-5 75 Software Pipeliningn Idea: Move instructions across iterations of the loop q Very large improvements in running time are possible n E.g., 5-wide VLIWLD R1, 0(R9) NOPNOP NOPLD R1, 4(R9) MUL R2, R1, R1NOP ADD R9, R9, 4 NOP ADD R9, R9, 4 BGE R9, R10, ENDè 1 cycle per iteration}NOP ST R2, -4(R9) NOP ST R2, 0(R9) LOOP:Perfect efficiencyEND:{ SUB R10, R10, 4LD R1, 0(R9) NOPNOPNOPNOPNOPMUL R2, R1, R1NOPNOPMUL R2, R1, R1ST R2, -8(R9)ADD R9, R9, 4BLT R9, R10, LOOP 15-745 © Seth Copen Goldstein 2000-5 76 Goal of SPn Increase distance between dependent operations by moving destination operation to a later iterationA:a¬ ld[d] B:b¬ a*aC: st [d], b D:d¬ d+4Assume all have latency of 2ABCD15-745 © Seth Copen Goldstein 2000-5 77 Can we decrease the latency?n Lets unrollA: a¬ld[d]B: b¬a*aC: st [d], b D: d¬d+4 A1: a ¬ ld [d] B1: b ¬ a * a C1: st [d], b D1: d¬ d+4 ABCDA1B1C1D115-745 © Seth Copen Goldstein 2000-5 78 Rename variables A: a¬ ld[d]B: b¬ a*aC: st [d], b D: d1¬d+4A1: a1 ¬ ld [d1]B1: b1¬ a1*a1 C1: st [d1], b1 D1:d¬ d1+4ABCDA1B1C1D115-745 © Seth Copen Goldstein 2000-5 79 ScheduleA: a¬ ld[d]B: b¬ a*aC: st [d], b D: d1¬d+4A1: a1 ¬ ld [d1]B1: b1¬ a1*a1 C1: st [d1], b1 D1:d¬ d1+4ADBA1CB1D1C1 ABCD1DA1B1C1 15-745 © Seth Copen Goldstein 2000-5 80 Unroll Some More A: a¬B: b¬C:D: d1¬ A1: a1 ¬ B1: b1¬ C1:D1: d2¬ A2: a2 ¬ B2: b2¬ C2:D2: d ¬ld[d]a*ast [d], b d+4ld [d1] a1*a1st [d1], b1 d1+4ld [d2] a2*a2st [d2], b2 d2 + 4AD BCA1 D1 B1 A2 C1C2 D2B2 ABCD2DA1B1C1D1A2B2C215-745 © Seth Copen Goldstein 2000-5 81 Unroll Some More A: a ¬B: b¬C:D: d1¬ A1: a1 ¬ B1: b1¬ C1:D1: d2¬ A2: a2 ¬ B2: b2¬ C2:D2: d¬ld [d]a*ast [d], b d+4ld [d1] a1*a1st [d1], b1 d1+4ld [d2] a2*a2st [d2], b2 d2+4AD BCA1 D1B1 A2 D2C1B2 A3 C2 B3 C3 ABCD3DA1B1C1D1A2B2C2D2A3B3C3 D3 15-745 © Seth Copen Goldstein 2000-5 82ABCD4DA1B1C1D1A2B2C2D2A3B3C3D3A4B4C4 One More Time AD BCA1 D1B1 A2 D2 C1B2 A3 C2 B3 C3D3A4 B4 C4 15-745 © Seth Copen Goldstein 2000-5 83ABCD4DA1B1C1D1A2B2C2D2A3B3C3D3A4B4C4 Can Rearrange AD BCA1 D1B1 A2 D2 C1B2 A3 C2 B3 C3D3A4 B4 C4 15-745 © Seth Copen Goldstein 2000-5 84 Rearrange AD BC A1 D1B1 A2 D2 C1B2 A3 C2 B3 C3ABCD3DA1B1C1D1A2B2C2D2A3B3C3 D3 15-745 © Seth Copen Goldstein 2000-5 85 Rearrange AD BC A1 D1B1 A2 D2 C1B2 A3D3 C2 B3C3ABCD3DA1B1C1D1A2B2C2D2A3B3C3 15-745 © Seth Copen Goldstein 2000-5 86 SP LoopA: a ¬ B: b¬ D: d1¬ A1: a1 ¬ D1: d2¬C:B1: b1¬ A2: a2 ¬ D2: d¬B2: b2¬ C1:D3: d2¬ C2:ld [d] a*a d+4 ld [d1] d1+4st [d], b a1*a1 ld [d2] d2+4a2*a2st [d1], b1 d1+4st [d2], b2PrologBody EpilogABCCCD3DA1B1B1B1C1D1A2A2A2B2C2D2D2D2 15-745 © Seth Copen Goldstein 2000-5 87 Goal of Software Pipeliningn Increase distance between dependent operations by moving destination operation to a later iterationA BCdependencies in initial loopA after SPBCi+1 i+2 iteration i 15-745 © Seth Copen Goldstein 2000-5 88 Goal of Software Pipeliningn Discover ILP across iterations!A0 A0 A1 B0A2 B1 C0 A3 B2 C1B3 C2 C3A1 B0 Ai Bi-1 Ci-2Bi Ci-1 Ci 15-745 © Seth Copen Goldstein 2000-5 89 ExampleAssume operating on a infinite wide machine A0A1 B0Prolog loop bodyepilog Ai Bi-1 Ci-2 Bi Ci-1 Ci15-745 © Seth Copen Goldstein 2000-5 90 Dealing with exit conditionsfor (i=0; i = N) goto done
A0
B0
if (i+1 == N) goto last i=1
A1
if (i+2 == N) goto epilog i=2
{
}
loop: Ai Ai
Bi Bi-1 Ci Ci-2
i++
if (i < N) goto loop epilog:BiCi-1 last:ci done: 15-745 © Seth Copen Goldstein 2000-5 91 Loop Unrolling vs. Software PipeliningFor SuperScalar or VLIWn Loop Unrolling reduces loop overhead n Software Pipelining reduces fill/drain n Best is if you combine themSoftware PipeliningLoop Unrolling Time15-745 © Seth Copen Goldstein 2000-5 92 TODO: More complicated SP example93 Software Pipelining Approachesn The first serious approach to software pipelining was presented by Aiken & Nicolau.q Aiken’s 1988 Ph.D. thesis.q Impractical as it ignores resource hazards (focusing onlyon data-dependence constraints).n “Iterative Modulo Scheduling” Rau MICRO’94q Addresses resource constraintsq Iterate over different loop lengths until one schedules successfullyq Compute loop lower/upper bounds from required & available resources94SYSTOLIC ARRAYS 95 Why Systolic Architectures?n Idea: Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memoryn Similar to an assembly lineq Different people work on the same carq Many cars are assembled simultaneously q Can be two-dimensionaln Special purpose accelerators/architectures needq Simple, regular designs (keep # unique parts small andregular)q High concurrency à high performanceq Balanced computation and I/O (memory access)96 Systolic Architecturesn H. T. Kung, “Why Systolic Architectures?,” IEEE Computer 1982.Memory: heart PEs: cellsMemory pulses data through cells 97 Systolic Architecturesn Basic principle: Replace a single PE with a regular array of PEs and carefully orchestrate flow of data between the PEs à achieve high throughput w/o increasing memory bandwidth requirementsn Differences from pipelining:q Array structure can be non-linearand multi-dimensionalq PE connections can be multidirectional (and different speed)q PEs can have local memory and execute kernels (rather than a piece of the instruction) 98 Systolic Computation Examplen 99 Systolic Computation Example: Convolution 100 Systolic Computation: Convolutiony1 w3w2x2x1 y1w1 w3w2w1x3x2x1 x3y1w3y1w2x2y2w1y2 w3w2w1x4x3x2 y2y3 w3w2x4x3w1 101 Systolic Computation: Convolution y1 w3w2w1 x2x1102 Systolic Computation: Convolution y1 w3w2w1 x3x2x1103 Systolic Computation: Convolution y1y2 w3w2w1 x3x2104 Systolic Computation: Convolution y1y2 w3w2w1 x4x3x2105 Systolic Computation: Convolution y1y2y3 w3w2w1 x4x3106 Systolic Computation Example: Convolutionn Worthwhile to implement adder and multiplier separately to allow overlapping of add/multiply executions 107 TODO: Example relating SP to systolic architecture for some computation (maybe the convolution)108 More Programmabilityn Each PE in a systolic arrayq Can store multiple “weights”q Weights can be selected on the flyq Eases implementation of, e.g., adaptive filteringn Taken furtherq Each PE can have its own data and instruction memoryq Data memory à to store partial/temporary results, constantsq Leads to stream processing, pipeline parallelism n Moregenerally,stagedexecution109 Pipeline Parallelism 110 File Compression Example 111 Why pipeline parallelism in software?n Pipeline parallelism vs data parallelismq Why split pipeline stages across PEs?q No cycle-time benefit like we got in hardwaren Data movement patterns differq Pipeline parallelism: move input data between PEs q Data parallelism: move task code/data between PEsn Tight feedback loops within single stage q E.g., compression or encryptionn Appropriate design depends on application112 Systolic Array Summaryn Advantagesq Makes multiple uses of each data item à reduce data fetches q High concurrencyq Regular design (both data and control flow)n Disadvantagesq Not good at exploiting irregular parallelismq Relatively special purpose à need software, programmer support to be a general purpose model113 The WARP Computern HT Kung, CMU, 1984-1988n Linear array of 10 cells, each cell a 10 Mflop programmableprocessorn Attached to a general purpose host machinen High-level language and optimizing compiler to program the systolic arrayn Used extensively to accelerate vision and robotics tasks n Annaratone et al., “Warp Architecture andImplementation,” ISCA 1986.n Annaratone et al., “The Warp Computer: Architecture, Implementation, and Performance,” IEEE TC 1987.114 The WARP Computer 115 The WARP Computer 116 Resource Constraintsn Minimally indivisible sequences, i and j, can execute together if combined resources in a step do not exceed available resources.n R(i) is a resource configuration vector R(i) is the number of units of resource in r(i) is a resource usage vector s.t. 0 £ r(i) £ R(i)n Each node in G has an associated r(i)15-745 © Seth Copen Goldstein 2000-5 117SCALAR REPLACEMENT 118 Scalar Replacement: Example DO I = 1, NDO J = 1, MA(I) = A(I) + B(J)ENDDOENDDODO I = 1, NT = A(I)DO J = 1, MT = T + B(J)ENDDOA(I) = T ENDDOn AllloadsandstorestoAintheinner loop have been savedn A(I)canbeleftinaregister throughout the inner loopn Superscalar+cachewillgetmostof n HighchanceofTbeingallocateda this, but not allocate A(I) to register register by compilerScalar Replacementn Convert array reference to scalar referencen Approach is to use dependences to achieve these memory hierarchy transformations Dependence and Memory Hierarchyn True or Flow – save loads and cache miss n Anti – save cache miss?n Output – save storesn Input – save loadsA(I) = … + B(I) … =A(I)+k A(I) = …… = B(I) Dependence and Memory Hierarchyn Loop Carried dependences – Consistent dependences most useful for memory management purposesn Consistent dependences – dependences with constant dependence distance Using Dependencesn In the reduction example DO I = 1, N DO J = 1, M A(I) = A(I) + B(J)ENDDO ENDDODO I = 1, NT = A(I)DO J = 1, MT = T + B(J)ENDDOA(I) = T ENDDOn True dependence – replace the references to A in the inner loop by scalar Tn Output dependence – store can be moved outside the inner loopn Antidependence – load can be moved before the inner loop Scalar Replacementn Example: Scalar Replacement in case of loop independent dependence DO I = 1, NA(I) = B(I) + CX(I) = A(I)*QENDDODO I = 1, N t = B(I) + CA(I) = t X(I) = t*QENDDOn One less load for each iteration for reference to A Scalar Replacementn Example: Scalar Replacement in case of loop carried dependence spanning single iterationDO I = 1, NA(I) = B(I-1)B(I) = A(I) + C(I)ENDDOtB = B(0)DO I = 1, N tA = tB A(I) = tA tB = tA + C(I) B(I) = tBENDDOn One less load for each iter for ref to B which had a loop carried true dependence of 1 itern Also one less load per iter for reference to A Scalar Replacementn Example: Scalar Replacement in case of loop carried dependence spanning multiple iterationsDO I = 1, NA(I) = B(I-1) + B(I+1)ENDDOnt1 = B(0)t2 = B(1)DO I = 1, N t3 = B(I+1) A(I) = t1 + t3 t1 = t2 t2 = t3ENDDOOne less load for each iter for ref to B which had a loop carried input dependence of 2 itersInvariants maintained weret1=B(I-1); t2=B(I);t3=B(I+1) nAiken/Nicolau Scheduling Step 1Perform scalar replacement to eliminate memory references where possible. for i:=1 to N doa := j Å V[i-1]b := a Å f c := e Å j d := f Å c e := b Å d f := U[i] g: V[i] := b h: W[i] := dj := X[i]for i:=1 to N do a:=jÅb b:=aÅf c:=eÅj d:=fÅc e:=bÅdf := U[i] g: V[i] := b h: W[i] := dj := X[i] 15-745 © Seth Copen Goldstein 2000-5 127 Aiken/Nicolau Scheduling Step 2Unroll the loop and compute the data-dependence graph (DDG). DDG for rolled loop:for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å df := U[i] g: V[i] := b h: W[i] := dj := X[i]acjbd gfh e 15-745 © Seth Copen Goldstein 2000-5 128 Aiken/Nicolau Scheduling Step 2, cont’d DDG for unrolled loop:for i:=1 to N do a := j Å b b := a Å f c:=eÅj d:=fÅc e:=bÅdf:=U[i]g: V[i] := bh: W[i] := dj := X[i]a1 c1b1 d1g a j1 e h1 121b f1 c 22 g2 a3b 3j2 d 2f2e2 c3h115-745 © Seth Copen Goldstein 2000-5 129 Aiken/Nicolau Scheduling Step 3Build a tableau of iteration number vs cycle time.ac 1 1b d1 1j iteration 123456 1 acfjfjfjfjfjfj 2bd3egha 1 h 4 cb g1a2 e115dga f 6ehb b1c 7 cga 2j28db2d 9 ehga g2a3 2 10 cb fh11 dga 2e112 ehb b3 2 13 cg c 14 d3 15 eh15-745 © Seth Copen Goldstein 2000-5 130 cycle Aiken/Nicolau Scheduling Step 3Build a tableau of iteration number vs cycle time.basically, you’re emulatingiteration1 1 123456ac a superscalar with infinite1 acfjfjfjfjfjfj 2 bd3 egh a4 cb5 dga6 eh b 7cga 8db9 ehga resources, infinite registerb d1renaming, always predicting 1jthe loop-back branch: h g a 1 e 1121thus, just puref1 b2jc2 data dependency 2dg2a3 2 10 cb fh11 dga 2e112 ehbb3 2 13 cg c 14 d3 15 eh15-745 © Seth Copen Goldstein 2000-5 131cycle Aiken/Nicolau Scheduling Step 4Find repeating patterns of instructions. iteration 123456 1 acfjfjfjfjfjfj 2 bd 34 egh acb5 dga6 ehb7 cg a 8db9101112131415eh g a cbdga ehb cgd eh 15-745 © Seth Copen Goldstein 2000-5 132cycle Aiken/Nicolau Scheduling Step 4Find repeating patterns of instructions. iteration 123456 1 acfjfjfjfjfjfj 2 bd34 egh acb5 dga6 ehb7 cg a 8db 9101112131415eh g a cbdga ehb cgd eh 15-745 © Seth Copen Goldstein 2000-5 133cycle Aiken/Nicolau Scheduling Step 4Find repeating patterns of instructions. iteration 123456 1 acfjfjfjfjfjfj 2 bd34 egh acb5 dga6 ehb7 cg a 8dbGo back and relate slopes to DDG9101112131415eh g a cbdga ehb cgd eh 15-745 © Seth Copen Goldstein 2000-5 134cycle Aiken/Nicolau Scheduling Step 5“Coalesce” the slopes. iteration iteration 123456 1234561 acfjfjfjfjfjfj 2 bd3 egh a4 cb1 acfj 2 bd 3 egh 4fjacb fjdg aeh bfj5 dga6 ehb7 cg a 8db8db9101112131415eh g a cbdga ehb cgd ehcg a9 eh g fj5 6 710 ca 11 db 12 eh g13 c 14 d 15 eh 15-745 © Seth Copen Goldstein 2000-5 135cyclecycle Aiken/Nicolau Scheduling Step 6Find the loop body and “reroll” the loop.iteration 1234561 acfj 2bd 3 egh 4fjacb fj dga ehbfj 567 8db9101112131415ehg fj cadb eh g c dehcg a 15-745 © Seth Copen Goldstein 2000-5 136cycle Aiken/Nicolau Scheduling Step 6Find the loop body and “reroll” the loop.Prologue/entry codeLoop body Epilogue/exit code iteration 123456 1 acfj2bd fj3 egh a4 cb fj5 dga6 ehbfj 78 cg a 91011 db 12 eh g13 c 14 d 15 ehdb ehg fjca 15-745 © Seth Copen Goldstein 2000-5 137cycle Aiken/Nicolau Scheduling Step 7Generate code.(Assume VLIW-like machine for this example. The instructions on each line should be issued in parallel.) L:bi+1 := ai Å fiW[i] := diai+2 := ji+1 Å bi+1 i := i+1bN :=aN Å fN-1 W[N-1] := dN-1w[N]:=dNa1:=j0 Å b0 b1:=a1 Å f0 e1:=b1 Å d1 c2:=e1 Å j1 d2:=f1 Å c2 e2:=b2 Å d2 c3:=e2 Å j2di:=fi-1Åci ei := bi Å di ci+1:=eiÅjic1:=e0 Å j0 d1:=f0 Å c1 V[1] := b1 b2:=a2 Å f1 V[2] := b2 W[2] := d2 V[3] := b3f1:=U[1]f2:=U[2] W[1] := d1f3 := U[3] a3:=j2 Å b3:=a3 Å a4:=j3 Åj1 := X[1]j2 := X[2] a2:=j1 Å b1j3 := X[3] b2f2 f4 := U[4] b3 i:=3j4 := X[4]dN-1 :=fN-2 Å cN-1 eN-1 := bN-1 Å dN-1cN := eN-1 Å jN-1 dN := fN-1 + cNeN :=bN Å dNV[i+1] := bi+1 v[N] := bNfi+2 := U[I+2] ji+2 := X[i+2] if i
n For an edge, u®v, we must have s(v)-s(u) 3 d(u,v)
15-745 © Seth Copen Goldstein 2000-5 149

Precedence Constraints
n Cyclic: constraint becomes a tuple: q p is the minimum iteration delay
(or the loop carried dependence distance) q d is the delay
n For an edge, u®v, we must have s(v)-s(u) 3 d(u,v)-s*p(u,v)
np30
n If data dependence is
q within an iteration, p=0
q loop-carried across p iter boundaries, p>0
15-745 © Seth Copen Goldstein 2000-5 150

Iterative Approach
n Finding minimum S that satisfies the constraints is NP- Complete.
n Heuristic:
q Find lower and upper bounds for S
q foreach s from lower to upper bound? n Schedule graph.
n If succeed, done
n Otherwise try again (with next higher s)
n Thus: “Iterative Modulo Scheduling” Rau MICRO’94
15-745 © Seth Copen Goldstein 2000-5 151

Iterative Approach
n Heuristic:
q Find lower and upper bounds for S
q foreach s from lower to upper bound
n Schedule graph.
n If succeed, done
n Otherwise try again (with next higher s)
n So the key difference:
q AN88 does not assume S when scheduling
q IMS must assume an S for each scheduling attempt to understand resource conflicts
15-745 © Seth Copen Goldstein 2000-5 152

Lower Bounds
n Resource Constraints: SR
maximum over all resources of # of uses divided by #
available…
n Precedence Constraints: SE
max delay over all cycles in dataflow graph
In practice, one is easy, other is hard.
Tim’s secret approach: just use SR as lower bound, then do binary search for best S
15-745 © Seth Copen Goldstein 2000-5 153

Acyclic Example
a <0,2>
<0,3> c
b <0,1>
Lower Bound: SR=2 Upper Bound: 5
15-745 © Seth Copen Goldstein 2000-5 154

Lower Bound on s
• Assume 1 ALU and 1 MU
• Assume latency Op or load is 1 cycle
a
c
for i:=1 to N do a := j Å b b := a Å f c := e Å j d:=fÅc e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
<1,1> <0,1>
<1,1> <0,1>
<1,1> <0,1>
j
b
h
f
d
g
<0,1>
<0,1>
<1,1>
Resources => 5 cycles Dependencies => 3 cycles
e
15-745 © Seth Copen Goldstein 2000-5 155

Scheduling data structures
To schedule for initiation interval s:
n Create a resource table with s rows and R columns
n Create a vector, s, of length N for n instructions in the loop
q s[n] = the time at which n is scheduled, or NONE
n Prioritize instructions by some heuristic q critical path (or cycle)
q resource critical
15-745 © Seth Copen Goldstein 2000-5 156

Scheduling algorithm
n Pick an instruction, n
n Calculate earliest time due to dependence constraints For all x=pred(n),
earliest = max(earliest, s(x)+d(x,n)-s.p(x,n)) n try and schedule n from earliest to (earliest+s-1)
s.t. resource constraints are obeyed.
q possible twist: deschedule a conflicting node to make way for n, maybe randomly, like sim anneal
n If we fail, then this schedule is faulty (i.e. give up on this s)
15-745 © Seth Copen Goldstein 2000-5 157

Scheduling algorithm – cont.
n We now schedule n at earliest, I.e., s(n) = earliest n Fix up schedule
q Successors, x, of n must be scheduled s.t.
s(x) >= s(n)+d(n,x)-s.p(n,x), otherwise they are removed
(descheduled) and put back on worklist.
n repeat this some number of times until either
q succeed, then register allocate q fail, then increase s
15-745 © Seth Copen Goldstein 2000-5 158

Simplest Example
for () {
a = b+c
b = a*a
c = a*194 }
<1,1>
<1,1> <0,1> <0,1>
b
Resources:
a
1
1
c
What is IIres? What is IIrec?
15-745 © Seth Copen Goldstein 2000-5 159

Simplest Example
for () { 0 a = b+c
b = a*a
c = a*194 1 }
a
Modulo Resource Table:
0 1
b
c
Try II = 2
1
15-745 © Seth Copen Goldstein 2000-5 160

Simplest Example
for () { 0 a = b+c
b = a*a
c = a*194 1 }
a
Try II = 2
Modulo Resource Table:
0 1
1
b
c
1
15-745 © Seth Copen Goldstein 2000-5 161

Simplest Example
for () { 0 a = b+c
b = a*a
c = a*194 1 }
Modulo Resource Table:
0 1
1
1
a
2
Try II = 2
b
c
1
15-745 © Seth Copen Goldstein 2000-5 162

Simplest Example
for () { 0 a = b+c
b = a*a
c = a*194 1 }
a
b
Try II = 2
Modulo Resource Table:
0 1
earliest a: sigma(c) + delay(c) – 2 = 2+1-2 = 1
1
c
2
1
15-745 © Seth Copen Goldstein 2000-5 163

Simplest Example
for () {
a = b+c
b = a*a
c = a*194
0 1 2
}
b
a
Try II = 2
Modulo Resource Table:
0 1
earliest b? scheduled b? what next?
1
c
1
15-745 © Seth Copen Goldstein 2000-5 164

Simplest Example
for () {
a = b+c
b = a*a
c = a*194
0 1 2 3
}
Try II = 2
Modulo Resource Table:
0 1
Lesson: lower bound may not be achievable
1
b
a
c
1
15-745 © Seth Copen Goldstein 2000-5 165

Example
for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities: ?
<1,1> <0,1>
<1,1> <0,1>
<1,1> <0,1>
<0,1>
<0,1>
<1,1>
g
a
h
j
f
c
b
d
e
15-745 © Seth Copen Goldstein 2000-5 166

Example
for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities: c,d,e,a,b,f,j,g,h
g
a
<1,1> <0,1>
<1,1> <0,1>
<1,1> <0,1>
<0,1>
<0,1>
<1,1>
h
f
c
j
b
d
e
15-745 © Seth Copen Goldstein 2000-5 167

for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
s=5
instr
s
a
b
c
d
e
f
g
h
j
ALU
MU
Priorities: c,d,e,a,b,f,j,g,h
a
b
j
h
f
c
g
e
d
15-745 © Seth Copen Goldstein 2000-5 168

for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities: a,b,f,j,g,h
s=5
instr
s
a
b
c
0
d
1
e
2
f
g
h
j
ALU
MU
c
d
e
j
a
c
f
b
d
g
h
e
15-745 © Seth Copen Goldstein 2000-5 169

for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities: b,f,j,g,h
s=5
instr
s
a
3
b
c
0
d
1
e
2
f
g
h
j
ALU
MU
c
d
e
a
j
a
c
f
b
d
g
h
e
15-745 © Seth Copen Goldstein 2000-5 170

for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities: b,f,j,g,h
s=5
instr
s
a
3
b
4
c
0
d
1
e
2
f
g
h
j
ALU
MU
c
d
e
a
b
j
a
c
f
b
d
g
h
e
15-745 © Seth Copen Goldstein 2000-5 171

for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities: e,f,j,g,h
s=5
ALU
MU
c
d
a
b
j
a
c
f
b
d
h
g
instr
a
b
c
d
e
f
g
h
j
1
s
3
4
0
e
b causes b->e edge violation
15-745 © Seth Copen Goldstein 2000-5 172

for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities: e,f,j,g,h
s=5
ALU
MU
c
d
e
a
b
j
a
c
f
b
d
h
g
instr
a
b
c
d
e
f
g
h
j
1
s
3
4
0
7
e
e causes e->c edge violation
15-745 © Seth Copen Goldstein 2000-5 173

for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities: f,j,g,h
s=5
instr
s
a
3
b
4
c
5
d
6
e
7
f
0
g
h
j
ALU
MU
c
f
d
e
a
b
j
a
c
f
b
d
g
h
e
15-745 © Seth Copen Goldstein 2000-5 174

for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities:j,g,h
s=5
instr
s
a
3
b
4
c
5
d
6
e
7
f
0
g
h
j
1
ALU
MU
c
f
d
j
e
a
b
j
a
c
f
b
d
g
h
e
15-745 © Seth Copen Goldstein 2000-5 175

for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities:g,h
s=5
instr
s
a
3
b
4
c
5
d
6
e
7
f
0
g
7
h
8
j
1
ALU
MU
c
f
d
j
e
g
a
h
b
j
a
c
f
b
d
g
h
e
15-745 © Seth Copen Goldstein 2000-5 176

Creating the Loop
n Create the body from the schedule.
n Determine which iteration an instruction
falls into
q Mark its sources and dest as belonging to that iteration.
q Add Moves to update registers n Prolog fills in gaps at beginning
q For each move we will have an instruction in prolog, and we fill in dependent instructions
n Epilog fills in gaps at end
instr
s
a
3
b
4
c
5
d
6
e
7
f
0
g
7
h
8
j
1
15-745 © Seth Copen Goldstein 2000-5 177

f0 = U[0]; j0 = X[0];
FOR i = 0 to N f1 := U[i+1] j1 := X[i+1] nop
a := j0 ? b b := a ? f0 c := e ? j0 d := f0 ? c e := b ? d
h: W[i] := d f0 = f1
j0 = j1
g: V[i] := b
15-745 © Seth Copen Goldstein 2000-5 178

Conditionals
n What about internal control structure, I.e., conditionals n Three approaches
q Schedule both sides and use conditional moves
q Schedule each side, then make the body of the conditional a
macro op with appropriate resource vector q Trace schedule the loop
15-745 © Seth Copen Goldstein 2000-5 179

What to take away
n Architecture includes compiler!
n Dependence analysis is very important
(including alias analysis)
n Software pipelining crucial for statically scheduled, but also very useful for dynamically scheduled
15-745 © Seth Copen Goldstein 2000-5 180

Multiflow: An early VLIW architecture (1987)

EPIC – Intel IA-64 Architecture
n Gets rid of lock-step execution of instructions within a VLIW instruction
n Idea: More ISA support for static scheduling and parallelization q Specify dependencies within and between VLIW instructions
(explicitly parallel)
+ No lock-step execution
+ Static reordering of stores and loads + dynamic checking
— Hardware needs to perform dependency checking (albeit aided by software)
— Other disadvantages of VLIW still exist
n Huck et al., “Introducing the IA-64 Architecture,” IEEE Micro, Sep/Oct 2000.
183

IA-64 Instructions
n IA-64 “Bundle” (~EPIC Instruction)
q Total of 128 bits
q Contains three IA-64 instructions
q Template bits in each bundle specify dependencies within a bundle

n IA-64 Instruction
q Fixed-length 41 bits long
q Contains three 7-bit register specifiers
q Contains a 6-bit field for specifying one of the 64 one-bit predicate registers
184

IA-64 Instruction Bundles and Groups
n Groups of instructions can be executed safely in parallel
q Marked by “stop bits”
n Bundles are for packaging
q Groups can span multiple bundles
n Alleviates recompilation need somewhat
185

Template Bits
n Specify two things
q Stop information: Boundary of independent instructions
q Functional unit information: Where should each instruction be routed
186

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] CS计算机代考程序代写 cache algorithm data structure assembly AI x86 concurrency compiler prolog Static Scheduling & VLIW 15-740
30 $