[SOLVED] algorithm scala parallel compiler software Module Outline:

$25

File Name: algorithm_scala_parallel_compiler_software_Module_Outline:.zip
File Size: 546.36 KB

5/5 - (1 vote)

Module Outline:
Module 2: High Performance Techniques Scoreboard
How to Optimize the Pipeline? Extract More Parallelism! Compiler-Directed (Static) Approaches
VLIW
EPIC
Superscalar
Software Pipelining
A Dynamic ApproachScoreboard

Pipelining: Can we somehow make CPI closer to 1?
Lets assume full pipelining

FP Loop: Where are the hazards?

FP Loop Showing Stalls

Revised FP Loop Minimizing Stalls

Unroll Loop Four Times (straightforward way)

Unrolled Loop That Minimizes Stalls

Getting CPI < 1: Processing Multiple Instructions/Cycle Use parallel processing!! Two main variations: superscalar and VLIW Superscalar: varying number of instructions/cycle (1 to 6)- parallelism and dependencies determined/resolved by HW- IBM PowerPC 604, Sun UltraSPARC, DEC Alpha 21164, HP 7100 Very Long Instruction Words (VLIW): fixed number of instructions (16) determined by compiler- pipeline is exposed; compiler must schedule delays to get right results Explicit Parallel Instruction Computer (EPIC)/Intel- 128 bit packets containing 3 instructions (can execute sequentially)- can link 128 bit packets together to allow more parallelism- compiler determines parallelism, HW checks dependencies and forwards/stallsParallelism: Overt vs. Covert Parallelism: Overt vs. Covert Problem vs. Program Compilers for Covert Parallelism Compilers for Covert Parallelism Compilers for Covert Parallelism Compilers for Covert Parallelism Compilers for Covert Parallelism Compilers for Covert Parallelism Unrolled Loop Compilation and ISA Efficient compilation requires knowledge of the pipeline structure – latency and bandwidth of each operation type But a good ISA transcends several implementations with different pipelines- should things like a delayed branch be in an ISA?- should a compiler use the properties of one implementation when compiling for an ISA? – do we need a new interface?An Alternative Very Long Instruction Word (VLIW) Computers Pros:Cons: Very simple hardware Lockstep execution (static schedule)- no dependency detection- simple issue logic- just ALUs and register files- very sensitive to long latency operations (cache misses) Potentially exploits large amounts of ILP Global register file hard to build Lots of NO-OPsVLIW Pros and Cons- poor code density- I-cache capacity and bandwidthcompromised Must recompile sources to deliver potential Implementation visible through ISA EPIC: Explicit Parallel Instruction Computer128-bit instructions: three 3-address operationsop1 op2 op3 tmp a template that encodes dependencies 128 general registers predication speculative load (data prediction) Example: IA-64 of Intel/HPpred op rdrs1 rs2 constGetting CPI < 1: Issuing Multiple Instructions/CycleGetting CPI < 1: Issuing Multiple Instructions/Cycle Superscalar DLX: 2 instructions, 1 FP & 1 anything else- fetch 64 bits/clock cycle; integer on left, FP on right- can only issue 2nd instruction if 1st instruction issues- more ports for FP registers to do FP load & FP op in a pair 1 cycle load delay expands to 3 instructions in SS- instruction in right half cant use it, nor instructions in next slot Loop Unrolling in Superscalar Unrolled 5 times to avoid delays (+1 due to SS) 12 clocks, or 2.4 clocks per iterationAnother Example: Multiple IssueAnother Example: Multiple IssueRescheduled Code- exactly 50% FP operations – no hazardsLimits of Superscalar While integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: If more instructions issue at same time, greater difficulty of decode and issue- even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue
VLIW: tradeoff instruction space for simple decoding
the long instruction word has room for many operations
by definition, all the operations the compiler puts in the long instruction word can execut
in parallel
e.g., 2 integer operations, 2 FP ops, 2 memory references, 1 branch;
16 to 24 bits for each of these fields ==> 7 x 16 or 112 bits to 7 x 24 or 168 bits wide need compiling technique that schedules across several branches

Loop Unrolling in VLIW
Need more registers in VLIW (EPIC ==> 128 integer + 128 FP)

Software Pipelining
Observation: if iterations from loops are independent, then can get more ILP (instruction level parallelism) by taking instructions from different iterations
Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (i.e., Tomasulo algorithms in SW)

Software Pipelining Example

Software Pipelining with Loop Unrolling in VLIW
9 results in 9 cycles, or 1 clock per iteration
average: 3.3 ops per clock, 66% efficiency
Note: need less registers for software pipelining (only using 7 registers here, was using 15)

Multiple Issue:
more complex issue logic
Multiple Issue as compared with VLIW
check dependencies
check structural hazards
issue variable number of instructions (0-N) shift unissued instructions over
Able to run existing binaries
recompile for performance, not correctness
Datapaths identical
but bypass requires detection
Neither VLIW or multiple-issue can schedule around run-time variation in instruction latency
cache misses
Dealing with run-time variation requires run-time or dynamic scheduling

The Problem with Static Scheduling (Compile-Time)
In-Order Execution:
an unexpected long latency blocks ready instructions from executing (scheduled code cannot be changed at run-time)
binaries need to be rescheduled (recompiled) for each new processor implementation
small number of named registers becomes a bottleneck

Why in HW at run time?
Can we use HW to get CPI closer to 1?
works when cant know real dependencies at compile time compiler simpler; avoid recompilation also!
code for one machine runs well on another
Dynamic scheduling?!
Key ideas: allow instructions behind stall to proceed:
DIVD F0, F2, F4 ADDD F10, F0, F8 SUBD F12, F8, F14
Out-of-order execution ==> out-of-order completion Disadvantages?
complexity
precise interrupts harder!

How do we prevent WAR and WAW hazards? How do we deal with variable latency?
forwarding for RAW hazards harder
Problems?

Scoreboard: a bookkeeping technique
Out-of-order execution divides ID stage:
1. Issuedecode instructions, check for structural hazards
2. Read operandswait until no data hazards, then read operands
Scoreboards date to CDC6600 designed in 1963
Instructions execute whenever not dependent on previous instructions and no hazards
CDC6600: in-order issue, out-of-order execution, out-of-order commit (or completion)
no forwarding!
imprecise interrupt/exception model for now

Scoreboard Architecture (CDC6600)

No register renaming!
Scoreboard Implications
Out-of-order completion ==> WAR, WAW hazards?
Solutions for WAR:
stall writeback until registers have been read
read registers only during Read Operands stage
Need to have multiple instructions in execution phase ==> multiple execution units or pipelined execution units
Scoreboard keeps track of dependencies between instructions that have already issued
Scoreboard replaces ID, EX, WB with 4 stages

Four Stages of Scoreboard Control
Issuedecode instructions & check for structural hazards (ID1)
instructions issued in program order (for hazard checking)
dont issue if structural hazard
dont issue if instruction is output dependent on any previously issued but uncompleted
instruction (no WAW hazards)
Read operandswait until no data hazards, then read operands (ID2)
all real dependencies (RAW hazards) resolved in this stage, since we wait for instructions to write back data
no forwarding of data in this model

Four Stages of Scoreboard Control
Executionoperate on operands (EX)
the functional unit begins execution upon receiving operands;
when the result is ready, it notifies the scoreboard that it has completed execution
Write resultfinish execution (WB)
stall until no WAR hazards with previous instructions:
Example: DIVD F0, F2, F4 ADDD F10, F0, F8 SUBD F8, F8, F14
CDC6600 scoreboard would stall SUBD until ADDD reads operands

Instruction status:
Which of 4 steps the instruction is in
Field
Meaning
Busy Op
Fi
Fj, Fk Qj, Qk Rj, Rk
indicates whether the unit is busy or not operation to perform in the unit (e.g., add or sub) destination register
source register numbers
functional units producing source registers Fj, Fk flags indicating when Fj, Fk are ready
Three Parts of the Scoreboard
Functional unit status:
Indicates the state of the functional unit (FU). 9 fields for each functional unit
Register result status:
Indicates which functional unit will write each register, if one exists; blank when no pending instructions will write that register

Scoreboard Example

Detailed Scoreboard Pipeline Control

Scoreboard Example: Cycle 1

Scoreboard Example: Cycle 2

Scoreboard Example: Cycle 3
in order issue

Scoreboard Example: Cycle 4

Scoreboard Example: Cycle 5

Scoreboard Example: Cycle 6

Scoreboard Example: Cycle 7
no. MULTD needs to wait for Integer unit to write back to F2.

Scoreboard Example: Cycle 8a (1st half of clock cycle)

Scoreboard Example: Cycle 8b (2nd half of clock cycle)

Scoreboard Example: Cycle 9

Scoreboard Example: Cycle 10

Scoreboard Example: Cycle 11

Scoreboard Example: Cycle 12
no. it needs to wait for the multiply1 to write back on F0

Scoreboard Example: Cycle 13

Scoreboard Example: Cycle 14

Scoreboard Example: Cycle 15

Scoreboard Example: Cycle 16

Scoreboard Example: Cycle 17

Scoreboard Example: Cycle 18

Scoreboard Example: Cycle 19

Scoreboard Example: Cycle 20

WAR Hazard is now gone
Scoreboard Example: Cycle 21

Scoreboard Example: Cycle 22

Lets skip some cyclesScoreboard Example: Cycle 61

Scoreboard Example: Cycle 62

Speedup 1.7 from compiler; 2.5 by hand; UT slow memory (no cache) limits benefit
Limitations of 6600 scoreboard:
CDC6600 Scoreboard
no forwarding hardware
limited to instructions in basic block (small instruction window)
small number of functional units (structural hazards), especially integer/load store units do not issue on structural hazards
wait for WAR hazards
prevent WAW hazards

Compiler scheduling HW exploiting ILP
Summary of Concepts
works when we cant possibly know dependencies at compile time code for one machine runs well on another
Key idea of scoreboard: allow instructions behind stall to proceed (decode => issue instructions and read operands)
enables out-of-order execution ==> out-of-order completion ID stage checked both for structural and data dependencies original version didnt handle forwarding
no automatic register renaming

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] algorithm scala parallel compiler software Module Outline:
$25