[SOLVED] 代写 algorithm scala parallel compiler software Module Outline:

30 $

File Name: 代写_algorithm_scala_parallel_compiler_software_Module_Outline:.zip
File Size: 612.3 KB

SKU: 9463442911 Category: Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Or Upload Your Assignment Here:


Module Outline:
Module 2: High Performance Techniques— Scoreboard
• How to Optimize the Pipeline? Extract More Parallelism! • Compiler-Directed (Static) Approaches
– VLIW
– EPIC
– Superscalar
– Software Pipelining
• A Dynamic Approach—Scoreboard

Pipelining: Can we somehow make CPI closer to 1?
• Let’s assume full pipelining

FP Loop: Where are the hazards?

FP Loop Showing Stalls

Revised FP Loop Minimizing Stalls

Unroll Loop Four Times (straightforward way)

Unrolled Loop That Minimizes Stalls

Getting CPI < 1: Processing Multiple Instructions/Cycle• Use parallel processing!!• Two main variations: superscalar and VLIW• Superscalar: varying number of instructions/cycle (1 to 6)- parallelism and dependencies determined/resolved by HW- IBM PowerPC 604, Sun UltraSPARC, DEC Alpha 21164, HP 7100• Very Long Instruction Words (VLIW): fixed number of instructions (16) determined by compiler- pipeline is exposed; compiler must schedule delays to get right results • Explicit Parallel Instruction Computer (EPIC)/Intel- 128 bit packets containing 3 instructions (can execute sequentially)- can link 128 bit packets together to allow more parallelism- compiler determines parallelism, HW checks dependencies and forwards/stallsParallelism: Overt vs. Covert Parallelism: Overt vs. Covert Problem vs. Program Compilers for Covert Parallelism Compilers for Covert Parallelism Compilers for Covert Parallelism Compilers for Covert Parallelism Compilers for Covert Parallelism Compilers for Covert Parallelism Unrolled Loop Compilation and ISA• Efficient compilation requires knowledge of the pipeline structure – latency and bandwidth of each operation type• But a good ISA transcends several implementations with different pipelines- should things like a delayed branch be in an ISA?- should a compiler use the properties of one implementation when compiling for an ISA? – do we need a new interface?An Alternative Very Long Instruction Word (VLIW) Computers Pros:Cons:• Very simple hardware• Lockstep execution (static schedule)- no dependency detection- simple issue logic- just ALUs and register files- very sensitive to long latency operations (cache misses)• Potentially exploits large amounts of ILP• Global register file hard to build • Lots of NO-OPsVLIW Pros and Cons- poor “code density”- I-cache capacity and bandwidthcompromised• Must recompile sources to deliver potential• Implementation visible through ISA EPIC: Explicit Parallel Instruction Computer128-bit instructions:• three 3-address operationsop1 op2 op3 tmp• a template that encodes dependencies• 128 general registers• predication• speculative load (data prediction)• Example: IA-64 of Intel/HPpred op rdrs1 rs2 constGetting CPI < 1: Issuing Multiple Instructions/CycleGetting CPI < 1: Issuing Multiple Instructions/Cycle• Superscalar DLX: 2 instructions, 1 FP & 1 anything else- fetch 64 bits/clock cycle; integer on left, FP on right- can only issue 2nd instruction if 1st instruction issues- more ports for FP registers to do FP load & FP op in a pair• 1 cycle load delay expands to 3 instructions in SS- instruction in right half can’t use it, nor instructions in next slot Loop Unrolling in Superscalar• Unrolled 5 times to avoid delays (+1 due to SS) • 12 clocks, or 2.4 clocks per iterationAnother Example: Multiple IssueAnother Example: Multiple Issue—Rescheduled Code- exactly 50% FP operations – no hazardsLimits of Superscalar• While integer/FP split is simple for the HW, get CPI of 0.5 only for programs with:• If more instructions issue at same time, greater difficulty of decode and issue- even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue
• VLIW: tradeoff instruction space for simple decoding
– the long instruction word has room for many operations
– by definition, all the operations the compiler puts in the long instruction word can execut
in parallel
– e.g., 2 integer operations, 2 FP ops, 2 memory references, 1 branch;
16 to 24 bits for each of these fields ==> 7 x 16 or 112 bits to 7 x 24 or 168 bits wide – need compiling technique that schedules across several branches

Loop Unrolling in VLIW
• Need more registers in VLIW (EPIC ==> 128 integer + 128 FP)

Software Pipelining
• Observation: if iterations from loops are independent, then can get more ILP (instruction level parallelism) by taking instructions from different iterations
• Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (i.e., Tomasulo algorithms in SW)

Software Pipelining Example

Software Pipelining with Loop Unrolling in VLIW
• 9 results in 9 cycles, or 1 clock per iteration
• average: 3.3 ops per clock, 66% efficiency
• Note: need less registers for software pipelining (only using 7 registers here, was using 15)

Multiple Issue:
• more complex issue logic
Multiple Issue as compared with VLIW
– check dependencies
– check structural hazards
– issue variable number of instructions (0-N) – shift unissued instructions over
• Able to run existing binaries
– recompile for performance, not correctness
• Datapaths identical
– but bypass requires detection
• Neither VLIW or multiple-issue can schedule around run-time variation in instruction latency
– cache misses
• Dealing with run-time variation requires run-time or dynamic scheduling

The Problem with Static Scheduling (Compile-Time)
In-Order Execution:
• an unexpected long latency blocks ready instructions from executing (scheduled code cannot be changed at run-time)
• binaries need to be rescheduled (recompiled) for each new processor implementation
• small number of named registers becomes a bottleneck

• Why in HW at run time?
Can we use HW to get CPI closer to 1?
– works when can’t know real dependencies at compile time – compiler simpler; avoid recompilation also!
– code for one machine runs well on another
• Dynamic scheduling?!
• Key ideas: allow instructions behind stall to proceed:
DIVD F0, F2, F4 ADDD F10, F0, F8 SUBD F12, F8, F14
• Out-of-order execution ==> out-of-order completion • Disadvantages?
– complexity
– precise interrupts harder!

• How do we prevent WAR and WAW hazards? • How do we deal with variable latency?
– forwarding for RAW hazards harder
Problems?

Scoreboard: a bookkeeping technique
• Out-of-order execution divides ID stage:
– 1. Issue—decode instructions, check for structural hazards
– 2. Read operands—wait until no data hazards, then read operands
• Scoreboards date to CDC6600 designed in 1963
• Instructions execute whenever not dependent on previous instructions and no hazards
• CDC6600: in-order issue, out-of-order execution, out-of-order commit (or completion)
– no forwarding!
– imprecise interrupt/exception model for now

Scoreboard Architecture (CDC6600)

• No register renaming!
Scoreboard Implications
• Out-of-order completion ==> WAR, WAW hazards?
• Solutions for WAR:
– stall writeback until registers have been read
– read registers only during Read Operands stage
• Need to have multiple instructions in execution phase ==> multiple execution units or pipelined execution units
• Scoreboard keeps track of dependencies between instructions that have already issued
• Scoreboard replaces ID, EX, WB with 4 stages

Four Stages of Scoreboard Control
• Issue—decode instructions & check for structural hazards (ID1)
– instructions issued in program order (for hazard checking)
– don’t issue if structural hazard
– don’t issue if instruction is output dependent on any previously issued but uncompleted
instruction (no WAW hazards)
• Read operands—wait until no data hazards, then read operands (ID2)
– all real dependencies (RAW hazards) resolved in this stage, since we wait for instructions to write back data
– no forwarding of data in this model

Four Stages of Scoreboard Control
• Execution—operate on operands (EX)
– the functional unit begins execution upon receiving operands;
when the result is ready, it notifies the scoreboard that it has completed execution
• Write result—finish execution (WB)
– stall until no WAR hazards with previous instructions:
Example: DIVD F0, F2, F4 ADDD F10, F0, F8 SUBD F8, F8, F14
CDC6600 scoreboard would stall SUBD until ADDD reads operands

• Instruction status:
Which of 4 steps the instruction is in
Field
Meaning
Busy Op
Fi
Fj, Fk Qj, Qk Rj, Rk
indicates whether the unit is busy or not operation to perform in the unit (e.g., add or sub) destination register
source register numbers
functional units producing source registers Fj, Fk flags indicating when Fj, Fk are ready
Three Parts of the Scoreboard
• Functional unit status:
Indicates the state of the functional unit (FU). 9 fields for each functional unit
• Register result status:
Indicates which functional unit will write each register, if one exists; blank when no pending instructions will write that register

Scoreboard Example

Detailed Scoreboard Pipeline Control

Scoreboard Example: Cycle 1

Scoreboard Example: Cycle 2

Scoreboard Example: Cycle 3
in order issue

Scoreboard Example: Cycle 4

Scoreboard Example: Cycle 5

Scoreboard Example: Cycle 6

Scoreboard Example: Cycle 7
no. MULTD needs to wait for Integer unit to write back to F2.

Scoreboard Example: Cycle 8a (1st half of clock cycle)

Scoreboard Example: Cycle 8b (2nd half of clock cycle)

Scoreboard Example: Cycle 9

Scoreboard Example: Cycle 10

Scoreboard Example: Cycle 11

Scoreboard Example: Cycle 12
no. it needs to wait for the multiply1 to write back on F0

Scoreboard Example: Cycle 13

Scoreboard Example: Cycle 14

Scoreboard Example: Cycle 15

Scoreboard Example: Cycle 16

Scoreboard Example: Cycle 17

Scoreboard Example: Cycle 18

Scoreboard Example: Cycle 19

Scoreboard Example: Cycle 20

• WAR Hazard is now gone…
Scoreboard Example: Cycle 21

Scoreboard Example: Cycle 22

Let’s skip some cycles…Scoreboard Example: Cycle 61

Scoreboard Example: Cycle 62

• Speedup 1.7 from compiler; 2.5 by hand; UT slow memory (no cache) limits benefit
• Limitations of 6600 scoreboard:
CDC6600 Scoreboard
– no forwarding hardware
– limited to instructions in basic block (small instruction window)
– small number of functional units (structural hazards), especially integer/load store units – do not issue on structural hazards
– wait for WAR hazards
– prevent WAW hazards

• Compiler scheduling • HW exploiting ILP
Summary of Concepts
– works when we can’t possibly know dependencies at compile time – code for one machine runs well on another
• Key idea of scoreboard: allow instructions behind stall to proceed (decode => issue instructions and read operands)
– enables out-of-order execution ==> out-of-order completion – ID stage checked both for structural and data dependencies – original version didn’t handle forwarding
– no automatic register renaming

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] 代写 algorithm scala parallel compiler software Module Outline:
30 $