[SOLVED] CS CSE 371 Computer Organization and Design

$25

File Name: CS_CSE_371_Computer_Organization_and_Design.zip
File Size: 405.06 KB

5/5 - (1 vote)

CSE 371 Computer Organization and Design

CIS 501: Comp. Arch.|Dr.|Branch Prediction
Computer Organization and Design

Copyright By Assignmentchef assignmentchef

Unit 7: Branch Prediction

Based on slides by Profs., & C.J.IS 501: Comp. Arch.|Dr.|Branch Prediction
This Unit: Branch Prediction
Control hazards
Branch prediction

System software

CIS 501: Comp. Arch.|Dr.|Branch Prediction

Control Dependences and
Branch Prediction

CIS 501: Comp. Arch.|Dr.|Branch Prediction

CIS 501: Comp. Arch.|Dr.|Branch Prediction
What About Branches?
Branch speculation
Could just stall to wait for branch outcome (two-cycle penalty)
Fetch past branch insns before branch outcome is known
Default: assume not-taken (at fetch, cant tell its a branch)

CIS 501: Comp. Arch.|Dr.|Branch Prediction
Big Idea: Speculative Execution
Speculation: risky transactions on chance of profit

Speculative execution
Execute before all parameters known with certainty
Correct speculation
Avoid stall, improve performance
Incorrect speculation (mis-speculation)
Must abort/flush/squash incorrect insns
Must undo incorrect changes (recover pre-speculation state)

Control speculation: speculation aimed at control hazards
Unknown parameter: are these the correct insns to execute next?

CIS 501: Comp. Arch.|Dr.|Branch Prediction
Control Speculation Mechanics
Guess branch target, start fetching at guessed position
Doing nothing is implicitly guessing target is PC+4
We were already speculating before!
Can actively guess other targets: dynamic branch prediction

Execute branch to verify (check) guess
Correct speculation? keep going
Mis-speculation? Flush mis-speculated insns
Hopefully havent modified permanent state (Regfile, DMem)
Happens naturally in in-order 5-stage pipeline

Actually have modified one piece of state: PC! Why is that ok?

CIS 501: Comp. Arch.|Dr.|Branch Prediction
When to Perform Branch Prediction?
Option #1: During Decode
Look at instruction opcode to determine branch instructions
Can calculate next PC from instruction (for PC-relative branches)
One cycle mis-fetch penalty even if branch predictor is correct

Option #2: During Fetch?
How do we do that?
123456789
bnez r3,targFDXMW
targ:add r4r5,r4FDXMW

More specifically if you look at schematic in the book even if you know that the branch will be taken you cant do anything about it until you have calculated the branch target

CIS 501: Comp. Arch.|Dr.|Branch Prediction
Branch Recovery

Branch recovery: what to do when branch is actually taken
Insns that are in F and D are wrong
Flush them, i.e., replace them with nops
They havent written permanent state yet (regfile, DMem)
Two cycle penalty for taken branches

CIS 501: Comp. Arch.|Dr.|Branch Prediction
Branch Speculation and Recovery
Mis-speculation recovery: what to do on wrong guess
Not too painful in a short, in-order pipeline
Branch resolves in X
Younger insns (in F, D) havent changed permanent state
Flush insns currently in D and X (i.e., replace with nops)
123456789
addi r3r1,1FDXMW
bnez r3,targFDXMW
st r6[r7+4]FDXMW
mul r10r8,r9 FDXMW

123456789
addi r3r1,1FDXMW
bnez r3,targFDXMW
st r6[r7+4]FD
mul r10r8,r9 F
targ:add r4r4,r5FDXMW

speculative

CIS 501: Comp. Arch.|Dr.|Branch Prediction
Branch Performance
Back of the envelope calculation
Branch: 20%, load: 20%, store: 10%, other: 50%
Say, 75% of branches are taken

CPI = 1 + 20% * 75% * 2 =
1 + 0.20 * 0.75 * 2 = 1.3
Branches cause 30% slowdown
Worse with deeper pipelines (higher mis-prediction penalty)

Can we do better than assuming branch is not taken?

CIS 501: Comp. Arch.|Dr.|Branch Prediction
Dynamic Branch Prediction
Dynamic branch prediction: hardware guesses outcome
Start fetching from guessed address
Flush on mis-prediction

CIS 501: Comp. Arch.|Dr.|Branch Prediction
Branch Prediction Performance
Parameters
Branch: 20%, load: 20%, store: 10%, other: 50%
75% of branches are taken
Dynamic branch prediction
Branches predicted with 95% accuracy
CPI = 1 + 20% * 5% * 2 = 1.02

CIS 501: Comp. Arch.|Dr.|Branch Prediction
Dynamic Branch Prediction Components
Step #1: is it a branch?
Easy after decode
Step #2: is the branch taken or not taken?
Direction predictor (applies to conditional branches only)
Predicts taken/not-taken
Step #3: if the branch is taken, where does it go?
Easy after decode

Branch Prediction Steps
CIS 501: Comp. Arch.|Prof.|Branch Prediction
is insn a branch?
predicted target
branch target buffer
direction predictor
Which insns behavior are we trying to predict?
Where does PC come from?
prediction source:

trying to predict behavior of Decode insn, PC of Decode insn

Branch TARGET PREDICTION

CIS 501: Comp. Arch.|Dr.|Branch Prediction

CIS 501: Comp. Arch.|Dr.|Branch Prediction
Revisiting Branch Prediction Components
Step #1: is it a branch?
Easy after decode during fetch: predictor
Step #2: is the branch taken or not taken?
Direction predictor (later)
Step #3: if the branch is taken, where does it go?
Branch target predictor (BTB)
Supplies target PC if branch is taken

CIS 501: Comp. Arch.|Dr.|Branch Prediction
Branch Target Buffer
Learn from past, predict the future
Record the past in a hardware structure

Branch target buffer (BTB):
guess the future PC based on past behavior
Last time the branch X was taken, it went to address Y
So, in the future, if address X is fetched, fetch address Y next
PC indexes table of bits target addresses
Essentially: branch will go to same place it went last time

What about aliasing?
Two PCs with the same lower bits?
No problem, just a prediction!

predicted target

CIS 501: Comp. Arch.|Dr.|Branch Prediction
Branch Target Buffer (continued)
At Fetch, how do we know we have a branch? We dont
all insns access BTB in parallel with Imem Fetch
Key idea: use BTB to predict which insn are branches
Implement by tagging each entry with its corresponding PC
Update BTB on every taken branch insn, record target PC:
BTB[PC].tag = PC, BTB[PC].target = target of branch
All insns access at Fetch in parallel with Imem
Check for tag match, signifies insn at that PC is a branch
otherwise, assume insn is not a branch
Predicted PC = (BTB[PC].tag == PC) ? BTB[PC].target : PC+4

predicted target

Recording a tuple PC -> target but only for branches. When you fetch out an entry with your hash code you verify that this has a tuple that is relevant since the PC matches.

Question why does this work? Most branches/jumps are direct

Are there some that are not. Jump register -> shows up a lot in function returns.

CIS 501: Comp. Arch.|Dr.|Branch Prediction
Why Does a BTB Work?
Because most control insns use direct targets
Target encoded in insn itself same taken target every time

What about indirect targets?
Target held in a register can be different each time
Two indirect call idioms
Dynamically linked functions (DLLs): target always the same
Dynamically dispatched (virtual) functions: hard but uncommon
Also two indirect unconditional jump idioms
Switches: hard but uncommon
Function returns: hard and common but

CIS 501: Comp. Arch.|Dr.|Branch Prediction
Return Address Stack (RAS)
Return address stack (RAS)
Call instruction? RAS[TopOfStack++] = PC+4
Return instruction? Predicted-target = RAS[TopOfStack]
Q: how can you tell if an insn is a call/return before decoding it?
Accessing RAS on every insn BTB-style doesnt work
Answer: another predictor (or put them in BTB marked as return)
Or, pre-decode bits in insn mem, written when first executed

predicted target

Branch Direction PREDICTION

CIS 501: Comp. Arch.|Dr.|Branch Prediction

CIS 501: Comp. Arch.|Dr.|Branch Prediction
Branch Direction Prediction
Learn from past, predict the future
Record the past in a hardware structure
Direction predictor (DIRP)
Map conditional-branch PC to taken/not-taken (T/N) decision
Individual conditional branches often biased or weakly biased
90%+ one way or the other considered biased
Why?Loop back edges, checking for uncommon conditions
Bimodal predictor: simplest predictor
PC indexes Branch History Table of bits (0 = N, 1 = T), no tags
Essentially: branch will go same way it went last time

What about aliasing?
Two PC with the same lower bits?
No problem, just a prediction!

Prediction (taken or
not taken)

CIS 501: Comp. Arch.|Dr.|Branch Prediction
Bimodal Branch Predictor
simplest direction predictor
PC indexes table of bits (0 = N, 1 = T), no tags
Essentially: branch will go same way it went last time
Problem: inner loop branch below
for (i=0;i<100;i++) for (j=0;j<3;j++)// whateverTwo built-in mis-predictions per inner loop iterationBranch predictor changes its mind too quicklyTimeStatePredictionOutcome Result?1NNTWrong2TTTCorrect3TTTCorrect4T TNWrong5N NTWrong6TTTCorrect7TTTCorrect8T TNWrong9N NTWrong10TTTCorrect11TTTCorrect12T TNWrongLoop : Y Y Y N Y Y Y N Y Y Y NPred: – Y Y YN Y Y Y N Y Y YCIS 501: Comp. Arch.|Dr.|Branch PredictionTwo-Bit Saturating Counters (2bc)Two-bit saturating counters (2bc) [Smith 1981]Replace each single-bit prediction(0,1,2,3) = (N,n,t,T)Adds hysteresisForce predictor to mis-predict twice before changing its mindOne mispredict each loop execution (rather than two)Fixes this pathology (which is not contrived, by the way)Can we do even better?TimeStatePredictionOutcome Result?1NNTWrong2nNTWrong3tTTCorrect4T TNWrong5t TTCorrect6TTTCorrect7TTTCorrect8T TNWrong9t TTCorrect10TTTCorrect11TTTCorrect12T TNWrongCIS 501: Comp. Arch.|Dr.|Branch PredictionCorrelated PredictorCorrelated (two-level) predictor [Patt 1991]Exploits observation that branch outcomes are correlatedMaintains separate prediction per (PC, BHR) pairsBranch history register (BHR): recent branch outcomesSimple working example: assume program has one branchBHT: one 1-bit DIRP entryBHT+2BHR: 22 = 4 1-bit DIRP entriesWhy didnt we do better?BHT not long enough to capture pattern TimePatternStatePredictionOutcome Result?NNNTTNTT1NNN NNNNTWrong2NTTNNNNTWrong3TTTTNN NTWrong4TTTTNT TNWrong5TNTTNNNTWrong6NTTTTNTTCorrect7TTTTTNNTWrong8TTTTTTTNWrong9TNTTT NTTCorrect10NTTTTNTTCorrect11TTTTTNNTWrong12TTTTTTTNWrongShift register storing last k branch outcomes. A global history context.CIS 501: Comp. Arch.|Dr.|Branch PredictionCorrelated Predictor 3 Bit PatternTimePatternStatePredictionOutcome Result?NNNNNTNTNNTTTNNTNTTTNTTT1NNNNNNNNNNNNTWrong2NNTTNNNNNNNNTWrong3NTTTTNN NNNNNTWrong4TTTTTNTNNNN NNCorrect5TTNTTNTNNN NNTWrong6TNTTTNTNN TNNTWrong7NTTTTNT NTTNTTCorrect8TTTTTNTNTTN NNCorrect9TTNTTNTNTT NTTCorrect10TNTTTNTNT TNTTCorrect11NTTTTNT NTTNTTCorrect12TTTTTNTNTTNNNCorrectTry 3 bits of historyNo mis-predictions after predictor learns all the relevant patterns!CIS 501: Comp. Arch.|Dr.|Branch PredictionBranches may be correlatedfor (i=0; i<1000000; i++) { // Highly biased if (i % 3 == 0) {// Locally correlated if (random() % 2 == 0) { // Unpredictable if (i % 3 == 0) { // Globally correlatedGshare History-Based PredictorExploits observation that branch outcomes are correlatedMaintains recent branch outcomes in Branch History Register (BHR)In addition to BHT of counters (typically 2-bit sat. counters)How do we incorporate history into our predictions?Use PC xor BHR to index into BHT. Why?CIS 501: Comp. Arch.|Dr.|Branch Predictiondirection prediction (T/NT)CIS 501: Comp. Arch.|Dr.|Branch PredictionGshare History-based PredictorGshare working exampleassume program has one branchBHT: one 1-bit DIRP entry3BHR: last 3 branch outcomestrain counter, and update BHR after each branchTimeStateBHRPredictionOutcome Result?1NNNNNTwrong2NNNTNTwrong3NNTTNTwrong4NTTTNNcorrect5N TTNNTwrong6NTNTNTwrong7TNTTTTcorrect8NTTTNNcorrect9T TTNTTcorrect10TTNTTTcorrect11TNTTTTcorrect12NTTTNNcorrectShift register storing last k branch outcomes, global historyCIS 501: Comp. Arch.|Dr.|Branch PredictionCorrelated Predictor Design IDesign choice I: one global BHR or one per PC (local)?Each one captures different kinds of patternsGlobal history captures relationship among different branchesLocal history captures self correlationLocal history requires another table to store the per-PC historyfor (i=0; i<1000000; i++) { // Highly biased if (i % 3 == 0) {// Local correlated// whatever if (random() % 2 == 0) { // Unpredictableif (i % 3 >= 1) {
// whatever // Global correlated

CIS 501: Comp. Arch.|Dr.|Branch Prediction
Correlated Predictor Design II
Design choice II: how many history bits (BHR size)?
Tricky one
Given unlimited resources, longer BHRs are better, but
BHT utilization decreases
Many history patterns are never seen
Many branches are history independent (dont care)
PC xor BHR allows multiple PCs to dynamically share BHT
BHR length < log2(BHT size)Predictor takes longer to trainTypical length: 812CIS 501: Comp. Arch.|Dr.|Branch PredictionHybrid PredictorHybrid (tournament) predictor [McFarling 1993]Attacks correlated predictor BHT capacity problemIdea: combine two predictorsSimple bimodal predictor for history-independent branchesCorrelated predictor for branches that need historyChooser assigns branches to one predictor or the otherBranches start in simple BHT, move mis-prediction thresholdCorrelated predictor can be made smaller, handles fewer branches9095% accuracyVery similar to ensemble learning techniques from MLREDUCING BRANCH PENALTYCIS 501: Comp. Arch.|Dr.|Branch PredictionCIS 371 (Martin): PipeliningReducing Penalty: Fast BranchesFast branch: can decide at D, not XTest must be comparison to zero or equality, no time for ALUNew taken branch penalty is 1Additional insns (slt) for more complex tests, must bypass to D tooBypassing into D instead of just X is necessitated since previous instructions may compute results that are relevant to the equality computation.Reducing Branch PenaltyApproach taken in text is to move branch testing into the ID stage so fewer instructions are flushed on a mis-prediction.CIS 501: Comp. Arch.|Dr.|Branch PredictionNote equality unit shoe horned into decode phase between register file outputsCIS 501: Comp. Arch.|Dr.|Branch PredictionReducing Penalty: Fast BranchesFast branch: targets control-hazard penaltyBasically, branch insns that can resolve at D, not XTest must be comparison to zero or equality, no time for ALUNew taken branch penalty is 1Additional comparison insns (e.g., cmplt, slt) for complex testsMust bypass into decode stage now, too123456789 bnez r3,targFDXMW st r6[r7+4]FD——targ:add r4r5,r4FDXMWCIS 501: Comp. Arch.|Dr.|Branch PredictionFast Branch Performance Assume: Branch: 20%, 75% of branches are takenCPI = 1 + 20% * 75% * 1 = 1 + 0.20*0.75*1 = 1.1515% slowdown (better than the 30% from before)But wait, fast branches assume only simple comparisonsFine for MIPSBut not fine for ISAs with branch if $1 > $2 operations

In such cases, say 25% of branches require an extra insn
CPI = 1 + (20% * 75% * 1) + 20%*25%*1(extra insn) = 1.2

Example of ISA and micro-architecture interaction
Type of branch instructions
Another option: Delayed branch or branch delay slot
What about condition codes?

More specifically if you look at schematic in the book even if you know that the branch will be taken you cant do anything about it until you have calculated the branch target

Delayed slot idea we always execute the next instruction after a branch. Compilers job to put something useful in there.

Putting It All Together
BTB & branch direction predictor during fetch

If branch prediction correct, no taken branch penalty
CIS 501: Comp. Arch.|Dr.|Branch Prediction

predicted target

taken/not-taken

CIS 501: Comp. Arch.|Dr.|Branch Prediction
Branch Prediction Performance
Dynamic branch prediction
20% of instruction branches
Simple predictor: branches predicted with 75% accuracy
CPI = 1 + (20% * 25% * 2) = 1.1
More advanced predictor: 95% accuracy
CPI = 1 + (20% *5% * 2) = 1.02

Branch mis-predictions still a big problem though
Pipelines are long: typical mis-prediction penalty is 10+ cycles
For cores that do more per cycle, predictions more costly (later)

PREDICATION

CIS 501: Comp. Arch.|Dr.|Branch Prediction

CIS 501: Comp. Arch.|Dr.|Branch Prediction
Predication
Instead of predicting which way were going, why not go both ways?
compute a predicate bit indicating a condition
ISA includes predicated instructions
predicated insns either execute as normal or as NOPs, depending on the predicate bit
x86 cmov performs conditional load/store
32b ARM allows almost all insns to be predicated
64b ARM has predicated reg-reg move, inc, dec, not
Nvidias CUDA ISA supports predication on most insns
predicate bits are like LC4 NZP bits
x86 FLAGS, ARM condition codes

CIS 501: Comp. Arch.|Dr.|Branch Prediction
Predication Performance
Predication overhead is additional insns
Sometimes overhead is zero
for if-then statement where condition is true
Most of the times it isnt
if-then-else statement, only one of the paths is useful
Calculation for a given branch, predicate (vs speculate) if
Average number of additional insns > overall mis-prediction penalty
For an individual branch
Mis-prediction penalty in a 5-stage pipeline = 2
Mis-prediction rate is <50%, and often <20%Overall mis-prediction penalty <1 and often <0.4So when is predication ever worth it?CIS 501: Comp. Arch.|Dr.|Branch PredictionPredication PerformanceWhat does predication actually accomplish?In a scalar 5-stage pipeli CS: assignmentchef QQ: 1823890830 Email: [email protected]

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] CS CSE 371 Computer Organization and Design
$25