[SOLVED] 代写 C MIPS scala computer architecture software Computer Architecture

30 $

File Name: 代写_C_MIPS_scala_computer_architecture_software_Computer_Architecture.zip
File Size: 678.24 KB

SKU: 8908865467 Category: Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Or Upload Your Assignment Here:


Computer Architecture
Course code: 0521292B 06. Pipelining & Hazards
Jianhua Li
College of Computer and Information Hefei University of Technology
slides are adapted from CA course of wisc, princeton, mit, berkeley, etc.
The uses of the slides of this course are for educa/onal purposes only and should be
used only in conjunc/on with the textbook. Deriva/ves of the slides must
acknowledge the copyright no/ces of this and the originals.

复习:流水线的基本思想
• Moresystematically:
– Pipelining the execution of multiple instructions
• Idea:
– Divide the instruction processing cycle into distinct
“stages” of processing
– Ensure there are enough hardware resources to process
one instruction in each stage
– Process a different instruction in each stage simultaneously
• Instructions consecutive in program order are processed in consecutive stages
• Benefit: Increases instruction processing throughput

示例:加法操作的处理
• Multi-cycle: 4 cycles per instruction
F
D
E
W
F
D
E
W
F
D
E
W
F
D
E
W
Time
• Pipelined: 4 cycles per 4 instructions 看起来很美好!
F
D
E
W
F
D
E
W
F
D
E
W
F
D
E
W
Time

示例:加法操作的处理
• Multi-cycle: 4 cycles per instruction
F
D
E
W
F
D
E
W
F
D
E
W
F
D
E
W
Time
• Pipelined: 4 cycles per 4 instructions 实际总是如此美好吗?
Time
F
D
E
W
F
D
E
W
F
D
E
W
F
D
E
W

理想的流水线
• Goal: Increase throughput with little increase in cost (hardware cost, in case of instruction processing)
• Repetition of identical operations
– The same operation is repeated on a large number of
different inputs
• Repetition of independent operations
– No dependencies between repeated operations
• Uniformly partitionable suboperations
– Processing can be evenly divided into uniform-latency suboperations

理想的流水线
BW=~(1/T)
combinational logic (F,D,E,M,W) T psec
T/2 ps (F,D,E)
T/2 ps (M,W)
BW=~(2/T)
T/3
ps (F,D)
T/3
ps (E,M)
T/3
ps (M,W)
BW=~(3/T)

现实的流水线: Throughput • Nonpipelined version with delay T
BW = 1/(T+S) where S = latch delay
T ps
• k-stagepipelinedversion BWk-stage = 1 / (T/k +S )
Latch delay reduces throughput
BWmax = 1 / (1 gate delay + S )
T/k ps
T/k ps

现实的流水线: Cost
• Nonpipelined version with combinational cost G
Cost = G+L where L = latch cost
G gates
• k-stagepipelinedversion Costk-stage = G + Lk
Latches increase hardware cost
G/k
G/k

Pipeline: 指令处理过程
1. Instruction fetch (IF) 2. Instruction decode and
register operand fetch (ID/RF)
3. Execute/Evaluate memory address (EX/AG) 4. Memory operand fetch (MEM)
5. Store/writeback result (WB)

记住单周期微架构
Instruction [25– 0]
Shift
Jump address [31– 0]
PCSrc =Jump 011
MM uu xx
AddALU 1 0 result
Add 4
RegDst
Jump
Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite
Shift left 2
Instruction [31– 26]
Instruction [25– 21]
Control
PCSrc2=Br Taken
PC
26 left2 28
PC+4 [31– 28]
Read address
Instruction memory
Instruction [31– 0]
Read register 1
Read
register 2
Write Registers Read
register data 2
Write data
Instruction [20– 16]
Read data 1
0 M u x
1
Instruction [15– 11]
0 M u x
1
ALU control
Zero
ALU ALU result
bcond
Address
Read data
Write data
Data memory
Instruction [15– 0]
16 Sign 32 extend
Instruction [5– 0]
1 M u x
0
ALU operation

200ps
IF: Instruction fetch
0 M u x
1
Add 4
100ps
ID: Instruction decode/ register file read
200ps
EX: Execute/ address calculation
200ps
MEM: Memory access
100ps
WB: Write back
分割为阶段
PC
Shift left 2
0
M u x
1
Add Add result
Zero
ALU ALU result
Address
Instruction memory
Instruction
Read register 1
Read register 2
Read data1
Write Registers Read
register data 2
Write data
Address
Read data
Data memory
Write data
16 Sign 32 extend
1
M u x
0

200ps
IF: Instruction fetch
0 M u x
1
Add 4
100ps
ID: Instruction decode/ register file read
200ps
EX: Execute/ address calculation
200ps
MEM: Memory access
100ps
WB: Write back
分割为阶段
ignore for now
PC
Shift left 2
0
M u x
1
Add Add result
Zero
ALU ALU result
Address
Instruction memory
Instruction
Read register 1
Read register 2
Read data1
Write Registers Read
register data 2
Write data
Address
Read data
Data memory
Write data
RF write
16 Sign 32 extend
1
M u x
0

200ps
IF: Instruction fetch
0 M u x
1
Add 4
100ps
ID: Instruction decode/ register file read
200ps
EX: Execute/ address calculation
200ps
MEM: Memory access
100ps
WB: Write back
分割为阶段
ignore for now
PC
Shift left 2
0
M u x
1
Add Add result
Zero
ALU ALU result
Address
Instruction memory
Instruction
Read register 1
Read register 2
Read data1
Write Registers Read
register data 2
Write data
Address
Read data
Data memory
Write data
RF write
16 Sign 32 extend
1
M u x
0
Is this the correct partitioning? Why not 4 or 6 stages?
Why not different boundaries?

Program execution order
(in instructions)
lw $1, 100($0) lw $2, 200($0) lw $3, 300($0)
Program execution order
(in instructions)
lw $1, 100($0) lw $2, 200($0) lw $3, 300($0)
2 4 6 8 10 12 14 16 18
Instruction Pipeline Throughput
Time
Instruction fetch
Reg
ALU
Data access
Reg
Instruction fetch
Reg
ALU
Data access
Reg
8 ns
8 ns
Instruction fetch
… 8 ns
Time
2 4 6 8 10 12 14
Instruction fetch
Reg
ALU
Data access
Reg
Instruction fetch
Reg
ALU
Data access
Reg
Instruction fetch
Reg
ALU
Data access
Reg
2 ns
2 ns
2 ns
2 ns
2 ns
2 ns
2 ns

Program execution order
(in instructions)
lw $1, 100($0) lw $2, 200($0) lw $3, 300($0)
Program execution order
(in instructions)
lw $1, 100($0) lw $2, 200($0) lw $3, 300($0)
2 4 6 8 10 12 14 16 18
Instruction Pipeline Throughput
Time
Instruction fetch
Reg
ALU
Data access
Reg
Instruction fetch
Reg
ALU
Data access
Reg
8 ns
8 ns
Instruction fetch
… 8 ns
Time
2 4 6 8 10 12 14
Instruction fetch
Reg
ALU
Data access
Reg
Instruction fetch
Reg
ALU
Data access
Reg
Instruction fetch
Reg
ALU
Data access
Reg
2 ns
2 ns
2 ns
2 ns
2 ns
2 ns
2 ns
5-stage speedup is 4, not 5 as predicted by the ideal model. Why?

流水线的重要元素: Pipeline Registers
IF: Instruction fetch
ID: Instruction decode/ EX: Execute/ MEM: Memory access WB: Write back register file read address calculation
00 MM uu xx
11
AdAddd 44
No resource is used by more than 1 stage!
IF/ID ID/EX
left 2 left 2
00
MM u ux x
11
EX/MEM
MEM/WB
Shift Shift
AdAdd AdAdd rerseusltult
Zero Zero
PC C
A
Address ddress
P
Instruction
Instruction
memory
Instruction memory
ReRaedad
register 1 register 1
ReRaeda
Read Read
data1 data1
rergeisgtiesrte2r 2
Write
Write
data 2 data2
register
register
Write
Write
data
Registers Registers
Read Read
d
data
ALU
ALU ALU
ALU rerseusltult
Address Address
DaDtata mmememoroyry
Read
11 MM u ux
x 00
WWritreite
Read
data
data
16 32
16 Sign 32 Sign
data data
extend extend
T/k ps
T/k ps
16
T
ImmE
BE AE
PCE+4
BM
AoutM
nPCM
AoutW MDRW
IRD
Instruction
PCD+4
PCF

PC
Read register 1 regiisisteter1 register 1
Read Read Read
PC PC
data1 datata1
lw
Instruction fetch
All instruction classes must follow the same path and timing through the pipeline stages.
4 4
Address Adrres
Add Add Ad
0 0 0
lw lw lw
lw
Write back
MEM/WB MEM/W/WB
M M u u x x
Execution
Memory
1 1
Add Add
IF/ID IFIF/I/IDID
EX/MEM EX/M/MEM
1
流水操作示例
Write data
Write 0 data
Instruction decode
Read
Read
Read
Read
16 1632 32
Sign extend extend
Sign
Any performance impact?
ID/EX IDID/E/EX
Shift Shifififtft
left 2 lleleftftt2
Ad result resultltt
InInInsnststrtrturuructctitiotioion Instruction
0 0M Mu
lw
Instruction decode
lw
Write back
ux x 1
1
Instruction IInInsttrtructtitioion Instruction
Read register 2 regisisisteter2
Zero Zero Zero
memory memorry memory
Registers Regiisisteters Read Registers Read
0 0 0
ALU
ALU ALU ALU ALU
Read Read Read
Write
ALU
Address
Wriitte Wrritiete Write register regiisisteter register
Write Wriititete Write data datata data
data 2
result rresulltt rressultlt result
Addrress Addrres
1 1 1 1
16
32
16
32
16
Sign
32
datta 2 datata2 data 2
M M M u u u x x x
Data
datta datta
M M u u x x
Siigign
Sign
extend
extetend
extend
1
1 10
Write Wrrititete Write
0
data
datata
data
Data
Datta
memorryy memory memorry
0

Clock 1
Clock 3 Clock 5
sluwb$101,,2$02(,$$13) Instruction fetch
extend extend
slwub$$1101, ,2$02($, 1$)3
Instruction decode
lw $10, 20($1)
Execution
sub $11, $2, $3
Execution
4 4
Address Address
Add Add Add Add result
0 0
lw $10, 20($1) sub $11, $2, $3
Memory Memory
sub $11, $2, $3 lw $10, 20($1)
M M
Write back Write back
u u
x x
1 1
Add Add
IF/ID IF/ID
ID/EX ID/EX
EX/MEM EX/MEM
MEM/WB MEM/WB
drreagtaiistterr Write
M data 1M
Write
x
data
x
x
data
16
1
Sign
data data
extend
16 32 16 Sign 32
Sign
32
1
Write Write
0 0
流水操作示例
M u0
u
Write data
Data memory
u
InInstrtruructitioion
PC PC
register 1 register 1
Read Read
Instruction Instructiion
memory memory
register 2 register 2
Clock 3
ClColcokck5 1 CClolcoCkclko6c2k 4
sub $11, $2, $3
lw $10, 20($1)
18
Instruction fetch
Instruction decode
sub $11, $2, $3
lw $10, 20($1)
sub $11, $2, $3
0
Read Read
Read Read
data1 data 1
Zero Zero
Write Wriite
Read
0 0 0
ALU
Address Addrress
Read Read Read
1 1
register register
data 2 data 2
M M
result result
data data
M M
Write Wriite
u u
Datta Datta
u u
Registers Regiisters Read
ALU
ALU ALU
1 实际总会如此美好吗?
16 32
memorry memory
x x x
data
data 1
0
16
Sign 32 Sign
extend extend
Shift Shift
left 2 lleft 2
x x
resullt
Write Write
0
data data

描述流水操作: Operation View
t0 t1 t2 t3
t5
MEM
t4
IF
ID
EX
WB
IF
ID
EX
MEM
WB
IF
ID
IF
EX
MEM
EX
WB
ID
MEM
IF
ID
EX
IF
ID
IF
Inst0 Inst1 Inst2 Inst3 Inst4
WB
MEM EX
ID
IF
steady state (full pipeline)

描述流水操作: 时空图
IF
ID
EX
MEM
WB
t0
I0
t1
I1
I0
t2
I2
I1
I0
t3
I3
I2
I1
I0
t4
I4
I3
I2
I1
I0
t5
I5
I4
I3
I2
I1
t6
I6
I5
I4
I3
I2
t7
I7
I6
I5
I4
I3
t8
I8
I7
I6
I5
I4
t9
I9
I8
I7
I6
I5
t10
I9
I8
I7
I6
I10

流水线的控制
PCSrc
0
M u x
1
Add
MEM/WB
4
Branch
PC
MemWrite
IF/ID ID/EX
EX/MEM
Add Add result
Shift left 2
ALUSrc
0
M u x
1
ZeZreoro
ALU ALU result
RegWrite
Address
Instruction memory
Read register 1
Read
register 2
Write Registers Read
register data 2
Write data
Read data 1
Address
Read data
Data memory
Write data
Instruction
[15–0] 16 Sign 32
6
0 M
u x
1
RegDst
ALU control
ALUOp
extend
MemRead
Instruction [20– 16]
Instruction [15– 11]
MemtoReg
1 M
u
x
0
Identical set of control points as the single-cycle datapath!!
Instruction

流水线的控制
• For a given instruction
– same control signals as single-cycle, but
– control signals required at different cycles, depending on stage
 Option 1: decode once using the same logic as single-cycle and buffer signals until consumed
WB
Instruction
Control
M
WB
EX
M
WB
IF/ID
 Option 2: carry relevant “instruction word/field” down the pipeline and decode locally within each or in a previous stage
Which one is better?
ID/EX EX/MEM MEM/WB

PCSrc
流水化的控制信号
Address
Instruction memory
0 M
u x
1
Add
ID/EX
WB
Control
M
EX/MEM
EX
WB
M
IF/ID
4
Shift left 2
Add Add result
ALUSrc
Branch
MEM/WB
WB
PC
Read register 1
Read register 2
Read data 1
Write Registers Read register data 2
Write data
Read data
memory
Write data
Address
Data
Instruction 16 32 [15– 0] Sign
Instruction [20– 16]
extend
6
ALU control
0 M
u
x 1
Instruction [15– 11]
ALUOp
MemRead
RegDst
0
M u x
1
Zero
ALU ALU
result 1
M u x
0
Instruction
MemtoReg
MemWrite
RegWrite

Remember: 理想的流水线
• Goal: Increase throughput with little increase in cost (hardware cost, in case of instruction processing)
• Repetitionofidenticaloperations
– The same operation is repeated on a large number of different inputs (e.g., all laundry loads go through the same steps)
• Repetition of independent operations
– No dependencies between repeated operations
• Uniformly partitionable suboperations
– Processing an be evenly divided into uniform-latency suboperations (that do not share resources)

Instruction Pipeline: 不是理想流水线
• Identical operations … NOT!
 different instructionsnot all need the same
stages
Forcing different instructions to go through the same pipe stages external fragmentation (some pipe stages idle for some instructions)
• Uniform suboperations … NOT!
 different pipeline stagesnot the same latency Need to force each stage to be controlled by the same clock
internal fragmentation (some pipe stages are too fast but all take the same clock cycle time)
• Independent operations … NOT!
 instructions are not independent of each other Need to detect and resolve inter-instruction dependencies to ensure the pipeline provides correct results
pipeline stalls (pipeline is not always moving)

流水线设计面临的问题
• Balancing work in pipeline stages
– How many stages and what is done in each stage
• Keeping the pipeline correct, moving, and full in the presence of events that disrupt pipeline flow
– Handling dependences • Data
• Control
– Handling resource contention
– Handling long-latency (multi-cycle) operations
• Handling exceptions, interrupts
• Advanced:Improvingpipelinethroughput – Minimizing stalls

引起流水线停顿的原因
• Stall: A condition when the pipeline stops moving
• Resource contention
• Dependences (between instructions) – Data
– Control
• Long-latency (multi-cycle) operations

相关(denpendence)及其类型
• Also called “dependency” or less desirably “hazard”
• Dependences dictate ordering requirements between instructions
• Two types
– Data dependence
– Control dependence
• Resource contention is sometimes called resource dependence
– However, this is not fundamental to program semantics.

处理资源冲突(Resource Contention)
• Happens when instructions in two pipeline stages
need the same resource
• Solution 1: Eliminate the cause of contention
– Duplicate the resource or increase its throughput
• E.g., use separate instruction and data memories (caches) • E.g., use multiple ports for memory structures
• Solution 2: Detect the resource contention and stall one of the contending stages
– Which stage do you stall?
– Example: What if you had a single read and write port for the register file?

数据相关(Data Dependence)
• Types of data dependences
– Flow dependence (true dependence – read after write) – Output dependence (write after write)
– Anti dependence (write after read)
• Which ones cause stalls in a pipelined machine?
– For all of them, we need to ensure semantics of the
program is correct
– Flow dependences always need to be obeyed because they constitute true dependence on a value
– Anti and output dependences exist due to limited number of registers in the CPU
• They are dependence on a name, not a value • We will later see what we can do about them

Flow dependence r3  r1 op r2 r5  r3 op r4
Anti dependence r3  r1 op r2 r1  r4 op r5
Output-dependence r3 r1 opr2
r5 r3 opr4
r3 r6 opr7
Read-after-Write (RAW)
Write-after-Read (WAR)
Write-after-Write (WAW)
数据相关示例

Clock 1
Clock 3 Clock 5
sluwb$101,,2$02(,$$13) Instruction fetch
slwub$$1101, ,2$02($, 1$)3
Instruction decode
0 0
lw $10, 20($1) sub $11, $2, $3
Memory Memory
sub $11, $2, $3 lw $10, 20($1)
M M
Write back Write back
u u
x x
1 1
Add Add
IF/ID IF/ID
ID/EX ID/EX
EX/MEM EX/MEM
MEM/WB MEM/WB
drreagtaiistterr Write
M data 1M
Write
x
指令流
线操作示例
lw $10, 20($1)
Execution
sub $11, $2, $3
Execution
data
x
x
data
16
1
Sig水
n extend
data data
16 32 16 Sign 32
Sign
exte
nd extend
32
1
Write Write
0 0
M u0
u
Write data
Data memory
u
InInstrtruructitioion
PC PC
register 1 register 1
Read Read
4 4
Address Address
Add Add Add Add result
Instruction Instructiion
memory memory
register 2 register 2
Clock 3
ClColcokck5 1 CClolcoCkclko6c2k 4
sub $11, $2, $3
lw $10, 20($1)
32
Instruction fetch
Instruction decode
sub $11, $2, $3
lw $10, 20($1)
sub $11, $2, $3
0
Read Read
Read Read
data1 data 1
Zero Zero
Write Wriite
Read
0 0 0
ALU
Read Read Read
Write Wriite
u u
u u
Registers Regiisters Read
ALU
ALU ALU
Address Addrress
Datta Datta
What if the SUB were dependent on LW?
16 32
1 1
register register
data 2 data 2
M M
result result
data data
M M
data
data 1
16
Sign 32 Sign
extend extend
Shift Shift
left 2 lleft 2
x x
memorry memory
x x x
1
0
resullt
Write Write
0
data data

如何处理数据相关
• Anti and output dependences are easier to handle
– write to the destination in one stage and in program order
• Flow dependences are more interesting
• Four typical ways of handling flow dependences – Detect and wait until value is available in register file
– Detect and forward/bypass data to dependent instruction – Detect and eliminate the dependence at the software level
• No need for the hardware to detect dependence
– Predict the needed value(s), execute “speculatively”, and verify

Pipeline Interlocking (流水线互锁)
• Detection of dependence between instructions in a pipelined processor to guarantee correct execution
• Softwarebasedinterlocking • Hardwarebasedinterlocking
• MIPSacronym?

相关检测的方法 (1)
• Scoreboarding(记分牌算法)
– Each register in register file has a Valid bit associated with it
– An instruction that is writing to the register resets the Valid bit (set to 0)
– An instruction in Decode stage checks if all its source and destination registers are Valid (value is 1)
• Yes: No need to stall… No dependence
• No: Stall the instruction
• Advantage:
– Simple, 1 bit per register
• Disadvantage:
– Need to stall for all types of dependences, not only flow dependence.

如何在输出相关和反相关不停顿?
• What changes would you make to the scoreboard to enable this?
– Think about it!

相关检测的方法 (2) • Combinational dependence check logic
– Special logic that checks if any instruction in later stages (pipeline stages) is supposed to write to any source register of the instruction that is being decoded
– Yes: stall the instruction/pipeline
– No: no need to stall… no flow dependence
• Advantage:
– No need to stall on anti and output dependences
• Disadvantage:
– Logic is more complex than a scoreboard
– Logic becomes more complex as we make the pipeline deeper and wider (think superscalar execution)

通过硬件检测到相关之后呢……
• What do you do afterwards?
• Observation: Dependence between two instructions is detected before the communicated data value becomes available
• Option 1: Stall the dependent instruction right away
• Option 2: Stall the dependent instruction only when
necessarydata forwarding/bypassing
• Option 3: …

Next Topic Dependence Handling

Control Dependence
• Question: What should the fetch PC be in the next cycle?
• Answer: The address of the next instruction
– All instructions are control dependent on previous ones. Why?
• If the fetched instruction is a non-control-flow instruction: – Next Fetch PC is the address of the next-sequential instruction – Easy to determine if we know the size of the fetched instruction
• If the instruction that is fetched is a control-flow instruction: – How do we determine the next Fetch PC?
• In fact, how do we know whether or not the fetched instruction is a control-flow instruction?

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] 代写 C MIPS scala computer architecture software Computer Architecture
30 $