Solutions to the
Revision Exercise4
for
High Performance Computer Architecture
Page 1 of 9
Question 1. 25 marks
Consider the following pipeline:
Integer operation
LW instruction
Thus, the executionstage latencies of the instructions are clearly stated as follows:
MULS floatingpoint multiply: 4 cycles
ADDS floatingpoint addition: 2 cycles
integer operations: 1 cycle
LW memory load: 3 cycles the first cycle is for address calculation while the
remaining two cycles are for memory access
one branch delay slot
The following program fragment computes a dotproductccai bi. Registers r2
and r3 contain addresses of arrays of floatingpoint numbers. Register r1 contains the length of the arrays number of elements. Register f8 the dotproduct result, i.e., c is initialized to zero.
ADDS instruction
dotprod:
lw f5, 0r2
lw f6, 0r3
muls f7, f5, f6
adds f8, f8, f7
addi r2, r2, 4
addi r3, r3, 4
addi r1, r1, 1
bne r1, zero, dotprod
nop
load element from 1st array
load element from 2nd array
multiply elements
accumulate results
advance array index
advance array index
decrement element count
; loop if not done
; do nothing but always
; execute branch delay
; slot
MULS instruction
Branch condition determined here
a 7 marks Clearly explain the total number of cycles per iteration to execute the above program on the concerned pipeline when forwarding is allowed for any branch instruction.
Answer: 2 markspoint of the clear explanation, 1 mark for correct answers, total8 marks
a 9 cycles for 9 instructions i.e. ideal CPI12 stall cycles before MULS3 stall cycles
before ADDS 0.8 mark for clear explanation14 cycles, 1 mark Page 2 of 9
; ; ; ; ; ; ;
Explanation:
Assuming that forwarding is done between the addi and bne instructions;
dotprod:
lw f5, 0r2
lw f6, 0r3
muls f7, f5, f6
adds f8, f8, f7
addi r2, r2, 4
addi r3, r3, 4
addi r1, r1, 1
bne r1, zero, dotprod nop
; load element from first array
; load element from second array
; underlined: producerconsumer
; accumulate results
; advance array index
; advance array index
; decrement element count
; loop if not done
; do nothing
1 cycle for each instruction
By taking CPI1 for multiEXstage pinelining Therefore 9 cycles for the total 9 instructions PLUS the possible delaysstalls as follows ::
i between lw producer of f6 on line2 and muls consumer of f6 on line3
refer to the similar diagram for analysis of mulf and addf on pp.2 of the Module2 notes
Earliest forwarding for 3cycle lw
So, need to ADD 2 cycles of delaystall 2 marks ii between muls producer of f7 on line3 and adds consumer of f7 on line
Fetch
Decode
Ex1
Ex2
Ex3
WB
adds
Muls
Delay2
Delay1
lw line2
lw line1
4
Similarly,
Earliest forward. for 4cycle muls
Fetch
Decode
Ex1
Ex2
Ex3
Ex4
WB
addi
adds
Delay3
Delay2
Delay1
muls
lw line2
So, need to ADD another 3 cycles of delaystall 2 marks iii between addi producer of r1 on line8 and bne consumer of r1 on line9
Similarly,
Immediate forward. for 1cycle addi line8
Fetch
Decode
Ex1
WB
bne
addi line8
addi line7
So, NO delaystall when forwarding is allowed for the branch instruction bne.
2 marks
Hence, its altogether: 92 stalls3 stalls14 cycles !!
b 8 marks Without unrolling the loop, rearrange the code so that the number of cycles per
Page 3 of 9
iteration is minimized. Show your rearranged program code. Besides, clearly explain the total number of cycles per iteration to execute your rearranged program on the concerned pipeline when forwarding is not allowed for any branch instruction.
Answer: 2 marks for the correctly rearranged program code, 2 marks for the correct answer of cycles per iter, 2 markspoint of the clear explanation, total8 marks
dotprod:
lw f5, 0r2 ; load element from first array lwf6,0r3 ;loadelementfromsecondarray addi r2, r2, 4 ; advance array index
addi r3, r3, 4 ; advance array index
muls f7, f5, f6 ; multiply elements
addi r1, r1, 1 ; decrement element count
bne r1, zero, dotprod ; loop if not done
adds f8, f8, f7 ; accumulate results delay branch!
2 marks
8 instructions1 stall cycle before ADDS 2 marks for clear explanation9 cycles 2 marks for correct answer no matter if we assume forwarding is done between the addi and bne
instructions.
Explanation :
2 marks
Since the original 3cycles delay for the adds instruction is only partially utilized i.e. the addi and bne instructions are inserted to use 2 cycles out of the original 3 cycles delay, even though there is NO forwarding allowed between addi and bne instruction, the potentially EXTRA 1cycle delay can be coincided or in fact overlapped with the remaining 1 unused cycle of the adds instruction. Thus, there will be effectively NO extra cycle of stall added due to this controlhazard problem between addi and bne instructions no matter forwarding is done or not.
2 marks
between muls producer of f7 on line5 and adds consumer of f7 on line8
Diagrammatically, the situation is represented as follows for clear explanation:
Similarly,
NO forward. for 1cycle addi line7
Fetch
Decode
Ex1
Ex2
Ex3
Ex4
WB
adds
bne
unused Delay1
addi
Muls
Clearly, by shifting the intrinsically 1 unused cycle of delaystall between muls and adds to the Ex2 slot between addi and bne so as to ABSORB or coincide with the possibly extra stall injected, there is apparently NO extra cycle of delay when forwarding is NOT allowed for the branch instruction bne.
2 marks
Hence, its altogether: 81 stalls9 cycles no matter forwarding for any branch instruction is allowed or not !!
Page 4 of 9
c 10 marks Unroll the loop once, and schedule it to completely avoid stalls. Show your revised program code. Besides, clearly explain the total number of cycles per iteration to execute your revised program on the concerned pipeline when forwarding is not allowed for any branch instruction.
Answer: 6 marks for the correctly revisedunrolled program code, 2 marks for the correct answer of cycles per iter, 2 marks for the clear explanation, total10 marks
dotprod:
lw f5, 0r2 ; load element from 1st array
lw f6, 0r3 ; load element from 2nd array
lw f9, 4r2 ; load another element from 1st array
lw f10, 4r3; load another element from 2nd array
muls f7, f5, f6 ; multiply elements
addi r2, r2, 8 ; advance array index
muls f11, f9, f10 ; multiply elements
addi r3, r3, 8 ; advance array index
adds f8, f8, f7 ; accumulate results
addi r1, r1, 2 ; decrement element count
bne r1, zero, dotprod ; loop if not done
addsf8,f8,f11;accumulateresultsdelay
branch!
6 marks
N.B. the potential stalls between muls on line5 and adds on line9 is fully utilized by the highlighted blue region as enclosed by the brace symbol; similarly for the second pairs of mulsadds as highlighted by the symbol.
IF we assume that forwarding is not allowed between the addi and bne, the answer should be:
12 instructions1 stall13 cycles for 2 iterations 2 marks for the clear explanation6.5 cycles per iteration 2 marks for the correct answer
Page 5 of 9
Question 2. 25 marks
a 3 marks Besides the readafterwrite RAW hazard, there are 2 frequently occurring data hazards in high performance computer systems due to data dependencies in a program. Name and clearly define these 2 types of data hazards with simple code examples for illustration.
Answer: 1.5 markspoint: 1def.1code example, total4 marks
o WAR write after read: this hazard occurs when a later instruction writes a register that needs to be read by an earlier instruction. If data flows from the later to the earlier instruction, an incorrect operation will result. Example:
sub r5, r1, r6
add r1, r2, r3r1 induces a WAR hazard
o WAW write after write: this hazard occurs when two instructions write a register. If these instructions write out of order, then the register will have an incorrect value in it. Example:
add r1, r2, r3
sub r1, r5, r6r1 induces a WAW hazard
b 8 marks Based on the Tomasulos dynamic scheduling approach for instruction execution, discuss how each of the 2 data hazards as defined in a can be resolved, and give a short code example to demonstrate the idea.
Answer: 4 markspoint, total8 marks
o WAR write after read: Tomasulo prevents this by using register renaming. Example:
sub r5, r1, r6
add r7, r2, r3r7 is used instead of 1
o WAW write after write: Tomasulo prevents this by using register renaming. Example:
add r1, r2, r3
sub r7, r5, r6r7 is used instead of 1
Page 6 of 9
c 8 marks The following diagram is aimed to show all the relevant statuses of the Tomasulo algorithm at the end of cycle 6 for the following program fragment. However, the diagram is incomplete with some missing details.
Cycle 6:
Instruction Status : Instruction j
LD F6 30 LD F2 41 MULTD F0 F2 SUBD F8 F6 DIVD F10 F0 ADDD F6 F8
Reservation Stations : Time Name Busy
1 Add1 Add2 Add3
9 Mult1 Mult2
Register result status :
Clock
6 FU
Exec. Write
k Issue Comp Result
Busy Address
134 24
3
4
5
Cycle 6:
Instruction Status : Instruction j
LD F6 30 LD F2 41 MULTD F0 F2 SUBD F8 F6 DIVD F10 F0 ADDD F6 F8
Reservation Stations : Time Name Busy
1 Add1 Add2 Add3
9 Mult1 Mult2
Exec. Write
k Issue Comp Result
Busy Address
Op
S1 Vj
S2 Vk
RS Qj
RS Qk
R1 R2 F4 F2 F6 F2
Load1 Load2 Load3
Yes SUBD MA1 MA2
No
Yes MULTD MA2 RF4
F0 F2 F4 F6 F8 F10 F12 .F30
Copy the above diagram onto your answer book and complete all the missing details. Besides, clearly explain the major difference between the Scoreboard approach and the Tomasulo algorithm in program execution after the instruction DIVD is issued.
Answer: 1 markhighlighted detail in the diagram, 4 marks for a clear explanation of the major difference, total8 marks
Op
S1 Vj
S2 Vk
RS Qj
RS Qk
R1 R2 F4 F2 F6 F2
Load1 Load2 Load3
Page 7 of 9
No No No
Mult1 MA2 Add2 Add1 Mult2
134 245 3
4
5
6
No No No
Yes SUBD
Yes ADDD
No
Yes MULTD Yes DIVD
MA1 MA2
MA2
MA2 Add1
RF4
MA1 Mult1
Register result status :
Clock
6 FU
F0 F2 F4 F6 F8 F10 F12 .F30
Mult1 MA2 Add2 Add1 Mult2
The major difference between the Scoreboard approach and the Tomasulo algorithm in program execution after the instruction DIVD is issued is clearly explained as follows.
In the Scoreboard approach, the last instruction ADDD cannot be issued due to structural hazard i.e. with 1 adder only even after the instruction DIVD was issued. And even IF issued i.e. assuming no structural hazard with 2 adders or more, ADDD is still pendingwaiting for the result of F8 to be completed by SUBD i.e. stalls due to the RAW hazard. Therefore, in the Scoreboard approach, the instruction ADDD needs to wait until SUBD in Add1 is completed before its issue. 2 marks
On the other hand, in the Tomasulos algorithm, the last instruction ADDD can be issued right after the instruction DIVD was issued since the algorithm is more autonomous with available reservation stations Add2 to be occupied by the concerned ADDD instruction in this case. Besides, the requested register result F8 by ADDD is remarkedrenamed as Add1 to denote the requested result to be produced by the reservation station Add1. 2 marks
d
6 marks Use a diagram to show all the relevant statuses of the Tomasulo algorithm at the end of cycle 7 for the above program fragment. Besides, clearly state the specific instructions that isare waiting for the result produced by the reservation station Add1 in that cycle.
Answer: 1.5 markhighlighted detail in the diagram, 1.5 marks for identifying the right instruction, total6 marks
Cycle 7:
Instruction Status : Instruction j
LD F6 30 LD F2 41 MULTD F0 F2 SUBD F8 F6 DIVD F10 F0 ADDD F6 F8
Reservation Stations : Time Name Busy
0 Add1 Add2 Add3
8 Mult1 Mult2
Register result status :
Clock
7 FU
Exec. Write
k Issue Comp Result
Busy Address
Op
S1 Vj
S2 Vk
RS Qj
RS Qk
R1 R2 F4 F2 F6 F2
Load1 Load2 Load3
134 245 3
47
5 6
Yes SUBD Yes ADDD No
Yes MULTD Yes DIVD
MA1 MA2
MA2
MA2 Add1
RF4
MA1 Mult1
F0 F2 F4 F6 F8 F10 F12 .F30
Mult1 MA2 Add2 Add1 Mult2
Page 8 of 9
No No No
Since the instruction SUBD being executed in the reservation station Add1 is completing its operation with the remaining time as 0 clock cycle, the instruction ADDD in the reservation station Add2 is waiting for its result on F8. 1.5 marks
END OF REV. EX. 4SOL
Page 9 of 9
Reviews
There are no reviews yet.