Time allowed Full score
High Performance Computer
Architecture
Revision Exercise #2 & Solution
: 2 hours
: 100 marks
o This review exercise is solely for your OWN REVISION AND THUS NO NEED TO SUBMIT the answers.
o Trytoattempteachquestionandthenreviewthesuggestedanswersafterwards.
o The whole exercise consists of 8 pages. Check it out yourself.
o Time your handling of this exercise. You are advised to leave the last 10 minute of 2 hours to check your answer.
Page 1 of 8
Question 1. (30 marks)
Carefully read each of the following statements to decide whether it is True or False. Write down T for True or F for False for each statement. (2 mark each)
1. With pipelining, the latency of each single instruction is reduced.
2. Pipeline rate is limited by the shortest pipeline stage.
3. Pipeline stages with unbalanced durations decrease the speedup.
4. The time required to fill and drain a pipeline increases the speedup.
5. Structural hazards occur when two or more pipelined instructions contend for the same resource (like the main memory) in a high- performance computer system.
6. Controls hazards commonly occur for any branch instruction when a decision is made only after the condition of the branch instruction is evaluated.
7. Data hazards occur when an instruction depends on the result of a previous instruction still in the pipeline.
8. Data hazards can be resolved mostly by stalls or forwarding.
9. Loop unrolling always requires fewer registers to execute the original loop.
10. The superscalar architecture represents a dynamic approach to extract more instruction level parallelism (ILP) so as to optimise the pipelining efficiency.
11. Recompilation of the source program is always NOT necessary for superscalar architectures.
12. A disadvantage of VLIW is that the global register file is hard to build.
13. Software pineplining reorganizes loops so that each software- pipelined iteration is in fact made from instructions of the same iteration of the original loop.
14. Scoreboard keeps track of dependencies between instructions that have already been issued.
15. Register renaming is used in the Tomasulo Algorithm.
Answer:
1. F 2. F 3. T 4. F 5. T 6. F 7. T 8. T 9. F 10. F 11. T 12. T 13. F 14. T 15. T
Page 2 of 8
Question 2. (20 marks)
(a) (6 marks) A pipeline-based computer system consists of one single memory unit that allows only 1 memory access per clock cycle. Besides, this computer system has only one write port to update its register-file.
For a specific program, it was found that each instruction requires an average of 1.4 memory accesses. A student taking ELEC6036 quickly concluded that the average number of clock cycles per instruction (CPI) required for this program to execute on the above pipeline-based computer system should be < 1.4. Clearly state whether the students conclusion is true/false. Also, you should justify your answer with clear explanation.(6 marks)Answer:The students conclusion is false. (2 mark) Explanation:Since each instruction requires an average of 1.4 memory accesses, and only 1 memory access is allowed per cycle, the CPI should be 1.4 when we assume the memory utilization as 100% under normal circumstance. (2 mark)Unless the memory utilization can be > 100%, i.e. multiple and simultaneous memory accesses are allowed per cycle, we can have CPI < 1.4. This is clearly impossible/illegal in the above pipeline-based computer system. (2 mark)(b) (5 marks) Name the hazard that will occur when the following program fragment is executed on a 5-stage pipeline. Besides, the aid of execution diagrams, clearly explain how to use stalls to solve this hazard and analyse its possible impacts.loop: sub r1, r3, r4 beq r4, loop xor r8, r6, r7 …….. add r2, r6, r9 ……..An example of execution diagram is given as follows:(where I: I-fetch; D: Decode; E: Execute; M: Memory; W: Write)Cycle12345 6…… sub ID EM W…. Answer:The hazard that will occur: control hazard. (1 mark) Solution : introduce stall to wait until the branch decision is clear; Diagram: (with I: I-fetch; D: Decode; E: Execute; M: Memory; W: Write)Page 3 of 8Cycle 1234 5 6 ……sub I DE MW beqID E xorXX I D[ X: pipeline stall] (2 mark)Impact: 2 potential cycles lost (i.e. the 2nd and 3rd stage of beq will be wasted) since it always requires 3 cycles for executing the branch instruction; therefore it will slow down the performance of the pipeline. (2 mark)(c) (9 marks) Consider the execution of the following instructions on a 5-stage pipeline. lw r4, 12(r2) //load from [[r2]+$12] to r4sw r4, 10(r5)//store from r4 to [[r5]+$10]under the different cases as:i) without any forwarding or special hardware support;ii) with normal forwarding only;iii) with special forwarding supported by specialized MUX circuitry such thatvalues from the memory unit can be directly latched into the execution stage of the sw instruction.For each of the above cases, clearly state the number of stalled cycles required. Also, justify your answer with the corresponding execution diagrams of these two instructions on the 5-stage pipeline.Answer:i) 2 stalled cycles are required as illustrated below. (1.5 mark)lw F D E M Wsw F D X X E M W(where X represents pipeline stall)(1.5 mark)ii) By using normal forwarding, 1 stalled cycle is required as follows. (1.5 mark)lw F D E M! Wsw F D X E M W (where X represents stall and! means forwarding to E stage of sw)(1.5 mark)iii) By special forwarding supported by specialized MUX (i.e. multiplexer or basically data-selector circuit), 0 stalled cycle is resulted as follows. (1.5 mark)lw F D E M# Wsw F D E# M W(where # represents latch by MUX)(1.5 mark) Page 4 of 8Question 3. (25 marks)The diagram below shows an extended DLX pipeline scheme that allows multicycle operations. The extended DLX pipeline consists of 1 (F)etch stage, 1 (D)ecode stage, multiple [i.e. ranging from 1 4] (Ex)ecution stages, 1 (M)emory access stage and finally 1 (W)riteback stage.The following table details the number of execution stages required by the different instructions such as the floating-point (FP) ALU operations, and the earliest forwarding stages to provide their latest results to the Decoder unit of the subsequent instruction. Instructions#. of exec. stages required Earliest Forwarding stage All FP ALU operations(including ADDD, SUBD, DIVD and MULD) 4 Ex4Load or Store double/integer (including LD/SD or LI/SI) 2 Ex2All Integer ALU operations(including ADDI, SUBI, DIVI and MULI) 1Ex1 Throughout the following discussion, we assume the functional units of the extended DLX pipeline are fully pipelined so that an operation of any type can be issued on every clock cycle. Basically, most of the above operations consume their operands at the beginning of EX1. However, Store double/integer are the only exceptions that consume the value being stored 1 cycle later. Besides, similar to the standard DLX integer pipeline, any branch instruction will have a delay of 1 clock cycle unless otherwise stated.(a) (15 marks) Latency is always defined as the number of intervening cycles between the producer- and consumer-instructions. Copy the following table to your answer book and then complete the table by filling in the Latency in clock cycles for each of the producer- consumer instruction-pairs when executed on the above extended DLX pipeline. Besides, clearly explain how you arrive at each of the answers.1st (producer) instruction producing the result2nd (consumer) instruction using the resultLatency in clock cycles LDSD ? LDADDD ? DIVDADDD? Answer:1st (producer) instruction producing the result2nd (consumer) instruction using the resultLatency in clock cycles LDSD 0 LDADDD1 DIVDADDD 3 [3 marks @ of the above boldfaced answers, total = 9 marks]Page 5 of 8i) ii) iii)(Depth of execution pipeline of the producer – 1) 1 since SD will consume the value 1 cycle later, therefore (2 1) 1 = 0;Pipeline latency is essentially equal to one cycle less than the depth of execution pipeline of the producer instruction, i.e. (2 1) = 1;(Depth of execution pipeline of the producer – 1) 1 = (4 1) = 3;[2 mark @ of the above boldfaced answers, total = 6 marks](b) (10 marks) Based on your completed table in (a), clearly show the total number of clock cycles (including any stalls) required for executing the unscheduled code of Loop on the extended DLX pipeline. After this, also show the total number of clock cycles required for executing the scheduled code of Loop on the extended DLX pipeline.Unscheduled code:Loop: LD F4, 0(R2) ADDD F2, F4, F6SD 0(R2), F2 SUBI R2, R2, 8 ADDI R3, R3, 1 BNEZ R2, LoopLoop: LD F4, 0(R2) ADDD F2, F4, F6SUBI R2, R2, 8 ADDI R3, R3, 1 BNEZ R2, Loop SD 8(R2), F2Scheduled code:Answer: Unscheduled code:Clock cycle issued:1 2 3 4 5 6 7 8 9 10Loop: LD F4, 0(R2) stallADDD F2, F4, F6 stallstallSD 0(R2), F2 SUBI R2, R2, 8 ADDI R3, R3, 1 BNEZ R2, Loop stall(branch delay)[stalls for LD-ADDD, ADDD-SD and BNEZ @ 2 mark, total = 6 marks] Total number of clock cycles required = 10Scheduled code:Loop:[1 mark]Clock cycle issued:LD F4, 0(R2) 1stallADDD F2, F4, F6 3 SUBI R2, R2, 8 4 ADDI R3, R3, 1 5 BNEZ R2, Loop 6 SD 8(R2), F2 7[only 1 stall for LD-ADDD: 2 marks]Total number of clock cycles required = 72[1 mark]Page 6 of 8Question 4. (25 marks)(a) (7 marks) The diagram below shows the latest scoreboard status with the instruction sequencelisted at the upper left corner of the diagram.It is obvious that the ADDD instruction already finished the EXEC stage in cycle 16. Clearly explain why the ADDD instruction cannot proceed to the WRITE RESULT.Answer:There is a WAR (write after read) hazard there. The DIVD instruction has not read the F6 register and thus, the ADDD instruction cannot write into that register.(7 marks)(b) (9 marks) Name and clearly define the 3 different types of data hazards related to data registers in many pipelined computer systems. Besides, for each specific type of data hazards, give a short example of code sequence to illustrate that particular hazard.Answer: (3 marks @ point: 1 name, 1 def. & 1 code example, total = 9 marks)o RAW(readafterwrite):thishazardoccurswhenaninstructionproducesavaluewhich is used by a later instruction. Example:add $r1, $r2, $r3sub $r5, $r1, $r6 // $r1 induces a RAW hazardo WAR(writeafterread):thishazardoccurswhenalaterinstructionwritesaregister that needs to be read by an earlier instruction. If data flows from the later to the earlier instruction, an incorrect operation will result. Example:sub $r5, $r1, $r6add $r1, $r2, $r3 // $r1 induces a WAR hazardPage 7 of 8o WAW(writeafterwrite):thishazardoccurswhentwoinstructionswritearegister.If these instructions write out of order, then the register will have an incorrect value in it.Example:(c) (9 marks) Copy the following table into your answer book, and complete its detail which summarizes the major difference between the Tomasulo and scoreboard computers.add $r1, $r2, $r3sub $r1, $r5, $r6 // $r1 induces a WAW hazardTomasuloScoreboardNo issue on structural hazard??WAR : ??WAR : ?? WAW : ??WAW : ?? Broadcast results from FU??Answer :TomasuloScoreboard No issue on structural hazardsame WAR : renaming avoidsWAR : stall completion/write-result WAW : renaming avoidsWAW : stall issue Broadcast results from FUwrite/read registers [1.5 mark @ highlighted answer, sub-total = 9 marks] END OF REVISION EX. 2 Page 8 of 8
Reviews
There are no reviews yet.