[SOLVED] algorithm Scheme scala parallel software High Performance Computer

$25

File Name: algorithm_Scheme_scala_parallel_software_High_Performance_Computer.zip
File Size: 621.72 KB

5/5 - (1 vote)

High Performance Computer
Architecture Revision Exercise #2
Time allowed : 2 hours Full score : 100 marks
o ThisreviewexerciseissolelyforyourOWNREVISIONANDTHUSNONEEDTOSUBMIT the answers.
o Trytoattempteachquestionandthenreviewthesuggestedanswersafterwards.
o The whole exercise consists of 8 pages. Check it out yourself.
o Timeyourhandlingofthisexercise.Youareadvisedtoleavethelast10minuteof2hoursto check your answer.
Page 1 of 8

Question 1. (30 marks)
Carefully read each of the following statements to decide whether it is True or False. Write down T for True or F for False for each statement. (2 mark each)
1. With pipelining, the latency of each single instruction is reduced.
2. Pipeline rate is limited by the shortest pipeline stage.
3. Pipeline stages with unbalanced durations decrease the speedup.
4. The time required to fill and drain a pipeline increases the speedup.
5. Structural hazards occur when two or more pipelined instructions contend for the same resource (like the main memory) in a high- performance computer system.
6. Controls hazards commonly occur for any branch instruction when a decision is made only after the condition of the branch instruction is evaluated.
7. Data hazards occur when an instruction depends on the result of a previous instruction still in the pipeline.
8. Data hazards can be resolved mostly by stalls or forwarding.
9. Loop unrolling always requires fewer registers to execute the original loop.
10. The superscalar architecture represents a dynamic approach to extract more instruction level parallelism (ILP) so as to optimise the pipelining efficiency.
11. Recompilation of the source program is always NOT necessary for superscalar architectures.
12. A disadvantage of VLIW is that the global register file is hard to build.
13. Software pineplining reorganizes loops so that each software- pipelined iteration is in fact made from instructions of the same iteration of the original loop.
14. Scoreboard keeps track of dependencies between instructions that have already been issued.
15. Register renaming is used in the Tomasulo Algorithm.
Answer:
Page 2 of 8

Question 2. (20 marks)
(a) (6 marks) A pipeline-based computer system consists of one single memory unit that allows only 1 memory access per clock cycle. Besides, this computer system has only one write port to update its register-file.
For a specific program, it was found that each instruction requires an average of 1.4 memory accesses. A student taking ELEC6036 quickly concluded that the average number of clock cycles per instruction (CPI) required for this program to execute on the above pipeline-based computer system should be 1.4. Clearly state whether the students conclusion is true/false. Also, you should justify your answer with clear explanation.
(6 marks)
Answer:
(b) (5 marks) Name the hazard that will occur when the following program fragment is executed on a 5-stage pipeline. Besides, the aid of execution diagrams, clearly explain how to use stalls to solve this hazard and analyse its possible impacts.
loop:
sub r1, r3, r4
beq r4, loop
xor r8, r6, r7
..
add r2, r6, r9
..
An example of execution diagram is given as follows:
(where I: I-fetch; D: Decode; E: Execute; M: Memory; W: Write)
Cycle
1
2
3
4
5
6

sub
I
D
E
M
W
.
Answer:
(c) (9 marks) Consider the execution of the following instructions on a 5-stage pipeline. lw r4, 12(r2) //load from [[r2]+$12] to r4
sw r4, 10(r5)//store from r4 to [[r5]+$10]
under the different cases as:
i)without any forwarding or special hardware support; ii)with normal forwarding only;
iii)with special forwarding supported by specialized MUX circuitry such that values from the memory unit can be directly latched into the execution stage of the sw instruction.
Page 3 of 8

For each of the above cases, clearly state the number of stalled cycles required. Also, justify your answer with the corresponding execution diagrams of these two instructions on the 5-stage pipeline.
Answer:
Page 4 of 8

Question 3. (25 marks)
The diagram below shows an extended DLX pipeline scheme that allows multicycle operations. The extended DLX pipeline consists of 1 (F)etch stage, 1 (D)ecode stage, multiple [i.e. ranging from 1 4] (Ex)ecution stages, 1 (M)emory access stage and finally 1 (W)riteback stage.
The following table details the number of execution stages required by the different instructions such as the floating-point (FP) ALU operations, and the earliest forwarding stages to provide their latest results to the Decoder unit of the subsequent instruction.
Instructions
#. of exec. stages required
Earliest Forwarding stage
All FP ALU operations(including ADDD, SUBD, DIVD and MULD)
4
Ex4
Load or Store double/integer (including LD/SD or LI/SI)
2
Ex2
All Integer ALU operations(including ADDI, SUBI, DIVI and MULI)
1
Ex1
Throughout the following discussion, we assume the functional units of the extended DLX pipeline are fully pipelined so that an operation of any type can be issued on every clock cycle. Basically, most of the above operations consume their operands at the beginning of EX1. However, Store double/integer are the only exceptions that consume the value being stored 1 cycle later. Besides, similar to the standard DLX integer pipeline, any branch instruction will have a delay of 1 clock cycle unless otherwise stated.
(a) (15 marks) Latency is always defined as the number of intervening cycles between the producer- and consumer-instructions. Copy the following table to your answer book and then complete the table by filling in the Latency in clock cycles for each of the producer- consumer instruction-pairs when executed on the above extended DLX pipeline. Besides, clearly explain how you arrive at each of the answers.
Answer:
1st (producer) instruction producing the result
2nd (consumer) instruction using the result
Latency in clock cycles
LD
SD
?
LD
ADDD
?
DIVD
ADDD
?
1st (producer) instruction producing the result
2nd (consumer) instruction using the result
Latency in clock cycles
LD
SD
LD
ADDD
DIVD
ADDD
Page 5 of 8

(b) (10 marks) Based on your completed table in (a), clearly show the total number of clock cycles (including any stalls) required for executing the unscheduled code of Loop on the extended DLX pipeline. After this, also show the total number of clock cycles required for executing the scheduled code of Loop on the extended DLX pipeline.
Unscheduled code:
Loop:
Scheduled code:
Loop:
Answer:
LD F4, 0(R2) ADDD F2, F4, F6 SD 0(R2), F2 SUBI R2, R2, 8 ADDI R3, R3, 1 BNEZ R2, Loop
LD F4, 0(R2) ADDD F2, F4, F6 SUBI R2, R2, 8 ADDI R3, R3, 1 BNEZ R2, Loop SD 8(R2), F2
Page 6 of 8

Question 4. (25 marks)
(a) (7 marks) The diagram below shows the latest scoreboard status with the instruction sequence
listed at the upper left corner of the diagram.
It is obvious that the ADDD instruction already finished the EXEC stage in cycle 16. Clearly explain why the ADDD instruction cannot proceed to the WRITE RESULT.
Answer:
(b) (9 marks) Name and clearly define the 3 different types of data hazards related to data registers in many pipelined computer systems. Besides, for each specific type of data hazards, give a short example of code sequence to illustrate that particular hazard.
Answer:
Page 7 of 8

(c) (9 marks) Copy the following table into your answer book, and complete its detail which summarizes the major difference between the Tomasulo and scoreboard computers.
Tomasulo
Scoreboard
No issue on structural hazard
??
WAR : ??
WAR : ??
WAW : ??
WAW : ??
Broadcast results from FU
??
Answer :
Tomasulo
Scoreboard
No issue on structural hazard
WAR :
WAR :
WAW :
WAW :
Broadcast results from FU
END OF REVISION EX. 2
Page 8 of 8

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] algorithm Scheme scala parallel software High Performance Computer
$25