[SOLVED] scala parallel assembly compiler computer architecture software High Performance Computer

$25

File Name: scala_parallel_assembly_compiler_computer_architecture_software_High_Performance_Computer.zip
File Size: 838.38 KB

5/5 - (1 vote)

High Performance Computer
Architecture Revision Exercise #3
Time allowed: 1.5 hour (i.e. for your own revision and practice) Full score : 75 marks
o This review exercise will be useful for you to revise the concepts covered in this course, esp. before examination. It is solely for your OWN REVISION AND THUS NO NEED TO SUBMIT the answers to me.
o Try to attempt each question and then review the suggested answers afterwards.
o The whole exercise consists of 6 pages. Check it out yourself.
o Time your handling of this exercise. You are advised to leave the last 5 minute to check your answer.
Page 1 of 6

Question 1. (23 marks) A Quick Review on Important Concepts
Carefully read each question and write down the most appropriate option in your answer book. Each of the first 5 questions carries 3 marks while the last 4 questions carrying 2 marks. 1 mark will be deducted for each wrong answer or multiple options provided to each question.
1) The potential speedup of a pipelined system can be increased by
A. increasing the number of instructions entered into the pipeline;
B. increasing the number of pipeline stages;
C. increasing the number of registers available in the pipelined system;
D. increasing the largest difference between the lengths of pipeline stages;
E. none of the above.
2) Which of the following computer architectures is not compiler-directed ?
A. Explicit Parallel Instruction Computer (EPIC);
B. Very Long Instruction Word (VLIW);
C. Superscalar;
D. Software pipelining;
E. Scoreboard.
3) Even when existing binaries can be executed on a multiple-issue computer, recompilation of programs is required to
A. boost performance;
B. ensure correctness;
C. check dependencies;
D. check structural hazards;
E. forward immediate results.
4) Based
on the covert parallelism,
A. the multiple threads of a program can explicitly communicate and synchronize with each other;
B. the compiler may discover parallelism and/or recognize code to remove dependencies;
C. the computer executes the instructions in sequence while preserving parallel semantics;
D. the computer exploits both instruction-level and thread-level parallelism;
E. the object code may make instruction-level parallelism explicit through encoding dependencies.
Page 2 of 6

5) Which of the following is not a pitfall of the in-order execution ?
A. binaries need to be recompiled for each new processor implementation;
B. a relatively long period of time is required to reschedule code at run-time;
C. a small number of named registers always becomes a bottleneck;
D. an unexpected long latency blocks ready instructions from executing since scheduled code can never be changed;
E. none of the above.
Page 3 of 6

6) In software pipelining, each software-pipelined iteration is made from instructions of
A. the same iteration of different loops in the same program;
B. the same iteration of the original loop;
C. randomly selected iterations of the original loop;
D. totally different iterations of the original loop;
E. none of the above.
of the following hazards cannot be resolved in any stage of the Scoreboard approach ?
7) Which
A. control hazard;
B. RAW;
C. structural hazard;
D. WAW;
E. WAR.
8) Loop unrolling, software pipelining and very long instruction word (VLIW) machines all share a common property of
A. requiring more functional units;
B. requiring more registers;
C. requiring more memory space for program execution;
D. requiring more time to reschedule the instructions;
E. requiring more complex issue logics.
9) Register renaming of the Tomasulos approach is used to avoid
A. control hazard;
B. RAW hazard;
C. structural hazard;
D. WAW hazard;
E. the heavy dependency on reservation stations.
Page 4 of 6

Question 2. (26 marks)
(a) (8 marks) For a pipelined computer system using branch prediction technique, assume there is a probability of 72% for the branch predication technique to make a right guess, the ideal cycle- per-instruction (CPI) for any pipelined execution is always 1, and a specific assembly program contains 28% of its code as branch instructions, derive the resulting CPI on the branch instruction only. You have to explicitly state any assumption used, and clearly show all your calculation steps to arrive at the final answer.
(b) (6 marks) A student concluded that a control hazard will occur when the following program fragment is executed on a 5-stage pipeline. Besides stalls and delayed branch, clearly explain about another solution with the aid of an execution diagram to analyse its possible impact in solving this hazard.
loop:
sub r2, r3, r4
beq r2, loop
xor r7, r5, r6
..
muls r10, r8, r9
..
An example of execution diagram is given as follows:
(where F: Fetch; D: Decode; E: Execute; M: Memory; W: Write)
(c) (12 marks) A pipelined computer system consists of one single memory unit that allows only 1 memory access per clock cycle. Besides, this computer system has only one write port to update its register-file.
All R-type instructions require only 4 stages (i.e. Ifetch-Decode-Execute-Wr) whereas a load instruction requires the typical 5 stages for pipelined execution. Consider a program with the following sequence of instructions is executed on the above pipelined computer system,
R-type instr. Load instr. R-type instr.
Draw a diagram to illustrate the possible problem. Besides, you should clearly name the hazard, and suggest two possible solutions to resolve this hazard. Lastly, evaluate the impact on pipelined execution for each of the possible solutions. Specifically, you have to determine which solution is better with a clear explanation.
Cycle
1
2
3
4
5
6

sub
F
D
E
M
W
.
Page 5 of 6

Question 3. (26 marks)
(a) (8 marks) Clearly explain the strengths and shortcomings of the very long instruction word (VLIW)
architecture.
(b) (8 marks) The original program code is shown as below.
LOOP: LD F0, 0(R1)
ADDD F4, F0, F2
SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, LOOP
After loop unrolling, the above code is executed on a VLIW machine as follows.
Memory Ref. 1
Memory Ref. 2
FP operation 1
FP operation 2
Integer op./ branch
Clock
LD F0, 0(R1)
LD F6, -8(R1)
1
LD F10, -16(R1)
LD F14, -24(R1)
2
LD F18, -32(R1)
LD F22, -40(R1)
ADDD F4, F0, F2
ADDD F8, F6, F2
3
LD F26, -48(R1)
ADDD F12, F10, F2
ADDD F16, F14, F2
4
ADDD F20, F18, F2
ADDD F24, F22, F2
5
SD 0(R1), F4
SD -8(R1), F8
ADDD F28, F26, F2
6
SD -16(R1), F12
SD -24(R1), F16
7

The above diagram is incomplete as it shows the program execution up to cycle 7 only. Copy the above diagram onto your answer book, and complete the program execution of the unrolled program code with the remaining cycle(s) and instruction(s). Besides, identify at least two producer-consumer pairs of instructions in your completed diagram. Lastly, clearly state the total number of cycles per iteration for your completed and unrolled program code on the VLIW machine. It should be noted that you can insert more row(s) at the end of the table in your answer when extra clock cycle(s) is/are required to complete the program execution.
(c) (10 marks) Detail the 5 tips to develop an effective cloud architecture application.
****END OF REVISION EX. 3****
Page 6 of 6

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] scala parallel assembly compiler computer architecture software High Performance Computer
$25