[SOLVED] scala parallel assembly computer architecture Solutions to the

$25

File Name: scala_parallel_assembly_computer_architecture_Solutions_to_the.zip
File Size: 584.04 KB

5/5 - (1 vote)

Solutions to the
Revision Exercise # 3
for
High Performance Computer Architecture
Page 1 of 7

Question 1. (23 marks) A Review on Useful Concepts
Carefully read each question and write down the most appropriate option in your answer book. Each question of the first 5 questions carries 3 marks while the last 4 questions carrying 2 marks. 1 mark will be deducted for each wrong answer or multiple options provided to each question.
1) B 2) E 3) A 4) B 5) B 6) D 7) A 8) B 9) D
Page 2 of 7

Question 2. (26 marks)
(a) (8 marks) For a pipelined computer system using branch prediction technique, assume there is a probability of 72% for the branch predication technique to make a right guess, the ideal cycle-per-instruction (CPI) for any pipelined execution is always 1, and a specific assembly program contains 31% of its code as branch instructions, derive the resulting CPI on the branch instruction only. You have to explicitly state any assumption used, and clearly show all your calculation steps to arrive at the final answer.
Answer:
Assumptions: Prob. of right guess = 0.72;
Prob. of wrong guess = 1 0.72 = 0.28;
Ideal/Normal CPI = 1;
Given: proportion of branch instructions = 0.31;
proportion of other instructions = 0.69; (2 marks) Impacts: (1 normal cycle + 0 lost cycle) = 1 cycle for a right guess (2 marks)
(1 normal cycle + 1 lost cycle) = 2 cycles for a wrong guess (2 marks) Resulting CPI on the branch instructions only:
(0.72 * 1 CPI) + (0.28 * 2 CPI) = 1.28 CPI (2 marks)
(b) Besides stalls and delayed branch, another possible solution is
predict (or branch prediction) (1.5 mark) guess one direction and then
back up when the guess is wrong; (1.5 mark)
Diagram: (with I: I-fetch; D: Decode; E: Execute; M: Memory; W: Write) For the specific case where the guess/prediction is made on branching to the loop label.
(1.5 mark)
Impact: 0 cycle lost when the guess is correct; 1 lost cycle when the guess is incorrect. Basically, the possibility of making the right guess, and thus 0 cycle lost, per branch instruction is 0.5 (or 50%) (1.5 mark)
Cycle
1
2
3
4
5
6

sub
I
D
E
M
W
beq
I
D
E
muls
I
D
E
M
W
Page 3 of 7

(c) The diagram:
(2 mark)
The problem: structural hazard will occur since there is only one write port and two
instructions try to write to the register file at the same cycle
(1 mark)
Two possible solutions:
1) Insert a bubble/stall into the pipeline to prevent two writes at the same cycle 7;
(1 mark)
2) Delay the R-type instructions write by one cycle, i.e. NOOP in the 4-th Mem stage, to make it aligned easily with the typical 5-stage load instruction in
pipelining.
Impact on the pipelined execution for Solution 1) No new instruction can be
fetched in/started on the pipeline at cycle 7; (2 mark)
Impact on the pipelined execution for Solution 2) only the affected R-type instructions will be delayed/prolonged by 1 cycle with no operation performed.
(2 mark)
Clearly, Solution 2) is better (1 mark) since it will cause an overall slighter impact on the performance (or a relatively smaller increase on the CPI) of the pipelined execution when compared to that of Solution 1). (2 mark)
(1 mark)
Page 4 of 7

Question 3. (26 marks)
(a) (8 marks) Clearly explain the strengths and shortcomings of the very long instruction word
(VLIW) architecture.
Answer:
(1 mark @ point of the clear description of strengths/shortcomings, total = 8 marks)
The strengths/pros of the very long instruction word (VLIW) architecture include:
Very simple hardware : no dependency detection; simple issue logic; just ALUs and register files
Potentially exploits large amounts of ILP
The shortcomings/cons of the very long instruction word (VLIW) architecture include:
Lockstep execution (static schedule): very sensitive to long latency operations (cache misses)
Global register file hard to build
Lots of NO-Ops: poor code density; I-cache capacity and bandwidth compromised Must recompile sources to deliver potential
Implementation visible through ISA
(b) (8 marks) The original program code is shown as below.
LOOP: LD ADDD
F0, 0(R1)
F4, F0, F2
0(R1), F4
R1, R1, #8
SD
SUBI
BNEZ R1, LOOP
After loop unrolling, the above code is executed on a VLIW machine as follows.
Memory Ref. 1
Memory Ref. 2
FP operation 1
FP operation 2
Integer op./ branch
Clock
LD F0, 0(R1)
LD F6, -8(R1)
1
LD F10, -16(R1)
LD F14, -24(R1)
2
LD F18, -32(R1)
LD F22, -40(R1)
ADDD F4, F0, F2
ADDD F8, F6, F2
3
LD F26, -48(R1)
ADDD F12, F10, F2
ADDD F16, F14, F2
4
ADDD F20, F18, F2
ADDD F24, F22, F2
5
SD 0(R1), F4
SD -8(R1), F8
ADDD F28, F26, F2
6
SD -16(R1), F12
SD -24(R1), F16
7

Page 5 of 7

The above diagram is incomplete as it shows the program execution up to cycle 7 only. Copy the above diagram onto your answer book, and complete the program execution of the unrolled program code with the remaining cycle(s) and instruction(s). Besides, identify at least two producer-consumer pairs of instructions in your completed diagram. Lastly, clearly state the total number of cycles per iteration for your completed and unrolled program code on the VLIW machine. It should be noted that you can insert more row(s) at the end of the table in your answer when extra clock cycle(s) is/are required to complete the program execution.
Answer: (1 mark @ highlighted detail in the table, 1 mark @ producer-consumer pair of instructions, 1 mark for correct cycles per iter, total = 8 marks)
Memory Ref. 1
Memory Ref. 2
FP operation 1
FP operation 2
Integer op./ branch
Clock
LD F0, 0(R1)
LD F6, -8(R1)
1
LD F10, -16(R1)
LD F14, -24(R1)
2
LD F18, -32(R1)
LD F22, -40(R1)
ADDD F4, F0, F2
ADDD F8, F6, F2
3
LD F26, -48(R1)
ADDD F12, F10, F2
ADDD F16, F14, F2
4
ADDD F20, F18, F2
ADDD F24, F22, F2
5
SD 0(R1), F4
SD -8(R1), F8
ADDD F28, F26, F2
6
SD -16(R1), F12
SD -24(R1), F16
7
SD -32(R1), F20
SD -40(R1), F24
SUBI R1, R1, #48
8
SD -0(R1), F28
BNEZ R1, LOOP
9
The two producer-consumer pairs of instructions include:
i) producer : LD F18, -32(R1) consumer : ADDD F20, F18, F2 ii)producer : ADDD F20, F18, F2 consumer : SD -32(R1), F20
(5 marks)
(2 marks)
The total number of cycles per iteration for your completed and unrolled program code on the VLIW machine:
Unrolled 7 times (for 7 results) in 9 clock cycles = 1.3 clocks per iteration
(1 mark) (c) (10 marks) Detail the 5 tips to develop an effective cloud architecture application.
Answer: (2 marks @ point, total = 10 marks)
The 5 tips to develop an effective cloud architecture application is detailed as follows.
Page 6 of 7

Ensure that your application is scalable by designing each component to be scalable on its own. If every component implements a service interface, responsible for its own scalability in all appropriate dimensions, the overall system will have a scalable base.
For better manageability and high-availability, make sure that your components are loosely coupled. The key is to build components without having tight dependencies between each other, so that if one component were to die (fail), sleep (not respond) or remain busy (slow to respond) for some reason, the other components in the system are built so as to continue to work as if no failure is happening.
Implement parallelization for better use of the infrastructure and for performance. Distributing the tasks on multiple machines, multithreading your requests and effective aggregation of results obtained in parallel are some of the techniques that help exploit the infrastructure.
After designing the basic functionality, use techniques and approaches that will ensure resilience. If any component fails (and failures happen all the time), the system should automatically alert, failover, and re-sync back to the last known state as if nothing had failed.
Dont forget the cost factor. The key to building a cost-effective application is using on- demand resources in your design. Its wasteful to pay for infrastructure that is sitting idle.
END OF REV. EX. #3 SOL.
Page 7 of 7

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] scala parallel assembly computer architecture Solutions to the
$25