High Performance Computer Architecture
Question 1.
Determine whether each of the following statement is (T)rue or (F)alse.
(i) Loop unrolling always requires less registers to execute the original loop. T F
(ii) Thetimerequiredtofillanddrainapipelinereducesthespeedup. T F (iii) Usually recompilation of source is needed for superscalar architectures. T F (iv) Clock rate reduction may not help reducing power consumption. T F
(v) Data hazards occur when an instruction depends on the result of the next
instruction still existed in the pipeline. T F
Question 2.
[A Comparison of Scoreboard & Tomasulos Computers] Assume the latency characteristics for producer-consumer instruction pairs given as below:
Instruction producing result
Instruction using result
Latency
Floating-point ALU op.
Another floating point ALU op.
5
Floating-point ALU op.
STORE floating point
3
LOAD Floating-point
Floating point ALU op.
2
LOAD Floating-point
STORE floating point
1
Consider the following code sequence:
addd f7,f3,f4 ;f3 +f4 ->f7 multdf6,f7,f10 ;f7*f10->f6
multd f8, f7, f10 ; 2nd multiply instruction
when executed on the Scoreboard based computer as compared to the same program fragment executed on the Tomasulo based computer. Carefully answer the following questions. In all the questions, you can simply assume both Scoreboard and Tomasulo computers have sufficient number of functional units, including at least one floating-point adder and two floating-point multipliers, for the execution of the above program fragment. Besides, you can assume the number of cycles required for the
EXEC stage of addd or multd is 5.
When the addd instruction already enters its EXEC stage, and is computing the value of f7, clearly state whether the two subsequent multd instructions can be issued or not for the execution of the above program fragment on EACH of the Scoreboard/Tomasulos computer. In addition, for each yes or no answer you provide for each specific computer, give a short and concise explanation to justify your answer. For example, here is a sample format for your answer:
1) for Scoreboard No, the two instructions cannot be issued BECAUSE 2) for Tomasulo Yes, the two instructions can be issued BECAUSE ..
Question 3.
Answer these compulsory questions. Carefully read each of the following 10 short questions and write down the most appropriate option.
1) Loop unrolling can help to
A. minimize the risk of control hazard by adding more registers;
B. minimize the risk of data hazard by adding more registers;
C. minimize stalls by increasing the size of a basic block for rescheduling;
D. minimize stalls by reducing the size of a basic block for rescheduling;
E. minimize the latency of control hazard by reducing the total number of NO-OPs.
2) One of the major advantages of the VLIW is A. the global register file that is difficult to build; B. very simple hardware;
C. the lockstep execution;
D. lots of NO-OPs;
E. none of the above.
3) The potential speedup of a pipelined system can be reduced by
A. increasing the number of pipeline stages;
B. increasing the largest difference between the lengths of pipeline stages; C. increasing the number of instructions entered into the pipeline;
D. increasing the number ofregisters available in the pipelined system;
E. none of the above.
4) Software pipelining reorganizes loops so that each software-pipelined iteration is made from instructions of
A. the same iteration of the original loop;
B. randomly selected iterations of the original loop;
C. the same iteration of different loops in the same program; D. completely different iterations of the original loop;
E. none of the above.
5) Suppose a previous iustruction k is already in the instruction pipeline, and instructionj is to be issued, a WAR hazard occurs when there exists a common and non-empty register p in
A. Rreg(j) n Rreg(k);
B. Rreg(k) n Wreg(j);
C. RregO) n Wreg(k);
D. Wreg(k) n Wreg(j);
E. none of the above;
where Rreg(j) and Rreg(k) denote the sets of registers to be read by the instruction j
and k respectively,
and instruction
6) Which of the following hazards can be resolved in the issue (ID2) stage of the Scoreboard approach?
A. control hazards;
B. WAR hazards;
c. RAW hazards;
D. WA W hazards; E. structural hazards.
7) Which of the following is NOT a pitfall of the in-order execution due to static scheduling?
A. a relatively long time to reschedule code at run-time;
B. binaries need to be recompiled for each new processor implementation;
C. a small number of named registers always becomes a bottleneck;
D. an unexpected long latency blocks ready instructions from executing since scheduled code can never be changed;
E. none of the above.
8) WAR hazard can be resolved by
A. stalls only;
B. forwarding or stalls;
C. fetching operands early in the decode stage; D. performing all write-backs in order;
E. out-of-order execution.
9) Loop-carried dependence prohibits
A. the formation of a very large basic block for code rescheduling; B. data parallelism;
C. the formation of a very small basic block for code rescheduling; D. loop unrolling;
Wreg(j) and Wreg(k) denote the sets ofregisters to be written by the j and k respectively.
E. instruction-level parallelism.
10) Register renaming of the Tomasulos approach is used to avoid A. control hazard;
B. RAW hazard;
C. structural hazard;
D. WAR hazard;
E. the heavy reliance on reservation stations.
Question 4.
(a) Name the hazard when the following program fragment is executed on a 5-stage pipeline. In addition, clearly explain ALL possible solutions and their impacts with the aid of execution diagrams.
An example of execution diagram is given as below:
(where I: I-fetch; D: Decode; E: Execute; M: Memory; W: Write) (b) The following program fragment produces 4 cycles of stalls when being executed on a specific pipelined computer. Without loop unrolling, clearly show the rescheduled code to minimize the total number of stalls, and explicitly state the resulting number of stall(s).
Question 5.
(a) The following is the status of the Tomasulo computer at the end of cycle 10. Show its status (including the Instruction Status, Reservation Stations, and Register Result
Status) at the end of cycle 11. Besides, clearly explain the main reason(s) for the difference in the result obtained for the concerned instruction ADDD on the Tomasulo approach when compared to that of the scoreboard approach.
(b) With dynamic instruction scheduling, the scoreboard and Tomasulo computers perform out-of-order execution. However, usiug static scheduling, most computers can only perform in-order execution that is known to be problematic. Discuss the major pitfalls of in-order execution.
(c) The Tomasulo approach replaces the nonnal 5-stage pipeline with 4 stages: Fetch, Issue, Execute, and Writeback. One of its strengths is that it can perfonn out-of-order Execute and Writeback operations. However, the Fetch and Issue stages of the Tomasulo approach always handle instructions in the program order. Why?
Question 6.
(a) In the extended DLX pipeline as below,
the branch condition of any branch instruction is detennined in the Ex1 stage. With the aid of an execution diagram, clearly show the number of branch delay slot(s) for any branch instruction executed on this extended DLX pipeline. Moreover, assume a branch prediction technique is used with 66% of its guesses as correct on a particular
program containing 37% of all instructions as branch instructions, show the detailed steps to arrive at the estimated number of cycles per instruction (CPI) of the whole program when executed on this extended DLX pipeline.
(b) The following diagram gives the status of the scoreboard at the end of cycle 8 for the sequence of instructions listed at the upper left corner.
Show the status of the scoreboard (including the Instruction Status, Functional Unit Status, and Register Result Status) at the end of cycle 9, 20 and 21 respectively.
Question 7.
(a) The following is the original code of LoopA. Give a scheduled version of LoopA to eliminate as many stalls as possible in the original code when being executed on the above pipeline. Besides, you should show the total number of clock cycles required to execute the original code of LoopA against that of your scheduled version on the above pipeline to support your answer.
(b) The unrolled version of the original/unscheduled code for many loops, including the LoopA example in (b), is often slower than or the same as the scheduled version of the original loops in term of the number of clock cycles required per iteration. Yet loop unrolling is frequently performed early in most compilation processes. Explain why.
(c) Detail the 4 phases of the GrepTheWeb application as an example of a cloud architecture.
Reviews
There are no reviews yet.