[SOLVED] 代写 algorithm html math MPI openmp parallel graph network GPU Go Instructions:

30 $

File Name: 代写_algorithm_html_math_MPI_openmp_parallel_graph_network_GPU_Go_Instructions:.zip
File Size: 763.02 KB

SKU: 6027855342 Category: Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Or Upload Your Assignment Here:


Instructions:
Comp 4510Assignment 3
Due: November 15, 2019
This assignment should be completed on the Mercury machines using MPI and OpenMP. The MPI and OpenMP Programming Environment document on the course website describes how to log into these machines and use themplease read it before you begin.
You will be expected to follow the Assignment Guidelines, which are available in the assignments folder on the course website. This document describes how to organize and hand in your code. Please review it before you begin. Note that a failure to follow these standards will result in a loss of marks.
Assignment 3 is intended to help you become familiar with writing advanced MPI programs and introduce you to writing OpenMP programs.
You may place the written part from the programming questions Part I in the same document file you used for Part II.
Total: 100 marks. This assignment is worth 20 of the total assignment marks.
Part I: Programming Questions 80:
1. 20 We discussed the boundary value problem in class. The notes and the article are on the course web page under 4PCAM03BVP. Using an SPMD style not masterslave of programming, develop a parallel version of the problem and implement in MPI.
Assume a blockdata distribution. The rod is of size 1. The rod is partitioned into n sections, each of size k. Time is partitioned into m sections, each of size h. The computations end at time t5. The temperature at a given point and time is calculated using the formulas provided in class see notes on course web site. You program should print out the values it computes for each time step.
Your program for any value of k and h. As an experiment, use your program to fill in the bold cells in the table below where q is the number of processes.
k
h
Execution Time
0.1
1.0
q2
q4
q8
0.01
0.5
q4
q8
q16
0.001
0.25
q8
q16
2. 20 Consider Dijkstras shortest path algorithm we discussed in class. The notes and the article are on the course web page under 4PCAM05DijkstraAlgorithm. Assume a weighted adjacent matrix, A, where, all edge weights are positive. Consider a blockdata partition of columns. Partition A into nq columns, where n is the number of nodes in the graph and q is the number of

processes. Distribute the nq columns to each of the process. Each process then executes Dijkstras algorithm on its own local column data. Implement the algorithm in MPI. Note: understand why we use a column partition and not a row partition. This will make programming easier.
Again, your program should work for any n and q. As an experiment, check your parallel program on the following graph given below as a weighted adjacency matrix. V0 is the source. Check the correctness of your program, for q5, q3, q2 on the graph below with n6.
Adjacency matrix
V0
V1
V2
V3
V4
V5
V0
0
2
5

V1

0
7
1

8
V2

0
4

V3

0
3

V4

2

0
3
V5

5

2
4
0
3. 20 The allpairs shortest path problem is the problem of finding the cost of the shortest path between every pair of vertices in a weighted, directed graph. In this algorithm, the weighted graph is stored in an adjacency matrix, A. The value of matrix element i,j in the initial adjacency matrix is the weight of the edge from vertex i to vertex j. If there is no edge between two vertices, the corresponding adjacency matrix cell is assigned infinity. In the final solution matrix, the value in cell i,j, represents the length of the shortest path from vertex i to vertex j. The pseudocode is given below.
Floyds Algorithm
Input: n number of vertices, A0..n1, 0..n1 adjacency matrix
Output: A that contains the shortest path lengths.
for k 0 to n1 for i 0 to n1
for j 0 to n1
Ai,j min Ai,j, Ai,kAk,j
For the adjacency matrix above in Question 2, the solution of the allpairs shortest path problem is shown below.
Adjacency matrix
V0
V1
V2
V3
V4
V5
V0
0
2
5
3
6
9
V1

0
6
1
4
7
V2

15
0
4
7
10
V3

11
5
0
3
6

V4

8
2
5
0
3
V5

5
6
2
4
0
i. 5 Suppose we want to parallelize the algorithm using a parallel for pragma. We could place the pragma such that:
a only the outermost loop is parallelized, or b only the second inner loop parallelized, or c only the third inner loop is parallelized.
Discuss the pros and cons of using the parallel for pragma in each case. Use the number of fork and join steps, synchronization steps, etc. as a basis for your discussion.
ii. iii. iv.
5 Use OpenMP directives to implement a parallel version of Floyds algorithm. Use pragma omp parallel for. Implement the one you think will produce better performance based on your discussion in i.
5 Benchmark your program for different values of n number of vertices and t number of threads. You may use a random number generator to produce the values in A. Try for at least 3 values of n n32,64,128. Try for t4, 8, and 16 threads.
5 For a fixed problem size, n128 vertices, vary the chunk size for different scheduling policies and output the execution time of the algorithm. Fill in the table below. Explain your findings.
Chunk size
1
2
4
8
16
32
64
static
dynamic
guided
4. 20 Consider the sequential program below. This program approximates the value ofusing a mathematical formula known as Simpsons rule.
define n 2048
double f int i
double x;
xdouble idouble n;
return 4.01.0xx;

int main int argc, char argv
double area;
int i;
areaf0fn;
for int i1; in2; i

area4.0f2i12f2i;

area3.0n;
printfApproximation of pi: 13.11fn, area;
return EXITSUCCESS;

i. 10 Implement this algorithm in OpenMP using an SPMD pragma omp parallel style of programming. Set the number of threads to the maximum number of allowable threads per node. Try to avoid any redundant computations when parallelizing. Note: Increase n if you do not see any significant changes in the execution time. Or decrease n if the execution time takes significant amount of time.
ii. 10 The table below lists various combinations of scheduling policies and chunk sizes. For each cell in the table, run your program with the given settings and record the execution time. Comment on the results. As the chunk size increases which scheduling strategy is better?
chunk size
1
2
4
8
16
32
static
dynamic
guided
Part II: Written Questions 20
5. 6 These questions are on the Nehalem microarchitecture. Go to this link for reference.
https:www.ni.comencainnovationswhitepapers10topeightfeaturesoftheintelcorei7 processorsfortestmea.htmlsection2072810095
a. What is I b.
c.
d.
e. f.
6. 4 The distributed caches in HaswellEP and Broadwell based systems are kept coherent using snooping based protocols. Explain briefly the following different snoop modes available on these systems. Cite your sources.
a. Early Snoop mode
b. Home Snoop mode
c. Home Snoop with Directory mode d. Cluster on Die mode
ntel QuickPath Interconnect QPI?
Which type of shared memory architecture did QPI support?
What is the purpose of Intel Turbo Boost?
Which is the shared cache L1, L2, or L3 on the chip? Why is L3 called a smart cache?
Explain the operation of the L3 cache.
What is hyperthreading?
Why did Intel remove the hyperthreading feature in previous generations and bring it
back again in newer generations?

7. 10 This question focuses on the paper, Designing nextgeneration massively multithreaded architectures for irregular applications, by Tumeo, Secchi et al, provided under AssignmentsA3. Read through the paper and briefly answer the following questions:
a. b. c.
d. e. f.
g.
h.
i. j.
Define the characteristics of an irregular application.
Why are multilevel cache hierarchies ineffective for irregular applications?
What special characteristics do multithreaded architectures possess that make them effective for irregular applications?
The Cray XMTs design was based on three pillars. What are they?
What interconnection network is used in the Cray XMT?
The paper compares GPU architectures to the Cray XMT and provides some drawbacks of the GPU compared to Cray XMT. Summarize the advantages of Cray XMT and the disadvantages of the GPU as discussed by the authors.
What is temporal multithreading? Which kind of temporal multithreading is used by Cray XMT and what kind is used by the GPU?
Is simultaneous multithreading different from temporal multithreading? Explain. A paper on simultaneous multithreading is placed on the course web page.
How many hardware threads are executed in the ThreadStorm processor? Discuss the BFS results. What do the authors conclude?

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] 代写 algorithm html math MPI openmp parallel graph network GPU Go Instructions:
30 $