print msg.c is an example program. You can refer it for thread creation, joining and passing arguments to threads.
To compile it, type: gcc -lrt -lpthread -o go print msg.c
To run it, type: ./go
1 Parallel Matrix Multiplication
1.1 Local Output Variables
Parallelize the naive matrix multiplication which you implemented in PHW 1 using Pthreads. One simple way to parallelize the algorithm is evenly partitioning and distributing the computation works of the elements of output matrix among the threads and letting the threads work independently in parallel. Note, each thread (i,j) is responsible for computing output block C(i,j). For example, assuming the matrix size is n n and the number of threads is p p; let thread #(0,0) compute the elements of rows 0 to 1 and columns 0 to the output matrix, thread #(0,1) compute the elements of rows 0 to 1 and columns to frac2np 1 of the output matrix, and so on.
- The matrix initialization is the same as that in PHW 1. The matrix size is 4K 4K. Print out the execution time and the value of C[100][100] in your program.
- Pass the square root of total number of threads p as a command line parameter [1].
- Report the execution time for pp = 1 (the serial version you did in PHW1),4,16,64 and 256. Draw a diagram to show the execution time under various p. An example diagram is shown in Figure 1. Briefly discuss your observation.
- In your report, discuss how you partition the computation works of the elements of the output matrix among the threads.
Figure 1: Example diagram
1.2 Shared Output Variables
We now used another method as discussed in Lecture 4 slides 35-41 to parallelize the algorithm, where we evenly partition and distribute the computation works of the elements of one input matrix among the threads and letting the threads work in parallel. Note each thread (i,j) accesses and owns input block A(i,j), and is responsible to update all output C(i,k) via matrix multiplication with B(j,k), where k iterates from 1 to p. For example, assuming the matrix size is n n and the number of threads is p p; let thread #(0,0) compute the elements of rows 0 to 1 and columns 0 to 1 of the input matrix A multiplied by elements of rows 0 to 1 of the input matrix B (contributes to elements of rows 0 to of the input matrix C), thread #(0,1) compute the elements of rows 0 to 1 and columns to 1 of the input matrix A multiplied by elements of rows 1 of the input matrix B (contributes to elements of rows 0 to 1 of the input matrix C), and so on. (Hint: You will use mutex in this problem to handle the situation of multiple thread manipulating the same row of output matrix C)
- The matrix size is 4K 4K. Print out the execution time and the value of C[100][100] in your program.
- Pass the square root of total number of threads p as a command line parameter [1].
- Report the execution time for pp = 1 (the serial version you did in PHW1),4,16,64 and 256. Draw a diagram to show the execution time under various p. An example diagram is shown in Figure 1. Briefly discuss your observation.
- In your report, discuss how you partition the computation works of the elements of the output matrix among the threads.
- In your report, compare the execution time of problem 1.2 with 1.1 and explain any difference.
2 Parallel K-Means
In PHW 1, you implemented the K-Means algorithm. In this problem, you will need to parallelize your implementation using Pthreads. Let p denote the number of threads you create. The parallel version of K-Means algorithm has the following steps:
- Initialize a mean value for each cluster.
- Partition and distribute the data elements among the threads. Each thread is responsible for a subset of data elements.
- Each thread assigns its data elements to the corresponding cluster. Each thread also locally keeps track of the number of data element assigned to each cluster in current iteration and the corresponding sum value. 4. Synchronize the threads.
- Recompute the mean value of each cluster.
- Check convergence; if the algorithm converges, replace the value of each data with the mean value of the cluster which the data belongs to, then terminate the algorithm; otherwise, go to Step 2.
In this problem, the input data is the same matrix as in PHW 1 which is stored in the input.raw; the value of each matrix element ranges from 0 to 255. Thus, the matrix can be displayed as an image. However, we will have 6 clusters (K=6). The initial mean values for the clusters are 0, 65, 100, 125, 190 and 255, respectively. To simplify the implementation, you do not need check the convergence; run 50 iterations (Step 2-5) then terminate the algorithm and output the matrix into the file named output.raw.
- Implement a serial version without using Pthreads (No need to submit this program).
- Implement a parallel version using Pthreads. Name the program as p2.c. Pass the number of threads p as a command line parameter
- In your report, you need report the execution time for the 50 iterations (excluding the read/write time) for p = 4, Compare with the serial version with respect to execution time, discuss your observation.
- Show the image of the output matrix in your report. You can display the output image, output.raw, using the imageJ [4] software or the given Matlab script file, show raw.m. If you use show raw.m, remember to specify the file name in the script.

![[Solved] EE451 Homework2-Example Program](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip.jpg)

![[Solved] EE451 Homework3-parallelize a serial program](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip-1200x1200.jpg)
Reviews
There are no reviews yet.