5/5 - (1 vote)

COMP5426 Distributed
ramming Shared Memory
Platforms (D

Multiple threads cre
having their
A set of global variables shared by Threads communication implicitly u
reads coordination explici shared variables
~100 cores for large one)
machines are usuall
e grained parallelism
ated and running
local private
tly by sync
concurrently
all threads sing shared
hronization

Data locality
xplicit synchronization
Also need to synchronizat
l variables shared by all
algorithm design
Yes, data dont need to be moved around Well use several examples for comparison
Data partitioning unnecessary? Not true! must consider ILP &
very important
minimize u ion to impr
e efficiency

Matrix Multiplicat

Matrix Multiplicat
For coarse grain matrix C
No data dependency, thus
Note regular
Cyclic block
2D block
Embarrassingly parallel
data structure
ioning relatively easy
partition matri
chronization
x C and assig

In 1D cyclic partitioning, threads one by one in a c
rows are assig
Partitioning
rows are assigned to the yclic manner until all the

n 1D block part locks and then
artitionin
itioning, ro each block
ws are first grouped into is assigned one thread

For shared memory machines, we may assign each thread a row of small 2D blocks
artitionin
In 2D block partitioning, both rows and columns are first grouped into blocks and then each block
matrix) is assigned

Essentia
Which one is better?
Locality is very
Increase opportunities
IfBislarg
cache size
Increase computational intensity (memory hierarchy) 1D cyclic: one row in C needs one row in A and whol
ational intensity
cache misses

Which one is better?
Locality is very
Increase opportunities
Increase computational intensity 1D block: better
Matrix multiplicati
till needs whole
To improve further
higher com
rtition row bloc
putational
emory hierar
if B is large

Which one is better?
Locality is very
Increase opportunities
Increase computational intensity block:
Matrix multiplicati
Only need one bloc
high compu
k row of A and one b
Fit into the cache to minimize cache misses
emory hierar
lock colum

Multiplicatio
There is no data dependency
Linear speedup is achievable
The focus should be on sequential computation on each core to achieve high efficiency
e.g., cach
e effect, memory h
embarrassi
ierarchy, loop

allel Algorit

allel Scan

ting (GEPP)

should use BLAS
How? blocking, i.e.
BLAS 3 (matrix multiplication) is much faster (higher computational intensity) than BLAS 1 & 2
whenever po
blocked GEP
next slide)

for ib = 1 to n-1 step b Process matrix b columns at a time end = ib + b-1 Point to end of block of b columns apply BLAS2 version of GEPP to get A(ib:n , ib:end) = L * U let LL denote the strict lower triangular part of A(ib:end , ib:en
A(ib:end , end+1:n) = LL-1 * A(ib:end , end+1:n)
A(end+1:n , end+1:n ) = A(end+1:n , end+1:n )
A(end+1:n , ib:end) * A(ib:end , end+1:n)
apply delayed updates with single matrix-multiply with inner dimension b
update ne

allel Algorith
n GEPP the
need seriously
load balanci
After i=n-1

1D block: Assign
ubmatrix s
hrinks, more

Assign one
As work submatrix shr
Load not balanced
inks, more
not efficient
become idle

D cyclic block: partition the matrix into 2D small locks and assign blocks in a cyclic manner
iteration, still
balanced more
most threads are

There are two types
Static when the load is p be partitioned and assigne assignment will not change
Dynamic when the amoun unpredictable, have to dyn during the computation
Ideally, we want all the processors, or cores are busy for doing useful work all the time and they complete
work at the same time
Recall Amdahls law even a small amount of load imbalance can significantly affect the performance
redictable, the work can d in advance and the
during the computation
t of work changes and is amically balance the load

memory machi
sible by all
maintain a
Other threads
Dynamic load balancing is harder, on shared memory machines than
nes data in
On a shared memory machine, for exa
partitioning
but relatively easier on distributed
ed memory are
retrieve tasks from the pool for execution
However, we must consider the cost for achieving good load balancing
ize additional computational
ize synchr

n a shared memory
global data
Task/data partitioning & assignment
Static: the workload is predictable, parti assignment are done in advance and not c computation todays examples
Dynamic: the workload
Threads coordination is explicitly done synchronization on shared variables
algorithms
changes during
and is unpredictable. Dynamic partitioning is harder
tioning and hange during

Load balancing ver
significantly
y important
Amdahls law: even a small amount of load imbalance may
affect the overall
locality very im
Need seriously consider how to optimize the p on single processor, or core
Increase opportunities for ILP
Increase computational intensity
Three exam
matrix multiplication embarrassingly parallel
scan (or prefix) operation increased workload for parallel computation
Gaussian Elimination wit
balancing the
emory hierar
tial pivoting harder to
erformance

Basic li
AS 1: vect
Through these examples we also introduced
l partitioning
with regular
near algebra
BLAS 3 is faster because of high
should be used
strategies widely used
AS 2: matrix vector operations AS 3 matrix multiplication
whenever possible
nes (BLAS)
ational intensity,
1D cyclic, 1D block, 2D block and 2D cyclic block Blocking technique achieves better locality

CS: assignmentchef QQ: 1823890830 Email: [email protected]

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[SOLVED] CS COMP5426 Distributed

Reviews

Whatsapp Us

[SOLVED] CS COMP5426 Distributed

Reviews

Related products

[Solved] Python program that plays a guess-the-number game with magic lists

[Solved] Payroll calculation program-Python

[SOLVED] COP 3223 Program #2: P2 Lottery

[Solved] Assignment 9 Changing Sort Keys

[SOLVED] Project 9-1: Monthly Payment Calculator

[SOLVED] ITEC136 Python Program