[SOLVED] C algorithm math MIPS parallel compiler database graph software network GPU theory slides are adapted from CA course of wisc, princeton, mit, berkeley, etc.

$25

File Name: C_algorithm_math_MIPS_parallel_compiler_database_graph_software_network_GPU_theory_slides_are_adapted_from_CA_course_of_wisc,_princeton,_mit,_berkeley,_etc..zip
File Size: 1469.52 KB

5/5 - (1 vote)

slides are adapted from CA course of wisc, princeton, mit, berkeley, etc.
The uses of the slides of this course are for educaonal purposes only and should be
used only in conjuncon with the textbook. Derivaves of the slides must
acknowledge the copyright noces of this and the originals.
1

Which of the following airplanes has the best performance?
Airplane Passengers
Range mi
630 4150 4000 8720
Speed mph
598
610 1350 544
Boeing 737100 Boeing 747 BACSud Concorde Douglas DC850
101 470 132 146
How much faster is the Concorde vs. the 747
How much bigger is the 747 vs. DC8?
1

Which computer is fastest?
2

Which computer is fastest?
The answer is not simple:
Scientific simulationFP performance
Program developmentInteger performanceDatabase workloadMemory, IO
3

Want to buy the fastest computer for what you want to do?
Workload is very important
Correct measurement and analysis method
4

Want to buy the fastest computer for what you want to do?
Workload is very important
Correct measurement and analysis method
Want to design the fastest computer for what the customer wants to pay?
Cost is an important criterion
5

Time and performance
Iron LawTime
MIPS and MFLOPS
Which programs and how to averageAmdahls law
6

?
What is important to whom?
7

?
What is important to whom?
Computer system user
Minimize elapsed time for program:
tresptendtstart
Called response time
8

?
What is important to whom?
Computer system user
Minimize elapsed time for program:
tresptendtstart
Called response time
Computer center manager
Maximize completion ratejobssecondCalled throughput
9

Response Time vs. Throughput
Is throughput1avg. response time?
10

Response Time vs. Throughput
Is throughput1avg. response time?
Only if NO overlap
Otherwise, throughput1avg. response time
11

Response Time vs. Throughput
Is throughput1avg. response time?
Only if NO overlap
Otherwise, throughput1avg. response time
:A lunch buffet Assume 5 entrees
Each person takes 2 minutesentree
Throughput is 1 person every 2 minutes
BUT time to fill up trayis 10 minutes
Why and what would the throughput be otherwise?
5 people simultaneously filling tray overlapWithout overlap, throughput110
12

?
For computer architects
CPU timetime spent running a program
Intuitively, bigger should be faster, so:
Performance1X time, where X is response, CPU
execution, etc.
Elapsed timeCPU timeIO waitWe will concentrate on CPU time
13

?
Improve a response time or b throughput?Faster CPU
Add more CPUs
14

?
Improve a response time or b throughput?Faster CPU
Helps both a and bAdd more CPUs
15

?
Improve a response time or b throughput?Faster CPU
Helps both a and bAdd more CPUs
Helps b and perhaps a due to less queueing
16

?
MachineAisntimesfasterthanmachineBiff: perfAperfBtimeBtimeAn
Machine A is x faster than machine B iff perfAperfBtimeBtimeA1x100
17

?
MachineAisntimesfasterthanmachineBiff perfAperfBtimeBtimeAn
Machine A is x faster than machine B iff perfAperfBtimeBtimeA1x100
:timeA10s, timeB15s
15101.5A is 1.5 times faster than B15101.5A is 50 faster than B
18

?
MachineAisntimesfasterthanmachineBiff perfAperfBtimeBtimeAn
Machine A is x faster than machine B iff perfAperfBtimeBtimeA1x100
:timeA10s, timeB15s
15101.5A is 1.5 times faster than B15101.5A is 50 faster than B
How to obtain the time?
19

:
Breakingprogramintoinstructions
HW is aware of instructions, not programs
20

:
Breakingprogramintoinstructions
HW is aware of instructions, not programs
Atlowerlevel,HWbreaksinstructionsintocyclesLower level state machines change state every cycle
21

:
Breakingprogramintoinstructions
HW is aware of instructions, not programs
Atlowerlevel,HWbreaksinstructionsintocyclesLower level state machines change state every cycle
For example:
1GHz Snapdragon runs 1000M cyclessec, 1 cycle1ns2.5GHz Core i7 runs 2.5G cyclessec, 1 cycle0.25ns
22

Iron Law in Performance
Processor Performance
Time Program

Instructions Program
code size
X
Cycles Instruction
CPI
X
Time Cycle
cycle time
ArchitectureImplementationRealization Compiler Designer Processor Designer Chip Designer
23

Iron Law in Performance
InstructionsProgram
Instructions executed, not static code sizeDetermined by algorithm, compiler, ISA
24

Iron Law in Performance
InstructionsProgram
Instructions executed, not static code sizeDetermined by algorithm, compiler, ISA
25

Iron Law in Performance
InstructionsProgram
Instructions executed, not static code sizeDetermined by algorithm, compiler, ISA
CyclesInstruction
Determined by ISA and CPU organization
Overlap among instructions reduces this term
26

Iron Law in Performance
InstructionsProgram
Instructions executed, not static code sizeDetermined by algorithm, compiler, ISA
CyclesInstruction
Determined by ISA and CPU organization
Overlap among instructions reduces this term
Timecycle
Determined by technology, organization, clever circuit design
27

Iron Law
InstructionsProgram
Instructions executed, not static code sizeDetermined by algorithm, compiler, ISA
CyclesInstruction
Determined by ISA and CPU organization
Overlap among instructions reduces this term
Timecycle
Determined by technology, organization, clever circuit design
28

Minimize time which is the product, NOT isolated terms
Common error to miss terms while devising optimizations
E.g. ISA change to decrease instruction count
BUT leads to CPU organization which makes clock slower
29

Minimize time which is the product, NOT isolated terms
Common error to miss terms while devising optimizations
E.g. ISA change to decrease instruction count
BUT leads to CPU organization which makes clock
slower
Be mind that: terms are interrelated
30

MIPS:
MIPSinstruction countexecution time x 106
clock rateCPI x 106
But MIPS has serious shortcomings
31

MIPS
E.g. without FP hardware, an FP op may take 50
singlecycle instructions
With FP hardware, only one 2cycle instruction
Thus, adding FP hardware:
CPI increases why?
Instructionsprogram decreases why?
Total execution time decreases
BUT, MIPS gets worse!
505021 501
502
50 MIPS2 MIPS
32

MIPSIgnores program
Usually used to quote peak performance
Ideal conditionsguaranteed not to exceed!
33

MIPSIgnores program
Usually used to quote peak performance
Ideal conditionsguaranteed not to exceed!
When is MIPS ok?
Same compiler, same ISA
E.g. same binary running on AMD Jaguar, Intel Core i7
Why?
34

MIPSIgnores program
Usually used to quote peak performance
Ideal conditionsguaranteed not to exceed!
When is MIPS ok?
Same compiler, same ISA
E.g. same binary running on AMD Jaguar, Intel Core i7
Why?
Instrprogram is constant and can be factored out
35

MFLOPSFP ops in programexecution time x 106
Assuming FP ops independent of compiler and
ISA
Often safe for numeric codes: matrix size determinesof FP opsprogram
However, not always safe:
Missing instructions e.g. FP divideOptimizing compilers
36

Use ONLY Time
Beware when reading, especially if details are
omitted
Be careful when you buy products
Beware of Peak
Guaranteed not to exceed
37

Iron Law Example A
MachineA:clock1ns,CPI2.0,forprogramxMachineB:clock2ns,CPI1.2,forprogramxWhich machine is faster and how much?
38

Iron Law Example A
MachineA:clock1ns,CPI2.0,forprogramxMachineB:clock2ns,CPI1.2,forprogramxWhich machine is faster and how much?
39

Iron Law Example A
MachineA:clock1ns,CPI2.0,forprogramx
MachineB:clock2ns,CPI1.2,forprogramx
Which machine is faster and how much?
Ans:
TimePrograminstrprogram x cyclesinstr x seccycle TimeAN x 2.0 x 12N
TimeBN x 1.2 x 22.4N
Compare: TimeBTimeA2.4N2N1.2
40

Iron Law Example A
MachineA:clock1ns,CPI2.0,forprogramx
MachineB:clock2ns,CPI1.2,forprogramx
Whichisfasterandhowmuch? Ans:
TimePrograminstrprogram x cyclesinstr x seccycle TimeAN x 2.0 x 12N
TimeBN x 1.2 x 22.4N
Compare: TimeBTimeA2.4N2N1.2
So, Machine A is 20 faster than Machine B for program x
41

Iron Law Example B Keep clockA1ns and clockB 2ns
To get equal performance, if CPIB1.2, what should CPIA be?
42

Iron Law Example
Keep clockA1ns and clockB 2ns
To get equal performance, if CPIB1.2, what should CPIA be?
Ans: TimeBTimeA1
Nx2x1.2Nx1xCPIACPIA2.4
43

Iron Law Example CKeep CPIA2.0 and CPIB1.2
For equal performance, if clockB2ns, what should clockA be?
Ans: TimeBTimeA1
N x 2.0 x clockAN x 1.2 x 2clockA1.2ns
44

Time and performance: Machine A n times faster than Machine B
Iff TimeBTimeAn
Iron Law: PerformanceTimeprogram

Instructions Program
code size
X
Cycles Instruction
CPI
X
Time Cycle
cycle time
Other Metrics: MIPS and MFLOPSBeware of peak and omitted details
45

?
Executiontimeofwhatprogram?
Best caseyou always run the same set of programsPort them and time the whole workload
46

?
Executiontimeofwhatprogram?
Best caseyou always run the same set of programs
Port them and time the whole workload
In reality, use benchmarks
Programs chosen to measure performance
Predict performance of actual workloadSaves effort and money
Representative? Honest? Benchmarketing
47

Machine A
Machine B
Program 1
1
10
Program 2
1000
100
Total
1001
110
One answer: for total execution time, how much faster is B?
10011109.1x
48

?
Machine A
Machine B
Program 1
1
10
Program 2
1000
100
Total
1001
110
One answer: for total execution time, how much faster is B?
10011109.1x
49

How to Average?
Another:arithmeticmeansameresult
Arithmeticmeanoftimes:
AMA10012500.5
AMB110255
Speedup:500.5559.1x
n 1 timei n
i1
50

How to Average?
Another:arithmeticmeansameresult
Arithmeticmeanoftimes:
AMA10012500.5
AMB110255
Speedup:500.5559.1x
Valid only if programs run equally often, so use weighted arithmetic mean:
n 1 timei n
i1
n 1
weightitimei
n
i1
51

How to Average?
Another:arithmeticmeansameresult
Arithmeticmeanoftimes:
AMA10012500.5
AMB110255
Speedup:500.5559.1x
Valid only if programs run equally often, so use weighted arithmetic mean:
n 1 timei n
i1
n 1
weightitimei
n
i1
How to determine the weight?
52

Problem of AM
E.g., 30 mph for first 10 miles, then 90 mph for next 10 miles, what is average speed?
Average speed30902 ???
53

Problem of AM
E.g., 30 mph for first 10 miles, then 90 mph for next 10 miles, what is average speed?
Average speed30902
WRONG!
54

Problem of AM
E.g., 30 mph for first 10 miles, then 90 mph for next 10 miles, what is average speed?
Average speed30902
Average speedtotal distancetotal time
201030109045 mph
55

Harmonic Mean
Harmonic mean of rates
Use HM if forced to start and end with rates e.g.
reporting MIPS or MFLOPS
Why?
Rate has time in denominator
Mean should be proportional to inverse of sums of time not sum of inverses
See: J.E. Smith, Characterizing computer performance with a single number, CACM Volume 31 , Issue 10 October 1988, pp. 12021206.
n
n1
raten i1
56

Dealing with Ratios
Machine A
Machine B
Program 1
1
10
Program 2
1000
100
Total
1001
110
If we take ratios with respect to machine A
Machine A
Machine B
Program 1
1
10
Program 2
1
0.1
Average
1
5.05
57

Dealing with Ratios
Avg. wrt. machine A: A is 1, 5.05
If we take ratios with respect to machine B
Machine A
Machine B
Program 1
0.1
1
Program 2
10
1
Average
5.05
1
58

Dealing with Ratios
Avg. wrt. machine A: A is 1, 5.05
If we take ratios with respect to machine B
Machine A
Machine B
Program 1
0.1
1
Program 2
10
1
Average
5.05
1
Cant both be true!!!
Dont use arithmetic mean on ratios!
59

Geometric Mean
Use geometric mean for ratiosGeometric mean of ratios
Independent of reference machine
In the example, GM for machine a is 1, for
machine B is also 1
Normalized with respect to either machine
n
n ratioi
i1
60

But
GMofratiosisnotproportionaltototaltime
AMinexamplesaysmachineBis9.1timesfaster
GM says they are equal
If we took total execution time, A and B are equal only if
Program 1 is run 100 times more often than program 2
Generally, GM will mispredict for three or more machines
61

Use AM for times
Use HM if forced to use ratesUse GM if forced to use ratios
Best of all, use unnormalized numbers to compute time
62

: SPEC2000
System Performance Evaluation CooperativeFormed in 80s to combat benchmarketing
SPEC89, SPEC92, SPEC95, SPEC2000, SPEC2006Latest one is SPEC2017
12 integer and 14 floatingpoint programs
Sun Ultra5 300MHz reference machine has score
of 100
Report GM of ratios to reference machine
63

Benchmarks: SPEC CINT2000
Benchmark
Description
164.gzip
Compression
175.vpr
FPGA place and route
176.gcc
C compiler
181.mcf
Combinatorial optimization
186.crafty
Chess
197.parser
Word processing, grammatical analysis
252.eon
Visualization ray tracing
253.perlbmk
PERL script execution
254.gap
Group theory interpreter
255.vortex
Objectoriented database
256.bzip2
Compression
300.twolf
Place and route simulator
64

Benchmarks: SPEC CFP2000
Benchmark
Description
168.wupwise
PhysicsQuantum Chromodynamics
171.swim
Shallow water modeling
172.mgrid
Multigrid solver: 3D potential field
173.applu
Parabolicelliptic PDE
177.mesa
3D graphics library
178.galgel
Computational Fluid Dynamics
179.art
Image RecognitionNeural Networks
183.equake
Seismic Wave Propagation Simulation
187.facerec
Image processing: face recognition
188.ammp
Computational chemistry
189.lucas
Number theoryprimality testing
191.fma3d
Finiteelement Crash Simulation
200.sixtrack
High energy nuclear physics accelerator design
301.apsi
Meteorology: Pollutant distribution
65

Benchmark Pitfalls
Benchmark not representative
Your workload is IO bound, SPEC is useless
66

Benchmark Pitfalls
Benchmark not representative
Your workload is IO bound, SPEC is useless
Benchmark is too old
Benchmarks age poorly; benchmarketing pressure causes vendors to optimize compiler, hardware, software to match benchmarks
Need to be periodically refreshed
67

Amdahls Law
Motivationforoptimizingcommoncase
Speedupold timenew timenew rateold rate
Let an optimization speed fraction f of time by a
Newtime1f x oldtimefs x oldtime
Speedupoldtimenewtime
Speedupoldtime1f x oldtimefs x oldtime
factor of s
Math: If f is
Speedup
1 f foldtime
1f o l d t i m e sfo l d t i m e
1 1f
f s
68

Amdahls Law Example
Your boss asks you to improve performance by:Improve the ALU used 95 of time by 10
Improve memory pipeline used 5 of time by 10x
f
95
5
5
s
1.10
10

Speedup
Speedup1 1f
f s
1.094
1.047
1.052
69

Amdahls Law: Limit
Make common case fast:
10
9
8
7
6
5
4
3
2
1
0
lim 11 s 1fsf 1f
0 0.2 0.4 0.6 0.8 1 f
70
Speedup

Amdahls Law: Limit
Consideruncommoncase!
If 1f is nontrivial
Speedup is limited!
Particularly true for exploiting parallelism in the
large, where large s is not cheap
GPU with e.g. 1024 processors shader cores
Parallel portion speeds up by s 1024x
Serial portion of code 1f limits speedup
E.g. 10 serial portion: 10.110x speedup with 1000 cores
lim 11 s 1fsf 1f
71

Benchmarks: SPEC2000
Summarize performance:
AM for timeHM for rateGM for ratio
Amdahls Law:
Speedup1 1 f
f s
72

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] C algorithm math MIPS parallel compiler database graph software network GPU theory slides are adapted from CA course of wisc, princeton, mit, berkeley, etc.
$25