ECE437: Introduction to Digital Computer Design
Chapter 1b (performance, 1.4 onwards)
Spring 2021
Performance of Computers
Which computer is fastest? Not so simple
scientific simulation FP performance
program development Integer performance
commercial work memory + I/O
ECE437, S21 Vijaykumar and Thottethodi (2)
Performance of Computers
Want to buy the fastest computer for what you want to do
workload is important
Want to design the fastest computer for what they want to pay
cost is an important criterion
ECE437, S21 Vijaykumar and Thottethodi (3)
Outline
Time and performance
Iron law
MIPS and MFLOPS
Which programs and how to average Amdahls law
ECE437, S21 Vijaykumar and Thottethodi (4)
1
Defining Performance
What is important to whom
Limited model
Start job, wait 1. Computer system user for end. Suggest
minimize elapsed time for program = alternate
time_end time_start called response time
performance metrics.
2. Computer center manager
maximize completion rate = #jobs/second called throughput
ECE437, S21 Vijaykumar and Thottethodi (5)
Response Time vs. Throughput
Is throughput = 1/av. response time?
ONLY if NO overlap
with overlap, throughput > 1/av.response time
e.g., a lunch buffet assume 5 entrees
each person takes 2 minutes at every entree
throughput is 1 person every 2 minutes
BUT time to fill up tray is 10 minutes
Why? Otherwise, what would the throughput be?
because there are 5 people (each at 1 entree) simultaneously;
if there is no such overlap throughput = 1/10
ECE437, S21 Vijaykumar and Thottethodi (6)
What is Performance for us?
For computer architects
CPU execution time = time spent running a program
Shorter time better but people like better to be bigger to match intuition
performance = 1/time (shorter time better perf.)
where time = response time, CPU run time, etc.
Elapsed time = CPU execution time + I/O wait
We will concentrate mostly on CPU execution time
ECE437, S21 Vijaykumar and Thottethodi (7)
Improve Performance
Improve (a) response time or (b) throughput?
faster CPU
both (a) and (b)
Add more CPUs
(b) but (a) may be improved due to reduced queueing
ECE437, S21 Vijaykumar and Thottethodi (8)
Give an example of this phenomenon
2
Performance Comparison
Machine A is n times faster than machine B iff
perf(A)/perf(B) = time(B)/time(A) = n
Machine A is x% faster than machine B iff perf(A)/perf(B) = time(B)/time(A) = 1 + x/100
E.g., A 10s, B 15s
15/10 = 1.5 => A is 1.5 times faster than B
15/10=1+50/100=>Ais50%fasterthanB
ECE437, S21 Vijaykumar and Thottethodi (9)
Breaking down Performance
A program is broken into instructions
H/W is aware of instructions, not programs
At lower level, H/W breaks instructions into cycles
lower level state machines change state every cycle
E.g., 4 GHz Opteron runs 4 B cycles/sec, 1 cycle = 0.25 ns
ECE437, S21 Vijaykumar and Thottethodi (10)
Iron law
Time/program = instrs/program x cycles/instr x sec/cycle
NEVER forget this!
sec/cycle (a.k.a. cycle time, clock time)
mostly determined by technology and CPU orgn. cycles/instr (a.k.a. CPI)
mostly determined by ISA and CPU organization EFFECTIVE cycles/instr and NOT actual latency overlap among instructions makes this smaller
Each instr 5 cycles but 5 instrs overlap CPI = 1
AVERAGE over instrs (instrs have different CPI)
instr/program (a.k.a. instruction count)
instrs executed, NOT static code
mostly determined by program, compiler, ISA
ECE437, S21 Vijaykumar and Thottethodi (11)
Our Goal
Minimize time which is the product, NOT isolated terms
Tempting because of the separate terms
Optimizing one term may worsen time by worsening other terms!
Common error to miss terms while devising optimizations
E.g., ISA change to decrease instr. count BUT leads to slower clock
ECE437, S21 Vijaykumar and Thottethodi (12)
3
Iron Law Example
Machine A: clock 1 ns, CPI 2.0, for a program
Machine B: clock 2 ns, CPI 1.2, for same program
Which is faster and how much?
Time/program = instrs/program x cycles/instr x sec/cycle
Time(A):Nx2.0x1=2N
Time(B):Nx1.22=2.4N
Compare: Time(B)/Time(A) = 2.4N/2N = 1.2
So, Machine A is 20% faster than Machine B for this program
ECE437, S21 Vijaykumar and Thottethodi (13)
Iron Law Example
KeepclockofAat1nsandclockofBat 2 ns
For equal performance, if CPI of B is 1.2, what is As CPI?
Time(B)/Time(A) = 1 = (N x 2 x 1.2)/(N x 1 x CPI(A))
CPI(A) = 2.4
ECE437, S21 Vijaykumar and Thottethodi (14)
Iron Law Example
LetCPIofA= 2.0andCPIofB=1.2
For equal performance, if clock of B is 2 ns, what is As clock?
ECE437, S21 Vijaykumar and Thottethodi (15)
Iron Law Example
LetCPIofA=2.0andCPIofB=1.2
For equal performance, if clock of B is 2 ns, what is As clock?
Time(B)/Time(A) = 1 = (N x 1.2 x 2)/(N x 2.0 x clock(A))
clock(A) = 1.2 ns
ECE437, S21 Vijaykumar and Thottethodi (16)
4
Iron Law notes
You will see ch4a (single cycle) in the lab
But not Ch1b (Iron law) so spend an hour to think about and internalize Ch1b
ECE437, S21 Vijaykumar and Thottethodi (17)
Other Metrics
Other than execution time
MIPS and MFLOPS Bogus!
MIPS = millions of instructions per sec
= instruction count/(execution time x 106) Not MIPS, the instruction set
= clock rate/(CPI x 106) (How?) BUT MIPS has problems
ECE437, S21 Vijaykumar and Thottethodi (18)
Problems with MIPS
E.g.,withoutFPH/W,anFPopmaytake50single- cycle instrs
withFPH/Wonlyone2-cycleinstratsameclock
ThusaddingFPH/W
CPIincreases(why?)TheFPopgoesfrom50/50to2/1 butinstrs/progdecreasesmore(why?)eachoftheFPop
reduces from 50 to 1, factor of 50
totalexecutiontimedecreases
ForMIPS
instrs/prog ignored
MIPSgetsworse!
ECE437, S21 Vijaykumar and Thottethodi (19)
Problems with MIPS
Ignores program
Usually used to quote peak performance
ideal conditions => guarantee not to exceed!! When is MIPS ok?
same compiler and same ISA
e.g., same binary running on Xeon Ivy Bridge and
Xeon Broadwell
why? instrs/prog is constant and may be ignored
ECE437, S21 Vijaykumar and Thottethodi (20)
5
Other (Bogus) Metrics
MFLOPS = FP ops in program/(execution time x 106)
Assuming FP ops independent of compiler and ISA
Assumption not true
Eg may not have divide instruction in ISA
optimizing compilers can remove some insts
Relative MIPS and normalized MFLOPS adds to confusion!
ECE437, S21 Vijaykumar and Thottethodi (21)
Rules
Use ONLY Time
Beware when reading, especially if details are omitted
Beware of Peak
Peak is the performance that the chip maker guarantees not to exceed meaningless!
ECE437, S21 Vijaykumar and Thottethodi (22)
Outline
Time and performance Iron law
MIPS and MFLOPS
Which programs
How to average Amdahls law
ECE437, S21 Vijaykumar and Thottethodi (23)
Which Programs
Execution time of what
Best case you run the same set of programs everyday
port them and time the whole workload In reality, use benchmarks
programs chosen to measure performance
predict performance of actual workload (hopefully) saves effort and money
representative? honest?
ECE437, S21 Vijaykumar and Thottethodi (24)
6
Benchmarks: SPEC2006
SPEC: System Performance Evaluation Cooperative
Latest is SPEC2006, before SPEC89, SPEC92, SPEC95, SPEC2000
12 integer and 17 floating point programs
normalize run time with a Gold Standard processor (SPEC ratio)
Geometric Mean (GM) of the normalized times (why GM?)
ECE437, S21 Vijaykumar and Thottethodi (25)
One-minute quiz
State the Iron Law Time per program =
Previous quiz
What is the ALU operation for loads and stores?
Add
ECE437, S21 Vijaykumar and Thottethodi (26)
SPEC CINT2006 http://www.spec.org/cpu2006/CINT2006/
Benchmark
Description
Xalancbmk
XML processing
astar
Path finding
mcf
Combinatorial optimization
gcc
GNU C compiler
Omnetpp
Discrete event simulation
h264ref
Video compression
libquantum
Quantum computing
gobmk
Artificial Intelligence: Go
hmmer
Gene Sequence Search
Bzip2
Compression
sjeng
Artificial Intelligence: chess
Perlbench
PERL programming language
ECE437, S21 Vijaykumar and Thottethodi (27)
SPEC CFP2006 http://www.spec.org/cpu2006/CFP2006/
Benchmark
Description
milc
Quantum chromodynamics
Bwaves
Fluid dynamics
Soplex
Simplex LP solver
Wrf
Weather modeling
17 in all, see webpage
Speech recognition, quantum chemistry, structural mechanics, ray tracing, etc.
ECE437, S21 Vijaykumar and Thottethodi (28)
7
How to average
SPEC has 29 programs how to average? Example
One answer: total execution time, then how much faster than A is B? 1001/110 = 9.1
ECE437, S21 Vijaykumar and (29) Thottethodi
Machine A
Machine B
Program 1
1s
10s
Program 2
1000s
100s
Total
1001s
110s
How to average
Another:arithmeticmean(sameresult:B9.1times
faster than A )
Arithmetic mean of times: time
programs
i1
i
AM(A)=1001/2=500.5
AM(B)=110/2=55
500.5/55=9.1
Validonlyifprogramsrunequallyoften,elseuse
weight factors
Weighted arithmetic mean:
ECE437, S21 Vijaykumar and (30) Thottethodi
i1
n
/ n
for n
n
weight time / n
ii
Other Averages E.g., 30 mph for first 10 miles
90 mph for next 10 miles. Average speed?
Average speed = (30+90)/2 =60mph? WRONG
Average speed = total distance / total time = (20 / (10/30+10/90))
= 45 mph
What if it was 10 hours at each speed?
instead of 10 miles
ECE437, S21 Vijaykumar and Thottethodi (31)
Harmonic Mean
Harmonic mean of rates =
1 n1
rate i1 i n
Use HM if forced to start and end with rates Trick to do arithmetic mean of times but
using rates and not times
ECE437, S21 Vijaykumar and (32) Thottethodi
8
Practice
Machine A runs 10M instructions at 15 MIPS and the next 10M instructions at 45 MIPS
Average MIPS = ??
Machine A runs for 10 seconds at 15 MIPS and the next 10 seconds at 45 MIPS
Average MIPS = ??
ECE437, S21 Vijaykumar and Thottethodi (33)
Dealing with Ratios
Absolute times
Now consider ratios (w.r.t. A)
Averages: A = 1, B = 5.05 ECE437, S21 Vijaykumar and Thottethodi (34)
Machine A
Machine B
Program 1
1s
10s
Program 2
1000s
100s
Machine A
Machine B
Program 1
1
10
Program 2
1
0.1
Dealing with Ratios
Absolute times
Now consider ratios (w.r.t. B)
Averages: A = 5.05, B = 1
Both cannot be true
ECE437, S21 Vijaykumar and Thottethodi (35)
Machine A
Machine B
Program 1
1s
10s
Program 2
1000s
100s
Machine A
Machine B
Program 1
0.1
1
Program 2
10
1
Geometric Mean
Dontusearithmeticmeanonratios(normalized numbers)
Usegeometricmeanforratios
geometric mean of ratios =
Use GM if forced to use ratios
Independentofreferencemachine(mathproperty)
Intheexample,GMformachineAis1,formachine
B is also 1
normalizedwithrespecttoeithermachine
ECE437, S21 Vijaykumar and (36) Thottethodi
n
n ratio
i i1
9
But..
Geometric mean of ratios is not proportional to total time
AM in example says machine B is 9.1 times faster
GM says they are equal
If we took total execution time, A and B are equal only if
program 1 is run 100 times more often than program 2
Generally, GM will mispredict for three or more machines
ECE437, S21 Vijaykumar and Thottethodi (37)
Previous Midterm Question
Machine A runs 9 times faster than machine B when running program-P,
Machine B runs 4 times faster than machine A when running program Q.
Which machine is faster (and by what factor) when averaged across both programs?
ECE437, S21 Vijaykumar and Thottethodi (38)
Summary for Averages
Use AM for times
Use HM if forced to use rates
Use GM if forced to use ratios
Better yet, use unnormalized, raw run times to compute performance
ECE437, S21 Vijaykumar and Thottethodi (39)
Amdahls Law
Whydoesthecommoncasematterthemost?
Letanoptimizationspeedffractionoftimebya
factor of s
assumingthatoldtime=T,whatisthespeedup?
fistheaffectedfractionofT (1-f) is the unaffected fraction
Speedup = timeold unaffectedold affectedold timenew unaffectednew affectednew
= (1f)TfT ( 1 f ) T sf T
ECE437, S21 Vijaykumar and Thottethodi
1
( 1 f ) sf
(40)
10
Amdahls Law Example
Your boss asks you to improve processor performance
Two options: What should you do?
improve the ALU used 95% of time, by 10%
improve the square-root unit used 5%, by a factor of 10
ECE437, S21 Vijaykumar and (41) Thottethodi
f
s
Speedup
95%
1.10
1.094
5%
10
1.047
5%
1.052
Amdahls Law: Limit
Make common case fast because:
10
8
11 6 lim
s 1ff s 1f 4 2 0
ECE437, S21 Vijaykumar and (42) Thottethodi
0 0.2 0.4 0.6 0.8 1 f
Amdahls Law
Make common case fast
Scientific heuristic, not religious commandment Use for intuition, verify with numbers
60% can be improved by a factor of 2 Speedup = 1/(0.4+0.6/2) = 1/0.7
40% can be improved by a factor of 8 Speedup = 1/(0.6+0.4/8) = 1/0.65
Second option is better
Less common case, but higher speedup compensates
ECE437, S21 Vijaykumar and Thottethodi (43)
Summary of Chapter 1b
Timeandperformance:
MachineAntimesfasterthanMachineB iff Time(B)/Time(A) = n
IronLaw:Time/prog
InstrcountxCPIxCycletime
OtherMetrics:MIPSandMFLOPS Bewareofpeakandomitteddetails
Benchmarks:SPEC2006(int+FP) Summarizeperformance:
AMfortime,HMforrate,GMforratio
Amdahls Law: Speedup = 1 common case fast
1ff s
ECE437, S21 Vijaykumar and (44) Thottethodi
11
Speedup
Single-cycle Datapath
CPI
Inst. Count
Cycle Time
Performance Implications
Minimize all three
Insts/prog fixed f(ISA,compiler)
CPI = 1 : As good as it gets (*)
Clock cycle time : high, lw critical path
ECE437, S21 Vijaykumar and Thottethodi (45)
Back to Ch4b
To improve performance beyond single- cycle datapath
Now that we understand performance
ECE437, S21 Vijaykumar and Thottethodi (46)
12
Reviews
There are no reviews yet.