5/5 - (1 vote)

PERFORMANCE PROGRAMMING
Introduction to Performance Optimisation II Adrian Jackson

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%
In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal and I believe the same viewpoint should prevail in software engineering

Architectural trends
Optimisation is the process of tuning a code to run faster in a particular Hardware and Software environment.
The hardware environment consists of many different resources
Accelerator Network
Any of these resources could be the limiting factor for code performance
Which one depends on the application

Moores Law
Observation that the number of transistors in an integrated circuit that can be manufactured economically doubles roughly every 2 years.
Increased transistor count allows more powerful CPUs.
Historically smaller transistors also allowed faster clock speeds
but power constraints now limit this type of scaling.
Only limited benefit available from smarter processors.
Most CPU designs now use the increased transistor count to increase parallelism
Morecores
Longer SIMD instructions Etc.

CPU resources
In the early days of computing memory accesses were essentially free.
Optimisation consisted of reducing the instruction count. This is no longer the case, and is getting worse
CPU performance historically increased at approx. 80% per year (though recently this has been due to increasing core count rather than clock speed)
memory speed increases at approx. 7% per year (though capacity increases at a much higher rate)
Most HPC codes/systems are memory bandwidth limited.
Unless major data structures can live in cache.

CPU bottlenecks
Note, the following slides are primarily x86 focussed
Pipeline width
Number of operations/instructions that can be issues per cycle
5 for modern AMD and Intel processors
Reported under uops_issued.any counter for Intel
Can address by loop unrolling, vectorisation, or algorithmic change
Available functional units
Also know as port or execution width
Varies by functional unit
For floating point stuff may be 1 or 2
Reported under uops_dispatched_port counter for Intel
Can address in same way as Pipeline width problem or aim to change instructions for wider instructions

CPU bottlenecks
Load limits
How many address load instructions can be issued per cycle
(from L1 cache)
Port width limit, 2 for current Intel and AMD
Not generally a problem for most code
Store limits
How many data items can be stored per cycle 1 for Intel and AMD
Chain dependencies
Instruction pipeline performance depends on the slowest
instruction
Pipeline latency of instructions varies
Can remedy by breaking up pipeline chains or using other instructions

CPU bottlenecks
Out-of-order Buffers
Out-of-order scheduling of instructions requires storage
(buffers) to keep track of in flight instructions
Re-order Buffer (ROB), Load Buffer, Store Buffer
Can remedy by restructuring code so dependent statements are closer together, although this may impact overlap potential
Register file size
ROB entries need associated registers, which are stored in register files (typically one for scalar and one for vector operations)
This can fill up before the ROB is full
Can remedy by having mixed Vector and Scalar instructions or by reordering code as with the Out-of-order buffer issues

Memory resources
Old HPC codes were often written for vector processors.
Vectorprocessingtypicallyrequiresahighmemorybandwidth.
Vectorsystemsaretypicallybuiltwithlargenumbersofmemory banks giving high peak memory bandwidth.
Vector instructions have a very regular access pattern but tend to have low data re-use. Caches were rare on vector systems.
Old vector codes were often written with stride-1 data access patterns to maximise memory bank use.
Thesecodesoftenperformedbadlyoncachebasedmicroprocessors as stride-1 access obtained at the cost of temporal locality.
Modern SIMD instructions are effectively vector instructions
SIMD load/stores also need stride-1 access but also need to co-exist with caches.

Modern microprocessors rely heavily on caches
Fast on-chip memory that stores copies of a sub-set of the data
from main memory
Reduces the demands on external memory provided data access patterns exhibit locality
Particularly beneficial if important data structures can exist entirely within cache.
Current HPC codes are written to increase data locality (and therefore cache benefit) as much as possible
Cache sizes have been growing with Moores law which have helped offset slow growth in DRAM speed.

Commercial factors
Computer design is all about balancing performance and cost.
We know how to build faster computers than we can afford to build
And faster than we can afford to run.
Cheaper technologies used where good enough
HPC computing is a very small market compared to desktop/business/mobile systems.
HPC systems are often built out of components designed for a different purpose.
Economies of scale make mainstream technology cheaper

Manufacturing trade-off example
Processors need complex manufacturing processes E.g. many metal layers.
Expensive to manufacture
DRAM can be manufactured on a much simpler process.
Fewer metal layers
Much less expensive
DRAM and processors are manufactured as separate components for cost reasons.
Simple DRAM process force interface to be simple too.
Separation of RAM and processor impacts performance

Chip stacking
One promising trend in CPU design is chip-stacking Traditionally DRAM and CPU were distinct components
mounted on a common circuit board
Stacks of very thin DRAM chips can be mounted directly on top of the CPU
Alternatively both are mounted on a common carrier chip
Through Silicon Vias (TSVs) (electrical connections that penetrate through the thin chips) allow 3D connectivity within the stack.
Increased connectivity has potential to allow memory performance to grow at similar rate to CPU performance.

Comparing codes
HPCcodes
Large use of floating point Programs are quite small Data is quite large
Data in simple arrays
Regular access patterns Limited cache utilisation unless
array fits in cache Simple control flow
Most time spent in small number of loops.
Desktopapplications
Low use of floating point Large complicated programs Data relatively small
Data in complex structures
Random access pattern Typically good reuse of data
once accessed Complex control flow
Branches and pointer chasing.

Types of optimisation
Data structures
Code structure
Library calls
Compiler flags Impact
Algorithm Parallelism
Difficulty

Compiler flags
Easiest thing to change are the compiler flags
Most compilers will have good optimisation by
Some compiler optimisations are not always beneficial and need to be requested explicitly.
Some need additional guidance from user (e.g. inter-file in-lining)
Some break language standards and need to be requested explicitly
E.g. a/2 -> a*0.5 is contrary to Fortran standard but is usually safe.
Usually worthwhile to read compiler manual pages before optimising.

Debugging flags
Most compilers have flags to enable debugging Traditionally debugging flags disable some
optimisations
this is to make it easier to associate machine instructions with lines of source code.
Many compilers now support optimisation and debugging together
May have alternative flags to allow some debugging with optimisation enabled.

Library calls
The easiest way to make a big impact on code performance is to re-use existing optimised code.
Libraries represent large amount of development effort
Somebody else has put in the effort so you dont have to,
Code libraries exist for commonly used
functionality (e.g. linear algebra, FFTs etc.).
Often possible to access highly optimised versions of these libraries.
Even if the application code does not use a standard library it is often easy to re-write to use the standard calls.

Using libraries
Dont just use libraries blindly.
Read the documentation.
Learn what the library is capable of.
Libraries often have multiple ways of doing the same thing. You need to understand which is best for your code.
Many modern libraries support auto-tuning.
Can run internal experiments to automatically optimise
the implementation for a given problem.
Application programmer needs to cooperate on this.

The biggest performance increases typically require a change to the underlying algorithm.
Consider changing an O(N) sort algorithm to a O(log(N)) algorithm.
This is a lot of work as the relevant code section usually needs a complete re-write.
A warning
The complexity of an algorithm O(N), O(log(N)), O(N log(N)) etc. is related to number of operations and is not always a reliable indication of performance.
Pre-factor may make a worse algorithm perform better for the value of N of interest.
The worse algorithms may have much better cache re-use or exhibit more available parallelism

Many scientific algorithms performance can be improved by mathematical approaches i.e. Pre-conditioning matrices
Often increase runtime per iteration
But reduce iterations to convergence etc
i.e. Strength reductions
t:=b*c t:=b*c FORi:=1to10000DO d:=0
BEGIN FOR i := 1 TO 10000 DO
inti,sum=0;
for (i = 1; i <= N; ++i) {sum += i; }printf(“sum: %d
“, sum);a := t BEGINd:=i*3 a:=t… d:=d+3END intsum=N*(1+N)/2; printf(“sum: %d
“, sum); Data structures Changing the programs data structures can often give good performance improvements These are often global changes to the program and therefore expensive. Code re-writing tools can help with this. Easier if data structures are reasonably opaque, declared once objects, structs, F90 types, included common blocks. As memory access is often the major performancebottleneck the benefits can be great. Improve cache/register utilisation. Avoid pointer chasing May be able to avoid memory access problems by changing code structure in key areas instead. Code structure Most optimisations involve changes to code structure Loopunrolling Loop fusion routinein-lining. Often overlap with optimisations attempted by the compiler. Often better to help the compiler to do this than perform change by hand. Easier to implement than data changes as more localised. Performance impact is often also smaller unless the code fragment is Performance improvement often at the expense of code maintainability. Try to keep the unoptimised version up to date as well.a major time consumer. Parallelism Parallel computing is usually an optimisation technique Multiple processors are used to improve run-time Large numbers of processors can result in verysignificant performance improvements. Software effort to develop parallel code can be very high. Code runs faster in real time but may consume more cpu cycles in total. Instruction level parallelism Most modern CPUs include some kind of parallel instructions SIMD instructions where one instruction performs multiple operations. Pipelined or superscalar architectures where multiple (independent) instructions are executing at the same time. Largely the compiler finds and exploits this kind of parallelism automatically Therefore generally easy to exploit May still have to change the source code to: Increase the parallelism available to the compiler Make it easier for the compiler to recognise the parallelism Shared memory parallelism Multiple processors share a single memory system. Data structures can be shared and need not be changed. Only the work needs to be distributed (typically using threads). The compiler does most of the work but usually needs some additional directives. Typically the most flexible form of multi-processor parallelism Parallelism can be introduced incrementally, only critical routines need to be parallel. Often can add parallel directives to a serial code, automatically keeping parallel and serial versions in sync. Make greater demands on hardware designers Easy for parallel advantage to be lost as processors compete for access to memory. Distributed memory parallelism Processors have private memory systems and perform explicit communications Message passing, remote memory access calls etc. Usually large software development task. The whole code has to run in parallel. Very little if any support from the compiler. Data structures usually need changing. Exceptions: Regular domain decomposition: Serial code data structures often ok,just have to be smaller. Boundary condition code needs to be changed. Replicated data: Each node keeps a private copy of the global data structures. Easy to implement but often does not scale well. Distributed memory codes can easily run on shared memory systems, but not vice-versa Accelerators Special purpose hardware Very good at some types of operations Cantruntheentirecode. GPUscommonexamplebutcanalsobebuiltoutofFPGA Main CPU and accelerator handle different sections of the code. Usuallysomeoverheadinswitching Oftendatahastobecopiedbetweendevices Different memory layouts might be appropriate for each device Trend towards greater compiler support for acceleration Programmerstillneedstounderstandbotharchitecturestogetbestperformance. Successfulacceleratorarchitecturesinfluencenewprocessordesigns so boundary blurred over time. Data transfer over networks costs performance Memory copy costs Network transfer costs On-node communication costs can be optimised Shared-memory segments for local data transfers Off-node communication costs can be optimised Reduced copy communications, RDMA (direct memory access) Reduced CPU requirements Network communication maps/patterns Reduce message numbers Compression Reduce data volumes at CPU expenses Knowing when to stop Hardware metrics Estimating program requirements Roofline model There are many ways to optimise a program need to select the most appropriate for the situation doing nothing is an option Must always weigh the benefit (faster execution) against costs: programmer effort loss of maintainability loss of portability Performance measurement, analysis and testing are key parts of the process. CS: assignmentchef QQ: 1823890830 Email: [email protected]

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[SOLVED] CS TO 10000 DO

Reviews

Whatsapp Us

[SOLVED] CS TO 10000 DO

Reviews

Related products

[SOLVED] COP 3223 Program #4: Turtle Time and List Power

[SOLVED] COP 3223 Program #3: Counting Pez

[SOLVED] COP 3223 Program #1: Vacation Planning

[Solved] Python program that plays a guess-the-number game with magic lists

[SOLVED] pakudex

[SOLVED] ITEC136 Python Program