Computer Architecture
Course code: 0521292B 11. Optimizing Cache Performance
Jianhua Li
College of Computer and Information Hefei University of Technology
slides are adapted from CA course of wisc, princeton, mit, berkeley, etc.
The uses of the slides of this course are for educaonal purposes only and should be
used only in conjuncon with the textbook. Derivaves of the slides must
acknowledge the copyright noces of this and the originals.
How to Improve Cache Performance
Three fundamental goals:Reducing miss rate
Remind: reducing miss rate can reduce performance if more costlyto refetch blocks are evicted
Reducing miss latency or miss costReducing hit latency or hit cost
Basic Tricks to Improve Performance
Reducingmissrate
More associativity
Alternativesenhancements to associativity
Victim caches, hashing, pseudoassociativity, skewed associativity
Better replacementinsertion policiesSoftware approaches
ReducingmisslatencycostMultilevel caches
Critical word first
Subblockingsectoring
Better replacementinsertion policies
Nonblocking caches multiple cache misses in parallelIssues in multicore caches
Ways of Reducing Conflict Misses
Instead of building highlyassociative caches, many other approaches in the literature:
Victim Caches
Hashedrandomized Index FunctionsPseudo Associativity
Skewed Associative Caches
Victim Cache: Reducing Conflict Misses
Victim cache
Next Level Cache
Direct Mapped Cache
Jouppi, Improving DirectMapped Cache Performance by the Addition of a Small FullyAssociative Cache and Prefetch Buffers, ISCA 1990.
Idea: Use a small fully associative buffer victim cache to store evicted blocks
Can avoid ping ponging of cache blocks mapped to the same set if two cache blocks continuously accessed in nearby time conflict with each other
Increases miss latency if accessed serially with L2;Adds complexity
Hashing and PseudoAssociativity
Hashing: Use better randomizing index functionscan reduce conflict misses
by distributing the accessed memory blocks more evenly to sets
Example of conflicting accesses: stride access pattern where stride
value equals number of sets in cache
More complex to implement: can lengthen critical path
Pseudoassociativity
View each way of the set as a separate directmapped
cache
Ways are searched in sequence first, second,
On a hit in the kth way, then the line is promoted to the first way, and all lines in caches 1 to k1 get demoted by one.
On a miss, the item is filled in the first way, the item in the nth way is evicted, and all items in the 1 to n1 way are demoted one cache.
Skewed Associative Caches
Idea: Reduce conflict misses by using different index functions for each cache way
Seznec, A Case for TwoWay SkewedAssociative Caches, ISCA 1993.
Skewed Associative Caches I
Basic 2way associative cache structure
Way 0
Way 1
Same index function for each way
?
?
Tag Index Byte in Block
Skewed Associative Caches II
Skewed associative caches
Each bank has a different index function
same index Way 0 redistributed to
same index same set
Way 1
different sets
f0
?
tag
index
byte in block
?
Skewed Associative Caches III
Idea: Reduce conflict misses by using different index functions for each cache way
Benefit: indices are more randomized memory blocks are better distributed across sets due to hashing
Less likely two blocks have same indexReducedconflictmisses
Cost: additional latency of hash function
Example Software Approaches
Restructuring data access patterns
RestructuringdatalayoutLoop interchange
Data structure separationmergingBlocking
Restructuring Data Access Patterns I
Idea: Restructure data layout or access patterns
Example: If columnmajor
xi1,j follows xi,j in memoryxi,j1 is far away from xi,j
Poor code
for i1, rows
for j1, columns
sumsumxi,j
This is called loop interchange
Better code
for j1, columns for i1, rows
sumsumxi,j
Other optimizations can also increase hit rateLoop fusion, array merging,
What if multiple arrays? Unknown array size at compile time?
Restructuring Data Access Patterns II
Blocking for better cache utilization Divide loops operating on arrays into computation chunks so
that each chunk can hold its data in the cache
Avoids cache conflicts between different chunks of computation
Essentially: Divide the working set so that each piece fits in the cache
But, there are still selfconflicts in a block
1. there can be conflicts among different arrays
2. array sizes may be unknown at compileprogramming time
Restructuring Data Layout I
struct Node
struct Node node; int key;
char 256 name; char 256 school;
while node
if nodekeyinputkey
access other fields of node
nodenodenext;
Pointerbasedtraversal
Whydoesthecodeonthe left have poor cache hit rate?
Other fields occupy most of the cache line even though rarely accessed!
e.g., of a linked list
How to improve?
Assume a huge linked list 1M nodes and unique keys
Restructuring Data Layout II
struct Node
struct Node node;
int key;
struct Nodedata nodedata;
struct Nodedatachar 256 name; char 256 school;
while node
if nodekeyinputkey
access nodenodedata nodenodenext;
Idea: separate frequently used fields of a data structure and pack them into a separate data structure
Whoshoulddothis?
Programmer
Compiler
Profiling vs. dynamic
Hardware?
Who can determine what is frequently used?
Improving Basic Cache Performance
Reducingmissrate
More associativity
Alternativesenhancements to associativity
Victim caches, hashing, pseudoassociativity, skewed associativity
Better replacementinsertion policiesSoftware approaches
ReducingmisslatencycostMultilevel caches
Critical word first
Subblockingsectoring
Better replacementinsertion policies Nonblocking caches multiple cache misses in parallel
Issues in multicore caches
Miss LatencyCost
What is miss latency or miss cost affected by?
Where does the miss get serviced from?Local vs. remote memory
What level of cache in the hierarchy?
Row hit versus row miss
Queueing delays in the memory controller
How much does the miss stall the processor?Is it overlapped with other latencies?
Is the data immediately needed?
Memory Level Parallelism MLP
isolated miss
A
B C
parallel miss
time
Memory Level Parallelism MLP means generating and servicing multiple memory accesses in parallel Glew98
Several techniques to improve MLP e.g., outoforder execution
MLP varies. Some misses are isolated and some parallel
How does this affect cache replacement?
Traditional Cache Replacement Policies
Traditional cache replacement policies try to reduce miss count
Implicitassumption:Reducingmisscountreduces memoryrelated stall time
Misses with varying costMLP breaks this assumption!Eliminating an isolated miss helps performance more than
eliminating a parallel miss
Eliminating a higherlatency miss could help performance more than eliminating a lowerlatency miss
An Example
P4 P3 P2 P1 P1 P2 P3 P4
Misses to blocks P1, P2, P3, P4 can be parallel Misses to blocks S1, S2, and S3 are isolated
Two replacement algorithms:
1. Minimizes miss count Beladys OPT 2. Reduces isolated miss MLPAware
S1 S2 S3
For a fully associative cache containing 4 blocks
Fewest MissesBest Performance
P4
PS31 PS2 S3
Cache
P41
PS31
PS2
PS13
P4P4P3S1P24S2P13S
3P24
PS31
P2P4S2P3
P2
S3
P4 P3 P2 P1 P1 P2 P3 P4 S1 S2 S3
HitMiss HHHM HHHH M M M Time
Beladys OPT replacement
Misses4 Stalls4
stall
HitMiss HMMM HMMM Time
MLPAware replacement
Saved Misses6 cycles Stalls2
HHH
stall
MLPAware Cache Replacement
How do we incorporate MLP into replacement decisions?
Qureshi et al., A Case for MLPAware Cache Replacement, ISCA 2006.
Suggested reading for homework
Enabling Multiple Outstanding Misses
Handling Multiple Outstanding Accesses
Question: If the processor can generate multiple cache accesses, can the later accesses be handled while a previous miss is outstanding?
Goal I: Enable cache access when there is a pending miss
Goal II: Enable multiple misses in parallelMemorylevel parallelism MLP
Solution:Nonblockingorlockupfreecaches
Kroft, LockupFree Instruction FetchPrefetch Cache Organization, ISCA 1981.
Handling Multiple Outstanding Accesses
Idea: Keep track of the statusdata of misses that are being handled using Miss Status Handling Registers MSHRs
A cache access checks MSHRs to see if a miss to the same block is already pending.
If pending, a new request is not generated
If pending and the needed data available, data forwarded to later loadRequires buffering of outstanding miss requests
Miss Status Handling Register
Alsocalledmissbuffer
Keeps track of
Outstanding cache misses
Pending loadstore accesses that refer to the missing cache
block
Fields of a single MSHR entry
Valid bit
Cache block address to match incoming accesses
Controlstatus bits prefetch, issued to memory, which sub
blocks have arrived, etc
Data for each subblock
For each pending loadstore keep track of
Valid, type, data size, byte in block, destination register or store buffer
entry address
MSHR Entry Demo
MSHR Operation
On a cache miss:
Search MSHRs for a pending access to the same block
Found: Allocate a loadstore entry in the same MSHR entry
Not found: Allocate a new MSHR
No free entry: stallstructure hazard
When a subblock returns from the next level in memory
Check which loadsstores waiting for it
Forward data to the loadstore unit
Deallocate loadstore entry in the MSHR entry
Write subblock in cache or MSHR
If last subblock, deallaocate MSHR after writing the block in cache
NonBlocking Cache Implementation
When to access the MSHRs?In parallel with the cache?
After cache access is complete?
MSHRs need not be on the critical path of hit requests
Which one below is the common case?Cache miss, MSHR hit
Cachehit
MultiCore Issues in Caching
Caches in MultiCore Systems
Cache efficiency becomes even more important in a
multicoremultithreaded system
Memory bandwidth is at premiumCache space is a limited resource
How do we design the caches in a multicore system?
Manydecisions
Shared vs. private caches
How to maximize performance of the entire system?
How to provide QoS to different threads in a shared cache?
Should cache management algorithms be aware of
threads?
How should space be allocated to threads in a shared
cache?
Private vs. Shared Caches
Private cache: Cache belongs to one core a shared block can be in multiple caches
Shared cache: Cache is shared by multiple cores
CORE 0
CORE 1
CORE 2
CORE 3
CORE 0
CORE 1
CORE 2
CORE 3
L2 CACHE
L2 CACHE
L2 CACHE
L2 CACHE
L2 CACHE
DRAM MEMORY CONTROLLER
DRAM MEMORY CONTROLLER
Resource Sharing and Its Advantages
Idea: Instead of dedicating a hardware resource to a
hardware context, allow multiple contexts to use it
Example resources: functional units, pipeline, caches, buses, memory
Why?
Resource sharing improves utilizationefficiencythroughputWhen a resource is left idle by one thread, another thread can use it; no
need to replicate shared data
Reduces communication latency
For example, shared data kept in the same cache in multithreaded
processors
Compatible with the shared memory model
Resource Sharing Disadvantages
Resource sharing results in contention for resources
When the resource is not idle, another thread cannot use it
If space is occupied by one thread, another thread needs to re
occupy it
Sometimes reduces each or some threads performance
Thread performance can be worse than when it is run alone
Eliminates performance isolationinconsistent
performance across runs
Thread performance depends on coexecuting threads
UncontrolledfreeforallsharingdegradesQoSCauses unfairness, starvation
Need to efficiently and fairly utilize shared resources
This is crucial for modern data center
Shared Caches Between Cores
Advantages:
High effective capacity
Dynamic partitioning of available cache spaceNo fragmentation due to static partitioning
Easier to maintain coherence single copy
Shared data and locks do not ping pong between caches
Disadvantages
Slower access
Cores incur conflict misses due to other cores accessesMisses due to intercore interference
Some cores can destroy the hit rate of other cores
Guaranteeing a minimum level of service or fairness to each core is harder how much space, how much bandwidth?
Shared Caches: How to Share?
Freeforallsharing
Placementreplacement policies are the same as a single core
system usually LRU or pseudoLRU
Not threadapplication aware
An incoming block evicts a block regardless of which threads
the blocks belong to
Problems
Inefficient utilization of cache: LRU is not the best policy
A cacheunfriendly application can destroy the performance
of a cache friendly application
Not all applications benefit equally from the same amount of
cache: freeforall might prioritize those that do not benefit
Reduced performance, reduced fairness
Optimization: Utility Based Cache Partitioning
Goal: Maximize system throughput
Observation: Not all threadsapplications benefit equally from caching
simple LRU replacement not good for system throughput
Idea: Allocate more cache space to applications that obtain the most
benefit from more space
The highlevel idea can be applied to other shared resources as well.
Qureshi and Patt, UtilityBased Cache Partitioning: A LowOverhead, High
Performance, Runtime Mechanism to Partition Shared Caches, MICRO 2006.
Suh et al., A New Memory Monitoring Scheme for MemoryAware Scheduling and Partitioning, HPCA 2002.
The MultiCore System: A Shared Resource View
Shared Storage
Need for QoS and Shared Resource Mgmt.
Why is unpredictable performance or lack of QoS bad?
Makes programmers life difficult
An optimized program can get low performance and
performance varies widely depending on corunners
Causes discomfort to user
An important program can starve
Examples from shared software resources
Makes system management difficult
How do we enforce a Service Level Agreement when hardware resources are sharing is uncontrollable?
Resource Sharing vs. Partitioning
SharingimprovesthroughputBetter utilization of space
Partitioningprovidesperformanceisolationpredictable
performance
Dedicated space
!
Can we get the benefits of both?
Idea: Design shared resources such that they are
efficiently utilized, controllable and partitionable
No wasted resourceQoS mechanisms for threads
Shared Hardware Resources
Memorysubsysteminbothmultithreadedandmulti core systems
Nonprivate caches
Interconnects
Memory controllers, buses, banks
IO subsystem in both multithreaded and multicore systems
IO, DMA controllersEthernet controllers
ProcessorinmultithreadedsystemsPipeline resources
L1 caches
Next Topic Prefetching
Reviews
There are no reviews yet.