[SOLVED] 代写 algorithm parallel computer architecture software Computer Architecture

30 $

File Name: 代写_algorithm_parallel_computer_architecture_software_Computer_Architecture.zip
File Size: 734.76 KB

SKU: 7602300748 Category: Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Or Upload Your Assignment Here:


Computer Architecture
Course code: 0521292B 10. Cache Memory
Jianhua Li
College of Computer and Information Hefei University of Technology
slides are adapted from CA course of wisc, princeton, mit, berkeley, etc.
The uses of the slides of this course are for educa/onal purposes only and should be
used only in conjunc/on with the textbook. Deriva/ves of the slides must
acknowledge the copyright no/ces of this and the originals.

Cache Abstraction and Metrics
Tag Store
(is the address in the cache? + bookkeeping)
Data Store
(stores memory blocks)
Address
Hit/miss? Data
• Cache hit rate = (# hits) / (# hits + # misses) = (# hits) / (# accesses)
• Average memory access time (AMAT)
= ( hit-rate * hit-latency ) + ( miss-rate * miss-latency )
• Aside: Can reducing AMAT reduce performance?

A Basic Hardware Cache Design
• Wewillstartwithabasichardwarecachedesign
• Then,wewillexamineamultitudeofideastomakeit better

Blocks and Addressing the Cache
• Memoryislogicallydividedintofixed-sizeblocks
• Each block maps to a location in the cache, determined
by the index bits in the address
– used to index into the tag and data stores
tag index byte in block 8-bit address
2b
3 bits
3 bits
• Cache access:
– index into the tag and data stores with index bits in address
– check valid bit in tag store
– compare tag bits in address with the stored tag in tag store
• If a block is in the cache (cache hit), the stored tag should be valid and match the tag of the block

• •
tag
Direct-Mapped Cache
Assume byte-addressable memory: 256
bytes, 8-byte blocks32 blocks
Assume cache: 64 bytes, 8 blocks
– Direct-mapped: A block can go to only one location
index
byte in block
Address
2b
3 bits
3 bits
Tag store
Data store
V
tag
=?
MUX Data
byte in block
location
• Cause conflict misses
Hit?
– Addresses with same index contend for the same

Direct-Mapped Caches
• Direct-mappedcache:Twoblocksinmemorythatmap What’s the
to the same index in the cache cannot be present in the
cache at the same time – One indexone entry
problem?
• Can lead to 0% hit rate if more than one block accessed in an interleaved manner map to the same index
– Assume addresses A and B have the same index bits but different tag bits
– A, B, A, B, A, B, A, B, …conflict in the cache index – All accesses are conflict misses

Set Associativity
• Addresses 0 and 8 always conflict in direct mapped cache
• Instead of having one column of 8, have 2 columns of 4 blocks
Tag store
SET
Data store
V
tag
V
tag
=?
=?
MUX MUX
byte in block
Logic
tag
Key idea: Associative memory within the set
+ Accommodates conflicts better (fewer conflict misses) — More complex, slower access, larger tag store
Address
index byte in block
Hit?
3b
2 bits
3 bits

• 4-way
Higher Associativity
Tag store
=?
=?
=?
=?
Logic
Hit?
Data store
MUX MUX
byte in block
+ Likelihood of conflict misses even lower
— More tag comparators and wider data mux; larger tags

Full Associativity
• Fully associative cache
– A block can be placed in any cache location
Tag store
=?
=?
=?
=?
=?
=?
=?
=?
Logic
Data store
Hit?
MUX MUX
byte in block

Associativity (and Tradeoffs)
• Degreeofassociativity:Howmanyblockscanmapto the same set?
• Higherassociativity
++ Higher hit rate
— Slower cache access time (hit latency and data access latency) — More expensive hardware (more comparators)
hit rate
• Diminishingreturnsfromhigher associativity
associativity

Issues in Set-Associative Caches
• Think of each block in a set having a “priority”
– Indicating how important it is to keep the block in the cache
• Key issue: How do you determine block priorities?
• There are three key decisions can adjust priority:
– Insertion, promotion, eviction (replacement)
• Insertion:Whathappenstoprioritiesonacachefill?
– Where to insert the incoming block, whether or not to insert
the block
• Promotion:Whathappenstoprioritiesonacachehit?
– Whether and how to change block priority
• Eviction/replacement:Whathappenstoprioritiesona
cache miss?
– Which block to evict and how to adjust priorities

We focus: Replacement Policy
• Which block in the set to replace on a cache miss?
– Any invalid block first
– If all are valid, consult the replacement policy • Random
• FIFO
• LRU:Leastrecentlyused(howtoimplement?) • NRU: Not most recently used
• LFU: Least frequently used?
• Least costly to re-fetch?
– Why would memory accesses have different cost? • Hybrid replacement policies
• Optimal replacement policy?

Implementing LRU
• Idea: Evict the least recently accessed block
• Problem: Need to keep track of access ordering of
blocks
• Question:2-waysetassociativecache:
– What do you need to implement LRU perfectly?
• Question:4-waysetassociativecache:
– What do you need to implement LRU perfectly?
– How many different orderings possible for the 4 blocks in the set?
– How many bits needed to encode the LRU order of a block?
– What is the logic needed to determine the LRU victim?

Approximations of LRU
• Mostmodernprocessorsdonotimplement“trueLRU” (also called “perfect LRU”) in highly-associative caches
• Why?
– True LRU is complex
– LRU is an approximation to predict locality anyway (i.e., not the best possible cache management policy)
• Alternativeexamples:
– Not MRU (not most recently used)
– Hierarchical LRU: divide the 4-way set into 2-way “groups”, track the MRU group and the MRU way in each group
– Victim-NextVictim Replacement: Only keep track of the victim and the next victim

LRU or Random? • LRUvs.Random:Whichoneisbetter?
– Example: 4-way cache, cyclic references to A, B, C, D, E • 0% hit rate with LRU policy
• Setthrashing:Whenthe“programworkingset”inaset is larger than set associativity
– Random replacement policy is better when thrashing occurs
• Inpractice:
– Depends on workload
– Average hit rate of LRU and Random are similar
• Best of both Worlds: Hybrid of LRU and Random – How to choose between the two? Set sampling
• See Qureshi et al., “A Case for MLP-Aware Cache Replacement,“ ISCA 2006.

What Is the Optimal Replacement Policy?
• Belady’sOPT
– Replace the block that is going to be referenced furthest in
the future by the program
– Belady, “A study of replacement algorithms for a virtual- storage computer,” IBM Systems Journal, 1966.
– How do we implement this? Simulate?
• Is this optimal for minimizing miss rate?
• Is this optimal for minimizing execution time?
– No. Cache miss latency/cost varies from block to block!
– Two reasons: Remote vs. local caches and miss overlapping
– Qureshi et al. “A Case for MLP-Aware Cache Replacement,“ ISCA 2006.

Cache versus Page Replacement
• Physical memory (DRAM) is a cache for disk
– Usually managed by system software via the virtual memory
subsystem
• Page replacement is similar to cache replacement
• Page table is the “tag store” for physical memory data
store
• Whatisthedifference?
– Access speed: cache vs. physical memory
– Number of blocks in a cache vs. physical memory
– “Tolerable” amount of time to find a replacement candidate – Role of hardware versus software

What’s in A Tag Store Entry?
• Valid bit
• Tag
• Replacement policy bits
• Dirty bit?
– Write back vs. write through caches

Handling Writes (I)
 When do we write the modified data in a cache to the next level?
• Write through: At the time the write happens
• Write back: When the block is evicted
– Write-back
+ Can consolidate multiple writes to the same block before eviction + Potentially saves bandwidth between cache levels + saves energy —Need a bit in the tag store indicating the block is “dirty/modified”
– Write-through + Simpler
+ All levels are up to date. Consistency: Simpler cache coherence because no need to check lower-level caches
—More bandwidth intensive; no coalescing of writes

Handling Writes (II)
• Do we allocate a cache block on a write miss? – Allocate on write miss: Yes
– No-allocate on write miss: No
• Allocateonwritemiss
+ Can consolidate writes instead of writing each of them
individually to next level
+ Simpler because write misses can be treated the same way as read misses
—Requires (?) transfer of the whole cache block • No-allocate
+ Conserves cache space if locality of writes is low (potentially better cache hit rate)

Sectored Caches
• Idea: Divide a block into subblocks (or sectors) – Have separate valid and dirty bits for each sector – When is this useful? (Think writes…)
++ No need to transfer entire cache block into cache – (A write simply validates and updates a subblock)
++ More freedom in transferring subblocks into cache – (How many subblocks do you transfer on a read?)
— More complex design
— May not exploit spatial locality fully for reads
v
d
subblock
v
d
subblock
v
d
subblock
tag

Instruction vs. Data Caches
• SeparateorUnified? • Unified:
+ Dynamic sharing of cache space: no overprovisioning that might happen with static partitioning (i.e., split I and D caches)
— Instructions and data can thrash each other (i.e., no guaranteed space for either)
— I and D are accessed in different places in the pipeline. Where do we place the unified cache for fast access?
• Firstlevelcachesarealmostalwayssplit – Mainly for the last reason above
• Second and higher levels are almost always unified

Multi-level Caching in a Pipelined Design
• First-levelcaches(instructionanddata) – Decisions very much affected by cycle time
– Small, lower associativity
– Tag store and data store accessed in parallel
• Second-levelcaches
– Decisions need to balance hit rate and access latency
– Usually large and highly associative; latency not as important – Tag store and data store accessed serially
• Serial vs. Parallel access of levels
– Serial: Second level cache accessed only if first-level misses
– Second level does not see the same accesses as the first • First level acts as a filter (filters some temporal and spatial locality) • Management policies are therefore different

Cache Performance

Cache Parameters vs. Miss/Hit Rate
• Cache size
• Block size
• Associativity
• Replacement policy
• Insertion/Placement policy

Cache Size
• Cache size: total data (not including tag) capacity – bigger can exploit temporal locality better
– notALWAYSbetter
• Too large a cache adversely affects hit and miss latency
– smaller is faster => bigger is slower
– accesstimemaydegradecriticalpath
hit rate – doesn’t exploit temporal locality well
• Toosmallacache
– usefuldatareplacedoften
• Workingset:thewholesetofdata
the executing application references – Within a time interval
“workingset” size
cache size

Block Size
• Blocksizeisthedatathatisassociatedwithanaddress
tag
– not necessarily thetuhnisitcoufrvtrea?nsfer between hierarchies
• Sub-blocking: A block divided into multiple pieces (each with V bit)
Can someone explain
– Can improve “write” performance • Toosmallblocks
hit rate – don’t exploit spatial locality well
– havelargertagoverhead
• Toolargeblocks
– too few total # of blocksless
temporal locality exploitation
– waste of cache space and bandwidth/energy
if spatial locality is not high
block size

Large Blocks: Critical-Word and Sub-blocking
• Large cache blocks can take a long time to fill into the cache
– fill cache line critical word first
– restart cache access before complete fill
• Large cache blocks can waste bus bandwidth – divide a block into subblocks
– associate separate valid bits for each subblock – When is this useful?
v
d
subblock
v
d
subblock
v
d
subblock
tag

Associativity
• How many blocks can map to the same index (or set)?
• Largerassociativity
– lower miss rate, less variation among programs – diminishing returns, higher hit latency
• Smaller associativity – lower cost
hit rate
– lower hit latency
• Especially important for L1 caches
• Power of 2 required?
associativity

Classification of Cache Misses
• Compulsorymiss
– first reference to an address (block) always results in a miss
– subsequent references should hit unless the cache block is displaced for the reasons below
• Capacitymiss
– cache is too small to hold everything needed
– defined as the misses that would occur even in a fully- associative cache (with optimal replacement) of the same
capacity
• Conflictmiss
– defined as any miss that is neither a compulsory nor a capacity miss

How to Reduce Each Miss Type
• Compulsory
– Caching cannot help – Prefetching
• Conflict
– More associativity
– Other ways to get more associativity without making the
cache associative • Victim cache
• Hashing
• Software hints?
• Capacity
– Utilize cache space better: keep blocks that will be referenced
– Software management: divide working set such that each
“phase” fits in cache

Improving Cache “Performance” • Remember
– Average memory access time (AMAT)
= ( hit-rate * hit-latency ) + ( miss-rate * miss-latency )
• Reducing miss rate
– Remind: reducing miss rate can reduce performance
if more costly-to-refetch blocks are evicted • Reducing miss latency/cost
• Reducing hit latency/cost

Improving Basic Cache Performance
• Reducing miss rate
– More associativity
– Alternatives/enhancements to associativity
• Victim caches, hashing, pseudo-associativity, skewed associativity
– Better replacement/insertion policies – Software approaches
• Reducing miss latency/cost – Multi-level caches
– Critical word first
– Subblocking/sectoring
– Better replacement/insertion policies
– Non-blocking caches (multiple cache misses in parallel) – Multiple accesses per cycle
– Software approaches

Next Topic Optimizing Cache Performance

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] 代写 algorithm parallel computer architecture software Computer Architecture
30 $