[SOLVED] data structure parallel compiler computer architecture software Computer Architecture

$25

File Name: data_structure_parallel_compiler_computer_architecture_software_Computer_Architecture.zip
File Size: 800.7 KB

5/5 - (1 vote)

Computer Architecture
Course code: 0521292B 09. Memory Hierarchy
Jianhua Li
College of Computer and Information Hefei University of Technology
slides are adapted from CA course of wisc, princeton, mit, berkeley, etc.
The uses of the slides of this course are for educaonal purposes only and should be
used only in conjuncon with the textbook. Derivaves of the slides must
acknowledge the copyright noces of this and the originals.

Memory Programmers View

: Virtual vs. Physical Memory
Programmer sees virtual memoryCan assume the memory is infinite
Reality: Physical memory size is much smaller than what the programmer assumes
The system softwarehardware maps virtual memory addresses are to physical memory
The system automatically manages the physical memory space transparently to the programmer
:Programmer does not need to know the physical size of memory nor manage itA small physical memory can appear as a huge one to the programmerLife is easier for the programmer
:More complex system software and architecture

Idealism
Instruction Supply
Pipeline Instruction execution
Data Supply
Zerocycle latency
Infinite capacity
Zero cost
Perfect control flow
No pipeline stalls
Perfect data flow regmemory dependencies
Zerocycle interconnect operand communication
Enough functional unitsZero latency compute
Zerocycle latencyInfinite capacity
Infinite bandwidthZero cost

Memory in a Modern System
CORE 0
CORE 1
DRAM MEMORY CONTROLLER
CORE 2
CORE 3
DRAM BANKS DRAM INTERFACE
L2 CACHE 1 L2 CACHE 3 L2 CACHE 0 L2 CACHE 2
SHARED L3 CACHE

Ideal Memory
Zero access time latencyInfinite capacity
Zero cost
Infinite bandwidth to support multiple accesses in parallel

Ideal memorys requirements oppose each otherBigger is slower
BiggerTakes longer to determine the locationFaster is more expensive
Memory technology: SRAM vs. DRAM vs. Disk vs. Tape
Higher bandwidth is more expensive
Need more banks, more ports, higher frequency, or faster technology

: DRAMDynamic random access memory
Capacitor charge state indicates stored valueWhether the capacitor is charged or discharged
indicates storage of 1 or 01 capacitor
1 access transistor
Capacitor leaks through the RC pathDRAM cell loses charge over time
DRAM cell needs to be refreshed
row enable
bitline

: SRAMStatic random access memory
Two cross coupled inverters store a single bit
Feedback path enables the stored value to persist in
the cell
4 transistors for storage
2 transistors for access row select
bitline
bitline

Bank OrganizationOperation

Read access sequence:
1. Decode row addressdrive wordlines
2. Selected bits drive bit lines
Entire row read
3. Amplify row data
4. Decode column addressselect subset of row
Send to output
5. Precharge bitlinesFor next access

Static Random Access Memory
row select
Read Sequence
1. address decode
2. drive row select
3. selected bitcells drive bitlines
entire row is read together
4. differential sensing and column select
data is ready
5. precharge all bitlines
for next read or write
Access latency dominated by steps 2 and 3 Cycling time dominated by steps 2, 3 and 5
step 2 proportional to 2m
step 3 and 5 proportional to 2n
bitcell array 2n row x 2mcol
nm to minimize overall latency
nm
n
2n
m

sense amp and mux
2m diff pairs
1
bitline
bitline

Dynamic Random Access Memory
row enable
Bits stored as charges on node capacitance nonrestorative
bit cell loses charge when read
bit cell loses charge over time Read Sequence
13 same as SRAM
4. a flipflopping sense amp amplifies and regenerates the bitline, data bit is muxed out
5. precharge all bitlines
Destructive reads Charge loss over time
Refresh: A DRAM controller must periodically read each row within the allowed refresh time 10s of ms such that charge is restored
bitcell array 2n row x 2mcol
nm to minimize overall latency
RAS n
m
CAS
2n
2m
sense amp and mux
1
A DRAM die comprises of multiple such arrays
bitline

DRAM vs. SRAM
DRAM
Slower access capacitor
Higher density 1T1C cell
Lower cost
Requires refresh power, performance, circuitry
Manufacturing requires putting capacitor and logic together
SRAM
Faster access no capacitor
Lower density 6T cell
Higher cost
No need for refresh
Manufacturing compatible with logic process no capacitor

Biggerisslower
SRAM, 512 Bytes, subnanosec
SRAM, KByteMByte, nanosecDRAM, Gigabyte, 50 nanosec
Hard Disk, Terabyte, 10 millisec
Faster is more expensive dollars and chip areaSRAM,10 per Megabyte
DRAM,1 per Megabyte
Hard Disk1 per Gigabyte
These sample values scale with time
Other technologies have their place as well
Flash memory, PCRAM, MRAM, RRAM not mature yet

Why Memory Hierarchy?
We want both fast and large
But we cannot achieve both with a single level of
memory
Idea: Have multiple levels of storage progressively bigger and slower as the levels are farther from the processor and ensure most of the data the processor needs is kept in the faster levels

The Memory Hierarchy
move what you use here
With good locality of reference, memory appears as fast as and as large as
backup everything here
fast small
big but slow
faster per byte cheaper per byte

Memory Hierarchy
Fundamental tradeoffFast memory: small
Large memory: slow
Idea: Memory hierarchy
Hard Disk
Main Memory DRAM
CPU
RF
Latency, cost, size, bandwidth
Cache

Locality
Ones recent past is a very good predictor of hisher near future.
Temporal Locality: If you just did something, it is very likely that you will do the same thing again soon
since you are here today, there is a good chance you will be here again and again regularly
Spatial Locality: If you did something, it is very likely you will do something similarrelated in space
every time I find you in this room, you are probably sitting close to the same people

Memory Locality
A typical program has a lot of locality in memory references
typical programs are composed of loops
Temporal: A program tends to reference the same memory location many times and all within a small window of time
Spatial:Aprogramtendstoreferenceaclusterof memory locations at a time
most notable examples:
instruction memory references
arraydata structure references

Caching Basics
Idea:Storerecentlyaccesseddatainautomatically managed fast memory called cache
Anticipation:thedatawillbeaccessedagainsoonTemporallocalityprinciple
Recently accessed data will be again accessed in the near future
This is what Maurice Wilkes had in mind:
Wilkes, Slave Memories and Dynamic Storage Allocation, IEEE Trans.
On Electronic Computers, 1965.
The use is discussed of a fast core memory of, say 32000 words as a slave to a slower core memory of, say, one million words in such a way that in practical cases the effective access time is nearer that of the fast memory than that of the slow memory.

Caching Basics
Idea: Store addresses adjacent to the recently accessed one in automatically managed fast memory
Logically divide memory into equal size blocksFetch to cache the accessed block in its entirety
Anticipation:nearbydatawillbeaccessedsoonSpatiallocalityprinciple
Nearby data in memory will be accessed in the near futureE.g., sequential instruction access, array traversal
This is what IBM 36085 implemented
16 Kbyte cache with 64 byte blocks
Liptay, Structural aspects of the System360 Model 85 II: the cache, IBM Systems Journal, 1968.

Caching in a Pipelined Design
The cache needs to be tightly integrated into the pipeline
Ideally, access in 1cycle, dependent operations do not stall
High frequency pipelineCannot make the cache
large
But, we want a large cache AND a pipelined design
Idea:Cachehierarchy
Main Memory DRAM
Level 2 Cache
CPU
RF
Level1 Cache

Manual vs. Automatic Management
Manual:Programmermanagesdatamovementacross levels
too painful for programmers on substantial programs core vs drum memory in the 50s
still done in some embedded processors onchip scratch pad SRAM in lieu of a cache
Automatic: Hardware manages data movement across levels, transparently to the programmer
programmers life is easier
the average programmer doesnt need to know about it
You dont need to know how big the cache is and how it works to write a correct program! What if you want a fast program?

Register File
32 words, subnsec
Memory Abstraction
Modern Memory Hierarchy
manualcompiler register spilling
L1 cache 32 KB, nsec
L2 cache
512 KB1MB, many nsec
Automatic HW cache management
L3 cache, ..
Main memory DRAM, GB, 100 nsec
automatic demand paging
Swap Disk 100 GB, 10 msec

Hierarchical Latency Analysis
For a given memory hierarchy level i it has a technologyintrinsic access time of ti, The perceived access time Ti is longer than ti
Except for the outermost hierarchy, when looking for a given address there is
a chance hitrate hi you hit and access time is ti
a chance missrate mi you miss and access time ti Ti1himi1
Thus
TihitimitiTi1
Titi miTi1
hi and mi are defined to be the hitrate and missrate of just the references that missed at Li1

Hierarchy Design Considerations
Recursivelatencyequation T it im i T i1
Thegoal:achievedesiredT1withinallowedcost
Titi is desirable
Keepmilow
increasing capacity Ci lowers mi, but beware of increasing ti
lower mi by smarter management replacement::anticipate
what you dont need, prefetching::anticipate what you will need
Keep Ti1 low
faster lower hierarchies, but beware of increasing costintroduce intermediate hierarchies as a compromise

Cache Basics and Operation

Cache
Generically, any structure that memorizes frequently used results to avoid repeating the longlatency operations required to reproduce the results from scratch, e.g. a web cache
Mostcommonlyintheondiecontext:an automaticallymanaged memory hierarchy based on SRAM
memorize in SRAM the most frequently accessed DRAM memory locations to avoid repeatedly paying for the DRAM access latency

Caching Basics
Block line: Unit of storage in the cache
Memory is logically divided into cache blocks that map to
locations in the cache
Whendatareferenced
HIT: If in cache, use cached data instead of accessing memoryMISS: If not in cache, bring block into cache
Someimportantcachedesigndecisions
Placement: where and how to placefind a block in cache?
Replacement: what data to remove to make room in cache?Granularity of management: large, small, uniform blocks?
Write policy: what do we do about writes?
Instructionsdata: Do we treat them separately?

Cache Abstraction and Metrics
Tag Store
is the address in the cache?bookkeeping
Data Store
Address
Hitmiss? Data
Cache hit rate hits hits misses hits accesses
Average memory access time AMAT
hitratehitlatencymissratemisslatency
Aside: Can reducing AMAT reduce performance?

Next Topic Cache Memory

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] data structure parallel compiler computer architecture software Computer Architecture
$25