Computer Architecture
Course code: 0521292B 16. Interconnection Networks
Jianhua Li
College of Computer and Information Hefei University of Technology
slides are adapted from CA course of wisc, princeton, mit, berkeley, etc.
The uses of the slides of this course are for educa/onal purposes only and should be
used only in conjunc/on with the textbook. Deriva/ves of the slides must
acknowledge the copyright no/ces of this and the originals.
1
Outline
Introduction
Network Topology
Flow Control
Example IN
How to connect individual devices together?
Device (definition):
Component within a computer Single computer
System of computers
Types of elements:
end nodes (device + interface) links
interconnection network
End Node
End Node
End Node
End Node
Device
Device
Device
Device
SW Interface
SW Interface
SW Interface
SW Interface
HW Interface
HW Interface
HW Interface
HW Interface
Interconnection Network
Internetworking: interconnection of multiple networks
Interconnection networks should be designed to transfer the maximum amount of information within the least amount of time (and cost, power constraints) so as not to bottleneck the system
Link
Link
Link
Link
Reasons to concern interconnection networks
They provide external connectivity from system to outside world
Also, connectivity within a single computer system at many levels
I/O units, boards, chips, modules and blocks inside chips
Trends: high demand on communication bandwidth increased computing power and storage capacity switched networks are replacing buses
Computer architects must understand interconnect problems and solutions in order to more effectively design and evaluate systems
Interconnection Network Domains
Interconnection networks can be grouped into four major networking domains, depending on the number and proximity of devices to be interconnected:
OCNs, SANs, LANs, and WANs
On-chip networks (OCNs), a.k.a., network-on-chip (NoC)
Interconnect microarchitecture functional units, register files, caches, compute tiles, processor and IP cores
Chip or multichip modules
Tens (in future, possibly 100s) of devices interconnected
Maximum interconnect distance on the order of centimeters
Interconnection Network Domains
System/storage area networks (SANs)
Multiprocessor and multicomputer systems
Interprocessorandprocessor-memoryinterconnections
Server and data center environments
StorageandI/Ocomponents
Hundreds to thousands of devices interconnected
IBM Blue Gene/L supercomputer (64K nodes, each with 2 processors)
Maximum interconnect distance typically on the order of tens of meters, but some with as high as a few hundred meters
InfiniBand:120Gbpsoveradistanceof300m
Interconnection Network Domains
Local area networks (LANs)
Machine room or throughout a building or campus
Hundreds of devices interconnected (1,000s with bridging)
Maximum interconnect distance on the order of few kilometers (some with distance spans of a few tens of kilometers)
Example (most popular): Ethernet, with 10 Gbps over 40Km
Wide area networks (WANs)
Interconnect systems distributed across the globe
Internetworking support is required
Many millions of devices interconnected
Maximum interconnect distance of many thousands of kilometers
Example: ATM
Interconnection Network Domains
5 x 106
5 x 103
5 x 100
5 x 10-3
WANs
OCNs
LANs SANs
1 10 100
1,000 10,000
>100,000
Number of devices interconnected
Distance (meters)
Outline
Introduction
Network Topology
Flow Control
Example IN
Topology Overview
Definition: determines arrangement of channels and nodes in network
Analogous to road map
Often first step in network design
Significant impact on network cost-performance Determines number of hops
Latency
Networkenergyconsumption Implementation complexity
Nodedegree Easeoflayout
Typical Topologies in Processors
NIAGARA
CELL EIB
CROSSBAR
RING
RING
LARABEE
MESH
TILERA
MESH
TRIPS
Network Metrics
Metrics are used to evaluate the performance and cost of network with given topology
Also influenced by routing/flow control
At this stage, we skip the impacts
Assume ideal routing (perfect load balancing)
Assume ideal flow control (no idle cycles on any channel)
(1) Degree
Switch Degree: number of links at a node
Proxy for estimating cost
Higher degree requires more links and port counts at each router
B
A
B
A
2
B
A
2,3,4
4
(2) Hop Count
Path: ordered set of channels between source and destination
Hop Count: number of hops a message takes from source to destination
Simple, useful proxy for network latency
Every node, link incurs some propagation delay even when no contention
Minimal hop count: smallest hop count connecting two nodes
Network diameter: largest min hop count in network
Average minimum hop count: average across all src/dst
Implementation may incorporate non-minimal paths Increasesaveragehopcount
pairs
(2) Hop Count
A
Uniform random traffic Ring > Mesh > Torus
B
A
Max = 2 Avg = 1.33
B
B
A
Max = 4 Avg = 2.2
Max = 4 Avg = 1.77
(3) Latency
Time for packet to traverse network Start: head arrives at input port
End: tail departs output port
Latency = Head latency + serialization latency
Serialization latency: time for packet with Length L to cross
channel with bandwidth b (L/b) Approximate with hop count
Other design choices (routing, flow control) impact latency Weskiptheseimpactsnow
(4) Maximum Channel Load
Estimate max bandwidth the network can support
Max bits per second (bps) that can be injected by every node before it saturates
Saturation:networkcannotacceptanymoretraffic Determine the most congested link
Forgiventrafficpattern
Will limit overall network bandwidth
Estimate load on this channel
Maximum Channel Load Example
ABEF
CDGH
Uniform random
Every node has equal probability of sending to every node
Identify bottleneck channel
Half of traffic from every node will cross bottleneck channel
8 x12 =4
Network saturates at 14 injection bandwidth
(5) Bisection Bandwidth
Common off-chip metric () Proxy for cost
Amount of global wiring that will be necessary Less useful for on-chip
Globalon-chipwiringconsideredabundant
Cuts: partition all the nodes into two disjoint sets Bandwidth of a cut
Bisection
A cut which divides all nodes into nearly half
Channel bisection min. channel count over all bisections Bisection bandwidth min. bandwidth over all bisections
With uniform random traffic 12 of traffic crosses bisection
Throughput Example
01234567
Bisection = 4 (2 in each direction)
With uniform random traffic
3 sends 1/8 of its traffic to 4,5,6
3 sends 1/16 of its traffic to 7 (2 possible shortest paths) 2 sends 1/8 of its traffic to 4,5
Etc
Suppose channel load = 1
(6) Path Diversity
Multiple shortest paths (R) between source/destination pair
Fault tolerance
Better load balancing in network
Routing algorithm should be able to exploit path diversity
B A RAB = 1
A
B
RAB = 6
Many possible network topologies
Bus
Crossbar
Ring
Tree
Omega
Hypercube Mesh Torus
Butterfly
..
Bus interconnect
Good:
Simple design
Cost effective for a small number of nodes
Easy to implement coherence (via snooping ())
Bad:
Contention: all nodes contend for shared bus
Limited bandwidth: all nodes communicate over same wires High electrical load = low frequency, high power
01234
Crossbar interconnect
Each node is connected to every other node
Good:
O(1) latency and high bandwidth
Bad:
Not scalable: O(N2) switches
High cost
01234567
0
1
2
3
4
5
6
7
Note: in this network illustration, each node is drawn twice for clarity (at left and at top)
Difficult to arbitrate at scale
8-node crossbar network (N=8)
Crossbars used in Sun/Oracle processors
Sun SPARC T2 (8 cores, 8 Oracle SPARC T5 (16 cores, 8 L2 cache banks) L3 cache banks)
Note that crossbar (CCX) occupies about the same chip area as a core
Ring
Good:
Simple
Cheap: O(N) cost Bad:
High latency: O(N)
Bisection bandwidth remains constant as nodes are added (scalability issue)
Used in recent Intel architectures Core i7, Xeon Phi
Also used in IBM CELL Broadband Engine (9 cores)
0 1 2 3
Intels ring interconnect (Sandy Bridge Microarch.)
System Agent
L3 cache slice (2 MB)
Core
L3 cache slice (2 MB)
Core
Four rings request
snoop
ack
data (32 bytes)
Six interconnect nodes: four slices of L3 cache + system agent + graphics
Each bank of L3 connected to ring bus twice
Theoretical peak BW from cores to L3 at 3.4 GHz is approx. 435 GB/sec
When each core is accessing its local slice
L3 cache slice (2 MB)
Core
L3 cache slice (2 MB)
Core
Graphics
Mesh
Direct network
O(N) cost
Average latency:O(sqrt(N))
Easy to lay out on chip: fixed-length links
Path diversity: many ways for message to travel from one node to another
Used by:
Tilera processors
Prototype Intel chips
2D Mesh
Torus
In mesh topology, node near edge or in middle of network is different
Torus topology introduces new links to avoid this problem
Still O(N) cost, but higher cost than 2D grid
Higher path diversity and bisection BW than mesh
Higher complexity
Difficult to layout on chip Unequal link lengths
2D Torus
Trees
Planar, hierarchical topology
Like mesh/torus, good when traffic has locality
Latency: O(lg N)
Use fat trees to alleviate root bandwidth problem
H-Tree
Fat Tree
Hyper cube in different dimension
Low latency: O(lg N)
Radix: O(lg N)
Number of links O(N lg N)
6D hypercube used in 64-core Cosmic Cube computer developed at Caltech in the 80s.
1101
1111
0101
1100 0111
1000 0011
0100
1011
1001
1110
0110
0000
0010
0001
1010
31
Multi-stage logarithmic
Indirect network with multiple switches between terminals Cost: O(N lg N)
Latency: O(lg N)
Many variations: Omega, butterfly, Clos networks, etc
000 001 010 011 100 101 110 111
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
000
001
010
011
100
101
110
111
Omega Network
Review: network topologies
Topology Direct/Indirect
Blocking/ Non-blocking
Cost Latency
Crossbar Indirect
Non-blocking
O(N2) O(1)
Multi-stage log. Indirect
Blocking
O(N lg N) O(lg N)
Mesh Direct
Blocking
O(N)
O(sqrt(N))
(average)
Outline
Introduction
Network Topology
Flow Control
Example IN
Flow Control Overview
Topology: determines connectivity of network
Routing: determines paths through network
Flow Control: determine allocation of resources to
messages as they traverse network
Buffers and links
Significant impact on throughput and latency of network
We dont cover this topic here. If you are interested in it, please read this book: On-Chip Networks, 2009.
Flow Control
Control State
Control state records
allocation of channels and buffers to packets current state of packet traversing node
Channel bandwidth advances flits along nodes Buffers hold flits waiting for channel bandwidth
Channel Bandwidth
Buffer capacity
Packets (1)
Messages: composed of one or more packets
If message size is <= maximum packet size only one packet created Packets: composed of one or more flits Flit: flow control digit Phit: physical digit Subdivides flit into chunks = to link width Packets (2)MessagePacketFlitHeaderHead FlitPayloadBody FlitPhitTail FlitRoute Seq# Type VCIDHead, Body, Tail, Head & Tail Off-chip: channel width limited by pins Requires phits On-chip: abundant wiring means phit size == flit size Packets(3) RC Type VCID Addr Bytes 0-15 Bytes 16-31 Bytes 32-47 Bytes 48-63 Cache lineHead FlitBody FlitsTail Flit Coherence Command RC Type VCID Addr CmdHead & Tail Flit Packet contains destination/route information Flits may not all flits of a packet must take same route Switching Different flow control techniques based on granularity Message-based: allocation made at message granularity(circuit-switching) Packet-based: allocation made to whole packets Flit-based: allocation made on a flit-by-flit basis Message-Based Flow Control Coarsest granularity Circuit-switching Pre-allocates resources across multiple hops Sourcetodestination Resources=links Buffersarenotnecessary Probe sent into network to reserve resources Circuit Switching Once probe sets up circuit Message does not need to perform any routing or allocationat each network hop Good for transferring large amounts of data Can amortize circuit setup cost by sending data with very low per-hop overheads No other message can use those resources until transfer is complete Throughput can suffer due to setup and hold time for circuits Links are idle until setup is complete Circuit Switching Example05 Configuration ProbeDataCircuit Acknowledgement Significant latency overhead prior to data transfer Data transfer does not pay per-hop overhead for routing andallocation Circuit Switching Example (2) 012 5Configuration ProbeData CircuitAcknowledgement When there is contention Significant wait time Message from 1 2 must wait Time-Space Diagram: Circuit-SwitchingA0 8D0 8D0 8D0 8D0 8T0 8A0 8D0 8D0 8D0 8D0 8T0 8S2 8A0 8D0 8D0 8D0 8D0 8T0 8S2 8T0 8A0 8D0 8D0 8D0 8D0 8S2 8T0 8 A0 8D0 8D0 8D0 8D0 80 S0 8S0 8S0 8S0 8S0 8 12 58 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20TimeTime to setup+ack circuit from 0 to 8Time setup from 2 to 8 is blocked 012 345 678Location Packet-based Flow Control Break messages into packets Interleave packets on links Better resource utilization Requires per-node buffering to store in-flight packets Two types of packet-based techniques Store and Forward Links and buffers are allocated to entire packet Head flit waits at router until entire packet is received before being forwarded to the next hop Not suitable for on-chip Requires buffering at each router to hold entire packet Packet cannot traverse link until buffering allocated to entire packet Incurs high latencies (pays serialization latency at each hop) Store and Forward Example0 5Total delay = 4 cycles per hop x 3 hops = 12 cycles High per-hop latency Serialization delay paid at each hop Larger buffering requiredTime-Space Diagram: Store and ForwardH B B B T0 12 58 H B B B T H B B B T H B B B T H B B B T 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24TimeLocation Packet-based: Virtual Cut Through Links and Buffers allocated to entire packets Flits can proceed to next hop before tail flit has been received by current router But only if next router has enough buffer space for entire packet Reduces the latency significantly compared to SAF But still requires large buffers Unsuitable for on-chip Virtual Cut Through Example0proceedsTotal delay = 1 cycle per hop x 3 hops + serialization = 6 cyclesAllocate 4 flit-sizedAllocate 4 flit-sizedbuffers before headbuffers before headproceeds5 Lower per-hop latency Large buffering required Time-Space Diagram: VCT HBBBT HBBBT HBBBT HBBBT HBBBT0 12 58 0 1 2 3 4 5 6 7 8TimeLocation Virtual Cut Through Cannot proceed because only 2 flit buffers available Throughput suffers from inefficient buffer allocation Time-Space Diagram: VCT (2) HBBBT HBBBT HBBBTInsufficient BuffersHBBBT HBBBT0 12 58 0 1 2 3 4 5 6 7 8 9 10 11TimeLocation Flit-Level Flow Control Help routers meet tight area/power constraints Flit can proceed to next router when there is buffer space available for that flit Improves over SAF and VCT by allocating buffers on a flit- by-flit basis Wormhole Flow Control Pros More efficient buffer utilization (good for on-chip) Low latency Cons Poor link utilization: if head flit becomes blocked, all linksspanning length of packet are idle Cannot be re-allocated to different packet Suffersfromheadofline(HOL)blockingWormhole ExampleRed holds this channel: channel remains idle until read proceedsChannel idle but red packet blocked behind blueBuffer full: blue cannot proceedBlocked by other packets 6 flit buffers/input port57 Time-Space Diagram: WormholeHBBBTHBBBTContentionHBBBT HBBBT HBBBT0 12 58 0 1 2 3 4 5 6 7 8 9 10 11TimeLocation Virtual Channels First proposed for deadlock avoidance Well not talk about deadlock in the class Read the textbook orOn-Chip Networks Can be applied to any flow control First proposed with wormhole Virtual Channel Flow Control Virtual channels used to combat HOL blocking in wormhole Virtual channels: multiple flit queues per input port Share same physical link (channel) A case of virtualization Link utilization improved Flits on different VC can pass blocked packet Virtual Channel Flow Control (2)A (in)B (in) A (out)B (out)Virtual Channel Flow Control (3)In1In2OutA DownstreamB Downstream A H A1 A2 A3 A4 A5 1122333332211 B H B1 B2 B3 B4 B512233333332211AT BT A H B H A1 B1 A2 B2 A3 B3 A4 B4 A5 B5 AT BT A HA1A2A3A4A5AT B HB1B2B3B4B5BT Virtual Channel Flow Control (4) Packets compete for VC on flit by flit basis In example: on downstream links, flits of each packet are available every other cycle Upstream links throttle because of limited buffers Does not mean links are idle May be used by packet allocated to other VCsVirtual Channel ExampleBuffer full: blue cannot proceedBlocked by other packets 6 flit buffers/input port 3 flit buffers/VCSummary of techniquesLinksBuffersCommentsCircuit- SwitchingMessagesN/A (buffer-less)Setup & AckStore and ForwardPacketPacketHead flit waits for tailVirtual Cut ThroughPacketPacketHead can proceedWormholePacketFlitHOLVirtual ChannelFlitFlitInterleave flits of different packets Outline Introduction Network Topology Flow Control Example IN Blue Gene/L 3D Torus Network 360TFLOPS(peak) 2,500squarefeet Connects65,536dual-processornodesand1,024I/O nodesBlue Gene/L 3D Torus Network www.ibm.com Compute Card 2 chips, 1x2x1 5.6/11.2 GF/s 1.0 GBNode Card32 chips, 4x4x216 compute, 0-2 I/O cards 90/180 GF/s16 GBChip (node) 2 processors 2.8/5.6 GF/s 512MBSystem 64 Racks, 64x32x32 180/360 TF/s 32 TB Rack32 Node cards 2.8/5.6 TF/s 512 GB Node distribution: Two nodes on a 2 x 1 x 1 compute card, 16 compute cards + 2 I/O cards on a 4 x 4 x 2 node board, 16 node boards on an 8 x 8 x 8 midplane, 2 midplaneson a 1,024 node rack, 8.6 meters maximum physical link length Next Topic Multiprocessors
Reviews
There are no reviews yet.