Efficient Microarchitecture For Network-On-Chip Routers: Daniel U. Becker PHD Oral Examination 8/21/2012

Concurrent VLSI Architecture Group
Efficient Microarchitecture for Network-on-Chip Routers

Daniel U. Becker PhD Oral Examination 8/21/2012
Outline
INTRODUCTION Allocator Implementations Buffer Management Infrastructure Conclusions
8/21/12
Efficient Microarchitecture for NoC Routers
Networks-on-Chip
Chip Core
Moores Law alive & well Many cores per chip Must work together
Networks-on-Chip (NoCs) aim to provide scalable, efficient communication fabric
8/21/12
Why Does the Network Matter?

Energy Breakdown for Radix Sort (SPLASH-2)
Performance
Latency Throughput Fairness, QoS
Caches, 10%
DRAM, 14% NoC, 45%
Cost
Die area Wiring resources Design complexity
Core, 31%
Power & energy efficiency

Efficient Microarchitecture for NoC Routers 4
[Harting et al., Energy and Performance Benefits of Active Messages 8/21/12
Optimizing the Network
8/21/12
Router Microarchitecture Overview
Part 1
Part 2
[Peh and Dally: A Delay Model for Router Microarchitectures] 8/21/12 Efficient Microarchitecture for NoC Routers 6
Outline
Introduction ALLOCATOR IMPLEMENTATIONS Buffer Management Infrastructure Conclusions
[Becker and Dally: Allocator Implementations for Network-on-Chip Routers, SC09]

8/21/12 Efficient Microarchitecture for NoC Routers 7
Allocators
Fundamental part of router control logic Manage access to network resources Orchestrate flow of packets through router Affect network utilization Potentially affect cycle time
8/21/12
Virtual Channel Allocation

Virtual channels (VCs) allow multiple packets to be interleaved on physical channels Similar to lanes on a highway, allow traffic blocks to be bypassed Before packets can use network channel, need to claim ownership of a VC
VC allocator assigns output VCs to waiting packets

Sparse VC Allocation
IVC NM REQ MIN
24 32 64 Requests
OVC P2 Requests P8 Requests P4 Requests
NM
REP MIN 222 VCs 24 VCs
8/21/12
P2 Requests
P4 Requests 8 VCs
[single input port shown]

10
VC Allocator Delay
sep in 50 min. cycle time (FO4) 40 30 sep out wf unr wf rot wf rep
Canonical design
-58%
-40%
20 10 0
-30%
-30%
P=5, V=2x1x1 5 ports, 2x1 VCs
8/21/12
11
VC Allocator Area
sep in 10000 8000 area (sq um) 6000 4000 2000 0 sep out wf unr wf rot wf rep
31800
-78%
-50%
-78% -60%
8/21/12
12
Switch Allocation
Once a VC is allocated, packet can be forwarded Broken down into flits For each flit, must request crossbar access Switch allocator generates crossbar schedule
8/21/12
inputs
outputs
[Enright Jerger and Peh, On-Chip Networks]
13
Speculative Switch Allocation
Reduce pipeline latency by attempting switch allocation in parallel with VC allocation
Speculate that VC will be assigned!
But mis-speculation wastes crossbar bandwidth Must prioritize non-speculative requests

Pessimistic Speculation
non-spec. requests nonspec. allocator Speculation matters most when network is lightly loaded At low network load, most requests are granted nonspec. Idea: Assume all non-spec. grants requests will be granted!
conflict detection
spec. requests
spec. allocator
mask
spec. grants
8/21/12
15
Performance with Speculation

100
average packet latency (cycles)
80
nonspec pessimistic canonical
<2%
60
40
20
-21% zero-load latency
0 0.2
0.22
0.24
[Mesh, 2 VCs; UR traffic]

8/21/12
0.26 0.28 0.3 0.32 0.34 offered load (flits/cycle/node)

0.36
0.38
0.4
16
Area and Delay Impact

[Full router; Mesh, 2 VCs; TSMC 45nm GP]
2.9 2.8 2.7 2.6
area (sq um)
x 10
+16% max. clock freq.
-13% area @ 1.2 GHz
2.5 2.4 2.3 2.2 2.1 2 1.9 0 nonspec pessimistic canonical 10 20 30 cycle time (FO4) 40 50 60
-5% area @ 1 GHz
8/21/12
17
Additional Contributions
Fast loop-free wavefront allocators Priority-based speculation Practical combined VC and switch allocation Details in thesis
8/21/12
18
Summary
Sparse VC allocation exploits traffic classes to reduce VC allocator complexity
Reduces delay by 30-60%, area by 50-80% No change in functionality
Pessimistic speculation reduces overhead for speculative switch allocation

Reduces overall router area by up to 13% Reduces critical path delay by up to 14% Trade for some throughput loss near saturation
Outline
Introduction Allocator Implementations BUFFER MANAGEMENT Infrastructure Conclusions
[Becker et al.: Adaptive Backpressure: Efficient Buffer Management for On-Chip Networks, to appear in ICCD12]
Buffer Cost
[TRIPS total network power]
channels 31%
buffers 35%
allocators 1% crossbar 33%

[Wang et al.: Power-driven Design of Router Microarchitectures in On-chip Networks]
8/21/12
21
Buffer Management
Many designs divide buffer statically among VCs
Assign each VC its fair share
But optimal buffer organization depends on load

Low load favors deep VCs High load favors many VCs
For fixed buffer size, static schemes must pick one or the other
Improve utilization by allowing buffer space to be shared among VCs

Buffer Management Performance

[linked-list based scheme; harmonic mean across traffic patterns]
1400 static 1300 1200
buffer cost (registers) 16 slots
dynamic
1100 1000 900 800 700 600 500

8 slots 12 slots
-18% +8% -28%
0.075
0.08
0.085 0.09 0.095 0.1 saturation rate (flits/cycle/node)

0.105
0.11
0.115
8/21/12
23
Buffer Monopolization
switch allocator credits switch allocator downstream router
upstream router
Congestion leads to buffer monopolization Uncongested traffic sees reduced buffer space
Increases latency, reduces throughput
Congestion spreads across VCs!

Adaptive Backpressure
Avoid unproductive use of buffer space Impose quotas on outstanding credits
Share freely under benign conditions Limit sharing to avoid performance pathologies
credits its
LAR: IVC: OVC: SOT:
lookahead routing logic input VC state output VC state shared occupancy tracker IVC IVC IVC
busy
occupancy
quota
LAR
combined allocator
OVC OVC OVC SOT
credits its
credits its
IVC IVC IVC
OVC OVC OVC SOT
LAR
credits its
Vary backpressure based on demand

Buffer Quota Heuristic

Goal: Set quota values just high enough to support observed throughput for each VC
Allow credit stalls that overlap with other stalls Drain unproductive buffer occupancy
Difficult to measure throughput directly Instead, infer from credit round trip times
In absence of congestion, set quota to RTT For each downstream stall cycle, reduce by one
Buffer Quota Motivation (1)

Router 0 Router 1 Router 0 Router 1 Tcrt,0
Tcrt,0+Tstall
Tstall
Excess flits
Full throughput is achieved in steady state

8/21/12
Congestion causes downstream stall and unproductive buffer occupancy

Efficient Microarchitecture for NoC Routers 27
Buffer Quota Motivation (2)

Router 0 Router 1 Router 0 Router 1
Tstall Tstall Tstall Excess flit drained
Tidle
Insufficient credit supply causes idle cycle downstream

8/21/12
Credit stall resolves unproductive buffer occupancy

28
Network Stability
baseline 0.25 throughput (flits/cycle/node) adaptive 10 baseline adaptive
0.2 avg. occupancy (flits)
0.15
0.1
6.3x
0.05
0.1 0.2 0.3 0.4 offered load (flits/cycle/node)
0.5
[tornado traffic]
0.1 0.2 0.3 0.4 offered load (flits/cycle/node)
0.5
8/21/12
29
Traffic Isolation
[Measure zero-load latency increase with background traffic]
baseline foreground zeroload latency (cycles) 50 45 40 adaptive foreground zeroload latency (cycles) 40 baseline adaptive
35
-33%
35 30 25 20
-38%
30
[uniform random background traffic]

0 0.2 0.4 0.6 0.8 1 background offered load (flits/cycle/node)
25
[hotspot background traffic]

20 0 0.2 0.4 0.6 0.8 1 background offered load (flits/cycle/node)
[uniform random foreground traffic]

Zero-load Latency with Background

70 baseline 60
zeroload latency (cycle)
adaptive
50 40 30 20 10 0
-31%
bitcomp
bitrev
shuffle
tornado transpose traffic pattern
uniform
[mean]
w/o background [50% uniform random background traffic]

Throughput with Background
w/o background
-13%
3.3x
[50% uniform random background traffic]

Application Performance Setup

Model traffic in heterogeneous CMP Each node generates two types of traffic: PARSEC application traffic models latency-optimized core Streaming traffic to memory controllers model array of throughput-optimized cores
CPU L2 SP SP SP SP SP SP SP SP SP SP I/O memory controller
memory bank memory bank memory bank memory bank
L1 I/O
network interface
memory bank
memory bank
8/21/12
memory bank
memory bank
33
Application Performance
2 baseline adaptive
-31%
normalized run time
1.5
0.5
bscholes canneal
dedup
ferret
fanimate workload
vips
x264
[gmean]
w/o background [12.5% injection rate for streaming traffic]

Summary
Sharing improves buffer utilization, but can lead to pathological performance Adaptive Backpressure minimizes unproductive use of shared buffer space Mitigates performance degradation in presence of adversarial traffic But maintains key benefits of buffer sharing under benign conditions
Infrastructure
Open source NoC router RTL
State-of-the-art router implementation Highly parameterized
Topology, routing, allocators, buffers,
Pervasive clock gating Fully synthesizable 100 files, >22k LOC of Verilog-2001 Used in research efforts both inside and outside our research group
Conclusions
Future large-scale chip multiprocessors will require efficient on-chip networks Router microarchitecture is one of many aspects that need to be optimized Allocation has direct impact on router delay and throughput By exploiting higher-level properties, we can reduce cost and delay without degrading performance Input buffers are attractive candidates for optimization However, care must be taken to avoid performance pathologies By avoiding unproductive use of buffer space, Adaptive Backpressure mitigates undesired interference effects
8/21/12
37
Acknowledgements
Bill Christos and Kunle Prof. Nishi George, Ted, Curt & the rest of the CVA gang
Acknowledgements
8/21/12
39
Acknowledgements
8/21/12
40
Thats it for today.
THANK YOU!
8/21/12
41

Efficient Microarchitecture For Network-On-Chip Routers: Daniel U. Becker PHD Oral Examination 8/21/2012

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Efficient Microarchitecture For Network-On-Chip Routers: Daniel U. Becker PHD Oral Examination 8/21/2012

Uploaded by

Copyright:

Available Formats

Concurrent VLSI Architecture Group

Efficient Microarchitecture for Network-on-Chip Routers

Efficient Microarchitecture for NoC Routers

Networks-on-Chip (NoCs) aim to provide scalable, efficient communication fabric

Efficient Microarchitecture for NoC Routers

Why Does the Network Matter?

DRAM, 14% NoC, 45%

Power & energy efficiency

[Harting et al., Energy and Performance Benefits of Active Messages 8/21/12

Optimizing the Network

Efficient Microarchitecture for NoC Routers

Router Microarchitecture Overview

[Becker and Dally: Allocator Implementations for Network-on-Chip Routers, SC09]

Efficient Microarchitecture for NoC Routers

Virtual Channel Allocation

VC allocator assigns output VCs to waiting packets

OVC P2 Requests P8 Requests P4 Requests

[single input port shown]

P=5, V=2x1x1 5 ports, 2x1 VCs

P=5, V=2x1x2 5 ports, 2x2 VCs

Efficient Microarchitecture for NoC Routers

P=5, V=2x1x1 5 ports, 2x1 VCs

P=5, V=2x1x2 5 ports, 2x2 VCs

Efficient Microarchitecture for NoC Routers

Efficient Microarchitecture for NoC Routers

Speculative Switch Allocation

Reduce pipeline latency by attempting switch allocation in parallel with VC allocation

Speculate that VC will be assigned!

But mis-speculation wastes crossbar bandwidth Must prioritize non-speculative requests

Efficient Microarchitecture for NoC Routers

Performance with Speculation

nonspec pessimistic canonical

-21% zero-load latency

[Mesh, 2 VCs; UR traffic]

0.26 0.28 0.3 0.32 0.34 offered load (flits/cycle/node)

Area and Delay Impact

+16% max. clock freq.

-13% area @ 1.2 GHz

-5% area @ 1 GHz

Efficient Microarchitecture for NoC Routers

Efficient Microarchitecture for NoC Routers

Pessimistic speculation reduces overhead for speculative switch allocation

allocators 1% crossbar 33%

Efficient Microarchitecture for NoC Routers

But optimal buffer organization depends on load

Improve utilization by allowing buffer space to be shared among VCs

Buffer Management Performance

1100 1000 900 800 700 600 500

-18% +8% -28%

0.085 0.09 0.095 0.1 saturation rate (flits/cycle/node)

Congestion spreads across VCs!

LAR: IVC: OVC: SOT:

OVC OVC OVC SOT

IVC IVC IVC

OVC OVC OVC SOT

Vary backpressure based on demand

Buffer Quota Heuristic

Buffer Quota Motivation (1)

Full throughput is achieved in steady state

Congestion causes downstream stall and unproductive buffer occupancy

Buffer Quota Motivation (2)

Tstall Tstall Tstall Excess flit drained