Professional Documents
Culture Documents
Outline
INTRODUCTION Allocator Implementations Buffer Management Infrastructure Conclusions
8/21/12
Networks-on-Chip
Chip Core
Moores Law alive & well Many cores per chip Must work together
8/21/12
Performance
Latency Throughput Fairness, QoS
Caches, 10%
Cost
Die area Wiring resources Design complexity
Core, 31%
8/21/12
Part 1
Part 2
[Peh and Dally: A Delay Model for Router Microarchitectures] 8/21/12 Efficient Microarchitecture for NoC Routers 6
Outline
Introduction ALLOCATOR IMPLEMENTATIONS Buffer Management Infrastructure Conclusions
Allocators
Fundamental part of router control logic Manage access to network resources Orchestrate flow of packets through router Affect network utilization Potentially affect cycle time
8/21/12
Sparse VC Allocation
IVC NM REQ MIN
24 32 64 Requests
NM
REP MIN 222 VCs 24 VCs
8/21/12
P2 Requests
P4 Requests 8 VCs
Efficient Microarchitecture for NoC Routers
VC Allocator Delay
sep in 50 min. cycle time (FO4) 40 30 sep out wf unr wf rot wf rep
Canonical design
-58%
-40%
20 10 0
-30%
-30%
8/21/12
11
VC Allocator Area
sep in 10000 8000 area (sq um) 6000 4000 2000 0 sep out wf unr wf rot wf rep
31800
-78%
-50%
-78% -60%
8/21/12
12
Switch Allocation
Once a VC is allocated, packet can be forwarded Broken down into flits For each flit, must request crossbar access Switch allocator generates crossbar schedule
8/21/12
inputs
outputs
[Enright Jerger and Peh, On-Chip Networks]
13
Pessimistic Speculation
non-spec. requests nonspec. allocator Speculation matters most when network is lightly loaded At low network load, most requests are granted nonspec. Idea: Assume all non-spec. grants requests will be granted!
conflict detection
spec. requests
spec. allocator
mask
spec. grants
8/21/12
15
80
<2%
60
40
20
0 0.2
0.22
0.24
0.36
0.38
0.4
16
x 10
2.5 2.4 2.3 2.2 2.1 2 1.9 0 nonspec pessimistic canonical 10 20 30 cycle time (FO4) 40 50 60
8/21/12
17
Additional Contributions
Fast loop-free wavefront allocators Priority-based speculation Practical combined VC and switch allocation Details in thesis
8/21/12
18
Summary
Sparse VC allocation exploits traffic classes to reduce VC allocator complexity
Reduces delay by 30-60%, area by 50-80% No change in functionality
Outline
Introduction Allocator Implementations BUFFER MANAGEMENT Infrastructure Conclusions
[Becker et al.: Adaptive Backpressure: Efficient Buffer Management for On-Chip Networks, to appear in ICCD12]
8/21/12 Efficient Microarchitecture for NoC Routers 20
Buffer Cost
[TRIPS total network power]
channels 31%
buffers 35%
8/21/12
21
Buffer Management
Many designs divide buffer statically among VCs
Assign each VC its fair share
For fixed buffer size, static schemes must pick one or the other
dynamic
0.075
0.08
0.105
0.11
0.115
8/21/12
23
Buffer Monopolization
switch allocator credits switch allocator downstream router
upstream router
Congestion leads to buffer monopolization Uncongested traffic sees reduced buffer space
Increases latency, reduces throughput
Adaptive Backpressure
Avoid unproductive use of buffer space Impose quotas on outstanding credits
Share freely under benign conditions Limit sharing to avoid performance pathologies
credits its
lookahead routing logic input VC state output VC state shared occupancy tracker IVC IVC IVC
busy
occupancy
quota
LAR
combined allocator
credits its
credits its
LAR
credits its
Difficult to measure throughput directly Instead, infer from credit round trip times
In absence of congestion, set quota to RTT For each downstream stall cycle, reduce by one
8/21/12 Efficient Microarchitecture for NoC Routers 26
Tcrt,0+Tstall
Tstall
Excess flits
Tidle
Network Stability
baseline 0.25 throughput (flits/cycle/node) adaptive 10 baseline adaptive
0.15
0.1
6.3x
0.05
0.5
[tornado traffic]
0.5
8/21/12
29
Traffic Isolation
[Measure zero-load latency increase with background traffic]
baseline foreground zeroload latency (cycles) 50 45 40 adaptive foreground zeroload latency (cycles) 40 baseline adaptive
35
-33%
35 30 25 20
-38%
30
25
adaptive
50 40 30 20 10 0
-31%
bitcomp
bitrev
shuffle
uniform
[mean]
w/o background
-13%
3.3x
L1 I/O
network interface
memory bank
memory bank
8/21/12
memory bank
memory bank
33
Application Performance
2 baseline adaptive
-31%
1.5
0.5
bscholes canneal
dedup
ferret
fanimate workload
vips
x264
[gmean]
Summary
Sharing improves buffer utilization, but can lead to pathological performance Adaptive Backpressure minimizes unproductive use of shared buffer space Mitigates performance degradation in presence of adversarial traffic But maintains key benefits of buffer sharing under benign conditions
8/21/12 Efficient Microarchitecture for NoC Routers 35
Infrastructure
Open source NoC router RTL
State-of-the-art router implementation Highly parameterized
Topology, routing, allocators, buffers,
Pervasive clock gating Fully synthesizable 100 files, >22k LOC of Verilog-2001 Used in research efforts both inside and outside our research group
8/21/12 Efficient Microarchitecture for NoC Routers 36
Conclusions
Future large-scale chip multiprocessors will require efficient on-chip networks Router microarchitecture is one of many aspects that need to be optimized Allocation has direct impact on router delay and throughput By exploiting higher-level properties, we can reduce cost and delay without degrading performance Input buffers are attractive candidates for optimization However, care must be taken to avoid performance pathologies By avoiding unproductive use of buffer space, Adaptive Backpressure mitigates undesired interference effects
8/21/12
37
Acknowledgements
Bill Christos and Kunle Prof. Nishi George, Ted, Curt & the rest of the CVA gang
8/21/12 Efficient Microarchitecture for NoC Routers 38
Acknowledgements
8/21/12
39
Acknowledgements
8/21/12
40
THANK YOU!
8/21/12
41