Professional Documents
Culture Documents
Agenda
Network design requirements
Protocol selection: BGP vs IGP
Details of Routing Design
Motivation for BGP SDN
Design of BGP SDN Controller
The roadmap for BGP SDN
Design Requirements
Scale of the data-center network:
100K+ bare metal servers
Over 3K network switches per DC
Applications:
Map/Reduce: Social Media, Web Index and Targeted Advertising
Public and Private Cloud Computing: Elastic Compute and Storage
Real-Time Analytics: Low latency computing leveraging distributed
memory across discrete nodes.
Key outcome:
East West traffic profile drives need for large bisectional
bandwidth.
3
Standards Based
Control Plane Scaling and Stability
Minimize resource consumption
e.g. CPU, TCAM usage predictable and low
Minimize the size of the L2 failure
domain
Layer3 equal-cost multipathing
(ECMP)
Programmable
Extensible and easy to
automate
Traffic Engineering
Heavy use of ECMP makes TE in DC not as important as in the WAN.
However we still want to drain devices and respond to imbalances
Routing Design
Single logical link between devices, eBGP all the way down to the ToR.
Separate BGP ASN per ToR, ToR ASNs reused between containers.
Parallel spines (Green vs Red) for horizontal scaling.
~100 Spines
~200 Leafs
~2K ToR
~100K
10
Interoperability issues:
15
Logical vs Literal
Scale - Clustering
High Availability
Latency between controller and network
element
Components of a Controller
Topology discovery
Path Computation
Monitoring and Network State Discovery
REST API
Analysis/
Correlation
BiG
DATA
REST API
Physical
Logical
Control
Path
Topology Module
Device
Manager
Controller
Collector
BGP
RIB MGR
PCE
Vendor
Agent
OPEN
FLOW
Monitor
State
WD
Flow
Monitoring Module
BDM
SDK
Network Element
19
Controller
Network Setup
Templates to peer with the central controller
(passive listening)
Policy to prefer routes injected from controller
Policy to announce only certain routes to the
controller
AS 64XXX
AS
64902
AS 64XXX
AS 64XXX
AS
64901
AS 64XXX
AS
65501
20
REST API
Write
&
Notify
Speaker
Thread
Decision
Thread
Managed
Devices
Command
Center
State
Sync Thread
Listener
Thread
Wakeup
&
Read
Shared
State
Database
Network Graph
(bootstrap information)
21
Controller
Inject
Prefix for R1
with one-hop
community
Expect
Prefix for R1
from R2
R1
R2
Prefix for R1
relayed
Prefix for R1
NOT relayed
R3
22
Controller
Inject
Prefix X/24
with Next-Hops:
N1, N2
Inject
Next-Hop
prefix N1/32
Inject
Next-Hop
prefix N2 /32
R2
R1
Next-hop prefix:
N1 /32
Next-hop prefix:
N2 /32
R3
24
PUT http://<controller>/state/link/up=R1,R2&down=R3,R4
S2
S1
This link
overloaded
R1
Prefix X
R2
R3
(1) Update these
devices first
26
Traffic Engineering
R4
100%
This link
congested
R3
R2
50%
50%
50%
R1
R4
50%
Congestion
alleviated
R3
R2
25%
75%
25%
75%
R1
Controller installs path with different ECMP
weights
28
Requires knowing
traffic matrix (TM)
Network topology and capacities
Solves Linear Programming problem
Computes ECMP weights
For every prefix
At every hop
Optimal for a given TM
Link state change causes reprogramming
More state pushed down to the network
66%
33%
29
What we learned
Does not require new firmware, silicon, or APIs.
Some BGP extensions are nice to have.
BGP Code is tends to be mature .
Easy to roll-back to default BGP routing.
Solves our current problems and allows solving more.
32
Questions?
Contacts:
Edet Nkposong - edetn@microsoft.com
Tim LaBerge - Tim.LaBerge@microsoft.com
Naoki Kitajima - naokikit@microsoft.com
References
http://datatracker.ietf.org/doc/draft-lapukhov-bgp-routing-large-dc/
http://code.google.com/p/exabgp/
http://datatracker.ietf.org/doc/draft-ietf-idr-link-bandwidth/
http://datatracker.ietf.org/doc/draft-lapukhov-bgp-sdn/
http://www.nanog.org/meetings/nanog55/presentations/Monday/Lapukhov.pdf
http://www.nanog.org/sites/default/files/wed.general.brainslug.lapukhov.20.pdf
http://research.microsoft.com/pubs/64604/osr2007.pdf
http://research.microsoft.com/en-us/people/chakim/slbsigcomm2013.pdf
34
Backup Slides
35
PODSET
SW 1
R2
PODSET
SW 2
PODSET 1
R3
PODSET
SW 1
R4
PODSET
SW 2
PODSET 2
R5
R6
-----
R7
R8
PODSET
SW 1
PODSET
SW 2
PODSET 72
Test Bed
4 Spine switches
144 VRFs created on a router
each VRF = 1x podset switch
Each VRF has 8 logical interfaces
(2 to each spine)
This emulates the 8-way required
by the podset switch
3 physical podset switches
Each podset carries 6 server-side
37
IP Subnets
Test Bed
Route table calculations
Expected OSPF state
144 x 2 x 4 = 1152 links for infrastructure
144 x 6 = 864 server routes (although these will be 4-way since we have brought
everything into 4 spines (instead of 8)
Some loopback addresses and routes from the real podset switches
We expect ~ (144 x 2 x 4) + (144 x 6) 144 = 1872 routes
Initial testing proved that the platform can sustain this scale (control and forwarding
plane)
What happens when we shake things up ?
38
2:30:32
2:30:40
2:30:45
2:30:52
2:30:59
2:31:06
2:31:13
2:31:22
2:31:33
2:31:45
2:31:57
2:32:10
2:32:22
2:32:34
2:32:45
2:32:56
2:33:06
2:33:18
2:33:29
2:33:43
2:33:54
2:34:12
2:34:24
2:34:31
14000
12000
10000
8000
6000
4000
2000
39
Sample route
O
2:30:32
2:30:45
2:30:59
2:31:13
2:31:33
2:31:57
2:32:22
2:32:45
2:33:06
2:33:29
2:33:54
2:34:24
192.0.1.33
192.0.2.57
192.0.0.1
192.0.11.249
192.0.0.185
192.0.0.201
192.0.2.25
192.0.1.49
192.0.0.241
192.0.11.225
192.0.1.165
192.0.0.5
192.0.12.53
192.0.1.221
192.0.1.149
192.0.0.149
6 server
vlans
PODSET
SW 2
R1
R2
R3
R4
R5
R6
R7
PODSET
SW 1
6 server
vlans
41
R8
PODSET
SW 1
R2
PODSET
SW 2
R3
PODSET
SW 1
R4
PODSET
SW 2
R5
R6
-----
R7
R8
PODSET
SW 1
PODSET
SW 2
PODSET 1
PODSET 2
PODSET 72
AS 65001
AS 65002
AS 65072
42
43
BGP advantages
Very deterministic behavior
Protocol design takes care of eliminating the surge effect (i.e. spine
wont learn routes with its own AS)
ECMP supported and routes are labeled by the podset they came from
(AS #) beautiful !
44