Professional Documents
Culture Documents
CASE STUDIES
FROM DMVPN AND WAN EDGE TO SERVER
CONNECTIVITY AND VIRTUAL APPLIANCES
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page ii
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
CONTENT AT A GLANCE
FOREWORD .......................................................................................................................... XV
INTRODUCTION ................................................................................................................. XVII
1
Page iii
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
CONTENTS
FOREWORD .......................................................................................................................... XV
INTRODUCTION ................................................................................................................. XVII
1
Page iv
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page v
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page vi
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page vii
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page viii
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
TABLE OF FIGURES
Figure 1-1: Network core and Internet edge ........................................................................ 1-2
Figure 2-1: Existing MPLS VPN WAN network topology .......................................................... 2-3
Figure 2-2: Proposed new network topology ........................................................................ 2-4
Figure 2-3: OSPF areas .................................................................................................... 2-5
Figure 2-4: OSPF-to-BGP route redistribution ...................................................................... 2-6
Figure 2-5: Inter-site OSPF route advertisements ................................................................ 2-7
Figure 2-6: DMVPN topology ............................................................................................. 2-9
Figure 2-7: OSPF areas in the Internet VPN....................................................................... 2-11
Figure 2-8: OSPF external route origination....................................................................... 2-12
Figure 2-9: Multiple OSPF processes with two-way redistribution .......................................... 2-13
Figure 2-10: BGP sessions in the WAN infrastructure .......................................................... 2-15
Figure 2-11: Single AS number used on all remote sites ..................................................... 2-19
Figure 2-12: BGP enabled on every layer-3 device between two BGP routers ......................... 2-21
Figure 2-13: BGP routing information redistributed into OSPF .............................................. 2-22
Figure 2-14: Dedicated VLAN between BGP edge routers .................................................... 2-23
Figure 2-15: Remote site logical network topology and routing ............................................ 2-27
Page ix
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Figure 2-16: Central site logical network topology and BGP+OSPF routing ............................. 2-29
Figure 3-1: Planned DMVPN network .................................................................................. 3-3
Figure 3-2: BGP routing in existing WAN backbone ............................................................... 3-5
Figure 4-1: Redundant data centers and their internet connectivity ........................................ 4-3
Figure 4-2: Simplified topology with non-redundant internal components ................................ 4-4
Figure 4-3: BGP sessions between Internet edge routers and the ISPs. ................................... 4-6
Figure 4-4: Outside WAN backbone in the redesigned network ............................................... 4-9
Figure 4-5: Point-to-point Ethernet links implemented with EoMPLS on DCI routers ................ 4-12
Figure 4-6: Single stretched VLAN implemented with VPLS across L3 DCI .............................. 4-13
Figure 4-7: Two non-redundant stretched VLANs provide sufficient end-to-end redundancy ..... 4-14
Figure 4-8: Virtual topology using point-to-point links ........................................................ 4-15
Figure 4-9: Virtual topology using stretched VLANs ............................................................ 4-16
Figure 4-10: Full mesh of IBGP sessions between Internet edge routers ................................ 4-17
Figure 4-11: Virtual Device Contexts: dedicated management planes and physical interfaces ... 4-22
Figure 4-12: Virtual Routing and Forwarding tables: shared management, shared physical
interfaces ...................................................................................................................... 4-23
Figure 4-13: BGP core in WAN backbone .......................................................................... 4-24
Figure 4-14: MPLS core in WAN backbone......................................................................... 4-25
Page x
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page xi
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Figure 7-9: LACP between a server and ToR switches ......................................................... 7-11
Figure 7-10: Optimal traffic flow with MLAG ...................................................................... 7-12
Figure 7-11: Redundant server connectivity requires the same IP subnet on adjacent ToR switches
.................................................................................................................................... 7-13
Figure 7-12: A single uplink is used without server-to-ToR LAG ........................................... 7-15
Figure 7-13: All uplinks are used by a Linux host using balance-tlb bonding mode .................. 7-16
Figure 7-14: All ToR switches advertise IP subnets with the same cost.................................. 7-17
Figure 7-15: IP routing with stackable switches ................................................................. 7-18
Figure 7-16: Layer-2 fabric between hypervisor hosts ........................................................ 7-20
Figure 7-17: Optimal flow of balance-tlb traffic across a layer-2 fabric .................................. 7-21
Figure 7-18: LAG between a server and adjacent ToR switches ............................................ 7-22
Figure 8-19: Packet filters protecting individual servers ........................................................ 8-6
Figure 8-20: VM NIC firewalls ........................................................................................... 8-9
Figure 8-21: Per-application firewalls ............................................................................... 8-12
Figure 8-22: High-performance WAN edge packet filters combined with a proxy server ........... 8-15
Figure 9-1: Centralized network services implemented with physical appliances ....................... 9-3
Figure 9-2: Centralized network services implemented with physical appliances ....................... 9-4
Figure 9-3: Applications accessing external resources ........................................................... 9-5
Page xii
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Figure 9-4: Hybrid architecture combining physical and virtual appliances ............................. 9-11
Figure 10-1: Containers and data center backbone ............................................................ 10-2
Figure 10-2: Interaction with the provisioning/orchestration system ..................................... 10-4
Figure 10-3: Traffic control appliances ........................................................................... 10-10
Figure 10-4: Layer-3 traffic control devices ..................................................................... 10-12
Figure 10-5: Bump-in-the-wire traffic control devices ....................................................... 10-13
Figure 10-6: Routing protocol adjacencies across traffic control appliances .......................... 10-14
Figure 11-1: Standard cloud infrastructure rack ................................................................. 11-2
Figure 11-2: Planned WAN connectivity ............................................................................ 11-3
Figure 11-3: Cloud infrastructure components ................................................................... 11-5
Figure 11-4: Single orchestration system used to manage multiple racks .............................. 11-9
Figure 11-5: VLAN transport across IP infrastructure ........................................................ 11-13
Figure 12-1: Some applications use application-level load balancing solutions ........................ 12-3
Figure 12-2: Typical workload architecture with network services embedded in the application stack
.................................................................................................................................... 12-3
Figure 12-3: Most applications use external services .......................................................... 12-4
Figure 12-4: Application tiers are connected through central physical appliances .................... 12-5
Figure 12-5: Virtual appliance NIC connected to overlay virtual network ............................... 12-8
Page xiii
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Figure 12-6: Virtual router advertises application-specific IP prefix via BGP ......................... 12-15
Page xiv
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
FOREWORD
Ivan Pepelnjak first came onto my radar in 2001, when I was tasked with migrating a large
multinational network from IGRP to EIGRP. As a CCIE I was (over)confident in my EIGRP abilities. I
had already deployed EIGRP for a smaller organization; how different could this new challenge be? A
few months into the project, I realized that designing a large-scale EIGRP network was quite
different from configuring a small one. Fortunately I stumbled across Ivans EIGRP Network Design
Solutions book. So began a cycle which continues to this day I take on a new project, look for a
definitive resource to understand the technologies, and discover that Ivan is the authoritative
source. MPLS, L3VPN, IS-IS Ivan has covered it all!
Several years ago I was lucky enough to meet Ivan in person through my affiliation with Gestalt ITs
Tech Field Day program. We also shared the mic via the Packet Pushers Podcast on several
occasions. Through these opportunities I discovered Ivan to be a remarkably thoughtful collaborator.
He has a knack for asking the exact right question to direct your focus to the specific information
you need. Some of my favorite interactions with Ivan center on his answering my could I do this?
inquiry with a yes, it is possible, but you dont want to do that because response. For a great
example of this, take a look at OSPF as the Internet VPN Routing Protocol section in chapter 2 of
this book.
I have found during my career as a network technology instructor that the case studies are the best
method for teaching network design. Presenting an actual network challenge and explaining the
thought process (including rejected solutions) greatly assists students in building the required skill
base to create their own scalable designs. This book uses this structure to explain diverse Enterprise
design challenges, from DMVPN to Data Centers to Internet routing. Over the next few hours of
Page xv
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
reading you will accompany Ivan on many real-world consulting assignments. You have the option of
implementing the designs as presented (I can assure you they work out of the box!), or you can
use the rich collection of footnotes and references to customize the solution to your exact needs. In
either event, I am confident that you will find these case studies as useful as I have found them to
be.
Jeremy Filliben
Network Architect / Trainer
CCDE#20090003, CCIE# 3851
Page xvi
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
INTRODUCTION
I started the ExpertExpress experiment a few years ago and it was unexpectedly successful; I was
amazed at how many people decided to ask me to help design or troubleshoot their network.
Most of the engagements touched at least one data center element, be it server virtualization, data
center network core, WAN edge, or connectivity between data centers and customer sites or public
Internet. I also noticed the same challenges appearing over and over, and decided to document
them in a series of ExpertExpress case studies, which eventually resulted in this book.
The book has two major parts: data center WAN edge and WAN connectivity, and internal data
center infrastructure.
In the first part, Ill walk you through common data center WAN edge challenges:
Optimizing BGP routing on data center WAN edge routers to reduce the downtime and brownouts
following link or node failures (chapter 1);
Integrating MPLS/VPN network provided by one or more service providers with DMVPN-overInternet backup network (chapter 2);
Building large-scale DMVPN network connecting one or more data centers with thousands of
remote sites (chapter 3);
Implementing redundant data center connectivity and routing between active/active data centers
and the outside world (chapter 4);
External routing combined with layer-2 data center interconnect (chapter 5).
Page xvii
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
The data center infrastructure part of the book covers these topics:
Replacing the central firewall with a scale-out architecture combining packet filters, virtual intersubnet firewalls and VM NIC firewalls (chapter 8);
The final part of the book covers scale-out architectures, multiple data centers and disaster
recovery:
Scale-out private cloud infrastructure using standardized building blocks (chapter 11);
Simplified workload migration and disaster recovery with virtual appliances (chapter 12);
Active-active data centers and scale-out application architectures (chapter 13 coming in late
2014);
I hope youll find the selected case studies useful. Should you have any follow-up questions, please
feel free to send me an email (or use the contact form @ ipSpace.net/Contact); Im also available
for short online consulting engagements.
Happy reading!
Ivan Pepelnjak
September 2014
Page xviii
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
IN THIS CHAPTER:
BRIEF NETWORK DESCRIPTION
SOLUTION EXECUTIVE OVERVIEW
DETAILED SOLUTION
ENABLE BFD
ENABLE BGP NEXT HOP TRACKING
REDUCE THE BGP UPDATE TIMERS
REDUCE THE NUMBER OF BGP PREFIXES
BGP PREFIX INDEPENDENT CONVERGENCE
CONCLUSIONS
Page 1-1
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
A large multi-homed content provider has experienced a number of outages and brownouts in the
Internet edge of their data center network. The brownouts were caused by high CPU load on the
Internet edge routers, leading to unstable forwarding tables and packet loss after EBGP peering
session loss.
This document describes the steps the customer could take to improve the BGP convergence and
reduce the duration of Internet connectivity brownouts.
Page 1-2
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page 1-3
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
3. Updates are sent to other BGP neighbors withdrawing the lost routes.
4. BGP neighbors process the withdrawal updates, select alternate BGP best routes, and install
them in their routing and forwarding tables.
5. BGP neighbors advertise their new best routes.
6. The router processes incoming BGP updates, selects new best routes, and installs them in
routing and forwarding tables.
Neighbor loss detection can be improved with Bidirectional Forwarding Detection (BFD)2, fast
neighbor failover3 or BGP next-hop tracking. BGP update propagation can be fine-tuned with BGP
update timers. The other elements of the BGP convergence process are harder to tune; they depend
primarily on the processing power of routers CPU, and the underlying packet forwarding hardware.
Some router vendors offer functionality that can be used to pre-install backup paths in BGP tables
(BGP best external paths) and forwarding tables (BGP Prefix Independent Convergence4). These
features can be used to redirect the traffic to the backup Internet connection even before the BGP
convergence process is complete.
Alternatively, you can significantly reduce the CPU load of the Internet edge routes, and improve the
BGP convergence time, by reducing the number of BGP prefixes accepted from the upstream ISPs.
Page 1-4
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Finally, you might need to replace your Internet edge routers with devices that have processing
power matching todays Internet routing table sizes.
DETAILED SOLUTION
The following design or configuration changes can be made to improve BGP convergence process:
ENABLE BFD
Bidirectional Forwarding Detection (BFD) has been available in major Cisco IOS and Junos software
releases for several years. Service providers prefer BFD over BGP hold time adjustments because
the high-end routers process BFD on the linecard, whereas BGP hold timer relies on BGP process
(running on the main CPU) sending keepalive packets over BGP TCP session.
Page 1-5
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
BFD has to be supported and configured on both ends of a BGP session; check with your ISP before
configuring BFD on your Internet-facing routers.
To configure BFD with BGP, use the following configuration commands on Cisco IOS:
interface <uplink>
bfd interval <timer> min_rx <timer> multiplier <n>
!
router bgp 65000
neighbor <ip> remote-as <ISP-AS>
neighbor <ip> fall-over bfd
Although you can configure BFD timers in milliseconds range, dont set them too low. BFD should
detect a BGP neighbor loss in a few seconds; you wouldnt want a short-term link glitch to start
CPU-intensive BGP convergence process.
Cisco IOS and Junos support BFD on EBGP sessions. BFD on IBGP sessions is available
Junos release 8.3. Multihop BFD is available in Cisco IOS, but theres still no support for BFD
on IBGP sessions.
Page 1-6
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
BGP next-hop tracking is enabled by default on Cisco IOS; you can adjust the tracking interval with
the bgp nexthop trigger delay router configuration command.
In environments using default routing, you should limit the valid prefixes that can be used for BGP
next hop tracking with the bgp nexthop route-map router configuration command.
If you want to use BGP next hop tracking in the primary/backup Internet access scenario described
in this document:
Do not change the BGP next hop on IBGP updates with neighbor next-hop-self router
configuration command. Example: routes advertised from GW-A to GW-B must have the original
next-hop from the ISP-A router.
Advertise IP subnets of ISP uplinks into IGP (example: OSPF) from GW-A and GW-B.
Use a route-map with BGP next hop tracking to prevent the default route advertised by GW-A
and GW-B from being used as a valid path toward external BGP next hop.
When the link between GW-A and ISP-A fails, GW-A revokes the directly-connected IP subnet from
its OSPF LSA, enabling GW-B to start BGP best path selection process before it receives BGP updates
from GW-A.
BGP next-hop tracking detects link failures that result in loss of IP subnet. It cannot detect
EBGP neighbor failure unless you combine it with BFD-based static routes.
Page 1-7
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page 1-8
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
AS numbers in the AS path). Some ISPs attach BGP communities to BGP prefixes they advertise to
their customers to help the customers implement well-tuned filters5.
When building an AS-path filter, consider the impact of AS path prepending on your AS-path
filter and use regular expressions that can match the same AS number multiple times6.
Example: matching up to three AS numbers in the AS path might not be good enough, as
another AS might use AS-path prepending to enforce primary/backup path selection7.
After deploying inbound BGP update filters, your autonomous system no longer belongs to the
default-free zone8 your Internet edge routers need default routes from the upstream ISPs to reach
destinations that are no longer present in their BGP tables.
BGP default routes could be advertised by upstream ISPs, requiring no further configuration on the
Internet edge routers.
Default-free zone
http://en.wikipedia.org/wiki/Default-free_zone
Page 1-9
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
If the upstream ISPs dont advertise BGP default routes, or if you cant trust the ISPs to perform
responsible default route origination9, use local static default routes pointing to far-away next hops.
Root name servers are usually a suitable choice.
The default routes on the Internet edge routers should use next-hops that are far away to
ensure the next hop reachability reflects the health status of upstream ISPs network. The
use of root DNS servers as next hops of static routes does not mean that the traffic will be
sent to the root DNS servers, just toward them.
Page 1-10
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
CONCLUSIONS
BGP neighbor loss detection can be significantly improved by deploying Bidirectional Forwarding
Detection (BFD).
Backup Internet edge router can use BGP next-hop tracking to detect primary uplink loss and adjust
its forwarding tables before receiving BGP updates from the primary Internet edge router.
To reduce the CPU overload and slow convergence caused by massive changes in the BGP, routing
and forwarding tables following a link or EBGP session failure:
Reduce the number of BGP prefixes accepted by the Internet edge routers;
Page 1-11
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
IN THIS CHAPTER:
IP ROUTING OVERVIEW
DESIGN REQUIREMENTS
SOLUTION OVERVIEW
OSPF AS THE INTERNET VPN ROUTING PROTOCOL
BENEFITS AND DRAWBACKS OF OSPF IN INTERNET VPN
Page 2-1
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
CONCLUSIONS
Page 2-2
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
A large enterprise (the Customer) has a WAN backbone based on MPLS/VPN service offered by a
regional Service Provider (SP). The service provider has deployed Customer Premises Equipment
(CPE) routers at remote sites. Customer routers at the central site are connected directly to the SP
Provider Edge (PE) routers with 10GE uplinks as shown in Figure 2-1.
The traffic in the Customers WAN network has been increasing steadily prompting the customer to
increase the MPLS/VPN bandwidth or to deploy an alternate VPN solution. The Customer decided to
trial IPsec VPN over the public Internet, initially as a backup, and potentially as the primary WAN
connectivity solution.
The customer will deploy new central site routers to support the IPsec VPN service. These routers
will terminate the IPsec VPN tunnels and provide whatever other services are needed (example:
QoS, routing protocols) to the IPsec VPNs.
Page 2-3
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
New low-end routers connected to the existing layer-3 switches will be deployed at the remote sites
to run the IPsec VPN (Figure 2-2 shows the proposed new network topology).
Page 2-4
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
IP ROUTING OVERVIEW
The customer is using OSPF as the sole routing protocol and would prefer using OSPF in the new
IPsec VPN.
OSPF routes are exchanged between Customers core routers and SPs PE routers, and between
Customers layer-3 switches and SPs CPE routers at remote sites. Customers central site is in OSPF
area 0; all remote sites belong to OSPF area 51.
The only external connectivity remote customer sites have is through the MPLS/VPN SP
backbone the OSPF area number used at those sites is thus irrelevant and the SP chose to
use the same OSPF area on all sites to simplify the CPE router provisioning and
maintenance.
Page 2-5
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
CPE routers deployed at Customers remote sites act as Customer Edge (CE) routers from MPLS/VPN
perspective. The Service Provider uses BGP as the routing protocol between its PE- and CE routers,
redistributing BGP routes into OSPF at the CPE routers for further propagation into Customers
remote sites.
OSPF routes received from the customer equipment (central site routers and remote site layer-3
switches) are redistributed into BGP used by the SPs MPLS/VPN service, as shown in Figure 2-4.
Page 2-6
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
communities dont propagate to CE routers running BGP with the PE routers, the CPE routers dont
receive the extended communities indicating the central site routes originated as OSPF routes. The
CPE routers thus redistribute routes received from other Customers sites as external OSPF routes
into the OSPF protocol running at remote sites.
Summary: All customer routes appear as external OSPF routes at all other customer sites (see
Figure 2-5 for details).
Page 2-7
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
DESIGN REQUIREMENTS
The VPN-over-Internet solution must satisfy the following requirements:
Dynamic routing: the solution must support dynamic routing over the new VPN infrastructure
to ensure fast failover on MPLS/VPN or Internet VPN failures;
Flexible primary/backup configuration: Internet VPN will be used as a backup path until it
has been thoroughly tested. It might become the primary connectivity option in the future;
Optimal traffic flow: Traffic to/from sites reachable only over the Internet VPN (due to local
MPLS/VPN failures) should not traverse the MPLS/VPN infrastructure. Traffic between an
MPLS/VPN-only site and an Internet VPN-only site should traverse the central site;
Minimal configuration changes: Deployment of Internet VPN connectivity should not require
major configuration changes in the existing remote site equipment. Central site routers will
probably have to be reconfigured to take advantage of the new infrastructure.
Minimal disruption: The introduction of Internet VPN connectivity must not disrupt the
existing WAN network connectivity.
Minimal dependence on MPLS/VPN provider: After the Internet VPN infrastructure has been
established and integrated with the existing MPLS/VPN infrastructure (which might require
configuration changes on the SP-managed CPE routers), the changes in the traffic flow must not
require any intervention on the SP-managed CPE routers.
Page 2-8
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
SOLUTION OVERVIEW
Internet VPN will be implemented with the DMVPN technology to meet the future requirements of
peer-to-peer topology. Each central site router will be a hub router in its own DMVPN subnet (one
hub router per DMVPN subnet), with the remote site routers having two DMVPN tunnels (one for
each central site hub router) as shown in Figure 2-6.
Page 2-9
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
The new VPN infrastructure could use OSPF or BGP routing protocol. The Customer would prefer to
use OSPF, but the design requirements and the specifics of existing MPLS/VPN WAN infrastructure
make OSPF deployment exceedingly complex.
Using BGP as the Internet VPN routing protocol would introduce a new routing protocol in the
Customers network. While the network designers and operations engineers would have to master a
new technology (on top of DMVPN) before production deployment of the Internet VPN, the reduced
complexity of BGP-only WAN design more than offsets that investment.
Page 2-10
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Challenges #3 and #4 significantly limit the OSPF area design options. Remote site OSPF areas
cannot extend to the Internet VPN hub router the hub router would automatically merge multiple
remote sites into the same OSPF area. Every remote site router must therefore be an Area Border
Router (ABR) or Autonomous System Border Router (ASBR). The only design left is an OSPF
backbone area spanning the whole Internet VPN.
Page 2-11
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Internet VPN edge routers perform two-way redistribution between intra-site OSPF process and
Internet VPN OSPF process.
Page 2-12
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page 2-13
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Its definitely possible to get such a design implemented with safety measures that would prevent
redistribution (and traffic forwarding) loops, but its definitely not an error-resilient design minor
configuration changes or omissions could result in network-wide failures.
Central site WAN edge routers (MPLS/VPN CE routers and Internet VPN routers);
Remote site Internet VPN routers and central site Internet VPN routers.
Page 2-14
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
IBGP or EBGP sessions: Which routers would belong to the same autonomous system (AS)?
Would the network use one AS per site or would a single AS span multiple sites?
Autonomous system numbers: There are only 1024 private AS numbers. Would the design
reuse a single AS number on multiple sites or would each site has a unique AS number?
Page 2-15
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Integration with CPE routers: Would the Internet VPN routers use the same AS number as
the CPE routers on the same site?
Integration with layer-3 switches: Would the central site and remote site layer-3 switches
participate in BGP or would they interact with the WAN edge routers through OSPF?
IBGP OR EBGP?
There are numerous differences between EBGP and IBGP and their nuances sometimes make it hard
to decide whether to use EBGP or IBGP in a specific scenario. However, you the following guidelines
usually result in simple and stable designs:
If you plan to use BGP as the sole routing protocol in (a part of) your network, use EBGP.
If youre using BGP in combination with another routing protocol that will advertise reachability
of BGP next hops, use IBGP. You can also use IBGP between routers residing in a single subnet.
Its easier to implement routing policies with EBGP. Large IBGP deployments need route
reflectors for scalability and some BGP implementations dont apply BGP routing policies on
reflected routes.
All routers in the same AS should have the same view of the network and the same routing
policies.
EBGP should be used between routers in different administrative (or trust) domains.
Applying these guidelines to our WAN network gives the following results:
EBGP will be used across DMVPN network. A second routing protocol running over DMVPN would
be needed to support IBGP across DMVPN, resulting in overly complex network design.
Page 2-16
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
IBGP will be used between central site WAN edge routers. The existing central site routing
protocol can be used to propagate BGP next hop information between WAN edge routers (or they
could belong the same layer-2 subnet).
EBGP will be used between central site MPLS/VPN CE routers and Service Providers PE routers
(incidentally, most MPLS/VPN implementations dont support IBGP as the PE-CE routing
protocol).
EBGP or IBGP could be used between remote site Internet VPN routers and CPE routers. While
IBGP between these routers reduces the overall number of autonomous systems needed, the
MPLS/VPN service provider might insist on using EBGP.
Throughout the rest of this document well assume the Service Provider agreed to use IBGP
between CPE routers and Internet VPN routers on the same remote site.
Reuse a single AS number for all remote sites even though each site has to be an individual AS;
Page 2-17
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Use a set of private AS numbers that the MPLS/VPN provider isnt using on its CPE routers and
number the remote sites;
Unless youre ready to deploy 4-octet AS numbers, the first option is the only viable option for
networks with more than a few hundred remote sites (because there are only 1024 private AS
numbers). The second option is feasible for smaller networks with a few hundred remote sites.
The last option is clearly the best one, but requires router software with 4-octet AS number support
(4-octet AS numbers are supported by all recent Cisco and Juniper routers).
Routers using 4-octet AS numbers (defined in RFC 4893) can interoperate with legacy
routers that dont support this BGP extension; Service Providers CPE routers thus dont
have to support 4-byte AS numbers (customer routers would appear to belong to AS
23456).
Default loop prevention filters built into BGP reject EBGP updates with local AS number in the AS
path, making it impossible to pass routes between two remote sites when they use the same AS
number. If you have to reuse the same AS number on multiple remote sites, disable the BGP loop
prevention filters as shown in Figure 2-11 (using neighbor allowas-in command on Cisco IOS).
While you could use default routing from the central site to solve this problem, the default routing
solution cannot be used when you have to implement the any-to-any traffic flow requirement.
Page 2-18
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page 2-19
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Redistribute BGP routes into IGP (example: OSPF). Non-BGP devices in the forwarding path thus
receive BGP information through their regular IGP (see Figure 2-13).
Page 2-20
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Figure 2-12: BGP enabled on every layer-3 device between two BGP routers
Page 2-21
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Enable MPLS forwarding. Ingress network edge devices running BGP label IP datagrams with
MPLS labels assigned to BGP next hops to ensure the datagrams get delivered to the proper
egress device; intermediate nodes perform label lookup, not IP lookup, and thus dont need the
full IP forwarding information.
Create a dedicated layer-2 subnet (VLAN) between BGP edge routers and advertise default route
to other layer-3 devices as shown in Figure 2-14. This design might result in suboptimal
Page 2-22
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
routing, as other layer-3 devices forward IP datagrams to the nearest BGP router, which might
not be the optimal exit point.
Page 2-23
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Remote site layer-3 switches will continue to use OSPF as the sole routing protocol;
Core central site layer-3 switches will participate in BGP routing and will become BGP route
reflectors.
REMOTE SITES
Internet VPN router will be added to each remote site. It will be in the same subnet as the existing
CPE router.
Remote site layer-3 switch might have to be reconfigured if it used layer-3 physical
interface on the port to which the CPE router was connected. Layer-3 switch should use a
VLAN (or SVI) interface to connect to the new router subnet.
Page 2-24
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
IBGP session will be established between CPE router and adjacent Internet VPN router. This is the
only modification that has to be has to be performed on the CPE router.
Internet VPN router will redistribute internal OSPF routes received from the layer-3 switch into BGP.
External OSPF routes will not be redistributed, preventing routing loops between BGP and OSPF.
The OSPF-to-BGP route redistribution does not impact existing routing, as the CPE router already
does it; its configured on the Internet VPN router solely to protect the site against CPE router
failure.
Internet VPN router will redistribute EBGP routes into OSPF (redistribution of IBGP routes is disabled
by default on most router platforms). OSPF external route metric will be used to influence the
forwarding decision of the adjacent layer-3 switch.
OSPF metric of redistributed BGP routes could be hard-coded into the Internet VPN router
confirmation or based on BGP communities attached to EBGP routes. The BGP communitybased approach is obviously more flexible and will be used in this design.
The following routing policies will be configured on the Internet VPN routers:
EBGP routes with BGP community 65000:1 (Backup route) will get local preference 50. These
routes will be redistributed into OSPF as external type 2 routes with metric 10000.
EBGP routes with BGP community 65000:2 (Primary route) will get local preference 150. These
routes will be redistributed into OSPF as external type 1 routes with metric 1.
Furthermore, the remote site Internet VPN router has to prevent potential route leakage between
MPLS/VPN and Internet VPN WAN networks. A route leakage between the two WAN networks might
turn one or more remote sites into transit sites forwarding traffic between the two WAN networks.
Page 2-25
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
NO-EXPORT BGP community will be used on the Internet VPN router to prevent the route leakage:
NO-EXPORT community will be set on updates sent over the IBGP session to the CPE router,
preventing the CPE router from advertising routes received from the Internet VPN router into the
MPLS/VPN WAN network.
NO-EXPORT community will be set on updates received over the IBGP session from the CPE
router, preventing leakage of these updates into the Internet VPN WAN network.
Page 2-26
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
CENTRAL SITE
The following steps will be used to deploy BGP on the central site:
1. BGP will be configured on existing MPLS/VPN edge routers, on the new Internet VPN edge
routers, and on the core layer-3 switches.
Page 2-27
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
2. IBGP sessions will be established between all loopback interfaces of WAN edge switches and both
core layer-3 switches10. Core layer-3 switches will be BGP route reflectors.
3. EBGP sessions will be established between MPLS/VPN edge routers and adjacent PE routers.
4. BGP community propagation11 will be configured on all IBGP and EBGP sessions.
After this step, the central site BGP infrastructure is ready for routing protocol migration.
5. Internal OSPF routes will be redistributed into BGP on both core layer-3 switches. No other
central site router will perform route redistribution.
At this point, the PE routers start receiving central site routes through PE-CE EBGP sessions and
prefer EBGP routes received from MPLS/VPN edge routes over OSPF routes received from the same
routers.
6. Default route will be advertised from layer-3 switches into OSPF routing protocol.
Access-layer switches at the core site will have two sets of external OSPF routes: specific routes
originated by the PE routers and default route originated by core layer-3 switches. They will still
prefer the specific routes originated by the PE routers.
7. OSPF will be disabled on PE-CE links.
10
11
Page 2-28
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
At this point, the PE routers stop receiving OSPF routes from the CE routers. The only central site
routing information they have are EBGP routes received from PE-CE EBGP session.
Likewise, the core site access-layer switches stop receiving specific remote site prefixes that were
redistributed into OSPF on PE routers and rely exclusively on default route advertised by the core
layer-3 switches.
Figure 2-16 summarizes central site IP routing design.
Figure 2-16: Central site logical network topology and BGP+OSPF routing
Page 2-29
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
INTERNET VPN
Two sets of EBGP sessions are established across DMVPN subnets. Each central site Internet VPN
router (DMVPN hub router) has EBGP sessions with remote site Internet VPN routers in the same
DMVPN subnet (DMVPN spoke routers). BGP community propagation will be configured on all EBGP
sessions.
VPN traffic flow through the central site: configure neighbor next-hop-self on DMVPN EBGP
sessions. Central site Internet VPN routers start advertising their IP addresses as EBGP next
hops for all EBGP prefixes, forcing the site-to-site traffic to flow through the central site.
Any-to-any VPN traffic flow: configure no neighbor next-hop-self on DMVPN EBGP sessions.
Default EBGP next hop processing will ensure that the EBGP routes advertised through the
central site routers retain the optimal BGP next hop IP address of the remote site if the two
remote sites connect to the same DMVPN subnet, or IP address of the central site router in any
other case.
Internet VPN as the backup connectivity: Set BGP community 65000:1 (Backup route) on all
EBGP updates sent from the central site routers. Remote site Internet VPN routers will lower the
local preference of routes received over DMVPN EBGP sessions and thus prefer IBGP routes
received from CPE router (which got the routes over MPLS/VPN WAN network).
Page 2-30
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Internet VPN as the primary connectivity: Set BGP community 65000:2 (Primary route) on all
EBGP updates sent from the central site routers. Remote site Internet VPN routers will increase
the local preference of routes received over DMVPN EBGP session and thus prefer those routes to
IBGP routes received from the CPE router.
CONCLUSIONS
A design with a single routing protocol running in one part of the network (example: WAN network
or within a site) is usually less complex than a design that involves multiple routing protocols and
route redistribution.
When you have to combine MPLS/VPN WAN connectivity with any other WAN connectivity, youre
forced to incorporate BGP used within the MPLS/VPN network into your network design. Even though
MPLS/VPN technology supports multiple PE-CE routing protocols, the service providers rarely
implement IGP PE-CE routing protocols with all the features you might need for successful enterprise
WAN integration. Provider-operated CE routers are even worse, as they cannot propagate
MPLS/VPN-specific information (extended BGP communities) into enterprise IGP in which they
participate.
WAN network based on BGP is thus the only logical choice, resulting in a single protocol (BGP) being
used in the WAN network. Incidentally, BGP provides a rich set of routing policy features, making
your WAN network more flexible than it could have been were you using OSPF or EIGRP.
Page 2-31
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
IN THIS CHAPTER:
EXISTING IP ROUTING OVERVIEW
IBGP VERSUS EBGP
IBGP AND EBGP BASICS
ROUTE PROPAGATION
BGP NEXT HOP PROCESSING
Page 3-1
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page 3-2
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
A large enterprise (the Customer) has an existing international WAN backbone using BGP as the
routing protocol. They plan to replace a regional access network with DMVPN-based solution and
want to extend the existing BGP routing protocol into the access network to be able to scale the
access network to several thousand sites.
The initial DMVPN access network should offer hub-and-spoke connectivity, with any-to-any traffic
implemented at a later stage.
Page 3-3
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Should they use Internal BGP (IBGP) or External BGP (EBGP) in the DMVPN access network?
What autonomous system (AS) numbers should they use on remote (spoke) sites if they decide
to use EBGP in the DMVPN access network?
Page 3-4
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
The WAN backbone AS is using BGP route reflectors; new DMVPN hub routers will be added as route
reflector clients to existing BGP topology.
Page 3-5
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
12
Page 3-6
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
ROUTE PROPAGATION
BGP loop prevention logic enforces an AS-level split horizon rule:
Routes received from an EBGP peer are further advertised to all other EBGP and IBGP peers
(unless an inbound or outbound filter drops the route);
Routes received from an IBGP peer are advertised to EBGP peers but not to other IBGP peers.
BGP route reflectors (RR) use slightly modified IBGP route propagation rules:
Routes received from an RR client are advertised to all other IBGP and EBGP peers. RR-specific
BGP attributes are added to the routes advertised to IBGP peers to detect IBGP loops.
Routes received from other IBGP peers are advertised to RR clients and EBGP peers.
The route propagation rules influence the setup of BGP sessions in a BGP network:
IBGP networks usually use a set of route reflectors (or a hierarchy of route reflectors); IBGP
sessions are established between all BGP-speaking routers in the AS and the route reflectors.
Page 3-7
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
An BGP router advertising a BGP route without a NEXT HOP attribute (locally originated BGP
route) sets the BGP next hop to the source IP address of the BGP session over which the BGP
route is advertised;
A BGP router advertising a BGP route to an IBGP peer does not change the value of the BGP
NEXT HOP attribute;
A BGP router advertising a BGP route to an EBGP peer sets the value of the BGP NEXT HOP
attribute to the source IP address of the EBGP session unless the existing BGP NEXT HOP value
belongs to the same IP subnet as the source IP address of the EBGP session.
You can modify the default BGP next hop processing rules with the following Cisco IOS configuration
options:
neighbor next-hop-self router configuration command sets the BGP NEXT HOP attribute to the
source IP address of the BGP session regardless of the default BGP next hop processing rules.
13
Page 3-8
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
BGP route reflector cannot change the BGP attributes of reflected routes14. Neighbor nexthop-self is thus not effective on routes reflected by a route reflector.
Recent Cisco IOS releases support an extension to the neighbor next-hop-self command:
neighbor address next-hop-self all configuration command causes a route server to
change BGP next hops on all IBGP and EBGP routes sent to the specified neighbor.
Inbound or outbound route maps can set the BGP NEXT HOP to any value with the set ip nexthop command (the outbound route maps are not applied to reflected routes). The most useful
use of this command is the set ip next-hop peer-address used in an inbound route map.
set ip next-hop peer-address sets BGP next hop to the IP address of BGP neighbor when
used in an inbound route map or to the source IP address of the BGP session when used in
an outbound route map.
14
Page 3-9
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
The number of spoke sites connected to a single hub site is large enough to cause scalability
issues in other routing protocols (example: OSPF);
The customer wants to run a single routing protocol across multiple access networks (MPLS/VPN
and DMVPN) to eliminate route redistribution and simplify overall routing design15.
In both cases, routing in the DMVPN network relies exclusively on BGP. BGP sessions are established
between directly connected interfaces (across DMVPN tunnel) and theres no IGP to resolve BGP
next hops, making EBGP a better fit (at least based on standard BGP use cases).
The customer has two choices when numbering the spoke DMVPN sites:
Each spoke DMVPN site could become an independent autonomous system with a unique AS
number;
All spoke DMVPN sites use the same autonomous system number.
See Integrating Internet VPN with MPLS/VPN WAN case study for more details.
Page 3-10
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
! Spoke A
! Spoke B
Page 3-11
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Once the BGP sessions are established, the DMVPN hub and spoke routers start exchanging BGP
prefixes. Prefixes advertised by DMVPN spoke sites retain their BGP next hop when the hub router
propagates the prefix to other DMVPN spoke sites; for IBGP prefixes advertised by other BGP routers
behind the hub router the hub router sets the next hop to the IP address of its DMVPN interface (see
Printout 3-3 for a sample BGP table on a DMVPN spoke router).
All printouts were generated in a test network connecting a DMVPN hub router (Hub) to two
DMVPN spoke routers (RA and RB) with IP prefixes 192.168.1.0/24 and 192.168.2.0/24,
and a core router with IP prefix 192.168.10.0/24.
RA#show ip bgp
BGP table version is 6, local router ID is 192.168.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found
*>
*>
*>
Network
192.168.1.0
192.168.2.0
192.168.10.0
Next Hop
0.0.0.0
10.0.0.2
10.0.0.3
Page 3-12
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
The AS path of the BGP routes indicates the sequence of AS numbers a BGP update had to
traverse before being received by the hub site; its thus easy to figure out which DMVPN
spoke site advertises a specific prefix or which prefixes a DMVPN spoke site advertises.
Page 3-13
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
RA#show ip bgp
BGP table version is 7, local router ID is 192.168.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found
*>
*>
*>
Network
192.168.1.0
192.168.2.0
192.168.10.0
Next Hop
0.0.0.0
10.0.0.3
10.0.0.3
Page 3-14
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
*>
*>
Network
0.0.0.0
192.168.1.0
Next Hop
10.0.0.3
0.0.0.0
Page 3-15
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
advertised by other spoke sites (to retain proper BGP next hop value), and the default route that
replaces all other BGP prefixes.
DMVPN spoke sites might have to use IPsec frontdoor VRF if they rely on default routing
within the enterprise network and toward the global Internet16.
You could use an outbound route map that matches on BGP next hop value on the BGP hub router to
achieve this goal (see Printout 3-8 for details).
router bgp 65000
bgp log-neighbor-changes
neighbor 10.0.0.1 remote-as 65001
neighbor 10.0.0.1 default-originate
neighbor 10.0.0.1 route-map hub-to-spoke out
!
ip access-list standard DMVPNsubnet
permit 10.0.0.0 0.0.0.255
!
route-map hub-to-spoke permit 10
match ip next-hop DMVPNsubnet
Printout 3-8: Phase 2 DMVPN hub router filtering non-DMVPN prefixes
16
See DMVPN: From Basics to Scalable Networks webinar for more details
http://www.ipspace.net/DMVPN
Page 3-16
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page 3-17
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
RA>show ip bgp
BGP table version is 15, local router ID is 192.168.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found
*>
*>
*>
Network
0.0.0.0
192.168.1.0
192.168.10.0
Next Hop
10.0.0.3
0.0.0.0
10.0.0.3
Path
65000 i
i
65000 i
Printout 3-10: BGP table on DMVPN spoke router (prefixes originated by other spokes are missing)
The default BGP loop prevention behavior might be ideal in DMVPN Phase 1 or Phase 3 networks
(see Using EBGP with Phase 1 DMVPN Networks and Reducing the Size of the Spoke Routers BGP
Table for more details), but is not appropriate for DMVPN Phase 2 networks.
In DMVPN Phase 2 networks we have to disable the BGP loop prevention on spoke site routers with
the neighbor allowas-in command17 (sample spoke router configuration is in Printout 3-11).
17
The use of this command in similarly-designed MPLS/VPN networks is described in details in MPLS and VPN
Architectures book.
Page 3-18
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page 3-19
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
RA#show ip bgp
BGP table version is 16, local router ID is 192.168.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found
*>
*
*>
*>
*>
Network
0.0.0.0
192.168.1.0
192.168.2.0
192.168.10.0
Next Hop
10.0.0.3
10.0.0.3
0.0.0.0
10.0.0.2
10.0.0.3
Path
65000
65000
i
65000
65000
i
65001 i
65001 i
i
Printout 3-12: Duplicate prefix on a DMVPN spoke router caused by a BGP update loop
Alternatively, one could adjust the AS path on updates sent by the DMVPN hub router with the
neighbor as-override router configuration command18 (see Printout 3-13), which replaces all
instances of neighbor AS number with local AS number. The resulting BGP table on a DMVPN spoke
router is shown in Printout 3-14.
18
The neighbor as-override command is extensively described in MPLS and VPN Architectures book.
Page 3-20
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
*>
*
*>
*>
*>
Network
0.0.0.0
192.168.1.0
192.168.2.0
192.168.10.0
Next Hop
10.0.0.3
10.0.0.3
0.0.0.0
10.0.0.2
10.0.0.3
Path
65000
65000
i
65000
65000
i
65000 i
65000 i
i
Printout 3-14: BGP table with modified AS paths on DMVPN spoke router
Page 3-21
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Establish IBGP sessions between hub- and spoke sites using directly-connected IP addresses
belonging to the DMVPN tunnel interface.
IBGP hub router configuration using dynamic BGP neighbors is extremely simple, as evidenced by
the sample configuration in Printout 3-15.
19
Page 3-22
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Networks that dont use BGP beyond the boundaries of DMVPN access network (core WAN
network might use an IGP like OSPF or EIGRP);
In all other cases, lack of BGP next hop processing across IBGP sessions (explained in the BGP Next
Hop Processing section) causes connectivity problems.
For example, in our sample network the spoke routers cannot reach destinations beyond the DMVPN
hub router BGP refuses to use those prefixes because the DMVPN spoke router cannot reach the
BGP next hop (Printout 3-16).
Page 3-23
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
RA#show ip bgp
BGP table version is 9, local router ID is 192.168.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found
Network
*> 192.168.1.0
*>i 192.168.2.0
* i 192.168.10.0
Next Hop
0.0.0.0
10.0.0.2
10.0.2.2
Printout 3-16: DMVPN spoke routers cannot reach prefixes behind the DMVPN hub router
Page 3-24
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
There are at least four approaches one can use to fix the IBGP next hop problem:
Use default routing in DMVPN network (see section Reducing the Size of the Spoke Routers BGP
Table for more details) unless youre using Phase 2 DMVPN.
Advertise default route from the DMVPN hub router with the neighbor default-information
router configuration command. DMVPN spokes will use the default route to reach IBGP next hop.
Some versions of Cisco IOS might not use an IBGP route to resolve a BGP next hop. Check
the behavior of your target Cisco IOS version before deciding to use this approach.
Change the IBGP next hop on all spoke routers with an inbound route map using set ip nexthop peer-address route map configuration command. This approach increases the complexity
of spoke site routers configuration and is thus best avoided.
Change the IBGP next hop on DMVPN hub router with the neighbor next-hop-self all router
configuration command.
This feature was introduced recently and might not be available on the target DMVPN hub
router.
Page 3-25
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
DESIGN RECOMMENDATIONS
BGP is the only routing protocol that scales to several thousand DMVPN nodes (the target size of the
DMVPN access network). The Customers DMVPN access network should thus rely on BGP without
underlying IGP.
Default routing fits current customers requirements (hub-and-spoke traffic), potential future direct
spoke-to-spoke traffic connectivity can be implemented with default routing and Phase 3 DMVPN.
Conclusion#1: Customer will use default routing over BGP. Hub router will advertise the default
route (and no other BGP prefix) to the spokes.
Spoke routers could use static host routes to send IPsec traffic to the hub router in the initial
deployment. Implementing spoke-to-spoke connectivity with static routes is time-consuming and
error prone, particularly in environments with dynamic spoke transport addresses. The customer
would thus like to use default routing toward the Internet.
Conclusion#2: Customer will use IPsec frontdoor VRF with default routing toward the Internet.
The customer does not plan to connect spoke DMVPN sites to any other access network (example:
MPLS/VPN), so theyre free to choose any AS numbering scheme they wish.
Any EBGP or IBGP design described in this document would meet customer routing requirements;
IBGP is the easiest one to implement and modify should the future needs change (assuming the
DMVPN hub router supports neighbor next-hop-self all functionality).
Conclusion#3: Customer will use IBGP in DMVPN access network.
Page 3-26
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
IN THIS CHAPTER:
SIMPLIFIED TOPOLOGY
IP ADDRESSING AND ROUTING
DESIGN REQUIREMENTS
FAILURE SCENARIOS
SOLUTION OVERVIEW
LAYER-2 WAN BACKBONE
BENEFITS AND DRAWBACKS OF PROPOSED TECHNOLOGIES
IP ROUTING ACROSS LAYER-2 WAN BACKBONE
BGP ROUTING
OUTBOUND BGP PATH SELECTION
Copyright ipSpace.net 2014
Page 4-1
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
CONCLUSIONS
Page 4-2
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
A large enterprise (the Customer) has two data centers linked with a fully redundant layer-3 Data
Center Interconnect (DCI) using an unspecified transport technology. Each data center has two
redundant Internet connections (see Figure 4-1 for details).
The customer would like to make the Internet connectivity totally redundant. For example: if both
Internet connections from DC1 fail, the public IP prefix of DC1 should remain accessible through
Internet connections of DC2 and the DCI link.
Page 4-3
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
SIMPLIFIED TOPOLOGY
All critical components of a redundant data center design should be redundant, but its sometimes
easier to disregard the redundancy of the components not relevant to a particular portion of the
overall design (our scenario: firewalls and DCI routers) to simplify the design discussions (see
Figure 4-2).
Page 4-4
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page 4-5
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
protocol (example: HSRP). At the moment theres no dynamic routing between the Internet edge
routers and any network devices.
Customers DCI routers connect the internal data center networks and currently dont provide transit
services.
Static default routes pointing to the local firewall inside IP address are used on the data center core
switches.
Figure 4-3: BGP sessions between Internet edge routers and the ISPs.
Page 4-6
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
DESIGN REQUIREMENTS
A redundant Internet access solution must satisfy the following requirements:
Resilient inbound traffic flow: both sites must advertise IP prefixes assigned to DC1 and DC2
to the Internet;
No session loss: Failure of one or more Internet-facing links must not result in application
session loss;
Optimal inbound traffic flow: Traffic for IP addresses in one of the data centers should arrive
over uplinks connected to the same data center; DCI link should be used only when absolutely
necessary.
Optimal outbound traffic flow: Outbound traffic must take the shortest path to the Internet;
as above, DCI link should be used only when absolutely necessary.
No blackholing: A single path failure (one or both Internet links on a single site, or one or
more DCI links) should not cause traffic blackholing.
FAILURE SCENARIOS
This document describes a network that is designed to survive the following failures:
Page 4-7
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
The design described in this document does not address a total data center failure; youd need a
manual or automatic failover mechanism addressing network, compute and storage components to
achieve that goal.
SOLUTION OVERVIEW
We can meet all the design requirements by redesigning the Internet Edge layer of the corporate
network to resemble a traditional Internet Service Provider design20.
The Internet Edge layer of the new network should have:
Edge or peering routers connecting the WAN backbone to Internet peers or upstream providers;
The missing component in the current Internet Edge layer is the WAN backbone. Assuming we have
to rely on the existing WAN connectivity between DC1 and DC2, the DCI routers (D11 through D22)
have to become part of the Internet Edge layer (outside) WAN backbone as shown in Figure 4-4.
20
For more details, watch the Redundant Data Center Internet Connectivity video
http://demo.ipspace.net/get/X1%20Redundant%20Data%20Center%20Internet%20Connectivity.mp4
Page 4-8
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
The outside WAN backbone can be built with any one of these technologies:
Point-to-point Ethernet links or stretched VLANs between Internet edge routers. This solution
requires layer-2 connectivity between the sites and is thus the least desirable option;
Virtual device contexts on DCI routers to split them in multiple independent devices (example:
Nexus 7000).
WAN backbone implemented in a virtual device context on Nexus 7000 would require
dedicated physical interfaces (additional inter-DC WAN links).
Page 4-9
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
VRFs on the DCI routers to implement another forwarding context for the outside WAN
backbone.
Regardless of the technology used to implement the WAN backbone, all the proposed solutions fall in
two major categories:
Layer-2 solutions, where the DCI routers provide layer-2 connectivity between Internet edge
routers, either in form of point-to-point links between Internet edge routers or site-to-site VLAN
extension.
GRE tunnels between Internet edge routers are just a special case of layer-2 solution that
does not involve DCI routers at all.
Layer-3 solutions, where the DCI routers participate in the WAN backbone IP forwarding.
Virtual Private LAN Service (VPLS) could be configured on DCI routers (combined with MPLSover-IP if needed) to provide site-to-site VLAN extension between Internet edge routers;
Page 4-10
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Overlay Transport Virtualization (OTV) could be configured on DCI routers to provide site-to-site
VLAN extensions;
GRE tunnels configured on Internet edge routers provide point-to-point links without
involvement of DCI routers.
All layer-2 tunneling technologies introduce additional encapsulation overhead and thus
require increased MTU on the path between Internet edge routers (GRE tunnels) or DCI
routers (all other technologies), as one cannot rely on proper operation of Path MTU
Discovery (PMTUD) across the public Internet21.
21
Page 4-11
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Figure 4-5: Point-to-point Ethernet links implemented with EoMPLS on DCI routers
Consider the potential failure scenarios in the simple topology from Figure 4-5 where the fully
redundant DCI backbone implements EoMPLS point-to-point links between Internet edge routers.
Failure of DCI link #1 (or DCI routers D11 or D21) causes the E1-to-E3 virtual link to fail;
Subsequent failure of E2 or E4 results in a total failure of the WAN backbone although there are
still alternate paths that could be used if the point-to-point links between Internet edge routers
wouldnt be tightly coupled with the physical DCI components.
Site-to-site VLAN extensions are slightly better in that respect; well-designed fully redundant
stretched VLANs (Figure 4-6) can decouple DCI failures from Internet edge failures.
Page 4-12
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Figure 4-6: Single stretched VLAN implemented with VPLS across L3 DCI
You could achieve the proper decoupling with a single WAN backbone VLAN that follows these rules:
The VLAN connecting Internet edge routers MUST be connected to all physical DCI devices
(preventing a single DCI device failure from impacting the inter-site VLAN connectivity);
Redundant independent DCI devices MUST use a rapidly converging protocol (example: rapid
spanning tree) to elect the primary forwarding port connected to the WAN backbone VLAN. You
could use multi-chassis link aggregation groups when DCI devices appear as a single logical
device (example: VSS, IRF, Virtual Chassis).
Every DCI router MUST be able to use all DCI links to forward the WAN backbone VLAN traffic, or
shut down the VLAN-facing port when its DCI WAN link fails.
Page 4-13
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Fully redundant stretched VLANs are hard to implement with todays technologies22 (example: OTV
supports a single transport interface on both NX-OS and IOS XE); it might be simpler to provide two
non-redundant WAN backbone VLANs and connect Internet edge routers to both of them as shown
in Figure 4-7 (a solution that cannot be applied to server subnets but works well for router-to-router
links23).
Figure 4-7: Two non-redundant stretched VLANs provide sufficient end-to-end redundancy
22
23
The difference between Metro Ethernet and stretched data center subnets
http://blog.ioshints.info/2012/07/the-difference-between-metro-ethernet.html
Page 4-14
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
GRE tunnels established directly between Internet edge routers might be the simplest solution. They
rely on IP transport provided by the DCI infrastructure and can use whatever path is available (keep
in mind the increased MTU requirements). Their only drawback is a perceived security risk: traffic
that has not been inspected by the firewalls is traversing the internal infrastructure.
Page 4-15
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
The IP routing design of the WAN backbone should thus follow the well-known best practices used
by Internet Service Provider networks:
Configure a full mesh of IBGP sessions as shown in Figure 4-10 (it does not make sense to
introduce route reflectors in a network with four BGP routers).
24
Page 4-16
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Figure 4-10: Full mesh of IBGP sessions between Internet edge routers
Use BGP next-hop-self on IBGP sessions to decouple the IBGP routes from external subnets.
25
26
27
Page 4-17
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
BGP ROUTING
The design of BGP routing between Internet edge routers should follow the usual non-transit
autonomous system best practices:
Every Internet edge router should advertise directly connected public LAN prefix (on Cisco IOS,
use network statement, not route redistribution);
Do not configure static routes to null 0 on the Internet edge routers; they should announce the
public LAN prefix only when they can reach it28.
Use BGP communities to tag the locally advertised BGP prefixes as belonging to DC1 or DC2 (on
Cisco IOS, use network statement with route-map option29).
Use outbound AS-path filters on EBGP sessions with upstream ISPs to prevent transit route
leakage across your autonomous system30.
Use AS-path prepending31, Multi-Exit Discriminator (MED) or ISP-defined BGP communities for
optimal inbound traffic flow (traffic destined for IP addresses in public LAN of DC1 should arrive
through DC1s uplinks if at all possible). Example: E3 and E4 should advertise prefixes from DC1
with multiple copies of Customers public AS number in the AS path.
28
29
30
31
Page 4-18
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
BGP attributes of BGP prefixes advertised to upstream ISPs (longer AS path, MED or additional
communities) should be based on BGP communities attached to the advertised IP prefixes.
The same BGP path is received from EBGP peer and local IBGP peer: EBGP path is preferred over
IBGP path;
Different BGP paths to the same IP prefix are received from EBGP peer and local IBGP peer. Both
paths have the same BGP local preference; other BGP attributes (starting with AS path length)
are used to select the best path;
The same BGP path is received from local IBGP peer and an IBGP peer from another data center.
The path received from local IBGP peer has higher local preference and is thus always preferred;
32
Page 4-19
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Different BGP paths to the same IP prefix are received from local IGBP peer and IBGP peer from
another data center. The path received from local IBGP peer has higher local preference; other
BGP attributes are ignored in the BGP path selection process the path received from local IBGP
peer is always preferred.
In all cases, the outbound traffic uses a local uplink assuming at least one of the upstream ISPs
advertised a BGP path to the destination IP address.
Default routing between data centers. Redundant BGP paths received from EBGP and IBGP
peers increase the memory requirements of Internet edge routers and slow down the convergence
process. You might want to reduce the BGP table sizes on Internet edge routers by replacing full
IBGP routing exchange between data centers with default routing:
Internet edge routers in the same data center exchange all BGP prefixes received from EBGP
peers to ensure optimal outbound traffic flow based on information received from upstream ISPs;
Default route and locally advertised prefixes (BGP prefixes with empty AS path) are exchanged
between IGBP peers residing in different data centers.
With this routing policy, Internet edge routers always use the local uplinks for outbound traffic and
fall back to default route received from the other data center only when there is no local path to the
destination IP address.
Two-way default routing between data centers might result in packet forwarding loops. If at
all possible, request default route origination from the upstream ISPs and propagate only
ISP-generated default routes to IBGP peers.
Page 4-20
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page 4-21
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Figure 4-11: Virtual Device Contexts: dedicated management planes and physical interfaces
Page 4-22
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Virtual Routing and Forwarding tables seem to be a better fit; most devices suitable for data center
edge deployment support them, and its relatively easy to associate a VRF with physical interfaces or
VLAN (sub)interfaces.
Figure 4-12: Virtual Routing and Forwarding tables: shared management, shared physical interfaces
Page 4-23
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
All routers in a layer-3 backbone must have a consistent forwarding behavior. This requirement can
be met by all core routers running IGP+BGP (and participating in full Internet routing) or by core
routers running IGP+MPLS and providing label switched paths between BGP-running edge routers
(BGP-free core).
The DCI routers in the WAN backbone should thus either participate in IBGP mesh (acting as BGP
route reflectors to reduce the IBGP mesh size, see Figure 4-13) or provide MPLS transport between
Internet edge routers as shown in Figure 4-14.
Page 4-24
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Both designs are easy to implement on dedicated high-end routers or within a separate VDC on a
Nexus 7000; the VRF-based implementations are way more complex:
Many MPLS- or VRF-enabled devices do not support IBGP sessions within a VRF; only EBGP
sessions are allowed between PE- and CE-routers (Junos supports IBGP in VRF in recent software
releases). In an MPLS/VPN deployment, the DCI routers would have to be in a private AS
inserted between two disjoint parts of the existing public AS. Multi-VRF or EVN deployment
would be even worse: each DCI router would have to be in its own autonomous system.
MPLS transport within a VRF requires support for Carriers Carrier (CsC) architecture; at the very
minimum, the DCI routers should be able to run Label Distribution Protocol (LDP) within a VRF.
Page 4-25
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
While both designs could be implemented on numerous data center edge platforms (including Cisco
7200, Cisco 7600 and Juniper MX-series routers), they rely on technologies not commonly used in
data center environments and might thus represent a significant deployment and operational
challenge.
IBGP sessions between routers in different data centers are used solely to propagate locallyoriginated routes. No external BGP routes are exchanged between data centers.
Page 4-26
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
IBGP sessions between data centers could also be replaced with local prefix origination all
Internet edge routers in both data centers would advertise public LAN prefixes from all data
centers (using route-map or similar mechanism to set BGP communities), some of them
based on connected interfaces, others based on IGP information.
Outbound traffic forwarding in this design is based on default routes advertised by all Internet edge
routers. An Internet edge router should advertise a default route only when its Internet uplink (and
corresponding EBGP session) is operational to prevent suboptimal traffic flow or blackholing. Local
Page 4-27
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
default routes should be preferred over default routes advertised from other data centers to ensure
optimal outbound traffic flow.
The following guidelines could be used to implement this design with OSPF on Cisco IOS:
Internet edge routers that receive default route from upstream ISPs through EBGP session
should be configured with default-route originate33 (default route is originated only when
another non-OSPF default route is already present in the routing table);
Internet edge routers participating in default-free zone (full Internet routing with no default
routes) should advertise default routes when they receive at least some well-known prefixes
(example: root DNS servers) from upstream ISP34. Use default-route originate always routemap configuration command and use the route map to match well-known prefixes.
Use external type-1 default routes to ensure DCI routers prefer locally-originated default routes
(even when they have unequal costs to facilitate primary/backup exit points) over default routes
advertised from edge routers in other data centers.
33
34
Page 4-28
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
CONCLUSIONS
A network with multiple data centers and requirements for seamless failover following any
combination of link/device failures must have an external WAN backbone (similar to Internet Service
Provider core networks) with individual data centers and other sites connected to the backbone via
firewalls or other intermediate devices.
In most cases the external WAN backbone has to share WAN links and physical devices with internal
data center interconnect links, while still maintaining strict separation of security zones and
forwarding planes.
The external WAN backbone could be implemented as either layer-2 backbone (using layer-2
tunneling mechanisms on DCI routers) or layer-3 backbone (with DCI routers participating in WAN
backbone IP forwarding). Numerous technologies could be used to implement the external WAN
backbone with the following ones being the least complex from the standpoint of a typical enterprise
data center networking engineer:
WAN backbone implemented in a separate VRF on DCI routers with default routing used for
outbound traffic forwarding.
Page 4-29
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
IN THIS CHAPTER:
IP ADDRESSING AND ROUTING
REDUNDANCY REMOVED TO SIMPLIFY DESIGN DISCUSSIONS
DESIGN REQUIREMENTS
FAILURE SCENARIOS
SOLUTION OVERVIEW
DETAILED SOLUTION OSPF
FAILURE ANALYSIS
Page 5-1
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
NEXT-HOP PROCESSING
BGP TABLE DURING NORMAL OPERATIONS
EBGP ROUTE ADVERTISEMENTS
FAILURE ANALYSIS
CONCLUSIONS
Page 5-2
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
ACME Enterprises has two data centers linked with a layer-2 Data Center Interconnect (DCI)
implemented with Ciscos Overlay Transport Virtualization (OTV). Each data center has connections
to the Internet and enterprise WAN network connecting data centers with remote offices (see
Figure 5-1 for details). Enterprise WAN network is implemented with MPLS/VPN services.
Layer-2 DCI was used to avoid IP renumbering in VM mobility and disaster recovery scenarios.
Occasional live migration between data centers is used during maintenance and hardware upgrades
operations.
Page 5-3
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page 5-4
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
ACME uses OSPF within its MPLS/VPN network and BGP with upstream Internet Service Providers
(ISPs).
Page 5-5
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page 5-6
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
DESIGN REQUIREMENTS
Layer-2 DCI is the least desirable data center interconnect solution35, as it extends a single
broadcast domain (and thus a single failure domain) across multiple sites, turning them into a single
availability zone36.
Furthermore, DCI link failure might result in a split-brain scenario where both sites advertise the
same IP subnet, resulting in misrouted (and thus black-holed) traffic37.
External routing between the two data centers and both Internet and enterprise WAN (MPLS/VPN)
network should thus ensure that:
Every data center subnet remains reachable after a single link or device failure;
DCI link failure does not result in a split-brain scenario with traffic for the same subnet being
sent to both data centers.
Backup data center (for a particular VLAN/subnet) advertises the subnet after the primary data
center failure.
35
36
37
The difference between Metro Ethernet and stretched data center subnets
http://blog.ioshints.info/2012/07/the-difference-between-metro-ethernet.html
Page 5-7
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
FAILURE SCENARIOS
The design described in this document should provide uninterrupted external connectivity under the
following conditions:
Single device or link failure anywhere in the data center network edge;
Stateful devices (firewalls, load balancers) are not included in this design. Each stateful device
partitions the data center network in two (or more) independent components. You can apply the
mechanisms described in this document to the individual networks; migration of stateful devices
following a data center failure is out of scope.
SOLUTION OVERVIEW
External data center routing seems to be a simple primary/backup design scenario (more details in
Figure 5-4):
Primary data center advertises a subnet with low cost (when using BGP, cost might be AS-path
length or multi-exit discriminator attribute);
Page 5-8
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Backup data center advertises the same subnet with high cost even if the DCI link fails, every
external router ignores the prefix advertised by the backup data center due to its higher cost.
The primary/backup approach based on routing protocol costs works reasonably well in enterprise
WAN network where ACME controls the routing policies, but fails in generic Internet environment,
where ACME cannot control routing policies implemented by upstream ISPs, and where every ISP
might use its own (sometimes even undocumented) routing policy.
For example, an upstream ISP might strictly prefer prefixes received from its customers over
prefixes received from other autonomous systems (peers or upstream ISPs); such an ISP would set
Copyright ipSpace.net 2014
Page 5-9
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
local preference on BGP paths received from its customers, making AS path length irrelevant.
Routing policy that unconditionally prefers customer prefixes might prevent a straightforward
implementation of primary/backup scenario based on routing protocol cost (ex: AS path length).
The only reliable mechanism to implement primary/backup path selection that does not rely on ISP
routing policies is conditional route advertisement BGP routers in backup data center should not
advertise prefixes from primary data center unless the primary data center fails or all its WAN
connections fail.
To further complicate the design, BGP routers in the backup data center (for a specific subnet) shall
not advertise the prefixes currently active in the primary data center when the DCI link fails.
Data center edge routers thus have to employ mechanisms similar to those used by data center
switches with a shared control plane (ex: Ciscos VSS or HPs IRF): they have to detect split brain
scenario by exchanging keepalive messages across the external network. When the backup router
(for a particular subnet) cannot reach the primary router through the DCI link but still reaches it
across the external network, it must enter isolation state (stop advertising the backup prefix).
You can implement the above requirements using neighbor advertise-map functionality available
in Cisco IOS in combination with IP SLA-generated routes (to test external reachability of the other
data center), with Embedded Event Manager (EEM) triggers, or with judicious use of parallel IBGP
sessions (described in the Detailed Solution Internet Routing With BGP section).
Page 5-10
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Intra-area routes;
Inter-area routes;
External Type-2 OSPF routes are the only type of OSPF routes where the internal cost (OSPF cost
toward the advertising router) does not affect the route selection process.
Its thus advisable to advertise data center subnets as E2 OSPF routes. External route cost should be
set to a low value (ex: 100) on data center routers advertising primary subnet and to a high value
(ex: 1000) on data center routers advertising backup subnet (Figure 5-5):
Page 5-11
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
FAILURE ANALYSIS
Consider the following failure scenarios (assuming DC-A is the primary data center and DC-B the
backup one):
DC-A WAN link failure: DC-B is still advertising the subnet into enterprise WAN (although with
higher cost). Traffic from DC-A flows across the DCI link, which might be suboptimal.
Performance problems might trigger evacuation of DC-A, but applications running in DC-A
remain reachable throughout the failure period.
Page 5-12
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
DC-B WAN link failure: does not affect the OSPF routing (prefix advertised by DC-B was too
expensive).
DCI link failure: Does not affect the applications running in DC-A. VMs residing in DC-B (within
the backup part of the shared subnet) will be cut off from the rest of the network.
MIGRATION SCENARIOS
Use the following procedures when performing a controlled migration from DC-A to DC-B:
DC evacuation (primary to backup). Migrate the VMs, decrease the default-metric on the
DC-B routers (making DC-B the primary data center for the shared subnet). Reduced cost of
prefix advertised by DC-B will cause routers in the enterprise WAN network to prefer path
through DC-B. Shut down DC-A.
DC restoration (backup to primary). Connect DC-A to the WAN networks (cost of prefixes
redistributed into OSPF in DC-A is still higher than the OSPF cost advertised from DC-B). Migrate
the VMs, increase the default-metric on routers in DC-B. Prefix advertised by DC-A will take
over.
Page 5-13
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Regular IBGP sessions are established between data center edge routers (potentially in
combination with external WAN backbone described in the Redundant Data Center Internet
Connectivity document38). These IBGP sessions could be configured between loopback or
internal LAN interfaces39;
Additional IBGP sessions are established between external (ISP-assigned) IP addresses of data
center edge routers. The endpoints of these IBGP sessions shall not be advertised in internal
routing protocols to ensure the IBGP sessions always traverse the public Internet.
38
39
Page 5-14
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Figure 5-6: EBGP and IBGP sessions on data center edge routers
IBGP sessions established across the public Internet should be encrypted. If you cannot
configure an IPsec session between the BGP routers, use MD5 authentication to prevent
man-in-the-middle or denial-of-service attacks.
Page 5-15
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
PREFIX ORIGINATION
With all BGP routers advertising the same prefixes, we have to use BGP local preference to select
the best prefix:
BGP prefixes advertised by routers in primary data center have default local preference (100);
BGP prefixes advertised by routers in backup data center have lower local preference (50). The
routers advertising backup prefixes (with a network or redistribute router confirmation
command) shall also set the BGP weight to zero to make locally-originated prefixes comparable
to other IBGP prefixes.
Furthermore, prefixes with default local preference (100) shall get higher local preference (200)
when received over Internet-traversing IBGP session (see Figure 5-7):
Page 5-16
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
...
Figure 5-7: BGP local preference in prefix origination and propagation
NEXT-HOP PROCESSING
IBGP sessions between ISP-assigned IP addresses shall not influence actual packet forwarding. The
BGP next hop advertised over these sessions must be identical to the BGP next hop advertised over
DCI-traversing IBGP sessions.
Default BGP next hop processing might set BGP next hop for locally-originated directly connected
prefixes to the local IP address of the IBGP session (BGP next hop for routes redistributed into BGP
from other routing protocols is usually set to the next-hop provided by the source routing protocol).
Page 5-17
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
If the BGP origination process does not set the BGP next hop (BGP next hop for locally originated
prefixes equals 0.0.0.0), you must set the value of the BGP next hop to one of the internal IP
addresses of the BGP router (loopback or internal LAN IP address). Use set next-hop command in a
route-map attached to network or redistribute router configuration command. You might also
change the BGP next hop with an outbound route-map applied to Internet-traversing IBGP session
(Figure 5-8):
Page 5-18
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Prefixes received from backup data center with local preference 50.
After BGP converges, the prefixes originated in the backup data center (for a specific subnet) should
no longer be visible in the BGP tables of routers in the primary data center; routers in backup data
center should revoke them due to their lower local preference.
BGP routers in the backup data center should have three copies of the primary subnet in their BGP
table:
Prefix received from primary data center over DCI-traversing IBGP session with local preference
100;
Prefix received from primary data center over Internet-traversing IBGP session with local
preference 200 (local preference is set to 200 by receiving router).
BGP routers in the backup data center should thus prefer prefixes received over Internet-traversing
IBGP session. As these prefixes have the same next hop as prefixes received over DCI-traversing
IBGP session (internal LAN or loopback interface of data center edge routers), the actual packet
forwarding is not changed.
Page 5-19
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Locally-originated prefix. The router is obviously the best source of routing information
either because its the primary router for the subnet or because the primary data center cannot
be reached through either DCI link or Internet.
IBGP prefix with local preference 100. Prefix with this local preference can only be received
from primary data center (for the prefix) over DCI-traversing IBGP session. Lack of better path
(with local preference 200) indicates failure of Internet-traversing IBGP session, probably
caused by Internet link failure in the primary data center. Prefix should be advertised with
prepended AS path40.
IBGP prefix with local preference 200. Prefix was received from the primary data center (for
the prefix) through Internet-traversing IBGP session, indicating primary data center with fully
operational Internet connectivity. Prefix must not be advertised to EBGP peers as its already
advertised by the primary data center BGP routers.
40
Page 5-20
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
BGP communities41 might be used to ease the differentiation between locally-originated and
other IBGP prefixes.
FAILURE ANALYSIS
Assume DC-A is the primary data center for a given prefix.
DC-A Internet link failure: Internet-traversing IBGP session fails. BGP routers in DC-B start
advertising the prefix from DC-A (its local preference has dropped to 100 due to IBGP session
failure).
DC-A BGP router failure: BGP routers in DC-B lose all prefixes from DC-A and start
advertising locally-originated prefixes for the shared subnet.
DCI failure: Internet-traversing IBGP session is still operational. BGP routers in DC-B do not
advertise prefixes from DC-A. No traffic is attracted to DC-B.
Total DC-A failure: All IBGP sessions between DC-A and DC-B are lost. BGP routers in DC-B
advertise local prefixes, attracting user traffic toward servers started in DC-B during the disaster
recovery procedures.
End-to-end Internet connectivity failure: Internet-traversing IBGP session fails. BGP routers
in DC-B start advertising prefixes received over DCI-traversing IBGP session with prepended
AS-path. Traffic for subnet currently belonging to DC-A might be received by DC-B but will still
be delivered to the destination host as long as the DCI link is operational.
41
Page 5-21
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
EBGP session failure in DC-A: Prefixes from DC-A will not be advertised by either DC-A
(because EBGP session is gone) or DC-B (because the Internet-traversing IBGP session is still
operational). You might use neighbor advertise-map on Internet-traversing IBGP session to
ensure prefixes are sent on that session only if the local BGP table contains external prefixes
(indicating operational EBGP session).
If your routers support Bidirectional Forwarding Detection (BFD)42 over IBGP sessions, use it
to speed up the convergence process.
MIGRATION SCENARIOS
Use the following procedures when performing a controlled migration from DC-A to DC-B:
DC evacuation (primary to backup). Migrate the VMs, decrease the default local preference
on DC-A routers to 40. Even though these prefixes will be received over Internet-traversing
IBGP session by BGP routers in DC-B, their local preference will not be increased. Prefixes
originated by DC-B will thus become the best prefixes and will be advertised by both data
centers. Complete the evacuation by shutting down EBGP sessions in DC-A. Shut down DC-A.
42
Page 5-22
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
DC restoration (backup to primary). Connect DC-A to the WAN networks, enable EBGP
sessions, change the default local preference to 100. After IBGP convergence completes, DC-B
stops advertising prefixes from DC-A.
CONCLUSIONS
Optimal external routing that avoids split-brain scenarios is relatively easy to implement in WAN
networks with consistent routing policy: advertise each subnet with low cost (or shorter AS-path or
lower value of multi-exit discriminator in BGP-based networks) from the primary data center (for
that subnet) and with higher cost from the backup data center.
In well-designed active-active data center deployments each data center acts as the
primary data center for the subset of prefixes used by applications running in that data
center.
Optimal external routing toward the Internet is harder to implement due to potentially inconsistent
routing policies used by individual ISPs. The only solution is tightly controlled conditional route
advertisement: routers in backup data center (for a specific prefix) should not advertise the prefix
as long as the primary data center retains its Internet connectivity. This requirement could be
implemented with numerous scripting mechanisms available in modern routers; this document
presented a cleaner solution that relies exclusively on standard BGP mechanisms available in most
modern BGP implementations.
Page 5-23
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
IN THIS CHAPTER:
COLLECT THE REQUIREMENTS
PRIVATE CLOUD PLANNING AND DESIGN PROCESS
DESIGN DECISIONS
CONCLUSIONS
Page 6-1
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
The data center networking team in a large enterprise (the Customer) has been tasked with building
the network infrastructure for a new private cloud deployment.
They approached numerous vendors trying to figure out how the new network should look like, and
got thoroughly confused by all the data center fabric offerings, from FabricPath (Cisco) and VCS
Fabric (Brocade) to Virtual Chassis Fabric (Juniper), QFabric (Juniper) and more traditional leaf-andspine architectures (Arista). Should they build a layer-2 fabric, a layer-3 fabric or a leaf-and-spine
fabric?
Total north-south (traffic leaving the data center) and east-west (inter-server traffic) bandwidth.
Page 6-2
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
These requirements can only be gathered after the target workload has been estimated in terms of
bandwidth, number of servers and number of tenants.
Long-term average and peak statistics of existing virtualized or physical workload behavior are
usually a good initial estimate of the target workload. The Customer has collected these statistics
using VMware vCenter Operations Manager:
Category
Collected values
20 hosts
storage statistics
40 TB of storage
500 VMs
Average VM
requirements
2 GB of RAM per VM
80 GB of disk per VM
Bandwidth and
IOPS statistics
Page 6-3
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
The Customer is expecting a reasonably fast growth in the workload and thus decided to build a
cloud infrastructure that will eventually support up to 5 times larger workload. They have also
increased the expected average VM requirements.
Category
Target workload
Average VM
requirements
4 GB of RAM per VM
20 IOPS per VM
2500 VMs
Workload size
Value
750
RAM
10 TB
IOPS
50.000
Page 6-4
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Parameter
Value
Storage bandwidth
125 Gbps
125 Gbps
Define the services offered by the cloud. Major decision points include IaaS versus PaaS and
simple hosting versus support for complex application stacks43.
Select the orchestration system (OpenStack, CloudStack, vCloud Director) that will allow the
customers to deploy these services;
Select the hypervisor supported by the selected orchestration system that has the desired
features (example: high-availability);
Select the network services implementation (physical or virtual firewalls and load balancers);
43
Page 6-5
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
DESIGN DECISIONS
The Customers private cloud infrastructure will use vCloud Automation Center and vSphere
hypervisors.
The server team decided to use the Nutanix NX-3050 servers with the following specifications:
Parameter
Value
CPU cores
16 cores
RAM
256 GB
IOPS
6000
Connectivity
2 * 10GE uplink
Page 6-6
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
The target workload can be placed on 50 NX-3050 servers (based on the number of CPU cores).
Those servers would have 12.800 GB of RAM (enough), 1 Tbps of network bandwidth (more than
enough) and 300.000 IOPS (more than enough).
Switch vendors use marketing math they count ingress and egress bandwidth on every
switch port. The Nutanix server farm would have 2 Tbps of total network bandwidth using
that approach.
The private cloud will use a combination of physical (external firewall) and virtual (per-application
firewalls and load balancers) network services44. The physical firewall services will be implemented
on two devices in active/backup configuration (two 10GE ports each); virtual services will be run on
a separate cluster45 of four hypervisor hosts, for a total of 54 servers.
The number of network segments in the private cloud will be relatively low. VLANs will be used to
implement the network segments; the network infrastructure thus has to provide layer-2
connectivity between any two endpoints.
This decision effectively turns the whole private cloud infrastructure into a single failure
domain. Overlay virtual networks would be a more stable alternative (from the network
perspective), but are not considered mature enough technology by more conservative cloud
infrastructure designers.
44
45
Cloud-as-an-appliance design
http://blog.ipspace.net/2013/07/cloud-as-appliance-design.html
Page 6-7
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page 6-8
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Modular switches from numerous vendors have significantly higher number of 10GE ports,
allowing you to build even larger 2-switch fabrics (called splines by Arista marketing).
Every data center switching vendor can implement ECMP layer-2 fabric with no blocked links using
multi-chassis link aggregation (Arista: MLAG, Cisco: vPC, HP: IRF, Juniper: MC-LAG).
Some vendors offer layer-2 fabric solutions that provide optimal end-to-end forwarding across larger
fabrics (Cisco FabricPath, Brocade VCS Fabric, HP TRILL), other vendors allow you to merge multiple
switches into a single management-plane entity (HP IRF, Juniper Virtual Chassis, Dell Force10
stacking). In any case, its not hard to implement end-to-end layer-2 fabric with ~100 10GE ports.
MANAGEMENT NETWORK
A mission-critical data center infrastructure should have a dedicated out-of-band management
network disconnected from the user and storage data planes. Most network devices and high-end
servers have dedicated management ports that can be used to connect these devices to a separate
management infrastructure.
The management network does not have high bandwidth requirements (most devices have Fast
Ethernet or Gigabit Ethernet management ports); you can build it very effectively with a pair of GE
switches.
Do not use existing ToR switches or fabric extenders (FEX) connected to existing ToR
switches to build the management network.
Page 6-9
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
The purpose of the management network is to reach infrastructure devices (including ToR
switches) even when the network infrastructure malfunctions or experiences forwarding
loops and resulting brownouts or meltdowns.
CONCLUSIONS
One cannot design an optimal network infrastructure without a comprehensive set of input
requirements. When designing a networking infrastructure for a private or public cloud these
requirements include:
Transport services offered by the networking infrastructure (VLANs, IP, lossless Ethernet, FCoE
);
Most reasonably sized private cloud deployments require few tens of high-end physical servers and
associated storage either distributed or in form of storage arrays. You can implement the network
infrastructure meeting these requirements with two ToR switches having between 64 10GE and 128
10GE ports.
Page 6-10
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
REDUNDANT SERVER-TO-NETWORK
CONNECTIVITY
IN THIS CHAPTER:
DESIGN REQUIREMENTS
VLAN-BASED VIRTUAL NETWORKS
REDUNDANT SERVER CONNECTIVITY TO LAYER-2 FABRIC
OPTION 1: NON-LAG SERVER CONNECTIVITY
OPTION 2: SERVER-TO-NETWORK LAG
CONCLUSIONS
Copyright ipSpace.net 2014
Page 7-1
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
A large enterprise (the Customer) is building a private cloud infrastructure using leaf-and-spine
fabric for internal network connectivity. The virtualization team hasnt decided yet whether to use a
commercial product (example: VMware vSphere) or an open-source alternative (KVM with
OpenStack). Its also unclear whether VLANs or overlay layer-2 segments will be used to implement
virtual networks.
Regardless of the virtualization details, the server team wants to implement redundant server-tonetwork connectivity: each server will be connected to two ToR switches (see Figure 7-1).
The networking team has to build the network infrastructure before having all the relevant input
data the infrastructure should thus be as flexible as possible.
Page 7-2
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
DESIGN REQUIREMENTS
The virtualization solution deployed in the private cloud may use VLANs as the virtual networking
technology Leaf-and-spine fabric deployed by the networking team MUST support layer-2
connectivity between all attached servers.
Overlay virtual networks may be used in the private cloud, in which case a large layer-2 failure
domain is not an optimal solution Leaf-and-spine fabric SHOULD also support layer-3 connectivity
with a separate subnet assigned to each ToR switch (or a redundant pair of ToR switches).
Page 7-3
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page 7-4
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
The choice of the layer-2 fabric technology depends primarily on the size of the fabric (are two core
switches enough for the planned number of server ports?) and the vendor supplying the networking
gear (most major data center vendors have proprietary layer-2 fabric architectures).
Only the network edge switches see MAC addresses of individual hosts in environments
using Provider Backbone Bridging (PBB) or TRILL/FabricPath-based fabrics.
Page 7-5
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Alternatively, the hypervisor hosts could bundle all uplinks into a link aggregation group (LAG) and
spread the traffic generated by the VMs across all the available uplinks (see Figure 7-5).
Page 7-6
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Figure 7-6: VM-to-uplink pinning with two hypervisor hosts connected to the same pair of ToR switches
Even though the two hypervisors could communicate directly, the traffic between two VMs might
have to go all the way through the spine switches (see Figure 7-7) due to VM-to-uplink pinning
which presents a VM MAC address on a single server uplink.
Page 7-7
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Conclusion: If the majority of the expected traffic flows between virtual machines and the outside
world (North-South traffic), non-LAG server connectivity is ideal. If the majority of the traffic flows
between virtual machines (East-West traffic) then the non-LAG design is clearly suboptimal unless
the chance of VMs residing on co-located hypervisors is exceedingly small (example: large cloud
with tens or even hundreds of ToR switches).
Page 7-8
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
46
Page 7-9
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
You might have to increase the bandwidth of intra-stack links to cope with the increased amount of
east-west traffic (leaf-to-spine bandwidth in well-designed Clos fabrics is usually significantly higher
than intra-stack bandwidth), but its way easier to remove MLAG pairing (or stacking) between ToR
switches and dedicate all non-server-facing ports to leaf-to-spine uplinks47.
Conclusion: Do not use MLAG or switch stacking in environments with non-LAG server-to-network
connectivity.
47
48
Page 7-10
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
You could configure a link aggregation group between a server and a pair of ToR switches as a
regular port channel (or LAG) using Link Aggregation Control Protocol (LACP) to manage the LAG
(see Figure 7-9), or static port channel without LACP.
Static port channel is the only viable alternative when using older hypervisors (example:
vSphere 5.0), but since this option doesnt use a handshake/link monitoring protocol, its
impossible to detect wiring mistakes or misbehaving physical interfaces. Static port
channels are thus inherently unreliable and should not be used if at all possible.
Switches participating in an MLAG group (or stack) exchange the MAC addresses received from the
attached devices, and a switch receiving a packet for a destination MAC address reachable over a
LAG link always uses a local member of a LAG link to reach the destination49 (see Figure 7-10). A
49
Page 7-11
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
design with servers dual-connected with LAGs to pairs of ToR switches therefore results in optimal
traffic flow regardless of VM placement and eventual VM-to-uplink pinning done by the hypervisors.
The only drawback of the server-to-network LAG design is the increased complexity introduced by
MLAG groups.
Page 7-12
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Figure 7-11: Redundant server connectivity requires the same IP subnet on adjacent ToR switches
Page 7-13
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Linux bonding driver can send traffic from a single IP address through multiple uplinks using
one or more MAC addresses (see Linux Bonding Driver Implementation Details).
In most setups the hypervisor associates its IP address with a single MAC address (ARP replies sent
by the hypervisor use a single MAC address), and that address cannot be visible over more than a
single server-to-switch link (or LAG).
Most switches would report MAC address flapping when receiving traffic from the same
source MAC address through multiple independent interfaces.
The traffic toward the hypervisor host (including all encapsulated virtual network traffic) would thus
use a single server-to-switch link (see Figure 7-12).
Page 7-14
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
The traffic sent from a Linux hypervisor host could use multiple uplinks (with a different MAC
address on each active uplink) when the host uses balance-tlb or balance-alb bonding mode (see
Linux Bonding Driver Implementation Details) as shown in Figure 7-13.
Page 7-15
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Figure 7-13: All uplinks are used by a Linux host using balance-tlb bonding mode
Page 7-16
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
ToR switches advertising the subnet, and since the server MAC addresses appear to be connected to
a single ToR switch, the spine switches send half of the traffic to the wrong ToR switch.
Figure 7-14: All ToR switches advertise IP subnets with the same cost
Stackable switches are even worse. While its possible to advertise an IP subnet shared by two ToR
switches with different metrics to attract the traffic to the primary ToR switch, the same approach
doesnt work with stackable switches, which treat all members of the stack as a single virtual IP
router, as shown in Figure 7-15.
Page 7-17
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Conclusion: Do not use non-LAG server connectivity in overlay virtual networking environments.
50
Page 7-18
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Balance-tlb mode uses multiple source MAC addresses for the same IP address, but a single MAC
address in ARP replies. Traffic from the server is sent through all active uplinks; return traffic uses a
single (primary) uplink. This mode is obviously not optimal in scenarios with a large percentage of
east-west traffic.
Balance-alb mode replaces the MAC address in the ARP replies sent by the Linux kernel with one of
the physical interface MAC addresses, effectively assigning different MAC addresses (and thus
uplinks) to IP peers, and thus achieving rudimentary inbound load distribution.
All other bonding modes (balance-rr, balance-xor, 802.3ad) use the same MAC address on multiple
active uplnks and thus require port channel (LAG) configuration on the ToR switch to work properly.
Page 7-19
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
All edge switches participating in a layer-2 fabric would have full MAC address reachability
information and would be able to send the traffic to individual hypervisor hosts over an optimal path
(assuming the fabric links are not blocked by Spanning Tree Protocol) as illustrated in Figure 7-17.
Page 7-20
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Layer-2 transport fabrics have another interesting property: they allow you to spread the load
evenly across all ToR switches (and leaf-to-spine links) in environments using server uplinks in
primary/backup mode all you have to do is to spread the primary links across evenly across all
ToR switches.
Unfortunately a single layer-2 fabric represents a single broadcast and failure domain51 using a
layer-2 fabric in combination with overlay virtual networks (which dont require layer-2 connectivity
between hypervisor hosts) is therefore suboptimal from the resilience perspective.
51
Page 7-21
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
ToR switches can reach server MAC addresses directly (switches in an MLAG group exchange
MAC addresses learned from traffic received on port channel interfaces);
Servers can send encapsulated traffic across all uplinks flow of northbound (server-tonetwork) traffic is optimal;
Both ToR switches can send the traffic to adjacent servers directly flow of southbound
(network-to-server) traffic is optimal.
Page 7-22
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
The LAGs used between servers and switches should use LACP to prevent traffic blackholing (see
Option 2: Server-to-Network LAG for details), and the servers and the ToR switches should use 5tuple load balancing.
VXLAN and STT encapsulations (VXLAN, STT) use source ports in UDP or TCP headers to
increase the packet entropy and the effectiveness of ECMP load balancing. Most other
encapsulation mechanisms use GRE transport, effectively pinning the traffic between a pair
of hypervisors to a single path across the network.
CONCLUSIONS
The most versatile leaf-and-spine fabric design uses dynamic link aggregation between servers and
pairs of ToR switches. This design requires MLAG functionality on ToR switches, which does increase
the overall network complexity, but the benefits far outweigh the complexity increase the design
works well with layer-2 fabrics (required by VLAN-based virtual networks) or layer-3 fabrics
(recommended for transport fabrics for overlay virtual networks) and usually results in optimal
traffic flow (the only exception being handling of traffic sent toward orphan ports this traffic might
have to traverse the link between MLAG peers).
You might also use layer-2 fabrics without server-to-network link aggregation for VLAN-based virtual
networks where hypervisors pin VM traffic to one uplink or for small overlay virtual networks when
youre willing to trade resilience of a layer-3 fabric for reduced complexity of non-MLAG server
connectivity.
Finally, you SHOULD NOT use non-MLAG server connectivity in layer-3 fabrics or MLAG (or stackable
switches) in layer-2 environments without server-to-switch link aggregation.
Page 7-23
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
IN THIS CHAPTER:
FROM PACKET FILTERS TO STATEFUL FIREWALLS
DESIGN ELEMENTS
PROTECTING THE SERVERS WITH PACKET FILTERS
PROTECTING VIRTUALIZED SERVERS WITH STATEFUL FIREWALLS
PER-APPLICATION FIREWALLS
PACKET FILTERS AT WAN EDGE
DESIGN OPTIONS
BEYOND THE TECHNOLOGY CHANGES
Page 8-1
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
ACME Inc. has a data center hosting several large-scale web applications. Their existing data center
design uses traditional enterprise approach:
Data center is segmented into several security zones (web servers, application servers,
database servers, supporting infrastructure);
Servers belonging to different applications reside within the same security zone, increasing
the risk of lateral movements in case of web- or application server breach;
Large layer-2 segments are connecting all servers in the same security zone, further
increasing the risk of cross-protocol attack52;
All inter-zone traffic is controlled by a pair of central firewalls, which are becoming
exceedingly impossible to manage;
The central firewalls are also becoming a chokepoint, severely limiting the growth of
ACMEs application infrastructure.
The networking engineers designing next-generation data center for ACME would like to replace the
central firewalls with iptables deployed on application servers, but are reluctant to do so due to
potential security implications.
Satisfy the business-level security requirements of ACME Inc., including potential legal,
regulatory and compliance requirements;
52
Page 8-2
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Not require large-scale upgrades when the application traffic reaches a certain limit (which
is the case with existing firewalls).
Effectively, theyre looking for a scale-out solution, which will ensure approximately linear growth,
with minimum amount of state to reduce the complexity and processing requirements.
While designing the overall application security architecture, they could use the following tools:
Packet filters (or access control lists ACLs) are the bluntest of traffic filtering tools: they match
(and pass or drop) individual packets based on their source and destination network addresses and
transport layer port numbers. They keep no state (making them extremely fast and implementable
in simple hardware) and thus cannot check validity of transport layer sessions or fragmented
packets.
Some packet filters give you the option of permitting or dropping fragments based on network layer
information (source and destination addresses), others either pass or drop all fragments (and
sometimes the behavior is not even configurable).
Packet filters are easy to use in server-only environments, but become harder to maintain when
servers start establishing client sessions to other servers (example: application servers opening
MySQL sessions to database servers).
They are not the right tool in environments where clients establish ad-hoc sessions to random
destination addresses (example: servers opening random sessions to Internet-based web servers).
Packet filters with automatic reverse rules (example: XenServer vSwitch Controller) are a
syntactic sugar on top of simple packet filters. Whenever you configure a filtering rule (example:
Page 8-3
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
permit inbound TCP traffic to port 80), the ACL configuration software adds a reverse rule in the
other direction (permit outbound TCP traffic from port 80).
ACLs that allow matches on established TCP sessions (typically matching TCP traffic with ACK or
RST bit set) make it easier to match outbound TCP sessions. In server-only environment you can
use them to match inbound TCP traffic on specific port numbers and outbound traffic of established
TCP sessions (to prevent simple attempts to establish outbound sessions from hijacked servers); in
client-only environment you can use them to match return traffic.
Reflexive access lists (Cisco IOS terminology) are the simplest stateful tool in the filtering arsenal.
Whenever a TCP or UDP session is permitted by an ACL, the filtering device adds a 5-tuple matching
the return traffic of that session to the reverse ACL.
Reflexive ACLs generate one filtering entry per transport layer session. Not surprisingly, you wont
find them in platforms that do packet forwarding and filtering in hardware they would quickly
overload the TCAM (or whatever forwarding/filtering hardware the device is using), cause packet
punting to the main CPU53 and reduce the forwarding performance by orders of magnitude.
Even though reflexive ACLs generate per-session entries (and thus block unwanted traffic that might
have been permitted by other less-specific ACLs) they still work on individual packets and thus
cannot reliably detect and drop malicious fragments or overlapping TCP segments.
Transport layer session inspection combines reflexive ACLs with fragment reassembly and
transport-layer validation. It should detect dirty tricks targeting bugs in host TCP/IP stacks like
overlapping fragments or TCP segments.
53
Page 8-4
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Application level gateways (ALG) add application awareness to reflexive ACLs. Theyre usually
used to deal with applications that exchange transport session endpoints54 (IP addresses and port
numbers) in application payload (FTP or SIP are the well-known examples). An ALG would detect the
requests to open additional data sessions and create additional transport-level filtering entries.
Web Application Firewalls (WAF) have to go way beyond ALGs. ALGs try to help applications get
the desired connectivity and thus dont focus on malicious obfuscations. WAFs have to stop the
obfuscators; they have to parse application-layer requests like a real server would to detect injection
attacks55. Needless to say, you wont find full-blown WAF functionality in reasonably priced highbandwidth firewalls.
DESIGN ELEMENTS
ACME designers can use numerous design elements to satisfy the security requirements, including:
Per-application firewalls;
54
55
Page 8-5
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page 8-6
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Per-server traffic filters (packet filters or firewalls) also alleviate the need for numerous
security zones, as the protection of individual servers no longer relies on the zoning
concept. In a properly operated environment one could have all servers of a single
application stack (or even servers from multiple applications stacks) in a single layer-2 or
layer-3 domain.
Scale-out packet filters require a high level of automation they have to be deployed automatically
from a central orchestration system to ensure consistent configuration and prevent operator
mistakes.
In environments with extremely high level of trust in the server operating system hardening one
could use iptables on individual servers. In most other environments its better to deploy the packet
filters outside of the application servers an intruder breaking into a server and gaining root access
could easily turn off the packet filter.
You could deploy packet filters protecting servers from the outside on first-hop switches (usually
Top-of-Rack or End-of-Row switches), or on hypervisors in virtualized environment.
Packet filters deployed on hypervisors are a much better alternative hypervisors are not limited by
the size of packet filtering hardware (TCAM), allowing the security team to write very explicit
application-specific packet filtering rules permitting traffic between individual IP addresses instead of
IP subnets (see also High Speed Multi-Tenant Isolation for more details).
All major hypervisors support packet filters on VM-facing virtual switch interfaces:
vSphere 5.5 and Windows Server 2012 R2 have built-in support for packet filters;
Linux-based hypervisors can use iptables in the hypervisor kernel, achieving the same
results as using iptables in the guest VM in a significantly more secure way;
Page 8-7
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Cisco Nexus 1000V provides the same ACL functionality and configuration syntax in
vSphere, Hyper-V and KVM environment.
Environments using high-performance bare-metal servers could redeploy these servers as
VMs in a single-VM-per-host setup, increasing deployment flexibility, ease upgrades, and
provide traffic control outside of guest OS.
Page 8-8
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page 8-9
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Most VM NIC firewall products56 offer centralized configuration automatically providing the
automated deployment of configuration changes mentioned in the previous section.
The implementation details that affect the scalability or performance of VM NIC virtual firewalls vary
greatly between individual products:
Distributed firewall in VMware NSX, Juniper Virtual Gateway57, and Hyper-v firewalls using
filtering functionality of Hyper-V Extensible Switch58 use in-kernel firewalls which offer true
scale-out performance limited only by the number of available CPU resources;
vSphere App or Zones uses a single firewall VM per hypervisor host and passes all guest
VM traffic through the firewall VM, capping the server I/O throughput to the throughput of
a single core VM (3-4 Gbps);
Cisco Nexus 1000V sends the first packets of every new session to Cisco Virtual Security
Gateway59, which might be deployed somewhere else in the data center, increasing the
session setup delay. Subsequent packets of the same session are switched in the Nexus
1000V VEM module60 residing in the hypervisor kernel;
56
57
58
59
60
Page 8-10
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
You should ask the following questions when comparing individual VM NIC firewall products:
Is the session state moved with the VM to another hypervisor, or is the state recreated
from packets of already-established sessions inspected after the VM move?
Page 8-11
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
PER-APPLICATION FIREWALLS
Per-application firewalls are a more traditional approach to application security: each application
stack has its own firewall context (on a physical appliance) or a virtual firewall.
Per-application firewalls (or contexts) significantly reduce the complexity of the firewall rule set
after all, a single firewall (or firewall context) contains only the rules pertinent to a single
application. It is also easily removed at the application retirement time, automatically reducing the
number of hard-to-audit stale firewall rules.
Page 8-12
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
VM-based firewalls or virtual contexts on physical firewalls are functionally equivalent to traditional
firewalls and thus pose no additional technical challenges. They do require a significant change in
deployment, management and auditing processes like its impossible to run thousands of
virtualized servers using mainframe tools, its impossible to operate hundreds of small firewalls
using the processes and tools suited for centralized firewall appliances61.
Virtual firewall appliances had significantly lower performance than their physical counterparts62. The
situation changed drastically with the introduction of Xeon CPUs (and their AMD equivalents); the
performance of virtual firewalls and load balancers is almost identical to entry-level physical
products63.
61
62
63
64
Page 8-13
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Firewalls were traditionally used to protect poorly written server TCP stacks. These firewalls
would remove out-of-window TCP segments, certain ICMP packets, and perform IP fragment
reassembly. Modern operating systems no longer need such protection.
Packet filters permitting only well-known TCP and UDP ports combined with hardened operating
systems offer similar protection as stateful firewalls; the real difference between the two is handling
of outgoing sessions (sessions established from clients in a data center to servers outside of the
data center). These sessions are best passed through a central proxy server, which can also provide
application-level payload inspection.
Page 8-14
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Figure 8-22: High-performance WAN edge packet filters combined with a proxy server
Page 8-15
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Rules
Inbound WAN
edge ACL
Outbound WAN
edge ACL
Inbound Proxy
facing ACL
Outbound Proxy
facing ACL
proxy server
Page 8-16
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
DESIGN OPTIONS
ACME designers can combine the design elements listed in the previous section to satisfy the
security requirements of individual applications:
WAN edge packet filters combined with per-server (or VM NIC) packet filters are good
enough for environments with well-hardened servers or low security requirements;
WAN edge packet filters combined with per-application firewalls are an ideal solution for
security-critical applications in high-performance environment;
A high-performance data center might use packet filters in front of most servers and perapplication firewalls in front of critical applications (example: credit card processing).
Environments that require stateful firewalls between data center and external networks
could use a combination of WAN edge firewall and per-server packet filters, or a
combination of WAN edge firewall and per-application firewalls;
In extreme cases one could use three (or more) layers of defense: a WAN edge firewall
performing coarse traffic filtering and HTTP/HTTPS inspection, and another layer of stateful
firewalls or WAFs protecting individual applications combined with per-server protection
(packet filters or firewalls).
Page 8-17
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Distributed traffic control points (firewalls or packet filters) cannot be configured and
managed with the same tools as a single device. ACME operations team SHOULD use an
orchestration tool that will deploy the traffic filters automatically (most cloud orchestration
platforms and virtual firewall products include tools that can automatically deploy
configuration changes across a large number of traffic control points);
System administrators went through a similar process when they migrated workloads from
mainframe computers to x86-based servers.
Per-application traffic control is much simpler and easier to understand than a centralized
firewall ruleset, but its impossible to configure and manage tens or hundreds of small
point solutions manually. The firewall (or packet filter) management SHOULD use the
automation, orchestration and management tools the server administrators already use to
manage large number of servers.
Application teams SHOULD become responsible for the whole application stack including
the security products embedded in it. The might not configure the firewalls or packet filters
themselves, but SHOULD own them in the same way they own all other specialized
components in the application stack like databases.
Page 8-18
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Security teams role SHOULD change from enforcer of security to validator of security
they should validate and monitor the implementation of security mechanisms instead of
focusing on configuring the traffic control points.
Simple tools like nmap probes deployed outside- and within the data center are good
enough to validate the proper implementation of L3-4 traffic control solutions including
packet filters and firewalls.
Page 8-19
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
IN THIS CHAPTER:
EXISTING NETWORK SERVICES DESIGN
SECURITY REQUIREMENTS
PRIVATE CLOUD INFRASTRUCTURE
NETWORK SERVICES IMPLEMENTATION OPTIONS
NETWORK TAPPING AND IDS
LOAD BALANCING
NETWORK-LEVEL FIREWALLS
DEEP PACKET INSPECTION AND APPLICATION-LEVEL FIREWALLS
SSL AND VPN TERMINATION
Page 9-1
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Central IT department of the government of Genovia is building a new private cloud which will
consolidate workloads currently being run at satellite data centers throughout various ministries.
The new private cloud should offer centralized security, quick application deployment capabilities,
and easy integration of existing application stacks that are using a variety of firewalls and load
balancers from numerous vendors.
Page 9-2
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page 9-3
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
The central IT department provides Internet connectivity to the whole government; other data
centers connect to the private WAN network as shown in Figure 9-2.
Page 9-4
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
SECURITY REQUIREMENTS
Applications run within the existing data centers have highly varying security requirements. Most of
them need some sort of network-layer protection; some of them require deep packet inspection or
application-level firewalls that have been implemented with products from numerous vendors (each
ministry or department used to have its own purchasing processes).
Most application stacks rely on data stored in internal databases or in the central database server
(resident in the central data center); some applications need access to third-party data reachable
over the Internet or tightly-controlled extranet connected to the private WAN network (see Figure
9-3).
Page 9-5
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Migrating the high-security applications into the security zones that have already been established
within the central data center is obviously out of question some of these applications are not
allowed to coexist in the same security zone with any other workload. The number of security zones
in the consolidated data center will thus drastically increase, more so if the cloud architects decide to
make each application an independent tenant with its own set of security zones.65
The cloud architecture team decided to virtualize the whole infrastructure, including large baremetal servers, which will be implemented as a single VM running on a dedicated physical server, and
network services appliances, which will be implemented with open-source or commercial products in
VM format.
65
66
Cloud-as-an-Appliance design
http://blog.ipspace.net/2013/07/cloud-as-appliance-design.html
Page 9-6
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Network-level firewalls;
67
68
Page 9-7
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
LOAD BALANCING
The virtualize everything approach to cloud infrastructure significantly simplifies the
implementation of load balancing services. Load balancing could be offered as a service
(implemented with a centrally-managed pool of VM-based load balancing appliances), or
implemented with per-application load balancing instances.
Individual customers (ministries or departments) migrating their workloads into the centralized
private cloud infrastructure could also choose to continue using their existing load balancing
vendors, and simply migrate their own load balancing architecture into a fully virtualized
environment (Bring-Your-Own-Load Balancer approach).
NETWORK-LEVEL FIREWALLS
Most hypervisor- or cloud orchestration products support VM NIC-based packet filtering capabilities,
either in form of simple access lists, or in form of distributed (semi)stateful firewalls.
The centralized private cloud infrastructure could use these capabilities to offer baseline security to
all tenants. Individual tenants could increase the security of their applications by using firewall
appliances offered by the cloud infrastructure (example: vShield Edge) or their existing firewall
products in VM format.
Page 9-8
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
69
70
71
Page 9-9
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
algorithm, the appliance vendor uses AES-NI instruction set in its software, and the VM runs on a
server with AES-NI-capable CPU.
The RSA algorithm performed during the SSL handshake is still computationally intensive; software
implementations might have performance that is orders of magnitude lower than the performance of
dedicated hardware used in physical appliances.
Total encrypted throughput and number of SSL transactions per second offered by a VMbased load balancing or firewalling product should clearly be one of the major
considerations during your product selection process if you plan to implement SSL- or VPN
termination on these products.
72
Page 9-10
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
In the end, the cloud architecture team proposed a hybrid solution displayed in Figure 9-4:
Physical load balancers will perform load balancing for high-volume web sites, and pass the
traffic to application-specific load balancers for all other web properties that need high-speed
SSL termination services.
High-volume web sites might use a caching layer, in which case the physical load balancers
send the incoming requests to a set of reverse proxy servers, which further distribute
requests to web servers.
Page 9-11
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
10
IN THIS CHAPTER:
INTERACTION WITH THE PROVISIONING SYSTEM
COMMUNICATION PATTERNS
STATELESS OR STATEFUL TRAFFIC FILTERS?
PACKET FILTERS ON LAYER-3 SWITCHES
PACKET FILTERS ON X86-BASED APPLIANCES
INTEGRATION WITH LAYER-3 BACKBONE
TRAFFIC CONTROL APPLIANCE CONFIGURATION
CONCLUSIONS
Page 10-1
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
The Customer is operating a large multi-tenant data center. Each tenant (application cluster,
database cluster, or a third-party application stack) has a dedicated container connected to a shared
layer-3 backbone. The layer-3 backbone enables connectivity between individual containers and
between containers and the outside world (see Figure 10-1).
Page 10-2
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
The connectivity between the layer-3 backbone and the outside world and the security
measures implemented toward the outside world (packet filters or firewalls, IPS/IDS
systems) are outside of the scope of this document.
Individual containers could be implemented with bare-metal servers, virtualized servers or even
independent private clouds (for example, using OpenStack). Multiple logical containers can share the
same physical infrastructure; in that case, each container uses an independent routing domain
(VRF) for complete layer-3 separation.
The Customer wants to implement high-speed traffic control (traffic filtering and/or firewalling)
between individual containers and the shared high-speed backbone. The solutions should be
redundant, support at least 10GE speeds, and be easy to manage and provision through a central
provisioning system.
Page 10-3
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page 10-4
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
COMMUNICATION PATTERNS
All the communication between individual containers and the outside world falls into one of these
categories:
TCP sessions established from an outside client to a server within a container (example: web
application sessions). Target servers are identified by their IP address (specified in the
orchestration system database) or IP prefix that covers a range of servers;
TCP sessions established from one or more servers within a container to a well-known server in
another container (example: database session between an application server and a database
server). Source and target servers are identified by their IP addresses or IP prefixes;
UDP sessions established from one or more servers within a container to a well-known server in
another container (example: DNS and syslog). Source and target servers are identified by their
IP addresses or IP prefixes.
All applications are identified by their well-known port numbers; traffic passing a container boundary
does not use dynamic TCP or UDP ports73.
Servers within containers are not establishing TCP sessions with third-party servers outside of the
data center. There is no need for UDP communication between clients within the data center and
servers outside of the data center.
73
Page 10-5
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
74
Page 10-6
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Ingress ACL
Egress ACL
port
established
src-server-ip established
ip eq destination-port
ip eq dst-port
src-server-ip established
src-server-ip
ip eq destination-port
ip eq dst-port
src-server-ip
Page 10-7
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
TCAM size: typical data center top-of-rack (ToR) switches support limited number of ACL
entries75. A few thousand ACL entries is more than enough when the traffic control rules use IP
prefixes to identify groups of servers; when an automated tool builds traffic control rules based
on IP addresses of individual servers, the number of ACL entries tends to explode due to
Cartesian product76 of source and destination IP ranges.
Object groups available in some products are usually implemented as a Cartesian product to
speed up the packet lookup process.
75
Nexus 5500 has 1600 ingress and 2048 egress ACL entries
http://www.cisco.com/c/en/us/products/collateral/switches/nexus-5000-series-switches/data_sheet_c78618603.html
Arista 7050 supports 4K ingress ACL and 1K egress ACL entries.
http://www.aristanetworks.com/media/system/pdf/Datasheets/7050QX-32_Datasheet.pdf
Arista 7150 supports up to 20K ACL entries
http://www.aristanetworks.com/media/system/pdf/Datasheets/7150S_Datasheet.pdf
76
See http://en.wikipedia.org/wiki/Cartesian_product
Page 10-8
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Multi-vendor environment: whenever the data center contains ToR switches from multiple
vendors, the provisioning system that installs traffic control rules must implement an abstraction
layer that maps traffic patterns into multiple vendor-specific configuration syntaxes.
Configuration mechanisms: most switch vendors dont offer APIs that would be readily
compatible with common server provisioning tools (example: Puppet).
Juniper offers a Junos Puppet client, but its current version cannot provision manage access
control lists77. Arista provides Puppet installation instructions for EOS78 but does not offer
agent-side code that would provision ACLs.
77
78
Page 10-9
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Standard Linux traffic filters (implemented, for example, with iptables or flow entries in Open
vSwitch) provide few gigabits of throughput due to the overhead of kernel-based packet
forwarding79. Solutions that rely on additional hardware capabilities of modern network interface
cards (NICs) and poll-based user-mode forwarding easily achieve 10Gbps throughput that satisfies
the Customers requirements. These solutions include:
79
Page 10-10
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
PF_RING, an open-source kernel module that includes 10GE hardware packet filtering;
Snabb Switch, an open-source Linux-based packet processing application that can be scripted
with Lua to do custom packet filtering;
Intels Data Path Development Kit (DPDK) and DPDK-based Open vSwitch (OVS).
The solutions listed above are primarily frameworks, not ready-to-use traffic control
products. Integration- and other professional services are available for most of them.
Page 10-11
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page 10-12
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Inserting new layer-3 appliances between container layer-3 switches and backbone switches
requires readdressing (due to additional subnets being introduced between existing adjacent layer-3
devices) and routing protocol redesign. Additionally, one would need robust routing protocol support
on the x86-based appliances. Its thus much easier to insert the x86-based appliances in the
forwarding path as transparent layer-2 devices.
Transparent appliances inserted in the forwarding path would not change the existing network
addressing or routing protocol configurations. The existing layer-3 switches would continue to run
Page 10-13
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
the routing protocol across layer-2 devices (the traffic control rules would have to be adjusted to
permit routing protocol updates), simultaneously checking the end-to-end availability of the
forwarding path a failure in the transparent traffic control device would also disrupt the routing
protocol adjacency and cause layer-3 switches to shift traffic onto an alternate path as shown in
Figure 10-6.
Page 10-14
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
The transparent x86-based appliances used for traffic control purposes thus have to support the
following data path functionality:
VLAN-based interfaces to support logical containers that share the same physical infrastructure;
Transparent (bridge-like) traffic forwarding between two physical or VLAN interfaces. All non-IP
traffic should be forwarded transparently to support non-IP protocols (ARP) and any deployment
model (including scenarios where STP BPDUs have to be exchanged between L3 switches);
Ingress and egress IPv4 and IPv6 packet filters on physical or VLAN-based interfaces.
Ideally, the appliance would intercept LLDP packets sent by the switches and generate LLDP hello
messages to indicate its presence in the forwarding path.
Both approaches satisfy the technical requirements (assuming the customer uses DPDK-based OVS
to achieve 10Gbps+ performance); the Customer should thus select the best one based on the
existing environment, familiarity with orchestration tools or OpenFlow controllers, and the amount of
Page 10-15
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
additional development that would be needed to integrate selected data path solution with desired
orchestration/management system.
As of February 2014 no OpenFlow controller (commercial or open-source) includes a readily
available application that would manage access control lists on independent transparent appliances;
the Customer would thus have to develop a custom application on top of one of the OpenFlow
controller development platforms (Open Daylight, Floodlight, Ciscos ONE controller or HPs
OpenFlow controller). Puppet agent managing PF_RING packet filters or OVS flows through CLI
commands is thus an easier option.
NECs ProgrammableFlow could support Customers OVS deployment model but would
require heavily customized configuration (ProgrammableFlow is an end-to-end fabric-wide
OpenFlow solution) running on non-mainstream platform (OVS is not one of the common
ProgrammableFlow switching elements).
Page 10-16
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
CONCLUSIONS
The Customer SHOULD80 use packet filters (ACLs) and not stateful firewalls to isolate individual
containers from the shared L3 backbone.
Existing layer-3 switches MAY be used to implement the packet filters needed to isolate individual
assuming the number of rules in the packet filters does not exceed the hardware capabilities of the
layer-3 switches (number of ingress and egress ACL entries).
The Customer SHOULD consider x86-based appliances that would implement packet filters in
software or NIC hardware. The appliances SHOULD NOT use Linux-kernel-based packet forwarding
(usermode poll-based forwarding results in significantly higher forwarding performance).
x86-based appliances SHOULD use the same configuration management tools that the Customer
uses to manage other Linux servers. Alternatively, the customer MAY consider an OpenFlow-based
solution composed of software (x86-based) OpenFlow switches and a cluster of OpenFlow
controllers.
80
RFC 2119: Key words for use in RFCs to Indicate Requirement Levels
http://tools.ietf.org/html/rfc2119
Page 10-17
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
11
IN THIS CHAPTER:
CLOUD INFRASTRUCTURE FAILURE DOMAINS
IMPACT OF SHARED MANAGEMENT OR ORCHESTRATION SYSTEMS
IMPACT OF LONG-DISTANCE VIRTUAL SUBNETS
IMPACT OF LAYER-2 CONNECTIVITY REQUIREMENTS
CONCLUSIONS
Page 11-1
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
ACME Inc. is building a large fully redundant private infrastructure-as-a-service (IaaS) cloud using
standardized single-rack building blocks. Each rack will have:
Two ToR switches providing intra-rack connectivity and access to the corporate backbone;
Dozens of high-end servers, each server capable of running between 50 and 100 virtual
machines;
Storage elements, either a storage array, server-based storage nodes, or distributed storage
(example: VMware VSAN, Nutanix, Ceph).
Page 11-2
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
They plan to use several geographically dispersed data centers with each data center having one or
more standard infrastructure racks.
Racks in smaller data centers (example: colocation) connect straight to the WAN backbone, racks in
data centers co-resident with significant user community connect to WAN edge routers, and racks in
larger scale-out data centers connect to WAN edge routers or internal data center backbone.
Page 11-3
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Minimize failure domain size - a failure domain should not span more than a single infrastructure
rack, making each rack an independent availability zone.
Page 11-4
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page 11-5
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
ACME Inc. should therefore (in an ideal scenario) deploy an independent virtualization management
system (example: vCenter) and cloud orchestration system (example: vCloud Automation Center or
CloudStack) in each rack. Operational and licensing considerations might dictate a compromise
where multiple racks use a single virtualization or orchestration system.
Page 11-6
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page 11-7
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Use a server high-availability solution that works independently of the cloud orchestration
system;
Periodically test the proper operation of the cloud orchestration system failover.
81
A cloud orchestration system instance might be implemented as a cluster of multiple hosts running
(potentially redundant) cloud orchestration system components.
82
RFC 2119, Key words for use in RFCs to Indicate Requirement Levels
http://tools.ietf.org/html/rfc2119
Page 11-8
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Recommendation: ACME Inc. SHOULD NOT use a single critical management-, orchestration- or
service instance across multiple data centers a data center failure or Data Center Interconnect
(DCI) link failure would render one or more dependent data centers inoperable.
Page 11-9
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Summary: Long-distance virtual subnets in ACME Inc. cloud infrastructure MUST use overlay virtual
networks.
83
Page 11-10
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Recommendation: ACME Inc. SHOULD NOT use storage replication products that require end-toend layer-2 connectivity.
84
Page 11-11
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
ACME Inc. COULD provide long-distance VLAN connectivity with Ethernet-over-MPLS (EoMPLS, VPLS,
EVPN) or hardware-based overlay virtual networking solution85. VXLAN Tunnel Endpoint (VTEP)
functionality available in data center switches from multiple vendors (Arista 7150, Cisco Nexus
9300) can also be used to extend a single VLAN across its IP backbone, resulting in limited coupling
across availability zones.
85
Page 11-12
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Recommendation: Rate-limiting of VXLAN traffic and broadcast storm control MUST be used when
using VXLAN (or any other similar technology) to extend a VLAN across multiple availability zones to
limit the amount of damage a broadcast storm in one availability zone can cause in other availability
zones.
Page 11-13
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page 11-14
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
also heavily used for maintenance purposes: for example, you might have to evacuate a rack of
servers before shutting it down for maintenance or upgrade.
Cold VM mobility is used in almost every high-availability and disaster recovery solution. VMware
High Availability (and similar solutions from other hypervisor vendors) restarts a VM on another
cluster host after the server failure. VMwares SRM does something similar, but usually in a different
data center. Cold VM mobility is also the only viable technology for VM migration between multiple
cloud orchestration systems (for example, when migrating a VM from private cloud into a public
cloud).
HOT VM MOBILITY
VMwares vMotion is probably the best-known example of hot VM mobility technology. vMotion
copies memory pages of a running VM to another hypervisor, repeating the process for pages that
have been modified while the memory was transferred. After most of the VM memory has been
successfully transferred, vMotion freezes the VM on source hypervisor, moves its state to another
hypervisor, and restarts it there.
A hot VM move must not disrupt the existing network connections and must thus preserve the
following network-level state:
VM should have the same MAC address (otherwise we have to rely on hypervisor-generated
gratuitous ARP to update ARP caches on other nodes in the same subnet);
Page 11-15
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
After the move, the VM must be able to reach first-hop router and all other nodes in the same
subnet using their existing MAC addresses (hot VM move is invisible to the VM, so the VM IP
stack doesnt know it should purge its ARP cache).
The only mechanisms we can use today to meet all these requirements are:
Hypervisor switches with layer-3 capabilities, including Hyper-V 3.0 Network Virtualization and
Juniper Contrail.
Recommendation: ACME Inc. should keep the hot VM mobility domain as small as possible.
COLD VM MOVE
In a cold VM move a VM is shut down and restarted on another hypervisor. The MAC address of the
VM could change during the move, as could its IP address unless the application running on the VM
doesnt use DNS to advertise its availability.
Recommendation: The new cloud infrastructure built by ACME Inc. SHOULD NOT be used by
poorly written applications that are overly reliant on static IP addresses86.
VMs that rely on static IP addressing might also have manually configured IP address of the first-hop
router. Networking- and virtualization vendors offer solutions that reduce the impact of that bad
practice (first-hop localization, LISP) while significantly increasing the overall network complexity.
86
Page 11-16
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Recommendation: Workloads deployed in ACME Inc. cloud infrastructure SHOULD NOT use static
IP configuration. VM IP addresses and other related parameters (first-hop router, DNS server
address) MUST be configured via DHCP or via cloud orchestration tools.
Page 11-17
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
a workload orchestration- or disaster recovery management tool (example: VMware Site Recovery
Manager SRM)87.
This approach works well even for workloads that require static IP addressing within the
application stack internal subnets (port groups) using VLANs or VXLAN segments are
recreated within the recovery data center prior to workload deployment.
Another popular scenario that requires hot VM mobility is disaster avoidance live workload
migration prior to a predicted disaster.
Disaster avoidance between data centers is usually impractical due to bandwidth constraints88. While
it might be used between availability zones within a single data center, that use case is best avoided
due to additional complexity and coupling introduced between availability zones.
Increased latency between application components and traffic trombones89,90 are additional
challenges one must consider when migrating individual components of an application stack.
Its usually simpler and less complex to move the whole application stack as a single unit.
87
88
89
90
Page 11-18
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
91
Page 11-19
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Recommendation: ACME Inc. MUST implement workload migration or disaster recovery with
dedicated workload orchestration tools that can move whole application stacks between cloud
instances.
CONCLUSIONS
Infrastructure building blocks:
ACME Inc. will build its cloud infrastructure with standard rack-size compute/storage/network
elements;
Each infrastructure rack will be an independent data- and control-plane failure domain
(availability zone);
Each infrastructure rack must have totally independent infrastructure and SHOULD NOT rely on
critical services, management or orchestration systems running in other racks.
Network connectivity:
L3 connectivity (IP) will be used between racks and ACME Inc. backbone;
Virtual subnets spanning multiple racks will be implemented with overlay virtual networks
implemented within hypervisor hosts;
VLANs spanning multiple racks will be implemented with VXLAN-based transport across ACME
Inc. backbone;
Page 11-20
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Each rack will implement a high-availability virtualization cluster that operates even when the
cloud orchestration system fails;
Hot VM mobility MIGHT be used across racks in the same data center assuming ACME Inc.
decides to use a single cloud orchestration system instance per data center;
Workload mobility between data centers will be implemented with a dedicated workload
orchestration- or disaster recovery tool.
A single management- or orchestration system instance will control a single rack or at most one
data center to reduce the size of management-plane failure domain;
Management- and orchestration systems controlling multiple availability zones will have
automated failover/recovery procedures that will be thoroughly tested at regular intervals;
ACME Inc. SHOULD NOT use a single critical management-, orchestration- or service instance
across multiple data centers.
Page 11-21
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
12
IN THIS CHAPTER:
EXISTING APPLICATION WORKLOADS
INFRASTRUCTURE CHALLENGES
INCREASE WORKLOAD MOBILITY WITH VIRTUAL APPLIANCES
BUILDING A NEXT GENERATION INFRASTRUCTURE
BENEFITS OF NEXT-GENERATIONS TECHNOLOGIES
TECHNICAL DRAWBACKS OF NEXT-GENERATION TECHNOLOGIES
ORGANIZATIONAL DRAWBACKS
PHASED ONBOARDING
ORCHESTRATION CHALLENGES
USING IP ROUTING PROTOCOLS IN WORKLOAD MIGRATION
Copyright ipSpace.net 2014
Page 12-1
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
CONCLUSIONS
ACME Inc. is building a private cloud and a disaster recovery site that will eventually serve as a
second active data center. They want to be able to simplify disaster recovery procedures, and have
the ability to seamlessly move workloads between the two sites after the second site becomes an
active data center.
The ACMEs cloud infrastructure design team is trying to find a solution that would allow them to
move quiescent workloads between sites with a minimum amount of manual interaction. They
considered VMwares SRM but found it lacking in the area of network services automation.
Page 12-2
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Figure 12-2: Typical workload architecture with network services embedded in the application stack
Page 12-3
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Some applications are truly self-contained, but most of them rely on some external services, ranging
from DNS and Active Directory authentication to central database access (typical central database
example is shown in Figure 12-3).
Load balancing and firewalling between application tiers is currently implemented with a central pair
of load balancers and firewalls, with all application-to-client and server-to-server traffic passing
through the physical appliances (non-redundant setup is displayed in Figure 12-4).
Page 12-4
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Figure 12-4: Application tiers are connected through central physical appliances
INFRASTRUCTURE CHALLENGES
The current data center infrastructure supporting badly-written enterprise applications generated a
number of problems92 that have to be avoided in the upcoming private cloud design:
Physical appliances are a significant chokepoint and would have to be replaced with a larger
model or an alternate solution in the future private cloud;
Current physical appliances support a limited number of virtual contexts. The existing workloads
are thus deployed on shared VLANs (example: web servers of all applications reside in the same
92
Sooner or later someone will pay for the complexity of the kludges you use
http://blog.ipspace.net/2013/09/sooner-or-later-someone-will-pay-for.html
Page 12-5
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
VLAN), significantly reducing the data center security intruder could easily move laterally
between application stacks after breaking into the weakest server93.
93
94
Page 12-6
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
This ease-of-deployment makes it perfectly feasible to create a copy of virtual appliance for every
application stack, removing the need for complex firewall/load balancing rules and virtual contexts95.
Appliance mobility. Virtual appliance is treated like any other virtual machine by server
virtualization and/or cloud orchestration tools. Its as easy (or hard96) to move virtual appliances as
the associated application workload between availability zones, data centers, or even private and
public clouds.
Transport network independence. Physical appliances have a limited number of network
interfaces, and typically use VLANs to create additional virtual interfaces needed to support multitenant contexts.
Most physical appliances dont support any other virtual networking technology but VLANs, the
exception being F5 BIG-IP which supports IP multicast-based VXLANs.97
Virtual appliances run on top of a hypervisor virtual switch and connect to whatever virtual
networking technology offered by the underlying hypervisor with one or more virtual Ethernet
adapters as shown in Figure 12-5.
95
96
97
Page 12-7
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
The number of virtual Ethernet interfaces supported by a virtual appliance is often dictated
by hypervisor limitations. For example, vSphere supports up to 10 virtual interfaces per
VM98; KVM has much higher limits.
Configuration management and mobility. Virtual appliances are treated like any other virtual
server. Their configuration is stored on their virtual disk, and when a disaster recovery solution
replicates virtual disk data to an alternate location, the appliance configuration becomes
automatically available for immediate use at that location all you need to do after the primary data
center failure is to restart the application workloads and associated virtual appliances at an alternate
location99.
98
99
Page 12-8
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Reduced performance. Typical virtual appliances can handle few Gbps of L4-7 traffic, and few
thousand SSL transactions per second100;
Licensing challenges. Some vendors try to license virtual appliances using the same per-box
model they used in the physical world.
100
Page 12-9
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
(see Replacing the Central Firewall and High-Speed Multi Tenant Isolation chapters for more
details);
VM NIC firewalls will increase the packet filtering performance the central firewalls will no
longer be a chokepoint;
Virtual appliances will reduce ACMEs dependence on hardware appliances and increase the
overall network services (particularly load balancing) performance with a scale-out appliance
architecture;
Overlay virtual networks will ease the deployment of large number of virtual network segments
that will be required to containerize the application workloads.
Most VM NIC firewalls dont offer the same level of security as their more traditional counterparts
most of them offer stateful packet filtering capabilities that are similar to reflexive ACLs101.
In-kernel VM NIC firewalls rarely offer application-level gateways (ALG) or layer-7 payload
inspection (deep packet inspection DPI).
101
Page 12-10
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Virtual appliances (load balancers and VM-based firewalls) rarely offer more than a few Gbps of
throughput. High-bandwidth applications might still have to use traditional physical appliances.
Overlay virtual networks need software or hardware gateways to connect to the physical
subnets. Self-contained applications that use a network services appliance to connect to the
outside world could use virtual appliances as overlay-to-physical gateways, applications that rely
on information provided by physical servers102 might experience performance problems that
would have to be solved with dedicated gateways.
ORGANIZATIONAL DRAWBACKS
The technical drawbacks identified by the ACME architects are insignificant compared to
organizational and process changes that the new technologies require103,104:
Move from traditional firewalls to VM NIC firewalls requires a total re-architecture of the
applications network security, including potential adjustments in security policies due to lack of
deep packet inspection between application tiers105;
102
103
104
105
Page 12-11
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Virtual appliances increase workload agility only when theyre moved with the workload. The
existing centralized appliance architecture has to be replaced with a per-application appliance
architecture106;
Increased number of virtual appliances will require a different approach to appliance deployment,
configuration and monitoring107.
PHASED ONBOARDING
Faced with all the potential drawbacks, the ACMEs IT management team decided to implement a
slow onboarding of application workloads.
New applications will be developed on the private cloud infrastructure and will include new
technologies and concepts in the application design, development, testing and deployment phases;
Moving an existing application stack to the new private cloud will always include security and
network services reengineering:
Load balancing rules (or contexts) from existing physical appliances will be migrated to perapplication virtual appliances;
Intra-application firewall rules will be replaced by equivalent rules implemented with VM NIC
firewalls wherever possible;
106
107
Page 12-12
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
ORCHESTRATION CHALLENGES
Virtual appliances and overlay virtual networks enable simplified workload mobility but do not solve
that problem. Moving a complex application workload between instances of cloud orchestration
systems (sometimes even across availability zones) requires numerous orchestration steps before
its safe to restart the application workload:
Virtual machine definitions and virtual disks have to be imported into the target environment
(assuming the data is already present on-site due to storage replication or backup procedures);
Internal virtual networks (port groups) used by the application stack have to be recreated in the
target environment;
Outside interface(s) of virtual appliances have to be connected to the external networks in the
target environment;
Configuration of the virtual appliance outside interfaces might have to be adjusted to reflect
different IP addressing scheme used in the target environment. IP readdressing might trigger
additional changes in DNS108;
108
Page 12-13
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Connectivity to outside services (databases, AD, backup servers) used by the application stack
has to be adjusted. Well-behaved applications109 would use DNS and adapt to the changed
environment automatically110; poorly written applications might require additional configuration
changes (example: NAT) in virtual appliance(s) connecting them to the external networks.
Some orchestration systems (example: vCloud Director) allow users to create application containers
that contain enough information to recreate virtual machines and virtual networks in a different
cloud instance, but even those environments usually require some custom code to connect migrated
workloads to external services.
Cloud architects sometimes decide to bypass the limitations of cloud orchestration systems
(example: lack of IP readdressing capabilities) by deploying stretched layer-2 subnets,
effectively turning multiple cloud instances into a single failure domain111. See the Scale-Out
Private Cloud Infrastructure chapter for more details.
109
110
111
Page 12-14
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Page 12-15
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
Some virtual appliances already support routing protocols (example: VMware NSX Edge
Service Router, Cisco ASAv). Its trivial to deploy routing functionality on virtual appliances
implemented as Linux services. Finally, one could always run a virtual router in a dedicated
virtual machine (example: Cisco CSR 1000v, Vyatta virtual router).
Virtual routers could establish routing protocol adjacency (preferably using BGP112) with first-hop
layer-3 switches in the physical cloud infrastructure (ToR or core switches depending on the data
center design).
One could use BGP peer templates on the physical switches, allowing them to accept BGP
connections from a range of directly connected IP addresses (outside IP address assigned to
virtual routers via DHCP), and use MD5 authentication to provide some baseline security.
A central BGP route server would be an even better solution. The route server could do be
dynamically configured by the cloud orchestration system to perform virtual router authentication
and route filtering. Finally, you could assign the same loopback IP address to route servers in all
data centers (or availability zones), making it easier for the edge virtual router to find its BGP
neighbor.
112
Page 12-16
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars
CONCLUSIONS
ACME Inc. can increase the application workload mobility and overall network services performance
with a judicious use of modern technologies like virtual appliances, overlay virtual networks, and VM
NIC firewalls, but wont realize most of the benefits of these technologies unless they also introduce
significant changes in their application development and deployment processes113.
Workload mobility between different availability zones of the same cloud orchestration system is
easy to achieve, as most cloud orchestration system automatically create all underlying objects
(example: virtual networks) across availability zones as required.
Its also possible to solve the orchestration challenge of a disaster recovery solution by restarting
the cloud orchestration system at a backup location (which would result in automatic recreation of
all managed objects, including virtual machines and virtual networks).
Workload mobility across multiple cloud orchestration instances, or between private and public
clouds, requires extensive orchestration support, either available in the cloud orchestration system,
or implemented with an add-on orchestration tool.
113
Page 12-17
This material is copyrighted and licensed for the sole use by Georgi Dobrev (georgi.d.dobrev@gmail.com [95.158.130.50]). More information at http://www.ipSpace.net/Webinars