BGP in the Data Center: Why you need to deploy it now!
Overlay networks in the data center are here and are here to stay. It's now easier than ever to programmatically provision new networks with a click of the mouse than ever before. No need to worry about VLAN IDs, integrated bridging and routing, MC-LAG, and spanning tree. Overlay networks use data plane encapsulations such as VXLAN or GRE to transport both Layer 2 and 3 between virtual machines and physical servers. One of the key requirements in an overlay architecture in the data center is to have a rock solid IP Fabric; simply Layer 3 connectivity between every host in the network that participates in the overlay network.
Does that sound familiar? Maybe a little bit like MPLS? You're right. A MPLS architecture requires a stable Layer 3 transport in order to provide IP services across Layer 2 and Layer 3 VPNs. Although similar, there are a few key control plane differences in a MPLS and data center overlay architecture. Let's walk through them.
MPLS has a hierarchy of control plane protocols that make up the network. It's common to see IS-IS or OSPF provide reachability between all nodes and provide a traffic engineering database. The next step is that each provider edge runs MP-BGP; it's the 18-wheeler in networking. MP-BGP carries all sorts of data from MAC addresses to identifying which VPNs should be installed into each provider edge. Finally there is LDP and RSVP which are responsible for label distribution and traffic engineering across the network.
As of today the control plane in a data center overlay network is fairly simple. The first option is to simply not use a control plane protocol. In this scenario we can use multicast to flood traffic control traffic to all hosts in the network. The next option is to use either OVSDB or DMI as the control plane protocol. These options prevent unnecessary flooding throughout the network and allow for a more efficient utilization of resources.
The biggest question is when you build an IP Fabric, what control plane protocol do you use? The options are the usual suspects: OSPF and IS-IS. But what about BGP? But isn't BGP a WAN control plane protocol? Not necessarily.
When creating an IP Fabric there are a few services that we need: prefix distribution, prefix filtering, traffic engineering, traffic tagging, and multi-vendor stability. Perhaps the most surprising requirements are traffic engineering and multi-vendor stability. When creating a large IP Fabric, it's desirable to be able to shift traffic across different links and perhaps steer traffic around a particular switch that's in maintenance mode. Creating an IP Fabric is an incremental process; not many people build out the entire network to the maximum scale from day one. Depending on politics, budgets, and feature sets companies may source switches from different vendors over a long period of time. It's critical that the IP Fabric architecture not change over time and the protocols used are stable across a set of different vendors.
Let's map the requirements of an IP Fabric and map them to the options in the control plane: OSPF, IS-IS, and BGP.
Even more so (think about the Internet)
What is interesting is that BGP pulls ahead as the best protocol choice in creating an IP Fabric. It excels in prefix filtering, traffic engineering, and traffic tagging. BGP is able to match on any attribute or prefix and prune prefixes both outbound and inbound between switches. Traffic engineering is accomplished through standard BGP attributes of Local Preference, MED, AS padding, and other techniques. BGP has extensive traffic tagging abilities with extended communities; each prefix can be associated with multiple communities to convey any sort of technical or business information. The best use case in the world for multi-vendor stability is the Internet; the backbone of the Internet is BGP.
BGP in the data center makes the most sense in the data center when building out an IP Fabric. Maybe it isn't so crazy after all. The benefits include prefix filtering, traffic engineering, tagging, and stability across a set of various vendors.
The biggest decision you need to make when designing an IP Fabric with BGP is to you use eBGP or iBGP. Again, each option has its benefits and drawbacks. One of the key factors is ECMP. It's critical that each leaf support full ECMP going northbound to each spine. The best way to scale a 3-stage Clos network is to increase the number of spines in order to support additional leaves. With the addition of each spine further increases the ECMP requirements of each leaf. The second factor is how many peering sessions do you want to manage in the IP Fabric.
Requires BGP AddPath
Requires Multi-AS Pathing
Requires Route Reflector to mitigate full-mesh
BGP session only between each spine and leaf
Let's take a closer look at BGP peering. In an iBGP network, each switches is required to have a BGP session to every other switch in the network. This means that every leaf in the network must peer with each other, in addition to each spine. This gets pretty wasteful very fast. The answer is to use a BGP route reflector in the spine of the network. This allows each leaf to become a route reflector client and only have to peer with each spine / BGP route reflector. The downside of a BGP route reflector is that it only reflects the best route. What if there are multiple? Tough luck, you only get one. The answer to support full ECMP with BGP route reflectors is to use another BGP feature called AddPath; this allows each client to receive multiple paths instead of only the best.
From the point of view of a 3-stage Clos or spine and leaf network, eBGP makes the most sense. It supports traffic engineering and doesn't require you configure and maintain a route reflector and AddPath. However this decision becomes a bit more involved in a 5-stage Clos design, but that's a subject for another blog post.
Now that we have decided to use eBGP in our 3-stage Clos, what other things do we need before we can create a final blueprint of what the network will look like? Let's walk through them one by one:
BGP autonomous system number assignments
IP address base prefix
Subnet masks to be used between point-to-point interfaces
IP address assignments
The first step is to assign a BGP ASN per switch; this is a 1:1 ratio of ASNs to switches. Now each spine is able to peer with each leaf via eBGP. The next step is to consider what IP address base prefix to use across the entire network. The answer isn't so simple and it depends on the number of switches, number of links, and the network mask used on the point-to-point links. Let's walk through the options.
Let's assume we have a simple 3-stage Clos network with four spines and 16 leaves; this creates a total of 20 switches. Assuming that each leaf has a full mesh of links to each spine, this creates a total of 64 links. 16 switches times four links (one for each spine) equals 64 point-to-point links. The next step is to think about what network mask to use between each point-to-point link. The most common options are 30-bit and 31-bit. A 3-bit network mask has four IP addresses per subnet. The 31-bit mask has two IP address per subnet. With the assumption that each point-to-point link only requires two IP addresses (one per switch), we can conclude that a 31-bit network mask is the most efficient use of IP space. Juniper switches support both the 30-bit and 31-bit network mask, but some other vendors may only support a 30-bit mask. The result is that using a 30-bit mask requires twice the IP space when compared to a 31-bit mask. Generally an IP base prefix of 192.168.0.0/16 is enough in most cases, unless you're building a very large IP Fabric.
The last task is to assign a 32-bit loopback address to each switch in the network. This allows us to quickly test routing connectivity through ping, traceroute, and other tools. BGP must be configured to advertise the loopback address to all of its peers. If a switch is able to communicate to another switching only using loopback addresses, we know that BGP is configured correctly and has reachability.
At this point you have enough information to build the transport mechanism of the IP Fabric, but the last component that's missing is the Layer 3 gateway services that the hosts and other end-points will use. Simply put, each server needs a default gateway address. The good news is that we can limit the gateway services to each leaf. There's no need to span the same Layer 3 gateway address across a set of leaves. This means that Layer 2 is limited to each leaf as well, thus removing any requirements for MC-LAG, STP, or any other Layer 2 protocols to span a bridge domain across a set of switches.
The easiest way to enable Layer 3 gateway services is to create a 26-bit IRB interfaces per leaf. The 26-bits would allow for a maximum of 62 hosts (reserved one for the gateway and the other for broadcast) per switch; the assumption is that each switch has 48 ports, so we have 14 IP addresses left over per leaf. Not bad.
Now that each switch has a unique 26-bit IRB interface, the next step is to advertise these prefixes to the rest of the network. Just like with the loopback addresses, each IRB prefix must also be flooded throughout the entire network. This ensures that each server in the IP Fabric has full Layer 3 reachability to every other host. The BGP export policy must be configured to advertise the IRB prefix to each BGP neighbor.
A good step to ensure the stability of the IP Fabric is to configure a set of BGP import policies. The policy should only accept loopback addresses and IRB prefixes. There's really no need to accept any other prefixes as they aren't critical for the operation of the IP Fabric. This keeps the table sizes small and allows for faster convergence and updates.
One of the least talked about requirements of an IP Fabric is high availability and convergence. By itself BGP can only support a 7 second interval (per the RFC) and would cause traffic to drop during this window. To speed up convergence during a failure, a faster mechanism is required. I really good tool is Bidirectional Forwarding Detection (BFD). It's a protocol that was specifically designed to be light-weight and detect forwarding errors in the network. Depending on the hardware and software support BFD can be configured as low as 10ms or 20ms. Data center switches typically don't have hardware support for such fast intervals and a more reasonable timer is around 100ms; this still achievements sub-second convergence during a failure.
The other aspect is network maintenance. How do you avoid traffic loss during a software upgrade of the switch? There are two options: traffic steering and in-service software upgrade (ISSU). The first method simply evacuates all traffic from the switch so that a software upgrade doesn't impact production traffic. The drawback is that other switches have to take the burden and responsibility of transporting the traffic until the upgrade is complete. This may or may not be possible depending on the amount of traffic in the network. The next option is a feature called ISSU which allows a switch to transport traffic at the same time it upgrades the software. If ISSU is to be used, it's important that this feature is supported across the entire IP Fabric and not limited to leaves for example.
A really great platform for building IP Fabrics is the Juniper QFX5100 series. It comes in various configurations supporting 40GE and 10GE. Let's check them out:
QFX5100-24Q: supports 32x40GE interfaces
QFX5100-48S: supports 48x10GE and 6x40GE interfaces
QFX5100-96S: supports 96x10GE and 8x40GE interfaces
As you can imagine the QFX5100-24Q makes a great spine switch. It has enough port density to build some very large IP Fabrics. The QFX5100-48S and QFX5100-96S make great leaf switches; they offer create 10GE density and enough 40GE uplinks for 2:1 or 3:1 over-subscription.
For example using (8) QFX5100-24Q switches in the spine and (32) QFX5100-96S switches as a leaf, the total number of ports in the IP Fabric is 3,072x10GE. Not bad for 40 switches, 72U of rack space, and 3W per port.
The Juniper QFX5100 also supports both options for iBGP and eBGP. It's no problem enables BGP route reflection and AddPath in the spine to support ECMP in an iBGP environment. Running eBGP has less requirements and is also no problem for the QFX5100.
In terms of high availability, the QFX5100 supports BFD and provides sub-second convergence times. Most surprisingly the QFX5100 also supports ISSU. You can upgrade the network software while it continues to pass traffic through the IP Fabric. The QFX5100 accomplishes this through virtualization of the control plane. It uses Linux KVM to create virtual machines for Junos. As the ISSU takes place there are two copies of Junos running. The master will continue to operate the control plane wile the backup is being upgraded. Once the backup upgrade is complete, the routing engines will switchover and the old backup becomes the new master. Now the other routing engine is upgraded while the new master continues to operate the control plane. Once the process is done, both routing engines will be upgraded without traffic loss.
The QFX5100 takes advantage from all of the control plane features and maturity that comes from the M, T, and MX series over the past 16 years. The BGP implementation in Junos is carrier-class and provides robust traffic engineering, tagging, and policy filtering features that make it a perfect choice for building a rock solid IP Fabric.
Automating the creation and maintenance of an IP Fabric is very malleable in the hands of the QFX5100. The platform supports the execution of Python scripts and has an extensive API that allows you to provision changes and read data from the switch with ease.
In summary the use of BGP in the data center supports and exceeds the requirements of an overlay network in the data center. It easily scales in large environments with 1000s of switches; extensive traffic tagging and engineering capabilities; and is very stable in the face of switches from different vendors. When BGP is implemented with Junos and the QFX5100, the result is an IP Fabric that's carrier-class and is a pleasure to use.
Go implement BGP with the QFX5100 in your next IP Fabric.