It's a simple question: how big is the blast radius in your data center? The answer is a bit more complicated. The short answer is: probably larger than you expected.
The network is the glue in the data center that holds everything together: servers, firewalls, and storage. Any problem in the network is immediately obvious because something isn't able to talk to something else. Depending on the type of failure, the symptoms vary.
An access switch failure isn't too bad if the server has redundant links to another access switch. Half of the forwarding capacity is lost for every server attached to the switch.
A core or aggregation switch failure is a bit more complicated because the number of components effected is larger. Each access switch with uplinks to the failed core switch will have reduced forwarding capacity and perhaps a loss of traffic until the control plane converges.
One thing that's very clear is that the size of the switch directly impacts how wide spread the failure is; this is referred to as the blast radius. When somethings breaks, how bad is it felt?
Over the past few years many large companies have purposefully been reducing the blast radius of their data center. The easiest and most effective method to reduce the blast radius is to use smaller switches. In order to maintain the large scale required to host thousands of servers, the use of a multi-stage Clos architecture is required. Such an architecture allows the physical network to scale beyond the port density of a single switch. The most common designs in a multi-stage Clos architecture are a 3-stage and 5-stage network.
The 3-stage design has two roles: the spine and leaf. It's called a 3-stage design because the traffic has to traverse three switches in the worst case scenario. For example traffic ingresses on the left-most leaf switch, goes to a spine switch, then egresses on the right-most switch. The 3-stage design is so popular it goes by a couple of different names:
Spine and leaf
The obvious downside to a 3-stage design is that you can only scale as large as the port capacity on the spine switch. Sooner or later you just run out of ports in the spine and can't support additional leaves. There are two solutions to this scaling problem:
Increase the size and capacity of the spine switches
Move to a 5-stage design
Herein lies the crux of the blast radius problem. If you increase the size and capacity of the spine switches, then so does the blast radius. For example if the spine switch in our 3-stage design failed, it would result in a loss in 1 out of 4 switches or 25% of the forwarding capacity of the entire network. To reduce the size of the blast radius and maintain larger scale, the 5-stage design has some significant advantages.
The basic premise is that you add another role to the 5-stage design: spine, leaf, and access. Using the same building blocks as in the 3-stage design, the 5-stage effectively has its own spine and leaf structure. The 3-stage design is encapsulated into a vSpine, which is simply a set of network ports that can be consumed by the new access switches.
If we make the assumption that each leaf has 32x40GE interfaces and 16 of them are used as uplinks going into the spine, that leaves each leaf switch with 16x40GE available. Therefore each vSpine has 64x40GE with an over-subscription of 1:1. Since there are four vSpines, that creates a total of 256x40GE ports available to the access switches. The next assumption is that each access switch has 48x10GE and 4x40GE interfaces. Each 40GE interface can be connected to a different vSpine which results in a total of 64 total access switch. Since each access switch has 48x10GE ports, this brings the total up to 3,072x10GE ports.
But the question still remains, how large is the blast radius? In our new 5-stage design you can calculate the blast radius by simply looking at the number of switches in the spine and leaf roles. Since each vSpine has six switches and there are four vSpines, the total switch count is 24. To calculate the traffic loss across the 5-stage design we divide 100 by 24, which is 8.3%.
One of the tenets of networking is that as the number of intermediate switches increase, so does the latency. Many people shy away from 5-stage designs because worst case traffic has to travel through five switches: access, leaf, spine, leaf, then access. What's really surprising is that modern switches that take advantage of switch on a chip (SoC) technology have very low latency.
For example the QFX5100 switches have about 600ns of latency. If we make the assumption that traffic has to traverse five switches, that's only 3 microseconds. What's even more interesting is that a single traditional chassis-based switch will have more latency than an entire 5-stage design using QFX5100 switches.
The question is why is a chassis inherently slower than five QFX5100 switches? The answer is that there are two ways to build a chassis switch:
Create an internal multi-stage network of chipsets inside of the chassis to forward traffic between line cards
Use a cell based switching fabric
As traffic flows through a chassis switch it is subject to an internal multi-stage network of chipsets, and each chip adds its own amount of latency. The alternative is that the traffic has to be broken up into cells and sprayed across the backplane of the network, which adds latency to the traffic from line card to line card.
Don't forget that as the switch size increases, so does the blast radius, as well as the latency. At this point you might be asking yourself: why would I ever want to use a chassis then? Traditionally the answer has been because a chassis represents a single point of management. The reality is that if you're fully building out a 5-stage design using 1RU switches, you're going to have about 50,000x10GE interfaces and over 800 switches, so a single point of management is not required or wanted. The benefit is that such a large network can maintain 3 microseconds of latency and has a very small blast radius. Each vSpine would have 16 spines and 32 leaves; there would be a total of eight vSpines bringing the total number of switches up to 384. This results in a traffic loss of only 0.26% during a switch failure. Not bad.
However there is one alternative when it comes to building a 5-stage network and maintaining a single point of management in the vSpine. You can leverage the new Virtual Chassis Fabric which allows you to manage a 3-stage design as a single, logical switch. Let's take the same 5-stage design example and see how Virtual Chassis Fabric can reduce the overall management.
Now each vSpine has been replaced by a Virtual Chassis Fabric which offers a single point of management. One thing to keep in mind is that Virtual Chassis Fabric has a limitation of 20 switches total: 4 in the spine and 16 in the leaves. If you wanted to create a Virtual Chassis Fabric with 1:1 over-subscription with 40GE interfaces, each leaf would have 16x40GE uplinks and 16x40GE interfaces available for the access switches. If we assume that each spine has 32x40GE interfaces and there are four spines in a Virtual Chassis Fabric, this gives us a total of 128x40GE interfaces in the spine, which will support eight leaves with 16x40GE uplinks. Eight leaves with 16x40GE interfaces available for the access brings the grand total up to 128x40GE per Virtual chassis Fabric. Since each access switch has 4x40GE interfaces, we can scale the number of Virtual Chassis Fabrics to four as well, bringing the grand total of ports available as uplinks to 512x40GE. This results in 128 access switches and 6,144x10GE server interfaces. The summary is that Virtual Chassis Fabric gives you a single point of management per vSpine as opposed to having to manage every single switch.
One interesting side effect of using 1RU switches in building an IP Fabric is that it allows you to spread out the switches across the data center. When using a traditional chassis switch, you're limited to a single rack. Being able to spread the vSpine across multiple switches in the data center allows for additional physical redundancy. In the event of a rack or power failure, only a portion of the vSpine would be effected.
This illustration shows a map of a data center from the top down. Each number and letter indicating a rack position in the data center. There are 10 rows (A-J) and each row has 16 racks. Each vSpine is illustrated by a different color and is spread out across multiple racks as well as rows.
In summary building a 3-stage and 5-stage Clos architecture allows you to reduce the blast radius in your data center. As the number of switches in the design increase, the blast radius is reduced. The end result is that the network becomes more stable and traffic loss during a failure isn't as amplified. Would you rather have a traffic loss of 12.5% or 0.26%? The crazy thing is that using the QFX5100 in a 5-stage design will offer less end-to-end latency than a single traditional chassis. Depending on which QFX5100 you use to build out your network, the average power consumption per 10GE interface is about 3.4W. The latency will be around 3 microseconds.