/var/blog
Industry and Technology Insights and reflections by Mike Bushong
/var/blog
Networking's next era: Age of Removal
07.31.17

throw-away.jpgFrom a purist’s point of view, networking has been around since the 60s. But the real heyday of networking started with the Internet boom. And for the better part of the two decades that followed, networking went through a growth spurt that we will likely never see again. New protocols, new transport mediums, new features, new management tools. Through the middle of the 2000s, networking was in the Age of Addition as we incrementally built up what we have come to know as networking today.
 
But the Age of Addition is over. We are entering the next networking era: the Age of Removal. 
 
When adding on doesn’t add up
Over the past thirty years, it’s actually pretty amazing what we have collectively accomplished in networking. In many cases, we have learned on the fly as we have moved from mainframes to distributed applications to cloud. 
 
Perhaps more amazingly, in almost any enterprise network of even moderate size, we support all of that concurrently today. It’s remarkable how old some of the legacy applications are in many enterprises today. You would absolutely cringe if you knew the details of some of the most critical infrastructure on which a lot of today’s modern marvels rely (I’m looking at you, air travel).
 
The result is that we have a bunch of snowflake infrastructure environments, largely because that’s what is required to support the long tail of applications that lack the business case to completely modernize. And to make it all work, we have contorted networking in strange and often fantastic ways to make things work. 
 
If you have ever wondered why networking is so complex, this is a big part of it. When a practice spans decades, there simply is no way to have a grandmaster plan that allows us to elegantly and simply fit all the pieces together. Rather, we end up with something that, despite all good intents, more closely resembles the Winchester Mystery House than the lovely 32-bedroom estate we thought we would end up in.
 
What happens when you start with a clean slate?
While it is impractical for most enterprises to simply throw everything away, we actually have a couple of pretty decent models for what this might look like. 
 
If you look at the major web scale and cloud companies, you get a pretty clear idea of what kinds of architectural principles might win out if you got to start fresh. So what can we glean?
 
Merchant silicon
Obviously in the datacenter space, merchant silicon makes a lot of sense for the leaf and spine. While the argument is usually around cost, I would suggest that is a happy side effect of the real rationale here. 
 
If the industry settles on a small number of merchant silicon solutions, it reduces the divergence in the underlying components of the infrastructure. A more consistent base of functionality provides a more predictable set of baseline behaviors. And this is what creates the solid platform on top of which these companies can build. 
 
Pizza boxes
The merchant silicon is stamped out into a relatively small set of form factors. There are two primary reasons this is the case. First, it reduces the amount of divergence in the architecture (see previous rationale). It means that everything from procurement to provisioning can be streamlined. When agility is your primary value proposition, this is a make-or-break principle. 
 
Second, the 1RU form factor helps maintain a small blast radius. When there are issues, they will tend to be contained to a small subset of the overall network. Because the blast radius is contained and the boxes are all the same, it means that the operating procedures around failures can be straightforward and uniformly applied. If something fails, simply replace it and figure out the issue later. 
 
If everything is the same, anomalous behavior is easy to identify. The mere fact that nothing else is suffering the same fate means there is something unique. Troubleshooting becomes a quest for identifying the divergence: what is different about this case? And remediation can many times be a case of making the divergent thing conformant again, which is done by replacing the device and applying a standard configuration. 
 
Even the OCP-based modular chassis, by the way, conforms to this. While it is physically a larger device, it is functionally just a CLOS-in-a-box, with each line card acting independently. 
 
Protocols and policies
The hallmark of cloud design? BGP to top-of-rack in a draft-lapukhov architecture. When starting from a blank slate, we see a dramatic slimming down of what actually gets configured and deployed. 
 
In many ways, when given the freedom to do whatever they want, people who start unencumbered choose to do less. This should resonate with the more unfortunate who have legacy systems for which they must be responsible. Most senior network engineers bear the scars of dealing with unique requirements. That small tweak only on a couple of servers becomes very expensive over the years, especially as staff changes and tribal knowledge dissipates into larger organizations. What should have been a routine upgrade can become suddenly very exciting for all the wrong reasons.
 
Code rot
There is a term code rot that basically means that code that is left untouched for sufficiently long periods while everything around it changes can become less stable over time. Colloquially, stagnation leads to rot, and rot leads to problems. This is why examining old code is a worthwhile task from time to time, and why refactoring is a good thing to do. This is especially true if a system has accrued any technical debt for whatever reason. Paying off debt sooner is always less expensive than waiting until the note is due.
 
The same premise is true in infrastructure. When things are left static and unchecked, they can often decay. That decay might be imperceptible at first, but over time, the more things that surround the rot, the more likely the issues are to manifest in larger, more obvious ways. 
 
Addition through subtraction
And this all brings me to my punch line. If we were allowed to design from scratch, we would choose simplicity over complexity. On top of that, the cumulative technical debt in our devices and networks is a source of infrastructure decay. 
 
So we want to take stuff out, and we need to take stuff out. Which means that the next era in networking will be notable not for the technologies we introduced but rather for the things we retired. 
 
If you want to measure progress, you should write down a list of all the protocols, policies, firewalls, and supporting technologies that you support today. Over time, if that list is shrinking, you are doing well. If, however, the list is growing, it probably signals a need to rethink the strategy.
 
The bottom line
The somewhat uncomfortable truth is that networks in 5 years will actually do less than they do today. Devices will support fewer things. The software that powers those devices will ultimately have fewer lines of code. 
 
This kind of transformation requires a massive shift in mindset. It means that incrementalism—in all its forms—needs to yield to stronger discipline. An RFP should start with an empty spreadsheet as opposed to a pre-populated list of 4,367 features that cover functionality first introduced in 1982. Adding something new to the infrastructure should undergo heavy scrutiny to justify the long-term cost to managing the complexity. Default behavior should be minimalist.
 
Somewhat paradoxically, the more effective we are at removing things, both individually and collectively, the more we will actually get from our infrastructure.

Top Kudoed Authors
User Kudos Count
4