Silicon and Systems
Showing results for 
Search instead for 
Do you mean 

The Fabric Node – Fabric approach to overcome Chassis limitations - Part 1 - Hope and Reality

by Juniper Employee ‎09-07-2017 02:51 PM - edited ‎09-08-2017 12:06 PM

Introduction

In recent years, companies that operate Hyperscale Data Centers –Content Service Providers - developed and mastered solutions that enable building of extremely large DC networks that connect 1000’s of endpoints (servers) and carry petabytes of information every second. These solutions are not formally specified but are commonly referred as “IP Fabric”, “IP CLOS”, “CLOS Fabric”, “DC CLOS”.

 

They are all the same - same physical topology and technological principle. Furthermore, it is proven that Data Center (DC) networks can be built using relatively inexpensive devices and based on a minimalistic set of protocols.

 

The success of “CLOS Fabric” in the DC space inspired Content Service Providers as well as other companies such as Telcos and MSOs (Cable) to investigate the possibility of using similar concepts as building blocks of their WAN networks.

 

image001.pngThis article discusses motivation, applicability as well as limitations and impact on network operations when the CLOS fabric concept is applied to a WAN network. In subsequent articles, other aspects such as device selection, capacity and cable planning, routing control plane design and finally operation support toolset will be discussed.

 

The hope

The companies that are looking to deploy CLOS structure as network “node” – FabricNode - are doing it for a reason. Ultimately their hope is that they will benefit by:

 

  1. Deploying a network node of higher capacity than any system offered on the market by vendors.
  2. Gradually scaling Fabric Node just by adding more devices
  3. Reducing Capital Expenses (CAPEX) through use of inexpensive ‘white box’ devices or devices of lower capabilities.
  4. Allowing the mix-and-match of devices from different vendors in a single FabricNode structure to:
    • Exercise competitive pressure for price reduction
    • Unbound the upgrade cycle of nodes for roadmap (and delays) of single vendor
    • Ability to use of best-of-breed devices for a given role
    • Mitigate risk of multiple concurrent failures caused by single implementation (single source code).
  5. Reducing the size of a single failure domain form entire (multi Tbpps) node to a single device in FabricNode or single link within it.
    This attribute also plays role during planned maintenance, as each device of Fabric Node could be maintained separately.
  6. Design a Fabric Node with oversubscription perfectly matching the provider’s network traffic pattern, and reducing total capital cost.

At the same time, it is desired that a FabricNode would not dramatically change how the entire network operates. Networks should support TE if desired, effectively using bandwidth of long-distance links, reducing the total length of the WAN connection (less cost and risk of failure).

The reality

Elastic Scaling to unlimited capacity (B/W)

In theory, The CLOS structure can scale infinitely, leveraging the recursive nature of a CLOS topology. However, keeping the CLOS structure expandable, requires reservation of ports on the highest-level SPINES not connected initially. When a capacity upgrade is due, this ports will be used to connect newly provisioned higher layer SPINES.

 

Let's review the example below.

 

The initial build is constructed of 16 LEAFs (CLOS level 1) each connected to SPINEs (Clos level 2) by 30 links. Therefore, there is 480 links between LEAFs and SPINEs, provided we are using 32-port SPINE switches, 15 x SPINES provide enough ports. If a FabricNode is built that way, there is no expansion possibility (without re-cabling and re-configuration of existing devices).

 

However, if initial deployment will use 30 SPINEs, after connecting all 16 LEAFs, there will be still 480 unused ports on spines. These ports could be used for either:

 

  1. Connecting another 16 LEAFS and extending the entire FabricNode structure to 32 LEAFs, what which be new maximum, or
  2. Deploy another level of SPINES, let's call it SuperSPINES, (Clos level 3) and connect the initial infrastructure to this SuperSPINE. The number of SuperSPINES and number of ports available will dictate total scalability.

Please note that in case of b), SuperSPINES may have some (50%) of it ports reserved for connection to even higher SPINE levels (Clos level 4) or use all ports to connect lower level spines. In the later case, our example structure can scale up to: 512 LEAFs, which would require 480 SuperSPINES, 960 SPINES and 30 720 links between devices.image004.png

 

Please note that keeping the option open for unlimited scaling is expensive. Let's assume that the design goal is 32 LEAFs system. We can activate it with 30 SPINES and 960 links. However, if this structure is supposed to be open for further expansion, at least 60 (32 port) SuperSPINES, another 30 SPINEs and 16 LEAFs need to be added. So, the structure would be 60 SuperSPINES (up to 480), 60 SPINES, 32 LEAFS, 1920 links and 960 ports on the Super SPINE reserved for further expansion. Please see the table below for a quick comparison:

 

Closed

Open for scaling

LEAFS

32

32

SPINES

30

60

SuperSPINES

0

60

Total Devices

62

152 (+145%)

Links

960

1920 (+100%)

 

 

The other interesting property of a FabricNode is it’s capacity scaling being independent from a vendor roadmap. The integrated router (chassis) capacity is capped by the number of slots and bandwidth of a line card. This limit could be elevated only by replacing line cards (and fabric) but faster – typically newer generation. As a consequence an integrated chassis could be scaled-up depending on a vendor’s roadmap and its execution. In contrast to an integrated router (chassis), a CLOS structure could be expanded by scale-out (adding more LEAFS and SPINES) at any time, whenever there is a need.

 

In summary, indeed a CLOS structure allows for virtually unlimited scaling, however it comes at at a significant cost of additional hardware and infrastructure (fiber-optic cores).

 

Capex Savings

Device capabilities

Some Content Providers build their DC fabric using inexpensive simple devices, while others use the most sophisticated ones. Obviously, the total cost of a device depends on many factors, not only device HW capability, but also the SW suite coming with it, support, etc. However, it is easy to identify 4 primary facets that impact the cost[1] of device:

 

  1. DBB capability
  2. IPv4/IPv6 FIB size
  3. SW support for advanced features such MPLS, TE, VPNs, ISSU
  4. Vendor support and services

Out of the above 4 for HyperScale DC only DBB capabilities are important. The couple 100.000’s of FIB entries and basic IP forwarding is good enough. Also, Vendor support isn't critical when 10.000’s of devices are procured and deployed. The CSP just replaces the malfunctioning device and develops and maintains their own SW stack.

 

The majority of a DC is also deployed with shallow DBB (10’s-100 ms), but there are also deployments that utilize deep buffering switches. It is worth noting that in a typical DC the vast majority of traffic stays in the DC thus round-trip time (RTT) is minimal and has negligible impact on responsiveness of flow-control (e.g. TCP sliding window). Plus Content providers control the software stack on both ends (servers) so they can optimize flow-control algorithm and its parameters. Also, cost of links – passive intra-DC fiber infrastructure – is relatively low, in comparison to long haul WAN optical spans. Therefore, DC fabrics could be designed with extra capacity, which allows the use of shallow buffers.

 

When a CLOS Fabric structure is used to instantiate a network node, immediate questions appear about which need to be a capability of devices used for construction FabricNode.

 

  • The LEAF devices that terminates long-haul expensive links probably should provice deep DBB.
  • The LEAF devices that connect other AS (peering and transit) should
    • Provide large FIB capability. (Although some control-plane solutions –the EPE - allow for moving this requirement to other LEAFS in a FabricNode)
    • As peering and transit is charged/billed base on traffic volume (e.g. 95th percentile rule) or ‘free’ there is no pressure on ability of full utilization of link capacity. Therefore, typically, deep DBB is not required.
  • The LEAF devices that connect customers should
    • Provide deep DBB. This is because usually a customer contract is associated with an access link bandwidth, and the customer has the right to expect that full capacity of this link could be utilized, especially when it is congested.
    • Provide large FIB to optimally route traffic to other leaf and over the WAN topology.
    • If VPN services are offered, this leaf should provide such capabilities
  • The LEAF devices that connect the DC, could be shallow DBB and depending on DC content may be small FIB (if content is served to limited group of consumers).
  • SPINES characteristics depends more on actual FabricNode internal architecture and SPINE role. Typically, SPINE will be simple, inexpensive devices.

As you can see in many cases, LEAF used to construct FabricNode have to provide decent buffering capabilities, and large scale. Therefore, these devices are inherently not-so-inexpensive. (in comparison to DC Fabric shallow DBB, small FIB devices)

 

Costly multi-chassis systems

Another angle we can look at the cost of solution is via comparison with integrated modular routers – single- or multi-chassis.

 

It is worth to note that in CLOS, 50% of LEAF device capacity (assuming no oversubscription) is used just to connect SPINES. If we agree that there is street price per Gbps for devices of given capabilities (say large FIB and deep DBB), it makes external ports of FabricNode cost twice of cost of same port on integrated chassis. Even before including costs of SPINEs and internal fabric connections (a lot of QSFPs and fibers).

 

Let’s use an example of a Juniper PTX10016 versus FabricNodes that use PTX1000 as LEAF. And let assume that 100GE port on both platforms cost 10 000 (of Rafal's Monetary Units - RMU - the fake currency nobody honors. Even me Smiley Wink.).

In order to have a node with 240x100GE we need one PTX10016 so it will cost RMU 2 400 000.

What with FabricNode? PTX1000 has 24 100GE ports so it cost RMU 240 000. Twelve (12) ports need to be used for SPINE connectivity and other twelve (12) for external interfaces. Therefore, for 240 x 100GE FabricNode system 20 LEAF is needed, what makes RMU 4 800 000. Twice much already, before adding any SPINE.

What if further growth is required to 480 x 100GE? Not a big problem – we just add Line Cards to PTX10016 or LEAFS to FabricNode (ignore SPINES for a moment). PTX10016 will now cost RMU 4 800 000 and LEAFS of FabricNode RMU 9 600 000. Still twice. But what next? What if we need 500 x 100GE? With current generation of PTX 10k line cards this may not be achieved. But adding one more LEAF to the FabricNode (and another set of SPINES if necessary) will do the job. It will cost another RMU 10 200 000 (plus cost of SPINES and interconnects), but price is irrelevant, if there is no other option. image006.png

With Multi-chassis systems, due to its uniqueness on market, cost per Gbps could be slightly higher, but rather not double. Still Fabric structure seems to be more expensive.

 

It is worth to note that all interface (line cards) of integrated system offers same or very similar capability and no much of oversubscription. The FabricNode allows mix of different classes of LEAF devices, what could reduce overall cost.

 

In summary, it is an unrealistic expectation that CLOS-based FabricNode will be significantly cheaper than an integrated router of similar capabilities and capacity. Most likely it will be more expensive.

 

Oversubscription

Based on the knowledge of the traffic pattern, service providers can plan a FabricNode structure and role of particular LEAFS and ports in such way, that big part of traffic would be locally forwarded between ports of sale LEAF device. (Or in case of 5-stage structure, between ports of LEAFs connected to same set of stage 2 SPINES). In consequences the number of SPINE-facing ports on LEAFS and SPINE devices (or SuperSPINE-facing ports of  SPINES and number of SuperSPINE devices) could be reduced.

 

The drawback of this method is that once Fabric Node is built, there is no possibility to change the initially assumed oversubscription ratio.

 

It is also worth noting that enforcing local forwarding on LEAFS requires strictly placed and disciplined connections of external systems. Which can increase operational cost and prohibit use of some free LEAF ports.

In summary, a CLOS structure allows for CapEx savings by flexible selection of oversubscription ratio, however it comes at a risk that a change in traffic pattern may lead to congestion on fabric links and traffic losses, and the possibility of not being able to use all available external ports on LEAFs. It is a known fact that managing oversubscription adds a lot to the operational expenses of a network.

 

Small failure domains

Each device in a FabricNode has independent control planes and communicates by standard, open and well matured protocols. It is really a prime attribute of a well-designed FabricNode structure.

 

Please note that some implementation of FabricNodes may advocate for a central controller that installs forwarding states into data-plane devices (LEAF and SPINES). This approach should be carefully considered and evaluated against scenarios when failure (SW defect) in controller could make all devices of FabricNode malfunctioning. In such cases an entire FabricNode should be seen as a single failure domain – equally to an integrated router (where a RoutingEngine/RouteProcessor is a central controller that installs forwarding states in data-plane devices - PFEs).

 

In summary, a properly designed FabricNode is Failure domain is reduced to single device (LEAF or SPINE) and single link between them.

 

Multivendor

Each device in a FabricNode has an independent control plane and communicates by standard, open and well matured protocols. So, in theory, every device in FabricNode structure could come from other vendor. Or be even ‘whitebox’.

 

There are undisputable benefits to a service provider. It can select the best device for a task, available on market at given point of time. And this is sole decision of the service provider what ‘best’ means for them.

Let’s see some examples.

 

Vendor A provides best platform for LEAF for given FabricNode. It provides 60 100GE ports, deep DBB and large FIB.

Unfortunately this vendor does not provide a good platform for SPINE – low-cost, dense 100GE that can have shallow DBB and small FIB. So a Service provider picks SPINE 32 x 100GE switch from vendor B.

As a result the FabricNode is a multi-vendor structure with LEAFs of vendor A  and SPINES of vendor B.

As the Service Provider deployed this system w/o overbooking and with 100% grow margin, the initial FabricNode consist 16 LEAF (1/2 of SPINES ports) and 30 SPINES (1/2 of LEAF’s ports are used for FabricNode internal connectivity)

 

After a while the FabricNode needs expansion, but at this time vendor D provides denser – 72 x 100GE platform that meets LEAF requirements. This platform is less expensive per 100GE, so seems to be the natural choice. Instead of 16, only 14 LEAFS will provide necessary port-count expansion. But it is hard to integrate Vendor D’s platform into existing structure. Ideally for 72-port LEAFs, there should be 36 (or 18, or 9, or 3) spines. But existing structure already have 30 spines. So service providers have the following options:

  1. Keep using Vendor A’s LEAF.
  2. Use 16 x Vendor D’s LEAF but connect only 30 ports of each to SPINES and use only 30-ports externally (because of no oversubscription). In this case all benefits of lower per-100GE ports are vanished as only 60 ports are used.
  3. Deploy another layer of 36 new SPINES (VendorB) and connect 14 Vendor D’s LEAFs, each by 36 links. Then delploy 48 SuperSpines as 6x8 matrix and connect SPINES trough it. This allows full use of Vendor D’s LEAF capacity, but investment in additional SPINES and connections (QFSPs) would offset benefit.

Please note that option A has an additional benefit on the operational side. There is no need for certification and testing of new platform. No need for cross-training of operation personnel and finally no need to adapt OSS systems (FCAPS, Orchestration, planning, etc) to operate with new platform.

 

In summary, multivendor capability of a FabricNode structure can give service providers the ability to select the best available product and negotiate the best cost at the initial node planning phase. However, in operation and expansion phases, cost of onboarding additional platforms and potential port count mismatch,  offset spotential benefits.

 

Not-so-free lunch

The discussion above clearly proves that FabricNode structure, with a reasonable boundary, can fulfill a Service Provider’s hope. The FabricNode can be a multivendor, flexibly scalable, structure that operates as multiple independent failure domains.

 

But noting under the sun is for free. So where is the catch?

 

Control Plane impact

The use of a FabricNode structure instead of a single integrated router means that there will be many more nodes and links in IGP domain. The difference is not just twice or triple, but rather in order of magnitude.

For example:

 

Let's assume a 5-node core network on a topology where each node has 4 links to 3 other nodes.

If the node is implemented by  the integrated router, IGP’s LSDB will contain then 5 IGP nodes and 60 (unidirectional) links.

Now consider a node implemented as a FabricNode of 4 LEAFs and 4 SPINEs. Then IGP’s LSDB will contain:

  • 40 (8 times) IGP nodes
  • 156 (2.6 times) (unidirectional) links (60 between LEAFs of different sites + 96 between LEAF and SPINES)

Finally if we decide to deploy a 5-stage clos say: 8 LEAF0,s 8 SPINES, and 16 SuperSPINES, the IGP LSDB will grow to:

  • 160 IGP nodes
  • 252 (unidirectional) IGP links

Some creative Control Plane solutions can mitigate IGP LSDB explosion, but the impact on IGP LSDB and consequently convergence time and the memory footprint is inevitable.

 

Please note that in the context of traffic engineering, growthe of LSDB has exponential impact on computational complexity of path computation.

 

Operationalization

Having multiple devices to manage in a FabricNode instead of a single device of an integrated router has obvious impact on operational procedure and practices. Despite of well known dependency of Operational cost from number of managed devices, FabricNode would require a toolset – the OSS – that will help to:

 

  • Plan a FabricNode in terms of capacity, cable plan, LEAF type/capabilities. Also plan expansion of a FabricNode that is in-service.
  • Build configuration for different devices and keep it consistent. 
  • Monitor behavior of FabricNode, catch and highlight anomalies, predict incoming capacity or resource exhaustion and suggest remediation.
  • Present dashboard of FabricNode state

Please note that this toolset is needed only because the of use of a FabricNode and is not needed (or greatly simplified) for integrated routers. Therfore it represents additional operational cost.

Conclusion

The use of a CLOS-topology disaggregated approach to construct a network node is absolutely possible. It enables service providers to

  • Provision the node of capacity exceeding biggest integrated systems provided by vendors.
  • Device node into smaller failure domains, to manage failure better and with lower capacity degradation.
  • Scale/upgrade node capacity and performance independently from given single vendor roadmap
  • Leverage multivendor strategy for cost control and use best-of-breed devices.

However, to benefit from this good property, good planning, operation toolsets and discipline is required. It is a Service Provider's decision if going the FabricNode way is beneficial for them.

 

[1] We consider cost normalized per Gbps

Announcements
Juniper Networks Technical Books
Labels
About the Author
  • Barry Burns is a Principal Engineer in the Router Business Unit Forwarding Software (PFE) group working. Responsible for the software architecture of the Class of Service (CoS) on the PTX platform. He joined Juniper in 2008 in the Silicon Development team in Raleih-Durham, NC. Prior to that he was with Cisco from 1995 and before that at IBM. He has over 35 years of hardware and software development experience.
  • Chang-Hong Wu is a Juniper Fellow in Silicon and Systems Engineering. With Juniper since 1998, he works with both internal teams and external suppliers to bring all innovations to Juniper's products. He also reaches out to customers to explain Juniper's architectural and technological advantages.
  • Jeff Libby is a Distinguished Engineer in Juniper's Silicon Development team. Jeff joined Juniper in 1999, and has worked on the design, verification and architecture of many Juniper chips. His current focus is on Trio architecture silicon.
  • David Song is a Sr Staff Engineer within Juniper's Optical Engineering team where he is responsible for the design of packet-optical solutions for routing and switching platforms. He joined Juniper in 2004 and has been designing networking software in control plane and data plane on various platforms. Prior to Juniper, he held various software development positions at Ciena and Nortel Networks. He has several US patents.