Tunnel Localization for higher scale

Tunnel Localization for higher scale


Problem description


Tunneling was invented to connect networks separated by other independent networks. One classic example is the VPN use case where remote access users connect to a variety of network resources (corporate home gateways or an internet service provider) through public data networks. Tunneling is simply encapsulating a packet in another packet. In this, the devices at the network entry and exit points play a pivotal role of doing the encapsulation and de-capsulation respectively, and hence are responsible for building/maintaining the necessary state that facilitates this tunneling, while the rest of the devices in the network seamlessly transport based on the outer headers.


Due to the complexity involved in the encapsulation/de-capsulation (tunneling) process this functionality is offloaded[1] from the forwarding and typically implemented in software on external servers. For operational and economic reasons, vendors came up with tunneling service offerings tethered locally on so called service blades present locally on the chassis. With the advent of powerful forwarding ASICs, the tunneling functionality started getting absorbed into packet forwarding and vendors started offering it in line with the regular forwarding. Though the per-tunnel capacity and overall forwarding capacity for tunneling has increased, the tunnel scale was limited due to the tunneling state that needs to be maintained on the chassis. This didn’t seem to be a problem until now as the tunneling was used primarily for traffic aggregation. But of late the number of use cases of tunneling in the Data Centers and service-providers wholesale network offerings have proliferated due to their enormous advantages. With ever increasing traffic demands vendors are making bigger chassis with more and more forwarding line-cards but the tunnel scale hasn’t kept up the pace due to the large amount of memory footprint required for tunneling. Though the forwarding is distributed across the forwarding-engines/line-cards, the state required for encapsulation and de-capsulation is kept on all the forwarding-engines/line-cards limiting the capacity to what can be supported on a single forwarding-engine/line-card.


In this blog we propose a solution called tunnel-localization (aka tunnel-anchoring) using this the chassis tunnel-scale can be increased linearly with line-cards. We also discuss various forwarding aspects that one needs to consider when using tunnel-localization.


Tunnel Localization


Figure 1[2] shows the components involved in the tunnel traffic forwarding. The controller-card is responsible for setting up the state required for tunnel traffic forwarding. The forwarding-engine aka Packet Forwarding Engine (PFE) on the line-cards is responsible for tunneling the traffic along with the regular forwarding. Here the tunnel-state needed for forwarding is replicated on all PFE/line-cards. So, the tunnel scale in the chassis would be limited by the total amount of memory available on the line-card/PFE for holding the tunnel forwarding state.



Figure 1 Tunnel forwarding state across chassis componentsFigure 1 Tunnel forwarding state across chassis components

With tunnel-localization a tunnel is pinned to a PFE/line-card, so the tunnel-state needed for forwarding is available only on the PFE/line-card anchoring the tunnel. The other PFEs/line-cards on the chassis will only have the state to steer the traffic to the tunnel anchoring PFE. As depicted in Figure 2 the line-card memory footprint would consist of steering-state required for all tunnels on the chassis plus the tunnel-state for the tunnels localized on the line-card/PFE effectively helping in scaling the tunnels higher.


Figure 2 Tunnel Forwarding state across chassis components with localizationFigure 2 Tunnel Forwarding state across chassis components with localization


Tunnel traffic Forwarding


Figure 3 shows the forwarding of IPv4 tunneled over IPv6 traffic. The IPv4 packet entering the router goes through destination lookup and if the result indicates that it needs to be tunneled then the packet goes through a process called encapsulation. The post-encapsulated packet then goes through another route lookup based on the outer headers and the result would lead to the egress interface from where the traffic would go out of the chassis.


 Figure 3 IPv4 packet tunneled into IPv6Figure 3 IPv4 packet tunneled into IPv6

Figure 4 shows the forwarding of an IPv4 packet into an IPv6 tunnel with localization. The PFE where the traffic entered is not the anchor for the tunnel, so the traffic gets steered to the anchor PFE and from there the tunnel encapsulation gets triggered.


Figure 4 IPv4 packet tunneled into IPv6 with localizationFigure 4 IPv4 packet tunneled into IPv6 with localization

In the de-tunneling case the packets entering the router would be encapsulated, so the destination lookup would lead to a local route pointing to tunnel termination lookup. If a matching entry is found then the packet would be subjected to the de-capsulation process. Then the inner packet would go through another round of lookup, this time based on the inner destination address. This is depicted in Figure 5.


Figure 5 IPv4oIPv6 tunnel termination logicFigure 5 IPv4oIPv6 tunnel termination logic

Figure 6 depicts the tunnel termination logic with localization. The PFE where the traffic entered is not the anchor for the tunnel, so the traffic gets steered to the anchor PFE and from there the tunnel decapsulation gets triggered.


Figure 6 IPv4oIPv6 tunnel termination logic with localizationFigure 6 IPv4oIPv6 tunnel termination logic with localization


Tunnel traffic Steering


With tunnel localization, the tunnel specific forwarding state is available only on the tunnel’s anchor and all other forwarding-engines will have some minimal state using which traffic would be steered to the tunnel’s anchor, this is called tunnel steering state. As this steering state needs to be present on all the PFEs/Line-cards its footprint would dictate the chassis wide scale one could achieve.  Steering entries are needed in both tunnel origination direction and also termination direction. There are couple of models using which this steering can be accomplished, one is per-tunnel steering and the other is aggregate-steering.


As the name indicates the per-tunnel steering would consume higher footprint per tunnel and could limit the extent one can increase the tunnel-scale by adding new line-cards to the chassis. But this is a simplistic approach and doesn’t need any extra intelligence in terms of how the tunnels are tied to its anchor. The steering table entries would be the specific per-tunnel entries and the resulting entry would simply help in steering to the anchor if the tunnel is not local, otherwise it would do the tunnel forwarding.


Aggregate steering model best fits IP based tunnels where the anchor assignment logic can be based on prefix ranges and requires only the installation of aggregate steering entries. For the tunnel termination side, one way to achieve aggregate steering is to assign at least one address per anchor-PFE. So the destination lookup steers the traffic directly using the encapsulated headers instead of delving into the per-tunnel state. For example: the IPv4oIPv6 use case that’s depicted in the diagrams each anchor-PFE is assigned a separate IPv6 address and the steering of tunnel-termination traffic is performed using these IPv6 addresses[3]. The memory footprint requirement for steering the traffic is minimal with this aggregate steering model.


So, the key for achieving higher scale is to find a mechanism using which aggregate steering can be achieved for the tunneling technology of interest[4].



Tunnel Anchor assignment


For tunnel-localization one of the main tasks is to pick the anchor PFE to which the tunnel needs to be pinned and there are a couple of ways to accomplish this. One being the static assignment and the other is dynamically picking the anchor-PFE by the tunnel installation entity.


In the static case, the user is allowed to configure certain prefix ranges and assign them an anchor PFE and the control-plane will pin the tunnels to that anchor-PFE during the tunnel setup. This mechanism is suitable for routing protocol driven tunnels (for ex: BGP route installation) where this prefix based configuration gets evaluated as part of tunnel setup[5].


The other way is to let the control-plane pick the anchor-PFE dynamically by following some constraint based algorithm which takes forwarding resources into consideration. This requires some daemon to continuously monitor the resources and during tunnel setup time pick the PFE based on the set forth constraints.



Tunnel Anchor failure handling


When the tunnels are not localized the PFE fail-over is automagically taken care by the network routing infrastructure. But with tunnel localization, the failures can be on the anchor-PFE and they may not get repaired by the network infrastructure so there is a need for anchor PFE failure handling. The failure handling can be accomplished using a couple of models. One being the global repair and the other by having a local backup.


Due to government regulations, typical Telecom providers are obligated to have deployments that provide chassis backup and also site backup. In these deployments, the tunnel state is replicated not only on the device that’s doing the forwarding but also on the backup chassis (more than one backup). So, asking providers to have another PFE backing up each anchor locally on the chassis is not an option. Typically, providers distribute the traffic equally on all of the chassis by using routing policies and when a device fails, the traffic would get rerouted to the backup device. As anchor PFEs are tied to prefixes, the failure of anchor PFEs can be handled by making the control-plane withdraw the addresses that are tied to the anchor. This way the traffic is rerouted to the network backup during the anchor-PFE fail-over duration. As the dynamic assignment of tunnel anchor-PFE may not result in picking the exact anchor on all the devices, the only option left for operators is to use the static anchor assignment.


The other way to handle anchor-PFE failure is to have a local backup and it can be 1:1 or N:1. In case of 1:1, the state replication is done at the time the tunnel is setup so upon failure of one PFE the backup can take over as soon as the failure gets detected. And in case of N:1, when any failure gets detected the backup takes over once the control-plane programs the backup anchor-PFE with the forwarding state for the tunnels anchored on the PFE that went down. Depending on the economics of the deployment and the traffic-loss requirements one of these models could be chosen.


Latency and fabric bandwidth


One of the drawbacks with tunnel localization is the introduction of an extra-hop in the forwarding path inside the chassis. The latency incurred is much lower than the queueing related latency but this is something that needs to be taken into consideration when using localization.


In addition, this extra-hop eats into the available fabric bandwidth as well. One way to mitigate this is to make sure the anchor PFE falls either on the ingress or egress forwarding engine. This is not always possible particularly with the static anchor assignment. But with dynamic assignment the algorithm constraints can be augmented with extra logic to pick the anchor based on the exit/entry PFE information. Typical deployments invariably use link aggregation and they always have forwarding ports distributed across PFEs so if the forwarding is augmented to pick the exit link going out on the anchor PFE instead of the one picked by the computed hash then the addition of a fabric-hop can be avoided.




During tunneling one of the issues that tunnel end-points need to deal with is MTU. Typically, end-to-end MTU discovery ensures that the packet-size after encapsulation doesn’t exceed the MTU but this is not always guaranteed. So the end devices need to fragment packets during tunneling. In case if the outer packet gets fragmented by the tunnel originator then the packets need reassembly at the terminating end for forwarding them. If the fragmented traffic needs reassembly prior to forwarding then it’s a MUST for all the fragments to arrive on the same PFE doing the reassembly so tunnel-localization takes care of this due to the inherent nature of forwarding done from the tunnel anchor.




Some of the forwarding functionalities like policing, filtering, flow-monitoring etc on the tunneled traffic can only be accomplished correctly if all of the traffic specific to a tunnel lands on a single PFE.  When any of these requirements need to be accomplished then tunnel-localization can seamlessly address these requirements.




Tunnel localization distributes the forwarding state by pinning the tunnels to forwarding engines either by dynamic selection or static configuration. By avoiding the duplication of state on the PFEs it helps improve the chassis wide scale. Tunnel forwarding state is only present on the anchor so the rest of the PFEs need to hold state for steering the traffic to the tunnel anchor for traffic forwarding. Aggregate steering models can be used to scale up linearly as more line-cards are added to the chassis. Anchor failure handling can be dealt with either global or local backup depending on the deployment model. If packets need reassembly then the fragments need to be reassembled at a common point prior to de-tunneling, so the common point can be the tunnel-anchor. Other forwarding features like policing, flow monitoring etc also fits well with anchoring.


[1] This technique was adopted to avoid impact to the line-rate forwarding.

[2] The diagrams in this document are depicting the state and forwarding treatment of IPv4 traffic getting tunneled over IPv6 network.

[3] If say for some reason more IPv6 addresses are not possible to assign then also the aggregate steering based on the inner IPv4 address can be used even for the tunnel termination side by using the inner IPv4 source address and looking up in the aggregate steering tables.

[4] For MPLS based tunnels, the label assignment can be divided into ranges and assign the label based on where the tunnel is anchored. For GTP tunnels, the TEID assignment can follow similar approach.

[5] Some customers like this approach as they have full control on where and how the tunnels are anchored.