Multi-path forwarding has been deployed extensively in various levels and scales of networks as a way to increase network capacity and resiliency. For multi-path, there are two areas that are of particular interests:
How to achieve balanced distribution of traffic among the multiple paths; and
How to avoid packet re-ordering while packets are travelling on the different paths.
Although the end-to-end path-weighted hash in Virtual Chassis Fabric (VCF)  takes into account the end-to-end path bandwidth ratio as the weight of hashing, the hashing itself is static since it only uses packet fields in determining flows’ link selection. Such static distribution mechanism may work poorly when the number of flows are too few, or when a small number of flows are disproportionally larger than most other flows. These larger and smaller flows are sometimes referred to as “elephant” flows and “mice” flows.
The “elephant” and “mice” problem in the data center and its potential solutions have been well summarized in . The conventional wisdom points to an “identification and steering” solution, i. e. first identifying the “elephant” flows and then applying different forwarding or steering policies on them. While the approach described in this blog is primarily aimed at achieving a balanced distribution of traffic, it also alleviates the “elephant” and “mice” problem as a side effect.
Adaptive Flowlet Splicing (AFS) is designed based on the observation that an “elephant” flow can be divided into a plural of short bursts, which are called flowlets in this blog. AFS has gone beyond traditional hashing in the following aspects:
With AFS, flowlets of a single flow can be assigned to different member links during the lifetime of a flow;
Flow reassignment, which is based on flowlet, is done in a way not to cause packet re-ordering.
To be able to adapt to uneven sizes of flows, frequent measurements of links’ loads and queue depths are used in determining link assignment;
To reduce the chance of hash-collision, an indirect hash table is used so that each hashed entry can be independently assigned a link.
Dynamic Load Balancing (DLB) based on the Adaptive Flowlet Splicing mechanism will be a feature on VCF. It will be provided as a configuration option in addition to VCF’s end-to-end path-weight based load balancing mechanism as described in .
Adaptive Flowlet Splicing
What are Flowlets?
Flowlet represents a small burst in a flow. Flowlet, as in the context of this blog, is a term appeared as early as 2004 in . The discovery of flowlets is the key enabler of better load balancing, hence the use of this terminology in this blog. One established reason  for the existence of flowlets is that TCP transmission is controlled by a window and the servers (aka senders) can finish sending the maximum allowed bytes before they can receive the ACKs. Most prominent “elephant” flows are TCP flows.
Although there have been various mechanisms, such as SACK, window shifting etc., which change the TCP windowing effect, TCP’s burstiness has always been prominent at the round-trip time (RTT) scales and even sub-RTT scales [1,7,8]. The higher speed ports on newer servers, as in modern data centers today, make TCP’s burstiness even more pronounced.
How to Avoid Packet Re-ordering?
There has been various solutions to the “elephant” and “mice” problem that
1) disregards packet re-ordering (such as round robin or randomized spraying ); or
2) introduces occasional packet re-ordering (such as sampling, flow-tracking, and other proposals listed in ).
Packet re-ordering during transmission creates a problem that needs to be solved, either by the network or by the end stations, and the solution is not going to be cost-free. Hence a network approach which ensures in-order packet delivery is preferred.
Regardless how flowlets are created (some of the reasons may not have been fully explored), as long as they exist, they provide natural delineation points for splitting elephant flows apart for achieving better load balancing while, at the same time, guaranteeing in-order delivery of packets.
It is important to notice that the requirement of in-order delivery is only from the start of a multi-path to the end of the same multi-path. As pointed out by , if a flow’s path switch happens in-between flowlets and the inactive interval between flowlets are longer than the latency skew of the multi-path, the path-switching is not causing packet re-ordering.
Key Components of Adaptive Flowlet Splicing
The Adaptive Flowlet Splicing mechanism is illustrated by the following diagram.
A key element in the adaptive flowlet splicing mechanism is the Hash Bucket Table. Each AFS-enabled fabric interface bundle is allocated a region in the Hash Bucket Table. Packets egressing on a bundle is hashed to a bucket in this region. Each bundle’s region may be sized in the hundreds or thousands of buckets. It is big enough so that collisions of “elephant” flowlets are rare.
Each hash bucket entry has a timestamp of last activity. For each packet, the time elapsed since the last packet received is compared with an inactivity timer threshold. If the hash bucket has been inactive longer than the threshold value, the hash bucket entry is eligible to be assigned a new link. If not, the packet must be assigned to the existing link recorded in the entry. Note the eligibility of an entry in the table signifies the detection of a flowlet boundary.
Another important enabler of AFS is the real-time measurement of the relative “quality” of member links. When a hash bucket entry is eligible to be assigned a new link, the member link with the best quality is assigned to the newly detected flowlets. The link “quality” is a moving average of the link’s load and its queue depth. Normally, the least utilized and least congested link among the members of fabric interfaces is selected for assigning new flowlets.
In a VCF built with QFX-5100 switches in a spine-and-leaf topology, as shown in the picture on the left, the fabric latency for unicast traffic is always 3 hops. The minimum and maximum latency are roughly 2 and 5 micro seconds, respectively, i.e. the latency skew is around 3 micro seconds. Hence an inactivity interval threshold of bigger than 3 micro seconds will be able to avoid packet re-ordering regardless which path a flowlet takes inside the VCF. Such a short interval provides plenty of opportunities for the VCF to spray flowlets to different paths within the fabric, hence it can achieve much more accurate load distribution. Note, AFS on VCF’s remote-destination fabric interfaces will split elephant flowlets onto multiple spine switches.
How Do the Elephants Avoid Collision?
Recent studies [8, 9] on data center traffic profiles have revealed that, even with very low link utilizations for most of the time, traffic loss is still observed at spikes of traffic, posing an acute challenge in data center network design. The source of the problem is the use of hashing as the sole factor for load balancing.
A good hashing outcome depends on the statistical multiplexing effect. Hence, for hash-based traffic distribution to be effective, two key assumptions need to hold true:
The number of flows in the aggregated traffic stream is large, and
The load of the flows is roughly equal or similar.
The 1st assumption is mostly true in the ISP networks. Studies  on real network traffic have shown that the number of concurrently active flows, on core or edge routers, typically ranges from a few thousands to tens of thousands. However, such assumptions are no longer true in modern data center networks. Recent studies [3, 8, and 9] have shown that, in data center networks, the number of concurrent flows is usually less than a few thousands or even less than one thousand. What makes things even worse is the fact that, among the fewer number of flows, an even smaller number of “elephant” flows, sometimes numbered in the single digit, are so large and longer-lasting so that they occupy a significant portion (e.g. greater than 90% as in ) of the total bytes transmitted within the data center. Since a flowlet is a short burst in a flow, concurrent flowlets become even fewer. The chart below intends to show the relative scales (not intending to be accurate) of flows/flowlets as analyzed the various studies to demonstrate the contrast in these environments.
As can be seen in the figure above, a traditional hash-based load distribution becomes the distribution of a “handful” of “elephant” flows to a “couple” of links, hence frequent hash collisions and packets drops are not a surprise at all. An adaptive mechanism, such as the AFS on VCF, becomes necessary in such environments.
Since AFS still uses a hash, as seen in the previous figure, can the “elephant” flowlets avoid hashing into the same hash bucket entry? To understand its effectiveness, we must recognize two significant characteristics in AFS:
1) A “handful” of elephant flows are hashed into a region sized in the hundreds or thousands of entries. Hence a hash collision in the hash bucket table is greatly reduced;
2) Since flowlets are short-lived, compared to elephant flows, elephant flowlets represent a small percentage (e.g. less 10% as in ) of the time scale. Hence, “elephant” flowlet collision is even less likely.
On the other hand, fewer concurrent flowlets indicate that even a reasonable region size (e.g. a few hundred or a thousand) in the hash bucket table should be enough to avoid hash collision for “elephant flowlets” in most cases.
Adaptive Flowlet Splicing provides an effective dynamic load balancing mechanism that performs much better than traditional hash-based load distribution mechanism, especially when dealing with a small number of “elephant” flows in the presence of “mice” flows.
Although load balancing is not the only problem of “elephant” and “mice”, it is a thorn in the side of data center network architects. Taking this thorn out allows the conventional approaches, such as oversubscription ratios, QoS, buffering planning etc., to be effectively applied to data center network designs.
As a summary, AFS is effective because:
TCP is bursty at small time scales so that an “elephant” flow can always be broken apart into small flowlets;
AFS takes the load and queue-depth information into account when assigning flowlets to links;
In the silent periods between flowlets, the same flows can be moved around without causing packet re-ordering. There are significantly more chances to rebalance, hence AFS can achieve more accurate load distribution;
Much fewer simultaneously active flowlets will require a reasonably-sized Hash Bucket table to be effective.