The use of crossbar switches helped the HOLB issue, but it did not solve it. So the next evolution was to put small buffers in front of each of the crossbar switch’s input ports for each of the crossbar switch’s output ports as is illustrated in Figure 6. Each of these small buffers was called a Virtual Output Queue (VOQ), because it was a queue that resided on the ingress side of the switch fabric for each and every output port on the switch fabric. They were virtual in the sense that they typically used a common memory and the memory was only consumed when there was traffic destined for a particular output.
Important Note: A crossbar switch port is NOT equivalent to a router’s external port. There are typically many (2 to 100s) external router ports contained within a single crossbar switch port. For instance, a crossbar switch port’s bandwidth is 1.2 Gbps, then there might be 10x100Mbps ports mapped to the crossbar port leaving a 20% margin. Generally switch ports are aggregated into a larger virtual port that represents a connection to a packet forwarding engine and the traffic is sprayed acrossed the individual crossbar switch links for all physical external ports attached to the corresponding packet forwarding engine. The way traffic transgresses the switch fabric varies from product to product and is not the topic of this blog. The blog will be using a generalized view of a crossbar switch,
With the added queues, one could say that HOLB had been virtually eliminated, at least theoretically. This of course depended upon the size of the queue and the distribution of traffic. If the traffic were perfectly distributed and there was not any speed up in the switch fabric, then one could argue that the queue size need only be 25% (percent time blocked) times the largest packet size. Obviously, the traffic is not perfectly distributed, so the statistical multiplexing/distribution argument was made, which essentially says that over a certain period of time the traffic will be statistically distributed evenly across the active ports.
There are a large number of factors that exposed HOLB issues, albeit they maybe at a finer granularity, such as:
- Number of external ports grew
- Bandwidth of external ports grew
- Traffic distribution was not equal
- Role dependent traffic patterns (Core, Provider Edge, Customer Edge, etc)
- Changing Network paradigm’s, such as Netflix traffic.
How much faster does your switch fabric need to be versus your ports? So if you consider a system where two external ports are being serviced by one switch fabric port. and let’s assume the switch port is exactly the sum of the external ports. The switch fabric has enough bandwidth to keep the two external ports busy in theory.
But what happens when one of the external port’s (Port X.1) packet size is 100 times that of the other port (Port X.2)? The packets for port X.2 will continuously be stuck waiting for one large packet to be transmitted across the switch fabric to port X.1. This would argue that the VOQ buffer needs to be at least the largest packet size one could get stuck behind. Not a big deal. But this issue as tertiary effects, such as a large amount of jitter on the output and unless the fabric scheduler were perfect it would result in loss external transmission opportunities or loss of egress bandwidth. Thus, the concept of cellification in the switch fabric was introduced. The idea being to break packets into much smaller units called cells and thus allowing the sender to alternate sending data to each destination, which would reduce the gap in receiving data on the receiver for specific destinations. This is very similar to Asynchronous Transfer Mode (ATM) networks that tried to become the networking standard in the late 1980s.
The size of the cell varies across the different vendors. They are typically in the range of 64 Bytes to a couple of hundred bytes. Part of the equation is that each cell needs a header so that the receiver will be able to properly reassemble the cells back into packets. The header typically has some sort of source/destination fabric address, sequence numbers, start/middle/end of packet indicators, etc. All of this is consider overhead and decreases the fabrics efficiency. For example let’s say our cell size is 100 Bytes and our header is 10 Bytes, then our transmission efficiency is (100/(100+10)= 91%). So if we need our switch fabric port to deliver say 100Gbps of BW, then our actual switch fabric needs to run at least a 110Gbps (100Gbps/91%).
By adding cells to our switch fabric, we can virtually eliminate jitter and loss of transmit opportunities due to packet size differentials.
What happens when an egress port is oversubscribed (more data is received on the ingress than the egress can transmit)? As an example of this let’s refer back to Figure 6 Simple Oversubscription Example. If ingress switch port A and B are sending at full capacity to only egress switch Port X, then the ingress queues for X on both port A and B will eventually overflow. Because they are filling at twice the rate that they are draining. These are considered uncontrolled or unintelligent drops as an egress drop policy has not been applied to the packets, because they are dropped on the ingress port side. Uncontrolled/Unintelligent drops are very bad.
They also can lead to another form of HOLB. Consider the simple example shown in Figure 7 Simple Oversubscription Example, each switch port as two physical ports associated (A.1-A.2, B.1-B.2). For this example, let’s assume each port is 10Gbps and each switch fabric port is 20Gbps. Here is the traffic arrival:
- A.1 to X.1 = 5 Gbps (red arrow)
- A.2 to X.1 = 5 Gbps (red arrow)
- B.1 to X.1 = 5 Gbps (red arrow)
- B.2 to X.2 = 10 Gbps (green arrow)
So each of the ingress port is only 50% of line rate. However, the switch port X is oversubscribed: a total of 25Gbps (5 + 5 + 5 + 10) destined for a 20Gbps. This means that 5 Gbps is going to be dropped on the ingress queues. Again, these will be unintelligent drops! But even more troubling is what traffic will be dropped?
For this analysis, let’s assume that a Deficit Round Robin (DRR) scheduler is used to schedule between switch port A and switch Port B. With the current knowledge of the system DRR scheduling is the best we could hope for. Therefore, the DRR scheduler for egress port X should take 10G of traffic from ingress switch Port A and 10G of traffic from Port B, for a total of 20G (its max). So, Port A has no issues, because all of its 10G of ingress traffic made it to the egress Port X.
However, Port B has 15G of traffic that can only be serviced at 10G. Thus, the ingress queues on Port B will be continuously overflowing at a 5G rate. But whose traffic will these drops come from? What is fair? What is desired? If you ask a group of networking engineers, you will likely hear a multitude of answers.
For this example, we will assume each port in the system has very little knowledge of what is going on at another port. We will also assume a constant packet size. This means that 1/3rd of the packets arriving at Switch Port B will be dropped or 5G will be dropped. We will also assume a normalized arrival rate and drain rate, thus the 5G of dropped traffic should be relative to each port’s actual ingress bandwidth, such that Port B.1 drops will be 5Gbps * (5 Gbps/15Gbps)= 1.67Gbps and Port B.2 drops=5Gbps * (10G/15G) = 3.33Gbps. This means that Port X.2 can only transmit 6.67Gbps of his ability to transmit 10Gbps even though the system received 10Gbps for that port!!! The results are summarized in Table 2 Simple Oversubscription Results.
How do we solve this problem? The easy answer is to make the switch port’s bandwidth to be greater than the sum of its ports, also referred to as fabric speedup. But how much greater? For this example, increasing the switch port bandwidth from 20G to 25G would solve the problem or a 25% increase. One could argue that each egress switch port needs to be able to sync all the ingress ports bandwidth. This would obviously be too cost prohibitive to do on a large, high-speed system.
Regardless fabric speedup is an integral part of modern router design and thus is a significant factor in the cost of the router. Typical fabric speedups range from 25% to more than 300% of the required bandwidth per port bandwidth. Another reason for extending the VOQ concept is to reduce the amount of fabric bandwidth required to avoid the unintelligent drop scenarios.
As was shown in the previous section having a single VOQ per egress switch port leaves the system vulnerable to HOLB and thus unintelligent drops. The next generation of VOQ architectures implemented priority based VOQs, as illustrated in Figure 8 . Now instead of having a single queue per egress switch port, there are 3 queues: High, Medium and Low priority in this example. The ingress would then need to classify the incoming packets into one of the priority queues. With priority based VOQs, the thought was that the drops would be “more” intelligent and that high priority would be would be lower latency.
So, while this did help provide special treatment for high priority traffic (lower latency and drop resistant). It did little to solve the oversubscription problem discussed in the Oversubscription subsection and it actually introduced some new problems such as priority based Denial Of Service (DOS) attacks. Some of these issues were addressed using ingress policers to prevent priority traffic from dominating the switch fabric links.
The Priority Based VOQs also raises the question of how they should be scheduled: Strict Priority, WRR or RR? Also, should the priority be propagated into the switch fabric to allow one ingress port with higher priority traffic take precedence over another switch port with lower priority traffic?
Because of port oversubscription that was illustrated in the Oversubscription section, and the fact that the number of external ports that were aggregated under a single switch fabric port was growing, the next candidate in the VOQ vernacular was the creation of Port Group queuing on the ingress line cards. This concept is illustrated in Figure 9. Note that we still have the same number of switch fabric links: 2x2. On the right-hand side of the diagram we had PG X.1-2 and PG Y1-2. PG stands for Port Group and in this diagram, we have two external ports per PG, and two PGs per switch port.
Now on the left-hand side in the orange boxes labeled VOQs, we have a queue for each PG in the system. The queues would be served in DRR fashion between themselves, unless of course one PG had a different aggregated bandwidth and then a scheduling algorithm like Deficit Weighted Round Robin (DWRR) would be used.
Does Port Group VOQ help the oversubscription issue identified in the Oversubscription subsection? We will use the exact same example to explore this, except external port X.1 will be X.1.0 and external port X.2 will be X.2.0.
If we assume that all the fabric scheduling decisions are made in the ingress chip, then we will have the exact same issue as before. So, we need to more than just ingress Port Groups. We need to add a mechanism that provides congestion information from the egress Port Groups back to the ingress switch port. This feedback information is illustrated in Figure 9 by the red solid arrows leading from the egress Port Groups back to the individual ingress VOQ represent the PG. This feedback can be implemented in a multitude of ways, such as out-band back pressure signals or in-band (using existing fabric links) to send backpressure. It could also use a credit based system between the ingress and egress sub-systems. They all have their pros and cons and this really isn’t that germane to the topic at hand.
Let’s just assume that the egress PG can signal each ingress switch port that the PG is congested. Back to our example from Figure 7, the traffic arrival is:
- A.1 to X.1 = 5 Gbps (red arrow)
- A.2 to X.1 = 5 Gbps (red arrow)
- B.1 to X.1 = 5 Gbps (red arrow)
- B.2 to X.2 = 10 Gbps (green arrow)
Recall Port Group X.1 had a bandwidth of 10G and X.2 had a bandwidth of 10G. Since there is 15G of traffic destined for X.1’s 10G of bandwidth, it is safe to assume that all ingress X.1’s VOQ will have back pressure asserted on them 1/3 of the time.
Now the X.1 VOQ for switch Port A has a load of 10G (A.1 + A.2). Since X.1 is back-pressured a third of the time, then a third of that 10G load will be dropped. This means that Port A will only consume 6.67G of the 20G of BW that Switch Port X has available. Thus leaving 13.33 G of port X’s bandwidth available for Port B to use.
From switch port B’s perspective, it is filling two VOQs, X.1 at 5Gbps and Y.1 at 10Gpbs, for a total of 15 Gbps. But we only have 13.33G available, so who will drop? Well every X.1 VOQ in the system is being back-pressured 1/3rd of the time, so 1.67G of the 5G X.1 will be dropped. Thus allowing 3.33G from B.1 to be sent to X.1. We now have exactly 10G of remaining switch port X bandwidth available (20G-6.66G from A -3.33G from B.1= 10G).
All of the traffic from B.2 to X.2.0, 10G, gets sent to port X.2.0, whereas before we only got 6.67 G through to port X.2. Thus, we have improved achieving line rate on the egress ports. In Table 3 PG VOQ Example Results, a summary of the results is illustrated.
Did we improve anything?
- No Drops for B.2 to X.2.0 – all 10Gs received is transmitted vs. 6.6
- Ingress drops have been evenly spread across all of X.1’s input ports.
- All of the drops are now occurring on the ingress line card and therefore all drops are of the “unintelligent” type.
We could reduce or eliminate the “unintelligent drops” by increasing the switch fabric’s bandwidth. In this case, if we increase the fabric port speed from 20G to 25G, a 25% over-speed, then ALL of the ingress would be sent to the egress and the egress’s drop policy could make an intelligent drop decision. As we have discussed before, this becomes untenable as the number of the ports and the speed of those ports increases in the system. Most routers rely upon some amount of fabric speedup, and ingress buffering (VOQ) and hoping that statistical multiplexing will take care of the rest. Unfortunately that is usually not enough and therefore most routers have some amount of unintelligent ingress dropping.
How can we totally eliminate unintelligent ingress dropping? The answer: Absolute VOQ. What is absolute VOQ? Recall from Figure 1 Typical Egress QoS Router that the QoS drop policy is applied to the individual OQs on the egress line card. Absolute VOQ replicates each and every egress OQ on every ingress line card or more specifically each ingress line card’s switch ports, like is illustrated in Figure 10.
The queues P0-P1 on the Ingress Line Card are the VOQs. There is virtually a queue on every ingress line card for each egress line card’s OQ. The red arrows that go from the egress’s OQ back to the ingress’s VOQ is control information. This control information could be as simple as back pressure, but for more accurate control the fabric scheduler should be credit based on a per OQ basis.
With the OQs being represented by ingress VOQs, the large DBBs can now be moved from the egress line card to the ingress line card. Thus, the OQs on the egress line card can be very shallow on-chip buffers. Just large enough to cover the Round Trip Time (RTT) to move a packet from the ingress line card to the egress line card to insure that the egress OQ does not go empty while there is work to be done.
Because the DBB is now located on the ingress line cards, the drop policies will now be applied on these ingress VOQs. How can this be done? The ingress line card only knows the size of it’s local copy of the VOQ, so how can it perform a tail drop or WRED drop computation with that limited knowledge?
The answer is that ingress line knows the rate that it’s local copy of the VOQ is draining at. In other words, how fast it is sending packets to the egress line card. It also knows the target rate that this VOQ is supposed to transmit at. Knowing these two pieces of information, drain rate and target rate, the ingress line card can derive what percentage of the aggregate OQ queue size it’s VOQ represents of the total. For instance, an OQ has a target rate 2Gbps and the ingress line card measures its VOQ’s drain rate at 1Gbps. Thus, it deduces that it represents 50% of the aggregate traffic destined for the OQ and can then scale back its local VOQ size to be 50% of the total.
Next question: wouldn’t replicating every egress OQ on every ingress line card be prohibitive from a memory point of view? Of course, the answer is no. The “V” in VOQ stands for virtual, meaning that it doesn’t “always” exist. The ingress line card only needs to create a VOQ when it has traffic to send to the egress OQ. Since an ingress line card is bound by its external port speed, one can easily compute the amount of buffer space required for that line card’s VOQ space. For instance, the ingress line card has 10 x 10GE ports attached to it, so the total bandwidth it can receive is 100Gbps. The system dictates a maximum DBB size in terms of time, typically 100 milliseconds. Thus, the line card only needs to have 100Gbps * 100ms or ~1.2G bytes. This same 1.2GB would have been used for egress buffering on a traditional system, so the net sum difference is 0.
So, what advantages does this bring over the more traditional VOQs:
- Fabric over-speed is reduced because all packet drops occur before fabric transmission!! Thus, the fabric speed can be relatively close to the actual port speeds. Saves cost and power!
- Packet writes to off chip memory occur once on the ingress VOQ. Saves on packet processing time, memory resources, cost and power!
- All drops are based on drop policy. No unintelligent drops.
- Eliminates HOLB issue
- Egress ports can achieve line rate
What are the disadvantages? More complex system design.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
- Barry Burns is a Principal Engineer in the Router Business Unit Forwarding Software (PFE) group working. Responsible for the software architecture of the Class of Service (CoS) on the PTX platform. He joined Juniper in 2008 in the Silicon Development team in Raleih-Durham, NC. Prior to that he was with Cisco from 1995 and before that at IBM. He has over 35 years of hardware and software development experience.
- Chang-Hong Wu is a Juniper Fellow in Silicon and Systems Engineering. With Juniper since 1998, he works with both internal teams and external suppliers to bring all innovations to Juniper's products. He also reaches out to customers to explain Juniper's architectural and technological advantages.
- Jeff Libby is a Distinguished Engineer in Juniper's Silicon Development team. Jeff joined Juniper in 1999, and has worked on the design, verification and architecture of many Juniper chips. His current focus is on Trio architecture silicon.
- David Song is a Sr Staff Engineer within Juniper's Optical Engineering team where he is responsible for the design of packet-optical solutions for routing and switching platforms. He joined Juniper in 2004 and has been designing networking software in control plane and data plane on various platforms. Prior to Juniper, he held various software development positions at Ciena and Nortel Networks.