Silicon and Systems
Silicon and Systems
Trio Packet Processing – Overview

Juniper’s MX series edge and aggregation routers and switches have been based on Trio architecture chipsets for the last six years.  The Trio architecture includes a packet processing model that is fully programmable to provide flexibility, and also provides high performance. This architecture has had longevity – the Juniper Silicon Development team is currently working on the fourth generation Trio architecture chipset.


At the highest level, we can characterize the Trio packet processing architecture as having three main elements.  The combination of these three elements is key to the effectiveness of the architecture.  The elements are:


  • Packet Processing Engines
  • High access rate data memory system
  • Supporting functions

This blog will provide a high-level look at Trio packet processing.


Router Overview


First a quick overview of Juniper Trio architecture routers.  A small Juniper router may have only a single Packet Forwarding Engine (PFE).  Larger routers have multiple Packet Forwarding Engines, connected with an interconnect fabric.  The fabric provides an any-to-any connection between the Packet Forwarding Engines.  From the standpoint of a packet transiting the router, the packet will arrive at a WAN interface of one Packet Forwarding Engine (called the ingress Packet Forwarding Engine), will be sent across the fabric to a second Packet Forwarding Engine (called the egress Packet Forwarding Engine), and will then be sent out the WAN interface of the egress Packet Forwarding Engine.  This is shown in the following diagram.


Packet Processing


Now we’ll look at the packet processing done in the Packet Forwarding Engines. For packets arriving to an ingress Packet Forwarding Engine from a WAN port or to an egress Packet Forwarding Engine from the fabric, the Packet Forwarding Engine will divide each packet into a “head” and a “tail”.  The packet head is the first part of the packet, and is large enough to hold all of the packet headers needed to process the packet (this is configurable, but is in the range of 128-256 bytes).  The tail consists of the remaining bytes not in the head (if any).  The packet head is sent to a Packet Processing Engine (PPE) for processing by a Dispatch block.  Packet tails are held in the Packet Forwarding Engine datapath, so that each PPE thread doesn’t need to store the largest possible packet tail. 


Many PPEs work in parallel to provide the required processing bandwidth.  The PPEs in the ingress and egress Packet Forwarding Engines will combine to handle all functions needed to process the packet (parsing, lookup, rewrite, etc.).  During the course of packet processing, a PPE will make requests over a crossbar to access memory and other functions.  When packet processing is complete, the rewritten packet head is sent through an Unload block to the appropriate queuing system.  This is shown in the following diagram.



There are benefits to having a single type of engine handle all types of packet processing in a Packet Forwarding Engine.  Processing cycles are fungible between processing functions.  Different applications are gracefully handled, such as lower packet rates with richer packet processing, or higher packet rates with simpler packet processing. 


The fact that the PPE is fully programmable provides capabilities that are difficult or impossible with more fixed processing pipelines or specialized processing units.  There is no fixed limit on the number or types of headers that can be processed by a PPE.  A PPE can easily create or consume new headers for packets entering or leaving a tunnel.  As new protocols are developed, the Trio packet processing architecture can adapt by enhancing the software that runs on the PPEs.  PPEs can also create or consume packets to accomplish tasks like keep-alive functions, doing so at a much higher rate than can be supported by a control plane CPU.


Having each packet processed by a PPE on both the ingress and egress Packet Forwarding Engine gives the Trio software developers the flexibility to distribute data structures and the associated processing in the most effective way.  We can largely avoid the need to store information about every port in the system on every Packet Forwarding Engine. This is particularly important for systems with large numbers of Packet Forwarding Engines, and contributes to the Trio systems ability to reach very high scale for logical interfaces, routes, tunnels, etc.




The processing needed for a packet in a router is (mostly) independent of the processing that is done for other packets that are transiting that router.  That is because most of the data structures that are accessed when processing a packet are not modified as a result of the processing of the packet. 


The Trio architecture takes advantage of this by providing multiple PPEs, each of which is multi-threaded.  A Trio Packet Forwarding Engine will have hundreds or thousands of PPE threads, each of which can be working on a single packet at the same time.  It is more efficient for both chip area and power to have many moderately fast processing elements than to do the same work with a few very fast processing elements.  This allows a Trio Packet Forwarding Engine to have a very high packet processing rate while fitting into a practical power envelope.


There are some data structures that are modified while processing a packet.  Counters and policers are updated for each packet that uses them.  MAC tables (which track which Ethernet MAC addresses have been learned) and flow tables will typically have high rates of lookups and moderate rates of insertion and deletion.  The Trio architecture provides support for these types of data structures in its memory system and in other supporting logic, in a way that is efficient even when many threads are accessing the same data structure.  This model of providing memory coherency on a more limited basis nicely meets the needs of packet processing applications.  The cache-line based coherency model used by conventional processors performs poorly for data structures that can be accessed by hundreds of threads.


Back to the basics for me, nice write up Jeff!