Automation & Programmability
Automation & Programmability
Junos Telemetry : Detecting Microbursts
08.08.17

Introduction

 

Managing networks is actually not that difficult when things are working as designed. Operational headaches happen when things go wrong. And even then, when they go fantastically wrong (like hard failures that are easily identifiable), troubleshooting or remediation can be relatively straightforward.

 

Rather, the biggest challenges for network operators is diagnosing transient issues. The only information that is available is often an observation about some downstream consequence (“the network is slow” or “my application isn’t responding”). To correctly diagnose issues here, there must be real-time telemetry that is fine-grained enough to provide meaningful input.

 

Take microbursts as an example. A microburst is a short spike of packets received in a relatively small interval at a rate much higher than the configured guaranteed bandwidth for a given queue.

 

It’s not hard to imagine scenarios where microbursts might impact the business such as high frequency trading platforms. Those platforms depend on real-time market data to formulate trading strategies. Microbursts will result in stale data delivery and trading algorithms will be out of sync with the market which can be catastrophic to their business.

 

What network operators need are fine-grained monitoring tools that can detect issues as they are happening. Snapshots of average queue depths do not help identify issues much less provide real-time remediation. This is why we have introduced a queue monitoring sensor as part of the Junos Telemetry Interface (JTI) in Junos release 17.1.

 

What can cause Microbursts?

 

The main factors that can cause micro bursts in a network are:

  • Multiple sources sending packets to a single queue
  • Significant speed mismatch between ingress and egress interfaces (for example, a 100G/40G ingress interface forwarding packets to 10G/1G egress interfaces or to a queue which is shaped at a lower rate)
  • Multicast replication done by egress Packet Forwarding Engine (PFE) to a large number of receivers on the same egress interface

 

A microburst may result in dropped packets if queues are configured with small buffers. If queues are configured with adequate buffers to absorb the microburst, there won’t be any drops but it will introduce additional latency in delivering packets due to increased queue utilization. Dropped packets are properly accounted and easy to troubleshoot. However, determining the source of additional latency can be quite challenging in the network for many reasons:

 

  • Typical network topologies consist of multiple routers, and it is difficult to identify the router that is introducing the latency.
  • Monitoring tools (SNMP or CLI based polling) query interface statistics every 30 or 60 seconds by default. That interval provides good average utilization but is not sufficient to detect microbursts. The polling interval needs to be less than 1ms in order to detect microbursts reliably. It is not practical to poll at that high a rate from the routing engine or line card CPU.

 

 

How to detect Microbursts?

 

For MPCs 7E/8E/9E, a new queue monitoring sensor will be introduced as part of the Junos Telemetry Interface. The queue monitoring sensor will periodically export peak queue depth information to an external collector.

 

Screen Shot 2017-08-04 at 3.39.58 PM.png

The microcode engine in the Trio ASIC will monitor queue depths for all configured queues and build/export JVision telemetry packets encoded in Google Protocol Buffer (GPB) format with all necessary information. Since all required tasks are performed in-line in the Trio ASIC without adding any additional load on the line card and routing engine CPUs, the queue monitoring sensor can monitor a large number of queues (32,000 queues for MPC7 and 64,000 queues for MPC8/9) simultaneously.

 

 Screen Shot 2017-08-04 at 3.41.15 PM.png

 

 

The following configuration will enable the queue-mon sensor on interface et-5/0/0:

Screen Shot 2017-08-04 at 3.42.12 PM.png

The GPB proto format for the queue-mon sensor:

Screen Shot 2017-08-04 at 3.44.53 PM.png

Screen Shot 2017-08-04 at 3.45.12 PM.png

 

Conclusion

 

With the addition of the queue monitoring sensor to the existing library of rich sensors in JTI, network operators will have much better visibility into queue utilization as compared to the average utilization supported on most routers. And this continues Juniper’s commitment to producing the single most automation-friendly network operating system in the industry.

 

References

09.04.17
Mohit Singh
15 years ago, I got an IEEE communications magazine paper on COPS based management (Salsano et al) and it's realisation efforts let me to bear comments like - how do you know when and how and what to control.

15 years later, the future seems a better understood history through things like this.
Top Kudoed Authors