Automation & Programmability
Automation & Programmability
Junos Telemetry : Detecting Microbursts
08.13.17

Introduction

 

Managing networks is actually not that difficult when things are working as designed. Operational headaches happen when things go wrong. And even then, when they go fantastically wrong (like hard failures that are easily identifiable), troubleshooting or remediation can be relatively straightforward.

 

Rather, the biggest challenges for network operators is diagnosing transient issues. The only information that is available is often an observation about some downstream consequence (“the network is slow” or “my application isn’t responding”). To correctly diagnose issues here, there must be real-time telemetry that is fine-grained enough to provide meaningful input.

 

Take microbursts as an example. A microburst is a short spike of packets received in a relatively small interval at a rate much higher than the configured guaranteed bandwidth for a given queue.

 

It’s not hard to imagine scenarios where microbursts might impact the business such as high frequency trading platforms. Those platforms depend on real-time market data to formulate trading strategies. Microbursts will result in stale data delivery and trading algorithms will be out of sync with the market which can be catastrophic to their business.

 

What network operators need are fine-grained monitoring tools that can detect issues as they are happening. Snapshots of average queue depths do not help identify issues much less provide real-time remediation. This is why we have introduced a queue monitoring sensor as part of the Junos Telemetry Interface (JTI) in Junos release 17.1.

 

What can cause Microbursts?

 

The main factors that can cause micro bursts in a network are:

  • Multiple sources sending packets to a single queue
  • Significant speed mismatch between ingress and egress interfaces (for example, a 100G/40G ingress interface forwarding packets to 10G/1G egress interfaces or to a queue which is shaped at a lower rate)
  • Multicast replication done by egress Packet Forwarding Engine (PFE) to a large number of receivers on the same egress interface

 

A microburst may result in dropped packets if queues are configured with small buffers. If queues are configured with adequate buffers to absorb the microburst, there won’t be any drops but it will introduce additional latency in delivering packets due to increased queue utilization. Dropped packets are properly accounted and easy to troubleshoot. However, determining the source of additional latency can be quite challenging in the network for many reasons:

 

  • Typical network topologies consist of multiple routers, and it is difficult to identify the router that is introducing the latency.
  • Monitoring tools (SNMP or CLI based polling) query interface statistics every 30 or 60 seconds by default. That interval provides good average utilization but is not sufficient to detect microbursts. The polling interval needs to be less than 1ms in order to detect microbursts reliably. It is not practical to poll at that high a rate from the routing engine or line card CPU.

 

 

How to detect Microbursts?

 

For MPCs 7E/8E/9E, a new queue monitoring sensor will be introduced as part of the Junos Telemetry Interface. The queue monitoring sensor will periodically export peak queue depth information to an external collector.

 

Screen Shot 2017-08-04 at 3.39.58 PM.png

The microcode engine in the Trio ASIC will monitor queue depths for all configured queues and build/export JVision telemetry packets encoded in Google Protocol Buffer (GPB) format with all necessary information. Since all required tasks are performed in-line in the Trio ASIC without adding any additional load on the line card and routing engine CPUs, the queue monitoring sensor can monitor a large number of queues (32,000 queues for MPC7 and 64,000 queues for MPC8/9) simultaneously.

 

 Screen Shot 2017-08-04 at 3.41.15 PM.png

 

 

The following configuration will enable the queue-mon sensor on interface et-5/0/0:

Screen Shot 2017-08-04 at 3.42.12 PM.png

The GPB proto format for the queue-mon sensor:

Screen Shot 2017-08-04 at 3.44.53 PM.png

Screen Shot 2017-08-04 at 3.45.12 PM.png

 

Conclusion

 

With the addition of the queue monitoring sensor to the existing library of rich sensors in JTI, network operators will have much better visibility into queue utilization as compared to the average utilization supported on most routers. And this continues Juniper’s commitment to producing the single most automation-friendly network operating system in the industry.

 

References

08.08.17
sgopalkr

Wow! That's a good read.

 

Just a couple of thoughts on the future work:

 

Do you think this data can be collected and be used by some kind of AI system to predict the traffic in a router? I had done a school project to collect cpu data to predict and scale VMs based on their usage. I think something similar can be done here to get a traffic prediction and take preventive actions.

 

Also, how difficult is it to translate these effects to the actual services that get affected. I think it's still a huge task for the network admin to interpret this into business meaningful data.

 

Regards,

Suhas

08.13.17
Distinguished Expert

Shuhas,

 

You are correct to see the connection between telemetry and predictive network operations.  Juniper has outlined a vision of the "other SDN" Self Driving Networks that behave just as you ask.  Taking in data from all these sources and making automated corrections to drive the network on demain.

 

White Papers on the Self Driving Network Part 1 and 2

 

https://www.juniper.net/assets/us/en/local/pdf/whitepapers/2000656-en.pdf

 

https://www.juniper.net/assets/us/en/local/pdf/whitepapers/2000657-en.pdf

 

Steve

08.21.17
perry.young@btinternet.com
Now this is my kind of topic.
What would be awesome would be for this functionality to be provided on SRX, moreover, on SRX HE devices running ExpressPath/Services-Offload. I note that this capability seems to have a dependency on Trio, suggesting it is an MX only feature. Is that the case and are other platforms like SRX on the viable roadmap for this?
08.21.17
Juniper Employee
QFX5100 supports microburst threshold monitoring/data export.
09.04.17
Mohit Singh
Simply great ! You are also allowing to counter shrewd DADOS through your 'the other SDN' and it's surely the first SDN'++. Machine learning can help congestion control deal with all these figures while deciding and tuning algorithms and their parameters.

The 'other control' is seemingly simply great.
Top Kudoed Authors