Enabling rapid remediation in the Self-Driving Network
Apr 5, 2018
A key issue for operators is detecting the existence of a problem within the network infrastructure in a timely way and working out how to divert traffic from the affected location. Currently this is dealt with in a manual, labor-intensive way. Doing this automatically and in a predictive manner dramatically reduces the amount of time that traffic is impacted, thus greatly improving the reliability of the service.
In order to achieve this key stepping stone to the Self Driving Network, we have combined three key technologies – Streaming Telemetry, the AppFormix analytics and optimization platform and the NorthStar WAN SDN Controller.
Streaming Telemetry is sent on a quasi-continuous basis by network nodes. Junos supports a wide variety of telemetry sensors, including sensors for node resource utilization, such as CPU and memory, and sensors for link statistics and anomalies. This telemetry information is crucial to predict failures before a traffic impacting fault actually occurs.
AppFormix ingests and analyzes the streaming telemetry data from all of the network nodes. It detects anomalous behaviour, either on the basis of user-defined thresholds, or by employing machine learning techniques to identify deviations from the norm. When AppFormix sees anomalous behaviour, it sends an alarm to the dashboard, and to external http endpoints. In addition, it makes a REST API call to the NorthStar Controller to request a maintenance window on the affected link or node.
The AppFormix dashboard below shows that some malformed packets (flagged as l3_incompletes) are arriving on interface ge-1/1/1 of a router in Chicago. This type of malfunction is known as a "gray" failure and is particularly insidious because it can be service-affecting but is often not noticed by the IGP or BFD, depending on the proportion of packets that are affected.
AppFormix dashboard showing malformed packets
In response to the REST API call from AppFormix requesting a maintenance window, the NorthStar Controller identifies which traffic-engineered LSPs currently pass through the faulty link. It recomputes their paths such that they no longer use that link. It then sends PCEP messages to the ingress routers of the affected LSPs with details of the new paths. The ingress routers change the paths of the LSPs accordingly in a make-before-break manner. The scheme works regardless of whether RSVP or Segment-Routed LSPs are being used.
The image below of the NorthStar GUI shows that before the fault occurred, the Segment-Routed LSP highlighted in orange followed the path Los Angeles-Denver-Chicago-New York.
Usual path of Los Angeles to New York LSP shown on NorthStarWhen NorthStar receives the REST API call from AppFormix, it diverts the LSP away from the faulty link, which is the link between Denver and Chicago, as you can see below. Note the red “M” on the Denver-Chicago link, which indicates that the link is under maintenance. The LSP now follows the path Los Angeles-Dallas-Miami-New York. Also you can see the maintenance event listed in the table at the bottom of the screen.
New path of Los Angeles to New York LSP shown on NorthStar
As you have seen, by bringing together big data, machine learning and software defined networking technologies you can now achieve rapid, fully-automated remediation to network anomalies which greatly improves the availability of your network.
To see more details, come and see it working live on our booth at the MPLS+SDN+NFV World Congress 2018 in Paris on 10-13 April.