As NFV Displaces Dedicated Network Functions, Who is Managing Service Levels?
As NFV gains mainstream production status across carriers and service providers worldwide, important facts about the realities of a software-defined network world are coming to light.
Service levels for network functions—more commonly referred to as “service assurance”—is a concept created in the days when network functions were anything but virtual. These network services were tied to physical devices, and that hardware was “assured” to deliver its assigned “services” at a specified up-time.
In NFV, those now-virtualized network functions—or VNFs—must still meet desired service targets in order for the network to support its assigned applications at expected levels.
Fortunately, this is not an entirely new challenge. We’ve seen this before in the early days of server virtualization, and the concepts learned there can be instructive. Application and infrastructure monitoring software was created to use log data, statistics and alarms to help data center operators isolate infrastructure problems and spin up new resources, hopefully before things got too hairy and started impacting the user experience.
Unfortunately for network operators, that’s where the usefulness of server monitoring software is limited. The latency embedded in monitoring tools used for server virtualization—often several minutes—is several orders of magnitude too slow to be of much use in network operations, where latencies of more than a few seconds can mean the difference between a VNF meeting its assigned service levels, or failing altogether.
To complicate matters further, this low tolerance for latency means that human intervention in the remediation loop can be a barrier to effective management. If a service degradation is detected, the time for a human to respond to an alarm might be too long to keep application performance from suffering. This is compounded in a 5G world, where IoT and edge cloud computing drive huge network traffic volumes and an increased sensitivity to latency. All of this means that what is needed is intent-driven automation, where the human defines the targets to be met and the general remediation strategies that the system will follow.
Also, consider how VNFs scale - horizontal scaling means that you need visibility into how performance varies as you roll out new network services and the ability to track latency changes as new resources are spun up. Having this insight helps operators maintain uptime and performance levels as they scale, smoothly adding new functions as application demand grows.
Let’s complicate this further. Operators might not know precisely where infrastructure supporting a specific set of VNFs is running. They might not even know what other services are running on specific nodes if the environment is shared.
Two things are needed to meet these challenges. First, we need monitoring designed to operate in a real-time world, and second we need to automate the remediation process when VNFs begin to fail.
Day 1 Becomes Day 2
The design and deployment of a virtualized network environment is a technical challenge, to be sure. But I believe the products and support available for this process have stabilized and matured such thatstanding up the environment is the smaller challenge businesses will face. The operational issues surrounding keeping the network services running at prescribed levels is the “day 2” challenge, and it’s not trivial.
What’s needed is monitoring software built for this reality. Smart people are working on tools to address this, including the Juniper Networks engineers working on our Appformix product.. Popular open source tools from the server virtualization world like Zabbix and Nagios rely on human-generated static pre-configuration. A newer generation of commercial tools like DataDog are designed with flexible schemas that allow for the applications to emit bits of OS-level performance information that are monitored, offering a first step towards automated outlier detection.
(For a description of how outlier and anomaly detection works, check out Homin Lee’s excellent presentation from OSCON last year.)
A tool that gives you real-time visibility into what your infrastructure and network services are delivering, however, is still pretty rare. Juniper Networks’ approach uses machine learning to get closer to this near-zero latency goal. We’re also leveraging Resource Director Technology (RDT) in the more recent Intel(r) Xeon(r) processor families to manage VNF workloads in the metal, extracting performance data from the processor instead of the management software. This shaves precious milliseconds off the latency problem.
What’s Your “Blast Radius?”
What’s important to keep in mind here is that the stakes are much higher than in the world of server virtualization. Problems at the application layer, almost by definition, impact only that application. When a network function fails or begins performing poorly—regardless of the reason—the “blast radius” of that failure can be catastrophically large, impacting a host of services and applications further up the stack.
In a 5G network, the volume of traffic and the number of connected devices can make a foundational network performance degradation catastrophic. All the more reason why you need to detect and remediate *much* faster than many tools created to monitor server virtualization can deliver.
Virtual Environments are Dynamic Environments
Both VNFs and the infrastructure they run on are constantly changing in these dynamic new network environments. Monitoring tools based on server virtualization technology that uses static thresholds are insufficient in this reality. Developers and operators need tools that understand the context of what’s happening in the applications and the infrastructure, from both bottom-up and top-down views, and those tools need to dynamically automate management of services in real-time.
Service assurance in a world of virtual network functions is a fundamentally different animal than when network functions were tied to dedicated hardware. What’s needed for success in this new world—especially as 5G looms on the horizon—is real-time monitoring where remedial action is automated based on rules optimized by machine learning, with operations personnel providing an oversight role.
Sumeet Singh is Vice President of Engineering at Juniper Networks and the founder of AppFormix.