Networking's machine learning missing link: Training data
It’s hard to get through a single sales pitch or conference keynote without hearing someone extol the virtues of machine learning and the thing it will eventual enable, artificial intelligence. In the networking world, we are all agog with visions of how the technology will transform our industry.
Whether it’s self-healing systems or networks capable of detecting anomalous behavior, there is something really basic standing between us and the future we all crave: training data.
The role of training data
In this machine learning future, the algorithms that will perform all the magic have to be developed. And tuned.
Basically, you collect a bunch of data, and then people well-versed in math sort through the data to find predictive relationships. The process is iterative, and filled with failure. To find expected relationships might not be terribly complex, but once you get beyond the basics, finding these relationships and then developing ways to express them can require an extraordinary effort.
When you get to something, you have to collect even more data to test the algorithms. You can’t test with the same data because you will find that the inputs will equal the outputs—not terribly surprising.
And then, when you finally boil it all down to a core set of relationships, you have to do it all over again. Unless your environment is perpetually static, you will want to continue to collect data so that you can train and evolve your application to account for your changing environment.
Where does this training data live?
It turns out that getting data sufficient for this kind of work is actually not always that easy. A few years ago, we were trying to get data simply to test a product using real-world inputs. Even in that basic case, it was nigh impossible to get a real set of training data.
It’s probably not terribly surprising to learn that most companies don’t actually want to share their real-world analytics. Even if it was simple to stream them to some central data lake somewhere, some of this data is sensitive. Can you imagine the look on your CISO’s face as you explained that your company’s sensitive data was going to be perpetually streamed to some third-party collector in the cloud so that someone else could use it to develop models?
And given the current security climate, it’s just not reasonable to expect that companies are going to freely put information out there, especially if they don’t entirely know how it could be used to compromise their environments. This means that people will cling tightly to their information, which makes getting your hands on even a few sets of meaningful training data is going to be difficult.
That’s not to say that no one will share data, but they will do it to help themselves, not the collective. The data will be shared but not pooled, and this means that it will be difficult to find those predictive relationships that exist across customer boundaries. And even then, it will likely be done as part of support packages to be consumed by TAC teams to better identify issues and facilitate troubleshooting. Sure, it will happen in some point cases, but not in the general case—at least not anytime soon.
Snowflakes don’t help
And even if this training data was available, the fact that most IT environments are snowflakes—unique mixtures of infrastructure, applications, and users—means that the data isn’t always super helpful.
If you run a network that resembles no one else’s network, then even if you had access to data, not all of it would be relevant to what you are trying to do. Take the relatively attractive use case of using machine learning to detect anomalous flows on the network. The very thing that makes a flow anomalous is that it does not exist in your environment. This means that the point of comparison is your baseline set of traffic flows. And that will be different from network to network.
There are two implications here. First, it means that you can’t necessarily leverage some generic set of training data. And second, it means that whatever models you derive from the training set are going to be unique to your environment.
That second point is important. It means all those mathmegicians working on sifting through the data to find gold have to repeat the chore for every environment—both the creation and continued maintenance.
The commercial implications of the training data problem are profound.
First, it means that companies not only have to solve the technical challenges around developing machine learning models and applying them, but they must also solve for the logistics of collecting data and refining those models over time. In fact, any vendor who is talking vaguely about machine learning ought to be able to answer what data they are using and how those models are evolving over time.
Second, it means that if the models are not ubiquitously applied to all customers the same way, companies have to solve for custom modeling. If a model is created one time based on relatively static inputs, this might not be an issue. But if a model has to change over time, this creates a maintenance burden that has to be addressed. How many customers require such work? And what is the frequency of change? This should allow you to determine whether the staffing is sufficient to support the claims.
The point here is not that these cannot be solved, but rather that the solutions are more than just some technical explanation for how machine learning can be used to identify some useful thing.
Are there simpler things to solve for?
I generally take issue with the broad IT industry’s refusal to start simple with things. Generally, before we master some new discipline, we are already upping the degree of difficulty by a few orders of magnitude.
In the machine learning space, should we really be aiming first at trying to detect anomalous flows and make it commercially repeatable across large numbers of accounts? Knowing that troubleshooting workflows tend to be hypercontextual (dependent on the topology, the applications, the users, and the surround infrastructure), should we really be going after dynamic workflow execution before we even solve provisioning?
In my opinion, there are much simpler targets. They might not be as sexy, but they would allow us to collectively figure out how to handle data, how to develop models, and how to refine the models as the data changes. The easiest targets? Persistent data is far easier to work with than trying to collect streaming telemetry and make sense of a constantly changing set of variables. The best data we have? Configuration files. They are fairly unchanging, and they are easy to collect.
So what could you do with this? Imagine making a policy recommendation engine. It could be that the edge policy for all external-facing ports is the same. So as soon as you configure eBGP on a port, you should be able to make a policy recommendation based on how all other such ports are configured. If you change policy in one place and that policy is used elsewhere, it could be that you want to change it all places (or no places).
These are relatively straightforward business rules that could help combat configuration drift within a network. Difference making? Not hugely. And there are already ways to do some of this. But the point isn’t to make a tectonic change—it’s to understand the supporting practices that have to be developed to make machine learning a viable part of day-to-day operations.
The bottom line
We have a bunch of evidence in tech that no matter how compelling something is, if we fail to make it usable for the majority of people, the technology simply withers and dies. We simply have to be addressing ease of use, even as we talk about the next generation of technology. If we aim too far and leave too many people feeling like the future is unapproachable, we will have helped no one.
For an industry that has historically ignored consumption, we need to take a long hard look in the mirror. If we fail to make these things consumable, the lack of downstream adoption will limit the technology to only a corner of the market. And that means that we will not be able to justify the expense required to really make these things meet the promise that exists right now primarily in blogs and conference keynotes.