A little over 3 years ago, Air France Flight 447 from Rio de Janeiro to Paris disappeared mid-flight. In the years since, there has been a ton of research, analysis, and speculation to determine what exactly caused the Airbus A320-200 to crash. The general conclusion of investigators suggests human error as the primary culprit in the disaster. The UK Telegraph details the findings in an April 28 article entitled “Air France Flight 447: 'D*mn it, we’re going to crash’”. In the article, the author writes that the pilots “made a fatal and sustained mistake” when their actions induced a stall that ultimately doomed the flight.
Most of us probably remember only bits and pieces of the investigation. I know that my personal recollection is largely around pitot tubes that failed, causing instrumentation failures (notably, air speed indicators). But when you dive into the facts a bit more–the reasons behind the pilot actions–the real issues start to get quite a bit more intriguing.
The UK Telegraph article does a pretty good job of spelling this out, but let me summarize it here. In modern aircraft, pilots control flight surfaces through computers that translate pilot input into airplane action. These systems are called fly-by-wire because the inputs are quite literally sent through electrical wires. Once you move to a computer-based system, you no longer need actual yokes (think steering wheels for planes) to maneuver the plane. Now this is where Boeing and Airbus have different philosophies: Boeing has a replica of a physical yoke in front of the pilots while Airbus uses a joystick at the side of the chair. When a pilot in a Boeing plane pulls back on the yoke, both pilots get tactile feedback on what is happening. When a pilot in an Airbus plane pulls back on the joystick, only he knows that he is pulling back.
In the Air France disaster, one of the pilots had been pulling back on the stick through much of the ordeal. In his efforts to climb, he had inadvertantly caused a stall. And the other pilots did not know what he had been doing until it was too late, so they were unable to correctly diagnose the problem. Less than one minute before they crashed, the co-pilot finally revealed the most important piece of information for the crew: But I've had the stick back the whole time!
When there are multiple inputs to the system (one pilot trying to climb while the other tries to descend), what action should the plane take?
Networking isn’t all that different. In traditional networks, there is a single admin (or team) managing the network. There is one set of inputs, so the behavior of the network is predictable (or at least traceable to the person or persons responsible). When things do not go as planned and you start getting errors, it is relatively straightforward to go back to the inputs driving the network (typically, device configuration). But in a world with multiple inputs, how do you know what is driving behavior?
SDN will actually make this a lot more difficult. As we move beyond persistent config and start using ephemeral network state to determine device and network behavior, how does the admin (or the operations team) know what inputs were being used? Essentially, if the operations team is one pilot and SDN is another pilot, how do we make sure that no one is pulling the stick back without us knowing, especially as the inputs might be changing rapidly once they are programmatic? The issue here is not that we shouldn’t use ephemeral state but rather that we need the networking equivalent of tactile feedback to ensure we know what is going on when we need to know. Put differently, we need to decide whether we are better off following a Boeing or an Airbus design philosophy for networking.
Going back to Flight 447, tactile feedback was certainly partially to blame for the pilots failure to correctly diagnose the problem, but it doesn’t explain why a trained pilot would suddenly forget how to fly a plane just because one of the instruments went out. Flying a plane has become increasingly complex as technology gives us more information and more control. At the same time, it has become increasingly easy as modern airplanes abstract the complexity, using systems to automate and take care of a lot of what used to be manual. This effectively reduces the workload of flying a complex aircraft (remember the old jets that required three pilots?), but now that means that without the computer, it's almost impossible to safely fly the plane. What happens when you cannot trust the automated systems (like when it gets bad airspeed data)?
Again, the parallels to networking are strong. As networking has become more complex (think about the impact of VM mobility on VLAN provisioning), the industry has moved to create systems and tools that abstract the complexity. The rise of overlays and SDN are essentially the networking equivalent of fly-by-wire. And for the most part, this is a good thing. We shouldn’t need to know about all the complexity all the time. The underlying transport should be largely transparent to the services that run on the network.
The question is really to what extent can we afford to completely abstract out the topology? The topology will matter in at least two scenarios:
You cannot guarantee that there is always enough bandwidth – If you ever need to balance traffic or optimize paths or do any traffic engineering, you need to have some knowledge about the underlying L2/L3 transport
Something goes wrong – If something ever breaks in the network, you need to have visibility into what is happening within the network.
In either case, a completely opaque overlay can cause problems. The value of overlays is apparent (flying a 787 or A380 cannot require us to twiddle every knob on the plane), but we need overlays to be transparent so that when we need information from the network, that information is accessible.
Ultimately, we cannot allow the simplification of managing networks to lead to an eventual inability to run networks. Today, people run networks a lot like how they fly planes – on intuition. You look for inputs that seem different than what you would expect (an airspeed reading or an OSPF neighbor being down), and then you react accordingly. Once things start happening more automatically, it is going to be more difficult to intuit what is happening because you don’t necessarily know what the correct configuration or network state should be (it keeps on changing!). As an industry, we are going to have to make sure that we help people retain the ability to troubleshoot despite increasing levels of abstraction and automation. In short, we need to make sure that operators never forget how to fly a plane.
At the end of it all, it is unfair to reduce all that happened in that Air France flight to tactile feedback and troubleshooting. It is equally unfair to heap all the blame on pilot error. It is worth asking the question though: in the race for simplicity, are we leaving ourselves unprepared to handle adversity when it arises? In the networking space, I think the answer is uncertain right now. It will be interesting to see how we collectively balance the desire for simplification with the need for information.