What it comes down to is this question: Why active/active?
In 10 years of working IT security, I have come across three possible answers:
a) To boost throughput
b) To resolve issues with asymmetric routing
c) So the 2nd unit "doesn't just sit there"
c) can be knocked out quickly. Unless there is a measurable advantage to active/active, having the 2nd unit "just sit there" is just fine. HA is implemented so there is failover, and because HA w/ NBD service is often more advantageous in the long run than a single unit with 4-hr service. This isn't about some nebolous feel-good advantage of active/active, this is about clear and measurable benefits that flow down to the bottom line.
b) can be a legitimate workaround. Ultimately, I far prefer to resolve the asymmetric routing situation. active/active is harder to troubleshoot than active/passive, and asymmetric routing doesn't make it any easier. From a TCO perspective, resolving asymmetric routing issues is preferable.
a) needs to be examined very closely, along several vectors. It has to be measurable and have a positive contribution to the company's bottom line.
a1) Are we truly boosting network throughput? The Juniper design of active/active means you could, as long as ingress and egress ports are on the same unit. Once you have to traverse the fabric link, you're losing that theoretical advantage. Also, if one unit can handle all the traffic you are throwing at it, then there's no need for active/active
a2) Is it acceptable to be running at the speed of one unit during a failure scenario? For how long? Is NBD acceptable, or do we need to go to 4-hr for both units now? (Higher cost - it may be more advantageous to buy the more performant units and stay with NBD service)
a3) Was the intent to boost UTM/IDP throughput? By how much are we boosting it? What does that mean in a failure scenario (a2 all over again)? And is that even supported in active/active? (Currently: No)
And then you need to carefully think about the possible drawbacks of active/active:
- Which features become unsupported, and did we need those features? (IDP and UTM, others?)
- Is the added complexity of troubleshooting worth the measurable benefit of active/active?
- What is the impact on TCO? Consider not just the possible added time spent troubleshooting, but also the skill level of your network engineers. Will you need to hire more costly resources to support this infrastructure? How about designing expansion of the infrastructure as time goes on and making sure that the benefits of active/active remain through that expansion?
I'll spare you the head-scratching and come right out with it: I have yet to see an environment where active/active is the right answer. I've seen it implemented only for ill-defined reasons such as c), and have yet to see anyone implement active/active for a clear, measurable benefit.
Of course active/active can be the right answer, as long as the lower performance in failure state is acceptable, and that failure state is less of a drag on the bottom line for its duration than just buying "the bigger box" would be. I just haven't seen the environment where that was the right answer, once people got over c). And boy do I hear c) a lot. 🙂