Desperately calling for some help here, because we're starting missing ideas to solve this...
We just installed a cluster of SRX1500 as our routing (and filtering) instance in our network. This used to be done by a HP Layer 3 Switch (direct routing with VLANS), but we wanted something sexier than stateless ACL.
We wanted first to validate the routing part, before any filtering, so we racked the SRX above the HP Switch, keeping the later for L2 operations only, and tagging all of our vlans on a single 4x1G LACP trunk between the HP and the SRX (actually 2x4, one on the active SRX, the other one on the passive).
The SRX is defined as the gateway for our Vlans, and we made a single zone TRUST, and all we do right now is to allow ALL traffic from zone TRUST to zone TRUST.
We can ping whatever gateway or equipement we need, all goes through fine, it works, but the problem comes with throuhput bandwith when traffic increases.
Our network inter-VLAN traffic is like 200 Mbytes/sec, sometimes peaking at 400 Mbytes/sec, but it just don't work, or more acurrately, it does work but runs sooooo slow (we got some video devices and collector servers to test (in separate vlans), half of them is freezing, some of them work, then drop, then other ones start working, but with very few fps, etc).
- We thought it could be the 4x1Gb lacp aggregate, so we went to 8, still the same.
- To confirm it wasn't the 4Gb aggregate the problem, we replaced the SRX cluster with another L3 Switch as routing instance, it just worked fine, and we tried disabling 1 by 1 the links, it worked until only one 1Gb link remained, where it started a little bit to lag, but almost unnoticeable, and nothing to compare with the SRX case.
So here we are, kinda blocked, troubleshooting is uneasy, as it works fine (ping, ssh, whatever) until "too much" traffic (like 150Mbytes/sec.... huge ^^), and we triplechecked SRX1500 datasheets, they're sold to handle 5Gbytes/sec of IMIX traffic, so even with pessimistic consideration, it should handle the third - 1.5Gbytes/sec - with no trouble at all.
But we're like bottlenecking at ~100 or so Mbytes/sec...
Informations on the case:
- All VLAN subnetworks
- All is working fine with L3 Switches instead of SRX1500 Cluster
- 1 zone TRUST, every logical reth0.X is defined (1 reth (reth0) with logicals interfaces (reth0.2, reth0.35, etc))
- 1 policy from zone TRUST to zone TRUST allowing ALL ALL ALL
- Cluster is healthy, failover is working fine (we tried unplugging everywhere, just great, only a few packets lost)
- Intervlan traffic is around a few 100Mbytes/sec, max 400
- Note that we use the 4x1Gb LACP aggregate for both "in" and "out" traffic, but it does work with another router instead of the SRX1500 (like a L3 HP Switch)...
- It works fine with the SRX1500 when traffic is really low
Don't hesitate to ask if you need more intel or what, I can upload a draw if you need it, but nothing very special here i think. Just a SRX1500 cluster on top a a good switch, with a distribution network under it... probably a classic architecture... And yet, can't find the problem --'
I would check that your HP switch is using two lacp lag interfaces against your srx's one (one to each srx node), otherwise I think the HP will send return traffic to the passive SRX. See this kb for more detail.
To me it sounds more like there might be packet drops when you have heavy traffic.
Since you are using it in a cluster mode, I would like you to make sure that the interfaces of Node-0 are bundled together as an ae interface on the Switch whereas interfaces of Node-1 are bundled separately.
let's say the physical connection looks like: -
Node-0 (ge-0/0/0) ---------- (gig-0) Switch
Node-0 (ge-0/0/1) ---------- (gig-1) Switch
Node-1 (ge-0/0/0) ---------- (gig-2) Switch
Node-1 (ge-0/0/1) ---------- (gig-3) Switch
Therefore, on switch, gig-0 and gig-1 should be bundled together to form aggregated interface (ae0) and gig-2 and gig-3 should form ae1.
Note that the LACP load balances the traffic across all the child interfaces.
You can also run the following command to if any drop is observed within the interface stats.
Thanks for your answers, sorry for the delay. On the Switch side we have indeed 2 LACP aggregates, one on each SRX. (But we did the mistake 2 weeks ago, 1 aggregate spreaded on both SRX, and it wasn't working at all, but we're good on that point now).
To confirm it wasn't due to the Cluster, we disabled the passive SRX, running the primary node as a standalone, and it's exactly the same result. It does work, but it's just bottlenecking as hell somewhere...
Our srx conf is in the .txt joined.
It probably is just us not understanding something on how zones or logical interfaces work or what, but the fact that it does work until a certain amout of traffic is reached makes troubleshooting anoying, even more considering that it's a network carrying security sub-systems, so we can't let it down for too long while troubleshooting...
We found that it does work when we change forwarding mode to packet-based (with a "delete security" folowed by a "set security forwarding-options family mpls mode packet-based" and reboot. It does work in standalone mode AND in cluster mode, with failover fully fonctionnal.
But as soon as we go back to flow-based forwarding mode, it struggle and slows down to les than 10Mby/sec...
We're caping at like 300 000 sessions in cruise speed. It seems to be far enough from the 2 millions sessions the SRX can handle according to the datasheet.