We have SRX650 Chassis with one RE and one 24xGE gPIM placed in FPC2 slot. It is connected to EX4200 VirtualChassis by two aggregated links(both ports are from installed gPIM since onboard ports are routed ports only).
We started to migrate some customers to this SRX device. Each customer has its own l3-interface(vlan) and a security zone. And there are some security policies between those zones. After configuring each customer I commit changes. We've noticed that while device commits those(not all) changes it brokes lacp. Hence, that appears to be not critical, since the connection to the devices doesn't lose.
After adding 4th or 5th client (after configuring each of them we got LacpTimeout snmp trap) we needed to see current flows passing the device. The actual traffic through SRX is very small for now: 10M and about 300-500 sessions(There are to be added some more and big customers). On entering command "show security flow session" after couple of pages the device loses its connection to the network for about 10-20 seconds and we get plenty of snmp traps indicating that all existing ports've gone up/down, link aggregation broke/established, ospf down/full and etc. The CPU load according to the SNMP monitoring system doesnot exceed 10-15%.
The Junos version is 11.1R1.10. Is this the software bug or some hardware issue?
I can provide any additional information if needed.
But, unfortunately, upgrade to 11.1R3.5 didn't solve the issue. The device keeps losing connection on a simple operation "show security flow session" with only 300 sessions and 2-4Mbits of traffic passing through it. Detailed monitoring showed that normal CPU usage is 28-30%, and peak is up to 60% after entering that command. It is only an example, i think other commands also may cause device's downtime.
Yes indeed, contact jtac. We had a similar case and I believe it was fixed in 10.4R6 - but I'm not entirely sure.
What typically causes problems like this is logging. Use syslog to send your traffic logs to an external device, don't log them on the SRX flash. The device can generate way more logs than can be written to the flash, which can have a negative impact on other processes.
Yes, I've already opened the case in JTAC. And while they are trying to figure out the solution I and my colleagues shared the same idea: The device has the only connection to the switch through the ae1 interface. All Vlans are present only on that link. When the problem described occurs, lacp timeout causes ae link to go down. And since the link containing vlans is down all vlan l3 interfaces also go down, because there is no other active link with these vlans. As far as I know other vendors' devices(eg. Cisco) behave the same way(if no active link with vlans the SVI(l3-interface) is down).
I checked that idea by adding extra link (not in ae, just spare link ) to the switch and adding all vlans to the trunk on that link. And after that I provoked the problem by issuing command "show securtiy flow session". Now there aren't any snmp-traps about vlan interfaces' state changes. But ae interface still goes down and also OSPF and STP changes occur because ae interfaces downtime causes those protocols timeout intervals to be exceeded. Traffic transmission interrupts for 10 to 15 seconds. There is now only less than 300 sessions and about 4Mbps traffic passing through device. But there are to be more customers whose traffic is much more significant. So if traffic is interrupted then that would be a big problem. We tried to use extra options to the command "show security flow session" like "interface <iface>", "source-prefix <prefix>", etc. But, unfortunately, with them the problem occurs as well. So, usage of the commands mentioned has an impact on the CPU, which is in order causes device to lose some LACP keepalive packets, which in turn causes ae interface to go down. We're still looking for solution...
So the problem seems to be the LACP, everything else is just a result of that. Shouldn't happen of course but I'll let jtac figure out whats going wrong.
One other workaround I can think of is switching LACP to slow timers. Sure that would take a bit longer to detect link failures, but that usually doesn't happen that often. That gives the SRX more time to respond if the CPU is temporary busy doing other things.