SRX Services Gateway
Highlighted
SRX Services Gateway

SRX650 losses connection (weird behavior)

‎07-27-2011 02:17 AM

Hello! 

 We have SRX650 Chassis with one RE and one 24xGE gPIM placed in FPC2 slot. It is connected to EX4200 VirtualChassis by two aggregated links(both ports are from installed gPIM since onboard ports are routed ports only). 

 We started to migrate some customers to this SRX device. Each customer has its own l3-interface(vlan) and a security zone. And there are some security policies between those zones. After configuring each customer I commit changes. We've noticed that while device commits those(not all) changes it brokes lacp. Hence, that appears to be not critical, since the connection to the devices doesn't lose. 

 After adding 4th or 5th client (after configuring each of them we got LacpTimeout snmp trap) we needed to see current flows passing the device. The actual traffic through SRX is very small for now: 10M and about 300-500 sessions(There are to be added some more and big customers). On entering command "show security flow session" after couple of pages  the device loses its connection to the network for about 10-20 seconds  and we get plenty of snmp traps indicating that all existing ports've gone up/down, link aggregation broke/established, ospf down/full and etc. The CPU load according to the SNMP monitoring system doesnot exceed 10-15%. 

 

The Junos version is 11.1R1.10. Is this the software bug or some hardware issue?

I can provide any additional information if needed.

11 REPLIES 11
Highlighted
SRX Services Gateway

Re: SRX650 losses connection (weird behavior)

‎07-27-2011 02:25 AM

Hi

 

Although a hardware fault cannot be excluded, you should never use R1 in production.

Try upgrading to 11.1R3 (or, maybe, 10.4R6). If it does not help, you should also open

a case at JTAC.

Best Regards,
PK

Juniper Ambassador, Juniper Networks Certified Instructor,
JNCIE-SEC #98, JNCIE-ENT #393, JNCIE-SP #2253
Twitter: @JuniperTrain
GitHub: https://github.com/pklimai
[Juniper Authorized Education & Support in Russia]
Highlighted
SRX Services Gateway

Re: SRX650 losses connection (weird behavior)

‎07-28-2011 12:04 AM

Thanks for reply, pk. 

 But, unfortunately, upgrade to 11.1R3.5 didn't solve the issue. The device keeps losing connection on a simple operation "show security flow session" with only 300 sessions and 2-4Mbits of traffic passing through it. Detailed monitoring showed that normal CPU usage is 28-30%, and peak is up to 60% after entering that command. It is only an example, i think other commands also may cause device's downtime.  

So, I will contact JTAC then. 

Highlighted
SRX Services Gateway

Re: SRX650 losses connection (weird behavior)

‎07-28-2011 02:35 AM

Yes indeed, contact jtac. We had a similar case and I believe it was fixed in 10.4R6 - but I'm not entirely sure.

 

What typically causes problems like this is logging. Use syslog to send your traffic logs to an external device, don't log them on the SRX flash. The device can generate way more logs than can be written to the flash, which can have a negative impact on other processes.

Highlighted
SRX Services Gateway

Re: SRX650 losses connection (weird behavior)

‎07-28-2011 03:24 AM

Alas, disabling logging to flash didn't help either.

Highlighted
SRX Services Gateway

Re: SRX650 losses connection (weird behavior)

‎07-31-2011 03:43 AM

hi ,

 

CF driver optimisation (code fix)  has been added in 11.2R2 and 10.4R6 .

 

what is the ospf timer and lacp timer values that you are running .? 

 

--thanks

 

 

Highlighted
SRX Services Gateway

Re: SRX650 losses connection (weird behavior)

‎08-01-2011 12:23 AM

Hi, nebu 

As it appears our problem is not connected to CF issue.

OSPF and LACP timers are default: OSPF Hello 10, Dead 40 ReXmit 5. LACP configured to fast periodic(1 sec).

 

Highlighted
SRX Services Gateway

Re: SRX650 losses connection (weird behavior)

‎08-01-2011 10:33 AM

hi ,

 

would you please open a jtac case and share the case numer ?

 

-thanks .

Highlighted
SRX Services Gateway

Re: SRX650 losses connection (weird behavior)

‎08-02-2011 11:48 PM

Yes, I've already opened the case in JTAC. And while they are trying to figure out the solution I and my colleagues shared the same idea: The device has the only connection to the switch through the ae1 interface. All Vlans are present only on that link. When the problem described occurs, lacp timeout causes ae link to go down. And since the link containing vlans is down all vlan l3 interfaces also go down, because there is no other active link with these vlans. As far as I know other vendors' devices(eg. Cisco) behave the same way(if no active link with vlans the SVI(l3-interface) is down). 

    I checked that idea by adding extra link (not in ae, just spare link ) to the switch and adding all vlans to the trunk on that link. And after that I provoked the problem by issuing command "show securtiy flow session". Now there aren't any snmp-traps about vlan interfaces' state changes. But ae interface still goes down and also OSPF and STP changes occur because ae interfaces downtime causes those protocols timeout intervals to be exceeded. Traffic transmission interrupts for 10 to 15 seconds. There is now only less than 300 sessions and about 4Mbps traffic passing through device. But there are to be more customers whose traffic is much more significant. So if traffic is interrupted then that would be a big problem.
    We tried to use extra options to the command "show security flow session" like "interface <iface>", "source-prefix <prefix>", etc. But, unfortunately, with them the problem occurs as well.
So, usage of the commands mentioned has an impact on the CPU, which is in order causes device to lose some LACP keepalive packets, which in turn causes ae interface to go down. We're still looking for solution...

Highlighted
SRX Services Gateway

Re: SRX650 losses connection (weird behavior)

‎08-03-2011 01:48 AM

So the problem seems to be the LACP, everything else is just a result of that. Shouldn't happen of course but I'll let jtac figure out whats going wrong.

 

One other workaround I can think of is switching LACP to slow timers. Sure that would take a bit longer to detect link failures, but that usually doesn't happen that often. That gives the SRX more time to respond if the CPU is temporary busy doing other things.

Highlighted
SRX Services Gateway

Re: SRX650 losses connection (weird behavior)

‎08-03-2011 04:35 AM

Hi,

 I've already informed JTAC about my thoughts. As a workaround we've already set lacp timers to slow. It really does the trick. But we are very concerned about the obvious drawback of this action. 

Highlighted
SRX Services Gateway
Solution
Accepted by topic author Jadmin
‎08-26-2015 01:27 AM

Re: SRX650 losses connection (weird behavior)

[ Edited ]
‎05-12-2012 12:34 AM

Well,

It's been more that a half year since we last saw the issue. The problem disappeared right after we installed 10.4R7.5 software. 

 

Feedback