SRX Services Gateway
Reply
Contributor
packermann
Posts: 66
Registered: ‎02-12-2010
0

Failover because of failed fxp0?

I'm running a SRX650-Cluster with Junos 10.0R3. We had on fridays a failover, that I actualy can't really explain. node1 was the primary one, I just was editing some policys when my ssh-connection freezed. The cluster did a failover. I connected to the failed node1 via serial-console to take a clioser look on what happened, but the node seemed to be ok. On node0 right after the failover a "show chassis cluster status" showed me node1 with the status lost. After some minutes it was secondary, "show interfaces terse" on node1 looked also OK, but fxp0 wasn't reachable at all. Also, when I tried to commit any configuration-changes on node0 or node1 I got a strange error:

 

# commit and-quit
node0:
configuration check succeeds
error: failed to copy file '//var/etc/policy.id+' to 'node1'

 

After a reboot of node1 all was OK again, fxp0 works and commiting changes is no problem again. I have now searched my logs up and down and the only thing I've found was in chassisd-log:

send: red alarm set, device Routing Engine 0, reason Host 0 fxp0 : Ethernet Link Down

 

Does that mean, that the cluster failed because of a detected failure of fxp0? I don't monitor it explicitely.

Don't you wish there was a knob on the TV to turn up the intelligence? There's one marked 'Brightness,' but it doesn't work.
Trusted Expert Trusted Expert
Trusted Expert
WL
Posts: 790
Registered: ‎07-26-2008
0

Re: Failover because of failed fxp0?

[ Edited ]

Sounds like an error that someone posted earlier as well, where we saw that cluster saw both nodes as being primary.

 

Can you post "show chassis cluster information" note you need to type this in full. That should give some more ideas.

 

Cluster should not be failing over due to fxp0 going down. I would also suggest that you take alook also at "show log jsprd" during the time of the issue.

****pls click the button " Accept as Solution" if my post helped to solve your problem****
Contributor
packermann
Posts: 66
Registered: ‎02-12-2010
0

Re: Failover because of failed fxp0?

From jsrpd-log:

Jul 30 15:35:54 Interface fxp1 is down. devflags: 0x3, ifdm_flags: 0x8
Jul 30 15:35:54 Ctrl-link state change: UP->DOWN
Jul 30 15:35:54 Flowd down, setting flag. RG-0 state: primary
Jul 30 15:35:54 csmon failure, computed-weight 0, cs-mon-weight 255
Jul 30 15:35:54 LED color changed from : Green to Amber, reason Monitored objects are down
Jul 30 15:35:54 Current threshold for rg-1 is 0. Setting priority to 0. Failures: cold-sync-monitoring
Jul 30 15:35:54 Starting Flowd DOWN timer.
Jul 30 15:35:54 Starting FLOWD down timer: 14
Jul 30 15:35:54 Interface fxp0 is going down
Jul 30 15:35:55 Failed to send ctrl hearbeat packet on ctrl link(0) with ifl_idx 4, error -1
Jul 30 15:36:08 last message repeated 6 times
Jul 30 15:36:08 FLOWD down handler. RG-0, state (primary)
Jul 30 15:36:08 FLOWD state change: UP->DOWN. Changing RG-0 state PRIMARY->SECONDARY_HOLD
Jul 30 15:36:08 Successfully sent an snmp-trap due to a failover from primary to secondary-hold on RG-0 on cluster 6 node 1. Reason: Ctrl-link down
Jul 30 15:36:08 Entering secondary-hold, previous primary for RG-0 is node1
Jul 30 15:36:08 Setting backup ready to false
Jul 30 15:36:08 Waiting for other node take over RG-0 primaryship. Setting RG-0 primary node-id to 0 (invalid)
Jul 30 15:36:08 updated rg_info for RG-0 with failover-cnt 2 state: secondary-hold into ssam. Result = success, error: 0
Jul 30 15:36:08 FLOWD down handler. RG-1, state (primary)
Jul 30 15:36:08 FLOWD state change: UP->DOWN. Changing RG-1 state PRIMARY->SECONDARY_HOLD
Jul 30 15:36:08 updated rg_info for RG-1 with failover-cnt 2 state: secondary-hold into ssam. Result = success, error: 0
Jul 30 15:36:09 uspipc client pfe channel is shutdown due to idle timeout or error in receiving msg or pfe going down.Reconnecting to pfe..
Jul 30 15:36:09 usp ipc connection shutdown, suspending fabric monitoring
Jul 30 15:36:09 fabric probe dead timer stopped
Jul 30 15:36:09 tnp address from PIC entry for pfe: 0x2600001
Jul 30 15:36:09 uspipc client pfe channel established
Jul 30 15:36:09 fabric probe dead timer not started since fab-mon is suspended
Jul 30 15:36:09 Failed to send ctrl hearbeat packet on ctrl link(0) with ifl_idx 4, error -1
Jul 30 15:36:09 SECONDARY-HOLD->SECONDARY due to back to back failover timer expiry for RG-1
Jul 30 15:36:09 updated rg_info for RG-1 with failover-cnt 2 state: secondary into ssam. Result = success, error: 0
Jul 30 15:36:09 Failed to send ctrl hearbeat packet on ctrl link(0) with ifl_idx 4, error -1

 

From the chassis cluster information, there seems to be only one interesting information under redundancy group 0 and 1:

Jul 30 15:35:51.481 : secondary->primary, reason: device timer expired

 

 

Don't you wish there was a knob on the TV to turn up the intelligence? There's one marked 'Brightness,' but it doesn't work.
Trusted Expert Trusted Expert
Trusted Expert
WL
Posts: 790
Registered: ‎07-26-2008
0

Re: Failover because of failed fxp0?

Hi there

 

from the log, its fxp1 that went down not fxp0. fxp1 is the control link, so definitely the node will fail over.

 

Jul 30 15:35:54 Interface fxp1 is down. devflags: 0x3, ifdm_flags: 0x8
Jul 30 15:35:54 Ctrl-link state change: UP->DOWN
Jul 30 15:35:54 Flowd down, setting flag. RG-0 state: primary

 

I think need to check on your control link to make sure cabling etc are all fine.

****pls click the button " Accept as Solution" if my post helped to solve your problem****
Contributor
packermann
Posts: 66
Registered: ‎02-12-2010
0

Re: Failover because of failed fxp0?

We had trouble with insability of the cluster before, showing also jitter on the control-link. We had 10.1R1 running. We exchanged all the cabling, including control-link, at least twice, I thing the control-link even three times (using at first cat5e-cabling and now cat6). The instability with the cluster was from what JTAC told us, a memory leak in 10.1R1 and we should downgrade to 10.0R3, what we have done.

So, as cabling should really not be the problem, am I right, that all our problems so far could also be related to faulty hardware?

Don't you wish there was a knob on the TV to turn up the intelligence? There's one marked 'Brightness,' but it doesn't work.
Trusted Expert Trusted Expert
Trusted Expert
WL
Posts: 790
Registered: ‎07-26-2008
0

Re: Failover because of failed fxp0?

Looking at the logs, cant tell if its hardware issue though.

 

Can you take a look at the log from chassid during the time of the issue and see if it detected phy link down?

Also what is the current setting you have for the heartbeat interval and threshold?

****pls click the button " Accept as Solution" if my post helped to solve your problem****
Contributor
packermann
Posts: 66
Registered: ‎02-12-2010
0

Re: Failover because of failed fxp0?

chassisd.txt for that time is attached.

 

Regarding heartbeat:

        heartbeat-interval 2000;
        heartbeat-threshold 8;

 

I set this as I experienced jitter on the control-link with 10.1R1 of about 4000ms. I told this JTAC also, but they said, nothing to worry about. With 10.0R3 I get also lots of Jitter-messages:

Jun 21 02:36:48 noticed a jitter of 18446744069414600 milli seconds on ctrl link between 47234 to 47235

 

I asked JTAC about this weird numbers and they told again, nothing to worry about, this is just cosmetic...

 

Don't you wish there was a knob on the TV to turn up the intelligence? There's one marked 'Brightness,' but it doesn't work.
Trusted Expert Trusted Expert
Trusted Expert
WL
Posts: 790
Registered: ‎07-26-2008
0

Re: Failover because of failed fxp0?

[ Edited ]

Hmm can you check the messages logs as well?

 

I see that board was restarted but this seems to be after the fxp1 interface went down if you compare time stamps. Also one thing to check is if you had a high traffic load at the time of the issue as well.

 

We did have some cosmetic issues with the jitter messages but in this case fxp1 so think focus should be on that as causing the failover.

 

Jul 30 15:36:09 CHASSISD_FRU_ONLINE_TIMEOUT: lcc_online_timeout_hdlr: attempt to bring LCC 0 online timed out
Jul 30 15:36:09 CHASSISD_FRU_OFFLINE_NOTICE: Taking LCC 0 offline: Restarting unresponsive board
Jul 30 15:36:09 lcc_offline_now - slot 0, reason: Restarting unresponsive board, error Chassis connection dropped

See the above is after the original fxp1 down time:
Jul 30 15:35:54 Interface fxp1 is down. devflags: 0x3, ifdm_flags: 0x8
Jul 30 15:35:54 Ctrl-link state change: UP->DOWN

****pls click the button " Accept as Solution" if my post helped to solve your problem****
Contributor
packermann
Posts: 66
Registered: ‎02-12-2010
0

Re: Failover because of failed fxp0?

Regarding high traffic, definately not. That was first thing I checked. I query the cluster via SNMP an graph it with cacti. Also our uplink is graphed through cacti. At the time of this issue, we had only a throughput of 160MBit with only 80K sessions... this should be far from the limits of a SRX650. We are not running any IDP, AV, WF or VPN. As I'm also graphing CPU and memory, I can say there were no peaks, memory was at about 60%, CPU at 17%. Maybe the whole story of our cluster might help you, the case is 2010-0618-0658 if you have access to it.

Regarding messages, as I'm using a syslog-server and fxp0 wasn't functioning during this failure, I have no recordings for this node...

Don't you wish there was a knob on the TV to turn up the intelligence? There's one marked 'Brightness,' but it doesn't work.
Trusted Expert Trusted Expert
Trusted Expert
WL
Posts: 790
Registered: ‎07-26-2008
0

Re: Failover because of failed fxp0?

I see what you mean. From the logs, we can only say that control link went down first and then the chassis connection was dropped leading to chassid restart.

 

So failover was clearly due to fxp1 going down. What triggered it however, since there are no logs as fxp0 did not seem to be functioning properly, we have no way to tell at this time.

 

One thing I think you can consider is to turn on sysloging locally for the time being, with the proviso that you constantly monitor file system storage to ensure that you are not filling up the file system so that at least logs of what occured are logged locally so that you can troubleshoot this issue.

 

Something like following that limits number of files and file size:

 

set system syslog file messages files 20 size 2M

 

Aside from that, I think if there aren't any logs, its going to be difficult to tell why fxp1 went down.

****pls click the button " Accept as Solution" if my post helped to solve your problem****
Copyright© 1999-2013 Juniper Networks, Inc. All rights reserved.