SRX Services Gateway
Highlighted
SRX Services Gateway

SRX4100 missed heartbeats in cluster

[ Edited ]
‎05-23-2018 11:26 AM

Dear Members, 

We are experiencing a weird problem with our HA configurations. The nodes are just installed and configured with basic HA configuration. The problem is the node tranists to disabled state after missing hearbeats. The nodes are connected back to back and we have tried chaning SFP, Cables and even both nodes but the problem persists. Please note that a similar pair is working fine in another location with same software and hardware.

 

We did upgrade the software to the latest release as recommended by JTAC, but the issue is still same. The case is now pening with ATAC and all the related logs have been provided.

 

Please let me know if anyone of you have faced a similar situation and what can be the solution. For Juniper Employees, the case number is 

2018-0503-0166

 

Error

May 23 21:14:04 Successfully sent jnxJsChClusterIntfTrap trap with severity minor to inform that Control link - em0 state changed from UP to DOWN on cluster 1; reason: missed heartbeats
May 23 21:14:07 missed heartbeats on control link between 25 to 33

 

Configuration

## Last commit: 2018-05-24 03:33:16 PKT by tayyab
version 15.1X49-D130.6;
groups {
node0 {
system {
host-name LHR_SRX_CH_FWL01;
}
interfaces {
fxp0 {
unit 0 {
family inet {
address 10.12.41.227/23;
}
}
}
}
}
node1 {
system {
host-name LHR_SRX_CH_FWL02;
}
interfaces {
fxp0 {
unit 0 {
family inet {
address 10.12.41.228/23;
}
}
}
}
}
}
apply-groups "${node}";
system {
time-zone Asia/Karachi;
root-authentication {
encrypted-password "$5$Ne4994/h$78cjDSVswBRh1lmOSdYwUTny7P/kZDG80bZoKJKCkb5"; ## SECRET-DATA
}
login {
user tayyab {
uid 2000;
class super-user;
authentication {
encrypted-password "$5$/./JeNE3$VGQK0zZrlqibVO7puB.3TJ4u91G0j7d6a4LsQmtv.X4"; ## SECRET-DATA
}
}
}
services {
ssh;
telnet;
netconf {
ssh;
}
web-management {
https {
system-generated-certificate;
}
}
}
syslog {
user * {
any emergency;
}
file messages {
any any;
authorization info;
}
file interactive-commands {
interactive-commands any;
}
}
license {
autoupdate {
url https://ae1.juniper.net/junos/key_retrieval;
}
}
}
chassis {
cluster {
reth-count 2;
redundancy-group 0 {
node 0 priority 100;
node 1 priority 50;
}
redundancy-group 1 {
node 0 priority 100;
node 1 priority 50;
}
}
}
interfaces {
fxp0 {
unit 0 {
family inet;
}
}
}

 Thanks & Regards, 

Tayyab Bin Tariq

6 REPLIES 6
SRX Services Gateway

Re: SRX4100 missed heartbeats in cluster

‎02-11-2019 09:23 AM

We have 4200 clusters running 15.1X-150.2 and we saw the same thing.

We have direct connect trwinax cables though.

 

In addition at the exact same time all our BFD sessions on the SRX cluter failed resulting in an outage.

 

Let me know if you have found anything on this.

 

Thanks

SRX Services Gateway

Re: SRX4100 missed heartbeats in cluster

[ Edited ]
‎02-14-2019 10:53 AM

hello

same thing from me. do you have any resolution for your case ? 

 

Thx


SRX Services Gateway

Re: SRX4100 missed heartbeats in cluster

‎02-15-2019 12:38 AM
Hi,

Yes, it was a software upgrade release from Juniper. Make sure you have upgraded it to latest one.

Regards,
Tayyab Bin Tariq
SRX Services Gateway

Re: SRX4100 missed heartbeats in cluster

[ Edited ]
‎02-15-2019 06:39 AM

We are on 15.1X49-D150.2

 

We were able to figure out what the issue was.  But first a little back story.

 

We had to deactivate the fxp management addresses because of an asymmetric issue and the fact that in Junos 15 doesnt support putting management interfaces in a routing instance. Monitoring devices in the trust zone were taking the management route via the fxp interface to the device but the reply was coming back in the untrust. This created polling issues.

 

The problem turned out to be L3 broadcast traffic from the management network was routing back out the untrust interfaces even with no IP active on the fxp interfaces. Pinging 1 time to the broadcast ip in the management network created a ~6000x amplification on the fxp interfaces and a ~3300x amplification for the untrust interfaces. It was routing L3 broadcast packets coming in the fxp interface with no ip address on it. These amplifications were crushing our RE when it happened. This is where the missed heartbeat messages were coming from. 

 

There are four solutions to this.

 

1. Shutdown the fxp interface

2. Enabled an IP in that management network on the fxp interface or remove the family inet from the interface

3. Put production traffic in a routing instance and leave the fxp interface in default

4. Run Junos 17 and put the management interfaces in a routing instance

SRX Services Gateway

Re: SRX4100 missed heartbeats in cluster

‎02-15-2019 10:15 AM

Nice information to know.

 

A small note regarding your solution 4) - putting fxp0 into a management-routing instance is not supported on SRX before 18.3R1 (ref: https://www.juniper.net/documentation/en_US/junos/information-products/topic-collections/release-not...)

--
Best regards,

Jonas Hauge Jensen
Systems Engineer, SEC Datacom A/S (Denmark)
SRX Services Gateway

Re: SRX4100 missed heartbeats in cluster

‎02-15-2019 10:50 AM

You can do it in 17  but i think its not officially supported until the 18 track