SRX Services Gateway
Highlighted
SRX Services Gateway

node1 goes from hold to secondary to disabled

[ Edited ]
‎06-10-2020 04:20 AM

After upgrading a pair of SRX320s to 15.1X49-D210, I cannot get the cluster to reform.

 

The primary node comes up ok but I cannot get the secondary online.

 

I've tried doing the following on the secondary:

set chassis cluster cluster-id 0 node 0 reboot

...

load factory-defaults

set chassis cluster cluster-id 1 node 1 reboot

 

But on the primary, the status goes "lost -> hold -> secondary -> disabled".

 

On the secondary the only hint is in chassisid log file:

 

LCC: send: fpc 0 pic 0 online ack
LCC: pic attach pic 0, flags 0x0, portcount 58, fpc 0
LCC: pic_set_online: i2c 0x689 pic 0 fpc 0 state 3 in_issu 0
LCC: pic_type=1673 pic_slot=0 fpc_slot=0 pic_i2c_id=1673
LCC: hwdb: entry for pic 1673 at slot 0 in fpc 0 inserted
LCC: FPC 0 PIC 0, attaching clean
LCC: not in vc mode
LCC: Forwarding pic attach to FWDD fpc 0, pic 0
LCC: Got a pic attach ack from fwdd fpc 0pic 0
LCC: FWDD pic attach ack recd fpc 0, pic 0
LCC: pic_copy_port_info:Got SFP Rev= , Pno=NON-JNPR, Sno=PG54Q4Q
LCC: SIGWINCH handler
LCC: Node entering disabled state
CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 0 offline: Chassis cluster disable
LCC: fpc_down slot 0 reason Chassis cluster disable cargs 0xfa6120
LCC: fpc_srxsme_disconnect slot is 0
LCC: fpc_offline_now - slot 0, reason: Chassis cluster disable, error OK transition state 1
CHASSISD_SNMP_TRAP3: ENTITY trap generated: entStateOperDisabled (entPhysicalIndex 7, entStateAdmin 3, entStateAlarm 0)
LCC: fpc_offline_now - slot 0, is_resync_ready cleared
LCC: mic_get_mic_slot: clp1: fpc_slot=0, pic_slot=0, i2c=0x689
LCC: hwdb: entry for fpc 1929 at slot 0 deleted
CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 1 offline: Chassis cluster disable
LCC: fpc_down slot 1 reason Removal cargs 0x0
LCC: fpc_offline_now - slot 1, reason: Chassis cluster disable, error OK transition state 1
CHASSISD_SNMP_TRAP3: ENTITY trap generated: entStateOperDisabled (entPhysicalIndex 8, entStateAdmin 1, entStateAlarm 0)
LCC: fpc_srxsme_is_mpim_present: slot 1, FPC not present
LCC: fpc_srxsme_init: slot 1, FPC not detected
CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 2 offline: Chassis cluster disable
LCC: fpc_down slot 2 reason Removal cargs 0x0
LCC: fpc_offline_now - slot 2, reason: Chassis cluster disable, error OK transition state 1
CHASSISD_SNMP_TRAP3: ENTITY trap generated: entStateOperDisabled (entPhysicalIndex 9, entStateAdmin 1, entStateAlarm 0)
LCC: fpc_srxsme_is_mpim_present: slot 2, FPC not present
LCC: fpc_srxsme_init: slot 2, FPC not detected
...
LCC: Unable to read FPC 6 ID EEPROM
LCC: I2C read error for slot 6
...

There's an error in jam_chassisid but that file is not on either SRX:

jam_dso_find_open.776:dir: /usr/sbin/jam
jam_dso_find_open.799:Failed to Open Dir /usr/sbin/jam
jam_get_db_attribute.1013:DB Get failed for chasd.lc.modelinfo.711-062269 with ret 3
jam_get_modelnumstr.1176:Got model num str for partno: 711-062269
jam_dso_find_open.776:dir: /usr/sbin/jam
jam_dso_find_open.799:Failed to Open Dir /usr/sbin/jam
jam_get_db_attribute.1013:DB Get failed for chasd.lc.modelinfo.711-062269 with ret 3
jam_get_modelnumstr.1176:Got model num str for partno: 711-062269
jam_get_db_attribute.1011 ERR:DB Get failed for chasd.lc.modelinfo. with error 3
jam_get_modelnumstr.1176:Got model num str for partno:
jam_dso_find_open.776:dir: /usr/sbin/jam
jam_dso_find_open.799:Failed to Open Dir /usr/sbin/jam
jam_get_db_attribute.1013:DB Get failed for chasd.lc.modelinfo.711-062269 with ret 3
jam_get_modelnumstr.1176:Got model num str for partno: 711-062269

So I'm a bit confused about what to do next.... is the unit actually faulty?

10 REPLIES 10
Highlighted
SRX Services Gateway

Re: node1 goes from hold to secondary to disabled

‎06-10-2020 04:32 AM

When the secondary is out of the cluster, all of the ge interfaces show up correctly as being up:

 

root> show interfaces terse | match ge-
ge-0/0/0                up    up
ge-0/0/1                up    up
ge-0/0/2                up    up
ge-0/0/3                up    up
ge-0/0/4                up    up
ge-0/0/5                up    up
ge-0/0/6                up    up
ge-0/0/7                up    down
ge-0/0/8                up    down
ge-0/0/9                up    down

So I'm not concerned about that.

 

Highlighted
SRX Services Gateway

Re: node1 goes from hold to secondary to disabled

‎06-10-2020 04:53 AM

Hello Baldwizard,

 

Greetings!

 

As per the description, I understand that the Secondary node is not online:

 

Can you help us with the below outputs:

 

> show chassis alarms no-forwarding

> show chassis cluster status

> show chassis cluster statistics
> show chassis cluster information
> show log jsrpd

 

Also, check the below KB to verify  how chassis cluster nodes are configured and up on J-Series and SRX:

https://kb.juniper.net/InfoCenter/index?page=content&id=KB15439&actp=METADATA

 

Best Regards,

Lingabasappa H

 

Highlighted
SRX Services Gateway
Solution
Accepted by topic author baldwizard
‎07-08-2020 02:47 AM

Re: node1 goes from hold to secondary to disabled

[ Edited ]
‎06-10-2020 05:03 AM

Ok, this appears to be because there was an interface configuration present on the non-cluster member for one of the HA interfaces, ge-0/0/0. I found that deep in a log file but that wasn't visible!

 

/var/log/dcd

 

I needed to do a "delete interface ge-0/0/0" from the non-cluster state of the secondary (it then only had the root password in its local configuration) and then reboot.

Highlighted
SRX Services Gateway

Re: node1 goes from hold to secondary to disabled

‎06-10-2020 05:06 AM

Interesting that there's one alarm:

 

> show chassis alarms no-forwarding
1 alarms currently active
Alarm time Class Description
2020-06-10 21:41:02 EST Major NSD fails to restart because subcomponents fail

 

Highlighted
SRX Services Gateway

Re: node1 goes from hold to secondary to disabled

‎06-10-2020 05:07 AM

Hello Baldwizard

Greetings !!

 

Kindly provide us the Output of the Below Commands

show chassis cluster status 
show chassis fpc pic-status
show chassis alarms
show log jsrpd
show chassis cluster information no-forwarding

Meanwhile You can go through the below Docs it will be benefical For trouebleshooting

https://kb.juniper.net/InfoCenter/index?page=content&id=KB20641&actp=METADATA

https://kb.juniper.net/InfoCenter/index?page=content&id=KB15421&actp=METADATA 

Please mark "Accept as solution" if this answers your query. 

 

Kudos are appreciated too

deeksha
Highlighted
SRX Services Gateway

Re: node1 goes from hold to secondary to disabled

‎06-10-2020 05:12 AM

Hello Baldwizard,

 

Thanks for the reply.

 

Did deleting the interface ge-0/0/0 from a non-cluster member in the secondary node and then followed a reboot resolved the issue?

 

Request you to mark the solution for the queries you post as accepted if it answered your query/queries.

This would enable others to find the right solution for the same/similar queries on the forum.

 

I hope this helps. Please mark my post as "Accept as solution" if that has answered your query.

 

Kudos are always appreciated! Smiley Happy

 

Best Regards,

Lingabasappa H

Highlighted
SRX Services Gateway

Re: node1 goes from hold to secondary to disabled

‎06-10-2020 05:15 AM

Hello baldwizard,

Regarding this alarm 

 

> show chassis alarms no-forwarding
1 alarms currently active
Alarm time Class Description
2020-06-10 21:41:02 EST Major NSD fails to restart because subcomponents fail

 

Starting in Junos OS Releases 12.3X48-D85, 15.1X49-D180, and 19.2R1, a system alarm is triggered when the Network Security Process (NSD) is unable to restart due to the failure of one or more NSD subcomponents. The alarm logs about the NSD are saved in the messages log. The alarm is automatically cleared when NSD restarts successfully. The show chassis alarms and show system alarms commands are updated to display the following output when NSD is unable to restart - NSD fails to restart because subcomponents fail.

 

Kindly go through the below Docs 

https://www.juniper.net/documentation/en_US/junos/topics/concept/security-alarm-overview.html

 

https://www.juniper.net/documentation/en_US/junos/information-products/topic-collections/release-not...

 

I hope this helps. Please mark my post as "Accept as solution" if that has answered your query.

 

Kudos are always appreciated!

 

deeksha
Highlighted
SRX Services Gateway

Re: node1 goes from hold to secondary to disabled

‎06-10-2020 05:20 AM

Hello Baldwizard,

 

> show chassis alarms no-forwarding
1 alarms currently active
Alarm time Class Description
2020-06-10 21:41:02 EST Major NSD fails to restart because subcomponents fail

 

To clear the above alarm, please run the below command in a safe Maintainence window:

 

>restart network-security

 

I suspect that the daemon got stuck and it needs to be restarted, but restarting the process could impact your traffic for a short period of time.

 

I hope this helps. Please mark this post "Accept as solution" if this answers your query.

 

Kudos are always appreciated! Smiley Happy

 

Best Regards,

Lingabasappa H

Highlighted
SRX Services Gateway

Re: node1 goes from hold to secondary to disabled

‎06-10-2020 05:53 AM

Hi baldwizard,

 

Firstly, please verify active alarms on both nodes. 

 

From the active alarm that you pasted output for, I see the active alarm is regarding NSD failure due to subcomponent failure.

 

Please note that the NSD process handles all security-related config and pushes them into the PFE. Since you are seeing these alerts, I suspect that the daemon might have gotten stuck and it needs to be restarted, but please keep in mind that restarting the process could impact your traffic for a short period of time.

 

To restart the daemon:

> restart network-security

 

If the above command does not solve the issue, please restart the device:

> request system reboot

 

Please be aware that you take precautionary measures while rebooting the node. You might not want to do a reboot on a node that is primary.

 

Hope this helps 🙂

 

Please mark "Accepted Solution" if this helps you solve your query.

Kudos are always appreciated!

Highlighted
SRX Services Gateway

Re: node1 goes from hold to secondary to disabled

‎06-10-2020 05:55 AM

Hi baldwizard,

 

Firstly, please verify active alarms on both nodes. 

 

From the active alarm that you pasted output for, I see the active alarm is regarding NSD failure due to subcomponent failure.

 

Please note that the NSD process handles all security-related config and pushes them into the PFE. Since you are seeing these alerts, I suspect that the daemon might have gotten stuck and it needs to be restarted, but please keep in mind that restarting the process could impact your traffic for a short period of time.

 

To restart the daemon:

> restart network-security

 

If the above command does not solve the issue, please restart the device:

> request system reboot

 

Please be aware that you take precautionary measures while rebooting the node. You might not want to do a reboot on a node that is primary.

 

Hope this helps 🙂

 

Please mark "Accepted Solution" if this helps you solve your query

Kudos are always appreciated!

Feedback