I have a branch office with a cluster of SRX220H2s that recently started exhibiting flapping issues with the secondary node in the cluster. Every 5-10 minutes, the secondary node will be kicked out of the cluster, then added several minutes later, before starting the cycle over. We've tried hard booting the secondary node to see if it would join and stick in the cluster, but it doesn't seem to help.
Additionally, I've noticed that the control-plane cpu on the primary node is consistently at 100%, with the jsrpd process consuming an awful amount of resources. We have a number of essentially identical branch clusters elsewhere, none of which have jsrpd consuming high resources. I know that that process is involved with the cluster process, in terms of messaging. Checking the jsrpd logs, I'm seeing something very unusual:
May 14 16:55:04 TCP-S: accepted client connection.
May 14 16:55:04 TCP-S: TCP client from 130.16.0.1/56547 connected
May 14 16:55:04 TCP-S: TCP peer closed connection
May 14 16:55:04 last message repeated 100 times (hit threshold of (100))
May 14 16:55:04 last message repeated 200 times (hit threshold of (200))
May 14 16:55:04 last message repeated 300 times (hit threshold of (300))
May 14 16:55:04 last message repeated 400 times (hit threshold of (400))
May 14 16:55:04 last message repeated 500 times (hit threshold of (500))
May 14 16:55:04 last message repeated 600 times (hit threshold of (600))
May 14 16:55:05 last message repeated 700 times (hit threshold of (700))
May 14 16:55:05 last message repeated 800 times (hit threshold of (800))
Here's the system process extensive command output:
show system processes extensive
node0:
--------------------------------------------------------------------------
last pid: 47616; load averages: 1.28, 1.26, 1.42 up 431+22:43:27 16:59:15
140 processes: 19 running, 108 sleeping, 2 zombie, 11 waiting
Mem: 210M Active, 149M Inact, 1036M Wired, 145M Cache, 112M Buf, 432M Free
Swap:
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
1403 root 5 76 0 996M 58812K RUN 0 ??? 102.20% flowd_octeon_hm
1406 root 1 139 0 14096K 7032K RUN 0 727.7H 76.66% jsrpd
22 root 1 171 52 0K 16K RUN 0 7574.2 0.00% idle: cpu0
23 root 1 -20 -139 0K 16K RUN 0 118.8H 0.00% swi7: clock
5 root 1 -16 0 0K 16K rtfifo 0 42.7H 0.00% rtfifo_kern_recv
25 root 1 -40 -159 0K 16K WAIT 0 40.4H 0.00% swi2: netisr 0
1413 root 1 76 0 12452K 5768K select 0 33.9H 0.00% license-check
show chasis cluster interfaces:
Control link status: Up
Control interfaces:
Index Interface Status Internal-SA
0 fxp1 Up Disabled
Fabric link status: Up
Fabric interfaces:
Name Child-interface Status
(Physical/Monitored)
fab0 ge-0/0/5 Up / Up
fab0
fab1 ge-3/0/5 Up / Up
fab1
Redundant-ethernet Information:
Name Status Redundancy-group
reth0 Up 1
reth1 Up 1
reth2 Up 1
Redundant-pseudo-interface Information:
Name Status Redundancy-group
lo0 Up 0
Interface Monitoring:
Interface Weight Status Redundancy-group
ge-3/0/0 255 Down 1
ge-0/0/0 255 Up 1
{primary:node0}
last 100 of show log chassisd
show log chassisd | last 100
May 14 16:39:58 SCC: pseudo_create_devs_swfab: Skipping creation of swfab1, since fabric presence is set to true
May 14 16:39:58 SCC: lcc_detach_interfaces_not_online lcc 1
May 14 16:39:58 CHASSISD_IFDEV_DETACH_FPC: ifdev_detach_fpc(3)
May 14 16:39:58 CHASSISD_IFDEV_DETACH_FPC: ifdev_detach_fpc(4)
May 14 16:39:58 CHASSISD_IFDEV_DETACH_FPC: ifdev_detach_fpc(5)
May 14 16:40:06 SCC: pfpc ready fpc 3 i2c 1897
May 14 16:40:06 SCC: fpc 3 clean, bringing online
May 14 16:40:06 SCC: lcc_send_fpc_online_cmd_generic: lcc 1 fpc 0
May 14 16:40:06 SCC: pic_online_req for fpc 3, pic 0 lcc_slot 1 in lcc_recv_pic_online_req
May 14 16:40:06 SCC: lcc_send_pic_online_ack: On Switch-chassis: fpc 3 pic 0 pic_type 0x669 msg_len 20 tlv_len 0
May 14 16:40:06 SCC: From SCC send: fru 13361152 lcc_slot 1 online ack to LCC
May 14 16:40:06 SCC: From Switch-Chassis send: fpc 3 pic 0 online ack to LCC
May 14 16:40:08 SCC: lcc_recv_pic_attach: pic attach pic 0, flags 0x0, portcount 8, fpc 3
May 14 16:40:08 SCC: pic_set_online: i2c 0x669 pic 0 fpc 3 state 5 in_issu 0
May 14 16:40:08 SCC: pic_type=1641 pic_slot=0 fpc_slot=3 pic_i2c_id=1641
May 14 16:40:08 SCC: fpc slot 3 pic_present 0x0 => 0x1
May 14 16:40:08 SCC: FPC 3 PIC 0, attaching clean
May 14 16:40:08 SCC: Creating pic entry, baseport 0, nports 8, port 0
May 14 16:40:08 SCC: create_pic_entry: pic i2c 0x669, hw qs 8 supported qs 8, flags 0x0, pic port 0
May 14 16:40:08 SCC: Created pic for ge-3/0/0
May 14 16:40:08 SCC: Creating pic entry, baseport 0, nports 8, port 1
May 14 16:40:08 SCC: create_pic_entry: pic i2c 0x669, hw qs 8 supported qs 8, flags 0x0, pic port 1
May 14 16:40:08 SCC: Created pic for ge-3/0/1
May 14 16:40:08 SCC: Creating pic entry, baseport 0, nports 8, port 2
May 14 16:40:08 SCC: create_pic_entry: pic i2c 0x669, hw qs 8 supported qs 8, flags 0x0, pic port 2
May 14 16:40:08 SCC: Created pic for ge-3/0/2
May 14 16:40:08 SCC: Creating pic entry, baseport 0, nports 8, port 3
May 14 16:40:08 SCC: create_pic_entry: pic i2c 0x669, hw qs 8 supported qs 8, flags 0x0, pic port 3
May 14 16:40:08 SCC: Created pic for ge-3/0/3
May 14 16:40:08 SCC: Creating pic entry, baseport 0, nports 8, port 4
May 14 16:40:08 SCC: create_pic_entry: pic i2c 0x669, hw qs 8 supported qs 8, flags 0x0, pic port 4
May 14 16:40:08 SCC: Created pic for ge-3/0/4
May 14 16:40:08 SCC: Creating pic entry, baseport 0, nports 8, port 5
May 14 16:40:08 SCC: create_pic_entry: pic i2c 0x669, hw qs 8 supported qs 8, flags 0x0, pic port 5
May 14 16:40:08 SCC: Created pic for ge-3/0/5
May 14 16:40:08 SCC: Creating pic entry, baseport 0, nports 8, port 6
May 14 16:40:08 SCC: create_pic_entry: pic i2c 0x669, hw qs 8 supported qs 8, flags 0x0, pic port 6
May 14 16:40:08 SCC: Created pic for ge-3/0/6
May 14 16:40:08 SCC: Creating pic entry, baseport 0, nports 8, port 7
May 14 16:40:08 SCC: create_pic_entry: pic i2c 0x669, hw qs 8 supported qs 8, flags 0x0, pic port 7
May 14 16:40:08 SCC: Created pic for ge-3/0/7
May 14 16:40:08 CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for ge-3/0/0
May 14 16:40:08 SCC: ifdev_create entered ge-3/0/0
May 14 16:40:08 SCC: ge-3/0/0: large delay buffer cleared
May 14 16:40:08 SCC: fpc_is_q_neompc: no valid ideeprom for slot 3
May 14 16:40:08 SCC: fpc_is_q_sangria: no valid ideeprom for slot 3
May 14 16:40:08 CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for ge-3/0/1
May 14 16:40:08 SCC: ifdev_create entered ge-3/0/1
May 14 16:40:08 SCC: ge-3/0/1: large delay buffer cleared
May 14 16:40:08 SCC: fpc_is_q_neompc: no valid ideeprom for slot 3
May 14 16:40:08 SCC: fpc_is_q_sangria: no valid ideeprom for slot 3
May 14 16:40:08 CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for ge-3/0/2
May 14 16:40:08 SCC: ifdev_create entered ge-3/0/2
May 14 16:40:08 SCC: ge-3/0/2: large delay buffer cleared
May 14 16:40:08 SCC: fpc_is_q_neompc: no valid ideeprom for slot 3
May 14 16:40:08 SCC: fpc_is_q_sangria: no valid ideeprom for slot 3
May 14 16:40:08 CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for ge-3/0/3
May 14 16:40:08 SCC: ifdev_create entered ge-3/0/3
May 14 16:40:08 SCC: ge-3/0/3: large delay buffer cleared
May 14 16:40:08 SCC: fpc_is_q_neompc: no valid ideeprom for slot 3
May 14 16:40:08 SCC: fpc_is_q_sangria: no valid ideeprom for slot 3
May 14 16:40:08 CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for ge-3/0/4
May 14 16:40:08 SCC: ifdev_create entered ge-3/0/4
May 14 16:40:08 SCC: ge-3/0/4: large delay buffer cleared
May 14 16:40:08 SCC: fpc_is_q_neompc: no valid ideeprom for slot 3
May 14 16:40:08 SCC: fpc_is_q_sangria: no valid ideeprom for slot 3
May 14 16:40:08 CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for ge-3/0/5
May 14 16:40:08 SCC: ifdev_create entered ge-3/0/5
May 14 16:40:08 SCC: ge-3/0/5: large delay buffer cleared
May 14 16:40:08 SCC: fpc_is_q_neompc: no valid ideeprom for slot 3
May 14 16:40:08 SCC: fpc_is_q_sangria: no valid ideeprom for slot 3
May 14 16:40:08 CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for ge-3/0/6
May 14 16:40:08 SCC: ifdev_create entered ge-3/0/6
May 14 16:40:08 SCC: ge-3/0/6: large delay buffer cleared
May 14 16:40:08 SCC: fpc_is_q_neompc: no valid ideeprom for slot 3
May 14 16:40:08 SCC: fpc_is_q_sangria: no valid ideeprom for slot 3
May 14 16:40:08 CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for ge-3/0/7
May 14 16:40:08 SCC: ifdev_create entered ge-3/0/7
May 14 16:40:08 SCC: ge-3/0/7: large delay buffer cleared
May 14 16:40:08 SCC: fpc_is_q_neompc: no valid ideeprom for slot 3
May 14 16:40:08 SCC: fpc_is_q_sangria: no valid ideeprom for slot 3
May 14 16:40:08 SCC: PIC (fpc 3 pic 0) message operation: add. ifd count 8, flags 0x3 in mesg
May 14 16:40:08 LCC: ignoring PIC message on LCC
For the moment, I've disabled the ports on the switch for the second node (node1) that keeps flapping, just so I don't keep seeing it go on and off, but can renable if needed.
Any thoughts are appreciated!