SRX Services Gateway
SRX Services Gateway

Chassis cluster crashes after show securify flow session

[ Edited ]
‎12-09-2019 11:14 AM

We have a chassis cluster with two SRX 340. Almost everything seem to be working fine, but sometimes when using command "show securify flow session" the cluster is crashing... 

 

root@SRX1> show system information 
Model: srx340
Family: junos-es
Junos: 18.2R3.4
Hostname: SRX1

 

This chassis cluster is not on production yet, so there is almost no traffic here. Here's an example (I tried to stop it with ^C, because I've already seen that the output is really slow and that it's going to crash):

 

{primary:node0}
root@SRX1>
root@SRX1> show security flow session 
node0:
--------------------------------------------------------------------------

Session ID: 2, Policy name: self-traffic-policy/1, State: Active, Timeout: 1786, Valid
  In: 1.2.3.4/64591 --> 4.3.2.1/179;tcp, Conn Tag: 0x0, If: .local..0, Pkts: 19131, Bytes: 1007022, 
^C[abort]

{secondary-hold:node0}
root@SRX1> 

 

As you can see above, it's a primary node, but then it changes to secondary-hold.

root@SRX1> show chassis cluster status   
Monitor Failure codes:
    CS  Cold Sync monitoring        FL  Fabric Connection monitoring
    GR  GRES monitoring             HW  Hardware monitoring
    IF  Interface monitoring        IP  IP monitoring
    LB  Loopback monitoring         MB  Mbuf monitoring
    NH  Nexthop monitoring          NP  NPC monitoring              
    SP  SPU monitoring              SM  Schedule monitoring
    CF  Config Sync monitoring      RE  Relinquish monitoring
 
Cluster ID: 1
Node   Priority Status               Preempt Manual   Monitor-failures

Redundancy group: 0 , Failover count: 1
node0  100      secondary-hold       no      no       GR             
node1  1        primary              no      no       None           

Redundancy group: 1 , Failover count: 1
node0  0        secondary            no      no       CS             
node1  1        primary              no      no       None   
root@SRX1> show chassis cluster information detail 
node0:
--------------------------------------------------------------------------
Redundancy mode:
    Configured mode: active-active
    Operational mode: active-active
Cluster configuration:
    Heartbeat interval: 1000 ms
    Heartbeat threshold: 3
    Control link recovery: Disabled
    Fabric link down timeout: 66 sec
Node health information:
    Local node health: Not healthy
    Remote node health: Healthy

Redundancy group: 0, Threshold: 255, Monitoring failures: gres-not-ready
    Events:
        Dec  9 14:20:47.751 : hold->secondary, reason: Hold timer expired
        Dec  9 14:21:02.845 : secondary->primary, reason: Better priority (100/1)
        Dec  9 19:09:11.726 : primary->secondary-hold, reason: Control link (Flowd) down

Redundancy group: 1, Threshold: 0, Monitoring failures: cold-sync-monitoring
    Events:                             
        Dec  9 14:20:48.044 : hold->secondary, reason: Hold timer expired
        Dec  9 14:21:04.962 : secondary->primary, reason: Remote yield (0/0)
        Dec  9 19:09:11.787 : primary->secondary-hold, reason: Control link (Flowd) down
        Dec  9 19:09:12.851 : secondary-hold->secondary, reason: Ready to become secondary
Control link statistics:                
    Control link 0:                     
        Heartbeat packets sent: 17928   
        Heartbeat packets received: 17509
        Heartbeat packet errors: 0      
        Duplicate heartbeat packets received: 0
    Control recovery packet count: 0    
    Sequence number of last heartbeat packet sent: 17928
    Sequence number of last heartbeat packet received: 17916
Fabric link statistics:      
   Child link 0                        
        Probes sent: 69                 
        Probes received: 69             
    Child link 1                        
        Probes sent: 69                 
        Probes received: 69             
Switch fabric link statistics:          
    Probe state : DOWN                  
    Probes sent: 0                      
    Probes received: 0                  
    Probe recv errors: 0                
    Probe send errors: 0                
    Probe recv dropped: 0               
    Sequence number of last probe sent: 0
    Sequence number of last probe received: 0
                                        
Chassis cluster LED information:        
    Current LED color: Amber            
    Last LED change reason: Monitored objects are down
Control port tagging:                   
    Disabled                            
                                        
Cold Synchronization:                   
    Status:                             
        Cold synchronization completed for: N/A
        Cold synchronization failed for: N/A
        Cold synchronization not known for: N/A
        Current Monitoring Weight: 255  
                                        
    Progress:                           
        CS Prereq               0 of 1 SPUs completed
           1. if_state sync          1 SPUs completed
           2. fabric link            0 SPUs completed
           3. policy data sync       1 SPUs completed
           4. cp ready               0 SPUs completed
           5. VPN data sync          0 SPUs completed
           6. IPID data sync         0 SPUs completed
           7. All SPU ready          0 SPUs completed
           8. AppID ready            0 SPUs completed
           9. Tunnel Sess ready      0 SPUs completed
        CS RTO sync             0 of 1 SPUs completed
       CS Postreq              0 of 1 SPUs completed

    Statistics:
        Number of cold synchronization completed: 0
        Number of cold synchronization failed: 0

    Events:
        Dec  9 14:22:34.358 : Cold sync for PFE  is RTO sync in process
        Dec  9 14:22:34.803 : Cold sync for PFE  is Completed

Loopback Information:

    PIC Name        Loopback        Nexthop     Mbuf
    -------------------------------------------------
                    Success         Failure     Success    

Interface monitoring:
    Statistics:
        Monitored interface failure count: 0

    Events:
        Dec  9 14:22:37.137 : Interface ge-0/0/5 monitored by rg 1, changed state from Down to Up
        Dec  9 14:22:37.279 : Interface ge-0/0/4 monitored by rg 1, changed state from Down to Up
                                        
Fabric monitoring:                      
    Status:                             
        Fabric Monitoring: Enabled      
        Activation status: Active       
        Fabric Status reported by data plane: Up
        JSRPD internal fabric status: Up
                                        
Fabric link events:                     
        Dec  9 19:09:12.742 : Fabric monitoring is suspended due to USPIPC CONNECTION failure
        Dec  9 19:15:37.365 : Fabric monitoring is suspended by remote node
        Dec  9 19:17:36.806 : Fabric monitoring suspension is revoked by remote node
        Dec  9 19:17:40.808 : Child link-0 of fab0 is down, pfe notification
        Dec  9 19:17:40.808 : Child link-1 of fab0 is down, pfe notification
        Dec  9 19:17:40.808 : Child link-0 of fab1 is down, pfe notification
        Dec  9 19:17:40.808 : Child link-1 of fab1 is down, pfe notification
        Dec  9 19:17:42.758 : Child link-0 of fab0 is up, pfe notification
        Dec  9 19:17:42.758 : Child link-1 of fab0 is up, pfe notification
        Dec  9 19:17:43.755 : Fabric link up, link status timer
Control link status: Up
    Server information:
        Server status : Inactive
        Server connected to None
    Client information:
        Client status : Connected
        Client connected to 130.16.0.1/62845
Control port tagging:
    Disabled

Control link events:
        Dec  9 14:21:25.527 : Control link fxp1 is up
        Dec  9 14:22:06.528 : Control link fxp1 is up
        Dec  9 14:22:07.254 : Control link fxp1 is up
        Dec  9 14:22:09.124 : Control link fxp1 is up
        Dec  9 19:09:10.524 : Control link fxp1 is down
        Dec  9 19:09:10.530 : Control link down, flowd is down
        Dec  9 19:11:37.898 : Control link fxp1 is down
        Dec  9 19:11:38.113 : Control link fxp1 is down
        Dec  9 19:15:31.728 : Control link fxp1 is up
        Dec  9 19:15:38.247 : Control link up, link status timer

Hardware monitoring:                    
    Status:                             
        Activation status: Enabled      
        Redundancy group 0 failover for hardware faults: Enabled
        Hardware redundancy group 0 errors: 0
        Hardware redundancy group 1 errors: 0
                                        
Schedule monitoring:
    Status:                             
        Activation status: Disabled     
        Schedule slip detected: None    
        Timer ignored: No               
                                        
    Statistics:                         
        Total slip detected count: 2    
        Longest slip duration: 3(s)     

  Events:                             
        Dec  9 14:19:13.237 : Detected schedule slip
        Dec  9 14:20:13.562 : Cleared schedule slip
        Dec  9 19:10:53.065 : Detected schedule slip
        Dec  9 19:11:55.408 : Cleared schedule slip

Configuration Synchronization:
    Status:
        Activation status: Enabled
        Last sync operation: Auto-Sync
        Last sync result: Not needed
        Last sync mgd messages:

    Events:
        Dec  9 14:21:04.959 : Auto-Sync: Not needed.

Cold Synchronization Progress:
    CS Prereq               0 of 1 SPUs completed
       1. if_state sync          1 SPUs completed
       2. fabric link            0 SPUs completed
       3. policy data sync       1 SPUs completed
       4. cp ready               0 SPUs completed
       5. VPN data sync          0 SPUs completed
       6. IPID data sync         0 SPUs completed
       7. All SPU ready          0 SPUs completed
       8. AppID ready            0 SPUs completed
       9. Tunnel Sess ready      0 SPUs completed
    CS RTO sync             0 of 1 SPUs completed
    CS Postreq              0 of 1 SPUs completed
                                        
node1:                                  
--------------------------------------------------------------------------
Redundancy mode:                        
    Configured mode: active-active      
    Operational mode: active-active     
Cluster configuration:                  
    Heartbeat interval: 1000 ms         
    Heartbeat threshold: 3              
    Control link recovery: Disabled     
    Fabric link down timeout: 66 sec    
Node health information:                
    Local node health: Healthy          
    Remote node health: Not healthy

Redundancy group: 0, Threshold: 255, Monitoring failures: none
    Events:
        Dec  9 14:17:15.788 : hold->secondary, reason: Hold timer expired
        Dec  9 19:05:27.214 : secondary->primary, reason: Only node present

Redundancy group: 1, Threshold: 255, Monitoring failures: none
    Events:
        Dec  9 14:17:17.372 : hold->secondary, reason: Hold timer expired
        Dec  9 19:05:27.200 : secondary->ineligible, reason: Fabric link down
        Dec  9 19:05:27.269 : ineligible->primary, reason: Only node present
Control link statistics:
    Control link 0:
        Heartbeat packets sent: 17917
        Heartbeat packets received: 17517
        Heartbeat packet errors: 0
        Duplicate heartbeat packets received: 0
    Control recovery packet count: 0
    Sequence number of last heartbeat packet sent: 17917
    Sequence number of last heartbeat packet received: 17929
Fabric link statistics:
    Child link 0                        
        Probes sent: 35495              
        Probes received: 34394          
    Child link 1                        
        Probes sent: 35497              
        Probes received: 34394          
Switch fabric link statistics:    
   Probe state : DOWN                  
    Probes sent: 0                      
    Probes received: 0                  
    Probe recv errors: 0                
    Probe send errors: 0                
    Probe recv dropped: 0               
    Sequence number of last probe sent: 0
    Sequence number of last probe received: 0

Chassis cluster LED information:
    Current LED color: Green
    Last LED change reason: No failures
Control port tagging:
    Disabled

Cold Synchronization:
    Status:
        Cold synchronization completed for: N/A
        Cold synchronization failed for: N/A
        Cold synchronization not known for: N/A
        Current Monitoring Weight: 0

    Progress:
        CS Prereq               1 of 1 SPUs completed
           1. if_state sync          1 SPUs completed
           2. fabric link            1 SPUs completed
           3. policy data sync       1 SPUs completed
           4. cp ready               1 SPUs completed
           5. VPN data sync          1 SPUs completed
           6. IPID data sync         1 SPUs completed
           7. All SPU ready          1 SPUs completed
           8. AppID ready            1 SPUs completed
           9. Tunnel Sess ready      1 SPUs completed
        CS RTO sync             1 of 1 SPUs completed
        CS Postreq              1 of 1 SPUs completed
                                        
    Statistics:                         
        Number of cold synchronization completed: 0
        Number of cold synchronization failed: 0
  Events:                             
        Dec  9 14:18:47.255 : Cold sync for PFE  is RTO sync in process
        Dec  9 14:18:48.641 : Cold sync for PFE  is Post-req check in process
        Dec  9 14:18:50.645 : Cold sync for PFE  is Completed

Loopback Information:

    PIC Name        Loopback        Nexthop     Mbuf
    -------------------------------------------------
                    Success         Success     Success    

Interface monitoring:
    Statistics:
        Monitored interface failure count: 2

    Events:
        Dec  9 19:07:00.391 : Interface ge-0/0/4 monitored by rg 1, changed state from Up to Down
        Dec  9 19:07:00.548 : Interface ge-0/0/5 monitored by rg 1, changed state from Up to Down
        Dec  9 19:13:56.789 : Interface ge-0/0/4 monitored by rg 1, changed state from Down to Up
        Dec  9 19:13:56.817 : Interface ge-0/0/5 monitored by rg 1, changed state from Down to Up
                                        
Fabric monitoring:                      
    Status:                             
        Fabric Monitoring: Enabled      
        Activation status: Active       
        Fabric Status reported by data plane: Down
        JSRPD internal fabric status: Down
                                        
Fabric link events:                     
        Dec  9 19:13:50.856 : Child ge-5/0/8 of fab1 is down
        Dec  9 19:13:50.866 : Child ge-5/0/9 of fab1 is down
        Dec  9 19:13:52.851 : Fabric link fab1 is up
        Dec  9 19:13:52.852 : Child ge-5/0/8 of fab1 is up
        Dec  9 19:13:52.868 : Child ge-5/0/9 of fab1 is up
        Dec  9 19:13:53.612 : Fabric link fab0 is up
        Dec  9 19:13:53.613 : Child ge-0/0/8 of fab0 is up
        Dec  9 19:13:53.630 : Child ge-0/0/9 of fab0 is up
        Dec  9 19:13:55.649 : Child link-0 of fab0 is up, pfe notification
        Dec  9 19:13:55.649 : Child link-1 of fab0 is up, pfe notification

Control link status: Up
    Server information:                 
        Server status : Connected       
        Server connected to 129.16.0.1/64127
    Client information:
        Client status : Inactive
        Client connected to None
Control port tagging:
    Disabled

Control link events:
        Dec  9 14:15:21.808 : Control link fxp1 is down
        Dec  9 14:15:45.347 : Control link fxp1 is up
        Dec  9 14:17:20.899 : Control link fxp1 is up
        Dec  9 14:17:31.245 : Control link fxp1 is up
        Dec  9 19:05:27.200 : Control link down, link status timer
        Dec  9 19:05:27.219 : Control link fxp1 is up
        Dec  9 19:06:28.399 : Control link fxp1 is up
        Dec  9 19:07:05.419 : Control link fxp1 is up
        Dec  9 19:11:51.177 : Control link up, link status timer
        Dec  9 19:12:52.871 : Control link fxp1 is up

Hardware monitoring:
    Status:
        Activation status: Enabled
        Redundancy group 0 failover for hardware faults: Enabled
        Hardware redundancy group 0 errors: 0
        Hardware redundancy group 1 errors: 0
                                        
Schedule monitoring:
    Status:                             
        Activation status: Disabled     
        Schedule slip detected: None    
        Timer ignored: No               
                                        
    Statistics:                         
        Total slip detected count: 3    
        Longest slip duration: 6(s)     

    Events:                             
        Dec  9 14:15:37.423 : Detected schedule slip
        Dec  9 14:16:37.560 : Cleared schedule slip
        Dec  9 14:19:14.050 : Detected schedule slip
        Dec  9 14:20:14.126 : Cleared schedule slip
        Dec  9 19:06:43.691 : Detected schedule slip
        Dec  9 19:07:43.878 : Cleared schedule slip

Configuration Synchronization:
    Status:
        Activation status: Enabled
        Last sync operation: Auto-Sync
        Last sync result: Succeeded

    Events:
        Dec  9 14:17:46.062 : Auto-Sync: In progress. Attempt: 1
        Dec  9 14:19:07.641 : Auto-Sync: Clearing mgd. Attempt: 1
        Dec  9 14:19:14.043 : Auto-Sync: Succeeded. Attempt: 1

Cold Synchronization Progress:
    CS Prereq               1 of 1 SPUs completed
       1. if_state sync          1 SPUs completed
       2. fabric link            1 SPUs completed
       3. policy data sync       1 SPUs completed
       4. cp ready               1 SPUs completed
       5. VPN data sync          1 SPUs completed
       6. IPID data sync         1 SPUs completed
       7. All SPU ready          1 SPUs completed
       8. AppID ready            1 SPUs completed
       9. Tunnel Sess ready      1 SPUs completed
    CS RTO sync             1 of 1 SPUs completed
    CS Postreq              1 of 1 SPUs completed

 

14 REPLIES 14
SRX Services Gateway

Re: Chassis cluster crashes after show securify flow session

‎12-09-2019 02:40 PM

Crash needs to be analyzed further. I would recommend opening a ticket with JTAC for further investigation.

Please attach all the relevant data while opening the ticket.

https://kb.juniper.net/InfoCenter/index?page=content&id=KB21781&actp=METADATA

SRX Services Gateway

Re: Chassis cluster crashes after show securify flow session

‎12-10-2019 02:52 AM

The issue seems to be the quality of your fabric link between the two nodes.  Check the cabling and ports here.

 

Switch fabric link statistics:          
    Probe state : DOWN          

Fabric link events:                     
        Dec  9 19:09:12.742 : Fabric monitoring is suspended due to USPIPC CONNECTION failure
        Dec  9 19:15:37.365 : Fabric monitoring is suspended by remote node
        Dec  9 19:17:36.806 : Fabric monitoring suspension is revoked by remote node
        Dec  9 19:17:40.808 : Child link-0 of fab0 is down, pfe notification
        Dec  9 19:17:40.808 : Child link-1 of fab0 is down, pfe notification
        Dec  9 19:17:40.808 : Child link-0 of fab1 is down, pfe notification
        Dec  9 19:17:40.808 : Child link-1 of fab1 is down, pfe notification
        Dec  9 19:17:42.758 : Child link-0 of fab0 is up, pfe notification
        Dec  9 19:17:42.758 : Child link-1 of fab0 is up, pfe notification
        Dec  9 19:17:43.755 : Fabric link up, link status timer
Steve Puluka BSEET - Juniper Ambassador
IP Architect - DQE Communications Pittsburgh, PA (Metro Ethernet & ISP)
http://puluka.com/home
SRX Services Gateway

Re: Chassis cluster crashes after show securify flow session

‎12-10-2019 07:07 AM

Thank you for your suggestion. I changed the cabling and the sfp transceivers (original from Juniper) used for the fabric, then rebooted bofh nodes. It still doesn't look good.

 

root@SRX1# run sh chassis cluster status     
Monitor Failure codes:
    CS  Cold Sync monitoring        FL  Fabric Connection monitoring
    GR  GRES monitoring             HW  Hardware monitoring
    IF  Interface monitoring        IP  IP monitoring
    LB  Loopback monitoring         MB  Mbuf monitoring
    NH  Nexthop monitoring          NP  NPC monitoring              
    SP  SPU monitoring              SM  Schedule monitoring
    CF  Config Sync monitoring      RE  Relinquish monitoring
 
Cluster ID: 1
Node   Priority Status               Preempt Manual   Monitor-failures

Redundancy group: 0 , Failover count: 0
node0  100      primary              no      no       None           
node1  1        secondary            no      no       None           

Redundancy group: 1 , Failover count: 0
node0  0        primary              no      no       CS             
node1  0        secondary            no      no       CS  


root@SRX1# run sh chassis cluster information no-forwarding 
Redundancy Group Information:

    Redundancy Group 0 , Current State: primary, Weight: 255

        Time            From                 To                   Reason
        n/a             n/a                  n/a                  n/a

    Redundancy Group 1 , Current State: primary, Weight: 0

        Time            From                 To                   Reason
        n/a             n/a                  n/a                  n/a

Chassis cluster LED information:
    Current LED color: Amber
    Last LED change reason: Monitored objects are down
Control port tagging:
    Disabled

Failure Information:

    Coldsync Monitoring Failure Information:
        Statistics:
            Coldsync Total SPUs: 1
            Coldsync completed SPUs: 0  
            Coldsync not complete SPUs: 1
                                        
    Fabric-link Failure Information:    
        Fabric Interface: fab0          
          Child interface   Physical / Monitored Status     
          ge-0/0/13             Up   / Down 


root@SRX1# show interfaces fab0 
fabric-options {
    member-interfaces {
        ge-0/0/13;
    }
}

{primary:node0}[edit]
root@SRX1# show interfaces fab1    
fabric-options {
    member-interfaces {
        ge-5/0/13;
    }
}

{primary:node0}[edit]
root@SRX1# run show interfaces terse | match fab 
ge-0/0/13.0             up    up   aenet    --> fab0.0
ge-5/0/13.0             up    up   aenet    --> fab1.0
fab0                    up    up
fab0.0                  up    up   inet     30.17.0.200/24  
fab1                    up    up
fab1.0                  up    up   inet     30.18.0.200/24  
swfab0                  up    down
swfab1                  up    down


{primary:node0}[edit]
root@SRX1# run show log jsrpd | last 
Dec 10 16:04:39 State of lnk-0 of fab0 remains DISABLED
Dec 10 16:04:39 jsrpd_pfe_fabmon_update_lnk_secure_status: lnk_idx:1, link_secure_state(curr:0, new:255)
Dec 10 16:04:39 State of lnk-1 of fab0 remains DISABLED
Dec 10 16:04:39 jsrpd_pfe_fabmon_update_lnk_secure_status: lnk_idx:2, link_secure_state(curr:0, new:0)
Dec 10 16:04:39 jsrpd_pfe_fabmon_update_lnk_secure_status: lnk_idx:3, link_secure_state(curr:0, new:0)
Dec 10 16:04:39 HA Fabric Info: After fabric child status is updated
Dec 10 16:04:39   node0: fab0 is Active with 1 child (AggId: 130)
Dec 10 16:04:39   link-0: ge-0/0/13 (0/0/13) is Active : ifd_state: Up pfe_state:Down secure_state Disabled
Dec 10 16:04:39   node1: fab1 is Active with 1 child (AggId: 163)
Dec 10 16:04:39   link-0: ge-5/0/13 (5/0/13) is Active : ifd_state: Up pfe_state:Up secure_state Disabled
Dec 10 16:04:50 Cleared control-plane statistics
Dec 10 16:04:53 All events collected for debugging are cleared.
Dec 10 16:05:04 updated rg_info for RG-0 with failover-cnt 0 state: primary into ssam. Result = success, error: 0
Dec 10 16:05:04 updated rg_info for RG-1 with failover-cnt 0 state: primary into ssam. Result = success, error: 0
Dec 10 16:05:04 cleared failover-count for all RGs
Dec 10 16:05:12 ISSU state: 0


root@SRX1# run show log chassisd | last 

Dec 10 15:58:49 SCC: pic_update_ifdev_tlvs: ge-5/0/15 Not SyncE capable
 
Dec 10 15:58:49 SCC: pic_update_ifdev_tlvs: ge-5/0/15 IFM type is Ether, SyncE TLV is added
 
Dec 10 15:58:49 SCC: PIC (fpc 5 pic 0) message operation: add. ifd count 16, flags 0x3 in mesg
Dec 10 15:58:49 LCC: ignoring PIC message on LCC
Dec 10 15:59:53 LCC: send: yellow alarm set, device Routing Engine 0, reason Potential slow peers are: FWDD1
Dec 10 15:59:53 SCC: send: yellow alarm clear, device LCC 0, reason LCC 0 Minor Errors
Dec 10 15:59:53 CHASSISD_IPC_UNEXPECTED_RECV: Received unexpected message from craftd: type = 4, subtype = 43
 
Dec 10 16:02:48 LCC: send: yellow alarm clear, device Routing Engine 0, reason Potential slow peers are:
Dec 10 16:02:48 SCC: send: yellow alarm clear, device LCC 0, reason LCC 0 Minor Errors
Dec 10 16:02:48 CHASSISD_IPC_UNEXPECTED_RECV: Received unexpected message from craftd: type = 4, subtype = 44



{secondary:node1}
root@SRX2> show log jsrpd| last 
Dec 10 16:04:39 Fabric link current state: UP link up count:1
Dec 10 16:04:39 HA Fabric Info: After fabric status is updated
Dec 10 16:04:39   node0: fab0 is Active with 1 child (AggId: 130)
Dec 10 16:04:39   link-0: ge-0/0/13 (0/0/13) is Active : ifd_state: Up pfe_state:Down secure_state Disabled
Dec 10 16:04:39   node1: fab1 is Active with 1 child (AggId: 163)
Dec 10 16:04:39   link-0: ge-5/0/13 (5/0/13) is Active : ifd_state: Up pfe_state:Up secure_state Disabled
Dec 10 16:04:39 HA Fabric Info: Before populated from blob
Dec 10 16:04:39   node0: fab0 is Active with 1 child (AggId: 130)
Dec 10 16:04:39   link-0: ge-0/0/13 (0/0/13) is Active : ifd_state: Up pfe_state:Down secure_state Disabled
Dec 10 16:04:39   node1: fab1 is Active with 1 child (AggId: 163)
Dec 10 16:04:39   link-0: ge-5/0/13 (5/0/13) is Active : ifd_state: Up pfe_state:Up secure_state Disabled
Dec 10 16:04:39 HA Fabric Info: After populated from blob
Dec 10 16:04:39   node0: fab0 is Active with 1 child (AggId: 130)
Dec 10 16:04:39   link-0: ge-0/0/13 (0/0/13) is Active : ifd_state: Up pfe_state:Down secure_state Disabled
Dec 10 16:04:39   node1: fab1 is Active with 1 child (AggId: 163)
Dec 10 16:04:39   link-0: ge-5/0/13 (5/0/13) is Active : ifd_state: Up pfe_state:Up secure_state Disabled
Dec 10 16:04:53 All events collected for debugging are cleared.


{secondary:node1}
root@SRX2> show log chassisd| last 
Dec 10 15:58:49 LCC: pic_get_egress_shaping_overhead: 0/0 eso val = 0
Dec 10 15:58:49 CHASSISD_SNMP_TRAP10: SNMP trap generated: FRU power on (jnxFruContentsIndex 8, jnxFruL1Index 6, jnxFruL2Index 1, jnxFruL3Index 0, jnxFruName node1 PIC:  @ 0/0/*, jnxFruType 11, jnxFruSlot 5, jnxFruOfflineReason 2, jnxFruLastPowerOff 0, jnxFruLastPowerOn 0)
Dec 10 15:58:49 CHASSISD_SNMP_TRAP3: ENTITY trap generated: entStateOperEnabled (entPhysicalIndex 59, entStateAdmin 4, entStateAlarm 0)
Dec 10 15:58:49 CHASSISD_SNMP_TRAP0: ENTITY trap generated: entConfigChange
Dec 10 15:58:49 LCC: send: fpc 0 pic 0 online ack
Dec 10 15:58:49 LCC: pic attach pic 0, flags 0x0, portcount 64, fpc 0
Dec 10 15:58:49 LCC: pic_set_online: i2c 0x682 pic 0 fpc 0 state 3 in_issu 0
Dec 10 15:58:49 LCC:  pic_type=1666 pic_slot=0 fpc_slot=0 pic_key=0x0 pic_i2c_id=1666
 
Dec 10 15:58:49 LCC: hwdb: entry for pic 1666 at slot 0 in fpc 0 inserted
Dec 10 15:58:49 LCC: FPC 0 PIC 0, attaching clean
Dec 10 15:58:49 LCC: not in vc mode
Dec 10 15:58:49 LCC: Forwarding pic attach to FWDD fpc 0, pic 0
Dec 10 15:58:49 LCC: Got a pic attach ack from fwdd fpc 0pic 0
Dec 10 15:58:49 LCC: FWDD pic attach ack recd fpc 0, pic 0
Dec 10 16:02:54 LCC: pic_copy_port_info:Got SFP Rev=�\����, Pno=NON-JNPR, Sno=5LCD19HY100
Dec 10 16:02:54 LCC: pic_copy_port_info:Got SFP Rev=REV 02, Pno=740-011613, Sno=AM19122U0P7

 

 

 

 

 

 

SRX Services Gateway

Re: Chassis cluster crashes after show securify flow session

‎12-10-2019 07:21 AM
{primary:node0}[edit]
root@SRX1# run show chassis cluster information detail    
node0:
--------------------------------------------------------------------------
Redundancy mode:
    Configured mode: active-active
    Operational mode: active-active
Cluster configuration:
    Heartbeat interval: 2000 ms
    Heartbeat threshold: 8
    Control link recovery: Enabled
    Fabric link down timeout: 352 sec
Node health information:
    Local node health: Not healthy
    Remote node health: Not healthy

Redundancy group: 0, Threshold: 255, Monitoring failures: none

Redundancy group: 1, Threshold: 0, Monitoring failures: cold-sync-monitoring
Control link statistics:
    Control link 0:
        Heartbeat packets sent: 332
        Heartbeat packets received: 331
        Heartbeat packet errors: 0
        Duplicate heartbeat packets received: 0
    Control recovery packet count: 0    
    Sequence number of last heartbeat packet sent: 1758
    Sequence number of last heartbeat packet received: 722
Fabric link statistics:                 
    Child link 0                        
        Probes sent: 1323               
        Probes received: 0              
    Child link 1                        
        Probes sent: 0                  
        Probes received: 0              
Switch fabric link statistics:          
    Probe state : DOWN                  
    Probes sent: 0                      
    Probes received: 0                  
    Probe recv errors: 0                
    Probe send errors: 0                
    Probe recv dropped: 0               
    Sequence number of last probe sent: 0
    Sequence number of last probe received: 0
                                        
Chassis cluster LED information:        
    Current LED color: Amber            
    Last LED change reason: Monitored objects are down
Control port tagging:
    Disabled

Cold Synchronization:
    Status:
        Cold synchronization completed for: N/A
        Cold synchronization failed for: N/A
        Cold synchronization not known for: N/A
        Current Monitoring Weight: 255

    Progress:
        CS Prereq               0 of 1 SPUs completed
           1. if_state sync          1 SPUs completed
           2. fabric link            0 SPUs completed
           3. policy data sync       1 SPUs completed
           4. cp ready               0 SPUs completed
           5. VPN data sync          0 SPUs completed
           6. IPID data sync         0 SPUs completed
           7. All SPU ready          0 SPUs completed
           8. AppID ready            0 SPUs completed
           9. Tunnel Sess ready      0 SPUs completed
        CS RTO sync             0 of 1 SPUs completed
        CS Postreq              0 of 1 SPUs completed
                                        
    Statistics:                         
        Number of cold synchronization completed: 0
        Number of cold synchronization failed: 0
                                        
Loopback Information:                   
                                        
    PIC Name        Loopback        Nexthop     Mbuf
    -------------------------------------------------
                    Success         Success     Success    
                                        
Interface monitoring:                   
    Statistics:                         
        Monitored interface failure count: 0
                                        
Fabric monitoring:                      
    Status:                             
        Fabric Monitoring: Enabled      
        Activation status: Active       
        Fabric Status reported by data plane: Down
        JSRPD internal fabric status: Down

Control link status: Up
    Server information:
        Server status : Connected
        Server connected to 130.16.0.1/57742
    Client information:
        Client status : Inactive
        Client connected to None
Control port tagging:
    Disabled

Hardware monitoring:
    Status:
        Activation status: Enabled
        Redundancy group 0 failover for hardware faults: Enabled
        Hardware redundancy group 0 errors: 0
        Hardware redundancy group 1 errors: 0

Schedule monitoring:
    Status:
        Activation status: Disabled     
        Schedule slip detected: None    
        Timer ignored: No               
                                        
    Statistics:                         
        Total slip detected count: 0    
        Longest slip duration: 0(s)     
                                        
Configuration Synchronization:
    Status:                             
        Activation status: Enabled      
        Last sync operation: Auto-Sync  
        Last sync result: Not needed    
        Last sync mgd messages:         

        Last sync mgd messages:         
                                        
Cold Synchronization Progress:          
    CS Prereq               0 of 1 SPUs completed
       1. if_state sync          1 SPUs completed
       2. fabric link            0 SPUs completed
       3. policy data sync       1 SPUs completed
       4. cp ready               0 SPUs completed
       5. VPN data sync          0 SPUs completed
       6. IPID data sync         0 SPUs completed
       7. All SPU ready          0 SPUs completed
       8. AppID ready            0 SPUs completed
       9. Tunnel Sess ready      0 SPUs completed
    CS RTO sync             0 of 1 SPUs completed
    CS Postreq              0 of 1 SPUs completed

node1:
--------------------------------------------------------------------------
Redundancy mode:
    Configured mode: active-active
    Operational mode: active-active
Cluster configuration:
    Heartbeat interval: 2000 ms
    Heartbeat threshold: 8
    Control link recovery: Enabled
    Fabric link down timeout: 352 sec
Node health information:
    Local node health: Not healthy
    Remote node health: Not healthy     
                                        
Redundancy group: 0, Threshold: 255, Monitoring failures: none
                                        
Redundancy group: 1, Threshold: -200, Monitoring failures: cold-sync-monitoring
Control link statistics:                
    Control link 0:                     
        Heartbeat packets sent: 723     
        Heartbeat packets received: 710 
        Heartbeat packet errors: 0      
        Duplicate heartbeat packets received: 0
    Control recovery packet count: 16   
    Sequence number of last heartbeat packet sent: 723
    Sequence number of last heartbeat packet received: 1758

Fabric link statistics:                 
    Child link 0                        
        Probes sent: 1569               
        Probes received: 1566           
    Child link 1                        
        Probes sent: 0
        Probes received: 0
Switch fabric link statistics:
    Probe state : DOWN
    Probes sent: 0
    Probes received: 0
    Probe recv errors: 0
    Probe send errors: 0
    Probe recv dropped: 0
    Sequence number of last probe sent: 0
    Sequence number of last probe received: 0

Chassis cluster LED information:
    Current LED color: Amber
    Last LED change reason: Monitored objects are down
Control port tagging:
    Disabled

Cold Synchronization:
    Status:
        Cold synchronization completed for: N/A
        Cold synchronization failed for: N/A
        Cold synchronization not known for: N/A
        Current Monitoring Weight: 255  
                                        
    Progress:                           
        CS Prereq               0 of 1 SPUs completed
           1. if_state sync          1 SPUs completed
           2. fabric link            0 SPUs completed
           3. policy data sync       1 SPUs completed
           4. cp ready               0 SPUs completed
           5. VPN data sync          0 SPUs completed
           6. IPID data sync         0 SPUs completed
           7. All SPU ready          0 SPUs completed
           8. AppID ready            0 SPUs completed
           9. Tunnel Sess ready      0 SPUs completed
        CS RTO sync             0 of 1 SPUs completed
        CS Postreq              0 of 1 SPUs completed
    Statistics:
        Number of cold synchronization completed: 0
        Number of cold synchronization failed: 0

Loopback Information:

    PIC Name        Loopback        Nexthop     Mbuf
    -------------------------------------------------
                    Success         Success     Success    

Interface monitoring:
    Statistics:
        Monitored interface failure count: 0

Fabric monitoring:
    Status:
        Fabric Monitoring: Enabled
        Activation status: Active
        Fabric Status reported by data plane: Down
        JSRPD internal fabric status: Down

Control link status: Up
    Server information:                 
        Server status : Connected       
        Server connected to 130.16.0.1/57742
    Client information:                 
        Client status : Inactive        
        Client connected to None        
Control port tagging:                   
    Disabled                            
                                        
Hardware monitoring:                    
    Status:                             
        Activation status: Enabled      
        Redundancy group 0 failover for hardware faults: Enabled
        Hardware redundancy group 0 errors: 0
        Hardware redundancy group 1 errors: 0
                                        
Schedule monitoring:
    Status:                             
        Activation status: Disabled     
        Schedule slip detected: None    
        Timer ignored: No               
    Statistics:                         
        Total slip detected count: 0    
        Longest slip duration: 0(s)

Configuration Synchronization:
    Status:
        Activation status: Enabled
        Last sync operation: Auto-Sync
        Last sync result: Not needed
        Last sync mgd messages:

Cold Synchronization Progress:
    CS Prereq               0 of 1 SPUs completed
       1. if_state sync          1 SPUs completed
       2. fabric link            0 SPUs completed
       3. policy data sync       1 SPUs completed
       4. cp ready               0 SPUs completed
       5. VPN data sync          0 SPUs completed
       6. IPID data sync         0 SPUs completed
       7. All SPU ready          0 SPUs completed
       8. AppID ready            0 SPUs completed
       9. Tunnel Sess ready      0 SPUs completed
    CS RTO sync             0 of 1 SPUs completed
    CS Postreq              0 of 1 SPUs completed
                                        
node1:                                  
--------------------------------------------------------------------------
Redundancy mode:                        
    Configured mode: active-active      
    Operational mode: active-active     
Cluster configuration:                  
    Heartbeat interval: 2000 ms         
    Heartbeat threshold: 8              
    Control link recovery: Enabled      
    Fabric link down timeout: 352 sec   
Node health information:                
    Local node health: Not healthy      
    Remote node health: Not healthy     
                                        
Redundancy group: 0, Threshold: 255, Monitoring failures: none

Redundancy group: 1, Threshold: -200, Monitoring failures: cold-sync-monitoring
Control link statistics:                
    Control link 0:                     
        Heartbeat packets sent: 783     
        Heartbeat packets received: 771 
        Heartbeat packet errors: 0      
        Duplicate heartbeat packets received: 0
    Control recovery packet count: 16
    Sequence number of last heartbeat packet sent: 783
    Sequence number of last heartbeat packet received: 1819
Fabric link statistics:
    Child link 0
        Probes sent: 1811
        Probes received: 1808
    Child link 1
        Probes sent: 0
        Probes received: 0
Switch fabric link statistics:
    Probe state : DOWN
    Probes sent: 0
    Probes received: 0
    Probe recv errors: 0
    Probe send errors: 0
    Probe recv dropped: 0
    Sequence number of last probe sent: 0
    Sequence number of last probe received: 0

Chassis cluster LED information:
    Current LED color: Amber            
    Last LED change reason: Monitored objects are down
Control port tagging:                   
    Disabled                            
                                        
Cold Synchronization:                   
    Status:                             
        Cold synchronization completed for: N/A
        Cold synchronization failed for: N/A
        Cold synchronization not known for: N/A
        Current Monitoring Weight: 255  
 Progress:                           
        CS Prereq               0 of 1 SPUs completed
           1. if_state sync          1 SPUs completed
           2. fabric link            0 SPUs completed
           3. policy data sync       1 SPUs completed
           4. cp ready               0 SPUs completed
           5. VPN data sync          0 SPUs completed
           6. IPID data sync         0 SPUs completed
           7. All SPU ready          0 SPUs completed
           8. AppID ready            0 SPUs completed
           9. Tunnel Sess ready      0 SPUs completed
        CS RTO sync             0 of 1 SPUs completed
        CS Postreq              0 of 1 SPUs completed

    Statistics:
        Number of cold synchronization completed: 0
        Number of cold synchronization failed: 0

Loopback Information:

    PIC Name        Loopback        Nexthop     Mbuf
    -------------------------------------------------
                    Success         Failure     Success    

Interface monitoring:
    Statistics:
        Monitored interface failure count: 0
                                        
Fabric monitoring:                      
    Status:                             
        Fabric Monitoring: Enabled      
        Activation status: Active       
        Fabric Status reported by data plane: Up
        JSRPD internal fabric status: Up
                                        
Control link status: Up
    Server information:                 
        Server status : Inactive        
        Server connected to None        
    Client information:                 
        Client status : Connected       
        Client connected to 129.16.0.1/62845
Control port tagging:              
 Disabled                            
                                        
Hardware monitoring:                    
    Status:
        Activation status: Enabled
        Redundancy group 0 failover for hardware faults: Enabled
        Hardware redundancy group 0 errors: 0
        Hardware redundancy group 1 errors: 0

Schedule monitoring:
    Status:
        Activation status: Disabled
        Schedule slip detected: None
        Timer ignored: No

    Statistics:
        Total slip detected count: 0
        Longest slip duration: 0(s)

Configuration Synchronization:
    Status:
        Activation status: Enabled
        Last sync operation: Auto-Sync
        Last sync result: Succeeded

Cold Synchronization Progress:          
    CS Prereq               0 of 1 SPUs completed
       1. if_state sync          1 SPUs completed
       2. fabric link            0 SPUs completed
       3. policy data sync       1 SPUs completed
       4. cp ready               0 SPUs completed
       5. VPN data sync          0 SPUs completed
       6. IPID data sync         0 SPUs completed
       7. All SPU ready          0 SPUs completed
       8. AppID ready            0 SPUs completed
       9. Tunnel Sess ready      0 SPUs completed
    CS RTO sync             0 of 1 SPUs completed
    CS Postreq              0 of 1 SPUs completed
SRX Services Gateway

Re: Chassis cluster crashes after show securify flow session

‎12-10-2019 05:41 PM

Yes, your config looks correct but there is clearly a signaling problem on the fab link.  The replacements at least seem to have brought the physical link up.  Since fiber and optics are all swapped it might be a hardware issue.'

 

Steve Puluka BSEET - Juniper Ambassador
IP Architect - DQE Communications Pittsburgh, PA (Metro Ethernet & ISP)
http://puluka.com/home
SRX Services Gateway

Re: Chassis cluster crashes after show securify flow session

[ Edited ]
‎12-10-2019 06:43 PM

Hi Gabriel, 

 

Greetings,

From the description, I suppose the issue is due to flowd process crashing.

  • Can you try using the Syslog server instead of internal logging (assuming you have internal logging setup) and check if this resolves the issue?
  • Also, use the security log mode stream for security logs.
  • Check the output after changing the above and share updates about the stability of the cluster.

Please mark "Accept as solution" if this answers your query. Kudos are appreciated too! Smiley Happy 


Regards,

Sharat Ainapur

SRX Services Gateway

Re: Chassis cluster crashes after show securify flow session

‎12-10-2019 08:13 PM

Hi,

 

It does not look very good, flowd, fabric monitoring down, scheduler slip. 

 

Better to open  a JTAC case so this is investigated thoroughly. If you want a quick fix, you can try a simultanous reboot of both nodes to check if the issue goes away.

 

Regards,

 

Vikas

SRX Services Gateway

Re: Chassis cluster crashes after show securify flow session

[ Edited ]
‎12-11-2019 03:02 AM

After rebooting bofh nodes two times, the chassis cluster status looks better temporarly.

 

I enabled logging to remote syslog server. I see plenty of errors like those below:

 

Dec 11 10:40:13 SRX1 RT-HAL,rt_entry_add_msg_proc,3723: rt_halp_vectors->rt_create failed 
Dec 11 10:40:13 SRX1 RT-HAL,rt_msg_handler,737: route process failed 
Dec 11 10:40:13 SRX1 /kernel: RT_PFE: RT msg op 1 (PREFIX ADD) failed, err 5 (Invalid)
Dec 11 10:40:13 SRX2 RT: IPv4:0 - 213.145.64/20 (RT: Failed to allocate object for flow) 
Dec 11 10:40:13 SRX2 RT-HAL,rt_entry_add_msg_proc,3723: rt_halp_vectors->rt_create failed 
Dec 11 10:40:13 SRX1 /kernel: RT_PFE: RT msg op 3 (PREFIX CHANGE) failed, err 5 (Invalid)
Dec 11 10:40:13 SRX1 /kernel: RT_PFE: RT msg op 3 (PREFIX CHANGE) failed, err 5 (Invalid)
Dec 11 10:40:13 SRX1 /kernel: RT_PFE: RT msg op 1 (PREFIX ADD) failed, err 5 (Invalid)
Dec 11 10:40:13 SRX1 fto_new: failed to allocate fto 
Dec 11 10:40:13 SRX1 RT: IPv4:0 - 200.39.18/24 (RT: Failed to allocate object for flow) 


Dec 11 12:00:17 SRX1 /kernel: RT_PFE: RT msg op 3 (PREFIX CHANGE) failed, err 5 (Invalid)
Dec 11 12:00:18 SRX1 /kernel: RT_PFE: RT msg op 3 (PREFIX CHANGE) failed, err 5 (Invalid)
Dec 11 12:00:18 SRX1 /kernel: RT_PFE: RT msg op 3 (PREFIX CHANGE) failed, err 5 (Invalid)
Dec 11 12:00:19 SRX1 /kernel: RT_PFE: RT msg op 3 (PREFIX CHANGE) failed, err 5 (Invalid)
Dec 11 12:00:19 SRX1 /kernel: RT_PFE: RT msg op 3 (PREFIX CHANGE) failed, err 5 (Invalid)
Dec 11 12:00:20 SRX1 /kernel: RT_PFE: RT msg op 3 (PREFIX CHANGE) failed, err 5 (Invalid)
Dec 11 12:00:20 SRX1 /kernel: RT_PFE: RT msg op 3 (PREFIX CHANGE) failed, err 5 (Invalid)
Dec 11 12:00:20 SRX1 /kernel: RT_PFE: RT msg op 3 (PREFIX CHANGE) failed, err 5 (Invalid)


Dec 11 11:43:29 SRX2 jbcm_intf_is_sfp_port: Intf ge-5/0/13:176 is SFP port 
Dec 11 11:43:29 SRX2 macsec_req_fab_config: Sending fab request for SFP port 176 
Dec 11 11:43:29 SRX2 jbcm_intf_is_sfp_port: Intf ge-5/0/13:176 is SFP port 
Dec 11 11:49:56 SRX2 JBCM(0/0) link 0 SFP syslog throttling: enabling syslogs for receive power alarms and warnings. (0/0) 


Dec 11 12:04:10 SRX1 JBCM(0/0) link 0 SFP laser bias current low  alarm set 
Dec 11 12:04:10 SRX1 JBCM(0/0) link 0 SFP output power low  alarm set 
Dec 11 12:04:10 SRX1 JBCM(0/0) link 0 SFP laser bias current low  warning set 
Dec 11 12:04:10 SRX1 JBCM(0/0) link 0 SFP output power low  warning set 

 

But I'm not sure if that maybe a source of the problem with chassis cluster?

 

 

SRX Services Gateway

Re: Chassis cluster crashes after show securify flow session

‎12-11-2019 03:52 AM

Hello,

 


@Gabriel- wrote:

I see plenty of errors like those below:

 

Dec 11 10:40:13 SRX1 RT-HAL,rt_entry_add_msg_proc,3723: rt_halp_vectors->rt_create failed 
Dec 11 10:40:13 SRX1 RT-HAL,rt_msg_handler,737: route process failed 
Dec 11 10:40:13 SRX1 /kernel: RT_PFE: RT msg op 1 (PREFIX ADD) failed, err 5 (Invalid)
Dec 11 10:40:13 SRX2 RT: IPv4:0 - 213.145.64/20 (RT: Failed to allocate object for flow) 
Dec 11 10:40:13 SRX2 RT-HAL,rt_entry_add_msg_proc,3723: rt_halp_vectors->rt_create failed 
Dec 11 10:40:13 SRX1 /kernel: RT_PFE: RT msg op 3 (PREFIX CHANGE) failed, err 5 (Invalid)
Dec 11 10:40:13 SRX1 /kernel: RT_PFE: RT msg op 3 (PREFIX CHANGE) failed, err 5 (Invalid)
Dec 11 10:40:13 SRX1 /kernel: RT_PFE: RT msg op 1 (PREFIX ADD) failed, err 5 (Invalid)
Dec 11 10:40:13 SRX1 fto_new: failed to allocate fto 
Dec 11 10:40:13 SRX1 RT: IPv4:0 - 200.39.18/24 (RT: Failed to allocate object for flow) 


 

How many routes are You pushing to this SRX cluster, please?

SRX340 can hold circa 1M IPv4 routes with enhanced route-scale features and circa 600K without.

https://www.juniper.net/assets/us/en/local/pdf/datasheets/1000550-en.pdf

https://forums.juniper.net/t5/SRX-Services-Gateway/How-to-use-maximum-RIB-FIB-sizein-SRX340-345/m-p/...

HTH

Thx

Alex

 

_____________________________________________________________________

Please ask Your Juniper account team about Juniper Professional Services offerings.
Juniper PS can design, test & build the network/part of the network as per Your requirements

+++++++++++++++++++++++++++++++++++++++++++++

Accept as Solution = cool !
Accept as Solution+Kudo = You are a Star !
SRX Services Gateway

Re: Chassis cluster crashes after show securify flow session

‎12-11-2019 08:45 AM

Thanks for the suggestion. I limited received prefixes from bgp peers to default routes only (it's fine for the now).

 

I'm going to verify everything and write back!

SRX Services Gateway

Re: Chassis cluster crashes after show securify flow session

2 weeks ago

Hello, Can you check if there were any traceoptions running ? 

SRX Services Gateway

Re: Chassis cluster crashes after show securify flow session

2 weeks ago

show configuration | display set | match traceoptions

SRX Services Gateway

Re: Chassis cluster crashes after show securify flow session

2 weeks ago

also outpout of show system process extensive, 

show chassis routing-engine

show security monitoring

SRX Services Gateway

Re: Chassis cluster crashes after show securify flow session

2 weeks ago

show security monitoring