SRX Services Gateway
SRX Services Gateway

Problems and more problems in a SRX340 cluster.... the neverending story

[ Edited ]
a week ago

Hi guys, 

This story is coming from here https://forums.juniper.net/t5/SRX-Services-Gateway/Junos-upgrade-fails-on-SRX340-cluster-from-15-1X4...

 

I was strugling to upgrade a SX340 cluster to a newer Junos version, and finally with the help of some gurus, I made it upgrade to version 18.3R2.7 on both nodes.

 

Now after the upgrade however i'm facing new issues... I can't SSH the device anymore, on its single reth interface configured while i can on the console port with same root password... Also sometimes the HA shows fine, but some times it shows amber HA led, and the output of the regular commands shows as below:

 

root@SPCFW-BRAVO> show chassis firmware  
node0:
--------------------------------------------------------------------------
Part                     Type       Version
FPC 0                    O/S        Version 18.3R2.7 by builder on 2019-05-03 09:17:52 UTC
FWDD                     O/S        Version 18.3R2.7 by builder on 2019-05-03 09:17:52 UTC

node1:
--------------------------------------------------------------------------
Part                     Type       Version
FPC 0                    O/S        Version 18.3R2.7 by builder on 2019-05-03 09:17:52 UTC
FWDD                     O/S        Version 18.3R2.7 by builder on 2019-05-03 09:17:52 UTC
root@SPCFW-BRAVO> show chassis cluster information 
node0:
--------------------------------------------------------------------------
Redundancy Group Information:

    Redundancy Group 0 , Current State: primary, Weight: 255

        Time            From                 To                   Reason
        Sep 11 20:57:13 hold                 secondary            Hold timer expired
        Sep 11 20:57:22 secondary            primary              Better priority (200/100)

    Redundancy Group 1 , Current State: primary, Weight: 0

        Time            From                 To                   Reason
        Sep 11 20:57:13 hold                 secondary            Hold timer expired
        Sep 11 20:57:24 secondary            primary              Remote yield (0/0)

Chassis cluster LED information:
    Current LED color: Amber
    Last LED change reason: Monitored objects are down
Control port tagging:                   
    Disabled

Failure Information:

    Coldsync Monitoring Failure Information:
        Statistics:
            Coldsync Total SPUs: 1
            Coldsync completed SPUs: 0
            Coldsync not complete SPUs: 1

    Fabric-link Failure Information:
        Fabric Interface: fab0
          Child interface   Physical / Monitored Status     
          ge-0/0/2              Up   / Down 

node1:
--------------------------------------------------------------------------
Redundancy Group Information:

    Redundancy Group 0 , Current State: secondary, Weight: 0

        Time            From                 To                   Reason
        Sep 11 20:57:21 hold                 secondary            Hold timer expired

    Redundancy Group 1 , Current State: secondary, Weight: -255

        Time            From                 To                   Reason
        Sep 11 20:57:22 hold                 secondary            Hold timer expired

Chassis cluster LED information:
    Current LED color: Amber
    Last LED change reason: Monitored objects are down
Control port tagging:
    Disabled

Failure Information:

    Coldsync Monitoring Failure Information:
        Statistics:
            Coldsync Total SPUs: 1
            Coldsync completed SPUs: 0
            Coldsync not complete SPUs: 1

    Fabric-link Failure Information:    
        Fabric Interface: fab1
          Child interface   Physical / Monitored Status     
          ge-5/0/2              Up   / Down 

{secondary:node1}
root@SPCFW-BRAVO> show chassis cluster status        
Monitor Failure codes:
    CS  Cold Sync monitoring        FL  Fabric Connection monitoring
    GR  GRES monitoring             HW  Hardware monitoring
    IF  Interface monitoring        IP  IP monitoring
    LB  Loopback monitoring         MB  Mbuf monitoring
    NH  Nexthop monitoring          NP  NPC monitoring              
    SP  SPU monitoring              SM  Schedule monitoring
    CF  Config Sync monitoring      RE  Relinquish monitoring
 
Cluster ID: 1
Node   Priority Status               Preempt Manual   Monitor-failures

Redundancy group: 0 , Failover count: 0
node0  200      primary              no      no       None           
node1  0        secondary            no      no       FL             

Redundancy group: 1 , Failover count: 0
node0  0        primary              yes     no       CS             
node1  0        secondary            yes     no       CS FL          
root@SPCFW-BRAVO> show chassis cluster interfaces 
Control link status: Up

Control interfaces: 
    Index   Interface   Monitored-Status   Internal-SA   Security
    0       fxp1        Up                 Disabled      Disabled  

Fabric link status: Down

Fabric interfaces: 
    Name    Child-interface    Status                    Security
                               (Physical/Monitored)
    fab0    ge-0/0/2           Up   / Down               Disabled   
    fab0   
    fab1    ge-5/0/2           Up   / Down               Disabled   
    fab1   

Redundant-ethernet Information:     
    Name         Status      Redundancy-group
    reth0        Down        Not configured   
    reth1        Up          1                
    reth2        Down        Not configured   
    reth3        Down        Not configured   
    reth4        Down        Not configured   
                                        
Redundant-pseudo-interface Information:
    Name         Status      Redundancy-group
    lo0          Up          0                

It seems that for some reason I can´t understand, fab0 ge-0/0/2 comes up sometimes, and comes down other times. 

 

What do you think? should I resinstall the same Junos version? go back to 15.1? 

 

Any help would be much appreciated

Thanks!

6 REPLIES 6
SRX Services Gateway

Re: Problems and more problems in a SRX340 cluster.... the neverending story

a week ago

BTW, this is the full config of the cluster, 

 

root@SPCFW-BRAVO> show configuration 
## Last commit: 2019-09-10 23:53:54 CEST by root
version 18.3R2.7;
groups {
    node0 {
        system {
            host-name SPCFW-ALPHA;
        }
        interfaces {
            fxp0 {
                unit 0 {
                    family inet {
                        address 10.101.44.1/24;
                    }
                }
            }
        }
    }
    node1 {
        system {
            host-name SPCFW-BRAVO;
        }
        interfaces {
            fxp0 {
                unit 0 {                
                    family inet {
                        address 10.101.44.2/24;
                    }
                }
            }
        }
    }
}
apply-groups "${node}";
system {
    root-authentication {
        encrypted-password "$5$ ## SECRET-DATA
    }
    time-zone Europe/Madrid;
    name-server {
        8.8.8.8;
        8.8.4.4;
    }
    services {
        ssh;
        netconf {
            ssh;                        
        }
        web-management {
            https {
                system-generated-certificate;
            }
        }
    }
    syslog {
        archive size 100k files 3;
        user * {
            any emergency;
        }
        file messages {
            any notice;
            authorization info;
        }
        file interactive-commands {
            interactive-commands any;
        }
    }
    max-configurations-on-flash 5;
    max-configuration-rollbacks 5;
    license {                           
        autoupdate {
            url https://ae1.juniper.net/junos/key_retrieval;
        }
    }
    ntp {
        server 69.164.198.192 prefer;
        server 216.239.35.8 prefer;
    }
    phone-home {
        server https://redirect.juniper.net;
    }
}
chassis {
    alarm {
        management-ethernet {
            link-down ignore;
        }
    }
    cluster {
        control-link-recovery;
        reth-count 5;
        redundancy-group 0 {
            node 0 priority 200;        
            node 1 priority 100;
        }
        redundancy-group 1 {
            node 0 priority 200;
            node 1 priority 100;
            preempt;
        }
    }
}
security {
    log {
        mode stream;
        report;
    }
    screen {
        ids-option untrust-screen {
            icmp {
                ping-death;
            }
            ip {
                source-route-option;
                tear-drop;
            }                           
            tcp {
                syn-flood {
                    alarm-threshold 1024;
                    attack-threshold 200;
                    source-threshold 1024;
                    destination-threshold 2048;
                    timeout 20;
                }
                land;
            }
        }
    }
    zones {
        security-zone Internal {
            host-inbound-traffic {
                system-services {
                    all;
                }
                protocols {
                    all;
                }
            }
            interfaces {                
                reth1.0;
            }
        }
        security-zone External;
        security-zone VPN;
        security-zone DMZ;
    }
}
interfaces {
    ge-0/0/3 {
        gigether-options {
            redundant-parent reth1;
        }
    }
    ge-5/0/3 {
        gigether-options {
            redundant-parent reth1;
        }
    }
    fab0 {
        fabric-options {
            member-interfaces {
                ge-0/0/2;               
            }
        }
    }
    fab1 {
        fabric-options {
            member-interfaces {
                ge-5/0/2;
            }
        }
    }
    reth1 {
        description MGMT;
        redundant-ether-options {
            redundancy-group 1;
        }
        unit 0 {
            family inet {
                address 10.101.40.254/24;
            }
        }
    }
}
protocols {                             
    l2-learning {
        global-mode switching;
    }
    rstp {
        interface all;
    }
}
access {
    address-assignment {
        pool junosDHCPPool1 {
            family inet {
                network 192.168.1.0/24;
                range junosRange {
                    low 192.168.1.2;
                    high 192.168.1.254;
                }
                dhcp-attributes {
                    router {
                        192.168.1.1;
                    }
                    propagate-settings ge-0/0/0.0;
                }
            }                           
        }
        pool junosDHCPPool2 {
            family inet {
                network 192.168.2.0/24;
                range junosRange {
                    low 192.168.2.2;
                    high 192.168.2.254;
                }
                dhcp-attributes {
                    router {
                        192.168.2.1;
                    }
                    propagate-settings ge-0/0/0.0;
                }
            }
        }
    }
}
vlans {
    vlan-trust {
        vlan-id 3;
        l3-interface irb.0;
    }                                   
}

{secondary:node1}
SRX Services Gateway
Solution
Accepted by topic author Trasgu
13 hours ago

Re: Problems and more problems in a SRX340 cluster.... the neverending story

Thursday

Trasgu,

 

Can you change the cable connecting ge-0/0/2 of both nodes in order to isolate a bad cable?

Can you change the fabric link to an interface different than ge-0/0/2 on both nodes?

Gather a "show interfaces terse" when the issue is reported to confirm if the physical interfaces are going down.

 

SRX Services Gateway

Re: Problems and more problems in a SRX340 cluster.... the neverending story

Thursday

Hi Andres, 

Of course I can try with a different cable , but don't think this will help, as the same if on the second node it's fine, and also because the left led is green and this happened only after the junos upgrade... 

 

Also, the problem with the SSH... eveything smells really bad. 

 

I'll make those tests this evening

 

Thanks

SRX Services Gateway

Re: Problems and more problems in a SRX340 cluster.... the neverending story

13 hours ago

Trasgu,

 

Can you check the following command on both nodes: show chassis cluster statistics

 

Being the fact that the Fab is down on only one node, can you reboot both nodes simultaneously to have them sync?

 

SRX Services Gateway

Re: Problems and more problems in a SRX340 cluster.... the neverending story

13 hours ago

Finally I made it work stable wih the help of a Juniper guru, he bassically deleted all the configuration and started from scratch. However after that, the SSH was still failing when using SecureCRT, but worked from Putty. We had to change the SSH authentication options.

 

Thanks

SRX Services Gateway

Re: Problems and more problems in a SRX340 cluster.... the neverending story

13 hours ago

Nice! based on my research a simultaneous reboot should have helped but anyway you did it during the process of re-configuring the cluster. Im glad your SRX cluster is back on track.