SRX

last person joined: 4 days ago

Ask questions and share experiences about the SRX Series, vSRX, and cSRX.

Back to discussions

Expand all | Collapse all

error: Could not connect to node1 : No route to host - after power failure

1. error: Could not connect to node1 : No route to host - after power failure

0 Recommend
Erdem
Posted 11-07-2016 10:10

Reply Reply Privately
I'm busy setting up our new SRX345 firewalls and in honesty it has been a complete nightmare! I finally managed to get the two clustered over our layer2 network with no errors, (by factory reset and doing the exact same config again, step by step). At that point both control and dual fabric links connected and all the subnets serviced via vlans on LACP reth0. everything appeared to be working properly.

The problems are now with failover and fail back.

When I issues shutdown to the port channel on the cisco that node0 connects to it fails over nearly immediately accordiging to 'show log messages' but the continuous ping to the vlan20 interface was lost for between 30 seconds and 10 minutes - normally around 6 minutes. I had preempt set and failback trggered by entering no shutdown on the cisco port channel was considerably quicker only losing pings for about 30-60 seconds.

I still have no idea why failover is taking such a long time and currently no idea on how to start diagnosing it but I now have a worse problem. I collegue suggested I try a more realisting failover and simulate a power cut to node0. This took only a minute to failover to node1 but on restoring the power node0 claims it cannot connect to node1. The cisco reports that LACP is no enabled on the reth and the control and fabric ports do not appear to have initialized - the cisco is behaving as if the ports are all connected to a hub.

In addition node0 is very slow to respond to the CLI on rollover cable and reports the following in the console:-

Message from syslogd@FW01 at Nov 7 17:29:37 ...
FW01 SCHED: Thread 4 (Module Init) ran for 1045 ms without yielding

Message from syslogd@FW01 at Nov 7 17:29:37 ...
FW01 Scheduler Oinker

Message from syslogd@FW01 at Nov 7 17:29:37 ...
FW01 Frame 00: sp = 0x510a68c8, pc = 0x182204e8

Message from syslogd@FW01 at Nov 7 17:29:37 ...
FW01 Frame 01: sp = 0x510a6970, pc = 0x182082e4

'Show interface terse' does not list the physical interfaces on node0

Anyone know what's going on? or how to fix it?
2. RE: error: Could not connect to node1 : No route to host - after power failure

0 Recommend
joses
Posted 11-07-2016 18:48

Reply Reply Privately
Hello ,

The errors provided are not problematic as they are "Scheduler Oinker" mesaages which is thown when the resources are freed in SRX .

But our main issue is that the dataplane failover is having traffic drops upto 60 Sec which is quit unacceptable .

So kindly share you device JUNOS version and if possible same config .

Also in SRX300 series the VLAN is relaced by IRB interface . So I hope the configuration is done accordingly .

3. RE: error: Could not connect to node1 : No route to host - after power failure

Recommend

Erdem

Posted 11-08-2016 00:56

| view attached

Thank you for your response, the version is 15.1X49-D70.7 on both devices. Config below. Show chassis cluster status / statistics showed everything normal until the power cycle. Now the cluster appears to be completely broken.

The SRX devices are connected through Cisco 2960-X series switches with MTU set to 9014 (vlan 30 for fabric and vlan 4094 for control), 2 switches for node0 but only a single switch for node1.

version 15.1X49-D60.7;
groups {
    node0 {
        system {
            host-name FW01;
        }
        interfaces {
            fxp0 {
                unit 0 {
                    family inet {
                        address 192.168.9.1/24;
                    }
                }
            }
        }
    }
    node1 {
        system {
            host-name FW02;
        }
        interfaces {
            fxp0 {
                unit 0 {
                    family inet {
                        address 192.168.9.2/24;
                    }
                }
            }
        }
    }
}
apply-groups "${node}";
system {
    domain-name wmdlps.local;
    time-zone Europe/London;
    root-authentication {
        encrypted-password "*****************************************************"; ## SECRET-DATA
    }
    name-server {
        192.168.7.1;
        192.168.7.2;
    }
    services {
        ssh;
        xnm-clear-text;
        web-management {
            http {
                interface fxp0.0;
            }
            https {
                system-generated-certificate;
                interface fxp0.0;
            }
        }
    }
    syslog {
        archive size 100k files 3;
        user * {
            any emergency;
        }
        file messages {
            any critical;
            authorization info;
        }
        file interactive-commands {
            interactive-commands error;
        }
    }
    max-configurations-on-flash 5;
    max-configuration-rollbacks 49;
    license {
        autoupdate {
            url https://ae1.juniper.net/junos/key_retrieval;
        }
    }
    ntp {
        boot-server 192.168.7.1;
        server 192.168.7.1;
        server 192.168.7.2;
    }
}
chassis {
    cluster {
        reth-count 1;
        redundancy-group 0 {
            node 0 priority 100;
            node 1 priority 1;
        }
        redundancy-group 1 {
            node 0 priority 100;
            node 1 priority 1;
            preempt;
            interface-monitor {
                ge-0/0/2 weight 100;
                ge-0/0/3 weight 100;
                ge-0/0/4 weight 100;
                ge-5/0/2 weight 100;
                ge-5/0/3 weight 100;
                ge-5/0/4 weight 100;
            }
        }
    }
}
security {
    screen {
        ids-option untrust-screen {
            icmp {
                ping-death;
            }
            ip {
                source-route-option;
                tear-drop;
            }
            tcp {
                syn-flood {
                    alarm-threshold 1024;
                    attack-threshold 200;
                    source-threshold 1024;
                    destination-threshold 2048;
                    timeout 20;
                }
                land;
            }
        }
    }
    nat {
        source {
            rule-set trust-to-untrust {
                from zone trust;
                to zone untrust;
                rule source-nat-rule {
                    match {
                        source-address 0.0.0.0/0;
                    }
                    then {
                        source-nat {
                            interface;
                        }
                    }
                }
            }
        }
    }
    policies {
        from-zone trust to-zone trust {
            policy trust-to-trust {
                match {
                    source-address any;
                    destination-address any;
                    application any;
                }
                then {
                    permit;
                }
            }
        }
        from-zone trust to-zone untrust {
            policy trust-to-untrust {
                match {
                    source-address any;
                    destination-address any;
                    application any;
                }
                then {
                    deny;
                    log {
                        session-init;
                    }
                }
            }
        }
        from-zone trust to-zone dmz {
            policy trust-to-dmz {
                match {
                    source-address any;
                    destination-address any;
                    application any;
                }
                then {
                    deny;
                    log {
                        session-init;
                    }
                }
            }
        }
        global {
            policy default-deny {
                match {
                    source-address any;
                    destination-address any;
                    application any;
                }
                then {
                    deny;
                    log {
                        session-init;
                        session-close;
                    }
                }
            }
        }
    }
    zones {
        security-zone trust {
            host-inbound-traffic {
                system-services {
                    all;
                    ping;
                    ssh;
                    http;
                    https;
                }
                protocols {
                    all;
                }
            }
            interfaces {
                irb.20;
                irb.101;
                irb.31;
            }
        }
        security-zone untrust {
            interfaces {
                irb.69;
                irb.70;
            }
        }
        security-zone dmz {
            interfaces {
                irb.21;
            }
        }
    }
}
interfaces {
    ge-0/0/2 {
        gigether-options {
            redundant-parent reth0;
        }
    }
    ge-0/0/3 {
        gigether-options {
            redundant-parent reth0;
        }
    }
    ge-0/0/4 {
        gigether-options {
            redundant-parent reth0;
        }
    }
    ge-5/0/2 {
        gigether-options {
            redundant-parent reth0;
        }
    }
    ge-5/0/3 {
        gigether-options {
            redundant-parent reth0;
        }
    }
    ge-5/0/4 {
        gigether-options {
            redundant-parent reth0;
        }
    }
    fab0 {
        fabric-options {
            member-interfaces {
                ge-0/0/0;
                ge-0/0/7;
            }
        }
    }
    fab1 {
        fabric-options {
            member-interfaces {
                ge-5/0/0;
            }
        }
    }
    irb {
        unit 20 {
            family inet {
                address 192.168.7.121/21;
            }
        }
        unit 21 {
            family inet;
        }
        unit 31 {
            family inet;
        }
        unit 69 {
            family inet;
        }
        unit 70 {
            family inet;
        }
        unit 101 {
            family inet;
        }
    }
    reth0 {
        redundant-ether-options {
            redundancy-group 1;
            lacp {
                active;
                periodic slow;
            }
        }
        unit 0 {
            family ethernet-switching {
                interface-mode trunk;
                vlan {
                    members [ 20-21 31 69-70 101 ];
                }
            }
        }
    }
}
snmp {
    client-list OpMan {
        192.168.7.11/32;
    }
    community dlps_pub {
        authorization read-only;
        client-list-name OpMan;
    }
}
protocols {
    l2-learning {
        global-mode switching;
    }
}
vlans {
    cp_mpls {
        vlan-id 31;
        l3-interface irb.31;
    }
    dirty_adsl {
        vlan-id 70;
        l3-interface irb.70;
    }
    dirty_zen {
        vlan-id 69;
        l3-interface irb.69;
    }
    dmz {
        vlan-id 21;
        l3-interface irb.21;
    }
    lan_192 {
        vlan-id 20;
        l3-interface irb.20;
    }
    mgmt {
        vlan-id 101;
        l3-interface irb.101;
    }
}

edit:-

Since posting this I have repeatedly rebooted both nodes and eventually disconnected all the cables except the control port, 4 or 5 reboots of both nodes later they appeared to come up correctly and re-enabling the ports has resulted in being back to where I was on Monday before the power cycle.

{primary:node0}
root@FW01> show chassis cluster status
Monitor Failure codes:
    CS  Cold Sync monitoring        FL  Fabric Connection monitoring
    GR  GRES monitoring             HW  Hardware monitoring
    IF  Interface monitoring        IP  IP monitoring
    LB  Loopback monitoring         MB  Mbuf monitoring
    NH  Nexthop monitoring          NP  NPC monitoring
    SP  SPU monitoring              SM  Schedule monitoring
    CF  Config Sync monitoring

Cluster ID: 1
Node   Priority Status         Preempt Manual   Monitor-failures

Redundancy group: 0 , Failover count: 1
node0  100      primary        no      no       None
node1  1        secondary      no      no       None

Redundancy group: 1 , Failover count: 3
node0  100      primary        yes     no       None
node1  1        secondary      yes     no       None



{primary:node0}
root@FW01> show chassis cluster statistics
Control link statistics:
    Control link 0:
        Heartbeat packets sent: 944
        Heartbeat packets received: 948
        Heartbeat packet errors: 0
Fabric link statistics:
    Child link 0
        Probes sent: 1887
        Probes received: 883
    Child link 1
        Probes sent: 1887
        Probes received: 1004
Services Synchronized:
    Service name                              RTOs sent    RTOs received
    Translation context                       0            0
    Incoming NAT                              0            0
    Resource manager                          0            0
    DS-LITE create                            0            0
    Session create                            34           0
    IPv6 session create                       0            0
    Session close                             31           0
    IPv6 session close                        0            0
    Session change                            0            0
    IPv6 session change                       0            0
    ALG Support Library                       0            0
    Gate create                               0            0
    Session ageout refresh requests           0            0
    IPv6 session ageout refresh requests      0            0
    Session ageout refresh replies            0            0
    IPv6 session ageout refresh replies       0            0
    IPSec VPN                                 0            0
    Firewall user authentication              0            0
    MGCP ALG                                  0            0
    H323 ALG                                  0            0
    SIP ALG                                   0            0
    SCCP ALG                                  0            0
    PPTP ALG                                  0            0
    JSF PPTP ALG                              0            0
    RPC ALG                                   0            0
    RTSP ALG                                  0            0
    RAS ALG                                   0            0
    MAC address learning                      0            0
    GPRS GTP                                  0            0
    GPRS SCTP                                 0            0
    GPRS FRAMEWORK                            0            0
    JSF RTSP ALG                              0            0
    JSF SUNRPC MAP                            0            0
    JSF MSRPC MAP                             0            0
    DS-LITE delete                            0            0
    JSF SLB                                   0            0
    APPID                                     0            0
    JSF MGCP MAP                              0            0
    JSF H323 ALG                              0            0
    JSF RAS ALG                               0            0
    JSF SCCP MAP                              0            0
    JSF SIP MAP                               0            0
    PST_NAT_CREATE                            0            0
    PST_NAT_CLOSE                             0            0
    PST_NAT_UPDATE                            0            0
    JSF TCP STACK                             0            0
    JSF IKE ALG                               0            0

I have not been able to repeat the 10 minute failover times but I am very concerned with how erratic the results have been. Any advice greatly appreciated.

4. RE: error: Could not connect to node1 : No route to host - after power failure

0 Recommend
joses
Posted 11-08-2016 23:42

Reply Reply Privately
Hello ,

Thanks for the details . The configuration looks fine. Can you try to change the LACP setting to " periodic fast " and see if that helps .
5. RE: error: Could not connect to node1 : No route to host - after power failure

0 Recommend
Erdem
Posted 11-09-2016 01:57

Reply Reply Privately
Thanks but i'm afraid that made little difference.

Failover after the change by shutting the port channel :-

1st test: pings lost for just shy of 9 minutes, fail back was 30 seconds.

2nd test: pings lost for just shy of 9 minutes, failback was 30 seconds.

Failover by use of 'request chassis cluster failover redundancy-group 1 node 1' and failback by ''request chassis cluster failover reset redundancy-group 1'

1st test: pings lost for 38 seconds, failback took 1 second.

2nd test: pings lost for 38 seconds, failback took 30 seconds.
6. RE: error: Could not connect to node1 : No route to host - after power failure

0 Recommend
Erdem
Posted 11-09-2016 08:37

Reply Reply Privately
Hi,

Thanks for reading the post this far!

I'm wondering if I've got my fabric ports set up right,

fab0 { fabric-options { member-interfaces { ge-0/0/0; ge-0/0/7; } } } fab1 { fabric-options { member-interfaces { ge-5/0/0; } } }

ge-0/0/0 is connected to cisco Core1 switch port g0/19 which is an access port on vlan 30

ge-0/0/7 is connected to cisco Core2 switch port g0/19 which is an access port on vlan 30

ge-5/0/0 is connected to cisco Core3 switch port g0/19 which is an access port on vlan 30

vlan 30 is carried from and to Core1, Core2 and Core3 on G0/25 to the WAN switch. (see previous image)

When a node is powered down I get a lot of errors on the cisco switch about macs flapping between ports g0/19 and g0/25.

ge-0/0/1 and ge-5/0/1 are the Juniper control ports, they are connected to cisco Core1 and Core3 respectively on g0/5 which are access ports to vlan 4094. vlan 4094 is also carried between switches via g0/25 through the WAN switch. (MTU is 9014)

When a node is powered off I also get errors on the cisco switch stating that g0/5 is connected to g0/19.

--

So after a power loss I have to shutdown all juniper ports on the cisco switch except the control and reboot both SRX nodes otherwise the cluster never recovers. Surely this is wrong?

Once I've got the cluster converged all the cisco errors cease and I can re-enable the ports one at a time, but this does mean that failback from a power failure requires 30+ minutes of hands on down time.

--

Juniper support told me this issue is out of scope for them.
7. RE: error: Could not connect to node1 : No route to host - after power failure

0 Recommend
joses
Posted 11-09-2016 21:22

Reply Reply Privately
Hello ,

First of all , you are missing the secondary FAB port from secondary Node which should be " ge-5/0/7 " .

Secondly the secondary FAB link "ge-0/0/7" and ge-5/0/7" connecting switch port on CISCO should be in different VLAN as a best practice ( With Jumo Frames enabled ) . Thats why we are getting the errors as the FAB probs are getting confused between the Primary and secondary FAB link .

And also this should be in same VLAN as of other revenue ports .

Now regarding your Failback issue , it seems to be something that need to be checked why it takes almost 30 mins to failback and It may be out of scop of the device ( Maybe the switch might also have causing this ) but they need to identify the issue so that we can proceed . Did the support say if this is a supported setup or not ?
8. RE: error: Could not connect to node1 : No route to host - after power failure

0 Recommend
Erdem
Posted 11-10-2016 01:25

Reply Reply Privately
Thank you for your response. I have not been able to find documentation that states whether dual fabric ports should be in the same or different vlans so this is very important information you have provided! I'll get this set up and tested as soon as I can.

before I can reconfigure though, I cannot get node1 back into the cluster after the last test I ran yesterday afternoon. I would like to ge tthe cluster back together so that I can make the updates. I think this information message on the cisco is relevant:

Nov 10 09:11:33.531: %CDP-4-NATIVE_VLAN_MISMATCH: Native VLAN mismatch discovered on GigabitEthernet0/5 (4094), with Switch-Core03.wmdlps.local GigabitEthernet0/19 (30).

g0/5 is connected to the SRX control port and G0/19 is connected to the fabric port. CDP is cisco discovery protocol and nothing to do with juniper - but why can the interface for the control port see the interface for the fabric port? I get this message every time I reboot either firewall. When the firewall does not come back into the cluster I keep getting this message on the switch.

At the moment show chasis cluster status on node 1

Redundancy group: 0 , Failover count: 0 node0 200 primary no no None node1 0 secondary no no CF Redundancy group: 1 , Failover count: 0 node0 200 primary no no None node1 0 secondary no no IF CS CF

and on node 0

Redundancy group: 0 , Failover count: 1 node0 200 primary no no None node1 0 lost n/a n/a n/a Redundancy group: 1 , Failover count: 1 node0 200 primary no no None node1 0 lost n/a n/a n/a

I'll keep trying, I've tried diagnosing this through kb troubleshooting articles with no luck but it seems that persistently repeating the same actions does give different results (insanity defined?) so hopefully later today I will be able to try your suggestions on the fabric ports.

Tech support pointed me to a few kb articles then simply said that initial configuration was outside their scope and closed the case. Basically, I have to get this working before they will help.

thanks Sam
9. RE: error: Could not connect to node1 : No route to host - after power failure

0 Recommend
joses
Posted 11-10-2016 22:30

Reply Reply Privately
Hello ,

The secondary Node1 is failing in Config Sync issue which can be due to the FAB link issue , so once we finx the FAB link , I hope the clustering issue will get fixed .

Then we can consentrate on the actual failover issue .
10. RE: error: Could not connect to node1 : No route to host - after power failure

0 Recommend
Erdem
Posted 11-18-2016 05:17

Reply Reply Privately
Thank you for your guidance in this thread.

We had a juniper expert on site on the 11th and he could not find fault with tye config I was running or create a config that would work on these firewalls, even with them directly connected. He concluded that smething was definately wrong but was unable to determine if it was a software or hardware issue as results where irregular.

I am now talking with my account manager.
11. RE: error: Could not connect to node1 : No route to host - after power failure

0 Recommend
Erdem
Posted 11-21-2016 16:02

Reply Reply Privately
This is concerning me a bit:

"When a node is powered down I get a lot of errors on the cisco switch about macs flapping between ports g0/19 and g0/25."

If one of the nodes is powered down, then we can't possibly have a split-brain, so... a L2 loop??

I had the physical interfaces disappear on me once as well, after a reboot during setup and testing. KB https://kb.juniper.net/InfoCenter/index?page=content&id=KB23033&actp=search says that happens when an fxp interface has configuration. But how could that be? I didn't have any configuration, I thought, only to discover that the factory default configuration had somehow made its way back to the active config (?!?!) ...

Once the cluster was installed in production, we did experience a power loss event on one of the nodes. The packet loss was about 2-3 seconds. And it was so when the node came back online. Now I'm feeling lucky about it. Or maybe it's just that I have the two devices connected directly on control and fabric and the cluster probably reacts faster to an interface going down than it does to missed probes.
12. RE: error: Could not connect to node1 : No route to host - after power failure

0 Recommend
Erdem
Posted 11-22-2016 00:08

Reply Reply Privately
Hi Nikolay,

I think you experienced expected behaviour.

The Layer2 loop issue I had appeared on the ports connected to the powered up device when the other device went offline. The SRX device was connecting its ports like a hub, which quite rightly the switch connected to the ports did not like at all.

The problems I have are the same if the firewalls are directly connected or through our switched network so we have determined that either A) one or both of the firewalls are broken or B) The SRX345 model is massively flawed and does not work.

Obviously I'm hoping on option A and a replacement firewall but they're not being fast about it. Still waiting on our account manager.

SRX

error: Could not connect to node1 : No route to host - after power failure

Erdem11-07-2016 10:10

joses11-07-2016 18:48

Erdem11-08-2016 00:56

joses11-08-2016 23:42

Erdem11-09-2016 01:57

Erdem11-09-2016 08:37

joses11-09-2016 21:22

Erdem11-10-2016 01:25

joses11-10-2016 22:30

Erdem11-18-2016 05:17

Erdem11-21-2016 16:02

Erdem11-22-2016 00:08

1. error: Could not connect to node1 : No route to host - after power failure

2. RE: error: Could not connect to node1 : No route to host - after power failure

3. RE: error: Could not connect to node1 : No route to host - after power failure

4. RE: error: Could not connect to node1 : No route to host - after power failure

5. RE: error: Could not connect to node1 : No route to host - after power failure

6. RE: error: Could not connect to node1 : No route to host - after power failure

7. RE: error: Could not connect to node1 : No route to host - after power failure

8. RE: error: Could not connect to node1 : No route to host - after power failure

9. RE: error: Could not connect to node1 : No route to host - after power failure

10. RE: error: Could not connect to node1 : No route to host - after power failure

11. RE: error: Could not connect to node1 : No route to host - after power failure

12. RE: error: Could not connect to node1 : No route to host - after power failure