SRX Services Gateway
Highlighted
SRX Services Gateway

SRX 240 HA cluster lost its secondary unit

[ Edited ]
‎12-10-2018 03:42 AM

Hi,

 

We have a SRX 240 HA cluster and the secondary unit seems to be lost. We can't connect to it via SSH, only on its console port. 

show chassis cluster status says its lost. 

 

 

Cluster ID: 1
Node   Priority Status         Preempt Manual   Monitor-failures

Redundancy group: 0 , Failover count: 9
node0  200      primary        no      no       None
node1  0        lost           n/a     n/a      n/a

Redundancy group: 1 , Failover count: 1
node0  0        primary        no      no       CS
node1  0        lost           n/a     n/a      n/a

Redundancy group: 2 , Failover count: 1
node0  0        primary        no      no       CS
node1  0        lost           n/a     n/a      n/a

Redundancy group: 3 , Failover count: 1
node0  0        primary        no      no       CS
node1  0        lost           n/a     n/a      n/a

Redundancy group: 4 , Failover count: 3
node0  0        primary        no      no       CS
node1  0        lost           n/a     n/a      n/a

 

When we console on to the secondary device we see that it can't even see its own interfaces:

 

 

show interfaces terse
Interface               Admin Link Proto    Local                 Remote
fxp0                    up    up
fxp0.0                  up    up   inet     192.168.x.y/29
fxp1                    up    up
fxp1.0                  up    up   inet     130.16.0.1/2
                                   tnp      0x2100001
fxp2                    up    up
fxp2.0                  up    up   tnp      0x2100001
gre                     up    up
ipip                    up    up
lo0                     up    up
lsi                     up    up
mtun                    up    up
pimd                    up    up
pime                    up    up
tap                     up    up

 

Primary device can see its own interfaces, but not the secondary's. Control link seems to working but the fabric links are not. 

 

 show chassis cluster control-plane statistics
Control link statistics:
    Control link 0:
        Heartbeat packets sent: 5184071
        Heartbeat packets received: 4956136
        Heartbeat packet errors: 0
Fabric link statistics:
    Child link 0
        Probes sent: 883891
        Probes received: 0
    Child link 1
        Probes sent: 530051
        Probes received: 0

We've checked the cabling a few times, everything is okay, nobody touched it since last year when it was installed. 

Software image is the same on both devices. 

 

Model: srx240h2
JUNOS Software Release [12.3X48-D45.6]

When we disabled the clustering on the secondary device, after the reboot the device could see its interfaces. 

We thought the problem is with the secondary unit so we replaced it with another 240. After enabling clustering on it and loading the config on to it, the problem still occurs.. 

 

Also, when we are on the secondary, we can see error messages:

 

mgmtfw01-b mgmtfw01-b CMLC: Chassis Manager terminated

Message from syslogd@mgmtfw01-b at Dec 10 12:25:09  ...
mgmtfw01-b mgmtfw01-b CMLC: Chassis Manager terminated

 

Has anyone seen this kind of behavior? 

 

Config of the cluster:

 

set groups node0 system host-name mgmtfw01-a
set groups node0 interfaces fxp0 unit 0 family inet address 192.168.
set groups node1 system host-name mgmtfw01-b
set groups node1 interfaces fxp0 unit 0 family inet address 192.168.
set apply-groups "${node}"
set chassis cluster control-link-recovery
set chassis cluster reth-count 10
set chassis cluster redundancy-group 1 node 0 priority 200
set chassis cluster redundancy-group 1 node 1 priority 100
set chassis cluster redundancy-group 1 interface-monitor ge-0/0/14 weight 128
set chassis cluster redundancy-group 1 interface-monitor ge-0/0/15 weight 128
set chassis cluster redundancy-group 1 interface-monitor ge-5/0/15 weight 128
set chassis cluster redundancy-group 1 interface-monitor ge-5/0/14 weight 128
set chassis cluster redundancy-group 2 node 0 priority 200
set chassis cluster redundancy-group 2 node 1 priority 100
set chassis cluster redundancy-group 2 interface-monitor ge-0/0/13 weight 255
set chassis cluster redundancy-group 2 interface-monitor ge-5/0/13 weight 255
set chassis cluster redundancy-group 0 node 0 priority 200
set chassis cluster redundancy-group 0 node 1 priority 100
set chassis cluster redundancy-group 3 node 0 priority 200
set chassis cluster redundancy-group 3 node 1 priority 100
set chassis cluster redundancy-group 3 interface-monitor ge-0/0/11 weight 128
set chassis cluster redundancy-group 3 interface-monitor ge-0/0/12 weight 128
set chassis cluster redundancy-group 3 interface-monitor ge-5/0/11 weight 128
set chassis cluster redundancy-group 3 interface-monitor ge-5/0/12 weight 128
set chassis cluster redundancy-group 4 node 0 priority 200
set chassis cluster redundancy-group 4 node 1 priority 100
set chassis cluster redundancy-group 4 interface-monitor ge-0/0/10 weight 255
set chassis cluster redundancy-group 4 interface-monitor ge-5/0/10 weight 255
set interfaces ge-0/0/10 gigether-options redundant-parent reth4
set interfaces ge-0/0/11 gigether-options redundant-parent reth3
set interfaces ge-0/0/12 gigether-options redundant-parent reth3
set interfaces ge-0/0/13 gigether-options redundant-parent reth2
set interfaces ge-0/0/14 gigether-options redundant-parent reth1
set interfaces ge-0/0/15 gigether-options redundant-parent reth1
set interfaces ge-5/0/10 gigether-options redundant-parent reth4
set interfaces ge-5/0/11 gigether-options redundant-parent reth3
set interfaces ge-5/0/12 gigether-options redundant-parent reth3
set interfaces ge-5/0/13 gigether-options redundant-parent reth2
set interfaces ge-5/0/14 gigether-options redundant-parent reth1
set interfaces ge-5/0/15 gigether-options redundant-parent reth1
set interfaces fab0 fabric-options member-interfaces ge-0/0/2
set interfaces fab0 fabric-options member-interfaces ge-0/0/3
set interfaces fab1 fabric-options member-interfaces ge-5/0/2
set interfaces fab1 fabric-options member-interfaces ge-5/0/3
set interfaces reth1 vlan-tagging
set interfaces reth1 gratuitous-arp-reply
set interfaces reth1 redundant-ether-options redundancy-group 1
set interfaces reth1 redundant-ether-options minimum-links 1
set interfaces reth1 redundant-ether-options lacp active
set interfaces reth1 redundant-ether-options lacp periodic slow
set interfaces reth2 gratuitous-arp-reply
set interfaces reth2 redundant-ether-options redundancy-group 2
set interfaces reth2 unit 0 description 
set interfaces reth2 unit 0 family inet address 
set interfaces reth3 gratuitous-arp-reply
set interfaces reth3 redundant-ether-options redundancy-group 3
set interfaces reth3 redundant-ether-options minimum-links 1
set interfaces reth3 redundant-ether-options lacp active
set interfaces reth3 redundant-ether-options lacp periodic slow
set interfaces reth3 unit 0 description ****
set interfaces reth3 unit 0 family inet address 
set interfaces reth4 vlan-tagging
set interfaces reth4 gratuitous-arp-reply
set interfaces reth4 redundant-ether-options redundancy-group 4

Thanks!

 

 

 

10 REPLIES 10
Highlighted
SRX Services Gateway

Re: SRX 240 HA cluster lost its secondary unit

‎12-10-2018 03:52 AM

Please share the output of below mentioned commands in problem state from secondary:

show version

show chassis alarms

show system core-dump

show chassis routing-engine | no-more

show chassis fpc pic-status

show chassis fpc detail | no-more

show chassis cluster status

show chassis cluster interfaces | no-more

show chassis cluster information detail | no-more

 

 

Thanks,
Nellikka
JNCIE x3 (SEC #321; SP #2839; ENT #790)
Please Mark My Solution Accepted if it Helped, Kudos are Appreciated too!!!
Highlighted
SRX Services Gateway

Re: SRX 240 HA cluster lost its secondary unit

‎12-10-2018 04:01 AM

Thanks for the reply. Here are the requested outputs.

 

me@mgmtfw01-b> show version
node0:
--------------------------------------------------------------------------
Hostname: mgmtfw01-a
Model: srx240h2
JUNOS Software Release [12.3X48-D45.6]

node1:
--------------------------------------------------------------------------
Hostname: mgmtfw01-b
Model: srx240h2
JUNOS Software Release [12.3X48-D45.6]

{secondary:node1}
me@mgmtfw01-b> show chassis alarms
node0:
--------------------------------------------------------------------------
No alarms currently active

node1:
--------------------------------------------------------------------------
No alarms currently active

{secondary:node1}
me@mgmtfw01-b> show system core-dumps
node0:
--------------------------------------------------------------------------
/var/crash/*core*: No such file or directory
/var/tmp/*core*: No such file or directory
/var/tmp/pics/*core*: No such file or directory
/var/crash/kernel.*: No such file or directory
/tftpboot/corefiles/*core*: No such file or directory

node1:
--------------------------------------------------------------------------
/var/crash/*core*: No such file or directory
/var/tmp/*core*: No such file or directory
/var/tmp/pics/*core*: No such file or directory
/var/crash/kernel.*: No such file or directory
/tftpboot/corefiles/*core*: No such file or directory

{secondary:node1}
me@mgmtfw01-b> show chassis routing-engine | no-more
node0:
--------------------------------------------------------------------------
Routing Engine status:
    Temperature                 39 degrees C / 102 degrees F
    CPU temperature             39 degrees C / 102 degrees F
    Total memory              2048 MB Max  1126 MB used ( 55 percent)
      Control plane memory    1072 MB Max   557 MB used ( 52 percent)
      Data plane memory        976 MB Max   566 MB used ( 58 percent)
    CPU utilization:
      User                      91 percent
      Background                 0 percent
      Kernel                     9 percent
      Interrupt                  0 percent
      Idle                       1 percent
    Model                          RE-SRX240H2
    Serial ID                      ACMX4357
    Start time                     2018-10-11 11:15:51 CEST
    Uptime                         60 days, 2 hours, 42 minutes, 5 seconds
    Last reboot reason             Router rebooted after a normal shutdown.
    Load averages:                 1 minute   5 minute  15 minute
                                       1.01       1.15       1.19

node1:
--------------------------------------------------------------------------
Routing Engine status:
    Temperature                 40 degrees C / 104 degrees F
    CPU temperature             39 degrees C / 102 degrees F
    Total memory              2048 MB Max   389 MB used ( 19 percent)
      Control plane memory    1072 MB Max   386 MB used ( 36 percent)
      Data plane memory        976 MB Max     0 MB used (  0 percent)
    CPU utilization:
      User                      20 percent
      Background                 0 percent
      Kernel                    74 percent
      Interrupt                  0 percent
      Idle                       6 percent
    Model                          RE-SRX240H2
    Serial ID                      ACMZ8364
    Start time                     2018-12-10 10:21:21 CET
    Uptime                         2 hours, 13 minutes, 45 seconds
    Last reboot reason             Router rebooted after a normal shutdown.
    Load averages:                 1 minute   5 minute  15 minute
                                       1.46       1.83       1.83

{secondary:node1}
me@mgmtfw01-b> show chassis fpc pic-status
node0:
--------------------------------------------------------------------------
Slot 0   Online       FPC
  PIC 0  Online       16x GE Base PIC

node1:
--------------------------------------------------------------------------
Slot 0   Present      FPC

{secondary:node1}
me@mgmtfw01-b> show chassis fpc detail | no-more
node0:
--------------------------------------------------------------------------
Slot 0 information:
  State                               Online
  Total CPU DRAM                      ---- CPU less FPC ----
  Start time                          2018-12-05 09:46:59 CET
  Uptime                              5 days, 3 hours, 11 minutes, 23 seconds

node1:
--------------------------------------------------------------------------
Slot 0 information:
  State                               Present
  Total CPU DRAM                      ---- CPU less FPC ----

{secondary:node1}
me@mgmtfw01-b> show chassis cluster status
Monitor Failure codes:
    CS  Cold Sync monitoring        FL  Fabric Connection monitoring
    GR  GRES monitoring             HW  Hardware monitoring
    IF  Interface monitoring        IP  IP monitoring
    LB  Loopback monitoring         MB  Mbuf monitoring
    NH  Nexthop monitoring          NP  NPC monitoring
    SP  SPU monitoring              SM  Schedule monitoring
    CF  Config Sync monitoring

Cluster ID: 1
Node   Priority Status         Preempt Manual   Monitor-failures

Redundancy group: 0 , Failover count: 0
node0  200      primary        no      no       None
node1  0        secondary      no      no       CF

Redundancy group: 1 , Failover count: 0
node0  0        primary        no      no       CS
node1  0        secondary      no      no       IF CS CF

Redundancy group: 2 , Failover count: 0
node0  0        primary        no      no       CS
node1  0        secondary      no      no       IF CS CF

Redundancy group: 3 , Failover count: 0
node0  0        primary        no      no       CS
node1  0        secondary      no      no       IF CS CF

Redundancy group: 4 , Failover count: 0
node0  0        primary        no      no       CS
node1  0        secondary      no      no       IF CS CF

{secondary:node1}
me@mgmtfw01-b> show chassis cluster interfaces | no-more
Control link status: Up

Control interfaces:
    Index   Interface   Monitored-Status   Internal-SA
    0       fxp1        Up                 Disabled

Fabric link status: Down

Fabric interfaces:
    Name    Child-interface    Status
                               (Physical/Monitored)
    fab0
    fab0
    fab1
    fab1

Redundant-pseudo-interface Information:
    Name         Status      Redundancy-group
    lo0          Up          0

Interface Monitoring:
    Interface         Weight    Status    Redundancy-group
    ge-5/0/14         128       Down      1
    ge-5/0/15         128       Down      1
    ge-0/0/15         128       Down      1
    ge-0/0/14         128       Down      1
    ge-5/0/13         255       Down      2
    ge-0/0/13         255       Down      2
    ge-5/0/12         128       Down      3
    ge-5/0/11         128       Down      3
    ge-0/0/12         128       Down      3
    ge-0/0/11         128       Down      3
    ge-5/0/10         255       Down      4
    ge-0/0/10         255       Down      4

{secondary:node1}
me@mgmtfw01-b> show chassis cluster information detail | no-more
node0:
--------------------------------------------------------------------------
Redundancy mode:
    Configured mode: active-active
    Operational mode: active-active
Cluster configuration:
    Heartbeat interval: 1000 ms
    Heartbeat threshold: 3
    Control link recovery: Enabled
    Fabric link down timeout: 66 sec
Node health information:
    Local node health: Not healthy
    Remote node health: Not healthy

Redundancy group: 0, Threshold: 255, Monitoring failures: none
    Events:
        Oct 11 11:15:13.709 : hold->secondary, reason: Hold timer expired
        Oct 25 15:39:14.955 : secondary->primary, reason: Only node present
        Dec  5 09:39:39.985 : primary->secondary-hold, reason: Manual failover
        Dec  5 09:39:49.702 : secondary-hold->primary, reason: Only node present
        Dec  5 09:41:40.678 : primary->secondary-hold, reason: Manual failover
        Dec  5 09:42:05.439 : secondary-hold->primary, reason: Only node present
        Dec  5 09:43:12.740 : primary->secondary-hold, reason: Manual failover
        Dec  5 09:43:38.498 : secondary-hold->primary, reason: Only node present
        Dec  5 09:45:16.073 : primary->secondary-hold, reason: Manual failover
        Dec  5 09:45:42.458 : secondary-hold->primary, reason: Only node present

Redundancy group: 1, Threshold: 0, Monitoring failures: cold-sync-monitoring
    Events:
        Oct 11 11:15:13.773 : hold->secondary, reason: Hold timer expired
        Oct 25 15:39:14.907 : secondary->ineligible, reason: Fabric link down
        Oct 25 15:39:15.106 : ineligible->primary, reason: Only node present

Redundancy group: 2, Threshold: 0, Monitoring failures: cold-sync-monitoring
    Events:
        Oct 11 11:15:13.812 : hold->secondary, reason: Hold timer expired
        Oct 25 15:39:14.911 : secondary->ineligible, reason: Fabric link down
        Oct 25 15:39:15.138 : ineligible->primary, reason: Only node present

Redundancy group: 3, Threshold: 0, Monitoring failures: cold-sync-monitoring
    Events:
        Oct 11 11:15:13.849 : hold->secondary, reason: Hold timer expired
        Oct 25 15:39:14.916 : secondary->ineligible, reason: Fabric link down
        Oct 25 15:39:15.142 : ineligible->primary, reason: Only node present

Redundancy group: 4, Threshold: 0, Monitoring failures: cold-sync-monitoring
    Events:
        Oct 11 11:15:13.888 : hold->secondary, reason: Hold timer expired
        Oct 11 17:32:32.836 : secondary->primary, reason: Remote is in secondary hold
        Oct 25 15:39:14.917 : primary->ineligible, reason: Fabric link down
        Oct 25 15:39:15.169 : ineligible->primary, reason: Only node present
Control link statistics:
    Control link 0:
        Heartbeat packets sent: 5185608
        Heartbeat packets received: 4956247
        Heartbeat packet errors: 0
        Duplicate heartbeat packets received: 0
    Control recovery packet count: 0
    Sequence number of last heartbeat packet sent: 5185628
    Sequence number of last heartbeat packet received: 5231
Fabric link statistics:
    Child link 0
        Probes sent: 886973
        Probes received: 0
    Child link 1
        Probes sent: 533133
        Probes received: 0
Switch fabric link statistics:
    Probe state : DOWN
    Probes sent: 0
    Probes received: 0
    Probe recv errors: 0
    Probe send errors: 0
    Probe recv dropped: 0
    Sequence number of last probe sent: 0
    Sequence number of last probe received: 0

Chassis cluster LED information:
    Current LED color: Amber
    Last LED change reason: Monitored objects are down
Control port tagging:
    Disabled

Cold Synchronization:
    Status:
        Cold synchronization completed for: N/A
        Cold synchronization failed for: N/A
        Cold synchronization not known for: N/A
        Current Monitoring Weight: 255

    Progress:
        CS Prereq               0 of 1 SPUs completed
           1. if_state sync          1 SPUs completed
           2. fabric link            0 SPUs completed
           3. policy data sync       1 SPUs completed
           4. cp ready               0 SPUs completed
           5. VPN data sync          0 SPUs completed
           6. Dynamic addr sync      0 SPUs completed
        CS RTO sync             0 of 1 SPUs completed
        CS Postreq              0 of 1 SPUs completed

    Statistics:
        Number of cold synchronization completed: 0
        Number of cold synchronization failed: 0

    Events:
        Oct 11 11:17:27.928 : Cold sync for PFE  is RTO sync in process
        Oct 11 11:17:27.929 : Cold sync for PFE  is Post-req check in process
        Oct 11 11:17:27.936 : Cold sync for PFE  is Completed
        Dec  5 09:54:17.411 : Cold sync for PFE  is Not complete

Loopback Information:

    PIC Name        Loopback        Nexthop     Mbuf
    -------------------------------------------------
                    Success         Success     Success

Interface monitoring:
    Statistics:
        Monitored interface failure count: 303

    Events:
        Dec  6 12:50:19.901 : Interface ge-0/0/14 monitored by rg 1, changed state from Down to Up
        Dec  6 12:50:22.364 : Interface ge-0/0/15 monitored by rg 1, changed state from Down to Up
        Dec  6 12:50:36.740 : Interface ge-0/0/11 monitored by rg 3, changed state from Up to Down
        Dec  6 12:50:36.855 : Interface ge-0/0/12 monitored by rg 3, changed state from Up to Down
        Dec  6 12:50:39.944 : Interface ge-0/0/11 monitored by rg 3, changed state from Down to Up
        Dec  6 12:50:40.046 : Interface ge-0/0/12 monitored by rg 3, changed state from Down to Up
        Dec  6 12:50:44.296 : Interface ge-0/0/14 monitored by rg 1, changed state from Up to Down
        Dec  6 12:50:46.808 : Interface ge-0/0/15 monitored by rg 1, changed state from Up to Down
        Dec  6 12:50:48.643 : Interface ge-0/0/14 monitored by rg 1, changed state from Down to Up
        Dec  6 12:50:49.966 : Interface ge-0/0/15 monitored by rg 1, changed state from Down to Up

Fabric monitoring:
    Status:
        Fabric Monitoring: Enabled
        Activation status: Suspended by local node and other node
        Fabric Status reported by data plane: Down
        JSRPD internal fabric status: Down

Fabric link events:
        Dec 10 12:54:42.405 : Fabric link fab1 is down
        Dec 10 12:54:42.429 : Fabric link fab1 is down
        Dec 10 12:54:42.450 : Fabric link fab1 is deleted
        Dec 10 12:54:42.488 : Fabric link fab0 is up
        Dec 10 12:55:39.375 : Fabric link fab1 is down
        Dec 10 12:55:39.412 : Fabric link fab1 is down
        Dec 10 12:55:39.465 : Fabric link fab1 is down
        Dec 10 12:55:39.510 : Fabric link fab0 is up
        Dec 10 12:55:39.549 : Fabric link fab1 is down
        Dec 10 12:55:39.575 : Fabric link fab1 is down

Control link status: Up
    Server information:
        Server status : Connected
        Server connected to 130.16.0.1/52793
    Client information:
        Client status : Inactive
        Client connected to None
Control port tagging:
    Disabled

Control link events:
        Dec 10 12:43:01.319 : Control link up, link status timer
        Dec 10 12:43:33.079 : Control link fxp1 is up
        Dec 10 12:48:27.022 : Control link down, link status timer
        Dec 10 12:48:39.021 : Control link fxp1 is up
        Dec 10 12:49:04.788 : Control link up, link status timer
        Dec 10 12:49:36.445 : Control link fxp1 is up
        Dec 10 12:54:30.529 : Control link down, link status timer
        Dec 10 12:54:42.497 : Control link fxp1 is up
        Dec 10 12:55:08.238 : Control link up, link status timer
        Dec 10 12:55:39.522 : Control link fxp1 is up

Hardware monitoring:
    Status:
        Activation status: Enabled
        Redundancy group 0 failover for hardware faults: Enabled
        Hardware redundancy group 0 errors: 0
        Hardware redundancy group 1 errors: 0

Schedule monitoring:
    Status:
        Activation status: Disabled
        Schedule slip detected: None
        Timer ignored: No

    Statistics:
        Total slip detected count: 31
        Longest slip duration: 25(s)

    Events:
        Dec  7 10:57:17.079 : Detected schedule slip
        Dec  7 10:58:17.170 : Cleared schedule slip
        Dec  7 12:23:33.217 : Detected schedule slip
        Dec  7 12:24:33.330 : Cleared schedule slip
        Dec  8 07:07:46.401 : Detected schedule slip
        Dec  8 07:08:46.859 : Cleared schedule slip
        Dec  8 11:52:31.165 : Detected schedule slip
        Dec  8 11:53:31.260 : Cleared schedule slip
        Dec  9 03:51:24.219 : Detected schedule slip
        Dec  9 03:52:24.317 : Cleared schedule slip

Configuration Synchronization:
    Status:
        Activation status: Enabled
        Last sync operation: Auto-Sync
        Last sync result: Succeeded
        Last sync mgd messages:
            mgd: rcp: /config/juniper.conf: No such file or directory
            Non-existant dump device /dev/bo0s1b
            mgd: commit complete

    Events:
        Oct 11 11:15:35.406 : Auto-Sync: In progress. Attempt: 1
        Oct 11 11:18:22.218 : Auto-Sync: Clearing mgd. Attempt: 1
        Oct 11 11:18:31.062 : Auto-Sync: Succeeded. Attempt: 1

Cold Synchronization Progress:
    CS Prereq               0 of 1 SPUs completed
       1. if_state sync          1 SPUs completed
       2. fabric link            0 SPUs completed
       3. policy data sync       1 SPUs completed
       4. cp ready               0 SPUs completed
       5. VPN data sync          0 SPUs completed
       6. Dynamic addr sync      0 SPUs completed
    CS RTO sync             0 of 1 SPUs completed
    CS Postreq              0 of 1 SPUs completed

 Command history:
        Dec  5 09:44:15.890 : Manual failover of RG-0 to node0
        Dec  5 09:44:28.871 : Manual failover reset of RG-0
        Dec  5 09:44:33.187 : Manual failover of RG-0 to node0
        Dec  5 09:45:02.494 : Manual failover of RG-0 to node0
        Dec  5 09:45:38.155 : Manual failover reset of RG-0
        Dec  5 09:45:52.176 : Manual failover of RG-0 to node0
        Dec  5 15:28:26.029 : Manual failover reset of RG-4
        Dec  5 15:28:39.491 : Manual failover reset of RG-3

node1:
--------------------------------------------------------------------------
Redundancy mode:
    Configured mode: active-active
    Operational mode: unknown
Cluster configuration:
    Heartbeat interval: 1000 ms
    Heartbeat threshold: 3
    Control link recovery: Enabled
    Fabric link down timeout: 66 sec
Node health information:
    Local node health: Not healthy
    Remote node health: Not healthy

Redundancy group: 0, Threshold: 0, Monitoring failures: config-sync-monitoring
    Events:
        Dec 10 10:24:57.009 : hold->secondary, reason: Hold timer expired

Redundancy group: 1, Threshold: -511, Monitoring failures: interface-monitoring, cold-sync-monitoring, config-sync-monitoring
    Events:
        Dec 10 10:24:57.102 : hold->secondary, reason: Hold timer expired

Redundancy group: 2, Threshold: -510, Monitoring failures: interface-monitoring, cold-sync-monitoring, config-sync-monitoring
    Events:
        Dec 10 10:24:57.585 : hold->secondary, reason: Hold timer expired

Redundancy group: 3, Threshold: -511, Monitoring failures: interface-monitoring, cold-sync-monitoring, config-sync-monitoring
    Events:
        Dec 10 10:24:57.622 : hold->secondary, reason: Hold timer expired

Redundancy group: 4, Threshold: -510, Monitoring failures: interface-monitoring, cold-sync-monitoring, config-sync-monitoring
    Events:
        Dec 10 10:24:57.671 : hold->secondary, reason: Hold timer expired
Control link statistics:
    Control link 0:
        Heartbeat packets sent: 5211
        Heartbeat packets received: 4757
        Heartbeat packet errors: 0
        Duplicate heartbeat packets received: 0
    Control recovery packet count: 0
    Sequence number of last heartbeat packet sent: 5237
    Sequence number of last heartbeat packet received: 5185634
Fabric link statistics:
    Child link 0
        Probes sent: 0
        Probes received: 0
    Child link 1
        Probes sent: 0
        Probes received: 0
Switch fabric link statistics:
    Probe state : DOWN
    Probes sent: 0
    Probes received: 0
    Probe recv errors: 0
    Probe send errors: 0
    Probe recv dropped: 0
    Sequence number of last probe sent: 0
    Sequence number of last probe received: 0

Chassis cluster LED information:
    Current LED color: Amber
    Last LED change reason: Monitored objects are down
Control port tagging:
    Disabled

Cold Synchronization:
    Status:
        Cold synchronization completed for: N/A
        Cold synchronization failed for: N/A
        Cold synchronization not known for: N/A
        Current Monitoring Weight: 255

    Progress:
        CS Prereq               0 of 1 SPUs completed
           1. if_state sync          0 SPUs completed
           2. fabric link            0 SPUs completed
           3. policy data sync       0 SPUs completed
           4. cp ready               0 SPUs completed
           5. VPN data sync          0 SPUs completed
           6. Dynamic addr sync      0 SPUs completed
        CS RTO sync             0 of 1 SPUs completed
        CS Postreq              0 of 1 SPUs completed

    Statistics:
        Number of cold synchronization completed: 0
        Number of cold synchronization failed: 0

Loopback Information:

    PIC Name        Loopback        Nexthop     Mbuf
    -------------------------------------------------
                    Success         Success     Success

Interface monitoring:
    Statistics:
        Monitored interface failure count: 0

Fabric monitoring:
    Status:
        Fabric Monitoring: Enabled
        Activation status: Suspended by local node and other node
        Fabric Status reported by data plane: Down
        JSRPD internal fabric status: Down

Fabric link events:
        Dec 10 11:30:58.765 : Fabric monitoring is suspended due to USPIPC CONNECTION failure
        Dec 10 11:31:06.997 : Fabric monitoring is suspended due to USPIPC CONNECTION failure
        Dec 10 11:31:13.229 : Fabric monitoring is suspended due to USPIPC CONNECTION failure
        Dec 10 11:37:02.814 : Fabric monitoring is suspended due to USPIPC CONNECTION failure
        Dec 10 11:37:10.999 : Fabric monitoring is suspended due to USPIPC CONNECTION failure
        Dec 10 11:43:05.846 : Fabric monitoring is suspended due to USPIPC CONNECTION failure
        Dec 10 11:43:14.009 : Fabric monitoring is suspended due to USPIPC CONNECTION failure
        Dec 10 11:49:09.882 : Fabric monitoring is suspended due to USPIPC CONNECTION failure
        Dec 10 11:49:18.072 : Fabric monitoring is suspended due to USPIPC CONNECTION failure
        Dec 10 12:34:13.129 : Fabric monitoring is suspended due to USPIPC CONNECTION failure

Control link status: Up
    Server information:
        Server status : Inactive
        Server connected to None
    Client information:
        Client status : Connected
        Client connected to 129.16.0.1/62845
Control port tagging:
    Disabled

Control link events:
        Dec 10 11:37:43.889 : Control link up, link status timer
        Dec 10 11:43:05.929 : Control link fxp1 is down
        Dec 10 11:43:05.929 : Control link down, flowd is down
        Dec 10 11:43:14.746 : Control link fxp1 is up
        Dec 10 11:43:47.390 : Control link up, link status timer
        Dec 10 11:49:09.964 : Control link fxp1 is down
        Dec 10 11:49:09.965 : Control link down, flowd is down
        Dec 10 11:49:19.221 : Control link fxp1 is up
        Dec 10 11:49:50.982 : Control link up, link status timer
        Dec 10 12:34:13.119 : Control link fxp1 is up

Hardware monitoring:
    Status:
        Activation status: Enabled
        Redundancy group 0 failover for hardware faults: Enabled
        Hardware redundancy group 0 errors: 0
        Hardware redundancy group 1 errors: 0

Schedule monitoring:
    Status:
        Activation status: Disabled
        Schedule slip detected: None
        Timer ignored: No

    Statistics:
        Total slip detected count: 16
        Longest slip duration: 2578(s)

    Events:
        Dec 10 11:31:11.706 : Detected schedule slip
        Dec 10 11:32:13.095 : Cleared schedule slip
        Dec 10 11:37:15.779 : Detected schedule slip
        Dec 10 11:38:17.021 : Cleared schedule slip
        Dec 10 11:43:17.973 : Detected schedule slip
        Dec 10 11:44:19.238 : Cleared schedule slip
        Dec 10 11:49:22.655 : Detected schedule slip
        Dec 10 11:50:24.368 : Cleared schedule slip
        Dec 10 12:34:13.090 : Detected schedule slip
        Dec 10 12:35:13.349 : Cleared schedule slip

Configuration Synchronization:
    Status:
        Activation status: Enabled
        Last sync operation: Auto-Sync
        Last sync result: In progress
        Last sync mgd messages:
            mgd: rcp: /config/juniper.conf: No such file or directory

    Events:
        Dec 10 10:25:23.643 : Auto-Sync: In progress. Attempt: 1
        Dec 10 12:34:13.078 : Auto-Sync: Retry needed. Attempt: 1
        Dec 10 12:34:18.930 : Auto-Sync: In progress. Attempt: 2

Cold Synchronization Progress:
    CS Prereq               0 of 1 SPUs completed
       1. if_state sync          0 SPUs completed
       2. fabric link            0 SPUs completed
       3. policy data sync       0 SPUs completed
       4. cp ready               0 SPUs completed
       5. VPN data sync          0 SPUs completed
       6. Dynamic addr sync      0 SPUs completed
    CS RTO sync             0 of 1 SPUs completed
    CS Postreq              0 of 1 SPUs completed

{secondary:node1}
Highlighted
SRX Services Gateway

Re: SRX 240 HA cluster lost its secondary unit

‎12-10-2018 04:11 AM

The RE CPU utilization is more than 95% on both nodes and the FPC is not online node1. You may have to check for the cause of high RE CPU.

Please share the output:

show security flow status

show system process extensive | no-more

Thanks,
Nellikka
JNCIE x3 (SEC #321; SP #2839; ENT #790)
Please Mark My Solution Accepted if it Helped, Kudos are Appreciated too!!!
Highlighted
SRX Services Gateway

Re: SRX 240 HA cluster lost its secondary unit

‎12-10-2018 07:14 AM

Hi,

 

The outputs as requested:

 

me@mgmtfw01-b> show security flow status
node0:
--------------------------------------------------------------------------
  Flow forwarding mode:
    Inet forwarding mode: flow based
    Inet6 forwarding mode: drop
    MPLS forwarding mode: drop
    ISO forwarding mode: drop
    Enhanced route scaling mode: Disabled
  Flow trace status
    Flow tracing status: off
  Flow session distribution
    Distribution mode: RR-based
  Flow ipsec performance acceleration: off
  Flow packet ordering
    Ordering mode: Hardware

node1:
--------------------------------------------------------------------------
  Flow forwarding mode:
    Inet forwarding mode: none (reboot needed to change to flow based)
    Inet6 forwarding mode: drop
    MPLS forwarding mode: none (reboot needed to change to drop)
    ISO forwarding mode: drop
    Enhanced route scaling mode: Disabled
  Flow trace status
    Flow tracing status: off
  Flow session distribution
    Distribution mode: RR-based
  Flow ipsec performance acceleration: off
  Flow packet ordering
    Ordering mode: Hardware

{secondary:node1}
me@mgmtfw01-b> show system processes extensive | no-more
node0:
--------------------------------------------------------------------------
last pid: 41674;  load averages:  1.12,  1.19,  1.16  up 60+05:57:17    16:12:38
149 processes: 18 running, 118 sleeping, 1 zombie, 12 waiting

Mem: 228M Active, 135M Inact, 1089M Wired, 255M Cache, 112M Buf, 266M Free
Swap:


  PID USERNAME     THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
19056 root           7  76    0  1026M 86948K select 0 398.5H 290.09% flowd_octeon_hm
19903 root           1 139    0 15080K  5644K RUN    0  96.6H 74.41% eventd
   22 root           1 171   52     0K    16K RUN    0 1059.6  0.00% idle: cpu0
   23 root           1 -20 -139     0K    16K WAIT   0 752:25  0.00% swi7: clock
   25 root           1 -40 -159     0K    16K WAIT   0 552:54  0.00% swi2: netisr 0
 1715 root           1  76    0 15556K  6692K select 0 430:29  0.00% rtlogd
 1720 root           1  76    0 14412K  6252K select 0 337:01  0.00% license-check
    5 root           1 -16    0     0K    16K rtfifo 0 323:41  0.00% rtfifo_kern_recv
 1710 root           1  76    0 17828K  5576K select 0 108:14  0.00% shm-rtsdbd
 1711 root           1  76    0 16108K  7900K select 0  74:01  0.00% jsrpd
   26 root           1 -16    0     0K    16K -      0  49:23  0.00% yarrow
 1696 root           1  76    0  3348K  1428K select 0  41:55  0.00% bslockd
   52 root           1 -16    0     0K    16K psleep 0  41:33  0.00% vmkmemdaemon
19046 root           1  76    0 25896K 16992K select 0  34:19  0.00% snmpd
 1716 root           1  76    0 19720K  8692K select 0  33:12  0.00% utmd
 1719 root           3  76    0 16608K  5484K select 0  23:03  0.00% wland
19042 root           1  76    0 33300K 14248K select 0  22:59  0.00% mib2d
22605 root           1   4    0     0K    16K proxy_ 0  22:55  0.00% peerproxy02100001
    2 root           1  -8    0     0K    16K -      0  15:40  0.00% g_event
   19 root           1 171   52     0K    16K RUN    3  15:38  0.00% idle: cpu3
   20 root           1 171   52     0K    16K RUN    2  15:29  0.00% idle: cpu2
   42 root           1  20    0     0K    16K syncer 0  15:05  0.00% syncer
 1807 root           1  76    0     0K    16K select 0  13:26  0.00% peerproxy01100001
19022 root           1  76    0 22580K 10208K select 0  13:07  0.00% l2ald
   43 root           1  20    0     0K    16K vnlrum 0  13:03  0.00% vnlru_mem
    3 root           1  -8    0     0K    16K -      0  12:39  0.00% g_up
    4 root           1  -8    0     0K    16K -      0  12:35  0.00% g_down
19051 root           1  76    0 21264K  6344K select 0  11:03  0.00% bdbrepd
 1718 root           1  76    0  7700K  5592K select 0  10:43  0.00% ntpd
19054 root           1  76    0   129M 20648K select 0  10:03  0.00% chassisd
19047 root           1  76    0 30068K 10676K select 0   7:35  0.00% pfed
 1808 root           1   4    0     0K    16K proxy_ 0   7:07  0.00% peerproxy02100001
 1695 root           1  76    0  2320K   916K select 0   7:05  0.00% watchdog
19024 root           1  76    0 24588K 10840K select 0   6:08  0.00% cosd
 1714 root           1  76    0 43256K  9944K select 0   5:53  0.00% idpd
   21 root           1 171   52     0K    16K RUN    1   5:13  0.00% idle: cpu1
 1751 root           1  76    0 15280K  6956K select 0   5:03  0.00% bfdd
 1701 root           1  76    0 14940K  5384K select 0   4:32  0.00% craftd
19108 root           8   8    0 82796K  7616K nanslp 0   4:26  0.00% ipfd
19038 root           1   8    0 29848K  5440K nanslp 0   4:12  0.00% wmic
19045 root           1  76    0 14348K  5824K select 0   4:07  0.00% alarmd
19040 root           1   4    0 11500K  5248K kqread 0   3:32  0.00% mcsnoopd
19021 root           1   4    0 57076K 27744K kqread 0   2:50  0.00% rpd
19043 root           1  76    0 31388K 12424K select 0   2:46  0.00% kmd
18987 root           1  76    0 17880K  8568K select 0   2:23  0.00% ppmd
19023 root           1  76    0 15768K  7544K select 0   2:23  0.00% rmopd
19044 root           1  76    0 39992K  9772K select 0   2:16  0.00% dcd
   45 root           1 -16    0     0K    16K sdflus 0   2:03  0.00% softdepflush
19052 root           1  76    0 29968K 15008K select 0   2:02  0.00% nsd
   40 root           1 171   52     0K    16K pgzero 0   1:49  0.00% pagezero
19053 root           1  76    0 27076K 10292K select 0   1:45  0.00% smid
19097 nobody         1   4    0 10496K  1484K kqread 0   1:35  0.00% webapid
19049 root           1  76    0  9196K  3736K select 0   1:34  0.00% irsd
   32 root           1   8    0     0K    16K dwcint 0   1:26  0.00% dwc0
   44 root           1  -4    0     0K    16K vlruwt 0   1:25  0.00% vnlru
18989 root           1  76    0 17836K  7872K select 0   1:24  0.00% lacpd
   41 root           1 -16    0     0K    16K psleep 0   1:20  0.00% bufdaemon
   50 root           1 -16    0     0K    16K psleep 0   1:16  0.00% vmuncachedaemon
19032 root           1  76    0 16820K  6728K select 0   1:02  0.00% pkid
 1702 root           1  76    0 46924K 25700K select 0   0:53  0.00% mgd
 1705 root           1  76    0  6936K  2148K select 0   0:49  0.00% inetd
 1437 root           1   8    0  2720K  1160K nanslp 0   0:44  0.00% cron
   30 root           1 -28 -147     0K    16K WAIT   0   0:30  0.00% swi5: cambio
   82 root           1  -8    0     0K    16K mdwait 0   0:25  0.00% md1
19101 nobody         1  76    0 10884K  5080K select 0   0:25  0.00% httpd
19048 root           1  76    0 27668K 10524K select 0   0:22  0.00% dfwd
    6 root           1   8    0     0K    16K -      0   0:20  0.00% kqueue taskq
19034 root           1  76    0 24048K  8376K select 0   0:17  0.00% smihelperd
    9 root           1 -16    0     0K    16K psleep 0   0:17  0.00% pagedaemon
40859 root           1  77    0 56296K 23532K select 0   0:15  0.00% mgd
   47 root           1  -8    0     0K    16K select 0   0:13  0.00% if_pfe_listen
19036 root           3  79    0 18940K  6292K ucond  0   0:11  0.00% syshmd
19025 root           1  76    0 10264K  3812K select 0   0:10  0.00% pppd
   37 root           1 -36 -155     0K    16K WAIT   0   0:10  0.00% swi3: ip6opt ipopt
40840 root           1  76    0 10708K  3676K select 0   0:07  0.00% sshd
  396 root           1  -8    0     0K    16K mdwait 0   0:07  0.00% md2
40858 root           1  76    0 55636K 19600K select 0   0:07  0.00% cli
19033 root           1  76    0 15276K  5760K select 0   0:06  0.00% httpd-gk
    1 root           1   8    0  1596K   892K wait   0   0:06  0.00% init
19035 root           1  76    0 12336K  4176K select 0   0:05  0.00% nstraced
19037 root           1  76    0  9656K  2964K select 0   0:04  0.00% smtpd
19901 root           1  76    0  3332K  1308K select 0   0:01  0.00% usbd
   36 root           1   8    0     0K    16K usbevt 0   0:01  0.00% usb1
 1697 root           1  77    0  3656K  1536K select 0   0:01  0.00% tnetd
   33 root           1   8    0     0K    16K usbevt 0   0:01  0.00% usb0
 1713 root           1  76    0 21684K  6448K select 0   0:01  0.00% appsecured
 1712 root           2   4    0 21172K  5736K select 0   0:01  0.00% appidd
 1721 root           1   4    0 11848K  3308K select 0   0:01  0.00% sdxd
19031 root           1  82    0 16784K  5684K select 0   0:00  0.00% wwand
19030 root           1  83    0 12644K  3884K select 0   0:00  0.00% sendd
19028 root           1  77    0 11336K  3508K select 0   0:00  0.00% oamd
   31 root           1 -48 -167     0K    16K WAIT   0   0:00  0.00% swi0: uart
19039 root           1  20    0 10244K  3712K pause  0   0:00  0.00% webapid
 1547 root           1  -8    0     0K    16K mdwait 0   0:00  0.00% md4
19029 root           1  93    0 12132K  3236K select 0   0:00  0.00% mplsoamd
19050 root           1  76    0  9548K  2936K select 0   0:00  0.00% relayd
   59 root           1  -8    0     0K    16K mdwait 0   0:00  0.00% md0
40844 root           1  20    0  5056K  3116K pause  0   0:00  0.00% csh
    7 root           1   8    0     0K    16K -      0   0:00  0.00% thread taskq
19849 root           1   5    0  4556K  1832K ttyin  0   0:00  0.00% login
41674 root           1  79    0 24628K  2124K CPU0   0   0:00  0.00% top
41673 root           1  77    0 46956K  4720K select 0   0:00  0.00% mgd
 1532 root           1  -8    0     0K    16K mdwait 0   0:00  0.00% md3
    8 root           1   8    0     0K    16K -      0   0:00  0.00% mastership taskq
    0 root           1  -8    0     0K     0K WAIT   0   0:00  0.00% swapper
   49 root           1   4    0     0K    16K purge_ 0   0:00  0.00% kern_pir_proc
   51 root           1  -8    0     0K    16K select 0   0:00  0.00% if_pic_listen0
   54 root           1   8    0     0K    16K -      0   0:00  0.00% nfsiod 0
   53 root           1   4    0     0K    16K dump_r 0   0:00  0.00% kern_dump_proc
   46 root           1  76    0     0K    16K sleep  0   0:00  0.00% netdaemon
   56 root           1   8    0     0K    16K -      0   0:00  0.00% nfsiod 2
   57 root           1   8    0     0K    16K -      0   0:00  0.00% nfsiod 3
   55 root           1   8    0     0K    16K -      0   0:00  0.00% nfsiod 1
   35 root           1   8    0     0K    16K dwcint 0   0:00  0.00% dwc1
   34 root           1   8    0     0K    16K usbtsk 0   0:00  0.00% usbtask
   10 root           1 -16    0     0K    16K ktrace 0   0:00  0.00% ktrace
   28 root           1 -12 -131     0K    16K WAIT   0   0:00  0.00% swi9: +
   29 root           1 -12 -131     0K    16K WAIT   0   0:00  0.00% swi9: task queue
   27 root           1 -16 -135     0K    16K WAIT   0   0:00  0.00% swi8: +
   24 root           1 -24 -143     0K    16K WAIT   0   0:00  0.00% swi6: vm
   38 root           1 -32 -151     0K    16K WAIT   0   0:00  0.00% swi4: ip6mismatch+
   39 root           1 -44 -163     0K    16K WAIT   0   0:00  0.00% swi1: ipfwd
   15 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu7
   14 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu8
   13 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu9
   12 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu10
   11 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu11
   17 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu5
   18 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu4
   16 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu6

node1:
--------------------------------------------------------------------------
last pid:  2746;  load averages:  0.92,  0.97,  0.97  up 0+05:28:55    15:49:46
121 processes: 18 running, 91 sleeping, 1 zombie, 11 waiting

Mem: 140M Active, 90M Inact, 1061M Wired, 227M Cache, 112M Buf, 454M Free
Swap:


  PID USERNAME     THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
 2723 root           7  76    0  1026M 86940K select 0  16:56 367.53% flowd_octeon_hm
 1623 root           1  76    0 15940K  7588K select 0  42:11  0.00% jsrpd
   22 root           1 171   52     0K    16K RUN    0  19:56  0.00% idle: cpu0
   20 root           1 171   52     0K    16K RUN    2  11:15  0.00% idle: cpu2
   19 root           1 171   52     0K    16K RUN    3  11:13  0.00% idle: cpu3
   21 root           1 171   52     0K    16K RUN    1   3:42  0.00% idle: cpu1
 1659 root           1  76    0 22208K  6800K select 0   2:42  0.00% bdbrepd
   23 root           1 -20 -139     0K    16K RUN    0   2:18  0.00% swi7: clock
 1656 root           1  76    0  9832K  3940K select 0   2:10  0.00% ksyncd
 1266 root           1  76    0 15080K  5032K select 0   1:35  0.00% eventd
 1633 root           1  76    0 14412K  6056K select 0   1:17  0.00% license-check
 1660 root           1  76    0 27008K  9624K select 0   1:10  0.00% smid
    5 root           1 -16    0     0K    16K rtfifo 0   1:05  0.00% rtfifo_kern_recv
 1666 root           1  76    0 23224K  8816K select 0   0:41  0.00% pfed
 1667 root           1  76    0 33352K 12076K select 0   0:40  0.00% mib2d
 1630 root           1  76    0 19548K  8280K select 0   0:39  0.00% utmd
   25 root           1 -40 -159     0K    16K WAIT   0   0:29  0.00% swi2: netisr 0
 1622 root           1  76    0 17720K  3656K select 0   0:28  0.00% shm-rtsdbd
  386 root           1  -8    0     0K    16K mdwait 0   0:17  0.00% md2
 1669 root           1  76    0 14344K  5500K select 0   0:16  0.00% alarmd
   82 root           1  -8    0     0K    16K mdwait 0   0:15  0.00% md1
    3 root           1  -8    0     0K    16K -      0   0:12  0.00% g_up
 1608 root           1  76    0  3348K  1392K select 0   0:10  0.00% bslockd
   52 root           1 -16    0     0K    16K psleep 0   0:09  0.00% vmkmemdaemon
 1668 root           1  76    0 19040K  9652K select 0   0:09  0.00% snmpd
 1629 root           1  76    0 15532K  6460K select 0   0:08  0.00% rtlogd
   26 root           1 -16    0     0K    16K -      0   0:07  0.00% yarrow
 1627 root           1  76    0 43268K  9680K select 0   0:07  0.00% idpd
    1 root           1   8    0  1596K   892K wait   0   0:07  0.00% init
 1655 root           1  76    0 15920K  6344K select 0   0:06  0.00% ppmd
    4 root           1  -8    0     0K    16K -      0   0:06  0.00% g_down
 1663 root           4  76    0 81740K  4924K select 0   0:05  0.00% ipfd
 1632 root           3  76    0 16604K  5088K select 0   0:04  0.00% wland
 1664 root           1  76    0  9196K  3428K select 0   0:04  0.00% irsd
    2 root           1  -8    0     0K    16K -      0   0:04  0.00% g_event
   42 root           1  20    0     0K    16K syncer 0   0:04  0.00% syncer
 1670 root           1  76    0 39288K  8508K select 0   0:04  0.00% dcd
   43 root           1  20    0     0K    16K vnlrum 0   0:03  0.00% vnlru_mem
 1614 root           1  76    0 46924K 25700K select 0   0:03  0.00% mgd
 2531 root           1  76    0 50536K 25720K select 0   0:02  0.00% mgd
 1631 root           1  76    0  7640K  4748K select 0   0:02  0.00% ntpd
   40 root           1 171   52     0K    16K pgzero 0   0:02  0.00% pagezero
 1613 root           1  76    0 14936K  5408K select 0   0:02  0.00% craftd
   32 root           1   8    0     0K    16K dwcint 0   0:02  0.00% dwc0
 2727 root           1  76    0   128M 18556K select 0   0:02  0.00% chassisd
 1607 root           1  76    0  2320K   916K select 0   0:02  0.00% watchdog
 1657 root           1  76    0 15216K  6828K select 0   0:01  0.00% bfdd
 1665 root           1  76    0 26572K  8824K select 0   0:01  0.00% dfwd
 1624 root           2  82    0 21216K  5532K select 0   0:01  0.00% appidd
 1625 root           1  76    0 21684K  6260K select 0   0:01  0.00% appsecured
 1662 root           1   8    0 26620K  9936K nanslp 0   0:01  0.00% nsd
   30 root           1 -28 -147     0K    16K WAIT   0   0:01  0.00% swi5: cambio
 1654 root           1   8    0 27344K  7172K nanslp 0   0:01  0.00% kmd
 2737 budaim         1   6    0 55624K 19116K ttywri 0   0:01  0.00% cli
 1658 root           1  76    0 17452K  5352K select 0   0:01  0.00% lacpd
   45 root           1 -16    0     0K    16K sdflus 0   0:00  0.00% softdepflush
   41 root           1 -16    0     0K    16K psleep 0   0:00  0.00% bufdaemon
   44 root           1  -4    0     0K    16K vlruwt 0   0:00  0.00% vnlru
   50 root           1 -16    0     0K    16K psleep 0   0:00  0.00% vmuncachedaemon
 1634 root           1  95    0 11836K  3128K select 0   0:00  0.00% sdxd
 1349 root           1   8    0  2720K  1160K nanslp 0   0:00  0.00% cron
 1661 root           1  89    0  9548K  2912K select 0   0:00  0.00% relayd
 1617 root           1  87    0  6936K  2292K select 0   0:00  0.00% inetd
 2738 root           1  76    0 46976K  7192K select 0   0:00  0.00% mgd
   59 root           1  -8    0     0K    16K mdwait 0   0:00  0.00% md0
 2736 root           1   8    0  4828K  1812K wait   0   0:00  0.00% login
 1239 root           1  76    0  3332K  1288K select 0   0:00  0.00% usbd
   49 root           1   4    0     0K    16K purge_ 0   0:00  0.00% kern_pir_proc
 2533 root           1   4    0  6280K  2052K sbwait 0   0:00  0.00% rcp
 1609 root           1  77    0  3656K  1536K select 0   0:00  0.00% tnetd
    9 root           1 -16    0     0K    16K psleep 0   0:00  0.00% pagedaemon
   31 root           1 -48 -167     0K    16K WAIT   0   0:00  0.00% swi0: uart
 2746 root           1  81    0 24596K  1968K CPU0   0   0:00  0.00% top
 2745 root           1  77    0 46956K  4680K select 0   0:00  0.00% mgd
    7 root           1   8    0     0K    16K -      0   0:00  0.00% thread taskq
 1444 root           1  -8    0     0K    16K mdwait 0   0:00  0.00% md3
   36 root           1   8    0     0K    16K usbevt 0   0:00  0.00% usb1
    0 root           1  -8    0     0K     0K WAIT   0   0:00  0.00% swapper
 1459 root           1  -8    0     0K    16K mdwait 0   0:00  0.00% md4
   33 root           1   8    0     0K    16K usbevt 0   0:00  0.00% usb0
   51 root           1  -8    0     0K    16K select 0   0:00  0.00% if_pic_listen0
   47 root           1  -8    0     0K    16K select 0   0:00  0.00% if_pfe_listen
   53 root           1   4    0     0K    16K dump_r 0   0:00  0.00% kern_dump_proc
   46 root           1  76    0     0K    16K sleep  0   0:00  0.00% netdaemon
   54 root           1   8    0     0K    16K -      0   0:00  0.00% nfsiod 0
   55 root           1   8    0     0K    16K -      0   0:00  0.00% nfsiod 1
   56 root           1   8    0     0K    16K -      0   0:00  0.00% nfsiod 2
   57 root           1   8    0     0K    16K -      0   0:00  0.00% nfsiod 3
   35 root           1   8    0     0K    16K dwcint 0   0:00  0.00% dwc1
    6 root           1   8    0     0K    16K -      0   0:00  0.00% kqueue taskq
   34 root           1   8    0     0K    16K usbtsk 0   0:00  0.00% usbtask
    8 root           1   8    0     0K    16K -      0   0:00  0.00% mastership taskq
   10 root           1 -16    0     0K    16K ktrace 0   0:00  0.00% ktrace
   29 root           1 -12 -131     0K    16K WAIT   0   0:00  0.00% swi9: task queue
   28 root           1 -12 -131     0K    16K WAIT   0   0:00  0.00% swi9: +
   27 root           1 -16 -135     0K    16K WAIT   0   0:00  0.00% swi8: +
   24 root           1 -24 -143     0K    16K WAIT   0   0:00  0.00% swi6: vm
   38 root           1 -32 -151     0K    16K WAIT   0   0:00  0.00% swi4: ip6mismatch+
   37 root           1 -36 -155     0K    16K WAIT   0   0:00  0.00% swi3: ip6opt ipopt
   39 root           1 -44 -163     0K    16K WAIT   0   0:00  0.00% swi1: ipfwd
   15 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu7
   14 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu8
   13 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu9
   16 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu6
   11 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu11
   17 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu5
   18 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu4
   12 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu10
Highlighted
SRX Services Gateway

Re: SRX 240 HA cluster lost its secondary unit

‎12-10-2018 08:38 AM

The 'eventd' process which is responsible for logging, is the top process contributing to high RE CPU. You may have to check your security policy logging configuration and disable it to reduce CPU utilization.

 

Find out which policy is getting more hitcount and disbale logging if it is enabled.

show security policies hit-count descending

clear security policies hit-count <----------------- Reset the count and check again

show security policies hit-count descending

Also check your syslog configuration and fine tune if required ( show configuration system syslog)

Once the RE CPU is normal, check the cluster status and if it is not ok, you may have to reboot secondary node

Thanks,
Nellikka
JNCIE x3 (SEC #321; SP #2839; ENT #790)
Please Mark My Solution Accepted if it Helped, Kudos are Appreciated too!!!
Highlighted
SRX Services Gateway

Re: SRX 240 HA cluster lost its secondary unit

‎12-11-2018 01:00 AM

Well, I reviewed the policies, but we need logging so I couldn't configure it out. 

A reboot was issued but the problem is still around. 

Highlighted
SRX Services Gateway

Re: SRX 240 HA cluster lost its secondary unit

‎12-11-2018 02:04 AM

You have to reduce RE CPU utilization to stabilize cluster. For this enable stream mode logging to offload the RE: https://kb.juniper.net/InfoCenter/index?page=content&id=KB16224&actp=METADATA

an if possible, remove logging at session-init if it is configured. Session-close log has a session summary which also tells you when the session started

 

 

Thanks,
Nellikka
JNCIE x3 (SEC #321; SP #2839; ENT #790)
Please Mark My Solution Accepted if it Helped, Kudos are Appreciated too!!!
Highlighted
SRX Services Gateway

Re: SRX 240 HA cluster lost its secondary unit

‎12-13-2018 07:59 AM

Hi,

 

The probem caused by an MTU problem between chassis members (via swichtes), which is solved now, and everything works fine.

You mentioned syslog settings, session-init actually. May I have your exact configuration of logging? I tried using only session-close logs, but I can't see it really useful, because "timestamp-"elapsed-time"" for a session information is little bit difficult to read.

My recent configuration is:

 

set security log mode stream
set security log source-address x.x.x.x
set security log stream traffic3 format sd-syslog
set security log stream traffic3 category all
set security log stream traffic3 host x.x.x.x
set security log stream traffic3 host port xxxx

 

Thank you very much in advance!

Highlighted
SRX Services Gateway

Re: SRX 240 HA cluster lost its secondary unit

‎12-13-2018 08:16 AM
Thanks for the update. Are you still seeing high cpu caused by eventd?
Thanks,
Nellikka
JNCIE x3 (SEC #321; SP #2839; ENT #790)
Please Mark My Solution Accepted if it Helped, Kudos are Appreciated too!!!
Highlighted
SRX Services Gateway

Re: SRX 240 HA cluster lost its secondary unit

‎12-13-2018 08:44 AM

I found a strange thing in our configuration:

 

set security log mode stream
set security log source-address x.x.x.x
set security log stream traffic
set security log stream traffic2
set security log stream traffic3 host x.x.x.x
set security log stream traffic3 host port xxx

 

Stream "traffic" and "traffic2" were set, but not any server. I deleted these so session-init. CPU usage decreased to 0%. After this I enabled again logging session init. Now CPU usage is about 6%, but now is out of working hours. We will see it tomorrow. 🙂

Feedback