SRX Services Gateway
Reply
Contributor
rebus
Posts: 56
Registered: ‎05-28-2009
0
Accepted Solution

SRX 220 - Sudden Very High Sustained CPU Utilization

[ Edited ]

This afternoon, an SRX 220 that's been running with average CPU utilization around 10% since it was installed 2 months ago suddenly shot up to over 80% sustained CPU.  (we use Solarwinds Orion to monitor and graph)  This has been going on for several hours, sustained.  We've made no configuration changes recently.

 

See attached performance graph image.  it's after business hours and our traffic is near zero.

 

Can anyone suggest what might be causing this, and what the solution might be?  (firmware is 11.2R6.3)

 

 

root@SRX1> show system processes extensive
last pid: 65805;  load averages:  3.05,  2.09,  1.78  up 1+01:18:11    18:47:26
136 processes: 19 running, 104 sleeping, 2 zombie, 11 waiting

Mem: 131M Active, 113M Inact, 529M Wired, 144M Cache, 112M Buf, 53M Free
Swap:


  PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
 1149 root        5  76    0   498M 52568K select 0  27.9H 98.44% flowd_octeon_hm
   22 root        1 171   52     0K    16K RUN    0  18.1H 11.23% idle: cpu0
65799 root        1  -8    0  8288K  3044K RUN    0   0:00  1.46% sshd
65800 root        1   4    0  7728K  2948K sbwait 0   0:00  1.46% sshd
65801 sshd        1   4    0  7768K  1728K sbwait 0   0:00  0.73% sshd
   24 root        1 -20 -139     0K    16K RUN    0  14:06  0.00% swi7: clock
 1144 root        1  76    0 20032K  8864K select 0  12:47  0.00% kmd
   23 root        1 -40 -159     0K    16K WAIT   0   9:07  0.00% swi2: net
  894 root        1  76    0  8100K  2708K select 0   8:18  0.00% eventd
    5 root        1 -16    0     0K    16K rtfifo 0   3:51  0.00% rtfifo_kern_recv
 1136 root        1  76    0 17692K 10352K select 0   3:32  0.00% snmpd
 1173 root        1  76    0 10576K  3880K select 0   3:09  0.00% license-check
 1139 root        1  76    0 17500K  6848K select 0   2:59  0.00% l2ald

 

 

Distinguished Expert
MMcD
Posts: 636
Registered: ‎07-20-2010
0

Re: SRX 220 - Sudden Very High Sustained CPU Utilization

What does the following show from the shell command?

 

user@srx>start shell
%top -H

 

MMcD [JNCIP-SEC, JNCIS-ENT, CCNA, MCP]
____________________________________________________

[Please Mark My Solution Accepted if it Helped, Kudos are Appreciated Too]
Contributor
rebus
Posts: 56
Registered: ‎05-28-2009
0

Re: SRX 220 - Sudden Very High Sustained CPU Utilization

Hello..... looks like the same process as above (flowd_octeon_hm)


 

 

ast pid: 66011;  load averages:  0.90,  0.65,  0.50                                                     up 1+14:08:53  07:38:08
72 processes:  3 running, 68 sleeping, 1 zombie
CPU states: 64.2% user,  0.0% nice,  5.6% system,  0.0% interrupt, 30.2% idle
Mem: 132M Active, 113M Inact, 529M Wired, 144M Cache, 112M Buf, 51M Free
Swap:

  PID USERNAME PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
 1149 root     139    0   498M 52444K CPU1   1  42.2H 92.48% flowd_octeon_hm
 1149 root      78    0   498M 52444K select 0  42.2H  5.22% flowd_octeon_hm
66011 sshd     104    0  7736K  1576K RUN    0   0:00  0.73% sshd
 1149 root      76    0   498M 52444K select 0  42.2H  0.00% flowd_octeon_hm
 1149 root      76    0   498M 52444K select 0  42.2H  0.00% flowd_octeon_hm
 1149 root       8    0   498M 52444K nanslp 0  42.2H  0.00% flowd_octeon_hm
 1144 root      76    0 20076K  8908K select 0  20:27  0.00% kmd
  894 root      76    0  8100K  2708K select 0  12:46  0.00% eventd
 1136 root      76    0 17692K 10352K select 0   5:17  0.00% snmpd
 1173 root      76    0 10576K  3880K select 0   4:44  0.00% license-check
 1139 root      76    0 17500K  6848K select 0   4:29  0.00% l2ald
 1137 root      76    0 17472K  7732K select 0   4:23  0.00% mib2d
 1168 root      76    0 11832K  4676K select 0   3:33  0.00% rtlogd
 1145 root      76    0  9068K  3400K select 0   3:18  0.00% ppmd
 1179 nobody    76    0 10368K  4428K ucondt 0   2:32  0.00% httpd
 1179 nobody    76    0 10368K  4428K ucondt 0   2:32  0.00% httpd

Distinguished Expert
MMcD
Posts: 636
Registered: ‎07-20-2010
0

Re: SRX 220 - Sudden Very High Sustained CPU Utilization

[ Edited ]

This is normal.  The flowd_octeon thread usually runs at high percentage in order to make sure that all the packets received are processed faster.  This is why flowd reads high always, however it does not mean that system is busy.

 

The actual CPU usage can be viewed below.

 

For Routing Engine (Control Plane) CPU usage:

  show chassis routing-engine

 

For Packet Forwarding Engine (Data Plane) CPU usage:

  show chassis forwarding

  show security monitoring fpc 0

 

Here is an output from an SRX240, the multicore processor, people sometimes wonder why the CPU% is 270%, but it is normal, 90% per core.

 

PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
 1095 root        7  76    0   499M 49852K select 0    ??? 277.39% flowd_octeon_hm
   22 root        1 171   52     0K    16K RUN    0 3720.0 87.21% idle: cpu0
   23 root        1 -20 -139     0K    16K WAIT   0  41.1H  0.00% swi7: clock
 1120 root        1  76    0  8788K  3740K select 0 582:56  0.00% license-check
    5 root        1 -84    0     0K    16K rtfifo 0 502:39  0.00% rtfifo_kern_recv
 1086 root        1  76    0 17084K  6696K select 0 325:33  0.00% l2ald
   25 root        1 -40 -159     0K    16K WAIT   0 200:36  0.00% swi2: net
 1099 root        3  20    0 43940K 13416K sigwai 0 173:25  0.00% authd

 

And the top -H shows each CPU

 

PID USERNAME PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
 1095 root     139    0   499M 49852K CPU3   3    ??? 91.70% flowd_octeon_hm
 1095 root     139    0   499M 49852K CPU2   2    ??? 91.70% flowd_octeon_hm
 1095 root     139    0   499M 49852K CPU1   1    ??? 91.70% flowd_octeon_hm
 1095 root      77    0   499M 49852K RUN    0    ???  1.46% flowd_octeon_hm
 1095 root      76    0   499M 49852K select 0    ???  0.00% flowd_octeon_hm
 1095 root      76    0   499M 49852K select 0    ???  0.00% flowd_octeon_hm
 1095 root       8    0   499M 49852K nanslp 0    ???  0.00% flowd_octeon_hm
 1120 root      76    0  8788K  3740K select 0 582:55  0.00% license-check
 1086 root      76    0 17084K  6696K select 0 325:33  0.00% l2ald
 1099 root      96    0 43940K 13416K ucond  0 173:25  0.00% authd
 1099 root      76    0 43940K 13416K select 0 173:25  0.00% authd
 1099 root      20    0 43940K 13416K sigwai 0 173:25  0.00% authd
 1088 root      76    0 22016K  6800K select 0 155:18  0.00% pfed


MMcD [JNCIP-SEC, JNCIS-ENT, CCNA, MCP]
____________________________________________________

[Please Mark My Solution Accepted if it Helped, Kudos are Appreciated Too]
Contributor
rebus
Posts: 56
Registered: ‎05-28-2009
0

Re: SRX 220 - Sudden Very High Sustained CPU Utilization

But notice the performance graph I attached to the original post.

 

This CPU spike is sudden an significant for this device, as depoloyed in our company.  We've been monitoring it for 2 months and it's always averaged around 10% for all that time.  Until yesterday.

 

Something has changed, and it wasn't a config change because we haven't made any.

Distinguished Expert
MMcD
Posts: 636
Registered: ‎07-20-2010
0

Re: SRX 220 - Sudden Very High Sustained CPU Utilization

Yup, I know, from Solad Winds it states:

 

"This poller should pull back the equivilant of "show security monitoring fpc 0" on an SRX"

 

What does the following commands return?

 

For Packet Forwarding Engine (Data Plane) CPU usage:

  show chassis forwarding

  show security monitoring fpc 0

MMcD [JNCIP-SEC, JNCIS-ENT, CCNA, MCP]
____________________________________________________

[Please Mark My Solution Accepted if it Helped, Kudos are Appreciated Too]
Contributor
rebus
Posts: 56
Registered: ‎05-28-2009
0

Re: SRX 220 - Sudden Very High Sustained CPU Utilization

[ Edited ]

Here you go......

 

root@SRX1> show chassis forwarding
FWDD status:
  State                                 Online   
  Microkernel CPU utilization         7 percent
  Real-time threads CPU utilization   1 percent
  Heap utilization                   52 percent
  Buffer utilization                  1 percent
  Uptime:                               1 day, 16 hours, 10 minutes, 5 seconds

root@SRX1> show security monitoring fpc 0
FPC 0
  PIC 0
    CPU utilization      :    2 %
    Memory utilization   :   52 %
    Current flow session :  426
    Max flow session     : 98304
Session Creation Per Second (for last 96 seconds on average):    0

 

 

Overnight the CPU utilization dropped suddenly from 80% to 40% and is holding steady in the low 40s.  An improvement, but still a far cry from "normal" for this device.

Also note that uptime of 1 day 16 hours was after a recovery from a power outage, which outlasted the UPS battery.  It was an unplanned un-graceful shutdown.  Could that have anything to do with it?  I didn't mention this before, because the unit ran for about 22 hours at "normal" 10% CPU before the sudden spike to 80% sustained.

Distinguished Expert
MMcD
Posts: 636
Registered: ‎07-20-2010
0

Re: SRX 220 - Sudden Very High Sustained CPU Utilization

Hi,

 

Well I do not know much about Solar Windows but somwhere along the way, either the SRX or the Solar Windows Product is giving/providing/showing incorrect data.

 

The real readings above are of course correct.  So I would try and restart the SNMP Daemon first.

 

user@srx>restart snmp

 

I also notice your graph is "Average" CPU load and not a true current reading.  Im leaning toward an issue with the way the Solar Windows product has handled the unexpected system shutdown.  I expect resetting the data would result in it dropping down again.

 

Can it show a running CPU graph?

MMcD [JNCIP-SEC, JNCIS-ENT, CCNA, MCP]
____________________________________________________

[Please Mark My Solution Accepted if it Helped, Kudos are Appreciated Too]
Contributor
rebus
Posts: 56
Registered: ‎05-28-2009
0

Re: SRX 220 - Sudden Very High Sustained CPU Utilization

See attached graph.

 

I restarted snmp on the SRX and the solarwinds processes, and reported CPU average utilization is back down around 10%.

 

It's interesting that the anomolous spike didn't happen right after the router came back up from the power outage, but rather, it happend about 22 hours later.  Hmmmm.

 

Thanks for you help and suggestions. 

Copyright© 1999-2013 Juniper Networks, Inc. All rights reserved.