SRX Services Gateway
SRX Services Gateway

Significant SRX reliability problems

[ Edited ]
‎12-06-2017 01:47 PM

Generally speaking, I really like working with the SRX.  We use 210, 220, and 240 models throughout the company.  It's trivially easy to set up tunnels with OSPF to do all kinds of neat inter-office connectivity, and working with JTAC is WAY better than Cisco TAC.  (we have a Cisco phone system)

 

Five years ago, we bought 15 new SRXes from an authorized Juniper dealer, and each was installed in a separate geographic location.

 

I'm having GRAVE concerns about their reliability.  In the past 2 years, 5 of the 15 have failed with a 6th one heading to the toilet.

  • One lost its flash-- no storage recognized at boot.  It will only boot from a USB stick.
  • Another suddenly came up with a huge number of flash errors, enough that we had to remove from service-- and this one's in a very high quality colo facility (clean power always).
  • One has a "reset" button problem, such that it kept resetting itself to factory defaults  randomly.  I had to set "config-button no-clear" as a workaround.
  • One randomly lost power internally several times a day... not an OS crash, but as in "all the lights blink off then back on".  (Power supply swap didn't help.)
  • One slowly lost its RJ-45 interfaces, one at a time.  I moved services to other interfaces as they failed, until one day...... the unit just crashed and never rebooted.
  • Another one is starting the "randomly loses power internally" issue, in the exact same way as the other one did.  I'm configuring its replacement today.

6 failures out of 15... that's a 40% failure rate in 5 years.  For the record, all are on APC UPSes of varying capacities, and utility power problems are extremely rare.

Is the SRX really this much of a failure-prone dog?  Juniper Netscreens we bought circa 2005-06 are still running TODAY with no problems at all... which is why I was so anxious to adopt the SRX at new locations.  But wow.... the problems never end.

Are we alone in this experience?

9 REPLIES 9
SRX Services Gateway

Re: Significant SRX reliability problems

‎12-07-2017 01:52 AM

Hi!

 

No - you're not the only one. We didn't have any jack- or pushbutton-issues, but loads of problems with bad blocks in NAND which often lead to problems during upgrade (i.e. change of boot partition). ISSU going haywire, Systems responding extremly slow after config change (had to re-image the divice). Or SRXes stuck in bootlaoder for no reason - issuing a 'boot' then brings them up (had it with severeal SRX300 so far) - but of course that has to be done from console, i.e. driving to customers site and do it locally since customers usually don't have serial adapter nor want to / are able to revive their equipment. Not to mention the extended downtime at customers site...

 

And it's not the SRXes alone - in the last few months, we had increasing problems with EX-switches too.

Corrupted filesystems (no power outage - NAND simply 'slowly dies' during regular operation within 2 years. JTAC tells me that's normal and we have to live with this). Update of a 9 chassis- VC left 4 of the chassis in boot-prompt

Sponatnoues reboot after a simple commit, false emergency fire-shutdowns due to possible bug in CPU temp sensor.

JUNOS Quality suffered massively - we ran into many bugs in the past - most of them 'confidential', i.e. we didn't even had a chance to circumvent them. To make things worse many (not all!) JTAC engineers have a strange way of tackeling problems ('please try to install a different JUNOS-Version in your production environment- we don't know if it will work (potluck), but hey - it's just half an hour of downtime (if you're lucky) and a drive to the customers site (since you might loose network access to the devices and need console access) - it might cost you a few thousand bucks, but be honest-money is not an issue...) or (well NAND problems ar inadvertable - please check nand on all your (200+) devices once a week to quickly identify problems...).

And I have the feeling that often, they didn't even try once to actually install their recommended versions of Junos on the corresponding devices - we had it more than once that the recommendation didn't work at all on the device (too little memory). Funny things then happen (e.g. systems boots, and forward packets but doesn't NAT anymore - no error messages...).

I already complained multiple times toward Juniper to beef up their QA again - so far in vein.

 

Kai

SRX Services Gateway

Re: Significant SRX reliability problems

[ Edited ]
‎12-07-2017 06:39 AM

Juniper has been my go-to vendor for over a decade-- but their  reliability problems are killing us-- and rapidly changing my mind.

 

I hate to say it, but I'm t taking another look at pfSense, because that will give me control over hardware quality.  Last I saw, they didn't do routed IPSec which was a show-stopper, and I really DON'T want to mess with Cisco PIX.  Dealing with TAC for our Cisco phone system is a big enough nightmare.  But none of our Cisco gear (switches, VPN gateway, phone system) have failed in any way.

 

Juniper is killing themselves with quality control problems.  Maybe not on million-dollar carrier gear, but definitely on branch tier equipment.

SRX Services Gateway

Re: Significant SRX reliability problems

‎12-07-2017 09:59 PM

Hello 

 

Thanks a lot for the feedabck. 

 

To understand better, what is the model of the newly procured SRXes and what is the JUNOS verison this fleet is running?

 

Regards,

 

Vikas

SRX Services Gateway

Re: Significant SRX reliability problems

[ Edited ]
‎12-08-2017 03:36 AM

"Newly procured?"  Per my original post, these were bought 4-5 years ago-- which is still fairly young in networking gear terms.  (except the one with the reset button problem is just under 3 years old, problem started at age 2)


210HE, 220H, 240H

 

Some are on 12.1X46-D67, others are still at 12.1X44-D30.4.  

 

That's another MAJOR complaint.  ALL of our devices are still under PAID support, but there is NO JUNOS version we can run that mitigates vulnerabilities CVE-2016-10012, CVE-2016-10010, CVE-2015-6564 and CVE-2015-8325.  The fix is 12.3X48-D55 but none of our devices can run that build, per JTAC, because they are not the newer H2 model.  It is also impossible to disable SSL 3.0 and TLS 1.0 (per JTAC) because the builds that do that are also NOT able to run on our still-paid-supported gear.  I put in an enhancement request for that, but haven't heard a thing.  So I've had to disable nearly all external access on devices that are a long distance away.

SRX Services Gateway

Re: Significant SRX reliability problems

‎03-06-2018 11:04 AM

Since I wrote the original post 3 months ago, we've had 2 additional failures.  One crashed in service and on reboot couldn't find boot device (flash failure). 

To replace it, I pulled a gently-used SRX off the shelf which was removed from a shuttered location.  Unit was running perfectly when it was gracefully shut down and brought back to the corporate server room for storage.  When it was booted to replace the flash-failed unit mentioned above, the primary boot partition couldn't be read so it booted to backup partition.  I tried to reformat the failed partition (req sys snap slice alt) but that failed with (can't remember the exact words) an error related to partition inaccessible or media unreadable-- something like that.  So it smells like another flash failure.

Unfortunately, after looking at options, we had no choice but to buy more Juniper because of the effort involved mixing another vendor into a production environment with so many tunnels.  So we're getting a batch of SRX320 and 340 models.

I hope they're more reliable, because my confidence Juniper is at an all-time low right now.

SRX Services Gateway

Re: Significant SRX reliability problems

‎03-08-2018 07:00 AM

Hi

 

I am sad to hear you have so many failures. This looks like an anomaly to me. Our experience with ten SRX240 boxes after ~5 years of working in the lab rack - zero failures. Are you monitoring devices temperature, is it not too high?

 

Best Regards,
PK

Juniper Ambassador, Juniper Networks Certified Instructor,
JNCIE-SEC #98, JNCIE-ENT #393, JNCIE-SP #2253
Twitter: @JuniperTrain
GitHub: https://github.com/pklimai
[Juniper Authorized Education & Support in Russia]
SRX Services Gateway

Re: Significant SRX reliability problems

[ Edited ]
‎03-08-2018 09:59 AM

Yes, all devices are kept in rooms with proper cooling and humidity.  The failed colo router is in a premium colocation facility where temp, humidity, and power are rigorously maintained-- and we've reviewed the logs to verify.  In our own on-premises telco/server rooms, we have dedicated cooling, and make extensive use of APC brand UPSes in various configurations.

There is nothing environmentally that would explain the failures.  Additionally, each of these locations has other brands of equipment, from Cisco switches and voice gateways, to HP and Dell servers, to video surveillance, and many other types of gear.  The only.... and I stress ONLY.... equipment failures we've had are the Juniper SRXes.

Corrupt or missing Flash.  Front panel reset button that seems to frequently "push itself" (enough that I had to disable it in config).  Ports that go bad for no apparent reason.  One unit even randomly goes dark (as if losing power) for a few seconds then powers back on.  (we replaced the power supply and cables on that one, but problem remained).

Many different types of failures, but only on our Juniper SRX devices.  The Juniper branded Netscreens (NS-25, NS-50) bought in 2006 are still running perfectly with zero failures after 12 years.

SRX Services Gateway

Re: Significant SRX reliability problems

Wednesday

Hi, 

 

We are an MSP in the Netherlands. We've been running Juniper for 10 years now. Some SSG-5's for 8 years+. After many interface issues with SRX-200/220 but reliable operation we switched to SRX-300's. In the last 4 years we replaced ALL SRX-300's all due to: no-interface / overheating / crashing  of general failure. We replaced 12 units and have made the hard descision to switch to Unifi equipement. A lot less funtionality but for generic router/firewall very capable. I'm very displeased with the way the SRX-300 had done in the last years.

 

Refreshing regards,
Chris -  Lime Networks

SRX Services Gateway

Re: Significant SRX reliability problems

yesterday

Hello Chris,

 

Thank you for the feedback and very sorry to hear your experience of reliability problems for SRX300 series product lines.

 

We had an issue with out internal storage component, which was causing random crash, boot-failure, reboot etc..

Storage issues were noticed when excessive logging was written onto the disk.

 

Based on feedback from many of our customers, we have changed the storage component, which provides better IO speed and reliability.

Field response has been positive so far from customers who are using SRX300 series devices with new storage component and newer Junos version(15.1X49-D150 and above)

 

We would defintely like to help you and fix all the reliability issues that you have been experiencing.

Could you open up a JTAC support ticket to assist you better?

 

Regards,

Raveen

Note: If this answers your question, you could mark this post as accepted solution, that way it helps others as well. Kudos will be cool if I earned it!