06-14-2009 12:47 PM - edited 06-14-2009 12:52 PM
A ex4200 member of the virtual chassis experienced some corruption that caused packets greater than a hundred bytes or so to be dropped. If you ping a device on that member chassis, no problem (very small packets moving). But if you tried to pass traffic via a socket, the socket would ultimately be closed and traffic did not pass through.
This issue is further exacerbated because we run N+1 at the network level. We didnt really know it was this particular member of the chassis without extensive onsite testing. Upgraded JUNOS and did a reboot and it seems to be fine. So the problem is.. if the switch would have just died it would have been perfectly OK. But because it decided to stay somewhat healthy, it wrecked havoc in production as load balanced traffic sometimes went over this unhealthy switch.
I experienced a similar issue with a Juniper SSG device about six months ago (again, a reboot solved the problem). I am not too happy about juniper's software, it is probably some complicated and messy source code given these type of bugs.
How do people monitor these type of scenarios? Do you on purpose set up some devices without redundancy on each switch and then run some type of tests back and forth all day long? Or is this just one of the things that you hope does not happen again? Do people script something to reboot switches every week or month? But that would send over alarms all over the place... this particular switch had been up almost a year.. i guess regularly updating JUNOS might have solved it?
EDIT: We do not like to upgrade JUNOS or ScreenOS unless there is a compelling reason to (new feature, known bug). Juniper is one of the few product lines where we have had to actually roll-back to older versions due to bugs with new versions. So as a policy we typically do not upgrade unless needed ... obviously we dont let the software get too old, so we will update if it gets a year old and everything is running ok (this way we stick with a good migration path). I cant help but think upgrading junos every month might have solved this. what is your policy?
06-17-2009 04:30 PM
Hi. There are a lot of layers to this question...
First, The SSG series and EX series run completely different software (the EX runs JUNOS, the SSG is based on ScreenOS) so even if the issues seemed similar it's extremely unlikely they were related.
A problem in code over a year old should be known to us and resolved by now. In either case, please have your reseller contact us about it, or if you have a support contract you can open a direct case with JTAC to identify the root cause. You shouldn't tolerate, expect or design for packet-loss or corruption issues - those are high priority issues.
Can you tell us what JUNOS version was running on the EX prior to the upgrade? What did you upgrade to? The EX right when it first shipped a year ago was not running as high a quality release as we would have liked. We've learned, and I think you'll find it shows in the subsequent releases.
re:downgrades - I suspect here you are really referring to the ScreenOS software again. While we are very proud of our development practices and the quality of our software, there is always room for improvement. Recent field data shows we're making good progress.
Monthly upgrades and/or restarts are neither required nor advised. Your strategy to upgrade within our EOL window is perfectly fine.
That particular issue you faced must have been quite a challenge to isolate - please accept my apology on behalf of Juniper for the business disruption and time spent.
Juniper Networks, Inc
06-23-2009 12:17 AM
Thanks for the reply. After a weekend wasted, I was tempted to design something more robust in our infrastructure but i recognize that is not really practical. We were running JUNOS 9.2 and ugpraded to 9.5.
One problem I think juniper is going to face is this JUNOS being everything to everybody. it is a switch, a router, firewall, everything and as it gets more complex it just gets unwieldly. All it takes is one developer to put one bad line of code out of millions of lines and it breaks. I have been advised by the reseller to periodically check for new versions and read what bugs have been fixed. I guess this is reasonable; i will do it once every couple months.
Also we ran into SNMP issues with the EX4200. If you query too much stuff the cpu spikes to 100%. So we turned off SNMP. I realize now if i would have been able to query SNMP then possibly i would have seen on our graphical displays that this particular member was passing trivial amounts of traffic compared to normal. I may revisit using SNMP and just select manually one or two interfaces.
Incidentally, what fails with the ex4200 is grabbing SNMP for all the ports: in and out bandwidth. I normally do this with the monitoring tool's wizard. We buy a switch, point to it and it grabs all the interfaces and displays them with stats compared to historical figures. So 4 ex4200s is 96 ports and we grab out/in so that is 192 SNMP queries. We would run them every few minutes and it would literally just stop the switch for 3-5 seconds every 3-5 minutes. That was when we first bought the switch (SNMP is one of the first things we do) and after we turned that SNMP off .. the ex4200 needs to be handled with kid gloves. Kind of sad really, do you know if this SNMP was an issue that has been fixed too? i guess i could grab the counters that i "must have", but 192 counters is nothing, it shouldnt kill JUNOS. maybe CPU should go from 3% to 5% but not 3% to 99%
06-23-2009 09:39 AM
Have you ever tried to graph SNMP on other Juniper platforms.
The CPU usage you are graphing is the Routing-engine job but not the forwarding-plane Job.
The CPU usage (routing-engine) one goes up to 100% cpu usage when do do for eample an snmp walk on an EX switch but also on a lot of other "bigger" Juniper routers.
You don't have to be afraid about this.
This is a unix kernel that is taking care of different process and holding different priorities related to each process.
Did you encountered any packets drops?
Did you have any shutdown in routing protocols?
Did you loose for example you remote access to the switch?
Just test this please, and with some
Did you do the same thing on a Cisco switch ?
on a traditionnal snmp walk the CPU usage goes up to 77% on a C2960 but the mib lenght is much smaller, and the soft is not a UNIX style so the "processes" doesn't work the same way.
I think you must do some other kind of tests to be confident in Junos.
I hope that some Juniper people will also give you some advices regarding the CPU usage.
Hope this will Help you
06-23-2009 04:37 PM
The SNMP issue looks very similar to one reported last week. Is COS configured by any chance? That was implicated in the other report. It can't be terribly common or we'd be buried in the issue - snmp polling of interface stats is very common activity.
basistrdr - if you or your reseller can open a case with our technical assistance center, you'll be able to get better information on that issue. (I am not a product expert)