Multiple RADIUS-Server configuration for UP/UNREACHABLE.
This is a question of curiosity in regards to the 'dot1x' configuration for Juniper EX3*00 series.
We have a new datacenter and when migrating the the RADIUS-servers we also made a complete change of the guest-wired network by moving it to a differnet /16 network and sending it out on a different IP not associated with our company.
I set up 2 new RADIUS-Servers in the new network with identical configuration for clients, policies etc..
I then continued to add the new radius servers to the configuration and when we made the network changed inactivated the dot1x protocol shortly before bringing them upp and everything was fine and dandy, all 4 serves worked fine, deactivating the 2 previous ones to the test the new ones was no issue, so i went ahead and shut down the previous 2 servers.
And boy did that make for strange issues, suddenly our Elastic Search solution flooded with dot1x discards while the denied rate remained the same and the authenticated users dropped like a rock.
Thanks to the 'server-fail permit' config in the profile no users were kicked out.
I reviewed the different switches and could see that some still consided the old servers 'UP' and some 'DOWN' or 'UNREACHABLE' depending on ELS or not.
After scripting out to have the old servers removed from config everything was once again fine.
What could have caused this issue? I understand that JunOS uses the authentication-order but it seems like it has been flooding the down servers with requests and had issues with the other 2 in regards with that.
Is there a way to avoid this? i know there is a whole Tunda for radius-message configuration on JunOS but im not experienced enough in that area.
Re: Multiple RADIUS-Server configuration for UP/UNREACHABLE.
By default, JUNOS uses round-robin to balance authentication attempts between all configured RADIUS servers. When a server doesn't respond, it's marked as unreachable. After expiration of revert timer this server is marked as UP again (without any actual check for "liveness" of the server taking place), and this process goes on and on. The following KB describes the revert-timer configuration option: