We are having a problem with a simple vSRX deployment in a chassis cluster on VMware. I am convinced the underlying hosts/network/storage are not to blame here as we have many other clusters, load balancers etc which are sensitive to network issues and we have not experienced any issues with these.
2 x vSRX firewalls, currently on 15.1X49-D130.6 - not the latest, but we have upgraded from previous versions and not experienced any change in behaviour. I don't think just upgrading to the latest firmware will change anything, as this seems a basic chassis cluster operation issue.
ESXi /VCenter 5.5, hosts are not over utilised in terms of CPU/network/storage
The problem: the cluster does not seem to form properly - one node (the primary) will report the status as primary/sceondary, and the other node will always report as primary/lost
So if the secondary node is restarted it seems the cluster never fully reconverges.
Config changes are sync'd
Failover works - if the primary node fails by simulation the secondary node always takes over with maybe 1 packet drop. Great. Then the previous primary node returns back - the cluster never reconverges and is always shown as lost - see screenshot of the console of each vSRX
I'm sure this is not normal behaviour, I certainly haven't experienced it with hardware SRX firewalls, so I'm wondering what could be going on here. JTAC Just advised to backup the config and rebuild, which yes we could and probably will, but I would like to troubleshoot furhter if possible.
Are both vSRX nodes situated on the same ESXi host or they are moving between different hosts of the cluster?
If not on the same host, does the VLANs inside ESXi host (distributed or standard switch) & in the physical switch (connection two ESXi host) have Jumbo Frames enabled as well as recommended settings for cluster?