I have clustered 2xQFX5100. It has very interesting config and issue related to it.
So let me introduce. Box is separated into three routing instances(type virtual-router) with route leaking between them(with rib-groups).
1st routing (master) instance has routes to instance 2 & 3, so hosts connected to interfaces in this instance have reachability to other instance 2 and instance 3.
2nd instance has routes it's own routes and from 1st instance.
3rd instance has it's own routes and routes from 1st instance.
So 2nd & 3rd instance are completely isolated from each other.
And let go closer to the issue, it happens from time to time, without any period. Here is an example:
Host from connected network in 2nd (or 3rd instance) instance tries to reach host in 1st instance (master).
Host sends packet from 2nd instance to 1st one (master), I see those from tcpdump, but packets don't reach the target host in connected network in 1st instance (master).
Then I try to reach host in 2nd instance from host in 1st instance, in reverse direction.
Host from connected network in master instance sends packets to host in 2nd instance, packets reach host in 2nd instance and host makes reply to those traffic (I see come & replied packets from tcpdump), !_ BUT _! replied packets don't reach host in master instance (from tcpdump).
In this case we have one unreachable host (not all network) from all hosts in connected networks in 2nd instance
Also, there was found solution, if there try to make static route in 2nd instance with unreachable host IP as destination and next-hop 0.0.0.0 and resolve option, it starts to work. (Default route is imported from master instance and there it comes from BGP neighbour.)
It is hard to understand for the first time, so give me your questions pls.
If you enable route leaking between the routing instances (by using the rib-group
statement,for example), the downstream device cannot connect to the upstream device
because the switch connects to the upstream device over a direct route and these routes
are not leaked between instances. NOTE: You can see a route to the upstream device in the routing table of the downstream device, but this route is not functional.
Indirect routes are leaked between routing instances, so the downstream device can
connect to any upstream devices that are connected to the switch over indirect routes.
And You also found a workaround with "resolve" knob, congrats.
Thanks for your reply, I hope this will be fixed in next releases.
Also, if somebody faces the same issue, I found better solution. You need to delete family inet address of that interface where affected node is located, I mean node which doesn't receive reply. (In example above, it is node in instance master) and then return family inet address back on interface.
This solution is not good, it causes impact for the all nodes in deleted network(while it is deleted), but I don't have other. :-(