[ClusterLabs] Fence node when network interface goes down

Mon Nov 15 07:32:41 EST 2021

On 15/11/2021 12:03, Klaus Wenninger wrote:
>
>
> On Mon, Nov 15, 2021 at 12:19 PM Andrei Borzenkov 
> <arvidjaar at gmail.com> wrote:
>
>     On Mon, Nov 15, 2021 at 1:18 PM Klaus Wenninger
>     <kwenning at redhat.com> wrote:
>     >
>     >
>     >
>     > On Mon, Nov 15, 2021 at 10:37 AM S Rogers
>     <sa.rogers1342 at gmail.com> wrote:
>     >>
>     >> I had thought about doing that, but the cluster is then
>     dependent on the
>     >> external system, and if that external system was to go down or
>     become
>     >> unreachable for any reason then it would falsely cause the
>     cluster to
>     >> failover or worse it could even take the cluster down
>     completely, if the
>     >> external system goes down and both nodes cannot ping it.
>     >
>     > You wouldn't necessarily have to ban resources from nodes that can't
>     > reach the external network. It would be enough to make them prefer
>     > the location that has connection. So if both lose connection 
>     one side
>     > would still stay up.
>     > Not to depend on something really external you might use the
>     > router to your external network as ping target.
>     > In case of fencing - triggered by whatever - and a potential
>     fence-race
>
>     The problem here is that nothing really triggers fencing. What
>     happens, is
>
>
> Got that! Which is why I gave the hint how to prevent shutting down
> services with ping first.
> Taking care of what happens when nodes are fenced still makes sense.
> Imagine a fence-race where the node running services loses just
> to afterwards get the services moved back when it comes up again.
>
> Klaus
Thanks, I wasn't aware of priority-fencing-delay. While it doesn't solve 
this problem, I can still use it to improve the fencing behaviour of the 
cluster in general.

Unfortunately, in some situations this cluster will be deployed in a 
completely isolated network so there may not even be a router that we 
can use as a ping target, and we can't guarantee the presence of any 
other system on the network that we could reliably use as a ping target.

>
>     - two postgres lose connection over external network, but cluster
>     nodes retain connectivity over another network
>     - postgres RA compares "latest timestamp" when selecting the best node
>     to fail over to
>     - primary postgres has better timestamp, so RA simply does not
>     consider secondary as suitable for (atomatic) failover
>
>     The only solution here - as long as fencing node on external
>     connectivity loss is acceptable - is modifying ethmonitor RA to fail
>     monitor operation in this case.
>
I was hoping to find a way to achieve the desired outcome without 
resorting to a custom RA, but it does appear to be the only solution.

This may not be the right audience, but does anyone know if it is a 
viable change to add an additional parameter to the ethmonitor RA that 
allows users to override the desired behaviour when the monitor 
operation fails? (ie, a 'monitor_force_fail' parameter that when set to 
true will cause the monitor operation to fail if it determines the 
interface is down)

Being relatively new to pacemaker, I don't know whether this goes 
against RA conventions/practices.

>
>     > you might use the rather new feature priority-fencing-delay
>     (give the node
>     > that is running valuable resources a benefit in the race) or go for
>     > fence_heuristics_ping (pseudo fence-resource that together with a
>     > fencing-topology prevents the node without access to a certain IP
>     > from fencing the other node).
>     >
>     https://clusterlabs.org/pacemaker/doc/deprecated/en-US/Pacemaker/2.0/html/Pacemaker_Explained/s-cluster-options.html
>     >
>     https://github.com/ClusterLabs/fence-agents/blob/master/agents/heuristics_ping/fence_heuristics_ping.py
>     >
>     > Klaus
>     > _______________________________________________
>     >>
>     >> Manage your subscription:
>     >> https://lists.clusterlabs.org/mailman/listinfo/users
>     >>
>     >> ClusterLabs home: https://www.clusterlabs.org/
>     >>
>     > _______________________________________________
>     > Manage your subscription:
>     > https://lists.clusterlabs.org/mailman/listinfo/users
>     >
>     > ClusterLabs home: https://www.clusterlabs.org/
>     _______________________________________________
>     Manage your subscription:
>     https://lists.clusterlabs.org/mailman/listinfo/users
>
>     ClusterLabs home: https://www.clusterlabs.org/
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home:https://www.clusterlabs.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20211115/e4c42715/attachment-0001.htm>