[ClusterLabs] Fence node when network interface goes down

Fri Nov 12 16:44:16 EST 2021

On Fri, 2021-11-12 at 17:31 +0000, S Rogers wrote:
> Hi, I'm hoping someone will be able to point me in the right
> direction.
> 
> I am configuring a two-node active/passive cluster that utilises the
> PostgreSQL PAF resource agent. Each node has two NICs, therefore the
> cluster is configured with two corosync links - one on each network
> (one network is the public network, the other is effectively private
> and just used for cluster communication). The cluster has a virtual
> IP resource, which has a colocation constraint to keep it together
> with the primary Postgres instance.
> 
> I am trying to protect against the scenario where the public network
> interface on the active node goes down, in which case I want a
> failover to occur and the other node to take over and host the
> primary Postgres instance and the public virtual IP. My current
> approach is to use ocf:heartbeat:ethmonitor to monitor the public
> interface along with a location constraint to ensure that the virtual
> IP must be on a node where the public interface is UP.
> 
> With this configuration, if I disconnect the active node from the
> public network, Pacemaker attempts to move the primary PostgreSQL and
> virtual IP to the other node. The problem is that it attempts to stop
> the resources gracefully, which causes the pgsql resource to error
> with "Switchover has been canceled from pre-promote action" (which I
> believe is because PostgreSQL shuts down, but can't communicate with
> the standby during the shutdown - a similar situation to what is
> described here: https://github.com/ClusterLabs/PAF/issues/149)
> 
> Ideally, if the public network interface on the active node goes down
> I would want to take that node offline (either fence it or put it in
> standby mode, so that no resources can run on it), leaving just the
> other node in the cluster as the active node. Then the old primary
> can be rebuilt from the new primary in order to join the cluster
> again. However, I can't figure out a way to cause the active node to
> be fenced as a result of ocf:heartbeat:ethmonitor detecting that the
> interface has gone down.
> 
> Does anyone have any ideas/pointers how I could achieve this, or an
> alternative approach?
> 
> Hopefully that makes sense. Any help is appreciated!
> 
> Thanks.

Failure handling is configurable via the on-fail meta-attribute. You
can set on-fail=fence for the ethmonitor resource's monitor action to
fence the node if the monitor fails. There's also on-fail=standby, but
that will still try to stop any active resources gracefully, so it
doesn't help in this case.
-- 
Ken Gaillot <kgaillot at redhat.com>