[ClusterLabs] Fence node when network interface goes down

Thu Nov 11 18:14:06 EST 2021

Hi, I'm hoping someone will be able to point me in the right direction.

I am configuring a two-node active/passive cluster that utilises the
PostgreSQL PAF resource agent. Each node has two NICs, therefore the
cluster is configured with two corosync links - one on each network (one
network is the public network, the other is effectively private and just
used for cluster communication). The cluster has a virtual IP resource,
which has a colocation constraint to keep it together with the primary
Postgres instance.

I am trying to protect against the scenario where the public network
interface on the active node goes down, in which case I want a failover to
occur and the other node to take over and host the primary Postgres
instance and the public virtual IP. My current approach is to use
ocf:heartbeat:ethmonitor to monitor the public interface along with a
location constraint to ensure that the virtual IP must be on a node where
the public interface is UP.

With this configuration, if I disconnect the active node from the public
network, Pacemaker attempts to move the primary PostgreSQL and virtual IP
to the other node. The problem is that it attempts to stop the resources
gracefully, which causes the pgsql resource to error with "Switchover has
been canceled from pre-promote action" (which I believe is because
PostgreSQL shuts down, but can't communicate with the standby during the
shutdown - a similar situation to what is described here:
https://github.com/ClusterLabs/PAF/issues/149)

Ideally, if the public network interface on the active node goes down I
would want to take that node offline (either fence it or put it in standby
mode, so that no resources can run on it), leaving just the other node in
the cluster as the active node. Then the old primary can be rebuilt from
the new primary in order to join the cluster again. However, I can't figure
out a way to cause the active node to be fenced as a result of
ocf:heartbeat:ethmonitor detecting that the interface has gone down.

Does anyone have any ideas/pointers how I could achieve this, or an
alternative approach?

Hopefully that makes sense. Any help is appreciated!

Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20211111/94336021/attachment.htm>