[ClusterLabs] Fence node when network interface goes down

Fri Nov 12 14:05:29 EST 2021

On 12.11.2021 20:31, S Rogers wrote:
> Hi, I'm hoping someone will be able to point me in the right direction.
> 
> I am configuring a two-node active/passive cluster that utilises the
> PostgreSQL PAF resource agent. Each node has two NICs, therefore the
> cluster is configured with two corosync links - one on each network (one
> network is the public network, the other is effectively private and just
> used for cluster communication). The cluster has a virtual IP resource,
> which has a colocation constraint to keep it together with the primary
> Postgres instance.
> 
> I am trying to protect against the scenario where the public network
> interface on the active node goes down, in which case I want a failover to
> occur and the other node to take over and host the primary Postgres
> instance and the public virtual IP. My current approach is to use
> ocf:heartbeat:ethmonitor to monitor the public interface along with a
> location constraint to ensure that the virtual IP must be on a node where
> the public interface is UP.
> 
> With this configuration, if I disconnect the active node from the public
> network, Pacemaker attempts to move the primary PostgreSQL and virtual IP
> to the other node. The problem is that it attempts to stop the resources
> gracefully, which causes the pgsql resource to error with "Switchover has
> been canceled from pre-promote action" (which I believe is because
> PostgreSQL shuts down, but can't communicate with the standby during the
> shutdown - a similar situation to what is described here:
> https://github.com/ClusterLabs/PAF/issues/149)
> 
> Ideally, if the public network interface on the active node goes down I
> would want to take that node offline (either fence it or put it in standby
> mode, so that no resources can run on it), leaving just the other node in
> the cluster as the active node. Then the old primary can be rebuilt from
> the new primary in order to join the cluster again. However, I can't figure
> out a way to cause the active node to be fenced as a result of
> ocf:heartbeat:ethmonitor detecting that the interface has gone down.
> 
> Does anyone have any ideas/pointers how I could achieve this, or an
> alternative approach?
> 

If stopping resource fails, default pacemaker reaction is to fence the
node. Assuming "causes the pgsql resource to error" means "stopping
resource fails" it should already do what you want. Show logs from both
nodes around the time you simulate error.

> Hopefully that makes sense. Any help is appreciated!
> 
> Thanks.
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
>