[ClusterLabs] Automatic recover from split brain ?

Mon Aug 10 16:19:10 EDT 2020

On Sun, 2020-08-09 at 21:11 +0200, Adam Cécile wrote:
> Hello,
> 
> 
> I'm experiencing issue with corosync/pacemaker running on Debian
> Buster. 
> Cluster has three nodes running in VMWare virtual machine and the 
> cluster fails when VEEAM backups the virtual machine (I know it's
> doing 
> bad things, like freezing completely the VM for a few minutes to
> make 
> disk snapshot).
> 
> My biggest issue is that once the backup has been completed, the
> cluster 
> stays in split brain state, and I'd like it to heal itself. Here 

Fencing is how the cluster prevents split-brain. When one node is lost,
the other nodes will not recover any resources from it until it's
fenced. For VMWare there's a fence_vmware_soap fence agent.

However that's intended for failure scenarios, not a planned outage
like a backup snapshot.

For planned outages, you can set the cluster-wide
property "maintenance-mode" to true. The cluster won't start, monitor,
or stop resources while in maintenance mode. You can use rules to
automatically put the cluster in maintenance mode at specific times.

However I believe even in maintenance mode, the node will get fenced if
it drops out of the corosync membership. Ideally you'd put the cluster
in maintenance mode, stop pacemaker and corosync on the node, do the
backup, then start pacemaker and corosync, wait for them to come up,
and take the cluster out of maintenance mode.

Alternatively, if you want the resources to move to other nodes while
the backup is being done, you could put the node in standby rather than
set maintenance mode.

> current 
> status:
> 
> 
> One node is isolated:
> 
> Stack: corosync
> Current DC: host2.domain.com (version 2.0.1-9e909a5bdd) - partition 
> WITHOUT quorum
> Last updated: Sat Aug  8 11:59:46 2020
> Last change: Fri Jul 24 07:18:12 2020 by root via cibadmin on 
> host1.domain.com
> 
> 3 nodes configured
> 6 resources configured
> 
> Online: [ host2.domain.com ]
> OFFLINE: [ host3.domain.com host1.domain.com ]
> 
> 
> Two others are seeing each others:
> 
> Stack: corosync
> Current DC: host3.domain.com (version 2.0.1-9e909a5bdd) - partition
> with 
> quorum
> Last updated: Sat Aug  8 12:07:56 2020
> Last change: Fri Jul 24 07:18:12 2020 by root via cibadmin on 
> host1.domain.com
> 
> 3 nodes configured
> 6 resources configured
> 
> Online: [ host3.domain.com host1.domain.com ]
> OFFLINE: [ host2.domain.com ]
> 
> 
> The problem is that one of the resources is a floating IP address
> which 
> is currently assigned to two different hosts...
> 
> 
> Can you help me configuring the cluster correctly so this cannot
> occurs ?
> 
> 
> Thanks in advance,
> 
> Adam.
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot <kgaillot at redhat.com>