[ClusterLabs] Automatic recover from split brain ?

Tue Aug 11 02:48:15 EDT 2020

08.08.2020 13:10, Adam Cécile пишет:
> Hello,
> 
> 
> I'm experiencing issue with corosync/pacemaker running on Debian Buster.
> Cluster has three nodes running in VMWare virtual machine and the
> cluster fails when VEEAM backups the virtual machine (I know it's doing
> bad things, like freezing completely the VM for a few minutes to make
> disk snapshot).
> 
> My biggest issue is that once the backup has been completed, the cluster
> stays in split brain state, and I'd like it to heal itself. Here current
> status:
> 
> 
> One node is isolated:
> 
> Stack: corosync
> Current DC: host2.domain.com (version 2.0.1-9e909a5bdd) - partition
> WITHOUT quorum
> Last updated: Sat Aug  8 11:59:46 2020
> Last change: Fri Jul 24 07:18:12 2020 by root via cibadmin on
> host1.domain.com
> 
> 3 nodes configured
> 6 resources configured
> 
> Online: [ host2.domain.com ]
> OFFLINE: [ host3.domain.com host1.domain.com ]
> 
> 
> Two others are seeing each others:
> 
> Stack: corosync
> Current DC: host3.domain.com (version 2.0.1-9e909a5bdd) - partition with
> quorum
> Last updated: Sat Aug  8 12:07:56 2020
> Last change: Fri Jul 24 07:18:12 2020 by root via cibadmin on
> host1.domain.com
> 
> 3 nodes configured
> 6 resources configured
> 
> Online: [ host3.domain.com host1.domain.com ]
> OFFLINE: [ host2.domain.com ]
> 

Show your full configuration including defined STONITH resources and
cluster options (most importantly, no-quorum-policy and stonith-enabled).

> 
> The problem is that one of the resources is a floating IP address which
> is currently assigned to two different hosts...
> 

Of course - each partition assumes another partition is dead and so it
is free to take over remaining resources.

> 
> Can you help me configuring the cluster correctly so this cannot occurs ?
> 

Define "correctly".

The most straightforward text book answer - you need to have STONITH
resources that will eliminate "lost" node. But your lost node is in the
middle of performing backup. Eliminating it may invalidate backup being
created.

So another answer would be - put cluster in maintenance mode, perform
backup, resume normal operation. Usually backup software allows hooks to
be executed before and after backup. It may work too.

Or find a way to not freeze VM during backup ... e.g. by using different
backup method?