[ClusterLabs] Resources too_active (active on all nodes of the cluster, instead of only 1 node)

Thu Mar 24 11:11:09 EDT 2022

On Wed, 2022-03-23 at 05:30 +0000, Balotra, Priyanka wrote:
> Hi All,
>  
> We have a scenario on SLES 12 SP3 cluster.
> The scenario is explained as follows in the order of events:
>  There is a 2-node cluster (FILE-1, FILE-2)
>  The cluster and the resources were up and running fine initially .
>  Then fencing request from pacemaker got issued on both nodes
> simultaneously
>  
> Logs from 1st node:  
> 2022-02-22T03:26:36.737075+00:00 FILE-1 corosync[12304]: [TOTEM ]
> Failed to receive the leave message. failed: 2
> .
> .
> 2022-02-22T03:26:36.977888+00:00 FILE-1 pacemaker-fenced[12331]:
> notice: Requesting that FILE-1 perform 'off' action targeting FILE-2
>  
> Logs from 2nd node:  
> 2022-02-22T03:26:36.738080+00:00 FILE-2 corosync[4989]: [TOTEM ]
> Failed to receive the leave message. failed: 1
> .
> .
> Feb 22 03:26:38 FILE-2 pacemaker-fenced [5015] (call_remote_stonith)
> notice: Requesting that FILE-2 perform 'off' action targeting FILE-1
>  
>  When the nodes came up after unfencing, the DC got set after
> election
>  After that the resources which were expected to run on only one node
> became active on both (all) nodes of the cluster.
>  
>  27290 2022-02-22T04:16:31.699186+00:00 FILE-2 pacemaker-
> schedulerd[5018]: error: Resource stonith-sbd is active on 2 nodes
> (attempting recovery)
> 27291 2022-02-22T04:16:31.699397+00:00 FILE-2 pacemaker-
> schedulerd[5018]: notice: See 
> https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for
> more information
> 27292 2022-02-22T04:16:31.699590+00:00 FILE-2 pacemaker-
> schedulerd[5018]: error: Resource FILE_Filesystem is active on 2
> nodes (attem pting recovery)
> 27293 2022-02-22T04:16:31.699731+00:00 FILE-2 pacemaker-
> schedulerd[5018]: notice: See 
> https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for
> more information
> 27294 2022-02-22T04:16:31.699878+00:00 FILE-2 pacemaker-
> schedulerd[5018]: error: Resource IP_Floating is active on 2 nodes
> (attemptin g recovery)
> 27295 2022-02-22T04:16:31.700027+00:00 FILE-2 pacemaker-
> schedulerd[5018]: notice: See 
> https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for
> more information
> 27296 2022-02-22T04:16:31.700203+00:00 FILE-2 pacemaker-
> schedulerd[5018]: error: Resource Service_Postgresql is active on 2
> nodes (at tempting recovery)
> 27297 2022-02-22T04:16:31.700354+00:00 FILE-2 pacemaker-
> schedulerd[5018]: notice: See 
> https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for
> more information
> 27298 2022-02-22T04:16:31.700501+00:00 FILE-2 pacemaker-
> schedulerd[5018]: error: Resource Service_Postgrest is active on 2
> nodes (att empting recovery)
> 27299 2022-02-22T04:16:31.700648+00:00 FILE-2 pacemaker-
> schedulerd[5018]: notice: See 
> https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for
> more information
> 27300 2022-02-22T04:16:31.700792+00:00 FILE-2 pacemaker-
> schedulerd[5018]: error: Resource Service_esm_primary is active on 2
> nodes (a ttempting recovery)
> 27301 2022-02-22T04:16:31.700939+00:00 FILE-2 pacemaker-
> schedulerd[5018]: notice: See 
> https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for
> more information
> 27302 2022-02-22T04:16:31.701086+00:00 FILE-2 pacemaker-
> schedulerd[5018]: error: Resource Shared_Cluster_Backup is active on
> 2 nodes (attempting recovery)  
>  
> Can you guys please help us understand if this is indeed a split-
> brain scenario ? Under what circumstances can such a scenario be
> observed?

This does look like a split-brain, and the most likely cause is that
the fence agent reported that fencing was successful, but it actually
wasn't.

What are you using as a fencing device?

If you're using watchdog-based SBD, that won't work with only two
nodes, because both nodes will assume they still have quorum, and not
self-fence. You need either true quorum or a shared external drive to
use SBD.

> We can have very serious impact if such a case can re-occur inspite
> of stonith already configured. Hence the ask .
> In case this situation gets reproduced, how can it be handled? 
> 
> Note: We have stonith configured and it has been working fine so far.
> In this case also, the initial fencing happened from stonith only.
>  
> Thanks in advance!
-- 
Ken Gaillot <kgaillot at redhat.com>