[ClusterLabs] Resources too_active (active on all nodes of the cluster, instead of only 1 node)

Thu Mar 24 01:24:39 EDT 2022

On 23.03.2022 08:30, Balotra, Priyanka wrote:
> Hi All,
> 
> We have a scenario on SLES 12 SP3 cluster.
> The scenario is explained as follows in the order of events:
> 
>   *   There is a 2-node cluster (FILE-1, FILE-2)
>   *   The cluster and the resources were up and running fine initially .
>   *   Then fencing request from pacemaker got issued on both nodes simultaneously
> 
> Logs from 1st node:
> 2022-02-22T03:26:36.737075+00:00 FILE-1 corosync[12304]: [TOTEM ] Failed to receive the leave message. failed: 2
> .
> .
> 2022-02-22T03:26:36.977888+00:00 FILE-1 pacemaker-fenced[12331]: notice: Requesting that FILE-1 perform 'off' action targeting FILE-2
> 
> Logs from 2nd node:
> 2022-02-22T03:26:36.738080+00:00 FILE-2 corosync[4989]: [TOTEM ] Failed to receive the leave message. failed: 1
> .
> .
> Feb 22 03:26:38 FILE-2 pacemaker-fenced [5015] (call_remote_stonith) notice: Requesting that FILE-2 perform 'off' action targeting FILE-1
> 

This is normal behavior in case of split brain. Each node will try to
fence another node to be able to take over resources from it.

> 
>   *   When the nodes came up after unfencing, the DC got set after election

What exactly "came up" means?

>   *   After that the resources which were expected to run on only one node became active on both (all) nodes of the cluster.
> 

It sounds like both nodes believed fencing has been successful and so
each node took over resources from another node. It is impossible to
tell more without seeing actual logs from both nodes and actual
configuration.

> 27290 2022-02-22T04:16:31.699186+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource stonith-sbd is active on 2 nodes (attempting recovery)
> 27291 2022-02-22T04:16:31.699397+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information
> 27292 2022-02-22T04:16:31.699590+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource FILE_Filesystem is active on 2 nodes (attem pting recovery)
> 27293 2022-02-22T04:16:31.699731+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information
> 27294 2022-02-22T04:16:31.699878+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource IP_Floating is active on 2 nodes (attemptin g recovery)
> 27295 2022-02-22T04:16:31.700027+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information
> 27296 2022-02-22T04:16:31.700203+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource Service_Postgresql is active on 2 nodes (at tempting recovery)
> 27297 2022-02-22T04:16:31.700354+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information
> 27298 2022-02-22T04:16:31.700501+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource Service_Postgrest is active on 2 nodes (att empting recovery)
> 27299 2022-02-22T04:16:31.700648+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information
> 27300 2022-02-22T04:16:31.700792+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource Service_esm_primary is active on 2 nodes (a ttempting recovery)
> 27301 2022-02-22T04:16:31.700939+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information
> 27302 2022-02-22T04:16:31.701086+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource Shared_Cluster_Backup is active on 2 nodes (attempting recovery)
> 
> 
> Can you guys please help us understand if this is indeed a split-brain scenario ? 

I do not understand this question and I suspect you are using "split
brain" incorrectly. Split brain is condition when corosync/pacemaker on
two nodes cannot communicate. Split brain ends with fencing request.

> Under what circumstances can such a scenario be observed?

When two nodes are unable to communicate with each other if "such
scenario" refers to "split brain".

> We can have very serious impact if such a case can re-occur inspite of stonith already configured. Hence the ask .
> In case this situation gets reproduced, how can it be handled?
> 

Stonith agent must never return success unless it can confirm that
fencing was successful.

> Note: We have stonith configured and it has been working fine so far. In this case also, the initial fencing happened from stonith only.
> 
> Thanks in advance!
> 
> 
> 
> 
> 
> Internal Use - Confidential
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/