[ClusterLabs] Resources too_active (active on all nodes of the cluster, instead of only 1 node)

Klaus Wenninger kwenning at redhat.com
Tue Mar 29 10:47:06 EDT 2022


On Thu, Mar 24, 2022 at 4:12 PM Ken Gaillot <kgaillot at redhat.com> wrote:
>
> On Wed, 2022-03-23 at 05:30 +0000, Balotra, Priyanka wrote:
> > Hi All,
> >
> > We have a scenario on SLES 12 SP3 cluster.
> > The scenario is explained as follows in the order of events:
> >  There is a 2-node cluster (FILE-1, FILE-2)
> >  The cluster and the resources were up and running fine initially .
> >  Then fencing request from pacemaker got issued on both nodes
> > simultaneously
> >
> > Logs from 1st node:
> > 2022-02-22T03:26:36.737075+00:00 FILE-1 corosync[12304]: [TOTEM ]
> > Failed to receive the leave message. failed: 2
> > .
> > .
> > 2022-02-22T03:26:36.977888+00:00 FILE-1 pacemaker-fenced[12331]:
> > notice: Requesting that FILE-1 perform 'off' action targeting FILE-2
> >
> > Logs from 2nd node:
> > 2022-02-22T03:26:36.738080+00:00 FILE-2 corosync[4989]: [TOTEM ]
> > Failed to receive the leave message. failed: 1
> > .
> > .
> > Feb 22 03:26:38 FILE-2 pacemaker-fenced [5015] (call_remote_stonith)
> > notice: Requesting that FILE-2 perform 'off' action targeting FILE-1
> >
> >  When the nodes came up after unfencing, the DC got set after
> > election
> >  After that the resources which were expected to run on only one node
> > became active on both (all) nodes of the cluster.
> >
> >  27290 2022-02-22T04:16:31.699186+00:00 FILE-2 pacemaker-
> > schedulerd[5018]: error: Resource stonith-sbd is active on 2 nodes
> > (attempting recovery)
> > 27291 2022-02-22T04:16:31.699397+00:00 FILE-2 pacemaker-
> > schedulerd[5018]: notice: See
> > https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for
> > more information
> > 27292 2022-02-22T04:16:31.699590+00:00 FILE-2 pacemaker-
> > schedulerd[5018]: error: Resource FILE_Filesystem is active on 2
> > nodes (attem pting recovery)
> > 27293 2022-02-22T04:16:31.699731+00:00 FILE-2 pacemaker-
> > schedulerd[5018]: notice: See
> > https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for
> > more information
> > 27294 2022-02-22T04:16:31.699878+00:00 FILE-2 pacemaker-
> > schedulerd[5018]: error: Resource IP_Floating is active on 2 nodes
> > (attemptin g recovery)
> > 27295 2022-02-22T04:16:31.700027+00:00 FILE-2 pacemaker-
> > schedulerd[5018]: notice: See
> > https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for
> > more information
> > 27296 2022-02-22T04:16:31.700203+00:00 FILE-2 pacemaker-
> > schedulerd[5018]: error: Resource Service_Postgresql is active on 2
> > nodes (at tempting recovery)
> > 27297 2022-02-22T04:16:31.700354+00:00 FILE-2 pacemaker-
> > schedulerd[5018]: notice: See
> > https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for
> > more information
> > 27298 2022-02-22T04:16:31.700501+00:00 FILE-2 pacemaker-
> > schedulerd[5018]: error: Resource Service_Postgrest is active on 2
> > nodes (att empting recovery)
> > 27299 2022-02-22T04:16:31.700648+00:00 FILE-2 pacemaker-
> > schedulerd[5018]: notice: See
> > https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for
> > more information
> > 27300 2022-02-22T04:16:31.700792+00:00 FILE-2 pacemaker-
> > schedulerd[5018]: error: Resource Service_esm_primary is active on 2
> > nodes (a ttempting recovery)
> > 27301 2022-02-22T04:16:31.700939+00:00 FILE-2 pacemaker-
> > schedulerd[5018]: notice: See
> > https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for
> > more information
> > 27302 2022-02-22T04:16:31.701086+00:00 FILE-2 pacemaker-
> > schedulerd[5018]: error: Resource Shared_Cluster_Backup is active on
> > 2 nodes (attempting recovery)
> >
> > Can you guys please help us understand if this is indeed a split-
> > brain scenario ? Under what circumstances can such a scenario be
> > observed?
>
> This does look like a split-brain, and the most likely cause is that
> the fence agent reported that fencing was successful, but it actually
> wasn't.
>
> What are you using as a fencing device?
>
> If you're using watchdog-based SBD, that won't work with only two
> nodes, because both nodes will assume they still have quorum, and not
> self-fence. You need either true quorum or a shared external drive to
> use SBD.

We see a fencing-resource stonith_sbd so I would guess
poison-pill-fencing is configured.
So we should verify there isn't stonith-watchdog-timeout configured
to anything but 0 as well - just to be sure it would never fall back
to watchdog-fencing.
Maybe you can try inserting the poison pill manually and see if
the targeted node is rebooting. You can either do that using high-level
tooling as crmsh or pcs or using the sbd-binary as cmdline-tool
directly.
You can try that both from the node to rebooted as well as from the
other node. To e.g. check if both sides see the same disk(s) ...
Check that the disk(s) configured with the sbd-service are the
same as those configured for the sbd-fencing-resource (and of
course when using sbd as cmdline tool to insert a poison pill
the same disks have to be used as well).
Is sbd-service running without complaints?
Please check as well for a (hardware)-watchdog properly configured
with sbd. In this case I guess we should have seen a reboot
still - even with non working watchdog - as both nodes seem to be
alive enough. But still it is important to work properly for cases
where nodes aren't responsive anymore.

Klaus

>
> > We can have very serious impact if such a case can re-occur inspite
> > of stonith already configured. Hence the ask .
> > In case this situation gets reproduced, how can it be handled?
> >
> > Note: We have stonith configured and it has been working fine so far.
> > In this case also, the initial fencing happened from stonith only.
> >
> > Thanks in advance!
> --
> Ken Gaillot <kgaillot at redhat.com>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>



More information about the Users mailing list