[ClusterLabs] 2-Node Cluster - fencing with just one node running ?

Ken Gaillot kgaillot at redhat.com
Mon Aug 8 09:36:35 EDT 2022


On Thu, 2022-08-04 at 10:43 -0700, Reid Wahl wrote:
> On Thu, Aug 4, 2022 at 6:07 AM Lentes, Bernd
> <bernd.lentes at helmholtz-muenchen.de> wrote:
> > 
> > ----- On 4 Aug, 2022, at 00:27, Reid Wahl nwahl at redhat.com wrote:
> > 
> > > Such constraints are unnecessary.
> > > 
> > > Let's say we have two stonith devices called "fence_dev1" and
> > > "fence_dev2" that fence nodes 1 and 2, respectively. If node 2
> > > needs
> > > to be fenced, and fence_dev2 is running on node 2, node 1 will
> > > still
> > > use fence_dev2 to fence node 2. The current location of the
> > > stonith
> > > device only tells us which node is running the recurring monitor
> > > operation for that stonith device. The device is available to ALL
> > > nodes, unless it's disabled or it's banned from a given node. So
> > > these
> > > constraints serve no purpose in most cases.
> > 
> > Would do you mean by "banned" ? "crm resource ban ..." ?
> 
> Yes. If you run `pcs resource ban fence_dev1 node-1` (I presume `crm
> resource ban` does the same thing), then:
>   - fence_dev1 is not allowed to run on node-1
>   - node-1 is not allowed to use fence-dev to fence a node
> 
> If you disable the fence_dev1 (the pcs command would be `pcs resource
> disable`, which sets the target-role meta attribute to Stopped), then
> **no** node can use fence_dev1 to fence a node.
> 
> > Is that something different than a location constraint ?
> 
> It creates a -INFINITY location constraint.
> 
> The same might also apply when a stonith device has a finite negative
> preference for a given node -- not sure without testing.

Correct, even a finite negative score will make the device unusable by
the affected node.

I'd like to change that someday, but the main issue is that the fencer
currently doesn't need to run the full scheduler, it just needs to
check the device configuration. If we want to differentiate based on
node scores, or support rules for fence device parameters, the fencer
will need to run the full scheduler, which would add significant
overhead and bug exposure.

> 
> > > If you ban fence_dev2 from node 1, then node 1 won't be able to
> > > use
> > > fence_dev2 to fence node 2. Likewise, if you ban fence_dev1 from
> > > node
> > > 1, then node 1 won't be able to use fence_dev1 to fence itself.
> > > Usually that's unnecessary anyway, but it may be preferable to
> > > power
> > > ourselves off if we're the last remaining node and a stop
> > > operation
> > > fails.
> > So banning a fencing device from a node means that this node can't
> > use the fencing device ?
> > 
> > > If ha-idg-2 is in standby, it can still fence ha-idg-1. Since it
> > > sounds like you've banned fence_ilo_ha-idg-1 from ha-idg-1, so
> > > that it
> > > can't run anywhere when ha-idg-2 is in standby, I'm not sure off
> > > the
> > > top of my head whether fence_ilo_ha-idg-1 is available in this
> > > situation. It may not be.
> > 
> > ha-idg-2 was not only in standby, i also stopped pacemaker on that
> > node.
> > Then ha-idg-2 can't fence ha-idg-1 i assume.
> 
> Correct, ha-idg-2 can't fence ha-idg-1 if ha-idg-2 is stopped.
> 
> > > A solution would be to stop banning the stonith devices from
> > > their
> > > respective nodes. Surely if fence_ilo_ha-idg-1 had been running
> > > on
> > > ha-idg-1, ha-idg-2 would have been able to use it to fence ha-
> > > idg-1.
> > > (Again, I'm not sure if that's still true if ha-idg-2 is in
> > > standby
> > > **and** fence_ilo_ha-idg-1 is banned from ha-idg-1.)
> > > 
> > > > Aug 03 01:19:58 [19364] ha-idg-1 stonith-ng:   notice:
> > > > log_operation:
> > > > Operation 'Off' [20705] (call 2 from crmd.19368) for host 'ha-
> > > > idg-1' with
> > > > device 'fence_ilo_ha-idg-2' returned: 0 (OK)
> > > > So the cluster starts the resource running on ha-idg-1 and cut
> > > > off ha-idg-2,
> > > > which isn't necessary.
> > > 
> > > Here, it sounds like the pcmk_host_list setting is either missing
> > > or
> > > misconfigured for fence_ilo_ha-idg-2. fence_ilo_ha-idg-2 should
> > > NOT be
> > > usable for fencing ha-idg-1.
> > > 
> > > fence_ilo_ha-idg-1 should be configured with pcmk_host_list=ha-
> > > idg-1,
> > > and fence_ilo_ha-idg-2 should be configured with
> > > pcmk_host_list=ha-idg-2.
> > 
> > I will check that.
> > 
> > > What happened is that ha-idg-1 used fence_ilo_ha-idg-2 to fence
> > > itself. Of course, this only rebooted ha-idg-2. But based on the
> > > stonith device configuration, pacemaker on ha-idg-1 believed that
> > > ha-idg-1 had been fenced. Hence the "allegedly just fenced"
> > > message.
> > > 
> > > > Finally the cluster seems to realize that something went wrong:
> > > > Aug 03 01:19:58 [19368] ha-idg-1       crmd:     crit:
> > > > tengine_stonith_notify:
> > > > We were allegedly just fenced by ha-idg-1 for ha-idg-1!
> > 
> > Bernd
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > ClusterLabs home: https://www.clusterlabs.org/
> 
> 
-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list