[ClusterLabs] 2-Node Cluster - fencing with just one node running ?

Thu Aug 4 13:43:54 EDT 2022

On Thu, Aug 4, 2022 at 6:07 AM Lentes, Bernd
<bernd.lentes at helmholtz-muenchen.de> wrote:
>
>
> ----- On 4 Aug, 2022, at 00:27, Reid Wahl nwahl at redhat.com wrote:
>
> >
> > Such constraints are unnecessary.
> >
> > Let's say we have two stonith devices called "fence_dev1" and
> > "fence_dev2" that fence nodes 1 and 2, respectively. If node 2 needs
> > to be fenced, and fence_dev2 is running on node 2, node 1 will still
> > use fence_dev2 to fence node 2. The current location of the stonith
> > device only tells us which node is running the recurring monitor
> > operation for that stonith device. The device is available to ALL
> > nodes, unless it's disabled or it's banned from a given node. So these
> > constraints serve no purpose in most cases.
>
> Would do you mean by "banned" ? "crm resource ban ..." ?

Yes. If you run `pcs resource ban fence_dev1 node-1` (I presume `crm
resource ban` does the same thing), then:
  - fence_dev1 is not allowed to run on node-1
  - node-1 is not allowed to use fence-dev to fence a node

If you disable the fence_dev1 (the pcs command would be `pcs resource
disable`, which sets the target-role meta attribute to Stopped), then
**no** node can use fence_dev1 to fence a node.

> Is that something different than a location constraint ?

It creates a -INFINITY location constraint.

The same might also apply when a stonith device has a finite negative
preference for a given node -- not sure without testing.

>
> > If you ban fence_dev2 from node 1, then node 1 won't be able to use
> > fence_dev2 to fence node 2. Likewise, if you ban fence_dev1 from node
> > 1, then node 1 won't be able to use fence_dev1 to fence itself.
> > Usually that's unnecessary anyway, but it may be preferable to power
> > ourselves off if we're the last remaining node and a stop operation
> > fails.
> So banning a fencing device from a node means that this node can't use the fencing device ?
>
> > If ha-idg-2 is in standby, it can still fence ha-idg-1. Since it
> > sounds like you've banned fence_ilo_ha-idg-1 from ha-idg-1, so that it
> > can't run anywhere when ha-idg-2 is in standby, I'm not sure off the
> > top of my head whether fence_ilo_ha-idg-1 is available in this
> > situation. It may not be.
>
> ha-idg-2 was not only in standby, i also stopped pacemaker on that node.
> Then ha-idg-2 can't fence ha-idg-1 i assume.

Correct, ha-idg-2 can't fence ha-idg-1 if ha-idg-2 is stopped.

>
> >
> > A solution would be to stop banning the stonith devices from their
> > respective nodes. Surely if fence_ilo_ha-idg-1 had been running on
> > ha-idg-1, ha-idg-2 would have been able to use it to fence ha-idg-1.
> > (Again, I'm not sure if that's still true if ha-idg-2 is in standby
> > **and** fence_ilo_ha-idg-1 is banned from ha-idg-1.)
> >
> >> Aug 03 01:19:58 [19364] ha-idg-1 stonith-ng:   notice: log_operation:
> >> Operation 'Off' [20705] (call 2 from crmd.19368) for host 'ha-idg-1' with
> >> device 'fence_ilo_ha-idg-2' returned: 0 (OK)
> >> So the cluster starts the resource running on ha-idg-1 and cut off ha-idg-2,
> >> which isn't necessary.
> >
> > Here, it sounds like the pcmk_host_list setting is either missing or
> > misconfigured for fence_ilo_ha-idg-2. fence_ilo_ha-idg-2 should NOT be
> > usable for fencing ha-idg-1.
> >
> > fence_ilo_ha-idg-1 should be configured with pcmk_host_list=ha-idg-1,
> > and fence_ilo_ha-idg-2 should be configured with
> > pcmk_host_list=ha-idg-2.
>
> I will check that.
>
> > What happened is that ha-idg-1 used fence_ilo_ha-idg-2 to fence
> > itself. Of course, this only rebooted ha-idg-2. But based on the
> > stonith device configuration, pacemaker on ha-idg-1 believed that
> > ha-idg-1 had been fenced. Hence the "allegedly just fenced" message.
> >
> >>
> >> Finally the cluster seems to realize that something went wrong:
> >> Aug 03 01:19:58 [19368] ha-idg-1       crmd:     crit: tengine_stonith_notify:
> >> We were allegedly just fenced by ha-idg-1 for ha-idg-1!
>
> Bernd
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/

-- 
Regards,

Reid Wahl (He/Him)
Senior Software Engineer, Red Hat
RHEL High Availability - Pacemaker