[ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

Mon Jun 22 17:19:12 EDT 2020

On Sat, 2020-06-20 at 08:47 +0300, Andrei Borzenkov wrote:
> 19.06.2020 01:13, Howard пишет:
> > Thanks for all the help so far.  With your assistance, I'm very
> > close to
> > stable.
> > 
> > Made the following changes to the vmfence stonith resource:
> > 
> > Meta Attrs: failure-timeout=30m migration-threshold=10
> >   Operations: monitor interval=60s (vmfence-monitor-interval-60s)
> > 
> > If I understand this correctly, it will check if the fencing device
> > is
> > online every 60 seconds. It will try 10 times and then mark the
> > node
> > ineligible.
> 
> No. That's the main problem - stonith resource failure on a node does
> not affect whether this node can be selected to perform stonith. Node
> becomes ineligible for *monitoring* operation, that's all.
> 
> Resource could be marked as failed on all nodes and still fencing
> will
> be attempted.
> 
> That is very counter-intuitive, OTOH this allows fencing to work even
> in
> case of transient issues.
> 
> I wonder if pacemaker will cycle through available nodes though.
> Consider three node cluster nodeA, nodeB, nodeC. nodeA is lost, nodeB
> is
> selected to but cannot perform stonith for whatever reasons. Will
> pacemaker retry on nodeC? Under which conditions (number of retries
> on
> nodeB, whatever)? If nodeC fails too, will pacemaker restart cycle
> from
> the beginning?

When selecting a node to execute fencing, pacemaker prefers (1) a node
that runs a recurring monitor on the device; (2) any other node besides
the target; or (3) the target, if no other node is available.

By default pacemaker will attempt to execute a fencing action twice.
This is customizable via the pcmk_reboot_retries / pcmk_off_retries /
etc. stonith device meta-attributes. IIRC, the second attempt will be
tried on a different node if one is available.

However each attempt eats into the overall timeout. If the first
attempt hangs and uses up all the timeout, then no further attempts
will be made.

A fencing topology can be configured if multiple devices can be used to
fence a node, to specify which should be attempted first.

If all devices/attempts fail, pacemaker marks the fencing as failed.
>From there it depends on how the fencing was initiated. If pacemaker
itself initiated it (vs. external software like DLM, or a sysadmin
running stonith_admin), the controller will resubmit the fencing
operation up to 10 times by default (the stonith-max-attempts cluster
property) then give up. However the controller will reset the counter
to zero and try again at the next transition if the node still needs to
be fenced.

> Also does stonith resource failure on a node affect selecting this
> node
> to perform stonith? Is there any sort of priority list? If yes, how
> is
> it ordered?

Currently, device monitor failure does not affect the selection of a
node to execute the device, but that is planned.

The priority list for selecting a node to execute a device is described
above. For selecting between multiple fence devices when there is no
topology, there is a priority meta-attribute for stonith devices, but
it not currently implemented (another to-do item).

> 
> >  After 30 minutes it will start trying again.
> > 
> 
> ... resume monitoring. Nothing more.
-- 
Ken Gaillot <kgaillot at redhat.com>