[ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

Sat Jun 20 01:37:16 EDT 2020

19.06.2020 13:23, Klaus Wenninger пишет:
> On 6/19/20 12:13 AM, Howard wrote:
>> Thanks for all the help so far.  With your assistance, I'm very close
>> to stable.
>>
>> Made the following changes to the vmfence stonith resource:
>>   
>> Meta Attrs: failure-timeout=30m migration-threshold=10
>>   Operations: monitor interval=60s (vmfence-monitor-interval-60s)
>>
>> If I understand this correctly, it will check if the fencing device is
>> online every 60 seconds. It will try 10 times and then mark the node
>> ineligible.  After 30 minutes it will start trying again.
>>
>> On Thu, Jun 18, 2020 at 12:29 PM Ken Gaillot <kgaillot at redhat.com
>> <mailto:kgaillot at redhat.com>> wrote:
>>
>>     On Thu, 2020-06-18 at 21:32 +0300, Andrei Borzenkov wrote:
>>     > 18.06.2020 18:24, Ken Gaillot пишет:
>>     > > Note that a failed start of a stonith device will not prevent the
>>     > > cluster from using that device for fencing. It just prevents the
>>     > > cluster from monitoring the device.
>>     > >
>>     >
>>     > My understanding is that if stonith resource cannot run anywhere, it
>>     > also won't be used for stonith. When failcount exceeds threshold,
>>     > resource is banned from node. If it happens on all nodes, resource
>>     > cannot run anywhere and so won't be used for stonith. Start failure
>>     > automatically sets failcount to INFINITY.
>>     >
>>     > Or do I misunderstand something?
>>
>>     I had to test to confirm, but a stonith resource stopped due to
>>     failures can indeed be used. Only stonith resources stopped via
>>     location constraints (bans) or target-role=Stopped are prevented from
>>     being used.
>>
> Unfortunately this could be a bit tricky to test as fenced updates
> the device-list on configuration changes but scores as well influence
> if a device is taken into that list.

Can you elaborate? I understand it as "if score is -INFINITY, device is
ignored", is it correct? This would be consistent, explicit constraints
are just one possible way to set location score.

> So there is as well a possible dependency on when the device-list has been
> updated most recently.

My understanding was that pacemaker recomputes scores on every
transaction. That is the whole idea - any event triggers re-evaluation
of current resource placement. Node lost event that results in stonith
recomputes scores as the very first thing.

Of course it is possible after node loss some other even happens that
would have made resource available but it is no more taken in account
because pacemaker already decided no fencing resource was available to
perform stonith. Is it what you mean?

> Don't know if it is relevant for this config but unfortunately something
> to have in the back of one's mind in case of more complex fencing
> setups.
> An uglyness that is known for a long time but there is no easy way
> to solve the issue without loosing part of the independence and with
> that robustness of the fencing subsystem.
>