[ClusterLabs] cluster doesn't do HA as expected, pingd doesn't help

Tue Dec 19 14:03:23 EST 2023

On 19.12.2023 21:42, Artem wrote:
> Andrei and Klaus thanks for prompt reply and clarification!
> As I understand, design and behavior of Pacemaker is tightly coupled with
> the stonith concept. But isn't it too rigid?
> 

If you insist on shooting yourself in the foot, pacemaker gives you the 
gun. It just does not load it by default and does not shoot itself.

Seriously, this topic has been beaten to death. Just do some research.

You can avoid fencing and rely on quorum in shared-nothing case. The 
prime example that I have seen is NetApp C-Mode ONTAP where the set of 
management processes go read-only preventing any modification when 
node(s) go(es) out of quorum. But as soon as you have shared resource, 
ignoring fencing will lead to data corruption sooner or later.

> Is there a way to leverage self-monitoring or pingd rules to trigger
> isolated node to umount its FS? Like vSphere High Availability host
> isolation response.
> Can resource-stickiness=off (auto-failback) decrease risk of corruption by
> unresponsive node coming back online?
> Is there a quorum feature not for cluster but for resource start/stop? Got
> lock - is welcome to mount, unable to refresh lease - force unmount.
> Can on-fail=ignore break manual failover logic (stopped will be considered
> as failed and thus ignored)?
> 
> best regards,
> Artem
> 
> On Tue, 19 Dec 2023 at 17:03, Klaus Wenninger <kwenning at redhat.com> wrote:
> 
>>
>>
>> On Tue, Dec 19, 2023 at 10:00 AM Andrei Borzenkov <arvidjaar at gmail.com>
>> wrote:
>>
>>> On Tue, Dec 19, 2023 at 10:41 AM Artem <tyomikh at gmail.com> wrote:
>>> ...
>>>> Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107]
>>> (update_resource_action_runnable)    warning: OST4_stop_0 on lustre4 is
>>> unrunnable (node is offline)
>>>> Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107]
>>> (recurring_op_for_active)    info: Start 20s-interval monitor for OST4 on
>>> lustre3
>>>> Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107]
>>> (log_list_item)      notice: Actions: Stop       OST4        (     lustre4
>>> )  blocked
>>>
>>> This is the default for the failed stop operation. The only way
>>> pacemaker can resolve failure to stop a resource is to fence the node
>>> where this resource was active. If it is not possible (and IIRC you
>>> refuse to use stonith), pacemaker has no other choice as to block it.
>>> If you insist, you can of course sert on-fail=ignore, but this means
>>> unreachable node will continue to run resources. Whether it can lead
>>> to some corruption in your case I cannot guess.
>>>
>>
>> Don't know if I'm reading that correctly but I understand what you had
>> written
>> above that you try to trigger the failover by stopping the VM (lustre4)
>> without
>> ordered shutdown.
>> With fencing disabled what we are seeing is exactly what we would expect:
>> The state of the resource is unknown - pacemaker tries to stop it -
>> doesn't work
>> as the node is offline - no fencing configured - so everything it can do
>> is wait
>> till there is info if the resource is up or not.
>> I guess the strange output below is because of fencing disabled - quite an
>> unusual - also not recommended - configuration and so this might not have
>> shown up too often in that way.
>>
>> Klaus
>>
>>>
>>>> Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107]
>>> (pcmk__create_graph)         crit: Cannot fence lustre4 because of OST4:
>>> blocked (OST4_stop_0)
>>>
>>> That is a rather strange phrase. The resource is blocked because the
>>> pacemaker could not fence the node, not the other way round.
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/