[ClusterLabs] cluster doesn't do HA as expected, pingd doesn't help

Artem tyomikh at gmail.com
Tue Dec 19 13:42:34 EST 2023


Andrei and Klaus thanks for prompt reply and clarification!
As I understand, design and behavior of Pacemaker is tightly coupled with
the stonith concept. But isn't it too rigid?

Is there a way to leverage self-monitoring or pingd rules to trigger
isolated node to umount its FS? Like vSphere High Availability host
isolation response.
Can resource-stickiness=off (auto-failback) decrease risk of corruption by
unresponsive node coming back online?
Is there a quorum feature not for cluster but for resource start/stop? Got
lock - is welcome to mount, unable to refresh lease - force unmount.
Can on-fail=ignore break manual failover logic (stopped will be considered
as failed and thus ignored)?

best regards,
Artem

On Tue, 19 Dec 2023 at 17:03, Klaus Wenninger <kwenning at redhat.com> wrote:

>
>
> On Tue, Dec 19, 2023 at 10:00 AM Andrei Borzenkov <arvidjaar at gmail.com>
> wrote:
>
>> On Tue, Dec 19, 2023 at 10:41 AM Artem <tyomikh at gmail.com> wrote:
>> ...
>> > Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107]
>> (update_resource_action_runnable)    warning: OST4_stop_0 on lustre4 is
>> unrunnable (node is offline)
>> > Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107]
>> (recurring_op_for_active)    info: Start 20s-interval monitor for OST4 on
>> lustre3
>> > Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107]
>> (log_list_item)      notice: Actions: Stop       OST4        (     lustre4
>> )  blocked
>>
>> This is the default for the failed stop operation. The only way
>> pacemaker can resolve failure to stop a resource is to fence the node
>> where this resource was active. If it is not possible (and IIRC you
>> refuse to use stonith), pacemaker has no other choice as to block it.
>> If you insist, you can of course sert on-fail=ignore, but this means
>> unreachable node will continue to run resources. Whether it can lead
>> to some corruption in your case I cannot guess.
>>
>
> Don't know if I'm reading that correctly but I understand what you had
> written
> above that you try to trigger the failover by stopping the VM (lustre4)
> without
> ordered shutdown.
> With fencing disabled what we are seeing is exactly what we would expect:
> The state of the resource is unknown - pacemaker tries to stop it -
> doesn't work
> as the node is offline - no fencing configured - so everything it can do
> is wait
> till there is info if the resource is up or not.
> I guess the strange output below is because of fencing disabled - quite an
> unusual - also not recommended - configuration and so this might not have
> shown up too often in that way.
>
> Klaus
>
>>
>> > Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107]
>> (pcmk__create_graph)         crit: Cannot fence lustre4 because of OST4:
>> blocked (OST4_stop_0)
>>
>> That is a rather strange phrase. The resource is blocked because the
>> pacemaker could not fence the node, not the other way round.
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20231219/0bde28fa/attachment.htm>


More information about the Users mailing list