[ClusterLabs] cluster doesn't do HA as expected, pingd doesn't help

Tue Dec 19 03:59:52 EST 2023

On Tue, Dec 19, 2023 at 10:41 AM Artem <tyomikh at gmail.com> wrote:
...
> Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (update_resource_action_runnable)    warning: OST4_stop_0 on lustre4 is unrunnable (node is offline)
> Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active)    info: Start 20s-interval monitor for OST4 on lustre3
> Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (log_list_item)      notice: Actions: Stop       OST4        (     lustre4 )  blocked

This is the default for the failed stop operation. The only way
pacemaker can resolve failure to stop a resource is to fence the node
where this resource was active. If it is not possible (and IIRC you
refuse to use stonith), pacemaker has no other choice as to block it.
If you insist, you can of course sert on-fail=ignore, but this means
unreachable node will continue to run resources. Whether it can lead
to some corruption in your case I cannot guess.

> Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__create_graph)         crit: Cannot fence lustre4 because of OST4: blocked (OST4_stop_0)

That is a rather strange phrase. The resource is blocked because the
pacemaker could not fence the node, not the other way round.