[ClusterLabs] Antw: [EXT] Correctly stop pacemaker on 2-node cluster with SBD and failed devices?

Wed Jun 16 05:37:50 EDT 2021

On Wed, Jun 16, 2021 at 11:26 AM Klaus Wenninger <kwenning at redhat.com>
wrote:

>
>
> On Wed, Jun 16, 2021 at 10:47 AM Roger Zhou <zzhou at suse.com> wrote:
>
>>
>> On 6/16/21 3:03 PM, Andrei Borzenkov wrote:
>>
>> >
>> >>>
>> >>> We thought that access to storage was restored, but one step was
>> >>> missing so devices appeared empty.
>> >>>
>> >>> At this point I tried to restart the pacemaker. But as soon as I
>> >>> stopped pacemaker SBD rebooted nodes ‑ which is logical, as quorum was
>> >>> now lost.
>> >>>
>> >>> How to cleanly stop pacemaker in this case and keep nodes up?
>> >>
>> >> Unconfigurte sbd devices I guess.
>> >>
>> >
>> > Do you have *practical* suggestions on how to do it online in a
>> > running pacemaker cluster? Can you explain how it is going to help
>> > given that lack of sbd device was not the problem in the first place?
>>
>> I would translate this issue as "how to gracefully shutdown sbd to
>> deregister
>> sbd from pacemaker for the whole cluster". Seems no way to do that except
>> `systemctl stop corosync`.
>>
>> With that, to calm down sbd suicide, I'm thinking some tricky steps as
>> below
>> might help. Well, not sure it fits your situation as the whole.
>>
>> crm cluster run "systemctl stop pacemaker"
>> crm cluster run "systemctl stop corosync"
>>
> I guess this shouldn't be helpful in this situation.
> As I've already tried to explain before shutting down
> pacemaker on one of the nodes - if sbd-device can't
> be reached - should already be enough for the other
> one to suicide.
>
> One - not less ugly than other suggestions here I'm afraid -
> thing coming to my mind is to right after stopping pacemaker
> dummy-register at the cpg-protocol. If after that you want
> to bring down corosync & sbd as well it should be possible
> to do that quickly enough - as pcs is otherwise doing with
> 3+ node clusters.
>

Something else coming to my mind that might be more
helpful and less ugly - have to think it over a bit though:

With the new startup/shutdown-syncing pacemaker
should stay connected to the cpg-protocol till a final
handshake with sbd on shutdown.
If we could bring all nodes to a state right before that
handshake with e.g. pcs we have lots of time for that.
And the final step incl. corosync/sbd shutdown is quick
enough that it can happen on all nodes within
watchdog-timeout.

Klaus

>
>> BR,
>> Roger
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20210616/c199969c/attachment-0001.htm>