[ClusterLabs] Node reset on shutdown by SBD watchdog with corosync-qdevice

Sun Jul 28 23:34:25 EDT 2019

On 7/28/19 6:35 PM, Andrei Borzenkov wrote:
> In two node cluster + qnetd I consistently see the node that is being
> shut down last being reset during shutdown. I.e.
>
> - shutdown the first node - OK
> - shutdown the second node - reset
>
> As far as I understand what happens is
>
> - during shutdown pacemaker.service is stopped first. In above
> configuration it leaves corosync.service, corosync-qdevice.service and
> sbd.service running (see another mail with subject "corosync.service
> (and sbd.service) are not stopper on pacemaker shutdown when
> corosync-qdevice is used")
>
> - corosync-qdevice.service is declared After=corosync.service, so on
> shutdown it is stopped first
>
> - this immediately removes one vote from quorum
>
> - when first node is shut down, node remains in quorum (it lost qnetd
> but still has second node)
>
> - when second node is shut down, as soon as corosync-qdevice.service
> stops, node goes out-of-quorum and SBD resets it
Actually all that should happen very quickly and all within
the timeout, meaning you should loose quorum but that
shouldn't matter as sbd should be down and watchdog
switched off before the timer runs off.
Which versions are you using? Do you have logs at hand
that show which servant is timeouting and what it is saying?
I've added pacemaker graceful-shutdown detection quite
recently to make it survive longer gaps between pacemaker
and corosync/sbd shutdown reliably.
Sometime back when playing with corosync-observation
via cpg-protocol we had a node-failure scenario with qdevice
that drove corosync into sync-phase meaning that all
calls (except quorum-protocol) would stall for up to something
like 30s with default settings. No idea if your scenario could
drive you into that ... but maybe something to keep in the
back of ones mind ... and maybe the corosync-specialists
can say something about it.

Klaus
>
> Is it possible to start corosync-qdevice.service *before* corosync? Can
> it be made intelligent enough to wait for corosync to come up?
>
> This basically makes it impossible to safely shutdown cluster nodes.
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/