[ClusterLabs] Node reset on shutdown by SBD watchdog with corosync-qdevice

Mon Jul 29 03:07:16 EDT 2019

> On 7/28/19 6:35 PM, Andrei Borzenkov wrote:
>> In two node cluster + qnetd I consistently see the node that is being
>> shut down last being reset during shutdown. I.e.
>>
>> - shutdown the first node - OK
>> - shutdown the second node - reset
>>
>> As far as I understand what happens is
>>
>> - during shutdown pacemaker.service is stopped first. In above
>> configuration it leaves corosync.service, corosync-qdevice.service and
>> sbd.service running (see another mail with subject "corosync.service
>> (and sbd.service) are not stopper on pacemaker shutdown when
>> corosync-qdevice is used")
>>
>> - corosync-qdevice.service is declared After=corosync.service, so on
>> shutdown it is stopped first
>>
>> - this immediately removes one vote from quorum
>>
>> - when first node is shut down, node remains in quorum (it lost qnetd
>> but still has second node)
>>
>> - when second node is shut down, as soon as corosync-qdevice.service
>> stops, node goes out-of-quorum and SBD resets it
> Actually all that should happen very quickly and all within
> the timeout, meaning you should loose quorum but that
> shouldn't matter as sbd should be down and watchdog
> switched off before the timer runs off.
> Which versions are you using? Do you have logs at hand
> that show which servant is timeouting and what it is saying?
> I've added pacemaker graceful-shutdown detection quite
> recently to make it survive longer gaps between pacemaker
> and corosync/sbd shutdown reliably.
> Sometime back when playing with corosync-observation
> via cpg-protocol we had a node-failure scenario with qdevice
> that drove corosync into sync-phase meaning that all
> calls (except quorum-protocol) would stall for up to something
> like 30s with default settings. No idea if your scenario could
> drive you into that ... but maybe something to keep in the

I believe this is the case. Andrei, would you mind to test settings 
stated in the comment
https://github.com/ClusterLabs/sbd/pull/76#issuecomment-486952369

> back of ones mind ... and maybe the corosync-specialists
> can say something about it.
> 
> Klaus
>>
>> Is it possible to start corosync-qdevice.service *before* corosync? Can
>> it be made intelligent enough to wait for corosync to come up?

It would be implementable, but how exactly you mean it would help? It 
needs to wait for membership provided by corosync, otherwise it's not 
possible to decide if node should have vote or not so it would not have 
one. If you mean it other way around - it is shutdown after corosync (or 
not at all), then yes, it may help.

Regards,
   Honza

>>
>> This basically makes it impossible to safely shutdown cluster nodes.
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
>