[ClusterLabs] Gracefully stop nodes one by one with disk-less sbd

Mon Aug 12 07:39:52 EDT 2019

On 8/9/19 9:06 PM, Yan Gao wrote:
> On 8/9/19 6:40 PM, Andrei Borzenkov wrote:
>> 09.08.2019 16:34, Yan Gao пишет:
>>> Hi,
>>>
>>> With disk-less sbd,  it's fine to stop cluster service from the cluster
>>> nodes all at the same time.
>>>
>>> But if to stop the nodes one by one, for example with a 3-node cluster,
>>> after stopping the 2nd node, the only remaining node resets itself with:
>>>
>> That is sort of documented in SBD manual page:
>>
>> --><--
>> However, while the cluster is in such a degraded state, it can
>> neither successfully fence nor be shutdown cleanly (as taking the
>> cluster below the quorum threshold will immediately cause all remaining
>> nodes to self-fence).
>> --><--
>>
>> SBD in shared-nothing mode is basically always in such degraded state
>> and cannot tolerate loss of quorum.
> Well, the context here is it loses quorum *expectedly* since the other 
> nodes gracefully shut down.
>
>>
>>
>>> Aug 09 14:30:20 opensuse150-1 sbd[1079]:       pcmk:    debug:
>>> notify_parent: Not notifying parent: state transient (2)
>>> Aug 09 14:30:20 opensuse150-1 sbd[1080]:    cluster:    debug:
>>> notify_parent: Notifying parent: healthy
>>> Aug 09 14:30:20 opensuse150-1 sbd[1078]:  warning: inquisitor_child:
>>> Latency: No liveness for 4 s exceeds threshold of 3 s (healthy servants: 0)
>>>
>>> I can think of the way to manipulate quorum with last_man_standing and
>>> potentially also auto_tie_breaker, not to mention
>>> last_man_standing_window would also be a factor... But is there a better
>>> solution?
>>>
>> Lack of cluster wide shutdown mode was mentioned more than once on this
>> list. I guess the only workaround is to use higher level tools which
>> basically simply try to stop cluster on all nodes at once. It is still
>> susceptible to race condition.
> Gracefully stopping nodes one by one on purpose is still a reasonable 
> need though ...
If you do the teardown as e.g. pcs is doing it - first tear down
pacemaker-instances and then corosync/sbd - it is at
least possible to tear down the pacemaker-instances one-by one
without risking a reboot due to quorum-loss.
With kind of current sbd having in
-
https://github.com/ClusterLabs/sbd/commit/824fe834c67fb7bae7feb87607381f9fa8fa2945
-
https://github.com/ClusterLabs/sbd/commit/79b778debfee5b4ab2d099b2bfc7385f45597f70
-
https://github.com/ClusterLabs/sbd/commit/a716a8ddd3df615009bcff3bd96dd9ae64cb5f68
this should be pretty robust although we are still thinking
(probably together with some heartbeat to pacemakerd
that assures pacemakerd is checking liveness of sub-daemons
properly) of having a cleaner way to detect graceful
pacemaker-shutdown.

Klaus
>
> Regards,
>    Yan
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190812/daaf3b18/attachment.html>