[ClusterLabs] Gracefully stop nodes one by one with disk-less sbd

Klaus Wenninger kwenning at redhat.com
Mon Aug 12 07:39:52 EDT 2019

On 8/9/19 9:06 PM, Yan Gao wrote:
> On 8/9/19 6:40 PM, Andrei Borzenkov wrote:
>> 09.08.2019 16:34, Yan Gao пишет:
>>> Hi,
>>> With disk-less sbd,  it's fine to stop cluster service from the cluster
>>> nodes all at the same time.
>>> But if to stop the nodes one by one, for example with a 3-node cluster,
>>> after stopping the 2nd node, the only remaining node resets itself with:
>> That is sort of documented in SBD manual page:
>> --><--
>> However, while the cluster is in such a degraded state, it can
>> neither successfully fence nor be shutdown cleanly (as taking the
>> cluster below the quorum threshold will immediately cause all remaining
>> nodes to self-fence).
>> --><--
>> SBD in shared-nothing mode is basically always in such degraded state
>> and cannot tolerate loss of quorum.
> Well, the context here is it loses quorum *expectedly* since the other 
> nodes gracefully shut down.
>>> Aug 09 14:30:20 opensuse150-1 sbd[1079]:       pcmk:    debug:
>>> notify_parent: Not notifying parent: state transient (2)
>>> Aug 09 14:30:20 opensuse150-1 sbd[1080]:    cluster:    debug:
>>> notify_parent: Notifying parent: healthy
>>> Aug 09 14:30:20 opensuse150-1 sbd[1078]:  warning: inquisitor_child:
>>> Latency: No liveness for 4 s exceeds threshold of 3 s (healthy servants: 0)
>>> I can think of the way to manipulate quorum with last_man_standing and
>>> potentially also auto_tie_breaker, not to mention
>>> last_man_standing_window would also be a factor... But is there a better
>>> solution?
>> Lack of cluster wide shutdown mode was mentioned more than once on this
>> list. I guess the only workaround is to use higher level tools which
>> basically simply try to stop cluster on all nodes at once. It is still
>> susceptible to race condition.
> Gracefully stopping nodes one by one on purpose is still a reasonable 
> need though ...
If you do the teardown as e.g. pcs is doing it - first tear down
pacemaker-instances and then corosync/sbd - it is at
least possible to tear down the pacemaker-instances one-by one
without risking a reboot due to quorum-loss.
With kind of current sbd having in
this should be pretty robust although we are still thinking
(probably together with some heartbeat to pacemakerd
that assures pacemakerd is checking liveness of sub-daemons
properly) of having a cleaner way to detect graceful

> Regards,
>    Yan
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> ClusterLabs home: https://www.clusterlabs.org/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190812/daaf3b18/attachment.html>

More information about the Users mailing list