[ClusterLabs] Questions about SBD behavior

Mon Jun 25 10:01:06 UTC 2018

> -----Original Message-----
> From: Klaus Wenninger [mailto:kwenning at redhat.com]
> Sent: Wednesday, June 13, 2018 6:40 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed; 井上 和
> 徳
> Subject: Re: [ClusterLabs] Questions about SBD behavior
> 
> On 06/13/2018 10:58 AM, 井上 和徳 wrote:
> > Thanks for the response.
> >
> > As of v1.3.1 and later, I recognized that real quorum is necessary.
> > I also read this:
> >
> https://wiki.clusterlabs.org/wiki/Using_SBD_with_Pacemaker#Watchdog-based_self
> -fencing_with_resource_recovery
> >
> > As related to this specification, in order to use pacemaker-2.0,
> > we are confirming the following known issue.
> >
> > * When SIGSTOP is sent to the pacemaker process, no failure of the
> >   resource will be detected.
> >   https://lists.clusterlabs.org/pipermail/users/2016-September/011146.html
> >   https://lists.clusterlabs.org/pipermail/users/2016-October/011429.html
> >
> >   I expected that it was being handled by SBD, but no one detected
> >   that the following process was frozen. Therefore, no failure of
> >   the resource was detected either.
> >   - pacemaker-based
> >   - pacemaker-execd
> >   - pacemaker-attrd
> >   - pacemaker-schedulerd
> >   - pacemaker-controld
> >
> >   I confirmed this, but I couldn't read about the correspondence
> >   situation.
> >
> https://wiki.clusterlabs.org/w/images/1/1a/Recent_Work_and_Future_Plans_for_SB
> D_1.1.pdf
> You are right. The issue was known as when I created these slides.
> So a plan for improving the observation of the pacemaker-daemons
> should have gone into that probably.
> 

It's good news that there is a plan to improve.
So I registered it as a memorandum in CLBZ:
https://bugs.clusterlabs.org/show_bug.cgi?id=5356

Best Regards

> Thanks for bringing this to the table.
> Guess the issue got a little bit neglected recently.
> 
> >
> > As a result of our discussion, we want SBD to detect it and reset the
> > machine.
> 
> Implementation wise I would go for some kind of a split
> solution between pacemaker & SBD. Thinking of Pacemaker
> observing the sub-daemons by itself while there would be
> some kind of a heartbeat (implicitly via corosync or explicitly)
> between pacemaker & SBD that assures this internal
> observation is doing it's job properly.
> 
> >
> > Also, for users who do not have shared disk or qdevice,
> > we need an option to work even without real quorum.
> > (fence races are going to avoid with delay attribute:
> >  https://access.redhat.com/solutions/91653
> >  https://access.redhat.com/solutions/1293523)
> I'm not sure if I get your point here.
> Watchdog-fencing on a 2-node-cluster without
> additional qdevice or shared disk is like denying
> the laws of physics in my mind.
> At the moment I don't see why auto_tie_breaker
> wouldn't work on a 4-node and up cluster here.
> 
> Regards,
> Klaus
> >
> > Best Regards,
> > Kazunori INOUE
> >
> >> -----Original Message-----
> >> From: Users [mailto:users-bounces at clusterlabs.org] On Behalf Of Klaus Wenninger
> >> Sent: Friday, May 25, 2018 4:08 PM
> >> To: users at clusterlabs.org
> >> Subject: Re: [ClusterLabs] Questions about SBD behavior
> >>
> >> On 05/25/2018 07:31 AM, 井上 和徳 wrote:
> >>> Hi,
> >>>
> >>> I am checking the watchdog function of SBD (without shared block-device).
> >>> In a two-node cluster, if one cluster is stopped, watchdog is triggered on the
> >> remaining node.
> >>> Is this the designed behavior?
> >> SBD without a shared block-device doesn't really make sense on
> >> a two-node cluster.
> >> The basic idea is - e.g. in a case of a networking problem -
> >> that a cluster splits up in a quorate and a non-quorate partition.
> >> The quorate partition stays over while SBD guarantees a
> >> reliable watchdog-based self-fencing of the non-quorate partition
> >> within a defined timeout.
> >> This idea of course doesn't work with just 2 nodes.
> >> Taking quorum info from the 2-node feature of corosync (automatically
> >> switching on wait-for-all) doesn't help in this case but instead
> >> would lead to split-brain.
> >> What you can do - and what e.g. pcs does automatically - is enable
> >> the auto-tie-breaker instead of two-node in corosync. But that
> >> still doesn't give you a higher availability than the one of the
> >> winner of auto-tie-breaker. (Maybe interesting if you are going
> >> for a load-balancing-scenario that doesn't affect availability or
> >> for a transient state while setting up a cluste node-by-node ...)
> >> What you can do though is using qdevice to still have 'real-quorum'
> >> info with just 2 full cluster-nodes.
> >>
> >> There was quite a lot of discussion round this topic on this
> >> thread previously if you search the history.
> >>
> >> Regards,
> >> Klaus
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org