[ClusterLabs] Questions about SBD behavior

Mon Jun 25 06:58:36 EDT 2018

On 06/25/2018 12:01 PM, 井上 和徳 wrote:
>> -----Original Message-----
>> From: Klaus Wenninger [mailto:kwenning at redhat.com]
>> Sent: Wednesday, June 13, 2018 6:40 PM
>> To: Cluster Labs - All topics related to open-source clustering welcomed; 井上 和
>> 徳
>> Subject: Re: [ClusterLabs] Questions about SBD behavior
>>
>> On 06/13/2018 10:58 AM, 井上 和徳 wrote:
>>> Thanks for the response.
>>>
>>> As of v1.3.1 and later, I recognized that real quorum is necessary.
>>> I also read this:
>>>
>> https://wiki.clusterlabs.org/wiki/Using_SBD_with_Pacemaker#Watchdog-based_self
>> -fencing_with_resource_recovery
>>> As related to this specification, in order to use pacemaker-2.0,
>>> we are confirming the following known issue.
>>>
>>> * When SIGSTOP is sent to the pacemaker process, no failure of the
>>>   resource will be detected.
>>>   https://lists.clusterlabs.org/pipermail/users/2016-September/011146.html
>>>   https://lists.clusterlabs.org/pipermail/users/2016-October/011429.html
>>>
>>>   I expected that it was being handled by SBD, but no one detected
>>>   that the following process was frozen. Therefore, no failure of
>>>   the resource was detected either.
>>>   - pacemaker-based
>>>   - pacemaker-execd
>>>   - pacemaker-attrd
>>>   - pacemaker-schedulerd
>>>   - pacemaker-controld
>>>
>>>   I confirmed this, but I couldn't read about the correspondence
>>>   situation.
>>>
>> https://wiki.clusterlabs.org/w/images/1/1a/Recent_Work_and_Future_Plans_for_SB
>> D_1.1.pdf
>> You are right. The issue was known as when I created these slides.
>> So a plan for improving the observation of the pacemaker-daemons
>> should have gone into that probably.
>>
> It's good news that there is a plan to improve.
> So I registered it as a memorandum in CLBZ:
> https://bugs.clusterlabs.org/show_bug.cgi?id=5356
>
> Best Regards
Wasn't there a bug filed before?

Klaus

>
>> Thanks for bringing this to the table.
>> Guess the issue got a little bit neglected recently.
>>
>>> As a result of our discussion, we want SBD to detect it and reset the
>>> machine.
>> Implementation wise I would go for some kind of a split
>> solution between pacemaker & SBD. Thinking of Pacemaker
>> observing the sub-daemons by itself while there would be
>> some kind of a heartbeat (implicitly via corosync or explicitly)
>> between pacemaker & SBD that assures this internal
>> observation is doing it's job properly.
>>
>>> Also, for users who do not have shared disk or qdevice,
>>> we need an option to work even without real quorum.
>>> (fence races are going to avoid with delay attribute:
>>>  https://access.redhat.com/solutions/91653
>>>  https://access.redhat.com/solutions/1293523)
>> I'm not sure if I get your point here.
>> Watchdog-fencing on a 2-node-cluster without
>> additional qdevice or shared disk is like denying
>> the laws of physics in my mind.
>> At the moment I don't see why auto_tie_breaker
>> wouldn't work on a 4-node and up cluster here.
>>
>> Regards,
>> Klaus
>>> Best Regards,
>>> Kazunori INOUE
>>>
>>>> -----Original Message-----
>>>> From: Users [mailto:users-bounces at clusterlabs.org] On Behalf Of Klaus Wenninger
>>>> Sent: Friday, May 25, 2018 4:08 PM
>>>> To: users at clusterlabs.org
>>>> Subject: Re: [ClusterLabs] Questions about SBD behavior
>>>>
>>>> On 05/25/2018 07:31 AM, 井上 和徳 wrote:
>>>>> Hi,
>>>>>
>>>>> I am checking the watchdog function of SBD (without shared block-device).
>>>>> In a two-node cluster, if one cluster is stopped, watchdog is triggered on the
>>>> remaining node.
>>>>> Is this the designed behavior?
>>>> SBD without a shared block-device doesn't really make sense on
>>>> a two-node cluster.
>>>> The basic idea is - e.g. in a case of a networking problem -
>>>> that a cluster splits up in a quorate and a non-quorate partition.
>>>> The quorate partition stays over while SBD guarantees a
>>>> reliable watchdog-based self-fencing of the non-quorate partition
>>>> within a defined timeout.
>>>> This idea of course doesn't work with just 2 nodes.
>>>> Taking quorum info from the 2-node feature of corosync (automatically
>>>> switching on wait-for-all) doesn't help in this case but instead
>>>> would lead to split-brain.
>>>> What you can do - and what e.g. pcs does automatically - is enable
>>>> the auto-tie-breaker instead of two-node in corosync. But that
>>>> still doesn't give you a higher availability than the one of the
>>>> winner of auto-tie-breaker. (Maybe interesting if you are going
>>>> for a load-balancing-scenario that doesn't affect availability or
>>>> for a transient state while setting up a cluste node-by-node ...)
>>>> What you can do though is using qdevice to still have 'real-quorum'
>>>> info with just 2 full cluster-nodes.
>>>>
>>>> There was quite a lot of discussion round this topic on this
>>>> thread previously if you search the history.
>>>>
>>>> Regards,
>>>> Klaus
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org