[ClusterLabs] Antw: Re: 2-Node Cluster Pointless?

Sat Apr 22 10:31:20 CEST 2017

On 04/22/2017 09:20 AM, Digimer wrote:
> On 22/04/17 03:05 AM, Andrei Borzenkov wrote:
>> 18.04.2017 10:47, Ulrich Windl пишет:
>> ...
>>>> Now let me come back to quorum vs. stonith;
>>>>
>>>> Said simply; Quorum is a tool for when everything is working. Fencing is
>>>> a tool for when things go wrong.
>>> I'd say: Quorum is the tool to decide who'll be alive and who's going to die,
>>> and STONITH is the tool to make nodes die.
>> If I had PROD, QA and DEV in a cluster and PROD were separated from
>> QA+DEV I'd be very sad if PROD were shut down.
>>
>> The notion of simple node majority as kill policy is not appropriate as
>> well as simple node based delays. I wish pacemaker supported scoring
>> system for resources so that we could base stonith delays on them (the
>> most important sub-cluster starts fencing first).
>>
>>
>>> If everything is working you need
>>> neither quorum nor STONITH.
>>>
>> I wonder how SBD fits into this discussion. It is marketed as stonith
>> agent, but it is based on committing suicide so relies on well-behaving
>> nodes. Which we by definition cannot trust to behave well, otherwise
>> we'd not need stonith in the first place.
> The logic, when using a watchdog timer, is that if the node is alive
> enough to kick the watchdog, it's alive enough to not do something dumb
> to the cluster. If it's not able to kick the timer, the watchdog timer
> will reset the machine. This works *if* all resources hang when messages
> stop coming back from the peer (a side effect of corosync's virtual
> synchrony).

In fact watchdog-implementations (meaning the software that
kicks the hardware-watchdog) are a little bit smarter - and
so is SBD.
By having the watchdog-kicking and observation-code in a
simple loop that is executed periodically you don't need the
'if it is alive enough to do the kicking it will behave well'
paradigm.
This burns down to making the critical part of the code very
small and on top hard to control failures that result in any
kind of hanging don't bother us.

>
> So as I understand it, for SBD to be safe, it requires a hardware
> watchdog timer and a properly configured cluster.

Yes, yes and yes ... as important as fencing I would say ;-)

Regards,
Klaus

>

-- 
Klaus Wenninger

Senior Software Engineer, EMEA ENG Openstack Infrastructure

Red Hat

kwenning at redhat.com