[Pacemaker] stonith and avoiding split brain in two nodes cluster

Tue Mar 26 14:23:40 EDT 2013

On Tue, Mar 26, 2013 at 6:30 PM, Angel L. Mateo <amateo at um.es> wrote:
> El 25/03/13 20:50, Jacek Konieczny escribió:
>
>> On Mon, 25 Mar 2013 20:01:28 +0100
>> "Angel L. Mateo" <amateo at um.es> wrote:
>>>>
>>>> quorum {
>>>>         provider: corosync_votequorum
>>>>         expected_votes: 2
>>>>         two_node: 1
>>>> }
>>>>
>>>> Corosync will then manage quorum for the two-node cluster and
>>>> Pacemaker
>>>
>>>
>>>    I'm using corosync 1.1 which is the one  provided with my
>>> distribution (ubuntu 12.04). I could also use cman.
>>
>>
>> I don't think corosync 1.1 can do that, but I guess in this case cman
>> should be able provide this functionality.
>>
>         Sorry, it's corosync 1.4, not 1.1.
>
>
>>>> can use that. You still need proper fencing to enforce the quorum
>>>> (both for pacemaker and the storage layer – dlm in case you use
>>>> clvmd), but no
>>>> extra quorum node is needed.
>>>>
>>>    I hace configured a dlm resource usted with clvm.
>>>
>>>    One doubt... With this configuration, how split brain problem is
>>> handled?
>>
>>
>> The first node to notice that the other is unreachable will fence (kill)
>> the other, making sure it is the only one operating on the shared data.
>> Even though it is only half of the node, the cluster is considered
>> quorate as the other node is known not to be running any cluster
>> resources.
>>
>> When the fenced node reboots its cluster stack starts, but with no
>> quorum until it comminicates with the surviving node again. So no
>> cluster services are started there until both nodes communicate
>> properly and the proper quorum is recovered.
>>
>         But, will this work with corosync 1.4? Alghtough with corosync 1.4 I
> may won't be able to use quorum configuration you said (I'll try), I have
> configured no-quorum-policy="ignore" so the cluster could still run in the
> case of one node failing. Could this be a problem?

Its essentially required for two-node clusters as quorum makes no sense.
Without it the cluster would stop everything (everywhere) when a node
failed (because quorum was lost).

But it also tells pacemaker it can fence failed nodes (this is a good
thing, as we can't recover the services from a failed node until we're
100% sure the node is powered off)