[ClusterLabs] Pacemaker responsible of DRBD and a systemd resource

Wed Nov 15 15:51:34 EST 2017

I've driven for 22 years and never needed my seatbelt before, but yet, I
still make sure I use it every time I am in a car. ;)

Why it happened now is perhaps an interesting question, but it is one I
would try to answer after fixing the core problem.

cheers,

digimer

On 2017-11-15 03:37 PM, Derek Wuelfrath wrote:
> And just to make sure, I’m not the kind of person who stick to the “we
> always did it that way…” ;)
> Just trying to figure out why it suddenly breaks.
> 
> -derek
> 
> --
> Derek Wuelfrath
> dwuelfrath at inverse.ca <mailto:dwuelfrath at inverse.ca> :: +1.514.447.4918
> (x110) :: +1.866.353.6153 (x110)
> Inverse inc. :: Leaders behind SOGo (www.sogo.nu <https://www.sogo.nu>),
> PacketFence (www.packetfence.org <https://www.packetfence.org/>) and
> Fingerbank (www.fingerbank.org <https://www.fingerbank.org>)
> 
>> On Nov 15, 2017, at 15:30, Derek Wuelfrath <dwuelfrath at inverse.ca
>> <mailto:dwuelfrath at inverse.ca>> wrote:
>>
>> I agree. Thing is, we have this kind of setup deployed largely and
>> since a while. Never ran into any issue.
>> Not sure if something changed in Corosync/Pacemaker code or way of
>> dealing with systemd resources.
>>
>> As said, without a systemd resource, everything just work as it
>> should… 100% of the time
>> As soon as a systemd resource comes in, it breaks.
>>
>> -derek
>>
>> --
>> Derek Wuelfrath
>> dwuelfrath at inverse.ca <mailto:dwuelfrath at inverse.ca> ::
>> +1.514.447.4918 (x110) :: +1.866.353.6153 (x110)
>> Inverse inc. :: Leaders behind SOGo (www.sogo.nu
>> <https://www.sogo.nu/>), PacketFence (www.packetfence.org
>> <https://www.packetfence.org/>) and Fingerbank (www.fingerbank.org
>> <https://www.fingerbank.org/>)
>>
>>> On Nov 14, 2017, at 23:03, Digimer <lists at alteeve.ca
>>> <mailto:lists at alteeve.ca>> wrote:
>>>
>>> Quorum doesn't prevent split-brains, stonith (fencing) does. 
>>>
>>> https://www.alteeve.com/w/The_2-Node_Myth
>>>
>>> There is no way to use quorum-only to avoid a potential split-brain.
>>> You might be able to make it less likely with enough effort, but
>>> never prevent it.
>>>
>>> digimer
>>>
>>> On 2017-11-14 10:45 PM, Garima wrote:
>>>> Hello All,
>>>>  
>>>> Split-brain situation occurs due to there is a drop in quorum which
>>>> leads to Spilt-brain situation and status information is not
>>>> exchanged between both two nodes of the cluster. 
>>>> This can be avoided if quorum communicates between both the nodes.
>>>> I have checked the code. In My opinion these files need to be
>>>> updated (quorum.py/stonith.py) to avoid the spilt-brain situation to
>>>> maintain Active-Passive configuration.
>>>>  
>>>> Regards,
>>>> Garima
>>>>  
>>>> *From:* Derek Wuelfrath [mailto:dwuelfrath at inverse.ca] 
>>>> *Sent:* 13 November 2017 20:55
>>>> *To:* Cluster Labs - All topics related to open-source clustering
>>>> welcomed <users at clusterlabs.org>
>>>> *Subject:* Re: [ClusterLabs] Pacemaker responsible of DRBD and a
>>>> systemd resource
>>>>  
>>>> Hello Ken !
>>>>  
>>>>
>>>>     Make sure that the systemd service is not enabled. If pacemaker is
>>>>     managing a service, systemd can't also be trying to start and
>>>>     stop it.
>>>>
>>>>  
>>>> It is not. I made sure of this in the first place :)
>>>>  
>>>>
>>>>     Beyond that, the question is what log messages are there from around
>>>>     the time of the issue (on both nodes).
>>>>
>>>>  
>>>> Well, that’s the thing. There is not much log messages telling what
>>>> is actually happening. The ’systemd’ resource is not even trying to
>>>> start (nothing in either log for that resource). Here are the logs
>>>> from my last attempt:
>>>> Scenario:
>>>> - Services were running on ‘pancakeFence2’. DRBD was synced and
>>>> connected
>>>> - I rebooted ‘pancakeFence2’. Services failed to ‘pancakeFence1’
>>>> - After ‘pancakeFence2’ comes back, services are running just fine
>>>> on ‘pancakeFence1’ but DRBD is in Standalone due to split-brain
>>>>  
>>>> Logs for pancakeFence1: https://pastebin.com/dVSGPP78
>>>> Logs for pancakeFence2: https://pastebin.com/at8qPkHE
>>>>  
>>>> It really looks like the status checkup mechanism of
>>>> corosync/pacemaker for a systemd resource force the resource to
>>>> “start” and therefore, start the ones above that resource in the
>>>> group (DRBD in instance).
>>>> This does not happen for a regular OCF resource (IPaddr2 per example)
>>>>
>>>> Cheers!
>>>> -dw
>>>>  
>>>> --
>>>> Derek Wuelfrath
>>>> dwuelfrath at inverse.ca <mailto:dwuelfrath at inverse.ca> ::
>>>> +1.514.447.4918 (x110) :: +1.866.353.6153 (x110)
>>>> Inverse inc. :: Leaders behind SOGo (www.sogo.nu
>>>> <https://www.sogo.nu/>), PacketFence (www.packetfence.org
>>>> <https://www.packetfence.org/>) and Fingerbank (www.fingerbank.org
>>>> <https://www.fingerbank.org/>)
>>>>
>>>>
>>>>     On Nov 10, 2017, at 11:39, Ken Gaillot <kgaillot at redhat.com
>>>>     <mailto:kgaillot at redhat.com>> wrote:
>>>>      
>>>>     On Thu, 2017-11-09 at 20:27 -0500, Derek Wuelfrath wrote:
>>>>
>>>>         Hello there,
>>>>
>>>>         First post here but following since a while!
>>>>
>>>>
>>>>     Welcome!
>>>>
>>>>
>>>>
>>>>         Here’s my issue,
>>>>         we are putting in place and running this type of cluster since a
>>>>         while and never really encountered this kind of problem.
>>>>
>>>>         I recently set up a Corosync / Pacemaker / PCS cluster to
>>>>         manage DRBD
>>>>         along with different other resources. Part of theses
>>>>         resources are
>>>>         some systemd resources… this is the part where things are
>>>>         “breaking”.
>>>>
>>>>         Having a two servers cluster running only DRBD or DRBD with
>>>>         an OCF
>>>>         ipaddr2 resource (Cluser IP in instance) works just fine. I can
>>>>         easily move from one node to the other without any issue.
>>>>         As soon as I add a systemd resource to the resource group,
>>>>         things are
>>>>         breaking. Moving from one node to the other using standby
>>>>         mode works
>>>>         just fine but as soon as Corosync / Pacemaker restart involves
>>>>         polling of a systemd resource, it seems like it is trying to
>>>>         start
>>>>         the whole resource group and therefore, create a split-brain
>>>>         of the
>>>>         DRBD resource.
>>>>
>>>>
>>>>     My first two suggestions would be:
>>>>
>>>>     Make sure that the systemd service is not enabled. If pacemaker is
>>>>     managing a service, systemd can't also be trying to start and
>>>>     stop it.
>>>>
>>>>     Fencing is the only way pacemaker can resolve split-brains and
>>>>     certain
>>>>     other situations, so that will help in the recovery.
>>>>
>>>>     Beyond that, the question is what log messages are there from around
>>>>     the time of the issue (on both nodes).
>>>>
>>>>
>>>>
>>>>
>>>>         It is the best explanation / description of the situation
>>>>         that I can
>>>>         give. If it need any clarification, examples, … I am more
>>>>         than open
>>>>         to share them.
>>>>
>>>>         Any guidance would be appreciated :)
>>>>
>>>>         Here’s the output of a ‘pcs config’
>>>>
>>>>         https://pastebin.com/1TUvZ4X9
>>>>
>>>>         Cheers!
>>>>         -dw
>>>>
>>>>         --
>>>>         Derek Wuelfrath
>>>>         dwuelfrath at inverse.ca <mailto:dwuelfrath at inverse.ca> ::
>>>>         +1.514.447.4918 (x110) :: +1.866.353.6153
>>>>         (x110)
>>>>         Inverse inc. :: Leaders behind SOGo (www.sogo.nu
>>>>         <http://www.sogo.nu/>), PacketFence
>>>>         (www.packetfence.org <http://www.packetfence.org/>) and
>>>>         Fingerbank (www.fingerbank.org <http://www.fingerbank.org/>)
>>>>
>>>>     -- 
>>>>     Ken Gaillot <kgaillot at redhat.com <mailto:kgaillot at redhat.com>>
>>>>
>>>>     _______________________________________________
>>>>     Users mailing list: Users at clusterlabs.org
>>>>     <mailto:Users at clusterlabs.org>
>>>>     http://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>>     Project Home: http://www.clusterlabs.org
>>>>     <http://www.clusterlabs.org/>
>>>>     Getting
>>>>     started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>     Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
>>>>
>>>>  
>>>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org
>>>> http://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>> -- 
>>> Digimer
>>> Papers and Projects: https://alteeve.com/w/
>>> "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org <mailto:Users at clusterlabs.org>
>>> http://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org <mailto:Users at clusterlabs.org>
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould