[ClusterLabs] Pacemaker responsible of DRBD and a systemd resource
Digimer
lists at alteeve.ca
Wed Nov 15 15:51:34 EST 2017
I've driven for 22 years and never needed my seatbelt before, but yet, I
still make sure I use it every time I am in a car. ;)
Why it happened now is perhaps an interesting question, but it is one I
would try to answer after fixing the core problem.
cheers,
digimer
On 2017-11-15 03:37 PM, Derek Wuelfrath wrote:
> And just to make sure, I’m not the kind of person who stick to the “we
> always did it that way…” ;)
> Just trying to figure out why it suddenly breaks.
>
> -derek
>
> --
> Derek Wuelfrath
> dwuelfrath at inverse.ca <mailto:dwuelfrath at inverse.ca> :: +1.514.447.4918
> (x110) :: +1.866.353.6153 (x110)
> Inverse inc. :: Leaders behind SOGo (www.sogo.nu <https://www.sogo.nu>),
> PacketFence (www.packetfence.org <https://www.packetfence.org/>) and
> Fingerbank (www.fingerbank.org <https://www.fingerbank.org>)
>
>> On Nov 15, 2017, at 15:30, Derek Wuelfrath <dwuelfrath at inverse.ca
>> <mailto:dwuelfrath at inverse.ca>> wrote:
>>
>> I agree. Thing is, we have this kind of setup deployed largely and
>> since a while. Never ran into any issue.
>> Not sure if something changed in Corosync/Pacemaker code or way of
>> dealing with systemd resources.
>>
>> As said, without a systemd resource, everything just work as it
>> should… 100% of the time
>> As soon as a systemd resource comes in, it breaks.
>>
>> -derek
>>
>> --
>> Derek Wuelfrath
>> dwuelfrath at inverse.ca <mailto:dwuelfrath at inverse.ca> ::
>> +1.514.447.4918 (x110) :: +1.866.353.6153 (x110)
>> Inverse inc. :: Leaders behind SOGo (www.sogo.nu
>> <https://www.sogo.nu/>), PacketFence (www.packetfence.org
>> <https://www.packetfence.org/>) and Fingerbank (www.fingerbank.org
>> <https://www.fingerbank.org/>)
>>
>>> On Nov 14, 2017, at 23:03, Digimer <lists at alteeve.ca
>>> <mailto:lists at alteeve.ca>> wrote:
>>>
>>> Quorum doesn't prevent split-brains, stonith (fencing) does.
>>>
>>> https://www.alteeve.com/w/The_2-Node_Myth
>>>
>>> There is no way to use quorum-only to avoid a potential split-brain.
>>> You might be able to make it less likely with enough effort, but
>>> never prevent it.
>>>
>>> digimer
>>>
>>> On 2017-11-14 10:45 PM, Garima wrote:
>>>> Hello All,
>>>>
>>>> Split-brain situation occurs due to there is a drop in quorum which
>>>> leads to Spilt-brain situation and status information is not
>>>> exchanged between both two nodes of the cluster.
>>>> This can be avoided if quorum communicates between both the nodes.
>>>> I have checked the code. In My opinion these files need to be
>>>> updated (quorum.py/stonith.py) to avoid the spilt-brain situation to
>>>> maintain Active-Passive configuration.
>>>>
>>>> Regards,
>>>> Garima
>>>>
>>>> *From:* Derek Wuelfrath [mailto:dwuelfrath at inverse.ca]
>>>> *Sent:* 13 November 2017 20:55
>>>> *To:* Cluster Labs - All topics related to open-source clustering
>>>> welcomed <users at clusterlabs.org>
>>>> *Subject:* Re: [ClusterLabs] Pacemaker responsible of DRBD and a
>>>> systemd resource
>>>>
>>>> Hello Ken !
>>>>
>>>>
>>>> Make sure that the systemd service is not enabled. If pacemaker is
>>>> managing a service, systemd can't also be trying to start and
>>>> stop it.
>>>>
>>>>
>>>> It is not. I made sure of this in the first place :)
>>>>
>>>>
>>>> Beyond that, the question is what log messages are there from around
>>>> the time of the issue (on both nodes).
>>>>
>>>>
>>>> Well, that’s the thing. There is not much log messages telling what
>>>> is actually happening. The ’systemd’ resource is not even trying to
>>>> start (nothing in either log for that resource). Here are the logs
>>>> from my last attempt:
>>>> Scenario:
>>>> - Services were running on ‘pancakeFence2’. DRBD was synced and
>>>> connected
>>>> - I rebooted ‘pancakeFence2’. Services failed to ‘pancakeFence1’
>>>> - After ‘pancakeFence2’ comes back, services are running just fine
>>>> on ‘pancakeFence1’ but DRBD is in Standalone due to split-brain
>>>>
>>>> Logs for pancakeFence1: https://pastebin.com/dVSGPP78
>>>> Logs for pancakeFence2: https://pastebin.com/at8qPkHE
>>>>
>>>> It really looks like the status checkup mechanism of
>>>> corosync/pacemaker for a systemd resource force the resource to
>>>> “start” and therefore, start the ones above that resource in the
>>>> group (DRBD in instance).
>>>> This does not happen for a regular OCF resource (IPaddr2 per example)
>>>>
>>>> Cheers!
>>>> -dw
>>>>
>>>> --
>>>> Derek Wuelfrath
>>>> dwuelfrath at inverse.ca <mailto:dwuelfrath at inverse.ca> ::
>>>> +1.514.447.4918 (x110) :: +1.866.353.6153 (x110)
>>>> Inverse inc. :: Leaders behind SOGo (www.sogo.nu
>>>> <https://www.sogo.nu/>), PacketFence (www.packetfence.org
>>>> <https://www.packetfence.org/>) and Fingerbank (www.fingerbank.org
>>>> <https://www.fingerbank.org/>)
>>>>
>>>>
>>>> On Nov 10, 2017, at 11:39, Ken Gaillot <kgaillot at redhat.com
>>>> <mailto:kgaillot at redhat.com>> wrote:
>>>>
>>>> On Thu, 2017-11-09 at 20:27 -0500, Derek Wuelfrath wrote:
>>>>
>>>> Hello there,
>>>>
>>>> First post here but following since a while!
>>>>
>>>>
>>>> Welcome!
>>>>
>>>>
>>>>
>>>> Here’s my issue,
>>>> we are putting in place and running this type of cluster since a
>>>> while and never really encountered this kind of problem.
>>>>
>>>> I recently set up a Corosync / Pacemaker / PCS cluster to
>>>> manage DRBD
>>>> along with different other resources. Part of theses
>>>> resources are
>>>> some systemd resources… this is the part where things are
>>>> “breaking”.
>>>>
>>>> Having a two servers cluster running only DRBD or DRBD with
>>>> an OCF
>>>> ipaddr2 resource (Cluser IP in instance) works just fine. I can
>>>> easily move from one node to the other without any issue.
>>>> As soon as I add a systemd resource to the resource group,
>>>> things are
>>>> breaking. Moving from one node to the other using standby
>>>> mode works
>>>> just fine but as soon as Corosync / Pacemaker restart involves
>>>> polling of a systemd resource, it seems like it is trying to
>>>> start
>>>> the whole resource group and therefore, create a split-brain
>>>> of the
>>>> DRBD resource.
>>>>
>>>>
>>>> My first two suggestions would be:
>>>>
>>>> Make sure that the systemd service is not enabled. If pacemaker is
>>>> managing a service, systemd can't also be trying to start and
>>>> stop it.
>>>>
>>>> Fencing is the only way pacemaker can resolve split-brains and
>>>> certain
>>>> other situations, so that will help in the recovery.
>>>>
>>>> Beyond that, the question is what log messages are there from around
>>>> the time of the issue (on both nodes).
>>>>
>>>>
>>>>
>>>>
>>>> It is the best explanation / description of the situation
>>>> that I can
>>>> give. If it need any clarification, examples, … I am more
>>>> than open
>>>> to share them.
>>>>
>>>> Any guidance would be appreciated :)
>>>>
>>>> Here’s the output of a ‘pcs config’
>>>>
>>>> https://pastebin.com/1TUvZ4X9
>>>>
>>>> Cheers!
>>>> -dw
>>>>
>>>> --
>>>> Derek Wuelfrath
>>>> dwuelfrath at inverse.ca <mailto:dwuelfrath at inverse.ca> ::
>>>> +1.514.447.4918 (x110) :: +1.866.353.6153
>>>> (x110)
>>>> Inverse inc. :: Leaders behind SOGo (www.sogo.nu
>>>> <http://www.sogo.nu/>), PacketFence
>>>> (www.packetfence.org <http://www.packetfence.org/>) and
>>>> Fingerbank (www.fingerbank.org <http://www.fingerbank.org/>)
>>>>
>>>> --
>>>> Ken Gaillot <kgaillot at redhat.com <mailto:kgaillot at redhat.com>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org
>>>> <mailto:Users at clusterlabs.org>
>>>> http://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> <http://www.clusterlabs.org/>
>>>> Getting
>>>> started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org
>>>> http://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>> --
>>> Digimer
>>> Papers and Projects: https://alteeve.com/w/
>>> "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org <mailto:Users at clusterlabs.org>
>>> http://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org <mailto:Users at clusterlabs.org>
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
--
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
More information about the Users
mailing list