[ClusterLabs] DRBD failover in Pacemaker

Wed Sep 7 21:15:27 UTC 2016

Message: 1
Date: Wed, 7 Sep 2016 19:23:04 +0900
From: Digimer <lists at alteeve.ca>
To: Cluster Labs - All topics related to open-source clustering
	welcomed	<users at clusterlabs.org>
Subject: Re: [ClusterLabs] DRBD failover in Pacemaker
Message-ID: <b1e95242-1b0d-ed28-2ba8-d6b58d152fea at alteeve.ca>
Content-Type: text/plain; charset=windows-1252

> no-quorum-policy: ignore
> stonith-enabled: false

You must have fencing configured.

CentOS 6 uses pacemaker with the cman plugin. So setup cman
(cluster.conf) to use the fence_pcmk passthrough agent, then setup proper stonith in pacemaker (and test that it works). Finally, tell DRBD to use 'fencing resource-and-stonith;' and configure the 'crm-{un,}fence-peer.sh' {un,}fence handlers.

See if that gets things working.

On 07/09/16 04:04 AM, Devin Ortner wrote:
> I have a 2-node cluster running CentOS 6.8 and Pacemaker with DRBD. I have been using the "Clusters from Scratch" documentation to create my cluster and I am running into a problem where DRBD is not failing over to the other node when one goes down. Here is my "pcs status" prior to when it is supposed to fail over:
> 
> ----------------------------------------------------------------------
> ------------------------------------------------
> 
> [root at node1 ~]# pcs status
> Cluster name: webcluster
> Last updated: Tue Sep  6 14:50:21 2016		Last change: Tue Sep  6 14:50:17 2016 by root via crm_attribute on node1
> Stack: cman
> Current DC: node2 (version 1.1.14-8.el6_8.1-70404b0) - partition with 
> quorum
> 2 nodes and 5 resources configured
> 
> Online: [ node1 node2 ]
> 
> Full list of resources:
> 
>  Cluster_VIP	(ocf::heartbeat:IPaddr2):	Started node1
>  Master/Slave Set: ClusterDBclone [ClusterDB]
>      Masters: [ node1 ]
>      Slaves: [ node2 ]
>  ClusterFS	(ocf::heartbeat:Filesystem):	Started node1
>  WebSite	(ocf::heartbeat:apache):	Started node1
> 
> Failed Actions:
> * ClusterFS_start_0 on node2 'unknown error' (1): call=61, status=complete, exitreason='none',
>     last-rc-change='Tue Sep  6 13:15:00 2016', queued=0ms, exec=40ms
> 
> 
> PCSD Status:
>   node1: Online
>   node2: Online
> 
> [root at node1 ~]#
> 
> When I put node1 in standby everything fails over except DRBD:
> ----------------------------------------------------------------------
> ----------------
> 
> [root at node1 ~]# pcs cluster standby node1
> [root at node1 ~]# pcs status
> Cluster name: webcluster
> Last updated: Tue Sep  6 14:53:45 2016		Last change: Tue Sep  6 14:53:37 2016 by root via cibadmin on node2
> Stack: cman
> Current DC: node2 (version 1.1.14-8.el6_8.1-70404b0) - partition with 
> quorum
> 2 nodes and 5 resources configured
> 
> Node node1: standby
> Online: [ node2 ]
> 
> Full list of resources:
> 
>  Cluster_VIP	(ocf::heartbeat:IPaddr2):	Started node2
>  Master/Slave Set: ClusterDBclone [ClusterDB]
>      Slaves: [ node2 ]
>      Stopped: [ node1 ]
>  ClusterFS	(ocf::heartbeat:Filesystem):	Stopped
>  WebSite	(ocf::heartbeat:apache):	Started node2
> 
> Failed Actions:
> * ClusterFS_start_0 on node2 'unknown error' (1): call=61, status=complete, exitreason='none',
>     last-rc-change='Tue Sep  6 13:15:00 2016', queued=0ms, exec=40ms
> 
> 
> PCSD Status:
>   node1: Online
>   node2: Online
> 
> [root at node1 ~]#
> 
> I have pasted the contents of "/var/log/messages" here: 
> http://pastebin.com/0i0FMzGZ Here is my Configuration: 
> http://pastebin.com/HqqBV90p
> 
> When I unstandby node1, it comes back as the master for the DRBD and 
> everything else stays running on node2 (Which is fine because I haven't setup colocation constraints for that) Here is what I have after node1 is back:
> -----------------------------------------------------
> 
> [root at node1 ~]# pcs cluster unstandby node1
> [root at node1 ~]# pcs status
> Cluster name: webcluster
> Last updated: Tue Sep  6 14:57:46 2016		Last change: Tue Sep  6 14:57:42 2016 by root via cibadmin on node1
> Stack: cman
> Current DC: node2 (version 1.1.14-8.el6_8.1-70404b0) - partition with 
> quorum
> 2 nodes and 5 resources configured
> 
> Online: [ node1 node2 ]
> 
> Full list of resources:
> 
>  Cluster_VIP	(ocf::heartbeat:IPaddr2):	Started node2
>  Master/Slave Set: ClusterDBclone [ClusterDB]
>      Masters: [ node1 ]
>      Slaves: [ node2 ]
>  ClusterFS	(ocf::heartbeat:Filesystem):	Started node1
>  WebSite	(ocf::heartbeat:apache):	Started node2
> 
> Failed Actions:
> * ClusterFS_start_0 on node2 'unknown error' (1): call=61, status=complete, exitreason='none',
>     last-rc-change='Tue Sep  6 13:15:00 2016', queued=0ms, exec=40ms
> 
> 
> PCSD Status:
>   node1: Online
>   node2: Online
> 
> [root at node1 ~]#
> 
> Any help would be appreciated, I think there is something dumb that I'm missing.
> 
> Thank you.
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
--
Digimer
Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?

------------------------------

Message: 2
Date: Wed, 7 Sep 2016 08:51:36 -0600
From: Greg Woods <woods at ucar.edu>
To: Cluster Labs - All topics related to open-source clustering
	welcomed	<users at clusterlabs.org>
Subject: Re: [ClusterLabs] DRBD failover in Pacemaker
Message-ID:
	<CAKhxXfZP7nwunp0pLfgd9J1YRerAmx4B4BoyN_XYU3nRXpDcRQ at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

On Tue, Sep 6, 2016 at 1:04 PM, Devin Ortner <
Devin.Ortner at gtshq.onmicrosoft.com> wrote:

> Master/Slave Set: ClusterDBclone [ClusterDB]
>      Masters: [ node1 ]
>      Slaves: [ node2 ]
>  ClusterFS      (ocf::heartbeat:Filesystem):    Started node1
>

As Digimer said, you really need fencing when you are using DRBD. Otherwise
it's only a matter of time before your shared filesystem gets corrupted.

You also need an order constraint to be sure that the ClusterFS Filesystem
does not start until after the Master DRBD resource, and a colocation
constraint to ensure these are on the same node.

--Greg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://clusterlabs.org/pipermail/users/attachments/20160907/8a6eadad/attachment-0001.html>

------------------------------

Message: 3
Date: Wed, 7 Sep 2016 10:02:45 -0500
From: Dmitri Maziuk <dmitri.maziuk at gmail.com>
To: users at clusterlabs.org
Subject: Re: [ClusterLabs] DRBD failover in Pacemaker
Message-ID: <8c6f6527-e691-55ed-f2cb-602a6dcece03 at gmail.com>
Content-Type: text/plain; charset=windows-1252; format=flowed

On 2016-09-06 14:04, Devin Ortner wrote:
> I have a 2-node cluster running CentOS 6.8 and Pacemaker with DRBD.
> I  have been using the "Clusters from Scratch" documentation to create my
cluster and I am running into a problem where DRBD is not failing over
to the other node when one goes down.

I forget if Clusters From Scratch spell this out: you have to create the 
DRBD volume and let it finish the initial sync before you let pacemaker 
near it. Was 'cat /proc/drbd' showing UpToDate/UpToDdate 
Primary/Secondary when you tried the failover?

Ignore the "stonith is optional; you *must* use stonith" mantra du jour.

Dima
------------------
Thank you for the responses, I followed Digimer's instructions along with some information I had read on the DRBD site and configured fencing on the DRBD resource. I also configured STONITH using IPMI in Pacemaker. I setup Pacemaker first and verified that it kills the other node. 

After configuring DRBD fencing though I ran into a problem where failover stopped working. If I disable fencing in DRBD when one node is taken offline pacemaker kills it and everything fails over to the other as I would expect, but with fencing enabled the second node doesn't become master in DRBD until the first node completely finishes rebooting. This makes for a lot of downtime, and if one of the nodes has a hardware failure it would never fail over. I think its something to do with the fencing scripts. 

I am looking for complete redundancy including in the event of hardware failure. Is there a way I can prevent Split-Brain while still allowing for DRBD to failover to the other node? Right now I have only STONITH configured in pacemaker and fencing turned OFF in DRBD. So far it works as I want it to but sometimes when communication is lost between the two nodes the wrong one ends up getting killed, and when that happens it results in Split-Brain on recovery. I hope I described the situation well enough for someone to offer a little help. I'm currently experimenting with the delays before STONITH to see if I can figure something out.

Thank you,
Devin