[ClusterLabs] DRBD failover in Pacemaker

Wed Sep 7 17:20:23 UTC 2016

On 09/06/2016 02:04 PM, Devin Ortner wrote:
> I have a 2-node cluster running CentOS 6.8 and Pacemaker with DRBD. I have been using the "Clusters from Scratch" documentation to create my cluster and I am running into a problem where DRBD is not failing over to the other node when one goes down. Here is my "pcs status" prior to when it is supposed to fail over:

The most up-to-date version of Clusters From Scratch targets CentOS 7.1,
which has corosync 2, pcs, and a recent pacemaker:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Clusters_from_Scratch/index.html

There is an older version targeting Fedora 13, which has CMAN, corosync
1, the crm shell, and an older pacemaker:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Clusters_from_Scratch/index.html

Your system is in between, with CMAN, corosync 1, pcs, and a newer
pacemaker, so you might want to compare the two guides as you go.

> ----------------------------------------------------------------------------------------------------------------------
> 
> [root at node1 ~]# pcs status
> Cluster name: webcluster
> Last updated: Tue Sep  6 14:50:21 2016		Last change: Tue Sep  6 14:50:17 2016 by root via crm_attribute on node1
> Stack: cman
> Current DC: node2 (version 1.1.14-8.el6_8.1-70404b0) - partition with quorum
> 2 nodes and 5 resources configured
> 
> Online: [ node1 node2 ]
> 
> Full list of resources:
> 
>  Cluster_VIP	(ocf::heartbeat:IPaddr2):	Started node1
>  Master/Slave Set: ClusterDBclone [ClusterDB]
>      Masters: [ node1 ]
>      Slaves: [ node2 ]
>  ClusterFS	(ocf::heartbeat:Filesystem):	Started node1
>  WebSite	(ocf::heartbeat:apache):	Started node1
> 
> Failed Actions:
> * ClusterFS_start_0 on node2 'unknown error' (1): call=61, status=complete, exitreason='none',
>     last-rc-change='Tue Sep  6 13:15:00 2016', queued=0ms, exec=40ms

'unknown error' means the Filesystem resource agent returned an error
status. Check the system log for messages from the resource agent to see
what the error actually was.

> 
> PCSD Status:
>   node1: Online
>   node2: Online
> 
> [root at node1 ~]#
> 
> When I put node1 in standby everything fails over except DRBD:
> --------------------------------------------------------------------------------------
> 
> [root at node1 ~]# pcs cluster standby node1
> [root at node1 ~]# pcs status
> Cluster name: webcluster
> Last updated: Tue Sep  6 14:53:45 2016		Last change: Tue Sep  6 14:53:37 2016 by root via cibadmin on node2
> Stack: cman
> Current DC: node2 (version 1.1.14-8.el6_8.1-70404b0) - partition with quorum
> 2 nodes and 5 resources configured
> 
> Node node1: standby
> Online: [ node2 ]
> 
> Full list of resources:
> 
>  Cluster_VIP	(ocf::heartbeat:IPaddr2):	Started node2
>  Master/Slave Set: ClusterDBclone [ClusterDB]
>      Slaves: [ node2 ]
>      Stopped: [ node1 ]
>  ClusterFS	(ocf::heartbeat:Filesystem):	Stopped
>  WebSite	(ocf::heartbeat:apache):	Started node2
> 
> Failed Actions:
> * ClusterFS_start_0 on node2 'unknown error' (1): call=61, status=complete, exitreason='none',
>     last-rc-change='Tue Sep  6 13:15:00 2016', queued=0ms, exec=40ms
> 
> 
> PCSD Status:
>   node1: Online
>   node2: Online
> 
> [root at node1 ~]#
> 
> I have pasted the contents of "/var/log/messages" here: http://pastebin.com/0i0FMzGZ 
> Here is my Configuration: http://pastebin.com/HqqBV90p 

One thing lacking in Clusters From Scratch is that master/slave
resources such as ClusterDB should have two monitor operations, one for
the master role and one for the slave role. Something like:

op monitor interval=59s role=Master op monitor interval=60s role=Slave

Not sure if that will help your issue, but it's a good idea.

Another thing the guide should do differently is configure stonith
before drbd.

Once you have fencing working in pacemaker, take a look at LINBIT's DRBD
User Guide for whatever version you installed (
https://www.drbd.org/en/doc ) and look for the Pacemaker chapter. It
will describe how to connect the fencing between DRBD and Pacemaker's CIB.

Your constraints need a few tweaks: you have two "ClusterFS with
ClusterDBclone", one with "with-rsc-role:Master" and one without. You
want the one with Master. Your "Cluster_VIP with ClusterDBclone" should
also be with Master. When you colocate with a clone without specifying
the role, it means the resource can run anywhere any instance of the
clone is running (whether slave or master). In this case, you only want
the resources to run with the master instance, so you need to specify
that. That could be the main source of your issue.

> When I unstandby node1, it comes back as the master for the DRBD and everything else stays running on node2 (Which is fine because I haven't setup colocation constraints for that)
> Here is what I have after node1 is back: 
> -----------------------------------------------------
> 
> [root at node1 ~]# pcs cluster unstandby node1
> [root at node1 ~]# pcs status
> Cluster name: webcluster
> Last updated: Tue Sep  6 14:57:46 2016		Last change: Tue Sep  6 14:57:42 2016 by root via cibadmin on node1
> Stack: cman
> Current DC: node2 (version 1.1.14-8.el6_8.1-70404b0) - partition with quorum
> 2 nodes and 5 resources configured
> 
> Online: [ node1 node2 ]
> 
> Full list of resources:
> 
>  Cluster_VIP	(ocf::heartbeat:IPaddr2):	Started node2
>  Master/Slave Set: ClusterDBclone [ClusterDB]
>      Masters: [ node1 ]
>      Slaves: [ node2 ]
>  ClusterFS	(ocf::heartbeat:Filesystem):	Started node1
>  WebSite	(ocf::heartbeat:apache):	Started node2
> 
> Failed Actions:
> * ClusterFS_start_0 on node2 'unknown error' (1): call=61, status=complete, exitreason='none',
>     last-rc-change='Tue Sep  6 13:15:00 2016', queued=0ms, exec=40ms
> 
> 
> PCSD Status:
>   node1: Online
>   node2: Online
> 
> [root at node1 ~]#
> 
> Any help would be appreciated, I think there is something dumb that I'm missing.
> 
> Thank you.
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>