[Pacemaker] Resource ordering/colocating question (DRBD + LVM + FS)

Thu Sep 5 08:08:07 EDT 2013

Hello,

I'm having a bit of a problem understanding what's going on with my simple two-node demo cluster here. My resources come up correctly after restarting the whole cluster but the LVM and Filesystem resources fail to start after a single node restart or standby/unstandby (after node comes back online - why do they even stop/start after the second node comes back?).

OS: CentOS 6.4 (cman stack)
Pacemaker: pacemaker-1.1.8-7.el6.x86_64
DRBD: drbd84-utils-8.4.3-1.el6.elrepo.x86_64

Everything is configured using: pcs-0.9.26-10.el6_4.1.noarch

Two DRBD resources configured and working: data01 & data02
Two nodes: pgdbsrv01.cl1.local & pgdbsrv02.cl1.local

Configuration:

node pgdbsrv01.cl1.local
node pgdbsrv02.cl1.local
primitive DRBD_data01 ocf:linbit:drbd \
     params drbd_resource="data01" \
     op monitor interval="30s"
primitive DRBD_data02 ocf:linbit:drbd \
     params drbd_resource="data02" \
     op monitor interval="30s"
primitive FS_data01 ocf:heartbeat:Filesystem \
     params device="/dev/mapper/vgdata01-lvdata01" directory="/data01" fstype="ext4" \
     op monitor interval="30s"
primitive FS_data02 ocf:heartbeat:Filesystem \
     params device="/dev/mapper/vgdata02-lvdata02" directory="/data02" fstype="ext4" \
     op monitor interval="30s"
primitive LVM_vgdata01 ocf:heartbeat:LVM \
     params volgrpname="vgdata01" exclusive="true" \
     op monitor interval="30s"
primitive LVM_vgdata02 ocf:heartbeat:LVM \
     params volgrpname="vgdata02" exclusive="true" \
     op monitor interval="30s"
group GRP_data01 LVM_vgdata01 FS_data01
group GRP_data02 LVM_vgdata02 FS_data02
ms DRBD_ms_data01 DRBD_data01 \
     meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
ms DRBD_ms_data02 DRBD_data02 \
     meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
colocation colocation-GRP_data01-DRBD_ms_data01-INFINITY inf: GRP_data01 DRBD_ms_data01:Master
colocation colocation-GRP_data02-DRBD_ms_data02-INFINITY inf: GRP_data02 DRBD_ms_data02:Master
order order-DRBD_data01-GRP_data01-mandatory : DRBD_data01:promote GRP_data01:start
order order-DRBD_data02-GRP_data02-mandatory : DRBD_data02:promote GRP_data02:start
property $id="cib-bootstrap-options" \
     dc-version="1.1.8-7.el6-394e906" \
     cluster-infrastructure="cman" \
     stonith-enabled="false" \
     no-quorum-policy="ignore" \
     migration-threshold="1"
rsc_defaults $id="rsc_defaults-options" \
     resource-stickiness="100"

1) After starting the cluster, everything runs happily:

Last updated: Tue Sep  3 00:11:13 2013
Last change: Tue Sep  3 00:05:15 2013 via cibadmin on pgdbsrv01.cl1.local
Stack: cman
Current DC: pgdbsrv02.cl1.local - partition with quorum
Version: 1.1.8-7.el6-394e906
2 Nodes configured, unknown expected votes
9 Resources configured.

Online: [ pgdbsrv01.cl1.local pgdbsrv02.cl1.local ]

Full list of resources:

Master/Slave Set: DRBD_ms_data01 [DRBD_data01]
     Masters: [ pgdbsrv01.cl1.local ]
     Slaves: [ pgdbsrv02.cl1.local ]
Master/Slave Set: DRBD_ms_data02 [DRBD_data02]
     Masters: [ pgdbsrv01.cl1.local ]
     Slaves: [ pgdbsrv02.cl1.local ]
Resource Group: GRP_data01
     LVM_vgdata01 (ocf::heartbeat:LVM): Started pgdbsrv01.cl1.local
     FS_data01 (ocf::heartbeat:Filesystem): Started pgdbsrv01.cl1.local
Resource Group: GRP_data02
     LVM_vgdata02 (ocf::heartbeat:LVM): Started pgdbsrv01.cl1.local
     FS_data02 (ocf::heartbeat:Filesystem): Started pgdbsrv01.cl1.local

2) Putting node #1 to standby mode - after which everything runs happily on node pgdbsrv02.cl1.local

# pcs cluster standby pgdbsrv01.cl1.local
# pcs status
Last updated: Tue Sep  3 00:16:01 2013
Last change: Tue Sep  3 00:15:55 2013 via crm_attribute on pgdbsrv02.cl1.local
Stack: cman
Current DC: pgdbsrv02.cl1.local - partition with quorum
Version: 1.1.8-7.el6-394e906
2 Nodes configured, unknown expected votes
9 Resources configured.

Node pgdbsrv01.cl1.local: standby
Online: [ pgdbsrv02.cl1.local ]

Full list of resources:

 IP_database     (ocf::heartbeat:IPaddr2):     Started pgdbsrv02.cl1.local
 Master/Slave Set: DRBD_ms_data01 [DRBD_data01]
     Masters: [ pgdbsrv02.cl1.local ]
     Stopped: [ DRBD_data01:1 ]
 Master/Slave Set: DRBD_ms_data02 [DRBD_data02]
     Masters: [ pgdbsrv02.cl1.local ]
     Stopped: [ DRBD_data02:1 ]
 Resource Group: GRP_data01
     LVM_vgdata01     (ocf::heartbeat:LVM):     Started pgdbsrv02.cl1.local
     FS_data01     (ocf::heartbeat:Filesystem):     Started pgdbsrv02.cl1.local
 Resource Group: GRP_data02
     LVM_vgdata02     (ocf::heartbeat:LVM):     Started pgdbsrv02.cl1.local
     FS_data02     (ocf::heartbeat:Filesystem):     Started pgdbsrv02.cl1.local

3) Putting node #1 back online - it seems that all the resources stop (?) and then DRBD gets promoted successfully on node #2 but LVM and FS resources never start

# pcs cluster unstandby pgdbsrv01.cl1.local
# pcs status
Last updated: Tue Sep  3 00:17:00 2013
Last change: Tue Sep  3 00:16:56 2013 via crm_attribute on pgdbsrv02.cl1.local
Stack: cman
Current DC: pgdbsrv02.cl1.local - partition with quorum
Version: 1.1.8-7.el6-394e906
2 Nodes configured, unknown expected votes
9 Resources configured.

Online: [ pgdbsrv01.cl1.local pgdbsrv02.cl1.local ]

Full list of resources:

 IP_database     (ocf::heartbeat:IPaddr2):     Started pgdbsrv02.cl1.local
 Master/Slave Set: DRBD_ms_data01 [DRBD_data01]
     Masters: [ pgdbsrv02.cl1.local ]
     Slaves: [ pgdbsrv01.cl1.local ]
 Master/Slave Set: DRBD_ms_data02 [DRBD_data02]
     Masters: [ pgdbsrv02.cl1.local ]
     Slaves: [ pgdbsrv01.cl1.local ]
 Resource Group: GRP_data01
     LVM_vgdata01     (ocf::heartbeat:LVM):     Stopped
     FS_data01     (ocf::heartbeat:Filesystem):     Stopped
 Resource Group: GRP_data02
     LVM_vgdata02     (ocf::heartbeat:LVM):     Stopped
     FS_data02     (ocf::heartbeat:Filesystem):     Stopped

Any ideas why this is happening/what could be wrong in the resource configuration? The same thing happens when testing the situation with the resources located vice-versa in the beginning. Also, if I stop & start one of the nodes, same thing happens once the node gets back online.

-- 
Heikki Manninen <hma at iki.fi>