[Pacemaker] Resource ordering/colocating question (DRBD + LVM + FS)

Thu Sep 5 09:39:36 EDT 2013

Hi Heikki,

just some comments for helping yourself.

1) The second output of crm_mon show a resource IP_database
which is not shown in the initial crm_mon output and also
not in the config. => Reduce your problem/config to the
minimum being reproducible.

2) Enable logging and look out which node is the DC.
There in the logs you find many many informations showing
what is going on. Hint: Open a terminal session with an
opened tail -f logfile. Watch it while inserting commands.
You'll get used to it.

3) The shown status of a drbd resource (crm_mon) doesn't show
you all informations of the drbd devices. Have a look at
drbd-overview on both nodes. (e.g. syncing status).

4) This setup CRIES for stonithing. Even in a test environment.
When stonith happens (this is what you see immediately) you
know something went wrong. This is a good indicator for
errors in agents or in the config. Believe me, as tedious stonithing
is the valuable it is for getting hints for bad cluster state.
On virtual machines stonithing is not as painful as on real
servers.

5) Is the drbd fencing script enabled? If yes, in certain circumstances
-INF rules are inserted to deny promoting of "wrong" nodes.
You should grep for them 'cibadmin -Q | grep <resname>'

6) crm_simulate -L -v gives you an output of the scores of
the resources on each node. I really don't know how to read it
exactly (Is there a documentation of that anywhere?), but it
gives you a hint where to look at, when resources don't start.
Especially the aggregation of stickiness values in groups are
sometimes misleading.

7) Sometimes behaviour of pacemaker changed and it is possible
that you hit a bug. But this hard to find out. Possibility:
Check a newer version.

Hope this helps.

Best regards
Andreas Mock

-----Ursprüngliche Nachricht-----
Von: Heikki Manninen [mailto:hma at iki.fi] 
Gesendet: Donnerstag, 5. September 2013 14:08
An: pacemaker at oss.clusterlabs.org
Betreff: [Pacemaker] Resource ordering/colocating question (DRBD + LVM + FS)

Hello,

I'm having a bit of a problem understanding what's going on with my simple
two-node demo cluster here. My resources come up correctly after restarting
the whole cluster but the LVM and Filesystem resources fail to start after a
single node restart or standby/unstandby (after node comes back online - why
do they even stop/start after the second node comes back?).

OS: CentOS 6.4 (cman stack)
Pacemaker: pacemaker-1.1.8-7.el6.x86_64
DRBD: drbd84-utils-8.4.3-1.el6.elrepo.x86_64

Everything is configured using: pcs-0.9.26-10.el6_4.1.noarch

Two DRBD resources configured and working: data01 & data02
Two nodes: pgdbsrv01.cl1.local & pgdbsrv02.cl1.local

Configuration:

node pgdbsrv01.cl1.local
node pgdbsrv02.cl1.local
primitive DRBD_data01 ocf:linbit:drbd \
     params drbd_resource="data01" \
     op monitor interval="30s"
primitive DRBD_data02 ocf:linbit:drbd \
     params drbd_resource="data02" \
     op monitor interval="30s"
primitive FS_data01 ocf:heartbeat:Filesystem \
     params device="/dev/mapper/vgdata01-lvdata01" directory="/data01"
fstype="ext4" \
     op monitor interval="30s"
primitive FS_data02 ocf:heartbeat:Filesystem \
     params device="/dev/mapper/vgdata02-lvdata02" directory="/data02"
fstype="ext4" \
     op monitor interval="30s"
primitive LVM_vgdata01 ocf:heartbeat:LVM \
     params volgrpname="vgdata01" exclusive="true" \
     op monitor interval="30s"
primitive LVM_vgdata02 ocf:heartbeat:LVM \
     params volgrpname="vgdata02" exclusive="true" \
     op monitor interval="30s"
group GRP_data01 LVM_vgdata01 FS_data01
group GRP_data02 LVM_vgdata02 FS_data02
ms DRBD_ms_data01 DRBD_data01 \
     meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
ms DRBD_ms_data02 DRBD_data02 \
     meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
colocation colocation-GRP_data01-DRBD_ms_data01-INFINITY inf: GRP_data01
DRBD_ms_data01:Master
colocation colocation-GRP_data02-DRBD_ms_data02-INFINITY inf: GRP_data02
DRBD_ms_data02:Master
order order-DRBD_data01-GRP_data01-mandatory : DRBD_data01:promote
GRP_data01:start
order order-DRBD_data02-GRP_data02-mandatory : DRBD_data02:promote
GRP_data02:start
property $id="cib-bootstrap-options" \
     dc-version="1.1.8-7.el6-394e906" \
     cluster-infrastructure="cman" \
     stonith-enabled="false" \
     no-quorum-policy="ignore" \
     migration-threshold="1"
rsc_defaults $id="rsc_defaults-options" \
     resource-stickiness="100"

1) After starting the cluster, everything runs happily:

Last updated: Tue Sep  3 00:11:13 2013
Last change: Tue Sep  3 00:05:15 2013 via cibadmin on pgdbsrv01.cl1.local
Stack: cman
Current DC: pgdbsrv02.cl1.local - partition with quorum
Version: 1.1.8-7.el6-394e906
2 Nodes configured, unknown expected votes
9 Resources configured.

Online: [ pgdbsrv01.cl1.local pgdbsrv02.cl1.local ]

Full list of resources:

Master/Slave Set: DRBD_ms_data01 [DRBD_data01]
     Masters: [ pgdbsrv01.cl1.local ]
     Slaves: [ pgdbsrv02.cl1.local ]
Master/Slave Set: DRBD_ms_data02 [DRBD_data02]
     Masters: [ pgdbsrv01.cl1.local ]
     Slaves: [ pgdbsrv02.cl1.local ]
Resource Group: GRP_data01
     LVM_vgdata01 (ocf::heartbeat:LVM): Started pgdbsrv01.cl1.local
     FS_data01 (ocf::heartbeat:Filesystem): Started pgdbsrv01.cl1.local
Resource Group: GRP_data02
     LVM_vgdata02 (ocf::heartbeat:LVM): Started pgdbsrv01.cl1.local
     FS_data02 (ocf::heartbeat:Filesystem): Started pgdbsrv01.cl1.local

2) Putting node #1 to standby mode - after which everything runs happily on
node pgdbsrv02.cl1.local

# pcs cluster standby pgdbsrv01.cl1.local
# pcs status
Last updated: Tue Sep  3 00:16:01 2013
Last change: Tue Sep  3 00:15:55 2013 via crm_attribute on
pgdbsrv02.cl1.local
Stack: cman
Current DC: pgdbsrv02.cl1.local - partition with quorum
Version: 1.1.8-7.el6-394e906
2 Nodes configured, unknown expected votes
9 Resources configured.

Node pgdbsrv01.cl1.local: standby
Online: [ pgdbsrv02.cl1.local ]

Full list of resources:

 IP_database     (ocf::heartbeat:IPaddr2):     Started pgdbsrv02.cl1.local
 Master/Slave Set: DRBD_ms_data01 [DRBD_data01]
     Masters: [ pgdbsrv02.cl1.local ]
     Stopped: [ DRBD_data01:1 ]
 Master/Slave Set: DRBD_ms_data02 [DRBD_data02]
     Masters: [ pgdbsrv02.cl1.local ]
     Stopped: [ DRBD_data02:1 ]
 Resource Group: GRP_data01
     LVM_vgdata01     (ocf::heartbeat:LVM):     Started pgdbsrv02.cl1.local
     FS_data01     (ocf::heartbeat:Filesystem):     Started
pgdbsrv02.cl1.local
 Resource Group: GRP_data02
     LVM_vgdata02     (ocf::heartbeat:LVM):     Started pgdbsrv02.cl1.local
     FS_data02     (ocf::heartbeat:Filesystem):     Started
pgdbsrv02.cl1.local

3) Putting node #1 back online - it seems that all the resources stop (?)
and then DRBD gets promoted successfully on node #2 but LVM and FS resources
never start

# pcs cluster unstandby pgdbsrv01.cl1.local
# pcs status
Last updated: Tue Sep  3 00:17:00 2013
Last change: Tue Sep  3 00:16:56 2013 via crm_attribute on
pgdbsrv02.cl1.local
Stack: cman
Current DC: pgdbsrv02.cl1.local - partition with quorum
Version: 1.1.8-7.el6-394e906
2 Nodes configured, unknown expected votes
9 Resources configured.

Online: [ pgdbsrv01.cl1.local pgdbsrv02.cl1.local ]

Full list of resources:

 IP_database     (ocf::heartbeat:IPaddr2):     Started pgdbsrv02.cl1.local
 Master/Slave Set: DRBD_ms_data01 [DRBD_data01]
     Masters: [ pgdbsrv02.cl1.local ]
     Slaves: [ pgdbsrv01.cl1.local ]
 Master/Slave Set: DRBD_ms_data02 [DRBD_data02]
     Masters: [ pgdbsrv02.cl1.local ]
     Slaves: [ pgdbsrv01.cl1.local ]
 Resource Group: GRP_data01
     LVM_vgdata01     (ocf::heartbeat:LVM):     Stopped
     FS_data01     (ocf::heartbeat:Filesystem):     Stopped
 Resource Group: GRP_data02
     LVM_vgdata02     (ocf::heartbeat:LVM):     Stopped
     FS_data02     (ocf::heartbeat:Filesystem):     Stopped

Any ideas why this is happening/what could be wrong in the resource
configuration? The same thing happens when testing the situation with the
resources located vice-versa in the beginning. Also, if I stop & start one
of the nodes, same thing happens once the node gets back online.

-- 
Heikki Manninen <hma at iki.fi>
_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org