[Pacemaker] failed over filesystem mount points not coming up on secondary node

Lonni J Friedman netllama at gmail.com
Thu Sep 27 22:10:16 UTC 2012


Greetings,
I've just started playing with pacemaker/corosync on a two node setup.
 At this point I'm just experimenting, and trying to get a good feel
of how things work.  Eventually I'd like to start using this in a
production environment.  I'm running Fedora16-x86_64 with
pacemaker-1.1.7 & corosync-1.4.3.  I have DRBD setup and working fine
with two resources.  I've verified that pacemaker is doing the right
thing when initially configured.  Specifically:
* the floating static IP is brought up
* DRBD is brought up correctly with a master & slave
* the local DRBD backed mount points are mounted correctly

Here's the configuration:
#########
node farm-ljf0 \
	attributes standby="off"
node farm-ljf1
primitive ClusterIP ocf:heartbeat:IPaddr2 \
	params ip="10.31.97.100" cidr_netmask="22" nic="eth1" \
	op monitor interval="10s"
primitive FS0 ocf:linbit:drbd \
	params drbd_resource="r0" \
	op monitor interval="10" role="Master" \
	op monitor interval="30" role="Slave"
primitive FS0_drbd ocf:heartbeat:Filesystem \
	params device="/dev/drbd0" directory="/mnt/sdb1" fstype="xfs"
primitive FS1 ocf:linbit:drbd \
	params drbd_resource="r1" \
	op monitor interval="10s" role="Master" \
	op monitor interval="30s" role="Slave"
primitive FS1_drbd ocf:heartbeat:Filesystem \
	params device="/dev/drbd1" directory="/mnt/sdb2" fstype="xfs"
ms FS0_Clone FS0 \
	meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
ms FS1_Clone FS1 \
	meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
location cli-prefer-ClusterIP ClusterIP \
	rule $id="cli-prefer-rule-ClusterIP" inf: #uname eq farm-ljf1
colocation fs0_on_drbd inf: FS0_drbd FS0_Clone:Master
colocation fs1_on_drbd inf: FS1_drbd FS1_Clone:Master
order FS0_drbd-after-FS0 inf: FS0_Clone:promote FS0_drbd
order FS1_drbd-after-FS1 inf: FS1_Clone:promote FS1_drbd
property $id="cib-bootstrap-options" \
	dc-version="1.1.7-2.fc16-ee0730e13d124c3d58f00016c3376a1de5323cff" \
	cluster-infrastructure="openais" \
	expected-quorum-votes="2" \
	stonith-enabled="false" \
	no-quorum-policy="ignore"
#########

However, when I attempted to simulate a failover situation (I shutdown
the current master/primary node completely), not everything failed
over correctly.  Specifically, the mount points did not get mounted,
even though the other two elements did failover correctly.
'farm-ljf1' is the node that I shutdown, farm-ljf0 is the node that I
expected to inherit all of the resources.  Here's the status:
#########
[root at farm-ljf0 ~]# crm status
============
Last updated: Thu Sep 27 15:00:19 2012
Last change: Thu Sep 27 13:59:42 2012 via cibadmin on farm-ljf1
Stack: openais
Current DC: farm-ljf0 - partition WITHOUT quorum
Version: 1.1.7-2.fc16-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, 2 expected votes
7 Resources configured.
============

Online: [ farm-ljf0 ]
OFFLINE: [ farm-ljf1 ]

 ClusterIP	(ocf::heartbeat:IPaddr2):	Started farm-ljf0
 Master/Slave Set: FS0_Clone [FS0]
     Masters: [ farm-ljf0 ]
     Stopped: [ FS0:0 ]
 Master/Slave Set: FS1_Clone [FS1]
     Masters: [ farm-ljf0 ]
     Stopped: [ FS1:0 ]

Failed actions:
    FS1_drbd_start_0 (node=farm-ljf0, call=23, rc=1, status=complete):
unknown error
    FS0_drbd_start_0 (node=farm-ljf0, call=24, rc=1, status=complete):
unknown error
#########

I eventually brought up the shut down node (farm-ljf1) again, hoping
that might at least bring things back into a good state, but its not
working either, and is showing up as OFFLINE:
##########
[root at farm-ljf1 ~]# crm status
============
Last updated: Thu Sep 27 15:06:54 2012
Last change: Thu Sep 27 14:49:06 2012 via cibadmin on farm-ljf1
Stack: openais
Current DC: NONE
2 Nodes configured, 2 expected votes
7 Resources configured.
============

OFFLINE: [ farm-ljf0 farm-ljf1 ]
##########


So at this point, I've got two problems:
0) FS mount failover isn't working.  I'm hoping this is some silly
configuration issue that can be easily resolved.
1) bringing the "failed" farm-ljf1 node back online doesn't seem to
work automatically, and I can't figure out what kind of magic is
needed.


If this stuff is documented somewhere, I'll gladly read it, if someone
can point me in the right direction.

thanks!




More information about the Pacemaker mailing list