[Pacemaker] colocation conundrum

Tue Nov 20 13:56:03 EST 2012

Hi there,

I think Ive exhausted everything I can find online in terms of trying to solve my problem so here goes with a posting to see if anyone on this mailing list might be able to help please.

I have a pacemaker1.1.7/corosync 1.4.1  two node cluster running on CentOS 6.3.
Im using this cluster to support shared storage using a combination of LVM and iSCSI.

Now failover works fine if I offline/stonith a node. However when I bring the node back online they enter a death-match situation.
I see the issue as being with ordering/colocation/resource sets and I have tried a bunch of different variations and read and re-read all the information I can find online without resolution.

Would really appreciate any help/advise. 

The key entries that I can see in the logs are:

NODE1:
======
Nov 20 12:16:38 cs1san1 iSCSILogicalUnit(cs1lb1l1)[2710]: ERROR: tgtadm: invalid request
Nov 20 12:16:39 cs1san1 iSCSILogicalUnit(cs1man1l1)[2807]: ERROR: tgtadm: invalid request
Nov 20 12:22:55 cs1san1 iSCSILogicalUnit(cs1master1l1)[4482]: ERROR: tgtadm: invalid request
Nov 20 12:23:17 cs1san1 iSCSILogicalUnit(cs1ddb1l1)[4968]: ERROR: tgtadm: invalid request
Nov 20 12:23:18 cs1san1 iSCSILogicalUnit(cs1master1l1)[5081]: ERROR: tgtadm: invalid request
Nov 20 12:30:28 cs1san1 iSCSILogicalUnit(cs1lb1l1)[2670]: ERROR: tgtadm: invalid request

NODE2:
======
Nov 20 12:16:38 cs1san2 LVM(cs1vg1)[22039]: ERROR: Can't deactivate volume group "cs1vg1" with 3 open logical volume(s)
Nov 20 12:22:55 cs1san2 LVM(cs1vg1)[3386]: ERROR: Can't deactivate volume group "cs1vg1" with 2 open logical volume(s)
Nov 20 12:23:17 cs1san2 LVM(cs1vg1)[4296]: ERROR: Can't deactivate volume group "cs1vg1" with 1 open logical volume(s)
Nov 20 12:30:27 cs1san2 LVM(cs1vg1)[14943]: ERROR: Can't deactivate volume group "cs1vg1" with 4 open logical volume(s)

which, to me, clearly indicates an ordering issue yet the configuration I have follows the colocation/ordering rules in as much as I can understand them. 

My "current" CRM config is as follows:
==============================================================================
node cs1san1 \
	attributes standby="off"
node cs1san2 \
	attributes standby="off"
primitive alert ocf:heartbeat:MailTo \
	params email="ops at xyz.com" subject="CS takeover event" \
	op monitor interval="10s"
primitive cs1ddb1l1 ocf:heartbeat:iSCSILogicalUnit \
	params target_iqn="iqn.2012-10.com.xyz.cs1san1:cs1ddb1d1" lun="1" path="/dev/cs1vg1/cs1ddb1d1" \
	op monitor interval="10" timeout="15"
primitive cs1ddb1t1 ocf:heartbeat:iSCSITarget \
	params iqn="iqn.2012-10.com.xyz.cs1san1:cs1ddb1d1" tid="7" \
	op monitor interval="10" timeout="15"
primitive cs1dws1l1 ocf:heartbeat:iSCSILogicalUnit \
	params target_iqn="iqn.2012-10.com.xyz.cs1san1:cs1dws1d1" lun="1" path="/dev/cs1vg1/cs1dws1d1" \
	op monitor interval="10" timeout="15"
primitive cs1dws1t1 ocf:heartbeat:iSCSITarget \
	params iqn="iqn.2012-10.com.xyz.cs1san1:cs1dws1d1" tid="8" \
	op monitor interval="10" timeout="15"
primitive cs1lb1l1 ocf:heartbeat:iSCSILogicalUnit \
	params target_iqn="iqn.2012-10.com.xyz.cs1san1:cs1lb1d1" lun="1" path="/dev/cs1vg1/cs1lb1d1" \
	op start interval="0" timeout="15" \
	op stop interval="0" timeout="15" \
	op monitor interval="10" timeout="15" \
	meta is-managed="true"
primitive cs1lb1t1 ocf:heartbeat:iSCSITarget \
	params iqn="iqn.2012-10.com.xyz.cs1san1:cs1lb1d1" tid="1" \
	op monitor interval="10" timeout="15"
primitive cs1lb2l1 ocf:heartbeat:iSCSILogicalUnit \
	params target_iqn="iqn.2012-10.com.xyz.cs1san2:cs1lb2d1" lun="1" path="/dev/cs1vg2/cs1lb2d1" \
	op start interval="0" timeout="15" \
	op stop interval="0" timeout="15" \
	op monitor interval="10" timeout="15"
primitive cs1lb2t1 ocf:heartbeat:iSCSITarget \
	params iqn="iqn.2012-10.com.xyz.cs1san2:cs1lb2d1" tid="2" \
	op monitor interval="10" timeout="15"
primitive cs1man1l1 ocf:heartbeat:iSCSILogicalUnit \
	params target_iqn="iqn.2012-10.com.xyz.cs1san1:cs1man1d1" lun="1" path="/dev/cs1vg1/cs1man1d1" \
	op monitor interval="10" timeout="15"
primitive cs1man1t1 ocf:heartbeat:iSCSITarget \
	params iqn="iqn.2012-10.com.xyz.cs1san1:cs1man1d1" tid="5" \
	op monitor interval="10" timeout="15"
primitive cs1master1l1 ocf:heartbeat:iSCSILogicalUnit \
	params target_iqn="iqn.2012-10.com.xyz.cs1san1:cs1master1d1" lun="1" path="/dev/cs1vg1/cs1master1d1" \
	op monitor interval="10" timeout="15"
primitive cs1master1t1 ocf:heartbeat:iSCSITarget \
	params iqn="iqn.2012-10.com.xyz.cs1san1:cs1master1d1" tid="6" \
	op monitor interval="10" timeout="15"
primitive cs1vg1 ocf:heartbeat:LVM \
	params exclusive="true" volgrpname="cs1vg1" \
	op start interval="0" timeout="30s" \
	op stop interval="0" timeout="30s" \
	meta target-role="Started"
primitive cs1vg2 ocf:heartbeat:LVM \
	params exclusive="true" volgrpname="cs1vg2" \
	op start interval="0" timeout="30s" \
	op stop interval="0" timeout="30s" \
	meta target-role="Started"
primitive ping ocf:pacemaker:ping \
	params host_list="10.96.0.1 10.96.0.2" attempts="3" timeout="2s" multiplier="100" dampen="5s" \
	op monitor interval="10s"
primitive san1fencer stonith:fence_ipmilan \
	params pcmk_host_list="cs1san1" lanplus="1" ipaddr="10.96.0.21" login="admin" passwd="xxxxxxx" power_wait="4s" \
	op monitor interval="60s" \
	meta target-role="Started"
primitive san1vip ocf:heartbeat:IPaddr2 \
	params ip="10.94.0.101" cidr_netmask="24" \
	op monitor interval="10s" \
	meta target-role="Started"
primitive san2fencer stonith:fence_ipmilan \
	params pcmk_host_list="cs1san2" lanplus="1" ipaddr="10.96.0.22" login="admin" passwd="xxxxxxxx" power_wait="4s" \
	op monitor interval="60s" \
	meta target-role="Started"
primitive san2vip ocf:heartbeat:IPaddr2 \
	params ip="10.94.0.102" cidr_netmask="24" \
	op monitor interval="10s" \
	meta target-role="Started"
group cs1ddb1grp cs1ddb1t1 cs1ddb1l1 \
	meta target-role="Started"
group cs1dws1grp cs1dws1t1 cs1dws1l1 \
	meta target-role="Started"
group cs1lb1grp cs1lb1t1 cs1lb1l1 \
	meta target-role="Started"
group cs1lb2grp cs1lb2t1 cs1lb2l1 \
	meta target-role="Started"
group cs1man1grp cs1man1t1 cs1man1l1 \
	meta target-role="Started"
group cs1master1grp cs1master1t1 cs1master1l1 \
	meta target-role="Started"
clone alerts alert \
	meta target-role="Started"
clone pings ping \
	meta target-role="Started"
location san1fence san1fencer -inf: cs1san1
location san1loc cs1vg1 \
	rule $id="san1loc-rule1" 50: #uname eq cs1san1 \
	rule $id="san1loc-rule2" pingd: defined ping
location san2fence san2fencer -inf: cs1san2
location san2loc cs1vg2 \
	rule $id="san2loc-rule1" 50: #uname eq cs1san2 \
	rule $id="san2loc-rule2" pingd: defined ping
colocation san1colo inf: ( cs1lb1grp cs1man1grp cs1master1grp cs1ddb1grp cs1dws1grp ) san1vip cs1vg1
colocation san2colo inf: ( cs1lb2grp ) san2vip cs1vg2
property $id="cib-bootstrap-options" \
	dc-version="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" \
	cluster-infrastructure="openais" \
	expected-quorum-votes="2" \
	no-quorum-policy="ignore" \
	last-lrm-refresh="1353428951" \
	stonith-enabled="true" \
	maintenance-mode="false"
===================================================================

Regards
Craig