[Pacemaker] Resource does not come back to a node after node recovers from network issue

Fri Aug 5 13:04:39 EDT 2011

Thanks Andrew. Setting migration-threshold and failure-timeout works well for my setup. MySQL (specifically InnoDB engine) seems to be causing issues with OCFS2, but that is another story.

Thanks again,
Prakash

On Jul 31, 2011, at 10:45 PM, Andrew Beekhof wrote:

> On Fri, Jul 22, 2011 at 2:00 AM, Prakash Velayutham
> <prakash.velayutham at cchmc.org> wrote:
>> Hello all,
>> 
>> I have a 2 node cluster running
>> 
>> Corosync - 1.2.1
>> Pacemaker - 1.1.2
>> 
>> Both nodes have the primary (production) and private (heartbeat) networks bonded across 2 separate ethernet interfaces. eth0/eth1 for bond0 (primary) and eth2/eth3 for bond1 (private). I am trying to test the migration of resources by downing the production bond.
>> 
>> I am seeing a strange issue as described below.
>> 
>> 1. Assume the resources are currently hosted at node1.
>> 2. If I do "ifdown bond0", I can see that g_mysqlp-1 group resource migrates to node2.
>> 3. If I do "ifup bond0" on node1 and then do a "ifdown bond0" on node2, the resources just get stopped, but not migrated back to node1.
>> 4. They do start up successfully on node1 if I do a "cleanup resource" on the resource group at this state.
>> 5. The strange thing is, at this point, if I do a "ifup bond0" on node2 and "ifdown bond0" on node1, the resources do migrate successfully to node2.
>> 
>> Not sure what is going on. I can see the following on node1's /var/log/messages.
>> 
>> Jul 21 11:02:41 node1 crm_resource: [6731]: ERROR: unpack_rsc_op: Hard error - p_vip-1_start_0 failed with rc=2: Preventing p_vip-1 from re-starting on node1
>> 
>> Is this what is stopping the resources and not migrating them to node1. Any idea what is going on?
> 
> By default failures to start the resource prevent the cluster from
> trying that node again.  You'd need to fix the underlying problem
> (which you did) and then run: crm cleanup p_vip-1
> Alternatively you could set migration-threshold=1 or
> start-failure-is-fatal=false.  Both options are well documented.
> 
>> 
>> The crm config is here.
>> 
>> node node1
>> node node2
>> primitive p_dlm-1 ocf:pacemaker:controld \
>>        operations $id="p_dlm-1-operations" \
>>        op monitor interval="120" timeout="20" start-delay="0" \
>>        params daemon="dlm_controld.pcmk"
>> primitive p_mysql-1 ocf:heartbeat:mysql \
>>        operations $id="p_mysql-1-operations" \
>>        op monitor interval="10s" timeout="15s" start-delay="15" \
>>        params datadir="/var/lib/mysql/data1" socket="/var/lib/mysql/data1/mysql.sock" \
>>        meta target-role="started"
>> primitive p_ocfs2-1 ocf:heartbeat:Filesystem \
>>        operations $id="p_ocfs2-1-operations" \
>>        op monitor interval="20" timeout="40" \
>>        params device="/dev/mapper/mysql01" directory="/var/lib/mysql/data1" fstype="ocfs2" \
>>        meta target-role="started"
>> primitive p_ocfs2control-1 ocf:ocfs2:o2cb \
>>        operations $id="p_ocfs2control-1-operations" \
>>        op monitor interval="120" timeout="20" start-delay="0" \
>>        params stack="pcmk"
>> primitive p_vip-1 ocf:heartbeat:IPaddr2 \
>>        operations $id="p_vip-1-operations" \
>>        op monitor interval="60s" timeout="10s" \
>>        params ip="10.200.31.103" broadcast="10.200.31.255" cidr_netmask="255.255.255.0" \
>>        meta target-role="started"
>> primitive stonith-1 stonith:external/riloe \
>>        meta target-role="started" \
>>        operations $id="stonith-1-operations" \
>>        op monitor interval="600" timeout="60" start-delay="0" \
>>        params hostlist="node1" ilo_hostname="node1rilo.chmcres.cchmc.org" ilo_user="xxxx" ilo_password="xxxx" ilo_can_reset="1" ilo_protocol="2.0" ilo_powerdown_method="power"
>> primitive stonith-2 stonith:external/riloe \
>>        meta target-role="started" \
>>        operations $id="stonith-2-operations" \
>>        op monitor interval="600" timeout="60" start-delay="0" \
>>        params hostlist="node2" ilo_hostname="node2rilo.chmcres.cchmc.org" ilo_user="xxxx" ilo_password="xxxx" ilo_can_reset="1" ilo_protocol="2.0" ilo_powerdown_method="power"
>> group g_mysql-1 p_vip-1 p_mysql-1 \
>>        meta target-role="started"
>> clone c_dlm-1 p_dlm-1 \
>>        meta interleave="true" target-role="started"
>> clone c_ocfs2-1 p_ocfs2-1 \
>>        meta interleave="true" target-role="started"
>> clone c_ocfs2control-1 p_ocfs2control-1 \
>>        meta interleave="true" target-role="started"
>> location stonith-1-never-on-node1 stonith-1 -inf: node1
>> location stonith-2-never-on-node2 stonith-2 -inf: node2
>> colocation g_mysql-1-with-ocfs2-1 inf: g_mysql-1 c_ocfs2-1
>> colocation ocfs2-1-with-ocfs2control-1 inf: c_ocfs2-1 c_ocfs2control-1
>> colocation ocfs2control-1-with-dlm-1 inf: c_ocfs2control-1 c_dlm-1
>> order start-mysql-1-after-ocfs2-1 : c_ocfs2-1 g_mysql-1
>> order start-ocfs2-1-after-ocfs2control-1 : c_ocfs2control-1 c_ocfs2-1
>> order start-ocfs2control-1-after-dlm-1 : c_dlm-1 c_ocfs2control-1
>> property $id="cib-bootstrap-options" \
>>        dc-version="1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5" \
>>        cluster-infrastructure="openais" \
>>        expected-quorum-votes="2" \
>>        no-quorum-policy="ignore" \
>>        last-lrm-refresh="1311260565" \
>>        stonith-timeout="30s" \
>>        start-failure-is-fatal="false"
>> 
>> Thanks a ton,
>> Prakash