[Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?

Wed May 15 18:44:21 EDT 2013

On 2013-05-15 20:44, Andrew Widdersheim wrote:
> Sorry to bring up old issues but I am having the exact same problem as the original poster. A simultaneous disconnect on my two node cluster causes the resources to start to transition to the other node but mid flight the transition is aborted and resources are started again on the original node when the cluster realizes connectivity is same between the two nodes.
> 
> I have tried various dampen settings without having any luck. Seems like the nodes report the outages at slightly different times which results in a partial transition of resources instead of waiting to know the connectivity of all of the nodes in the cluster before taking action which is what I would have thought dampen would help solve.
> 

You have some logs for us?

> Ideally the cluster wouldn't start the transition if another cluster node is having a connectivity issue as well and connectivity status is shared between all cluster nodes. Find my configuration below. Let me know there is something I can change to fix or if this behavior is expected.
> 
> primitive p_drbd ocf:linbit:drbd \
>         params drbd_resource="r1" \
>         op monitor interval="30s" role="Slave" \
>         op monitor interval="10s" role="Master"
> primitive p_fs ocf:heartbeat:Filesystem \
>         params device="/dev/drbd/by-res/r1" directory="/drbd/r1" fstype="ext4" options="noatime" \
>         op start interval="0" timeout="60s" \
>         op stop interval="0" timeout="180s" \
>         op monitor interval="30s" timeout="40s"
> primitive p_mysql ocf:heartbeat:mysql \
>         params binary="/usr/libexec/mysqld" config="/drbd/r1/mysql/my.cnf" datadir="/drbd/r1/mysql" \
>         op start interval="0" timeout="120s" \
>         op stop interval="0" timeout="120s" \
>         op monitor interval="30s" \
>         meta target-role="Started"
> primitive p_ping ocf:pacemaker:ping \
>         params host_list="192.168.5.1" dampen="30s" multiplier="1000" debug="true" \
>         op start interval="0" timeout="60s" \
>         op stop interval="0" timeout="60s" \
>         op monitor interval="5s" timeout="10s"
> group g_mysql_group p_fs p_mysql \
>         meta target-role="Started"
> ms ms_drbd p_drbd \
>         meta notify="true" master-max="1" clone-max="2" target-role="Started"
> clone cl_ping p_ping
> location l_connected g_mysql \
>         rule $id="l_connected-rule" pingd: defined pingd
> colocation c_mysql_on_drbd inf: g_mysql ms_drbd:Master
> order o_drbd_before_mysql inf: ms_drbd:promote g_mysql:start
> property $id="cib-bootstrap-options" \
>         dc-version="1.1.6-1.el6-8b6c6b9b6dc2627713f870850d20163fad4cc2a2" \
>         cluster-infrastructure="Heartbeat" \

Hmm ... you compiled your own Pacemaker version that supports Heartbeat
on RHEL6?

Best regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

>         no-quorum-policy="ignore" \
>         stonith-enabled="false" \
>         cluster-recheck-interval="5m" \
>         last-lrm-refresh="1368632470"
> rsc_defaults $id="rsc-options" \
>         migration-threshold="5" \
>         resource-stickiness="200" 		 	   		  
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 287 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130516/c79145b0/attachment-0003.sig>