[Pacemaker] Preventing Automatic Failback

Tue Jan 21 10:26:45 EST 2014

----- Original Message -----
> From: "Michael Monette" <mmonette at 2keys.ca>
> To: pacemaker at oss.clusterlabs.org
> Sent: Monday, January 20, 2014 8:22:25 AM
> Subject: [Pacemaker] Preventing Automatic Failback
> 
> Hi,
> 
> I posted this question before but my question was a bit unclear.
> 
> I have 2 nodes with DRBD with Postgresql.
> 
> When node-1 fails, everything fails to node-2 . But when node 1 is recovered,
> things try to failback to node-1 and all the services running on node-2 get
> disrupted(things don't ACTUALLY fail back to node-1..they try, fail, and
> then all services on node-2 are simply restarted..very annoying). This does
> not happen if I perform the same tests on node-2! I can reboot node-2,
> things fail to node-1 and node-2 comes online and waits until he is
> needed(this is what I want!) It seems to only affect my node-1's.
> 
> I have tried to set resource stickiness, I have tried everything I can really
> think of, but whenever the Primary has recovered, it will always disrupt
> services running on node-2.
> 
> Also I tried removing things from this config to try and isolate this. At one
> point I removed the atlassian_jira and drbd2_var primitives and only had a
> failover-ip and drbd1_opt, but still had the same problem. Hopefully someone
> can pinpoint this out for me. If I can't really avoid this, I would at least
> like to make this "bug" or whatever happen on node-2 instead of the actives.

I bet this is due to the drbd resource's master score value on node1 being higher than node2.  When you recover node1, are you actually rebooting that node?  If node1 doesn't lose membership from the cluster (reboot), those transient attributes that the drbd agent uses to specify which node will be the master instance will stick around.  Otherwise if you are just putting node1 in standby and then bringing the node back online, the I believe the resources will come back if the drbd master was originally on node1.

If you provide a policy engine file that shows the unwanted transition from node2 back to node1, we'll be able to tell you exactly why it is occurring.

-- Vossel

> 
> Here is my config:
> 
> node node-1.comp.com \
>         attributes standby="off"
> node node-1.comp.com \
>         attributes standby="off"
> primitive atlassian_jira lsb:jira \
>         op start interval="0" timeout="240" \
>         op stop interval="0" timeout="240"
> primitive drbd1_opt ocf:heartbeat:Filesystem \
>         params device="/dev/drbd1" directory="/opt/atlassian" fstype="ext4"
> primitive drbd2_var ocf:heartbeat:Filesystem \
>         params device="/dev/drbd2" directory="/var/atlassian" fstype="ext4"
> primitive drbd_data ocf:linbit:drbd \
>         params drbd_resource="r0" \
>         op monitor interval="29s" role="Master" \
>         op monitor interval="31s" role="Slave"
> primitive failover-ip ocf:heartbeat:IPaddr2 \
>         params ip="10.199.0.13"
> group jira_services drbd1_opt drbd2_var failover-ip atlassian_jira
> ms ms_drbd_data drbd_data \
>         meta master-max="1" master-node-max="1" clone-max="2"
>         clone-node-max="1" notify="true"
> colocation jira_services_on_drbd inf: atlassian_jira ms_drbd_data:Master
> order jira_services_after_drbd inf: ms_drbd_data:promote jira_services:start
> property $id="cib-bootstrap-options" \
>         dc-version="1.1.10-14.el6_5.1-368c726" \
>         cluster-infrastructure="classic openais (with plugin)" \
>         expected-quorum-votes="2" \
>         stonith-enabled="false" \
>         no-quorum-policy="ignore" \
>         last-lrm-refresh="1390183165" \
>         default-resource-stickiness="INFINITY"
> rsc_defaults $id="rsc-options" \
>         resource-stickiness="INFINITY"
> 
> Thanks
> 
> Mike
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>