[Pacemaker] resource stickiness and preventing stonith on failback

Mon Sep 19 23:02:26 EDT 2011

On Wed, Aug 24, 2011 at 6:56 AM, Brian J. Murrell <brian at interlinx.bc.ca> wrote:
> Hi All,
>
> I am trying to configure pacemaker (1.0.10) to make a single filesystem
> highly available by two nodes (please don't be distracted by the dangers
> of multiply mounted filesystems and clustering filesystems, etc., as I
> am absolutely clear about that -- consider that I am using a filesystem
> resource as just an example if you wish).  Here is my filesystem
> resource description:
>
> node foo1
> node foo2 \
>        attributes standby="off"
> primitive OST1 ocf:heartbeat:Filesystem \
>        meta target-role="Started" \
>        operations $id="BAR1-operations" \
>        op monitor interval="120" timeout="60" \
>        op start interval="0" timeout="300" \
>        op stop interval="0" timeout="300" \
>        params device="/dev/disk/by-uuid/8c500092-5de6-43d7-b59a-ef91fa9667b9"
> directory="/mnt/bar1" fstype="ext3"
> primitive st-pm stonith:external/powerman \
>        params serverhost="192.168.122.1:10101" poweroff="0"
> clone fencing st-pm
> property $id="cib-bootstrap-options" \
>        dc-version="1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3" \
>        cluster-infrastructure="openais" \
>        expected-quorum-votes="1" \
>        no-quorum-policy="ignore" \
>        last-lrm-refresh="1306783242" \
>        default-resource-stickiness="1000"
> rsc_defaults $id="rsc-options" \
>        resource-stickiness="100"
>
> The two problems I have run into are:
>
> 1. preventing the resource from failing back to the node it was
>   previously on after it has failed over and the previous node has
>   been restored.  Basically what's documented at
>
> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch05s03s02.html
>
> 2. preventing the active node from being STONITHed when the resource
>   is moved back to it's failed-and-restored node after a failover.
>   IOW: BAR1 is available on foo1, which fails and the resource is moved
>   to foo2.  foo1 returns and the resource is failed back to foo1, but
>   in doing that foo2 is STONITHed.
>
> For #1, as you can see, I tried setting the default resource stickiness
> to 100.  That didn't seem to work.  When I stopped corosync on the
> active node, the service failed over but it promptly failed back when I
> started corosync again, contrary to the example on the referenced URL.
>
> Subsequently I (think I) tried adding the specific resource stickiness
> of 1000.  That didn't seem to help either.
>
> As for #2, the issue with STONITHing foo2 when failing back to foo1 is
> that foo1 and foo2 are an active/active pair of servers.  STONITHing
> foo2 just to restore foo1's services puts foo2's services out of service,
>
> I do want a node that is believed to be dead to be STONITHed before it's
> resource(s) are failed over though.

Thats a great way to ensure your data gets trashed.
If the "node that is believed to be dead" isn't /actually/ dead,
you'll have two nodes running the same resources and writing to the
same files.

>
> Any hints on what I am doing wrong?
>
> Thanx and cheers,
> b.
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>