[Pacemaker] Infinite fail-count and migration-threshold after node fail-back

Fri Nov 12 12:45:11 EST 2010

On 11 November 2010 16:59, Dan Frincu <dfrincu at streamwide.ro> wrote:
[...snip...]
>
>   <constraints>
>     <rsc_location id="loc-1" rsc="Webserver" node="sles-1" score="200"/>
>     <rsc_location id="loc-2" rsc="Webserver" node="sles-3" score="0"/>
>     <rsc_location id="loc-3" rsc="Database" node="sles-2" score="200"/>
>     <rsc_location id="loc-4" rsc="Database" node="sles-3" score="0"/>
>   </constraints>
> Example 6.1. Example set of opt-in location constraints
>
> At the moment you have symmetric-cluster=false, you need to add
> location constraints in order to get your resources running.
> Below is my conf and it works as expected, pbx_service_01 starts on
> node-01 and never fails back, in case failed over to node-03 and
> node-01 is back on line, due to resource-stickiness="1000", but take a
> look at the score in location constraint, very low scores compared to
> 1000 - I could  have also set it to inf
>
>
> Yes but you don't have groups defined in your setup, having groups means the
> score of each active resource is added.
> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ch-advanced-resources.html#id2220530
>
> For example:
>
> root at cluster1:~# ptest -sL
> Allocation scores:
> group_color: all allocation score on cluster1: 0
> group_color: all allocation score on cluster2: -1000000
> group_color: virtual_ip_1 allocation score on cluster1: 1000
> group_color: virtual_ip_1 allocation score on cluster2: -1000000
> group_color: virtual_ip_2 allocation score on cluster1: 1000
> group_color: virtual_ip_2 allocation score on cluster2: 0
> group_color: Failover_Alert allocation score on cluster1: 1000
> group_color: Failover_Alert allocation score on cluster2: 0
> group_color: fs_home allocation score on cluster1: 1000
> group_color: fs_home allocation score on cluster2: 0
> group_color: fs_mysql allocation score on cluster1: 1000
> group_color: fs_mysql allocation score on cluster2: 0
> group_color: fs_storage allocation score on cluster1: 1000
> group_color: fs_storage allocation score on cluster2: 0
> group_color: httpd allocation score on cluster1: 1000
> group_color: httpd allocation score on cluster2: 0
> group_color: mysqld allocation score on cluster1: 1000
> group_color: mysqld allocation score on cluster2: 0
> clone_color: ms_drbd_home allocation score on cluster1: 9000
> clone_color: ms_drbd_home allocation score on cluster2: -1000000
> clone_color: drbd_home:0 allocation score on cluster1: 1100
> clone_color: drbd_home:0 allocation score on cluster2: 0
> clone_color: drbd_home:1 allocation score on cluster1: 0
> clone_color: drbd_home:1 allocation score on cluster2: 1100
> native_color: drbd_home:0 allocation score on cluster1: 1100
> native_color: drbd_home:0 allocation score on cluster2: 0
> native_color: drbd_home:1 allocation score on cluster1: -1000000
> native_color: drbd_home:1 allocation score on cluster2: 1100
> drbd_home:0 promotion score on cluster1: 18100
> drbd_home:1 promotion score on cluster2: -1000000
> clone_color: ms_drbd_mysql allocation score on cluster1: 10100
> clone_color: ms_drbd_mysql allocation score on cluster2: -1000000
> clone_color: drbd_mysql:0 allocation score on cluster1: 1100
> clone_color: drbd_mysql:0 allocation score on cluster2: 0
> clone_color: drbd_mysql:1 allocation score on cluster1: 0
> clone_color: drbd_mysql:1 allocation score on cluster2: 1100
> native_color: drbd_mysql:0 allocation score on cluster1: 1100
> native_color: drbd_mysql:0 allocation score on cluster2: 0
> native_color: drbd_mysql:1 allocation score on cluster1: -1000000
> native_color: drbd_mysql:1 allocation score on cluster2: 1100
> drbd_mysql:0 promotion score on cluster1: 20300
> drbd_mysql:1 promotion score on cluster2: -1000000
> clone_color: ms_drbd_storage allocation score on cluster1: 11200
> clone_color: ms_drbd_storage allocation score on cluster2: -1000000
> clone_color: drbd_storage:0 allocation score on cluster1: 1100
> clone_color: drbd_storage:0 allocation score on cluster2: 0
> clone_color: drbd_storage:1 allocation score on cluster1: 0
> clone_color: drbd_storage:1 allocation score on cluster2: 1100
> native_color: drbd_storage:0 allocation score on cluster1: 1100
> native_color: drbd_storage:0 allocation score on cluster2: 0
> native_color: drbd_storage:1 allocation score on cluster1: -1000000
> native_color: drbd_storage:1 allocation score on cluster2: 1100
> drbd_storage:0 promotion score on cluster1: 22500
> drbd_storage:1 promotion score on cluster2: -1000000
> native_color: virtual_ip_1 allocation score on cluster1: 12300
> native_color: virtual_ip_1 allocation score on cluster2: -1000000
> native_color: virtual_ip_2 allocation score on cluster1: 8000
> native_color: virtual_ip_2 allocation score on cluster2: -1000000
> native_color: Failover_Alert allocation score on cluster1: 7000
> native_color: Failover_Alert allocation score on cluster2: -1000000
> native_color: fs_home allocation score on cluster1: 6000
> native_color: fs_home allocation score on cluster2: -1000000
> native_color: fs_mysql allocation score on cluster1: 5000
> native_color: fs_mysql allocation score on cluster2: -1000000
> native_color: fs_storage allocation score on cluster1: 4000
> native_color: fs_storage allocation score on cluster2: -1000000
> native_color: mysqld allocation score on cluster1: 4000
> native_color: mysqld allocation score on cluster2: -1000000
> native_color: httpd allocation score on cluster1: 16000
> native_color: httpd allocation score on cluster2: -1000000
> drbd_home:0 promotion score on cluster1: 1000000
> drbd_home:1 promotion score on cluster2: -1000000
> drbd_mysql:0 promotion score on cluster1: 1000000
> drbd_mysql:1 promotion score on cluster2: -1000000
> drbd_storage:0 promotion score on cluster1: 1000000
> drbd_storage:1 promotion score on cluster2: -1000000
> clone_color: ping_gw_clone allocation score on cluster1: 0
> clone_color: ping_gw_clone allocation score on cluster2: 0
> clone_color: ping_gw:0 allocation score on cluster1: 1000
> clone_color: ping_gw:0 allocation score on cluster2: 0
> clone_color: ping_gw:1 allocation score on cluster1: 0
> clone_color: ping_gw:1 allocation score on cluster2: 1000
> native_color: ping_gw:0 allocation score on cluster1: 1000
> native_color: ping_gw:0 allocation score on cluster2: 0
> native_color: ping_gw:1 allocation score on cluster1: -1000000
> native_color: ping_gw:1 allocation score on cluster2: 1000

I have the same version as you have, I am using heartbeat although,
and I run your scenario in my systems.
I had the pbx_service_01 (which is a group) on node-01 and set that
node as standby using crm node standby node-01. The resource group and
the appropriate drbd ms resource failed
over to node-03. When I put node-01 back online by running crm node
online node-01, the pbx_service_01 resource group and the appropriate
drbd ms resource didn't fail back.
Below is my scores before the failover (the first column is a line
number) and you can also find my conf at the bottom

 1 group_color: pbx_service_01 allocation score on node-01: 200
  2 group_color: pbx_service_01 allocation score on node-03: 10
  3 group_color: ip_01 allocation score on node-01: 1200
  4 group_color: ip_01 allocation score on node-03: 10
  5 group_color: fs_01 allocation score on node-01: 1000
  6 group_color: fs_01 allocation score on node-03: 0
  7 group_color: pbx_01 allocation score on node-01: 1000
  8 group_color: pbx_01 allocation score on node-03: 0
  9 group_color: sshd_01 allocation score on node-01: 1000
 10 group_color: sshd_01 allocation score on node-03: 0
 11 group_color: mailAlert-01 allocation score on node-01: 1000
 12 group_color: mailAlert-01 allocation score on node-03: 0
 13 native_color: ip_01 allocation score on node-01: 5200
 14 native_color: ip_01 allocation score on node-03: 10
 15 clone_color: ms-drbd_01 allocation score on node-01: 4100
 16 clone_color: ms-drbd_01 allocation score on node-03: -1000000
 17 clone_color: drbd_01:0 allocation score on node-01: 11100
 18 clone_color: drbd_01:0 allocation score on node-03: 0
 19 clone_color: drbd_01:1 allocation score on node-01: 100
 20 clone_color: drbd_01:1 allocation score on node-03: 11000
 21 native_color: drbd_01:0 allocation score on node-01: 11100
 22 native_color: drbd_01:0 allocation score on node-03: 0
 23 native_color: drbd_01:1 allocation score on node-01: -1000000
 24 native_color: drbd_01:1 allocation score on node-03: 11000
 25 drbd_01:0 promotion score on node-01: 18100
 26 drbd_01:1 promotion score on node-03: -1000000
 27 native_color: fs_01 allocation score on node-01: 15100
 28 native_color: fs_01 allocation score on node-03: -1000000
 29 native_color: pbx_01 allocation score on node-01: 3000
 30 native_color: pbx_01 allocation score on node-03: -1000000
 31 native_color: sshd_01 allocation score on node-01: 2000
 32 native_color: sshd_01 allocation score on node-03: -1000000
 33 native_color: mailAlert-01 allocation score on node-01: 1000
 34 native_color: mailAlert-01 allocation score on node-03: -1000000
 35 group_color: pbx_service_02 allocation score on node-02: 200
 36 group_color: pbx_service_02 allocation score on node-03: 10
 37 group_color: ip_02 allocation score on node-02: 1200
 38 group_color: ip_02 allocation score on node-03: 10
 39 group_color: fs_02 allocation score on node-02: 1000
 40 group_color: fs_02 allocation score on node-03: 0
 41 group_color: pbx_02 allocation score on node-02: 1000
 42 group_color: pbx_02 allocation score on node-03: 0
 43 group_color: sshd_02 allocation score on node-02: 1000
 44 group_color: sshd_02 allocation score on node-03: 0
 45 group_color: mailAlert-02 allocation score on node-02: 1000
 46 group_color: mailAlert-02 allocation score on node-03: 0
 47 native_color: ip_02 allocation score on node-02: 5200
 48 native_color: ip_02 allocation score on node-03: 10
 49 clone_color: ms-drbd_02 allocation score on node-02: 4100
 50 clone_color: ms-drbd_02 allocation score on node-03: -1000000
 51 clone_color: drbd_02:0 allocation score on node-02: 11100
 52 clone_color: drbd_02:0 allocation score on node-03: 0
 53 clone_color: drbd_02:1 allocation score on node-02: 100
 54 clone_color: drbd_02:1 allocation score on node-03: 11000
 55 native_color: drbd_02:0 allocation score on node-02: 11100
 56 native_color: drbd_02:0 allocation score on node-03: 0
 57 native_color: drbd_02:1 allocation score on node-02: -1000000
 58 native_color: drbd_02:1 allocation score on node-03: 11000
 59 drbd_02:0 promotion score on node-02: 18100
 60 drbd_02:2 promotion score on none: 0
 61 drbd_02:1 promotion score on node-03: -1000000
 62 native_color: fs_02 allocation score on node-02: 15100
 63 native_color: fs_02 allocation score on node-03: -1000000
 64 native_color: pbx_02 allocation score on node-02: 3000
 65 native_color: pbx_02 allocation score on node-03: -1000000
 66 native_color: sshd_02 allocation score on node-02: 2000
 67 native_color: sshd_02 allocation score on node-03: -1000000
 68 native_color: mailAlert-02 allocation score on node-02: 1000
 69 native_color: mailAlert-02 allocation score on node-03: -1000000
 70 drbd_01:0 promotion score on node-01: 1000000
 71 drbd_01:1 promotion score on node-03: -1000000
 72 drbd_02:0 promotion score on node-02: 1000000
 73 drbd_02:2 promotion score on none: 0
 74 drbd_02:1 promotion score on node-03: -1000000
 75 native_color: pdu allocation score on node-03: -1000000
 76 native_color: pdu allocation score on node-02: -1000000
 77 native_color: pdu allocation score on node-01: -1000000

node $id="059313ce-c6aa-4bd5-a4fb-4b781de6d98f" node-03
node $id="d791b1f5-9522-4c84-a66f-cd3d4e476b38" node-02
node $id="e388e797-21f4-4bbe-a588-93d12964b4d7" node-01 \
        attributes standby="off"
primitive drbd_01 ocf:linbit:drbd \
        params drbd_resource="drbd_resource_01" \
        op monitor interval="30s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="120s"
primitive drbd_02 ocf:linbit:drbd \
        params drbd_resource="drbd_resource_02" \
        op monitor interval="30s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="120s"
primitive fs_01 ocf:heartbeat:Filesystem \
        params device="/dev/drbd1" directory="/pbx_service_01" fstype="ext3" \
        meta migration-threshold="3" failure-timeout="60" is-managed="true" \
        op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s"
primitive fs_02 ocf:heartbeat:Filesystem \
        params device="/dev/drbd2" directory="/pbx_service_02" fstype="ext3" \
        meta migration-threshold="3" failure-timeout="60" \
        op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s"
primitive ip_01 ocf:heartbeat:IPaddr2 \
        params ip="192.168.78.10" nic="eth3" cidr_netmask="24"
broadcast="192.168.78.255" \
        meta failure-timeout="120" migration-threshold="3" \
        op monitor interval="5s"
primitive ip_02 ocf:heartbeat:IPaddr2 \
        meta failure-timeout="120" migration-threshold="3" \
        params ip="192.168.78.20" nic="eth3" cidr_netmask="24"
broadcast="192.168.78.255" \
        op monitor interval="5s"
primitive mailAlert-01 ocf:heartbeat:MailTo \
        params email="root" subject="[Zanadoo Clustet event] pbx_service_01" \
        op monitor interval="2" timeout="10" \
        op start interval="0" timeout="10" \
        op stop interval="0" timeout="10"
primitive mailAlert-02 ocf:heartbeat:MailTo \
        params email="root" subject="[Zanadoo Clustet event] pbx_service_02" \
        op monitor interval="2" timeout="10" \
        op start interval="0" timeout="10" \
        op stop interval="0" timeout="10"
primitive pbx_01 lsb:znd-pbx_01 \
        meta migration-threshold="3" failure-timeout="60" is-managed="true" \
        op monitor interval="20s" timeout="20s" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s"
primitive pbx_02 lsb:znd-pbx_02 \
        meta migration-threshold="3" failure-timeout="60" \
        op monitor interval="20s" timeout="20s" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s"
primitive pdu stonith:external/rackpdu \
        params community="empisteftiko"
names_oid=".1.3.6.1.4.1.318.1.1.4.4.2.1.4"
oid=".1.3.6.1.4.1.318.1.1.4.4.2.1.3"
hostlist="node-01,node-02,node-03" pduip="192.168.100.100"
stonith-timeout="30" \
        op monitor interval="1m" timeout="60s" \
        meta target-role="Stopped"
primitive sshd_01 lsb:znd-sshd-pbx_01 \
        meta is-managed="true" \
        op monitor on-fail="stop" interval="10m" \
        op start interval="0" timeout="60s" on-fail="stop" \
        op stop interval="0" timeout="60s" on-fail="stop"
primitive sshd_02 lsb:znd-sshd-pbx_02 \
        op monitor on-fail="stop" interval="10m" \
        op start interval="0" timeout="60s" on-fail="stop" \
        op stop interval="0" timeout="60s" on-fail="stop" \
        meta target-role="Started"
group pbx_service_01 ip_01 fs_01 pbx_01 sshd_01 mailAlert-01 \
        meta target-role="Started"
group pbx_service_02 ip_02 fs_02 pbx_02 sshd_02 mailAlert-02 \
        meta target-role="Started"
ms ms-drbd_01 drbd_01 \
        meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started"
is-managed="true"
ms ms-drbd_02 drbd_02 \
        meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started"
is-managed="true"
location PrimaryNode-drbd_01 ms-drbd_01 100: node-01
location PrimaryNode-drbd_02 ms-drbd_02 100: node-02
location PrimaryNode-pbx_service_01 pbx_service_01 200: node-01
location PrimaryNode-pbx_service_02 pbx_service_02 200: node-02
location SecondaryNode-drbd_01 ms-drbd_01 0: node-03
location SecondaryNode-drbd_02 ms-drbd_02 0: node-03
location SecondaryNode-pbx_service_01 pbx_service_01 10: node-03
location SecondaryNode-pbx_service_02 pbx_service_02 10: node-03
location fencing-on-node-01 pdu 1: node-01
location fencing-on-node-02 pdu 1: node-02
location fencing-on-node-03 pdu 1: node-03
colocation fs_01-on-drbd_01 inf: fs_01 ms-drbd_01:Master
colocation fs_02-on-drbd_02 inf: fs_02 ms-drbd_02:Master
order pbx_service_01-after-drbd_01 inf: ms-drbd_01:promote pbx_service_01:start
order pbx_service_02-after-drbd_02 inf: ms-drbd_02:promote pbx_service_02:start
property $id="cib-bootstrap-options" \
        dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \
        cluster-infrastructure="Heartbeat" \
        symmetric-cluster="false" \
        stonith-enabled="false" \
        last-lrm-refresh="1289304946"
rsc_defaults $id="rsc-options" \
        resource-stickiness="1000"