[Pacemaker] Infinite fail-count and migration-threshold after node fail-back

Mon Oct 11 03:40:04 EDT 2010

Hi all,

I've managed to make this setup work, basically the issue with a 
symmetric-cluster="false" and specifying the resources' location 
manually means that the resources will always obey the location 
constraint, and (as far as I could see) disregard the rsc_defaults 
resource-stickiness values. This behavior is not the expected one, in 
theory, setting symmetric-cluster="false" should affect whether 
resources are allowed to run anywhere by default and the 
resource-stickiness should lock in place the resources so they don't 
bounce from node to node. Again, this didn't happen, but by setting 
symmetric-cluster="true", using the same ordering and collocation 
constraints and the resource-stickiness, the behavior is the expected one.

I don't remember seeing anywhere in the docs from clusterlabs.org being 
mentioned that the resource-stickiness only works on 
symmetric-cluster="true", so for anyone that also stumbles upon this 
issue, I hope this helps.

Regards,

Dan

Dan Frincu wrote:
> Hi,
>
> Since it was brought to my attention that I should upgrade from 
> openais-0.80 to a more recent version of corosync, I've done just 
> that, however I'm experiencing a strange behavior on the cluster.
>
> The same setup was used with the below packages:
>
> # rpm -qa | grep -i "(openais|cluster|heartbeat|pacemaker|resource)"
> openais-0.80.5-15.2
> cluster-glue-1.0-12.2
> pacemaker-1.0.5-4.2
> cluster-glue-libs-1.0-12.2
> resource-agents-1.0-31.5
> pacemaker-libs-1.0.5-4.2
> pacemaker-mgmt-1.99.2-7.2
> libopenais2-0.80.5-15.2
> heartbeat-3.0.0-33.3
> pacemaker-mgmt-client-1.99.2-7.2
>
> Now I've migrated to the most recent stable packages I could find (on 
> the clusterlabs.org website) for RHEL5:
>
> # rpm -qa | grep -i "(openais|cluster|heartbeat|pacemaker|resource)"
> cluster-glue-1.0.6-1.6.el5
> pacemaker-libs-1.0.9.1-1.el5
> pacemaker-1.0.9.1-1.el5
> heartbeat-libs-3.0.3-2.el5
> heartbeat-3.0.3-2.el5
> openaislib-1.1.3-1.6.el5
> resource-agents-1.0.3-2.el5
> cluster-glue-libs-1.0.6-1.6.el5
> openais-1.1.3-1.6.el5
>
> Expected behavior:
> - all the resources the in group should go (based on location 
> preference) to bench1
> - if bench1 goes down, resources migrate to bench2
> - if bench1 comes back up, resources stay on bench2, unless manually 
> told otherwise.
>
> On the previous incantation, this worked, by using the new packages, 
> not so much. Now if bench1 goes down (crm node standby `uname -n`), 
> failover occurs, but when bench1 comes backup up, resources migrate 
> back, even if default-resource-stickiness is set, and more than that, 
> 2 drbd block devices reach infinite metrics, most notably because they 
> try to promote the resources to a Master state on bench1, but fail to 
> do so due to the resource being held open (by some process, I could 
> not identify it).
>
> Strangely enough, the resources (drbd) fail to be promoted to a Master 
> status on bench1, so they fail back to bench2, where they are mounted 
> (functional), but crm_mon shows:
>
> Migration summary:
> * Node bench2.streamwide.ro:
>   drbd_mysql:1: migration-threshold=1000000 fail-count=1000000
>   drbd_home:1: migration-threshold=1000000 fail-count=1000000
> * Node bench1.streamwide.ro:
>
> .... infinite metrics on bench2, while the drbd resources are available
>
> version: 8.3.2 (api:88/proto:86-90)
> GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by 
> mockbuild at v20z-x86-64.home.local, 2009-08-29 14:07:55
> 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
>    ns:1632 nr:1864 dw:3512 dr:3933 al:11 bm:19 lo:0 pe:0 ua:0 ap:0 
> ep:1 wo:b oos:0
> 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
>    ns:4 nr:24 dw:28 dr:25 al:1 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
> 2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
>    ns:4 nr:24 dw:28 dr:85 al:1 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
>
> and mounted
>
> /dev/drbd1 on /home type ext3 (rw,noatime,nodiratime)
> /dev/drbd0 on /mysql type ext3 (rw,noatime,nodiratime)
> /dev/drbd2 on /storage type ext3 (rw,noatime,nodiratime)
>
> Attached is the hb_report.
>
> Thank you in advance.
>
> Best regards
>

-- 
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania