[ClusterLabs] Singleton resource not being migrated

Fri Aug 5 00:08:40 UTC 2016

Hello list,

Can you, please, help me in debugging 1 resource not being started after
node failover ?

Here is configuration that I'm testing:
3 nodes(kvm VM) cluster, that have:

node 10: aic-controller-58055.test.domain.local
node 6: aic-controller-50186.test.domain.local
node 9: aic-controller-12993.test.domain.local
primitive cmha cmha \
        params conffile="/etc/cmha/cmha.conf" daemon="/usr/bin/cmhad"
pidfile="/var/run/cmha/cmha.pid" user=cmha \
        meta failure-timeout=30 resource-stickiness=1 target-role=Started
migration-threshold=3 \
        op monitor interval=10 on-fail=restart timeout=20 \
        op start interval=0 on-fail=restart timeout=60 \
        op stop interval=0 on-fail=block timeout=90
primitive sysinfo_aic-controller-12993.test.domain.local
ocf:pacemaker:SysInfo \
        params disk_unit=M disks="/ /var/log" min_disk_free=512M \
        op monitor interval=15s
primitive sysinfo_aic-controller-50186.test.domain.local
ocf:pacemaker:SysInfo \
        params disk_unit=M disks="/ /var/log" min_disk_free=512M \
        op monitor interval=15s
primitive sysinfo_aic-controller-58055.test.domain.local
ocf:pacemaker:SysInfo \
        params disk_unit=M disks="/ /var/log" min_disk_free=512M \
        op monitor interval=15s

location cmha-on-aic-controller-12993.test.domain.local cmha 100:
aic-controller-12993.test.domain.local
location cmha-on-aic-controller-50186.test.domain.local cmha 100:
aic-controller-50186.test.domain.local
location cmha-on-aic-controller-58055.test.domain.local cmha 100:
aic-controller-58055.test.domain.local
location sysinfo-on-aic-controller-12993.test.domain.local
sysinfo_aic-controller-12993.test.domain.local inf:
aic-controller-12993.test.domain.local
location sysinfo-on-aic-controller-50186.test.domain.local
sysinfo_aic-controller-50186.test.domain.local inf:
aic-controller-50186.test.domain.local
location sysinfo-on-aic-controller-58055.test.domain.local
sysinfo_aic-controller-58055.test.domain.local inf:
aic-controller-58055.test.domain.local
property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=1.1.14-70404b0 \
        cluster-infrastructure=corosync \
        cluster-recheck-interval=15s \
        no-quorum-policy=stop \
        stonith-enabled=false \
        start-failure-is-fatal=false \
        symmetric-cluster=false \
        node-health-strategy=migrate-on-red \
        last-lrm-refresh=1470334410

When 3 nodes online, everything seemed OK, this is output of scoreshow.sh:
Resource                                                Score     Node
                              Stickiness #Fail    Migration-Threshold
cmha                                                    -INFINITY
aic-controller-12993.test.domain.local 1          0
cmha                                                              101
aic-controller-50186.test.domain.local 1          0
cmha                                                    -INFINITY
aic-controller-58055.test.domain.local 1          0
sysinfo_aic-controller-12993.test.domain.local          INFINITY
 aic-controller-12993.test.domain.local 0          0
sysinfo_aic-controller-50186.test.domain.local          -INFINITY
aic-controller-50186.test.domain.local 0          0
sysinfo_aic-controller-58055.test.domain.local          INFINITY
 aic-controller-58055.test.domain.local 0          0

The problem starts when 1 node, goes offline (aic-controller-50186). The
resource cmha is stocked in stopped state.
Here is the showscores:
Resource                                                Score     Node
                              Stickiness #Fail    Migration-Threshold
cmha                                                    -INFINITY
aic-controller-12993.test.domain.local 1          0
cmha                                                    -INFINITY
aic-controller-50186.test.domain.local 1          0
cmha                                                    -INFINITY
aic-controller-58055.test.domain.local 1          0

Even it has target-role=Started pacemaker skipping this resource. And in
logs I see:
pengine:     info: native_print:      cmha    (ocf::heartbeat:cmha):
 Stopped
pengine:     info: native_color:      Resource cmha cannot run anywhere
pengine:     info: LogActions:        Leave   cmha    (Stopped)

To recover cmha resource I need to run either:
1) crm resource cleanup cmha
2) crm resource reprobe

After any of the above commands, resource began to be picked up be
pacemaker and I see valid scores:
Resource                                                Score     Node
                              Stickiness #Fail    Migration-Threshold
cmha                                                    100
aic-controller-58055.test.domain.local 1          0        3
cmha                                                    101
aic-controller-12993.test.domain.local 1          0        3
cmha                                                    -INFINITY
aic-controller-50186.test.domain.local 1          0        3

So the questions here - why cluster-recheck doesn't work, and should it do
reprobing ?
How to make migration work or what I missed in configuration that prevents
migration?

corosync  2.3.4
pacemaker 1.1.14
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160804/d7bd6dc5/attachment.htm>