[ClusterLabs] crm_resource --wait

Tue Oct 10 05:19:58 UTC 2017

Hi Ken,

I managed to reproduce this on a simplified version of the cluster, and on
Pacemaker 1.1.15, 1.1.16, as well as 1.1.18-rc1

The steps to create the cluster are:

pcs property set stonith-enabled=false
pcs property set placement-strategy=balanced

pcs node utilization vm1 cpu=100
pcs node utilization vm2 cpu=100
pcs node utilization vm3 cpu=100

pcs property set maintenance-mode=true

pcs resource create sv-fencer ocf:pacemaker:Dummy

pcs resource create sv ocf:pacemaker:Dummy clone notify=false
pcs resource create std ocf:pacemaker:Dummy meta resource-stickiness=100

pcs resource create partition1 ocf:pacemaker:Dummy meta
resource-stickiness=100
pcs resource create partition2 ocf:pacemaker:Dummy meta
resource-stickiness=100
pcs resource create partition3 ocf:pacemaker:Dummy meta
resource-stickiness=100

pcs resource utilization partition1 cpu=5
pcs resource utilization partition2 cpu=5
pcs resource utilization partition3 cpu=5

pcs constraint colocation add std with sv-clone INFINITY
pcs constraint colocation add partition1 with sv-clone INFINITY
pcs constraint colocation add partition2 with sv-clone INFINITY
pcs constraint colocation add partition3 with sv-clone INFINITY

pcs property set maintenance-mode=false

I can then reproduce the issues in the following way:

$ pcs resource
 sv-fencer      (ocf::pacemaker:Dummy): Started vm1
 Clone Set: sv-clone [sv]
     Started: [ vm1 vm2 vm3 ]
 std    (ocf::pacemaker:Dummy): Started vm2
 partition1     (ocf::pacemaker:Dummy): Started vm3
 partition2     (ocf::pacemaker:Dummy): Started vm1
 partition3     (ocf::pacemaker:Dummy): Started vm2

$ pcs cluster standby vm3

# Check that all resources have moved off vm3
$ pcs resource
 sv-fencer      (ocf::pacemaker:Dummy): Started vm1
 Clone Set: sv-clone [sv]
     Started: [ vm1 vm2 ]
     Stopped: [ vm3 ]
 std    (ocf::pacemaker:Dummy): Started vm2
 partition1     (ocf::pacemaker:Dummy): Started vm1
 partition2     (ocf::pacemaker:Dummy): Started vm1
 partition3     (ocf::pacemaker:Dummy): Started vm2

# Wait for any outstanding actions to complete.
$ crm_resource --wait --timeout 300
Pending actions:
        Action 22: sv-fencer_monitor_10000      on vm2
        Action 21: sv-fencer_start_0    on vm2
        Action 20: sv-fencer_stop_0     on vm1
Error performing operation: Timer expired

# Check the resources again - sv-fencer is still on vm1
$ pcs resource
 sv-fencer      (ocf::pacemaker:Dummy): Started vm1
 Clone Set: sv-clone [sv]
     Started: [ vm1 vm2 ]
     Stopped: [ vm3 ]
 std    (ocf::pacemaker:Dummy): Started vm2
 partition1     (ocf::pacemaker:Dummy): Started vm1
 partition2     (ocf::pacemaker:Dummy): Started vm1
 partition3     (ocf::pacemaker:Dummy): Started vm2

# Perform a random update to the CIB.
$ pcs resource update std op monitor interval=20 timeout=20

# Check resource status again - sv_fencer has now moved to vm2 (the action
crm_resource was waiting for)
$ pcs resource
 sv-fencer      (ocf::pacemaker:Dummy): Started vm2  <<<============
 Clone Set: sv-clone [sv]
     Started: [ vm1 vm2 ]
     Stopped: [ vm3 ]
 std    (ocf::pacemaker:Dummy): Started vm2
 partition1     (ocf::pacemaker:Dummy): Started vm1
 partition2     (ocf::pacemaker:Dummy): Started vm1
 partition3     (ocf::pacemaker:Dummy): Started vm2

I do not get the problem if I:
1) remove the "std" resource; or
2) remove the co-location constraints; or
3) remove the utilization attributes for the partition resources.

In these cases the sv-fencer resource is happy to stay on vm1, and
crm_resource --wait returns immediately.

It looks like the pcs cluster standby call is creating/registering the
actions to move the sv-fencer resource to vm2, but it doesn't include it in
the cluster transition.  When the CIB is later updated by something else,
the action is included in that transition.

Regards,
Leon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20171010/9189ac55/attachment-0002.html>