[ClusterLabs] crm_resource --wait

Tue Oct 10 14:22:44 UTC 2017

On Tue, 2017-10-10 at 15:19 +1000, Leon Steffens wrote:
> Hi Ken,
> 
> I managed to reproduce this on a simplified version of the cluster,
> and on Pacemaker 1.1.15, 1.1.16, as well as 1.1.18-rc1

> The steps to create the cluster are:
> 
> pcs property set stonith-enabled=false
> pcs property set placement-strategy=balanced
> 
> pcs node utilization vm1 cpu=100
> pcs node utilization vm2 cpu=100
> pcs node utilization vm3 cpu=100
> 
> pcs property set maintenance-mode=true
> 
> pcs resource create sv-fencer ocf:pacemaker:Dummy
> 
> pcs resource create sv ocf:pacemaker:Dummy clone notify=false
> pcs resource create std ocf:pacemaker:Dummy meta resource-
> stickiness=100
> 
> pcs resource create partition1 ocf:pacemaker:Dummy meta resource-
> stickiness=100
> pcs resource create partition2 ocf:pacemaker:Dummy meta resource-
> stickiness=100
> pcs resource create partition3 ocf:pacemaker:Dummy meta resource-
> stickiness=100
> 
> pcs resource utilization partition1 cpu=5
> pcs resource utilization partition2 cpu=5
> pcs resource utilization partition3 cpu=5
> 
> pcs constraint colocation add std with sv-clone INFINITY
> pcs constraint colocation add partition1 with sv-clone INFINITY
> pcs constraint colocation add partition2 with sv-clone INFINITY
> pcs constraint colocation add partition3 with sv-clone INFINITY
> 
> pcs property set maintenance-mode=false
>  
> 
> I can then reproduce the issues in the following way:
> 
> $ pcs resource
>  sv-fencer      (ocf::pacemaker:Dummy): Started vm1
>  Clone Set: sv-clone [sv]
>      Started: [ vm1 vm2 vm3 ]
>  std    (ocf::pacemaker:Dummy): Started vm2
>  partition1     (ocf::pacemaker:Dummy): Started vm3
>  partition2     (ocf::pacemaker:Dummy): Started vm1
>  partition3     (ocf::pacemaker:Dummy): Started vm2
> 
> $ pcs cluster standby vm3
> 
> # Check that all resources have moved off vm3
> $ pcs resource
>  sv-fencer      (ocf::pacemaker:Dummy): Started vm1
>  Clone Set: sv-clone [sv]
>      Started: [ vm1 vm2 ]
>      Stopped: [ vm3 ]
>  std    (ocf::pacemaker:Dummy): Started vm2
>  partition1     (ocf::pacemaker:Dummy): Started vm1
>  partition2     (ocf::pacemaker:Dummy): Started vm1
>  partition3     (ocf::pacemaker:Dummy): Started vm2

Thanks for the detailed information, this should help me get to the
bottom of it. From this description, it sounds like a new transition
isn't being triggered when it should.

Could you please attach the DC's pe-input file that is listed in the
logs after the standby step above? That would simplify analysis.

> # Wait for any outstanding actions to complete.
> $ crm_resource --wait --timeout 300
> Pending actions:
>         Action 22: sv-fencer_monitor_10000      on vm2
>         Action 21: sv-fencer_start_0    on vm2
>         Action 20: sv-fencer_stop_0     on vm1
> Error performing operation: Timer expired
> 
> # Check the resources again - sv-fencer is still on vm1
> $ pcs resource
>  sv-fencer      (ocf::pacemaker:Dummy): Started vm1
>  Clone Set: sv-clone [sv]
>      Started: [ vm1 vm2 ]
>      Stopped: [ vm3 ]
>  std    (ocf::pacemaker:Dummy): Started vm2
>  partition1     (ocf::pacemaker:Dummy): Started vm1
>  partition2     (ocf::pacemaker:Dummy): Started vm1
>  partition3     (ocf::pacemaker:Dummy): Started vm2
> 
> # Perform a random update to the CIB.
> $ pcs resource update std op monitor interval=20 timeout=20
> 
> # Check resource status again - sv_fencer has now moved to vm2 (the
> action crm_resource was waiting for)
> $ pcs resource
>  sv-fencer      (ocf::pacemaker:Dummy): Started vm2  <<<============
>  Clone Set: sv-clone [sv]
>      Started: [ vm1 vm2 ]
>      Stopped: [ vm3 ]
>  std    (ocf::pacemaker:Dummy): Started vm2
>  partition1     (ocf::pacemaker:Dummy): Started vm1
>  partition2     (ocf::pacemaker:Dummy): Started vm1
>  partition3     (ocf::pacemaker:Dummy): Started vm2
> 
> I do not get the problem if I:
> 1) remove the "std" resource; or
> 2) remove the co-location constraints; or
> 3) remove the utilization attributes for the partition resources.
> 
> In these cases the sv-fencer resource is happy to stay on vm1, and
> crm_resource --wait returns immediately.
> 
> It looks like the pcs cluster standby call is creating/registering
> the actions to move the sv-fencer resource to vm2, but it doesn't
> include it in the cluster transition.  When the CIB is later updated
> by something else, the action is included in that transition.
> 
> 
> Regards,
> Leon