[ClusterLabs] crm_resource --wait

Mon Oct 9 15:23:44 UTC 2017

On Mon, 2017-10-09 at 16:37 +1000, Leon Steffens wrote:
> Hi all,
> 
> We have a use case where we want to place a node into standby and
> then wait for all the resources to move off the node (and be started
> on other nodes) before continuing.  
> 
> In order to do this we call:
> $ pcs cluster standby brilxvm45
> $ crm_resource --wait --timeout 300
> 
> This works most of the time, but in one of our test environments we
> are hitting a problem:
> 
> When we put the node in standby, the reported cluster transition is:
> 
> $  /usr/sbin/crm_simulate -x pe-input-3595.bz2 -S
> 
> Using the original execution date of: 2017-10-08 16:58:05Z
> ...
> Transition Summary:
>  * Restart sv_fencer    (Started brilxvm43)
>  * Stop    sv.svtest.aa.sv.monitor:1    (brilxvm45)
>  * Move    sv.svtest.aa.26.partition    (Started brilxvm45 ->
> brilxvm43)
>  * Move    sv.svtest.aa.27.partition    (Started brilxvm45 ->
> brilxvm44)
>  * Move    sv.svtest.aa.28.partition    (Started brilxvm45 ->
> brilxvm43)
> 
> We expect crm_resource --wait to return once sv_fencer (a fencing
> device) has been restarted (not sure why it's being restarted), and
> the 3 partition resources have been moved.
> 
> But crm_resource actually times out after 300 seconds with the
> following error:
> 
> Pending actions:
> Action 40: sv_fencer_monitor_60000 on brilxvm44
> Action 39: sv_fencer_start_0 on brilxvm44
> Action 38: sv_fencer_stop_0 on brilxvm43
> Error performing operation: Timer expired
> 
> It looks like it's waiting for the sv_fencer fencing agent to start
> on brilxvm44, even though the current transition did not include that
> move.  

crm_resource --wait doesn't wait for a specific transition to complete;
it waits until no further actions are needed.

That is one of its limitations, that if something keeps provoking a new
transition, it will never complete except by timeout.

> 
> After the crm_resource --wait has timed out, we set a property on a
> different node (brilxvm43).  This seems to trigger a new transition
> to move sv_fencer to brilxvm44:
> 
> $  /usr/sbin/crm_simulate -x pe-input-3596.bz2 -S
> Using the original execution date of: 2017-10-08 17:03:27Z
> 
> Transition Summary:
>  * Move    sv_fencer    (Started brilxvm43 -> brilxvm44)
> 
> And from the corosync.log it looks like this transition triggers
> actions 38 - 40 (the ones crm_resource --wait waited for).
> 
> So it looks like the crm_resource --wait knows about the transition
> to move the sv_fencer resource, but the subsequent setting of the
> node property is the one that actually triggers it  (which is too
> late as it gets executed after the wait).
> 
> I have attached the DC's corosync.log for the applicable time period
> (timezone is UTC+10).  (The last few lines in the corosync - the
> interruption of transition 141 - is because of a subsequent standby
> being done for brilxvm43).
> 
> A possible workaround I thought of was to make the sv_fencer resource
> slightly sticky (all the other resources are), but I'm not sure if
> this will just hide the problem for this specific scenario.
> 
> We are using Pacemaker 1.1.15 on RedHat 6.9.
> 
> Regards,
> Leon
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org