[ClusterLabs] crm_resource --wait
Ken Gaillot
kgaillot at redhat.com
Mon Oct 9 17:23:44 CEST 2017
On Mon, 2017-10-09 at 16:37 +1000, Leon Steffens wrote:
> Hi all,
>
> We have a use case where we want to place a node into standby and
> then wait for all the resources to move off the node (and be started
> on other nodes) before continuing.
>
> In order to do this we call:
> $ pcs cluster standby brilxvm45
> $ crm_resource --wait --timeout 300
>
> This works most of the time, but in one of our test environments we
> are hitting a problem:
>
> When we put the node in standby, the reported cluster transition is:
>
> $ /usr/sbin/crm_simulate -x pe-input-3595.bz2 -S
>
> Using the original execution date of: 2017-10-08 16:58:05Z
> ...
> Transition Summary:
> * Restart sv_fencer (Started brilxvm43)
> * Stop sv.svtest.aa.sv.monitor:1 (brilxvm45)
> * Move sv.svtest.aa.26.partition (Started brilxvm45 ->
> brilxvm43)
> * Move sv.svtest.aa.27.partition (Started brilxvm45 ->
> brilxvm44)
> * Move sv.svtest.aa.28.partition (Started brilxvm45 ->
> brilxvm43)
>
> We expect crm_resource --wait to return once sv_fencer (a fencing
> device) has been restarted (not sure why it's being restarted), and
> the 3 partition resources have been moved.
>
> But crm_resource actually times out after 300 seconds with the
> following error:
>
> Pending actions:
> Action 40: sv_fencer_monitor_60000 on brilxvm44
> Action 39: sv_fencer_start_0 on brilxvm44
> Action 38: sv_fencer_stop_0 on brilxvm43
> Error performing operation: Timer expired
>
> It looks like it's waiting for the sv_fencer fencing agent to start
> on brilxvm44, even though the current transition did not include that
> move.
crm_resource --wait doesn't wait for a specific transition to complete;
it waits until no further actions are needed.
That is one of its limitations, that if something keeps provoking a new
transition, it will never complete except by timeout.
>
> After the crm_resource --wait has timed out, we set a property on a
> different node (brilxvm43). This seems to trigger a new transition
> to move sv_fencer to brilxvm44:
>
> $ /usr/sbin/crm_simulate -x pe-input-3596.bz2 -S
> Using the original execution date of: 2017-10-08 17:03:27Z
>
> Transition Summary:
> * Move sv_fencer (Started brilxvm43 -> brilxvm44)
>
> And from the corosync.log it looks like this transition triggers
> actions 38 - 40 (the ones crm_resource --wait waited for).
>
> So it looks like the crm_resource --wait knows about the transition
> to move the sv_fencer resource, but the subsequent setting of the
> node property is the one that actually triggers it (which is too
> late as it gets executed after the wait).
>
> I have attached the DC's corosync.log for the applicable time period
> (timezone is UTC+10). (The last few lines in the corosync - the
> interruption of transition 141 - is because of a subsequent standby
> being done for brilxvm43).
>
> A possible workaround I thought of was to make the sv_fencer resource
> slightly sticky (all the other resources are), but I'm not sure if
> this will just hide the problem for this specific scenario.
>
> We are using Pacemaker 1.1.15 on RedHat 6.9.
>
> Regards,
> Leon
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list