[ClusterLabs] crm_resource --wait

Mon Oct 9 22:46:19 UTC 2017

On Tue, 2017-10-10 at 07:47 +1000, Leon Steffens wrote:
> 
> 
> 
> > >
> > > Pending actions:
> > > Action 40: sv_fencer_monitor_60000 on brilxvm44
> > > Action 39: sv_fencer_start_0 on brilxvm44
> > > Action 38: sv_fencer_stop_0 on brilxvm43
> > > Error performing operation: Timer expired
> > >
> > > It looks like it's waiting for the sv_fencer fencing agent to
> > start
> > > on brilxvm44, even though the current transition did not include
> > that
> > > move.  
> > 
> > crm_resource --wait doesn't wait for a specific transition to
> > complete;
> > it waits until no further actions are needed.
> > 
> > That is one of its limitations, that if something keeps provoking a
> > new
> > transition, it will never complete except by timeout.
> 
> Thanks Ken,
> 
> I understand that crm_resource --wait will wait until no further
> actions are needed, but I'm not quite sure of:

Ah, I missed that the property wasn't set until after the wait timed
out. I thought it was being set by something else during the wait.

> 1) what is triggering this movement of the sv_fencer resource from
> vm43 to vm44

Putting vm45 in standby results in moving its resources elsewhere, and
that can mean the ideal placement of resources already running
elsewhere changes (due to balancing, constraints, etc.).

I'm guessing sv_fencer moves due to the default placement strategy,
which is to spread the number of resources out evenly across available
nodes.

What is not always obvious about zero stickiness is that pacemaker
treats the resource the same as if it's not running at all when
deciding where everything goes. sv_fencer running on vm43 isn't taken
into account at all when deciding whether the other resources are
placed on vm43.

> 2) why is the action only triggered on a CIB update after the wait
> has timed out (setting of node property), and not while crm_resource
> --wait is waiting.

This is a mystery, especially since the initial transition shows
sv_fencer restarting on vm43 and not moving. Are you sure there were no
transitions after that initial one, before the wait timed out?

The monitor is waiting on the start, and the start is waiting on the
stop, so the question is why the stop isn't able to proceed.

> 3) why is crm_resource --wait waiting for this action if it's only
> triggered by the setting of a node property after the wait has timed
> out? (i.e. why is this action not triggered if the cluster is aware
> of the action?)

Pacemaker isn't psychic as far as I know :-) so the move it's waiting
on can't be necessitated by the later property change. Something else
(I'm guessing balancing) is causing it to move before then, but the
stop can't proceed for some reason.

> 
> The sequence of events is:
> 
> 1) Put node 3 in standby
> 2) Wait until no further actions are needed
> 3) Set property on node 1.
> 
> I'll see if I can reproduce this in an independent test and then try
> it with a later version of Pacemaker.
> 
> Regards,
> Leon
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org