[ClusterLabs] VirtualDomain & parallel shutdown
Ken Gaillot
kgaillot at redhat.com
Mon Nov 26 15:41:47 EST 2018
On Mon, 2018-11-26 at 14:24 +0200, Klecho wrote:
> Hi again,
>
> Just made one simple "parallel shutdown" test with a strange result,
> confirming the problem I've described.
>
> Created a few dummy resources, each of them taking 60s to stop. No
> constraints at all. After that issued "stop" to all of them, one by
> one.
>
> Stop operation wasn't attempted for any of the rest until the first
> resource stopped.
>
> When the first resource stopped, all the rest stopped at a same
> moment
> 120s after the stop commands were issued.
>
> This confirms that if many resources (VMs) need to be stopped and
> first
> one starts some update (and a big stop timeout is set), stop attempt
> for
> the rest won't be made at all, until the first is up.
>
> Why is this so and is there a way to avoid it?
It has to do with pacemaker's concept of a "transition".
When an interesting event happens (like your first stop), pacemaker
calculates what actions need to be taken and then does them. A
transition may be interrupted between actions by a new event, but any
event already begun must complete before a new transition can begin.
What happened here is that when you stopped the first resource, a
transition was created with that one stop, and that stop was initiated.
When the later stops came in, they would cause a new transition, but
that first stop has to complete before that transition can begin.
There are a few ways around this:
* Shutdown will stop all resources on its own, so you could skip the
stopping altogether.
* If you prefer to ensure all the resources stop successfully before
you start the shutdown, you could batch all the "stop" changes into one
file and apply that to the config. A stop command sets the resource's
target-role meta-attribute to Stopped. Normally, this is applied
directly to the live configuration, so it takes effect immediately.
However crm and pcs both offer ways to batch commands in a file, then
apply it all at once.
* Or, you could set the node(s) to standby mode as a transient
attribute (using attrd_updater). That would cause all resources to move
off those nodes (and stop if there are no nodes remaining). Transient
node attributes are erased every time a node leaves the cluster, so it
would only have effect until shutdown; when the node rejoined, it would
be in regular mode.
>
> On 11/20/18 12:40 PM, Klechomir wrote:
> > Hi list,
> > Bumped onto the following issue lately:
> >
> > When ultiple VMs are given shutdown right one-after-onther and the
> > shutdown of
> > the first VM takes long, the others aren't being shut down at all
> > before the
> > first doesn't stop.
> >
> > "batch-limit" doesn't seem to affect this.
> > Any suggestions why this could happen?
> >
> > Best regards,
> > Klecho
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
> > h.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list