[ClusterLabs] VirtualDomain & parallel shutdown

Tue Nov 27 05:29:55 EST 2018

Hi Ken,

Big thanks for the answer, but I in your ways around I don't see a 
solution for the following simple case:

I have a few VMs (VirtualDomain RA) and just want and to stop a few of 
them, not all.

While the first VM is shutting down (target-role=stopped), it starts 
some slow update, which could take hours (because of the possible update 
case, stop timeout is very big).

During these hours of update, no other VM can be stopped at all.

If this isn't avoidable, this could be a quite big flaw, because it 
blocks basic functionality.

Best regards,

On 11/26/18 10:41 PM, Ken Gaillot wrote:
> On Mon, 2018-11-26 at 14:24 +0200, Klecho wrote:
>> Hi again,
>>
>> Just made one simple "parallel shutdown" test with a strange result,
>> confirming the problem I've described.
>>
>> Created a few dummy resources, each of them taking 60s to stop. No
>> constraints at all. After that issued "stop" to all of them, one by
>> one.
>>
>> Stop operation wasn't attempted for any of the rest until the first
>> resource stopped.
>>
>> When the first resource stopped, all the rest stopped at a same
>> moment
>> 120s after the stop commands were issued.
>>
>> This confirms that if many resources (VMs) need to be stopped and
>> first
>> one starts some update (and a big stop timeout is set), stop attempt
>> for
>> the rest won't be made at all, until the first is up.
>>
>> Why is this so and is there a way to avoid it?
> It has to do with pacemaker's concept of a "transition".
>
> When an interesting event happens (like your first stop), pacemaker
> calculates what actions need to be taken and then does them. A
> transition may be interrupted between actions by a new event, but any
> event already begun must complete before a new transition can begin.
>
> What happened here is that when you stopped the first resource, a
> transition was created with that one stop, and that stop was initiated.
> When the later stops came in, they would cause a new transition, but
> that first stop has to complete before that transition can begin.
>
> There are a few ways around this:
>
> * Shutdown will stop all resources on its own, so you could skip the
> stopping altogether.
>
> * If you prefer to ensure all the resources stop successfully before
> you start the shutdown, you could batch all the "stop" changes into one
> file and apply that to the config. A stop command sets the resource's
> target-role meta-attribute to Stopped. Normally, this is applied
> directly to the live configuration, so it takes effect immediately.
> However crm and pcs both offer ways to batch commands in a file, then
> apply it all at once.
>
> * Or, you could set the node(s) to standby mode as a transient
> attribute (using attrd_updater). That would cause all resources to move
> off those nodes (and stop if there are no nodes remaining). Transient
> node attributes are erased every time a node leaves the cluster, so it
> would only have effect until shutdown; when the node rejoined, it would
> be in regular mode.
>
>> On 11/20/18 12:40 PM, Klechomir wrote:
>>> Hi list,
>>> Bumped onto the following issue lately:
>>>
>>> When ultiple VMs are given shutdown right one-after-onther and the
>>> shutdown of
>>> the first VM takes long, the others aren't being shut down at all
>>> before the
>>> first doesn't stop.
>>>
>>> "batch-limit" doesn't seem to affect this.
>>> Any suggestions why this could happen?
>>>
>>> Best regards,
>>> Klecho
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
>>> h.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
-- 
Klecho