[ClusterLabs] Antw: [EXT] Re: Disable all resources in a group if one or more of them fail and are unable to reactivate

Fri Jan 29 05:22:02 EST 2021

>>> Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 28.01.2021 um 18:30 in
Nachricht <db12df26-6cc4-bad2-8bf5-8ee3aad87533 at gmail.com>:
> 27.01.2021 22:03, Ken Gaillot пишет:
>> 
>> With a group, later members depend on earlier members. If an earlier
>> member can't run, then no members after it can run.
>> 
>> However we can't make the dependency go in both directions. If an
>> earlier member can't run unless a later member is active, and vice
>> versa, then how can anything be started?
>> 
>> By default, Pacemaker tries to recover failed resources on the same
>> node, up to its migration-threshold (which defaults to a million
>> times). Once a group member reaches its migration-threshold, Pacemaker
>> will move the entire group to another node if one is available. However
>> if no node is available for the failed member, then it will just remain
>> stopped (along with any later members in the group), and the earlier
>> members will stay active where they are.
>> 
>> I don't think there's any way to prevent earlier members from running
>> if a later member has no available node.
>> 
> 
> All other HA managers I am aware of have collection of resources (often
> called "application") as scheduling unit. All resources in one
> collection are automatically activated on the same node (they of course
> (may) have ordering dependencies). If any required resource in
> collection fails, partially active collection is cleaned up, all
> resources activated so far are deactivated. This is indeed virtually
> impossible to express in pacemaker. The only way I can think of is
> artificially restrict management layer to top-level resources, but this
> also won't work for stopping group of resources (where "group" is used
> generically, not in narrow pacemaker sense) for reasons you explained.

I just wonder: Adding op timeouts to a group?
If the groups fails to start or stop within the specific time, consider the
whole group as failed...
stop a failed start, and fence a failed stop...

Regards,
Ulrich

> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/