[ClusterLabs] Antw: Re: Master/slave failover does not work as expected
Ulrich.Windl at rz.uni-regensburg.de
Tue Aug 13 03:44:52 EDT 2019
>>> Harvey Shepherd <Harvey.Shepherd at Aviatnet.com> schrieb am 12.08.2019 um 23:38
in Nachricht <ec767e3d-0cde-42c2-a8de-72ffce859e2f at email.android.com>:
> I've been experiencing exactly the same issue. Pacemaker prioritises
> restarting the failed resource over maintaining a master instance. In my case
> I used crm_simulate to analyse the actions planned and taken by pacemaker
> during resource recovery. It showed that the system did plan to failover the
> master instance, but it was near the bottom of the action list. Higher
> priority was given to restarting the failed instance, consequently when that
> had occurred, it was easier just to promote the same instance rather than
> failing over.
That's interesting: Maybe usually it's actually faster to restart a failed (master) process rather than promoting a slave to master, possibly demoting the old master to slave, etc.
But most obviously while there is a (possible) resource utilization for resources, there is none for operations (AFAIK): If one could configure "operation costs" (maybe as rules), the cluster could prefer the transition with least costs. Unfortunately it will make things more complicated.
I could even imagine if you set the cost for "stop" to infinity, the cluster will not even try to stop the resource, but will fence the node instead...
> This particular behaviour caused me a lot of headaches. In the end I had to
> use a workaround by setting max failures for the resource to 1, and clearing
> the failure after 10 seconds. This forces it to failover, but there is then a
> window (longer than 10 seconds due to the cluster check timer which is used
> to clear failures) where the resource can't fail back if there happened to be
> a second failure. It also means that there is no slave running during this
> time, which causes a performance hit in my case.
More information about the Users