[ClusterLabs] Antw: Re: Master/slave failover does not work as expected

Tue Aug 13 03:44:52 EDT 2019

>>> Harvey Shepherd <Harvey.Shepherd at Aviatnet.com> schrieb am 12.08.2019 um 23:38
in Nachricht <ec767e3d-0cde-42c2-a8de-72ffce859e2f at email.android.com>:
> I've been experiencing exactly the same issue. Pacemaker prioritises 
> restarting the failed resource over maintaining a master instance. In my case 
> I used crm_simulate to analyse the actions planned and taken by pacemaker 
> during resource recovery. It showed that the system did plan to failover the 
> master instance, but it was near the bottom of the action list. Higher 
> priority was given to restarting the failed instance, consequently when that 
> had occurred, it was easier just to promote the same instance rather than 
> failing over.

That's interesting: Maybe usually it's actually faster to restart a failed (master) process rather than promoting a slave to master, possibly demoting the old master to slave, etc.

But most obviously while there is a (possible) resource utilization for resources, there is none for operations (AFAIK): If one could configure "operation costs" (maybe as rules), the cluster could prefer the transition with least costs. Unfortunately it will make things more complicated.

I could even imagine if you set the cost for "stop" to infinity, the cluster will not even try to stop the resource, but will fence the node instead...

> 
> This particular behaviour caused me a lot of headaches. In the end I had to 
> use a workaround by setting max failures for the resource to 1, and clearing 
> the failure after 10 seconds. This forces it to failover, but there is then a 
> window (longer than 10 seconds due to the cluster check timer which is used 
> to clear failures) where the resource can't fail back if there happened to be 
> a second failure. It also means that there is no slave running during this 
> time, which causes a performance hit in my case.
> 
[...]