[ClusterLabs] Why Do All The Services Go Down When Just One Fails?

Wed Feb 20 13:12:58 EST 2019

18.02.2019 18:53, Ken Gaillot пишет:
> On Sun, 2019-02-17 at 20:33 +0300, Andrei Borzenkov wrote:
>> 17.02.2019 0:33, Andrei Borzenkov пишет:
>>> 17.02.2019 0:03, Eric Robinson пишет:
>>>> Here are the relevant corosync logs.
>>>>
>>>> It appears that the stop action for resource p_mysql_002 failed,
>>>> and that caused a cascading series of service changes. However, I
>>>> don't understand why, since no other resources are dependent on
>>>> p_mysql_002.
>>>>
>>>
>>> You have mandatory colocation constraints for each SQL resource
>>> with
>>> VIP. it means that to move SQL resource to another node pacemaker
>>> also
>>> must move VIP to another node which in turn means it needs to move
>>> all
>>> other dependent resources as well.
>>> ...
>>>> Feb 16 14:06:39 [3912] 001db01a    pengine:  warning:
>>>> check_migration_threshold:        Forcing p_mysql_002 away from
>>>> 001db01a after 1000000 failures (max=1000000)
>>>
>>> ...
>>>> Feb 16 14:06:39 [3912] 001db01a    pengine:   notice:
>>>> LogAction:         *
>>>> Stop       p_vip_clust01     (                   001db01a
>>>> )   blocked
>>>
>>> ...
>>>> Feb 16 14:06:39 [3912] 001db01a    pengine:   notice:
>>>> LogAction:         *
>>>> Stop       p_mysql_001       (                   001db01a )   due
>>>> to colocation with p_vip_clust01
>>
>> There is apparently more in it. Note that p_vip_clust01 operation is
>> "blocked". That is because mandatory order constraint is symmetrical
>> by
>> default, so to move VIP pacemaker needs first to stop it on current
>> node; but before it can stop VIP it needs to (be able to) stop
>> p_mysql_002; but it cannot do it because by default when "stop" fails
>> without stonith, resource is blocked and no further actions are
>> possible
>> - i.e. resource can no more (tried to) be stopped.
> 
> Correct, failed stop actions are special -- an on-fail policy of "stop"
> or "restart" requires a stop, so obviously they can't be applied to
> failed stops. As you mentioned, without fencing, on-fail defaults to
> "block" for stops, which should freeze the resource as it is.
> 
>> I still consider is rather questionable behavior. I tried to
>> reproduce
>> it and I see the same.
>>
>> 1. After this happens resource p_mysql_002 has target=Stopped in CIB.
>> Why, oh why, pacemaker tries to "force away" resource that is not
>> going
>> to be started on another node anyway?
> 
> Without having the policy engine inputs, I can't be sure, but I suspect
> p_mysql_002 is not being forced away, but its failure causes that node
> to be less preferred for the resources it depends on.
> 
>> 2. pacemaker knows that it cannot stop (and hence move)
>> p_vip_clust01,
>> still it happily will stop all resources that depend on it in
>> preparation to move them and leave them at that because it cannot
>> move
> 
> I think this is the point at which the behavior is undesirable, because
> it would be relevant whether the move was related to the blocked
> failure or not. Feel free to open a bug report and attach the relevant
> policy engine input (or a crm_report).
> 

https://bugs.clusterlabs.org/show_bug.cgi?id=5379

>> them. Resources are neither restarted on current node, nor moved to
>> another node. At this point I'd expect pacemaker to be smart enough
>> and
>> not even initiate actions that are known to be unsuccessful.
>>
>> The best we can do at this point is set symmetrical=false which
>> allows
>> move to actually happen, but it still means downtime for resources
>> that
>> are moved and has its own can of worms in normal case.
> --
> Ken Gaillot <kgaillot at redhat.com>
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>