[ClusterLabs] Why Do All The Services Go Down When Just One Fails?

Sun Feb 17 01:35:30 EST 2019

17.02.2019 0:44, Eric Robinson пишет:
> Thanks for the feedback, Andrei.
> 
> I only want cluster failover to occur if the filesystem or drbd resources fail, or if the cluster messaging layer detects a complete node failure. Is there a way to tell PaceMaker not to trigger a cluster failover if any of the p_mysql resources fail?  
> 

The closest you can get is disabling monitor recurring action. In this
case pacemaker will effectively ignore any resource state change.
Unfortunately this also means your resource agent must now correctly
handle requests in the wrong state - i.e. it must be able to stop
resource that had already failed earlier without returning error to
pacemaker.

You may set resource to "unmanaged", but this will also prevent
pacemaker from starting/stopping your resource at all. As compromise you
may set "unmanaged" after resource has been started and unset before
stopping it, but then you have exactly the same issue - if resource has
failed, as soon as you manage it again pacemaker will trigger
corresponding action.

Pacemaker design is different from any other cluster resources monitor I
have seen. Pacemaker is designed to maintain target resource state at
any cost. Pacemaker does not have notion of "important" or "unimportant"
resources at all. Even playing with scores won't help because failed
resource outweighs everything else with -INFINITY score thus pushing
everything dependent away from its current node.

In this particular case it may be argued that pacemaker reaction is
unjustified. Administrator explicitly set target state to "stop"
(otherwise pacemaker would not attempt to stop it) so it is unclear why
it tries to restart it on other node.

>> -----Original Message-----
>> From: Users <users-bounces at clusterlabs.org> On Behalf Of Andrei
>> Borzenkov
>> Sent: Saturday, February 16, 2019 1:34 PM
>> To: users at clusterlabs.org
>> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One
>> Fails?
>>
>> 17.02.2019 0:03, Eric Robinson пишет:
>>> Here are the relevant corosync logs.
>>>
>>> It appears that the stop action for resource p_mysql_002 failed, and that
>> caused a cascading series of service changes. However, I don't understand
>> why, since no other resources are dependent on p_mysql_002.
>>>
>>
>> You have mandatory colocation constraints for each SQL resource with VIP. it
>> means that to move SQL resource to another node pacemaker also must
>> move VIP to another node which in turn means it needs to move all other
>> dependent resources as well.
>> ...
>>> Feb 16 14:06:39 [3912] 001db01a    pengine:  warning:
>> check_migration_threshold:        Forcing p_mysql_002 away from 001db01a
>> after 1000000 failures (max=1000000)
>> ...
>>> Feb 16 14:06:39 [3912] 001db01a    pengine:   notice: LogAction:         * Stop
>> p_vip_clust01     (                   001db01a )   blocked
>> ...
>>> Feb 16 14:06:39 [3912] 001db01a    pengine:   notice: LogAction:         * Stop
>> p_mysql_001       (                   001db01a )   due to colocation with p_vip_clust01
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>