[ClusterLabs] Why Do All The Services Go Down When Just One Fails?

Sun Feb 17 17:33:53 UTC 2019

17.02.2019 0:33, Andrei Borzenkov пишет:
> 17.02.2019 0:03, Eric Robinson пишет:
>> Here are the relevant corosync logs.
>>
>> It appears that the stop action for resource p_mysql_002 failed, and that caused a cascading series of service changes. However, I don't understand why, since no other resources are dependent on p_mysql_002.
>>
> 
> You have mandatory colocation constraints for each SQL resource with
> VIP. it means that to move SQL resource to another node pacemaker also
> must move VIP to another node which in turn means it needs to move all
> other dependent resources as well.
> ...
>> Feb 16 14:06:39 [3912] 001db01a    pengine:  warning: check_migration_threshold:        Forcing p_mysql_002 away from 001db01a after 1000000 failures (max=1000000)
> ...
>> Feb 16 14:06:39 [3912] 001db01a    pengine:   notice: LogAction:         * Stop       p_vip_clust01     (                   001db01a )   blocked
> ...
>> Feb 16 14:06:39 [3912] 001db01a    pengine:   notice: LogAction:         * Stop       p_mysql_001       (                   001db01a )   due to colocation with p_vip_clust01
> 

There is apparently more in it. Note that p_vip_clust01 operation is
"blocked". That is because mandatory order constraint is symmetrical by
default, so to move VIP pacemaker needs first to stop it on current
node; but before it can stop VIP it needs to (be able to) stop
p_mysql_002; but it cannot do it because by default when "stop" fails
without stonith, resource is blocked and no further actions are possible
- i.e. resource can no more (tried to) be stopped.

I still consider is rather questionable behavior. I tried to reproduce
it and I see the same.

1. After this happens resource p_mysql_002 has target=Stopped in CIB.
Why, oh why, pacemaker tries to "force away" resource that is not going
to be started on another node anyway?

2. pacemaker knows that it cannot stop (and hence move) p_vip_clust01,
still it happily will stop all resources that depend on it in
preparation to move them and leave them at that because it cannot move
them. Resources are neither restarted on current node, nor moved to
another node. At this point I'd expect pacemaker to be smart enough and
not even initiate actions that are known to be unsuccessful.

The best we can do at this point is set symmetrical=false which allows
move to actually happen, but it still means downtime for resources that
are moved and has its own can of worms in normal case.