[ClusterLabs] Why Do All The Services Go Down When Just One Fails?

Mon Feb 18 10:53:45 EST 2019

On Sun, 2019-02-17 at 20:33 +0300, Andrei Borzenkov wrote:
> 17.02.2019 0:33, Andrei Borzenkov пишет:
> > 17.02.2019 0:03, Eric Robinson пишет:
> > > Here are the relevant corosync logs.
> > > 
> > > It appears that the stop action for resource p_mysql_002 failed,
> > > and that caused a cascading series of service changes. However, I
> > > don't understand why, since no other resources are dependent on
> > > p_mysql_002.
> > > 
> > 
> > You have mandatory colocation constraints for each SQL resource
> > with
> > VIP. it means that to move SQL resource to another node pacemaker
> > also
> > must move VIP to another node which in turn means it needs to move
> > all
> > other dependent resources as well.
> > ...
> > > Feb 16 14:06:39 [3912] 001db01a    pengine:  warning:
> > > check_migration_threshold:        Forcing p_mysql_002 away from
> > > 001db01a after 1000000 failures (max=1000000)
> > 
> > ...
> > > Feb 16 14:06:39 [3912] 001db01a    pengine:   notice:
> > > LogAction:         *
> > > Stop       p_vip_clust01     (                   001db01a
> > > )   blocked
> > 
> > ...
> > > Feb 16 14:06:39 [3912] 001db01a    pengine:   notice:
> > > LogAction:         *
> > > Stop       p_mysql_001       (                   001db01a )   due
> > > to colocation with p_vip_clust01
> 
> There is apparently more in it. Note that p_vip_clust01 operation is
> "blocked". That is because mandatory order constraint is symmetrical
> by
> default, so to move VIP pacemaker needs first to stop it on current
> node; but before it can stop VIP it needs to (be able to) stop
> p_mysql_002; but it cannot do it because by default when "stop" fails
> without stonith, resource is blocked and no further actions are
> possible
> - i.e. resource can no more (tried to) be stopped.

Correct, failed stop actions are special -- an on-fail policy of "stop"
or "restart" requires a stop, so obviously they can't be applied to
failed stops. As you mentioned, without fencing, on-fail defaults to
"block" for stops, which should freeze the resource as it is.

> I still consider is rather questionable behavior. I tried to
> reproduce
> it and I see the same.
> 
> 1. After this happens resource p_mysql_002 has target=Stopped in CIB.
> Why, oh why, pacemaker tries to "force away" resource that is not
> going
> to be started on another node anyway?

Without having the policy engine inputs, I can't be sure, but I suspect
p_mysql_002 is not being forced away, but its failure causes that node
to be less preferred for the resources it depends on.

> 2. pacemaker knows that it cannot stop (and hence move)
> p_vip_clust01,
> still it happily will stop all resources that depend on it in
> preparation to move them and leave them at that because it cannot
> move

I think this is the point at which the behavior is undesirable, because
it would be relevant whether the move was related to the blocked
failure or not. Feel free to open a bug report and attach the relevant
policy engine input (or a crm_report).

> them. Resources are neither restarted on current node, nor moved to
> another node. At this point I'd expect pacemaker to be smart enough
> and
> not even initiate actions that are known to be unsuccessful.
> 
> The best we can do at this point is set symmetrical=false which
> allows
> move to actually happen, but it still means downtime for resources
> that
> are moved and has its own can of worms in normal case.
--
Ken Gaillot <kgaillot at redhat.com>