[ClusterLabs] Continuous master monitor failure of a resource in case some other resource is being promoted

Tue Feb 26 10:27:52 EST 2019

On Tue, 2019-02-26 at 07:03 +0300, Andrei Borzenkov wrote:
> 25.02.2019 23:13, Ken Gaillot пишет:
> > On Mon, 2019-02-25 at 14:20 +0530, Samarth Jain wrote:
> > > Hi,
> > > 
> > > 
> > > We have a bunch of resources running in master slave
> > > configuration
> > > with one master and one slave instance running at any given time.
> > > 
> > > What we observe is, that for any two given resources at a time,
> > > if
> > > say resource Stateful_Test_1 is in middle of doing a promote and
> > > it
> > > takes significant amount of time (close to 150 seconds in our
> > > scenario) for it to complete promote (like starting a web server)
> > > and, during this time, say resource Stateful_Test_2's master
> > > instance
> > > fails, then the failure of Stateful_Test_2 master is never
> > > honored by
> > > pengine and the monitor being reoccurring keeps on failing
> > > without
> > > any action being taken by the DC.
> > > 
> > > We see below logs for the failure of Stateful_Test_2 in the DC
> > > which
> > > was VM-3 at that time:
> > > 
> > > Feb 25 11:28:13 [6013] VM-3       crmd:   notice:
> > > abort_transition_graph:      Transition aborted by operation
> > > Stateful_Test_2_monitor_17000 'create' on VM-1: Old event |
> > > magic=0:9;329:8:8:4a2b407e-ad15-43d0-8248-e70f9f22436b
> > > cib=0.191.5
> > > source=process_graph_event:498 complete=false
> > > 
> > > As per our current testing, the Stateful_Test_2 resource has
> > > failed
> > > 590 times and it still continues to fail!! without the failure
> > > being
> > > processed by pacemaker. We have to manually intervene to recover
> > > it
> > > by doing a resource restart.
> > > 
> > > Could you please help me understand:
> > > 1. Why doesn't pacemaker process the failure of Stateful_Test_2
> > > resource immediately after first failure?
> > 
> > All actions that have already been initiated must complete before
> > the
> > cluster can react to new conditions. The outcome of those actions
> > can
> > (and likely will) affect what needs to be done, so the cluster has
> > to
> > wait for them. The action timeouts are the only way to really
> > affect
> > this.
> > 
> 
> Well, promote action sets master score and this aborts and re-
> evaluates
> current transition. So it's not that this rule is set in stone, there
> are obviously situations when pacemaker does not wait for operation
> to
> complete before starting next transition.

Actions that have been *scheduled* but not *initiated* can be aborted.
But anytime a resource agent has been invoked, we wait for that process
to complete.

> > We've discussed the theoretical possibility of figuring out what
> > would
> > have to be done regardless of the outcome of the in-flight actions,
> > but
> > that might be computationally impractical.
> > 
> 
> I'm not sure why we need "what if" guessing. If new transition
> evaluates
> to the same resource state, pacemaker knows that operation is already
> in
> flight and does not need to do anything. If new resource states is
> different, cannot pacemaker simply cancel current operation and
> initiate
> different one?

With the current design, the only time pacemaker kills an already
running process is if its timeout is reached. Scheduled actions can be
cancelled, but not in-flight actions. That makes sense because killing
a resource agent in the middle of a start/stop/promote/etc. could leave
things in a problematic state that would require recovery.

> I understand that operations *on the same resource* need
> serialization,
> but between completely independent resources?

Not within a single transition, but a new transition can't be done
(with the current model) until in-flight actions have completed.

Thinking about it some more, it would be easier to get around the
problem if we made record-pending permanently true (which is the
default in 2.0 but not 1.1). The scheduler could then be sure it knew
about all in-flight actions, and calculate a new transition where
actions that depend on that one are properly ordered. We'd have to add
the concept of waiting for an action that isn't scheduled in the
current transition.

This jogged my memory that we already have a BZ for this aspect of the
issue:

https://bugs.clusterlabs.org/show_bug.cgi?id=5208
-- 
Ken Gaillot <kgaillot at redhat.com>