[ClusterLabs] Antw: [EXT] Inquiry - remote node fencing issue

Thu Nov 4 18:20:56 EDT 2021

On Sat, 2021-10-30 at 21:17 +0300, Andrei Borzenkov wrote:
> On 29.10.2021 18:37, Ken Gaillot wrote:
> ...
> > > > > To address the original question, this is the log sequence I
> > > > > find
> > > > > most
> > > > > relevant:
> > > > > 
> > > > > > Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-
> > > > > > schedulerd[776553]
> > > > > > (unpack_rsc_op_failure)      warning: Unexpected result
> > > > > > (error)
> > > > > > was
> > > > > > recorded for monitor of jangcluster-srv-4 on jangcluster-
> > > > > > srv-2
> > > > > > at Oct
> > > > > > 22 12:21:09 2021 | rc=1 id=jangcluster-srv-4_last_failure_0
> > > > > > Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-
> > > > > > schedulerd[776553]
> > > > > > (unpack_rsc_op_failure)      notice: jangcluster-srv-4 will
> > > > > > not
> > > > > > be
> > > > > > started under current conditions
> > > > > > Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[
> > > > > > 776553] (pe_fence_node)      warning: Remote node
> > > > > > jangcluster-
> > > > > > srv-4
> > > > > > will be fenced: remote connection is unrecoverable
> > > > > 
> > > > > The "will not be started" is why the node had to be fenced.
> > > > > There
> > > > > was
> > > > 
> > > > OK so it implies that remote resource should fail over if
> > > > connection to
> > > > remote node fails. Thank you, that was not exactly clear from
> > > > documentation.
> > > > 
> > > > > nowhere to recover the connection. I'd need to see the CIB
> > > > > from
> > > > > that
> > > > > time to know why; it's possible you had an old constraint
> > > > > banning
> > > > > the
> > > > > connection from the other node (e.g. from a ban or move
> > > > > command),
> > > > > or
> > > > > something like that.
> > > > > 
> > > > 
> > > > Hmm ... looking in (current) sources it seems this message is
> > > > emitted
> > > > only in case of on-fail=stop operation property ...
> > > > 
> > > 
> > > Well ...
> > > 
> > >     /* For remote nodes, ensure that any failure that results in
> > > dropping an
> > > 
> > >      * active connection to the node results in fencing of the
> > > node.
> > > 
> > >      *
> > > 
> > >      * There are only two action failures that don't result in
> > > fencing.
> > > 
> > >      * 1. probes - probe failures are expected.
> > > 
> > >      * 2. start - a start failure indicates that an active
> > > connection
> > > does not already
> > > 
> > >      * exist. The user can set op on-fail=fence if they really
> > > want
> > > to
> > > fence start
> > > 
> > >      * failures. */
> > > 
> > > 
> > > pacemaker will forcibly set on-fail=stop for remote resource.
> > 
> > The default isn't any different, it's on-fail=restart.
> > 
> > At that point in the code, on-fail is not what the user set (or
> > default), but how the result should be handled, taking into account
> > what the user set. E.g. if the result is success, then on-fail is
> > set
> > to ignore because nothing needs to be done, regardless of what the
> > configured on-fail is.
> > 
> 
> There are two issues discussed in this thread.
> 
> 1. Remote node is fenced when connection with this node is lost. For
> all
> I can tell this is intended and expected behavior. That was the
> original
> question.

It's expected only because the connection can't be recovered elsewhere.
If another node can run the connection, pacemaker will try to reconnect
from there and re-probe everything to make sure what the current state
is.

> 2. Remote resource appears to not fail over. I cannot reproduce it,
> but
> then we also do not have the complete CIB, so something may affect
> it.
> OTOH logs shown stop before fencing has possibly succeeded, so may be
> remote resource *did* fail over.
> 
> What I see is - connection to remote node is lost, pacemaker fences
> remote node and attempts to restart remote resource, if this is
> unsuccessful (meaning - connection still could not be established)
> remote resource fails over to another node.
> 
> I do not know if it is possible to avoid fencing of remote node under
> described conditions.
> 
> What is somewhat interesting (and looks like a bug) - in my testing
> pacemaker ignored failed fencing attempt and proceeded with
> restarting
> of remote resource. Is it expected behavior?

I don't see a failed fencing attempt (or any result of the fencing
attempt) in the logs in the original message, only failures of the
connection monitor.
-- 
Ken Gaillot <kgaillot at redhat.com>