[ClusterLabs] Antw: [EXT] Inquiry - remote node fencing issue

Fri Oct 29 11:37:14 EDT 2021

On Fri, 2021-10-29 at 18:18 +0300, Andrei Borzenkov wrote:
> On 29.10.2021 18:16, Andrei Borzenkov wrote:
> > On 29.10.2021 17:53, Ken Gaillot wrote:
> > > On Fri, 2021-10-29 at 13:59 +0000, Gerry R Sommerville wrote:
> > > > Hey Andrei,
> > > >  
> > > > Thanks for your response again. The cluster nodes and remote
> > > > hosts
> > > > each share two networks, however there is no routing between
> > > > them. I
> > > > don't suppose there is a configuration parameter we can set to
> > > > tell
> > > > Pacemaker to try communicating with the remotes using multiple
> > > > IP
> > > > addresses?
> > > >  
> > > > Gerry Sommerville
> > > > E-mail: gerry at ca.ibm.com
> > > 
> > > Hi,
> > > 
> > > No, but you can use bonding if you want to have interface
> > > redundancy
> > > for a remote connection. To be clear, there is no requirement
> > > that
> > > remote nodes and cluster nodes have the same level of redundancy,
> > > it's
> > > just a design choice.
> > > 
> > > To address the original question, this is the log sequence I find
> > > most
> > > relevant:
> > > 
> > > > Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-
> > > > schedulerd[776553]
> > > > (unpack_rsc_op_failure)      warning: Unexpected result (error)
> > > > was
> > > > recorded for monitor of jangcluster-srv-4 on jangcluster-srv-2
> > > > at Oct
> > > > 22 12:21:09 2021 | rc=1 id=jangcluster-srv-4_last_failure_0
> > > > Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-
> > > > schedulerd[776553]
> > > > (unpack_rsc_op_failure)      notice: jangcluster-srv-4 will not
> > > > be
> > > > started under current conditions
> > > > Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[
> > > > 776553] (pe_fence_node)      warning: Remote node jangcluster-
> > > > srv-4
> > > > will be fenced: remote connection is unrecoverable
> > > 
> > > The "will not be started" is why the node had to be fenced. There
> > > was
> > 
> > OK so it implies that remote resource should fail over if
> > connection to
> > remote node fails. Thank you, that was not exactly clear from
> > documentation.
> > 
> > > nowhere to recover the connection. I'd need to see the CIB from
> > > that
> > > time to know why; it's possible you had an old constraint banning
> > > the
> > > connection from the other node (e.g. from a ban or move command),
> > > or
> > > something like that.
> > > 
> > 
> > Hmm ... looking in (current) sources it seems this message is
> > emitted
> > only in case of on-fail=stop operation property ...
> > 
> 
> Well ...
> 
>     /* For remote nodes, ensure that any failure that results in
> dropping an
> 
>      * active connection to the node results in fencing of the node.
> 
>      *
> 
>      * There are only two action failures that don't result in
> fencing.
> 
>      * 1. probes - probe failures are expected.
> 
>      * 2. start - a start failure indicates that an active connection
> does not already
> 
>      * exist. The user can set op on-fail=fence if they really want
> to
> fence start
> 
>      * failures. */
> 
> 
> pacemaker will forcibly set on-fail=stop for remote resource.

The default isn't any different, it's on-fail=restart.

At that point in the code, on-fail is not what the user set (or
default), but how the result should be handled, taking into account
what the user set. E.g. if the result is success, then on-fail is set
to ignore because nothing needs to be done, regardless of what the
configured on-fail is.
-- 
Ken Gaillot <kgaillot at redhat.com>