[ClusterLabs] Antw: [EXT] Inquiry - remote node fencing issue

Fri Nov 5 11:19:53 EDT 2021

On Fri, 2021-11-05 at 11:22 +0300, Andrei Borzenkov wrote:
> On 05.11.2021 01:20, Ken Gaillot wrote:
> > > There are two issues discussed in this thread.
> > > 
> > > 1. Remote node is fenced when connection with this node is lost.
> > > For
> > > all
> > > I can tell this is intended and expected behavior. That was the
> > > original
> > > question.
> > 
> > It's expected only because the connection can't be recovered
> > elsewhere.
> > If another node can run the connection, pacemaker will try to
> > reconnect
> > from there and re-probe everything to make sure what the current
> > state
> > is.
> > 
> 
> That's not what I see in sources and documentation and not what I
> obverse. Pacemaker will reprobe from another node only after
> attempting
> fencing of remote node.

Ah, you're right, I misremembered. Probe/start failures of a remote
connection don't require fencing but recurring monitor failures do. I
guess that makes sense, otherwise recovery of resources on a failed
remote could be greatly delayed.

I was confusing that with when the connection host is lost and has to
be fenced, in which case the connection will be recovered elsewhere if
possible, without fencing the remote.

<snip>

> The difference seems to be reconnect_interval parameter. If it is
> present in remote resource definition, pacemaker will not proceed
> after
> failed fencing.
> 
> As there is no real documentation how it is supposed to work I do not
> know whether all of this is a bug or not. But one is certainly sure -
> when connection to remote node is lost the first thing pacemaker does
> is
> to fence it and only then initiate any recovery action.

reconnect_interval is implemented as a sort of special case of failure-
timeout. When the interval expires, the connection failure is timed
out, so the cluster no longer sees a need for fencing. It's not a bug
but maybe a questionable design.

That's a case of a broader problem: if the cause for fencing goes away,
the cluster will stop trying fencing and act as if nothing was wrong.
This can be a good thing, for example a brief network interruption can
sometimes heal without fencing. However it's been suggested (e.g.
CLBZ#5476) that we need the concept of fencing required independently
of conditions -- i.e., for certain types of failure, fencing should be
considered required until it succeeds, regardless of whether the
original need for it goes away.
-- 
Ken Gaillot <kgaillot at redhat.com>