[ClusterLabs] Antw: [EXT] Inquiry - remote node fencing issue

Sat Oct 30 14:17:27 EDT 2021

On 29.10.2021 18:37, Ken Gaillot wrote:
...
>>>>
>>>> To address the original question, this is the log sequence I find
>>>> most
>>>> relevant:
>>>>
>>>>> Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-
>>>>> schedulerd[776553]
>>>>> (unpack_rsc_op_failure)      warning: Unexpected result (error)
>>>>> was
>>>>> recorded for monitor of jangcluster-srv-4 on jangcluster-srv-2
>>>>> at Oct
>>>>> 22 12:21:09 2021 | rc=1 id=jangcluster-srv-4_last_failure_0
>>>>> Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-
>>>>> schedulerd[776553]
>>>>> (unpack_rsc_op_failure)      notice: jangcluster-srv-4 will not
>>>>> be
>>>>> started under current conditions
>>>>> Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[
>>>>> 776553] (pe_fence_node)      warning: Remote node jangcluster-
>>>>> srv-4
>>>>> will be fenced: remote connection is unrecoverable
>>>>
>>>> The "will not be started" is why the node had to be fenced. There
>>>> was
>>>
>>> OK so it implies that remote resource should fail over if
>>> connection to
>>> remote node fails. Thank you, that was not exactly clear from
>>> documentation.
>>>
>>>> nowhere to recover the connection. I'd need to see the CIB from
>>>> that
>>>> time to know why; it's possible you had an old constraint banning
>>>> the
>>>> connection from the other node (e.g. from a ban or move command),
>>>> or
>>>> something like that.
>>>>
>>>
>>> Hmm ... looking in (current) sources it seems this message is
>>> emitted
>>> only in case of on-fail=stop operation property ...
>>>
>>
>> Well ...
>>
>>     /* For remote nodes, ensure that any failure that results in
>> dropping an
>>
>>      * active connection to the node results in fencing of the node.
>>
>>      *
>>
>>      * There are only two action failures that don't result in
>> fencing.
>>
>>      * 1. probes - probe failures are expected.
>>
>>      * 2. start - a start failure indicates that an active connection
>> does not already
>>
>>      * exist. The user can set op on-fail=fence if they really want
>> to
>> fence start
>>
>>      * failures. */
>>
>>
>> pacemaker will forcibly set on-fail=stop for remote resource.
> 
> The default isn't any different, it's on-fail=restart.
> 
> At that point in the code, on-fail is not what the user set (or
> default), but how the result should be handled, taking into account
> what the user set. E.g. if the result is success, then on-fail is set
> to ignore because nothing needs to be done, regardless of what the
> configured on-fail is.
> 

There are two issues discussed in this thread.

1. Remote node is fenced when connection with this node is lost. For all
I can tell this is intended and expected behavior. That was the original
question.

2. Remote resource appears to not fail over. I cannot reproduce it, but
then we also do not have the complete CIB, so something may affect it.
OTOH logs shown stop before fencing has possibly succeeded, so may be
remote resource *did* fail over.

What I see is - connection to remote node is lost, pacemaker fences
remote node and attempts to restart remote resource, if this is
unsuccessful (meaning - connection still could not be established)
remote resource fails over to another node.

I do not know if it is possible to avoid fencing of remote node under
described conditions.

What is somewhat interesting (and looks like a bug) - in my testing
pacemaker ignored failed fencing attempt and proceeded with restarting
of remote resource. Is it expected behavior?