[ClusterLabs] Antw: [EXT] Inquiry - remote node fencing issue

Fri Oct 29 11:18:51 EDT 2021

On 29.10.2021 18:16, Andrei Borzenkov wrote:
> On 29.10.2021 17:53, Ken Gaillot wrote:
>> On Fri, 2021-10-29 at 13:59 +0000, Gerry R Sommerville wrote:
>>> Hey Andrei,
>>>  
>>> Thanks for your response again. The cluster nodes and remote hosts
>>> each share two networks, however there is no routing between them. I
>>> don't suppose there is a configuration parameter we can set to tell
>>> Pacemaker to try communicating with the remotes using multiple IP
>>> addresses?
>>>  
>>> Gerry Sommerville
>>> E-mail: gerry at ca.ibm.com
>>
>> Hi,
>>
>> No, but you can use bonding if you want to have interface redundancy
>> for a remote connection. To be clear, there is no requirement that
>> remote nodes and cluster nodes have the same level of redundancy, it's
>> just a design choice.
>>
>> To address the original question, this is the log sequence I find most
>> relevant:
>>
>>> Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[776553]
>>> (unpack_rsc_op_failure)      warning: Unexpected result (error) was
>>> recorded for monitor of jangcluster-srv-4 on jangcluster-srv-2 at Oct
>>> 22 12:21:09 2021 | rc=1 id=jangcluster-srv-4_last_failure_0
>>
>>> Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[776553]
>>> (unpack_rsc_op_failure)      notice: jangcluster-srv-4 will not be
>>> started under current conditions
>>
>>> Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[
>>> 776553] (pe_fence_node)      warning: Remote node jangcluster-srv-4
>>> will be fenced: remote connection is unrecoverable
>>
>> The "will not be started" is why the node had to be fenced. There was
> 
> OK so it implies that remote resource should fail over if connection to
> remote node fails. Thank you, that was not exactly clear from documentation.
> 
>> nowhere to recover the connection. I'd need to see the CIB from that
>> time to know why; it's possible you had an old constraint banning the
>> connection from the other node (e.g. from a ban or move command), or
>> something like that.
>>
> 
> Hmm ... looking in (current) sources it seems this message is emitted
> only in case of on-fail=stop operation property ...
> 

Well ...

    /* For remote nodes, ensure that any failure that results in dropping an

     * active connection to the node results in fencing of the node.

     *

     * There are only two action failures that don't result in fencing.

     * 1. probes - probe failures are expected.

     * 2. start - a start failure indicates that an active connection
does not already

     * exist. The user can set op on-fail=fence if they really want to
fence start

     * failures. */

pacemaker will forcibly set on-fail=stop for remote resource.