[ClusterLabs] Antw: [EXT] Inquiry - remote node fencing issue

Fri Oct 29 11:16:02 EDT 2021

On 29.10.2021 17:53, Ken Gaillot wrote:
> On Fri, 2021-10-29 at 13:59 +0000, Gerry R Sommerville wrote:
>> Hey Andrei,
>>  
>> Thanks for your response again. The cluster nodes and remote hosts
>> each share two networks, however there is no routing between them. I
>> don't suppose there is a configuration parameter we can set to tell
>> Pacemaker to try communicating with the remotes using multiple IP
>> addresses?
>>  
>> Gerry Sommerville
>> E-mail: gerry at ca.ibm.com
> 
> Hi,
> 
> No, but you can use bonding if you want to have interface redundancy
> for a remote connection. To be clear, there is no requirement that
> remote nodes and cluster nodes have the same level of redundancy, it's
> just a design choice.
> 
> To address the original question, this is the log sequence I find most
> relevant:
> 
>> Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[776553]
>> (unpack_rsc_op_failure)      warning: Unexpected result (error) was
>> recorded for monitor of jangcluster-srv-4 on jangcluster-srv-2 at Oct
>> 22 12:21:09 2021 | rc=1 id=jangcluster-srv-4_last_failure_0
> 
>> Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[776553]
>> (unpack_rsc_op_failure)      notice: jangcluster-srv-4 will not be
>> started under current conditions
> 
>> Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[
>> 776553] (pe_fence_node)      warning: Remote node jangcluster-srv-4
>> will be fenced: remote connection is unrecoverable
> 
> The "will not be started" is why the node had to be fenced. There was

OK so it implies that remote resource should fail over if connection to
remote node fails. Thank you, that was not exactly clear from documentation.

> nowhere to recover the connection. I'd need to see the CIB from that
> time to know why; it's possible you had an old constraint banning the
> connection from the other node (e.g. from a ban or move command), or
> something like that.
> 

Hmm ... looking in (current) sources it seems this message is emitted
only in case of on-fail=stop operation property ...