[ClusterLabs] Antw: [EXT] Inquiry - remote node fencing issue
Andrei Borzenkov
arvidjaar at gmail.com
Fri Oct 29 11:18:51 EDT 2021
On 29.10.2021 18:16, Andrei Borzenkov wrote:
> On 29.10.2021 17:53, Ken Gaillot wrote:
>> On Fri, 2021-10-29 at 13:59 +0000, Gerry R Sommerville wrote:
>>> Hey Andrei,
>>>
>>> Thanks for your response again. The cluster nodes and remote hosts
>>> each share two networks, however there is no routing between them. I
>>> don't suppose there is a configuration parameter we can set to tell
>>> Pacemaker to try communicating with the remotes using multiple IP
>>> addresses?
>>>
>>> Gerry Sommerville
>>> E-mail: gerry at ca.ibm.com
>>
>> Hi,
>>
>> No, but you can use bonding if you want to have interface redundancy
>> for a remote connection. To be clear, there is no requirement that
>> remote nodes and cluster nodes have the same level of redundancy, it's
>> just a design choice.
>>
>> To address the original question, this is the log sequence I find most
>> relevant:
>>
>>> Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[776553]
>>> (unpack_rsc_op_failure) warning: Unexpected result (error) was
>>> recorded for monitor of jangcluster-srv-4 on jangcluster-srv-2 at Oct
>>> 22 12:21:09 2021 | rc=1 id=jangcluster-srv-4_last_failure_0
>>
>>> Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[776553]
>>> (unpack_rsc_op_failure) notice: jangcluster-srv-4 will not be
>>> started under current conditions
>>
>>> Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[
>>> 776553] (pe_fence_node) warning: Remote node jangcluster-srv-4
>>> will be fenced: remote connection is unrecoverable
>>
>> The "will not be started" is why the node had to be fenced. There was
>
> OK so it implies that remote resource should fail over if connection to
> remote node fails. Thank you, that was not exactly clear from documentation.
>
>> nowhere to recover the connection. I'd need to see the CIB from that
>> time to know why; it's possible you had an old constraint banning the
>> connection from the other node (e.g. from a ban or move command), or
>> something like that.
>>
>
> Hmm ... looking in (current) sources it seems this message is emitted
> only in case of on-fail=stop operation property ...
>
Well ...
/* For remote nodes, ensure that any failure that results in dropping an
* active connection to the node results in fencing of the node.
*
* There are only two action failures that don't result in fencing.
* 1. probes - probe failures are expected.
* 2. start - a start failure indicates that an active connection
does not already
* exist. The user can set op on-fail=fence if they really want to
fence start
* failures. */
pacemaker will forcibly set on-fail=stop for remote resource.
More information about the Users
mailing list