[ClusterLabs] inquiry - remote node fails over

Wed Oct 13 18:21:30 EDT 2021

On Wed, 2021-10-13 at 21:13 +0000, Janghyuk Boo wrote:
> Dear Community,
>  
> I have a Pacemaker cluster with two cluster nodes with two network
> interfaces each, and two remote nodes and one fabric fencing agent
> for the whole cluster.
> nodelist {
>    node {
>        ring0_addr: xxx
>        ring1_addr: xxx
>        name: jangcluster-srv-1
>        nodeid: 1
>    }
>    node {
>        ring0_addr: xxx
>        ring1_addr: xxx
>        name: jangcluster-srv-2
>        nodeid: 2
>    }
>  }
>  
> Node List:
>  * Online: [ jangcluster-srv-1 jangcluster-srv-2 ]
>  * RemoteOnline: [ jangcluster-srv-3 jangcluster-srv-4 ]
>  
>  Full List of Resources:
>  * GPFS-Fence (stonith:fence_gpfs):   Started jangcluster-srv-1
>  * jangcluster-srv-3  (ocf::pacemaker:remote):        Started
> jangcluster-srv-1
>  * jangcluster-srv-4  (ocf::pacemaker:remote):        Started
> jangcluster-srv-2
>  
>  node 1: jangcluster-srv-1 \
>        attributes ethmonitor-eth1=1
>  node 2: jangcluster-srv-2 \
>        attributes ethmonitor-eth1=1
>  node jangcluster-srv-3:remote \
>        attributes ethmonitor-eth1=1
>  node jangcluster-srv-4:remote \
>        attributes ethmonitor-eth1=1
>  primitive GPFS-Fence stonith:fence_gpfs \
>        params ipaddr=jangcluster-srv-1 pcmk_host_list=" jangcluster-
> srv-1 jangcluster-srv-2 jangcluster-srv-3 jangcluster-srv-4"
> secure=true \
>        op monitor interval=10s timeout=500s \
>        op off interval=0 \
>        meta is-managed=true
>  primitive NIC_eth1 ethmonitor \
>        params interface=eth1 repeat_count=4 repeat_interval=4
> link_status_only=true \
>        op monitor timeout=30s interval=4 \
>        op start timeout=60s interval=0s \
>        op stop interval=0s timeout=20s
> location prefer-node-jangcluster-srv-3 jangcluster-srv-3 100:
> jangcluster-srv-1
> location prefer-node-jangcluster-srv-4 jangcluster-srv-4 100:
> jangcluster-srv-2
> location prefer-node-jangcluster-srv-3-2 jangcluster-srv-3 50:
> jangcluster-srv-2
> location prefer-node-jangcluster-srv-4-2 jangcluster-srv-4 50:
> jangcluster-srv-1
>  
>  
> I noticed that remote node gets fenced when the quorum node its
> connected to gets fenced or experiences network failure.
>  For example, when I disconnected srv-2 from the rest of the
> cluster, 
> I expected that remote node jangcluster-srv-4  would failover to srv-
> 1 given my location constraints,
> but srv-4 was getting fenced along with srv-2 instead of failing
> over.
> How can I configure the cluster so that remote node srv-4 fails over
> instead of getting fenced?
>  
>  
> Thank you
>  
> Janghyuk Boo.

Hi,

That is how it works whenever possible. If it fences the remote, it is
because it was not recoverable. Logs from srv-1, srv-2, and srv-4
around that time would be helpful to give more detail.
-- 
Ken Gaillot <kgaillot at redhat.com>