[ClusterLabs] Stonith configuration

Fri Feb 14 08:58:30 EST 2020

On February 14, 2020 12:41:58 PM GMT+02:00, "BASDEN, ALASTAIR G." <a.g.basden at durham.ac.uk> wrote:
>Hi,
>I wonder whether anyone could give me some advice about a stonith 
>configuration.
>
>We have 2 nodes, which form a HA cluster.
>
>These have 3 networks:
>A generic network over which they are accessed (eg ssh) 
>(node1.primary.network, node2.primary.network)
>A directly connected cable between them (10.0.6.20, 10.0.6.21).
>A management network, on which ipmi is (172.16.150.20, 172.16.150.21)
>
>We have done:
>pcs cluster setup --name hacluster node1.primary.network,10.0.6.20
>node2.primary.network,10.0.6.21 --token 20000
>pcs cluster start --all
>pcs property set no-quorum-policy=ignore
>pcs property set stonith-enabled=true
>pcs property set symmetric-cluster=true
>pcs stonith create node1_ipmi fence_ipmilan ipaddr="172.16.150.20"
>lanplus=true login="root" passwd="password"
>pcmk_host_list="node1.primary.network" power_wait=10
>pcs stonith create node2_ipmi fence_ipmilan ipaddr="172.16.150.21"
>lanplus=true login="root" passwd="password"
>pcmk_host_list="node2.primary.network" power_wait=10
>
>/etc/corosync/corosync.conf has:
>totem {
>     version: 2
>     cluster_name: hacluster
>     secauth: off
>     transport: udpu
>     rrp_mode: passive
>     token: 20000
>}
>
>nodelist {
>     node {
>         ring0_addr: node1.primary.network
>         ring1_addr: 10.0.6.20
>         nodeid: 1
>     }
>
>     node {
>         ring0_addr: node2.primary.network
>         ring1_addr: 10.0.6.21
>          nodeid: 2
>     }
>}
>
>quorum {
>     provider: corosync_votequorum
>     two_node: 1
>}
>
>logging {
>     to_logfile: yes
>     logfile: /var/log/cluster/corosync.log
>     to_syslog: no
>}
>
>
>What I find is that if there is a problem with the directly connected 
>cable, the nodes stonith each other, even though the generic network is
>
>fine.
>
>What I would expect is that they would only shoot each other when both 
>networks are down (generic and directly connected).
>
>Any ideas?
>
>Thanks,
>Alastair.
>_______________________________________________
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/

What is  the output of :
corosync-cfgtool -s
corosync-quorumtool -s

Also check the logs of the suvived node for clues.

What about firewall ?
Have you enabled 'high-availability' service on firewalld on all zones for your interfaces ?

Best Regards,
Strahil Nikolov