[ClusterLabs] Stonith configuration
Strahil Nikolov
hunter86_bg at yahoo.com
Fri Feb 14 13:06:59 EST 2020
On February 14, 2020 4:44:53 PM GMT+02:00, "BASDEN, ALASTAIR G." <a.g.basden at durham.ac.uk> wrote:
>Hi Strahil,
>corosync-cfgtool -s
>Printing ring status.
>Local node ID 1
>RING ID 0
> id = 172.17.150.20
> status = ring 0 active with no faults
>RING ID 1
> id = 10.0.6.20
> status = ring 1 active with no faults
>
>corosync-quorumtool -s
>Quorum information
>------------------
>Date: Fri Feb 14 14:41:11 2020
>Quorum provider: corosync_votequorum
>Nodes: 2
>Node ID: 1
>Ring ID: 1/96
>Quorate: Yes
>
>Votequorum information
>----------------------
>Expected votes: 2
>Highest expected: 2
>Total votes: 2
>Quorum: 1
>Flags: 2Node Quorate WaitForAll
>
>Membership information
>----------------------
> Nodeid Votes Name
> 1 1 node1.primary.network (local)
> 2 1 node2.primary.network
>
>
>On the surviving node, the 10.0.6.21 interface flipflopped (though
>nothing
>detected on the other node), and that is what started it all off.
>
>We have no firewall running.
>
>Cheers,
>Alastair.
>
>
>On Fri, 14 Feb 2020, Strahil Nikolov wrote:
>
>> On February 14, 2020 12:41:58 PM GMT+02:00, "BASDEN, ALASTAIR G."
><a.g.basden at durham.ac.uk> wrote:
>>> Hi,
>>> I wonder whether anyone could give me some advice about a stonith
>>> configuration.
>>>
>>> We have 2 nodes, which form a HA cluster.
>>>
>>> These have 3 networks:
>>> A generic network over which they are accessed (eg ssh)
>>> (node1.primary.network, node2.primary.network)
>>> A directly connected cable between them (10.0.6.20, 10.0.6.21).
>>> A management network, on which ipmi is (172.16.150.20,
>172.16.150.21)
>>>
>>> We have done:
>>> pcs cluster setup --name hacluster node1.primary.network,10.0.6.20
>>> node2.primary.network,10.0.6.21 --token 20000
>>> pcs cluster start --all
>>> pcs property set no-quorum-policy=ignore
>>> pcs property set stonith-enabled=true
>>> pcs property set symmetric-cluster=true
>>> pcs stonith create node1_ipmi fence_ipmilan ipaddr="172.16.150.20"
>>> lanplus=true login="root" passwd="password"
>>> pcmk_host_list="node1.primary.network" power_wait=10
>>> pcs stonith create node2_ipmi fence_ipmilan ipaddr="172.16.150.21"
>>> lanplus=true login="root" passwd="password"
>>> pcmk_host_list="node2.primary.network" power_wait=10
>>>
>>> /etc/corosync/corosync.conf has:
>>> totem {
>>> version: 2
>>> cluster_name: hacluster
>>> secauth: off
>>> transport: udpu
>>> rrp_mode: passive
>>> token: 20000
>>> }
>>>
>>> nodelist {
>>> node {
>>> ring0_addr: node1.primary.network
>>> ring1_addr: 10.0.6.20
>>> nodeid: 1
>>> }
>>>
>>> node {
>>> ring0_addr: node2.primary.network
>>> ring1_addr: 10.0.6.21
>>> nodeid: 2
>>> }
>>> }
>>>
>>> quorum {
>>> provider: corosync_votequorum
>>> two_node: 1
>>> }
>>>
>>> logging {
>>> to_logfile: yes
>>> logfile: /var/log/cluster/corosync.log
>>> to_syslog: no
>>> }
>>>
>>>
>>> What I find is that if there is a problem with the directly
>connected
>>> cable, the nodes stonith each other, even though the generic network
>is
>>>
>>> fine.
>>>
>>> What I would expect is that they would only shoot each other when
>both
>>> networks are down (generic and directly connected).
>>>
>>> Any ideas?
>>>
>>> Thanks,
>>> Alastair.
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>>
>> What is the output of :
>> corosync-cfgtool -s
>> corosync-quorumtool -s
>>
>> Also check the logs of the suvived node for clues.
>>
>> What about firewall ?
>> Have you enabled 'high-availability' service on firewalld on all
>zones for your interfaces ?
>>
>> Best Regards,
>> Strahil Nikolov
>>
>>
One thing that comes to my mind is that you have a 20s token, but consensus is the default - but it should be token * 1.2 -> 24000 (24s).
That can be done live with some caution. Just set the cluster in maintenance, reload the corosync (or even better stop&start the cluster stack) and then run a 'crm_simulate' to verify what will happen when you remove the maintenance.
Last, remove the maintenance if the simulation doesn't show any action.
Corosync seems OK, but you should consider if you really need 'WaitForAll'.
If both nodes fail (power failiure for example) - you need to power up both before the cluster starts a resource.
There is a chance that the primary network had an issue at the same time but this can be detected only in the logs.
If you think that you can share the logs - send a link.
Otherwise you have to analyze them by yourself. Keep in mind that the DC node has more comprehensive logs, but if it was the fenced server - check both servers.
Note: You can check if the fencing mechanism has an option for a delay and then you can evaluate which node is hosting the more important resource . Then configure the delays, so that important node gets fenced second.
Note2: Consider adding a third node /for example a VM/ or a qdevice on a separate node (allows to be on a separate network, so a simple routing is the only requirement ) and reconfigure the cluster , so you have 'Expected votes: 3' .
This will protect you from split brain and is highly recommended.
P.S.: Sorry for the long post :D
Best Regards,
Strahil Nikolov
More information about the Users
mailing list