[ClusterLabs] Why Do Nodes Leave the Cluster?
Strahil Nikolov
hunter86_bg at yahoo.com
Sun Apr 12 02:31:37 EDT 2020
On April 11, 2020 5:01:37 PM GMT+03:00, Eric Robinson <eric.robinson at psmnv.com> wrote:
>
>Hi Strahil --
>
>I hope you won't mind if I revive this old question. In your comments
>below, you suggested using a 1s token with a 1.2s consensus. I
>currently have 2-node clusters (will soon install a qdevice). I was
>reading in the corosync.conf man page where it says...
>
>"For two node clusters, a consensus larger than the join timeout
>but less than token is safe. For three node or larger clusters,
>consensus should be larger than token."
>
>Do you still think the consensus should be 1.2 * token in a 2-node
>cluster? Why is a smaller consensus considered safe for 2-node
>clusters? Should I use a larger consensus anyway?
>
>--Eric
>
>
>> -----Original Message-----
>> From: Strahil Nikolov <hunter86_bg at yahoo.com>
>> Sent: Thursday, February 6, 2020 1:07 PM
>> To: Eric Robinson <eric.robinson at psmnv.com>; Cluster Labs - All
>topics
>> related to open-source clustering welcomed <users at clusterlabs.org>;
>> Andrei Borzenkov <arvidjaar at gmail.com>
>> Subject: RE: [ClusterLabs] Why Do Nodes Leave the Cluster?
>>
>> On February 6, 2020 7:35:53 PM GMT+02:00, Eric Robinson
>> <eric.robinson at psmnv.com> wrote:
>> >Hi Nikolov --
>> >
>> >> Defaults are 1s token, 1.2s consensus which is too small.
>> >> In Suse, token is 10s, while consensus is 1.2 * token -> 12s.
>> >> With these settings, cluster will not react for 22s.
>> >>
>> >> I think it's a good start for your cluster .
>> >> Don't forget to put the cluster in maintenance (pcs property set
>> >> maintenance-mode=true) before restarting the stack , or even
>better
>> >- get
>> >> some downtime.
>> >>
>> >> You can use the following article to run a simulation before
>removing
>> >the
>> >> maintenance:
>> >> https://www.suse.com/support/kb/doc/?id=7022764
>> >>
>> >
>> >
>> >Thanks for the suggestions. Any thoughts on timeouts for DRBD?
>> >
>> >--Eric
>> >
>> >Disclaimer : This email and any files transmitted with it are
>> >confidential and intended solely for intended recipients. If you are
>> >not the named addressee you should not disseminate, distribute, copy
>or
>> >alter this email. Any views or opinions presented in this email are
>> >solely those of the author and might not represent those of
>Physician
>> >Select Management. Warning: Although Physician Select Management has
>> >taken reasonable precautions to ensure no viruses are present in
>this
>> >email, the company cannot accept responsibility for any loss or
>damage
>> >arising from the use of this email or attachments.
>>
>> Hi Eric,
>>
>> The timeouts can be treated as 'how much time to wait before taking
>any
>> action'. The workload is not very important (HANA is something
>different).
>>
>> You can try with 10s (token) , 12s (consensus) and if needed you can
>adjust.
>>
>> Warning: Use a 3 node cluster or at least 2 drbd nodes + qdisk. The 2
>node
>> cluster is vulnerable to split brain, especially when one of the
>nodes is
>> syncing (for example after a patching) and the source is
>> fenced/lost/disconnected. It's very hard to extract data from a
>semi-synced
>> drbd.
>>
>> Also, if you need guidance for the SELINUX, I can point you to my
>guide in the
>> centos forum.
>>
>> Best Regards,
>> Strahil Nikolov
>Disclaimer : This email and any files transmitted with it are
>confidential and intended solely for intended recipients. If you are
>not the named addressee you should not disseminate, distribute, copy or
>alter this email. Any views or opinions presented in this email are
>solely those of the author and might not represent those of Physician
>Select Management. Warning: Although Physician Select Management has
>taken reasonable precautions to ensure no viruses are present in this
>email, the company cannot accept responsibility for any loss or damage
>arising from the use of this email or attachments.
Hey Eric,
1s/1.2s are the defaults and I guess they are defaults for 2-node clusters too.
Yet if the man page suggests that - then you should set it up this way.
On our SLES 2-node clusters we got token 30s & consensus 36s after several network issues and they are fine now - just a failiure is delayed by a minute.
Maybe you can try with consensus 0.8s and default token of 1s.
I have noticed that my oVirt LAB RHEL nodes are being fenced with the defaults, so I'm using SUSE's defaults of 10s token, 12s consensus even on RHEL 2-node clusters.
Best Regards,
Strahil Nikolov
More information about the Users
mailing list