[ClusterLabs] reducing corosync-qnetd "response time"

Fri Oct 25 03:17:24 EDT 2019

Sherrard Burton napsal(a):
> 
> 
> On 10/24/19 1:30 PM, Andrei Borzenkov wrote:
>> 24.10.2019 16:54, Sherrard Burton пишет:
>>> background:
>>> we are upgrading a (very) old HA cluster running heartbeat DRBD and NFS,
>>> with no stonith, to a much more modern implementation. for the existing
>>> cluster, as well as the new one, the disk space requirements make
>>> running a full three-node cluster infeasible, so i am trying to
>>> configure a quorum-only node using corosync-qnetd.
>>>
>>> the installation went fine, the nodes can communicate, etc, and the
>>> cluster seema to perform as desired when gracefully shutting down or
>>> restarting a node. but during my torture testing, simulating a node
>>> crash by stopping the network on one node leaves the remaining node in
>>> limbo for approximately 20 seconds before it and the quorum-only node
>>> decide that they are indeed quorate.
>>>
>>> the problem:
>>> the intended implementation involves DRBD, and its resource-level
>>> fencing freezes IO during the time that the remaining node is inquorate
>>> in order to avoid any possible data divergence/split-brain. this
>>> precaution is obviously desirable, and is the reason that i am trying to
>>> configure this cluster "properly".
>>>
>>> my (admittedly naive) expectation is that the remaining node and the
>>> quorum-only node would continue ticking along as if nothing happened,
>>> and i am hoping that this delay is due to some
>>> misconfiguration/oversight/bone-headedness on my part.
>>>
>>> so i am seeking understanding on the reason for this delay, and whether
>>> there is any (prudent) way to reduce it. of course, any other advice on
>>> the intended setup is welcome as well.
>>>
>>> please let me know if you require any additional details.
>>>
>>
>>
>> You may be interested in this discussion
>>
>> https://www.mail-archive.com/users@clusterlabs.org/msg08907.html
> 
> thanks Andrei.
> 
> my searches have brought me to that thread a few times, but i did not 
> think it applied because it seemed as if the asker was having issues 
> with complete loss of quorum and some unwanted fencing that resulted 
> from that, based on the relative values of some of these timeouts.
> 
> after re-reading it, i can see how it relates to my issue. but given the 
> number of iterations of suggestion/question -> misunderstanding -> 
> correction/clarification, i was unable to distill from that discussion 
> which settings should and shouldn't be touched, and which ones will 
> positively affect my situation while avoiding negative implications.
> 
> was there ever a "final verdict" from that discussion which would allow 
> me to reduce the delay in determining quorum after partition without 
> also ending up in the same situation as the asker, in which conflicting 
> timeout values introduce a different problem?

Hi,
distillation

https://github.com/ClusterLabs/sbd/pull/76#issuecomment-486952369

This should reduce the rtime of corosync "limbo" to ~2 sec.

Regards,
   Honza

> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/