[ClusterLabs] reducing corosync-qnetd "response time"

Fri Oct 25 09:12:08 EDT 2019

On 10/25/19 3:17 AM, Jan Friesse wrote:
> Sherrard Burton napsal(a):
>>
>>
>> On 10/24/19 1:30 PM, Andrei Borzenkov wrote:
>>> 24.10.2019 16:54, Sherrard Burton пишет:
>>>> background:
>>>> we are upgrading a (very) old HA cluster running heartbeat DRBD and 
>>>> NFS,
>>>> with no stonith, to a much more modern implementation. for the existing
>>>> cluster, as well as the new one, the disk space requirements make
>>>> running a full three-node cluster infeasible, so i am trying to
>>>> configure a quorum-only node using corosync-qnetd.
>>>>
>>>> the installation went fine, the nodes can communicate, etc, and the
>>>> cluster seema to perform as desired when gracefully shutting down or
>>>> restarting a node. but during my torture testing, simulating a node
>>>> crash by stopping the network on one node leaves the remaining node in
>>>> limbo for approximately 20 seconds before it and the quorum-only node
>>>> decide that they are indeed quorate.
>>>>
>>>> the problem:
>>>> the intended implementation involves DRBD, and its resource-level
>>>> fencing freezes IO during the time that the remaining node is inquorate
>>>> in order to avoid any possible data divergence/split-brain. this
>>>> precaution is obviously desirable, and is the reason that i am 
>>>> trying to
>>>> configure this cluster "properly".
>>>>
>>>> my (admittedly naive) expectation is that the remaining node and the
>>>> quorum-only node would continue ticking along as if nothing happened,
>>>> and i am hoping that this delay is due to some
>>>> misconfiguration/oversight/bone-headedness on my part.
>>>>
>>>> so i am seeking understanding on the reason for this delay, and whether
>>>> there is any (prudent) way to reduce it. of course, any other advice on
>>>> the intended setup is welcome as well.
>>>>
>>>> please let me know if you require any additional details.
>>>>
>>>
>>>
>>> You may be interested in this discussion
>>>
>>> https://www.mail-archive.com/users@clusterlabs.org/msg08907.html
>>
>> thanks Andrei.
>>
>> my searches have brought me to that thread a few times, but i did not 
>> think it applied because it seemed as if the asker was having issues 
>> with complete loss of quorum and some unwanted fencing that resulted 
>> from that, based on the relative values of some of these timeouts.
>>
>> after re-reading it, i can see how it relates to my issue. but given 
>> the number of iterations of suggestion/question -> misunderstanding -> 
>> correction/clarification, i was unable to distill from that discussion 
>> which settings should and shouldn't be touched, and which ones will 
>> positively affect my situation while avoiding negative implications.
>>
>> was there ever a "final verdict" from that discussion which would 
>> allow me to reduce the delay in determining quorum after partition 
>> without also ending up in the same situation as the asker, in which 
>> conflicting timeout values introduce a different problem?
> 
> Hi,
> distillation
> 
> https://github.com/ClusterLabs/sbd/pull/76#issuecomment-486952369
> 
> This should reduce the rtime of corosync "limbo" to ~2 sec.
> 
> Regards,
>    Honza

i will give that a try. thanks a bunch Jan.