[ClusterLabs] reducing corosync-qnetd "response time"

Thu Oct 24 14:40:10 EDT 2019

On 10/24/19 1:30 PM, Andrei Borzenkov wrote:
> 24.10.2019 16:54, Sherrard Burton пишет:
>> background:
>> we are upgrading a (very) old HA cluster running heartbeat DRBD and NFS,
>> with no stonith, to a much more modern implementation. for the existing
>> cluster, as well as the new one, the disk space requirements make
>> running a full three-node cluster infeasible, so i am trying to
>> configure a quorum-only node using corosync-qnetd.
>>
>> the installation went fine, the nodes can communicate, etc, and the
>> cluster seema to perform as desired when gracefully shutting down or
>> restarting a node. but during my torture testing, simulating a node
>> crash by stopping the network on one node leaves the remaining node in
>> limbo for approximately 20 seconds before it and the quorum-only node
>> decide that they are indeed quorate.
>>
>> the problem:
>> the intended implementation involves DRBD, and its resource-level
>> fencing freezes IO during the time that the remaining node is inquorate
>> in order to avoid any possible data divergence/split-brain. this
>> precaution is obviously desirable, and is the reason that i am trying to
>> configure this cluster "properly".
>>
>> my (admittedly naive) expectation is that the remaining node and the
>> quorum-only node would continue ticking along as if nothing happened,
>> and i am hoping that this delay is due to some
>> misconfiguration/oversight/bone-headedness on my part.
>>
>> so i am seeking understanding on the reason for this delay, and whether
>> there is any (prudent) way to reduce it. of course, any other advice on
>> the intended setup is welcome as well.
>>
>> please let me know if you require any additional details.
>>
> 
> 
> You may be interested in this discussion
> 
> https://www.mail-archive.com/users@clusterlabs.org/msg08907.html

thanks Andrei.

my searches have brought me to that thread a few times, but i did not 
think it applied because it seemed as if the asker was having issues 
with complete loss of quorum and some unwanted fencing that resulted 
from that, based on the relative values of some of these timeouts.

after re-reading it, i can see how it relates to my issue. but given the 
number of iterations of suggestion/question -> misunderstanding -> 
correction/clarification, i was unable to distill from that discussion 
which settings should and shouldn't be touched, and which ones will 
positively affect my situation while avoiding negative implications.

was there ever a "final verdict" from that discussion which would allow 
me to reduce the delay in determining quorum after partition without 
also ending up in the same situation as the asker, in which conflicting 
timeout values introduce a different problem?