[ClusterLabs] reducing corosync-qnetd "response time"
Sherrard Burton
sb-clusterlabs at allafrica.com
Thu Oct 24 14:40:10 EDT 2019
On 10/24/19 1:30 PM, Andrei Borzenkov wrote:
> 24.10.2019 16:54, Sherrard Burton пишет:
>> background:
>> we are upgrading a (very) old HA cluster running heartbeat DRBD and NFS,
>> with no stonith, to a much more modern implementation. for the existing
>> cluster, as well as the new one, the disk space requirements make
>> running a full three-node cluster infeasible, so i am trying to
>> configure a quorum-only node using corosync-qnetd.
>>
>> the installation went fine, the nodes can communicate, etc, and the
>> cluster seema to perform as desired when gracefully shutting down or
>> restarting a node. but during my torture testing, simulating a node
>> crash by stopping the network on one node leaves the remaining node in
>> limbo for approximately 20 seconds before it and the quorum-only node
>> decide that they are indeed quorate.
>>
>> the problem:
>> the intended implementation involves DRBD, and its resource-level
>> fencing freezes IO during the time that the remaining node is inquorate
>> in order to avoid any possible data divergence/split-brain. this
>> precaution is obviously desirable, and is the reason that i am trying to
>> configure this cluster "properly".
>>
>> my (admittedly naive) expectation is that the remaining node and the
>> quorum-only node would continue ticking along as if nothing happened,
>> and i am hoping that this delay is due to some
>> misconfiguration/oversight/bone-headedness on my part.
>>
>> so i am seeking understanding on the reason for this delay, and whether
>> there is any (prudent) way to reduce it. of course, any other advice on
>> the intended setup is welcome as well.
>>
>> please let me know if you require any additional details.
>>
>
>
> You may be interested in this discussion
>
> https://www.mail-archive.com/users@clusterlabs.org/msg08907.html
thanks Andrei.
my searches have brought me to that thread a few times, but i did not
think it applied because it seemed as if the asker was having issues
with complete loss of quorum and some unwanted fencing that resulted
from that, based on the relative values of some of these timeouts.
after re-reading it, i can see how it relates to my issue. but given the
number of iterations of suggestion/question -> misunderstanding ->
correction/clarification, i was unable to distill from that discussion
which settings should and shouldn't be touched, and which ones will
positively affect my situation while avoiding negative implications.
was there ever a "final verdict" from that discussion which would allow
me to reduce the delay in determining quorum after partition without
also ending up in the same situation as the asker, in which conflicting
timeout values introduce a different problem?
More information about the Users
mailing list