[ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.

alexey at pavlyuts.ru alexey at pavlyuts.ru
Fri May 3 14:59:03 EDT 2024


Hi,

> > Also, I've done wireshark capture and found great mess in TCP, it
> > seems like connection between qdevice and qnetd really stops for some
> > time and packets won't deliver.
> 
> Could you check UDP? I guess there is a lot of UDP packets sent by
corosync
> which probably makes TCP to not go thru.
Very improbably.  UPD itself can't prevent TCP from working, and 1GB links
seems too wide for corosync may overload it.
Also, overload usually leads to SOME packets drop, but there absolutely
other case: NO TCP packet passed, I got two captures from two side and I see
that for some time each party sends TCP packets, but other party do not
receive it at all.

> >
> > For my guess, it match corosync syncing activities, and I suspect that
> > corosync prevent any other traffic on the interface it use for rings.
> >
> > As I switch qnetd and qdevice to use different interface it seems to
> > work fine.
> 
> Actually having dedicated interface just for corosync/knet traffic is
optimal
> solution. qdevice+qnetd on the other hand should be as close to "customer"
as
> possible.
> 
I am sure qnetd is not intended to proof of network reachability, it only an
arbiter to provide quorum resolution. Therefore, as for me it is better to
keep it on the intra-cluster network with high priority transport. If we
need to make a solution based on network reachability, there other ways to
provide it.

> So if you could have two interfaces (one just for corosync, second for
> qnetd+qdevice+publicly accessible services) it might be a solution?
> 
Yes, this way it works, but I wish to know WHY it won't work on the shared
interface.

> > So, the question is: does corosync really temporary blocks any other
> > traffic on the interface it uses? Or it is just a coincidence? If it
> > blocks, is
> 
> Nope, no "blocking". But it sends quite some few UDP packets and I guess
it can
> really use all available bandwidth so no TCP goes thru.
Use all available 1GBps? Impossible.




More information about the Users mailing list