[ClusterLabs] Inconclusive recap for bonding (balance-rr) vs. HA (Was: why is node fenced ?)

Thu May 30 11:53:38 EDT 2019

On 20/05/19 14:35 +0200, Jan Pokorný wrote:
> On 20/05/19 08:28 +0200, Ulrich Windl wrote:
>>> One network interface is gone for a short period. But it's in a
>>> bonding device (round-robin), so the connection shouldn't be lost.
>>> Both nodes are connected directly, there is no switch in between.
>> 
>> I think you misunderstood: a round-robin bonding device is not
>> fault-safe IMHO, but it depends a lot on your cabling details. Also
>> you did not show the logs on the other nodes.
> 
> That was sort of my point.  I think that in this case, the
> fault tolerance together with TCP's "best effort" makes the case
> effectively fault-recoverable (except for some pathological scenarios
> perhaps) -- so the whole "value proposition" for that mode can
> make false impressions even for when it's not the case ... like
> with corosync (since it intentionally opts for unreliable
> transport for performance/scalability).
> 
> (saying that as someone who has just about a single hands-on
> experiment with bonding behind...)

Ok, I've tried to read up on the subject a bit -- still no more
hands-on so feel free to correct or amend my conclusions below.
This discusses a Linux setup.

First of all, I think that claiming further unspecified balance-rr
bonding mode to be SPOF solving solution is a myth -- and relying on
that unconditionally is a way to fail hard.

It needs in-depth considerations, I think:

1. some bonding modes (incl. mentioned balance-rr and active-backup)
   vitally depend on ability to actually detect a link failure

2. configuration of the bonding therefore needs to specify what's the
   most optimal way to detect such failures for given selection of
   network adapters (it seems the configuration for a resulting bond
   instance is shares across the enslaved devices -- it would then
   mean that preferably the same models shall be used, since this
   optimality is then shared inherently), that is, either

   - miimon, or
   - arp_interval and arp_ip_target

   parameters need to be specified to the kernel bonding module

   when this is not done, presumed non-SPOF interconnect still
   remains SPOF (!)

3. then, it is being discussed that there's hardly a notion of
   real-time detection of the link failure -- since all such
   indications are basically being polled for, and moreover,
   drivers for particular adapters can add up to the propagation
   delay, meaning the detection happens in order of hundreds+
   milliseconds after the fact -- which makes me think that
   such faulty link behaves essentially as a blackhole for such
   a period of time

4. finally, to get back to what I've meant with diametral differences
   between casual TCP vs. corosync (considering UDP unicast) traffic
   and which may play a role here as well, is that mentioned TCP's
   "best effort" with built-in confirmations will not normally give
   up in order of tens of seconds and more (please correct me),
   but in corosync case, with default "token" parameter of 1 second,
   multiplied with retransmit attempts (4 by default, see
   token_retransmits_before_loss_cons), we operate within the order
   of lower seconds (again, please correct me)

   therefore, specifying the above parameters for bonding in
   an excessive manner compard to corosync configuration (like
   miimon=10000) could under further circumstances mean the
   same SPOF effect, e.g. when

   - packets_per_slave parameter specified in a way it will
     contain all those possibly repeated attempts in the
     corosync exchange (selected link may be, out of bad luck,
     be the faulty one while the failure hasn't been detected
     yet)

   - (unsure if can happen) when logical messages corosync
     exchanges doesn't fit a single UDP datagram (low MTU?), and
     packets_per_slave is 1 (default), complete message is
     never successfully transmitted, since its part will always
     be carried over the faulty link (again, while its failure
     hasn't been detected yet), IIUIC

Looks quite tricky, overall.  Myself, I'd likely opt to live with
admitted SPOF than with something that's possibly haisen-SPOF
(your mileage may vary, don't take it as FUD, just a reminder
that discretion is always needed in HA world).

I didn't investigate more, but if there are any bonding modes with
actual redundancy, it could be a more reliable SPOF-in-interconnect
avoidance (but then, you also need more switches...).  I don't know
enough to comment on RRP (allegedly discouraged[1]) or kronosnet
for such a task.

Would be glad if someone more knowledgable could chime in to share
the insights and show where the limits to bonding are, especially
in HA settings.

Some references:

https://github.com/torvalds/linux/blob/master/Documentation/networking/bonding.txt
https://wiki.linuxfoundation.org/networking/bonding
https://wiki.debian.org/Bonding

[1] https://lists.clusterlabs.org/pipermail/users/2019-April/025651.html

-- 
Jan (Poki)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190530/9c3834ca/attachment.sig>