[ClusterLabs] Making xt_cluster IP load-sharing work with IPv6 (Was: Concept of a Shared ipaddress/resource for generic applicatons)[

Thu Jan 2 15:52:09 EST 2020

On 27/12/19 15:04 +0100, Valentin Vidić wrote:
> On Wed, Dec 04, 2019 at 02:44:49PM +0100, Jan Pokorný wrote:
>> For the record, based on my feedback, iptables-extensions man page is
>> headed to (finally) align with the actual in-kernel deprecation
>> message:
>> https://lore.kernel.org/netfilter-devel/20191204130921.2914-1-phil@nwl.cc/
> 
> From a quick run of xt_cluster it seems to be working as expected
> for IPv4

FTR. when having "netfilter"/nftables backend available, you can either
make use of iptables-translate conversion utility, or deduce a similar
takeaway from 
https://git.netfilter.org/iptables/tree/extensions/libxt_cluster.txlate?h=v1.8.4
possibly allowing to ditch any dependency on iptables-* tooling, and on
xt_cluster.ko just as well!

As mentioned in a newer incarnation asking about xt_cluster:
https://lists.clusterlabs.org/pipermail/users/2020-January/026718.html

for the envisioned agent, it would be a way to (optionally) allow for
a rather lightweight operation in the future (where iptables may not
get installed by default with some Linux distros at all; well, even
firewalld-as-a-middleware variant controlled just via DBus calls might
be thinkable, meaning that "nft" tool wouldn't be required, too).

> It requires iptables rules and ARP reply rewrite like:
> 
> arptables -A OUTPUT -o eth1 --h-length 6 -j mangle --mangle-mac-s 01:00:5e:00:01:01

pardon my ignorance but you currently appear to be the greatest expert
with practical experience on this list regarding the topic.

* * *

1. Is this based solely on experience with xt_cluster extension that
   led you to this ARP-level rewrite unique to using netfilter backend,
   or would the same actually be needed with true CLUSTERIP target?

Actually, I took a look at the code of CLUSTERIP extension, and it in
fact is used to do the very same ARP level mangling, even though, it is
slightly more precise, akin to (with stray in-line comments):

  arptables -A OUTPUT \
    --h-type 1 \  # Ethernet
    --proto-type 0x800 \  # IPv4
    --h-length 6 \  # perhaps redundant to --h-type?
    \ # cannot express limitation on the size of network address
    \ # but that would perhaps be redundant to --proto-type
    --opcode 2 \  # this time for Reply
    -j mangle --mangle-mac-s CLUSTERMAC

  # see also
  # https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4095ebf1e641b0f37ee1cd04c903bb85cf4ed25b
  arptables -A OUTPUT \
    --h-type 1 \  # Ethernet
    --proto-type 0x800 \  # IPv4
    --h-length 6 \  # perhaps redundant to --h-type?
    \ # cannot express limitation on the size of network address
    \ # but that would perhaps be redundant to --proto-type
    --opcode 1 \  # this time for Request
    -j mangle --mangle-mac-s CLUSTERMAC

What you've used appears to be akin to what this chunk of manpage
suggests (amongst others):
https://git.netfilter.org/iptables/tree/extensions/libxt_cluster.man

which is (yet another) indicator to me that xt_cluster extension
doesn't carry that functionality on its own (like CLUSTERIP target
did, as mentioned).

*
* Anyway, I'd like to understand why is this necessary in the first
* place, getting to my second question.
*

2. Is the following, for me viable explanation correct?

That arrangement is to prevent here unexpectedly leaky specific
associations (I'd call "fixations") of the interface's true (hence
non-multicast) MAC address with meant-to-be-shared IP address at hand,
and hence cancelling the effect of link-multicasted frames (to which
at most a single recipient would respond per the firewall matching
rules), and therefore botching the "shared IP" concept altogether from
the perspective of network members that would undesirably learn
non-multicast address association for the particular
meant-to-be-shared IP leaked like this.

*
* But it doesn't explain the suggested destination MAC renormalization
* on INPUT, which is currently yet to be heard of for our purpose...
*

3. Is, perhaps, the following plausible explanations sound?

- this is so as not spoil the local ARP cache/dependent interactions,
  such as when actual ARP request is sent with link layer address
  identical with the particular multicast MAC in use -- likely
  feasible with the other host on the network configured the same
  way and with already mentioned source-MAC rewriting on egress
  in place (so this would actually neutralize harmful network
  effect it could otherwise be causing)

- or any other reasons?

*
* Finally, when referring to the suggestive example above, there is
* one more question to ask...
*

4. Shall not even existing IPaddr2 (whether in CLUSTERIP-based mode
   or not) actually verify that
   /proc/sys/net/netfilter/nf_conntrack_tcp_loose
   gets cleared, at least until told not to through configuration?

- looks like a good idea not to allow any after-cut packets
  interaction (would only apply to anything outside of the
  critical cluster infrastructure since it uses UDP), as
  a matter of safety precautions (there are no liveness
  aspects to wish for in such scenarios, which could
  otherwise interfere, I think)

> However for IPv6 I could not find an equivalent command to rewrite
> Neighbour Advertisment packets.  Does anyone have an idea how this
> could be done?

5. Here, I had a closer look at the code as well and have an option
   to try -- does this help?

It appears as if that response in the (solicited)  Neighbour
Advertisement is -- in Linux kernel -- unconditionally always
picked from the very first address configured on the device (not to
be confused with "permanent address").  Hence it looks to me that
the way to go would be, so as to achieve feature parity IPv4 vs. IPv6,
to either:

- give up on the sole identity of the interface, so that it either
  operates under selected multicast link layer address or doesn't
  operate at all (rationale: better not to confuse the network with
  occasional MAC flips?)

- stick with a new macvlan pseudointerface, surprise-surprise, yet
  another virtualization/mimicking/independence-increasing layer :-)

No experience with macvlan on my side, but bridge mode looks appealing,
and would retain the interface addressable through its standard MAC
address as well.  And importantly, the newly created interface would
have the correct (multicast) MAC address to respond with to the
respective Neighbour Solicitations (which is exactly what's asked,
IIUIC), and I expect it would be the one selected to respond to
the very matching IP in question?

Still, this doesn't resolve any concern around point 3. above
(assuming it's not bogus, to begin with).

* * *

Sorry for any impreciseness, it's all quite confusing to me, and
the WHYs are rather underdocumented/inaccessible for my taste.
But hopefully, we can put some knowledge and practice together.

-- 
Jan (Poki)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20200102/9c9d8f40/attachment.sig>