[ClusterLabs] EL6, cman, rrp, unicast and iptables

Tue Sep 15 09:13:55 EDT 2015

On 15/09/15 03:20 AM, Jan Friesse wrote:
> Digimer napsal(a):
>> On 14/09/15 04:20 AM, Jan Friesse wrote:
>>> Digimer napsal(a):
>>>> Hi all,
>>>>
>>>>     Starting a new thread from the "Clustered LVM with iptables issue"
>>>> thread...
>>>>
>>>>     I've decided to review how I do networking entirely in my
>>>> cluster. I
>>>> make zero claims to being great at networks, so I would love some
>>>> feedback.
>>>>
>>>>     I've got three active/passive bonded interfaces; Back-Channel,
>>>> Storage
>>>> and Internet-Facing networks. The IFN is "off limits" to the cluster as
>>>> it is dedicated to hosted server traffic only.
>>>>
>>>>     So before, I uses only the BCN for cluster traffic for
>>>> cman/corosync
>>>> multicast traffic, no rrp. A couple months ago, I had a cluster
>>>> partition when VM live migration (also on the BCN) congested the
>>>> network. So I decided to enable RRP using the SN as backup, which has
>>>> been marginally successful.
>>>>
>>>>     Now, I want to switch to unicast (<cman transport="udpu"), RRP with
>>>> the SN as the backup and BCN as the primary ring and do a proper
>>>> IPTables firewall. Is this sane?
>>>>
>>>>     When I stopped iptables entirely and started cman with unicast +
>>>> RRP,
>>>> I saw this:
>>>>
>>>> ====] Node 1
>>>> Sep 11 17:31:24 node1 kernel: DLM (built Aug 10 2015 09:45:36)
>>>> installed
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [MAIN  ] Corosync Cluster
>>>> Engine
>>>> ('1.4.7'): started and ready to provide service.
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [MAIN  ] Corosync built-in
>>>> features: nss dbus rdma snmp
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [MAIN  ] Successfully read
>>>> config from /etc/cluster/cluster.conf
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [MAIN  ] Successfully parsed
>>>> cman config
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [TOTEM ] Initializing transport
>>>> (UDP/IP Unicast).
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [TOTEM ] Initializing
>>>> transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [TOTEM ] Initializing transport
>>>> (UDP/IP Unicast).
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [TOTEM ] Initializing
>>>> transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [TOTEM ] The network interface
>>>> [10.20.10.1] is now up.
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [QUORUM] Using quorum provider
>>>> quorum_cman
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [SERV  ] Service engine loaded:
>>>> corosync cluster quorum service v0.1
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [CMAN  ] CMAN 3.0.12.1 (built
>>>> Jul  6 2015 05:30:35) started
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [SERV  ] Service engine loaded:
>>>> corosync CMAN membership service 2.90
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [SERV  ] Service engine loaded:
>>>> openais checkpoint service B.01.01
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [SERV  ] Service engine loaded:
>>>> corosync extended virtual synchrony service
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [SERV  ] Service engine loaded:
>>>> corosync configuration service
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [SERV  ] Service engine loaded:
>>>> corosync cluster closed process group service v1.01
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [SERV  ] Service engine loaded:
>>>> corosync cluster config database access v1.01
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [SERV  ] Service engine loaded:
>>>> corosync profile loading service
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [QUORUM] Using quorum provider
>>>> quorum_cman
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [SERV  ] Service engine loaded:
>>>> corosync cluster quorum service v0.1
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [MAIN  ] Compatibility mode set
>>>> to whitetank.  Using V1 and V2 of the synchronization engine.
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [TOTEM ] adding new UDPU member
>>>> {10.20.10.1}
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [TOTEM ] adding new UDPU member
>>>> {10.20.10.2}
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [TOTEM ] The network interface
>>>> [10.10.10.1] is now up.
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [TOTEM ] adding new UDPU member
>>>> {10.10.10.1}
>>>> Sep 11 17:31:24 node1 corosync[2523]:   [TOTEM ] adding new UDPU member
>>>> {10.10.10.2}
>>>> Sep 11 17:31:27 node1 corosync[2523]:   [TOTEM ] Incrementing problem
>>>> counter for seqid 1 iface 10.10.10.1 to [1 of 3]
>>>> Sep 11 17:31:27 node1 corosync[2523]:   [TOTEM ] A processor joined or
>>>> left the membership and a new membership was formed.
>>>> Sep 11 17:31:27 node1 corosync[2523]:   [CMAN  ] quorum regained,
>>>> resuming activity
>>>> Sep 11 17:31:27 node1 corosync[2523]:   [QUORUM] This node is within
>>>> the
>>>> primary component and will provide service.
>>>> Sep 11 17:31:27 node1 corosync[2523]:   [QUORUM] Members[1]: 1
>>>> Sep 11 17:31:27 node1 corosync[2523]:   [QUORUM] Members[1]: 1
>>>> Sep 11 17:31:27 node1 corosync[2523]:   [CPG   ] chosen downlist:
>>>> sender
>>>> r(0) ip(10.20.10.1) r(1) ip(10.10.10.1) ; members(old:0 left:0)
>>>> Sep 11 17:31:27 node1 corosync[2523]:   [MAIN  ] Completed service
>>>> synchronization, ready to provide service.
>>>> Sep 11 17:31:27 node1 corosync[2523]:   [TOTEM ] A processor joined or
>>>> left the membership and a new membership was formed.
>>>> Sep 11 17:31:27 node1 corosync[2523]:   [QUORUM] Members[2]: 1 2
>>>> Sep 11 17:31:27 node1 corosync[2523]:   [QUORUM] Members[2]: 1 2
>>>> Sep 11 17:31:27 node1 corosync[2523]:   [CPG   ] chosen downlist:
>>>> sender
>>>> r(0) ip(10.20.10.1) r(1) ip(10.10.10.1) ; members(old:1 left:0)
>>>> Sep 11 17:31:27 node1 corosync[2523]:   [MAIN  ] Completed service
>>>> synchronization, ready to provide service.
>>>> Sep 11 17:31:29 node1 corosync[2523]:   [TOTEM ] ring 1 active with no
>>>> faults
>>>> Sep 11 17:31:29 node1 fenced[2678]: fenced 3.0.12.1 started
>>>> Sep 11 17:31:29 node1 dlm_controld[2691]: dlm_controld 3.0.12.1 started
>>>> Sep 11 17:31:30 node1 gfs_controld[2755]: gfs_controld 3.0.12.1 started
>>>> ====
>>>>
>>>> ====] Node 2
>>>> Sep 11 17:31:23 node2 kernel: DLM (built Aug 10 2015 09:45:36)
>>>> installed
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [MAIN  ] Corosync Cluster
>>>> Engine
>>>> ('1.4.7'): started and ready to provide service.
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [MAIN  ] Corosync built-in
>>>> features: nss dbus rdma snmp
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [MAIN  ] Successfully read
>>>> config from /etc/cluster/cluster.conf
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [MAIN  ] Successfully parsed
>>>> cman config
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [TOTEM ] Initializing transport
>>>> (UDP/IP Unicast).
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [TOTEM ] Initializing
>>>> transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [TOTEM ] Initializing transport
>>>> (UDP/IP Unicast).
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [TOTEM ] Initializing
>>>> transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [TOTEM ] The network interface
>>>> [10.20.10.2] is now up.
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [QUORUM] Using quorum provider
>>>> quorum_cman
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [SERV  ] Service engine loaded:
>>>> corosync cluster quorum service v0.1
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [CMAN  ] CMAN 3.0.12.1 (built
>>>> Jul  6 2015 05:30:35) started
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [SERV  ] Service engine loaded:
>>>> corosync CMAN membership service 2.90
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [SERV  ] Service engine loaded:
>>>> openais checkpoint service B.01.01
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [SERV  ] Service engine loaded:
>>>> corosync extended virtual synchrony service
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [SERV  ] Service engine loaded:
>>>> corosync configuration service
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [SERV  ] Service engine loaded:
>>>> corosync cluster closed process group service v1.01
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [SERV  ] Service engine loaded:
>>>> corosync cluster config database access v1.01
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [SERV  ] Service engine loaded:
>>>> corosync profile loading service
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [QUORUM] Using quorum provider
>>>> quorum_cman
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [SERV  ] Service engine loaded:
>>>> corosync cluster quorum service v0.1
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [MAIN  ] Compatibility mode set
>>>> to whitetank.  Using V1 and V2 of the synchronization engine.
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [TOTEM ] adding new UDPU member
>>>> {10.20.10.1}
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [TOTEM ] adding new UDPU member
>>>> {10.20.10.2}
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [TOTEM ] The network interface
>>>> [10.10.10.2] is now up.
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [TOTEM ] adding new UDPU member
>>>> {10.10.10.1}
>>>> Sep 11 17:31:23 node2 corosync[2271]:   [TOTEM ] adding new UDPU member
>>>> {10.10.10.2}
>>>> Sep 11 17:31:26 node2 corosync[2271]:   [TOTEM ] Incrementing problem
>>>> counter for seqid 1 iface 10.10.10.2 to [1 of 3]
>>>> Sep 11 17:31:26 node2 corosync[2271]:   [TOTEM ] A processor joined or
>>>> left the membership and a new membership was formed.
>>>> Sep 11 17:31:26 node2 corosync[2271]:   [CMAN  ] quorum regained,
>>>> resuming activity
>>>> Sep 11 17:31:26 node2 corosync[2271]:   [QUORUM] This node is within
>>>> the
>>>> primary component and will provide service.
>>>> Sep 11 17:31:26 node2 corosync[2271]:   [QUORUM] Members[1]: 2
>>>> Sep 11 17:31:26 node2 corosync[2271]:   [QUORUM] Members[1]: 2
>>>> Sep 11 17:31:26 node2 corosync[2271]:   [CPG   ] chosen downlist:
>>>> sender
>>>> r(0) ip(10.20.10.2) r(1) ip(10.10.10.2) ; members(old:0 left:0)
>>>> Sep 11 17:31:26 node2 corosync[2271]:   [MAIN  ] Completed service
>>>> synchronization, ready to provide service.
>>>> Sep 11 17:31:27 node2 corosync[2271]:   [TOTEM ] A processor joined or
>>>> left the membership and a new membership was formed.
>>>> Sep 11 17:31:27 node2 corosync[2271]:   [QUORUM] Members[2]: 1 2
>>>> Sep 11 17:31:27 node2 corosync[2271]:   [QUORUM] Members[2]: 1 2
>>>> Sep 11 17:31:27 node2 corosync[2271]:   [CPG   ] chosen downlist:
>>>> sender
>>>> r(0) ip(10.20.10.1) r(1) ip(10.10.10.1) ; members(old:1 left:0)
>>>> Sep 11 17:31:27 node2 corosync[2271]:   [MAIN  ] Completed service
>>>> synchronization, ready to provide service.
>>>> Sep 11 17:31:28 node2 corosync[2271]:   [TOTEM ] ring 1 active with no
>>>> faults
>>>> Sep 11 17:31:28 node2 fenced[2359]: fenced 3.0.12.1 started
>>>> Sep 11 17:31:28 node2 dlm_controld[2390]: dlm_controld 3.0.12.1 started
>>>> Sep 11 17:31:29 node2 gfs_controld[2442]: gfs_controld 3.0.12.1 started
>>>> ====
>>>>
>>>>
>>>> This looked good to me. So I wanted to test RRP by ifdown'ing bcn_bond1
>>>> on node 1 only, leaving bcn_bond1 up on node2. The cluster survived and
>>>> seemed to the SN, but I saw this error repeatedly printed;
>>>>
>>>> ====] Node 1
>>>> Sep 11 17:31:46 node1 kernel: bcn_bond1: Removing slave bcn_link1
>>>> Sep 11 17:31:46 node1 kernel: bcn_bond1: Releasing active interface
>>>> bcn_link1
>>>> Sep 11 17:31:46 node1 kernel: bcn_bond1: the permanent HWaddr of
>>>> bcn_link1 - 52:54:00:b0:e4:c8 - is still in use by bcn_bond1 - set the
>>>> HWaddr of bcn_link1 to a different address to avoid conflicts
>>>> Sep 11 17:31:46 node1 kernel: bcn_bond1: making interface bcn_link2 the
>>>> new active one
>>>> Sep 11 17:31:46 node1 kernel: ICMPv6 NA: someone advertises our address
>>>> fe80:0000:0000:0000:5054:00ff:feb0:e4c8 on bcn_link1!
>>>> Sep 11 17:31:46 node1 kernel: bcn_bond1: Removing slave bcn_link2
>>>> Sep 11 17:31:46 node1 kernel: bcn_bond1: Releasing active interface
>>>> bcn_link2
>>>> Sep 11 17:31:48 node1 ntpd[2037]: Deleting interface #7 bcn_link1,
>>>> fe80::5054:ff:feb0:e4c8#123, interface stats: received=0, sent=0,
>>>> dropped=0, active_time=48987 secs
>>>> Sep 11 17:31:48 node1 ntpd[2037]: Deleting interface #6 bcn_bond1,
>>>> fe80::5054:ff:feb0:e4c8#123, interface stats: received=0, sent=0,
>>>> dropped=0, active_time=48987 secs
>>>> Sep 11 17:31:48 node1 ntpd[2037]: Deleting interface #3 bcn_bond1,
>>>> 10.20.10.1#123, interface stats: received=0, sent=0, dropped=0,
>>>> active_time=48987 secs
>>>> Sep 11 17:31:51 node1 corosync[2523]:   [TOTEM ] Incrementing problem
>>>> counter for seqid 677 iface 10.20.10.1 to [1 of 3]
>>>> Sep 11 17:31:53 node1 corosync[2523]:   [TOTEM ] ring 0 active with no
>>>> faults
>>>> Sep 11 17:31:57 node1 corosync[2523]:   [TOTEM ] Incrementing problem
>>>> counter for seqid 679 iface 10.20.10.1 to [1 of 3]
>>>> Sep 11 17:31:59 node1 corosync[2523]:   [TOTEM ] ring 0 active with no
>>>> faults
>>>> Sep 11 17:32:04 node1 corosync[2523]:   [TOTEM ] Incrementing problem
>>>> counter for seqid 681 iface 10.20.10.1 to [1 of 3]
>>>> Sep 11 17:32:06 node1 corosync[2523]:   [TOTEM ] ring 0 active with no
>>>> faults
>>>> Sep 11 17:32:11 node1 corosync[2523]:   [TOTEM ] Incrementing problem
>>>> counter for seqid 683 iface 10.20.10.1 to [1 of 3]
>>>> Sep 11 17:32:13 node1 corosync[2523]:   [TOTEM ] ring 0 active with no
>>>> faults
>>>> Sep 11 17:32:17 node1 corosync[2523]:   [TOTEM ] Incrementing problem
>>>> counter for seqid 685 iface 10.20.10.1 to [1 of 3]
>>>> Sep 11 17:32:19 node1 corosync[2523]:   [TOTEM ] ring 0 active with no
>>>> faults
>>>> Sep 11 17:32:24 node1 corosync[2523]:   [TOTEM ] Incrementing problem
>>>> counter for seqid 687 iface 10.20.10.1 to [1 of 3]
>>>> Sep 11 17:32:26 node1 corosync[2523]:   [TOTEM ] ring 0 active with no
>>>> faults
>>>> Sep 11 17:32:31 node1 corosync[2523]:   [TOTEM ] Incrementing problem
>>>> counter for seqid 689 iface 10.20.10.1 to [1 of 3]
>>>> Sep 11 17:32:33 node1 corosync[2523]:   [TOTEM ] ring 0 active with no
>>>> faults
>>>> Sep 11 17:32:37 node1 corosync[2523]:   [TOTEM ] Incrementing problem
>>>> counter for seqid 691 iface 10.20.10.1 to [1 of 3]
>>>> Sep 11 17:32:39 node1 corosync[2523]:   [TOTEM ] ring 0 active with no
>>>> faults
>>>> Sep 11 17:32:44 node1 corosync[2523]:   [TOTEM ] Incrementing problem
>>>> counter for seqid 693 iface 10.20.10.1 to [1 of 3]
>>>> Sep 11 17:32:46 node1 corosync[2523]:   [TOTEM ] ring 0 active with no
>>>> faults
>>>> ====
>>>>
>>>> ====] Node 2
>>>> Sep 11 17:31:48 node2 corosync[2271]:   [TOTEM ] Incrementing problem
>>>> counter for seqid 676 iface 10.20.10.2 to [1 of 3]
>>>> Sep 11 17:31:50 node2 corosync[2271]:   [TOTEM ] ring 0 active with no
>>>> faults
>>>> Sep 11 17:31:54 node2 corosync[2271]:   [TOTEM ] Incrementing problem
>>>> counter for seqid 678 iface 10.20.10.2 to [1 of 3]
>>>> Sep 11 17:31:56 node2 corosync[2271]:   [TOTEM ] ring 0 active with no
>>>> faults
>>>> Sep 11 17:32:01 node2 corosync[2271]:   [TOTEM ] Incrementing problem
>>>> counter for seqid 680 iface 10.20.10.2 to [1 of 3]
>>>> Sep 11 17:32:03 node2 corosync[2271]:   [TOTEM ] ring 0 active with no
>>>> faults
>>>> Sep 11 17:32:08 node2 corosync[2271]:   [TOTEM ] Incrementing problem
>>>> counter for seqid 682 iface 10.20.10.2 to [1 of 3]
>>>> Sep 11 17:32:10 node2 corosync[2271]:   [TOTEM ] ring 0 active with no
>>>> faults
>>>> Sep 11 17:32:14 node2 corosync[2271]:   [TOTEM ] Incrementing problem
>>>> counter for seqid 684 iface 10.20.10.2 to [1 of 3]
>>>> Sep 11 17:32:16 node2 corosync[2271]:   [TOTEM ] ring 0 active with no
>>>> faults
>>>> Sep 11 17:32:21 node2 corosync[2271]:   [TOTEM ] Incrementing problem
>>>> counter for seqid 686 iface 10.20.10.2 to [1 of 3]
>>>> Sep 11 17:32:23 node2 corosync[2271]:   [TOTEM ] ring 0 active with no
>>>> faults
>>>> Sep 11 17:32:28 node2 corosync[2271]:   [TOTEM ] Incrementing problem
>>>> counter for seqid 688 iface 10.20.10.2 to [1 of 3]
>>>> Sep 11 17:32:30 node2 corosync[2271]:   [TOTEM ] ring 0 active with no
>>>> faults
>>>> Sep 11 17:32:35 node2 corosync[2271]:   [TOTEM ] Incrementing problem
>>>> counter for seqid 690 iface 10.20.10.2 to [1 of 3]
>>>> Sep 11 17:32:37 node2 corosync[2271]:   [TOTEM ] ring 0 active with no
>>>> faults
>>>> Sep 11 17:32:41 node2 corosync[2271]:   [TOTEM ] Incrementing problem
>>>> counter for seqid 692 iface 10.20.10.2 to [1 of 3]
>>>> Sep 11 17:32:43 node2 corosync[2271]:   [TOTEM ] ring 0 active with no
>>>> faults
>>>> Sep 11 17:32:48 node2 corosync[2271]:   [TOTEM ] Incrementing problem
>>>> counter for seqid 694 iface 10.20.10.2 to [1 of 3]
>>>> Sep 11 17:32:50 node2 corosync[2271]:   [TOTEM ] ring 0 active with no
>>>> faults
>>>> ====
>>>>
>>>> When I ifup'ed bcn_bond1 on node1, the messages stopped printing. So
>>>> before I even start on iptables, I am curious if I am doing something
>>>> incorrect here.
>>>>
>>>> Advice?
>>>
>>> Don't do ifdown. Corosync reacts on ifdown very badly (long time known
>>> issue, also it's one of the reason for knet in future version).
>>
>> Is knet in the future "official" now? I know it's been talked about for
>> a while (and I am hopeful that does become the case).
> 
> Yep, it's "official". We don't have any other (reasonable) choice. RRP
> fails in many cases and fixing it would mean redesign/reimplementation,
> so it's just better to use knet. Only one real problem seems to be with
> dlm.
> 
> On the other hand, knet is not exactly highest priority for now. I'm
> expecting to become highest priority in approx. one year.
> 
> Honza

Excellent!

I'm very much looking forward to that. My old knet notes will get
brushed off. :D

https://alteeve.ca/w/Kronosnet

>>> Also rrp active is not so well tested as passive, so give a try to
>>> passive.
>>>
>>> Honza
>>
>> Ah, the 'don't ifdown' now rings a bell. So OK, I'll use iptables to
>> drop all traffic instead. Also, I will switch to passive.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?