[ClusterLabs] EL6, cman, rrp, unicast and iptables

Vladislav Bogdanov bubble at hoster-ok.com
Fri Sep 11 20:30:29 UTC 2015


11 сентября 2015 г. 21:01:30 GMT+03:00, Digimer <lists at alteeve.ca> пишет:
>On 11/09/15 01:42 PM, Digimer wrote:
>> Hi all,
>> 
>>   Starting a new thread from the "Clustered LVM with iptables issue"
>> thread...
>> 
>>   I've decided to review how I do networking entirely in my cluster.
>I
>> make zero claims to being great at networks, so I would love some
>feedback.
>> 
>>   I've got three active/passive bonded interfaces; Back-Channel,
>Storage
>> and Internet-Facing networks. The IFN is "off limits" to the cluster
>as
>> it is dedicated to hosted server traffic only.
>> 
>>   So before, I uses only the BCN for cluster traffic for
>cman/corosync
>> multicast traffic, no rrp. A couple months ago, I had a cluster
>> partition when VM live migration (also on the BCN) congested the
>> network. So I decided to enable RRP using the SN as backup, which has
>> been marginally successful.
>> 
>>   Now, I want to switch to unicast (<cman transport="udpu"), RRP with
>> the SN as the backup and BCN as the primary ring and do a proper
>> IPTables firewall. Is this sane?
>> 
>>   When I stopped iptables entirely and started cman with unicast +
>RRP,
>> I saw this:
>> 
>> ====] Node 1
>> Sep 11 17:31:24 node1 kernel: DLM (built Aug 10 2015 09:45:36)
>installed
>> Sep 11 17:31:24 node1 corosync[2523]:   [MAIN  ] Corosync Cluster
>Engine
>> ('1.4.7'): started and ready to provide service.
>> Sep 11 17:31:24 node1 corosync[2523]:   [MAIN  ] Corosync built-in
>> features: nss dbus rdma snmp
>> Sep 11 17:31:24 node1 corosync[2523]:   [MAIN  ] Successfully read
>> config from /etc/cluster/cluster.conf
>> Sep 11 17:31:24 node1 corosync[2523]:   [MAIN  ] Successfully parsed
>> cman config
>> Sep 11 17:31:24 node1 corosync[2523]:   [TOTEM ] Initializing
>transport
>> (UDP/IP Unicast).
>> Sep 11 17:31:24 node1 corosync[2523]:   [TOTEM ] Initializing
>> transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
>> Sep 11 17:31:24 node1 corosync[2523]:   [TOTEM ] Initializing
>transport
>> (UDP/IP Unicast).
>> Sep 11 17:31:24 node1 corosync[2523]:   [TOTEM ] Initializing
>> transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
>> Sep 11 17:31:24 node1 corosync[2523]:   [TOTEM ] The network
>interface
>> [10.20.10.1] is now up.
>> Sep 11 17:31:24 node1 corosync[2523]:   [QUORUM] Using quorum
>provider
>> quorum_cman
>> Sep 11 17:31:24 node1 corosync[2523]:   [SERV  ] Service engine
>loaded:
>> corosync cluster quorum service v0.1
>> Sep 11 17:31:24 node1 corosync[2523]:   [CMAN  ] CMAN 3.0.12.1 (built
>> Jul  6 2015 05:30:35) started
>> Sep 11 17:31:24 node1 corosync[2523]:   [SERV  ] Service engine
>loaded:
>> corosync CMAN membership service 2.90
>> Sep 11 17:31:24 node1 corosync[2523]:   [SERV  ] Service engine
>loaded:
>> openais checkpoint service B.01.01
>> Sep 11 17:31:24 node1 corosync[2523]:   [SERV  ] Service engine
>loaded:
>> corosync extended virtual synchrony service
>> Sep 11 17:31:24 node1 corosync[2523]:   [SERV  ] Service engine
>loaded:
>> corosync configuration service
>> Sep 11 17:31:24 node1 corosync[2523]:   [SERV  ] Service engine
>loaded:
>> corosync cluster closed process group service v1.01
>> Sep 11 17:31:24 node1 corosync[2523]:   [SERV  ] Service engine
>loaded:
>> corosync cluster config database access v1.01
>> Sep 11 17:31:24 node1 corosync[2523]:   [SERV  ] Service engine
>loaded:
>> corosync profile loading service
>> Sep 11 17:31:24 node1 corosync[2523]:   [QUORUM] Using quorum
>provider
>> quorum_cman
>> Sep 11 17:31:24 node1 corosync[2523]:   [SERV  ] Service engine
>loaded:
>> corosync cluster quorum service v0.1
>> Sep 11 17:31:24 node1 corosync[2523]:   [MAIN  ] Compatibility mode
>set
>> to whitetank.  Using V1 and V2 of the synchronization engine.
>> Sep 11 17:31:24 node1 corosync[2523]:   [TOTEM ] adding new UDPU
>member
>> {10.20.10.1}
>> Sep 11 17:31:24 node1 corosync[2523]:   [TOTEM ] adding new UDPU
>member
>> {10.20.10.2}
>> Sep 11 17:31:24 node1 corosync[2523]:   [TOTEM ] The network
>interface
>> [10.10.10.1] is now up.
>> Sep 11 17:31:24 node1 corosync[2523]:   [TOTEM ] adding new UDPU
>member
>> {10.10.10.1}
>> Sep 11 17:31:24 node1 corosync[2523]:   [TOTEM ] adding new UDPU
>member
>> {10.10.10.2}
>> Sep 11 17:31:27 node1 corosync[2523]:   [TOTEM ] Incrementing problem
>> counter for seqid 1 iface 10.10.10.1 to [1 of 3]
>> Sep 11 17:31:27 node1 corosync[2523]:   [TOTEM ] A processor joined
>or
>> left the membership and a new membership was formed.
>> Sep 11 17:31:27 node1 corosync[2523]:   [CMAN  ] quorum regained,
>> resuming activity
>> Sep 11 17:31:27 node1 corosync[2523]:   [QUORUM] This node is within
>the
>> primary component and will provide service.
>> Sep 11 17:31:27 node1 corosync[2523]:   [QUORUM] Members[1]: 1
>> Sep 11 17:31:27 node1 corosync[2523]:   [QUORUM] Members[1]: 1
>> Sep 11 17:31:27 node1 corosync[2523]:   [CPG   ] chosen downlist:
>sender
>> r(0) ip(10.20.10.1) r(1) ip(10.10.10.1) ; members(old:0 left:0)
>> Sep 11 17:31:27 node1 corosync[2523]:   [MAIN  ] Completed service
>> synchronization, ready to provide service.
>> Sep 11 17:31:27 node1 corosync[2523]:   [TOTEM ] A processor joined
>or
>> left the membership and a new membership was formed.
>> Sep 11 17:31:27 node1 corosync[2523]:   [QUORUM] Members[2]: 1 2
>> Sep 11 17:31:27 node1 corosync[2523]:   [QUORUM] Members[2]: 1 2
>> Sep 11 17:31:27 node1 corosync[2523]:   [CPG   ] chosen downlist:
>sender
>> r(0) ip(10.20.10.1) r(1) ip(10.10.10.1) ; members(old:1 left:0)
>> Sep 11 17:31:27 node1 corosync[2523]:   [MAIN  ] Completed service
>> synchronization, ready to provide service.
>> Sep 11 17:31:29 node1 corosync[2523]:   [TOTEM ] ring 1 active with
>no
>> faults
>> Sep 11 17:31:29 node1 fenced[2678]: fenced 3.0.12.1 started
>> Sep 11 17:31:29 node1 dlm_controld[2691]: dlm_controld 3.0.12.1
>started
>> Sep 11 17:31:30 node1 gfs_controld[2755]: gfs_controld 3.0.12.1
>started
>> ====
>> 
>> ====] Node 2
>> Sep 11 17:31:23 node2 kernel: DLM (built Aug 10 2015 09:45:36)
>installed
>> Sep 11 17:31:23 node2 corosync[2271]:   [MAIN  ] Corosync Cluster
>Engine
>> ('1.4.7'): started and ready to provide service.
>> Sep 11 17:31:23 node2 corosync[2271]:   [MAIN  ] Corosync built-in
>> features: nss dbus rdma snmp
>> Sep 11 17:31:23 node2 corosync[2271]:   [MAIN  ] Successfully read
>> config from /etc/cluster/cluster.conf
>> Sep 11 17:31:23 node2 corosync[2271]:   [MAIN  ] Successfully parsed
>> cman config
>> Sep 11 17:31:23 node2 corosync[2271]:   [TOTEM ] Initializing
>transport
>> (UDP/IP Unicast).
>> Sep 11 17:31:23 node2 corosync[2271]:   [TOTEM ] Initializing
>> transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
>> Sep 11 17:31:23 node2 corosync[2271]:   [TOTEM ] Initializing
>transport
>> (UDP/IP Unicast).
>> Sep 11 17:31:23 node2 corosync[2271]:   [TOTEM ] Initializing
>> transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
>> Sep 11 17:31:23 node2 corosync[2271]:   [TOTEM ] The network
>interface
>> [10.20.10.2] is now up.
>> Sep 11 17:31:23 node2 corosync[2271]:   [QUORUM] Using quorum
>provider
>> quorum_cman
>> Sep 11 17:31:23 node2 corosync[2271]:   [SERV  ] Service engine
>loaded:
>> corosync cluster quorum service v0.1
>> Sep 11 17:31:23 node2 corosync[2271]:   [CMAN  ] CMAN 3.0.12.1 (built
>> Jul  6 2015 05:30:35) started
>> Sep 11 17:31:23 node2 corosync[2271]:   [SERV  ] Service engine
>loaded:
>> corosync CMAN membership service 2.90
>> Sep 11 17:31:23 node2 corosync[2271]:   [SERV  ] Service engine
>loaded:
>> openais checkpoint service B.01.01
>> Sep 11 17:31:23 node2 corosync[2271]:   [SERV  ] Service engine
>loaded:
>> corosync extended virtual synchrony service
>> Sep 11 17:31:23 node2 corosync[2271]:   [SERV  ] Service engine
>loaded:
>> corosync configuration service
>> Sep 11 17:31:23 node2 corosync[2271]:   [SERV  ] Service engine
>loaded:
>> corosync cluster closed process group service v1.01
>> Sep 11 17:31:23 node2 corosync[2271]:   [SERV  ] Service engine
>loaded:
>> corosync cluster config database access v1.01
>> Sep 11 17:31:23 node2 corosync[2271]:   [SERV  ] Service engine
>loaded:
>> corosync profile loading service
>> Sep 11 17:31:23 node2 corosync[2271]:   [QUORUM] Using quorum
>provider
>> quorum_cman
>> Sep 11 17:31:23 node2 corosync[2271]:   [SERV  ] Service engine
>loaded:
>> corosync cluster quorum service v0.1
>> Sep 11 17:31:23 node2 corosync[2271]:   [MAIN  ] Compatibility mode
>set
>> to whitetank.  Using V1 and V2 of the synchronization engine.
>> Sep 11 17:31:23 node2 corosync[2271]:   [TOTEM ] adding new UDPU
>member
>> {10.20.10.1}
>> Sep 11 17:31:23 node2 corosync[2271]:   [TOTEM ] adding new UDPU
>member
>> {10.20.10.2}
>> Sep 11 17:31:23 node2 corosync[2271]:   [TOTEM ] The network
>interface
>> [10.10.10.2] is now up.
>> Sep 11 17:31:23 node2 corosync[2271]:   [TOTEM ] adding new UDPU
>member
>> {10.10.10.1}
>> Sep 11 17:31:23 node2 corosync[2271]:   [TOTEM ] adding new UDPU
>member
>> {10.10.10.2}
>> Sep 11 17:31:26 node2 corosync[2271]:   [TOTEM ] Incrementing problem
>> counter for seqid 1 iface 10.10.10.2 to [1 of 3]
>> Sep 11 17:31:26 node2 corosync[2271]:   [TOTEM ] A processor joined
>or
>> left the membership and a new membership was formed.
>> Sep 11 17:31:26 node2 corosync[2271]:   [CMAN  ] quorum regained,
>> resuming activity
>> Sep 11 17:31:26 node2 corosync[2271]:   [QUORUM] This node is within
>the
>> primary component and will provide service.
>> Sep 11 17:31:26 node2 corosync[2271]:   [QUORUM] Members[1]: 2
>> Sep 11 17:31:26 node2 corosync[2271]:   [QUORUM] Members[1]: 2
>> Sep 11 17:31:26 node2 corosync[2271]:   [CPG   ] chosen downlist:
>sender
>> r(0) ip(10.20.10.2) r(1) ip(10.10.10.2) ; members(old:0 left:0)
>> Sep 11 17:31:26 node2 corosync[2271]:   [MAIN  ] Completed service
>> synchronization, ready to provide service.
>> Sep 11 17:31:27 node2 corosync[2271]:   [TOTEM ] A processor joined
>or
>> left the membership and a new membership was formed.
>> Sep 11 17:31:27 node2 corosync[2271]:   [QUORUM] Members[2]: 1 2
>> Sep 11 17:31:27 node2 corosync[2271]:   [QUORUM] Members[2]: 1 2
>> Sep 11 17:31:27 node2 corosync[2271]:   [CPG   ] chosen downlist:
>sender
>> r(0) ip(10.20.10.1) r(1) ip(10.10.10.1) ; members(old:1 left:0)
>> Sep 11 17:31:27 node2 corosync[2271]:   [MAIN  ] Completed service
>> synchronization, ready to provide service.
>> Sep 11 17:31:28 node2 corosync[2271]:   [TOTEM ] ring 1 active with
>no
>> faults
>> Sep 11 17:31:28 node2 fenced[2359]: fenced 3.0.12.1 started
>> Sep 11 17:31:28 node2 dlm_controld[2390]: dlm_controld 3.0.12.1
>started
>> Sep 11 17:31:29 node2 gfs_controld[2442]: gfs_controld 3.0.12.1
>started
>> ====
>> 
>> 
>> This looked good to me. So I wanted to test RRP by ifdown'ing
>bcn_bond1
>> on node 1 only, leaving bcn_bond1 up on node2. The cluster survived
>and
>> seemed to the SN, but I saw this error repeatedly printed;
>> 
>> ====] Node 1
>> Sep 11 17:31:46 node1 kernel: bcn_bond1: Removing slave bcn_link1
>> Sep 11 17:31:46 node1 kernel: bcn_bond1: Releasing active interface
>> bcn_link1
>> Sep 11 17:31:46 node1 kernel: bcn_bond1: the permanent HWaddr of
>> bcn_link1 - 52:54:00:b0:e4:c8 - is still in use by bcn_bond1 - set
>the
>> HWaddr of bcn_link1 to a different address to avoid conflicts
>> Sep 11 17:31:46 node1 kernel: bcn_bond1: making interface bcn_link2
>the
>> new active one
>> Sep 11 17:31:46 node1 kernel: ICMPv6 NA: someone advertises our
>address
>> fe80:0000:0000:0000:5054:00ff:feb0:e4c8 on bcn_link1!
>> Sep 11 17:31:46 node1 kernel: bcn_bond1: Removing slave bcn_link2
>> Sep 11 17:31:46 node1 kernel: bcn_bond1: Releasing active interface
>> bcn_link2
>> Sep 11 17:31:48 node1 ntpd[2037]: Deleting interface #7 bcn_link1,
>> fe80::5054:ff:feb0:e4c8#123, interface stats: received=0, sent=0,
>> dropped=0, active_time=48987 secs
>> Sep 11 17:31:48 node1 ntpd[2037]: Deleting interface #6 bcn_bond1,
>> fe80::5054:ff:feb0:e4c8#123, interface stats: received=0, sent=0,
>> dropped=0, active_time=48987 secs
>> Sep 11 17:31:48 node1 ntpd[2037]: Deleting interface #3 bcn_bond1,
>> 10.20.10.1#123, interface stats: received=0, sent=0, dropped=0,
>> active_time=48987 secs
>> Sep 11 17:31:51 node1 corosync[2523]:   [TOTEM ] Incrementing problem
>> counter for seqid 677 iface 10.20.10.1 to [1 of 3]
>> Sep 11 17:31:53 node1 corosync[2523]:   [TOTEM ] ring 0 active with
>no
>> faults
>> Sep 11 17:31:57 node1 corosync[2523]:   [TOTEM ] Incrementing problem
>> counter for seqid 679 iface 10.20.10.1 to [1 of 3]
>> Sep 11 17:31:59 node1 corosync[2523]:   [TOTEM ] ring 0 active with
>no
>> faults
>> Sep 11 17:32:04 node1 corosync[2523]:   [TOTEM ] Incrementing problem
>> counter for seqid 681 iface 10.20.10.1 to [1 of 3]
>> Sep 11 17:32:06 node1 corosync[2523]:   [TOTEM ] ring 0 active with
>no
>> faults
>> Sep 11 17:32:11 node1 corosync[2523]:   [TOTEM ] Incrementing problem
>> counter for seqid 683 iface 10.20.10.1 to [1 of 3]
>> Sep 11 17:32:13 node1 corosync[2523]:   [TOTEM ] ring 0 active with
>no
>> faults
>> Sep 11 17:32:17 node1 corosync[2523]:   [TOTEM ] Incrementing problem
>> counter for seqid 685 iface 10.20.10.1 to [1 of 3]
>> Sep 11 17:32:19 node1 corosync[2523]:   [TOTEM ] ring 0 active with
>no
>> faults
>> Sep 11 17:32:24 node1 corosync[2523]:   [TOTEM ] Incrementing problem
>> counter for seqid 687 iface 10.20.10.1 to [1 of 3]
>> Sep 11 17:32:26 node1 corosync[2523]:   [TOTEM ] ring 0 active with
>no
>> faults
>> Sep 11 17:32:31 node1 corosync[2523]:   [TOTEM ] Incrementing problem
>> counter for seqid 689 iface 10.20.10.1 to [1 of 3]
>> Sep 11 17:32:33 node1 corosync[2523]:   [TOTEM ] ring 0 active with
>no
>> faults
>> Sep 11 17:32:37 node1 corosync[2523]:   [TOTEM ] Incrementing problem
>> counter for seqid 691 iface 10.20.10.1 to [1 of 3]
>> Sep 11 17:32:39 node1 corosync[2523]:   [TOTEM ] ring 0 active with
>no
>> faults
>> Sep 11 17:32:44 node1 corosync[2523]:   [TOTEM ] Incrementing problem
>> counter for seqid 693 iface 10.20.10.1 to [1 of 3]
>> Sep 11 17:32:46 node1 corosync[2523]:   [TOTEM ] ring 0 active with
>no
>> faults
>> ====
>> 
>> ====] Node 2
>> Sep 11 17:31:48 node2 corosync[2271]:   [TOTEM ] Incrementing problem
>> counter for seqid 676 iface 10.20.10.2 to [1 of 3]
>> Sep 11 17:31:50 node2 corosync[2271]:   [TOTEM ] ring 0 active with
>no
>> faults
>> Sep 11 17:31:54 node2 corosync[2271]:   [TOTEM ] Incrementing problem
>> counter for seqid 678 iface 10.20.10.2 to [1 of 3]
>> Sep 11 17:31:56 node2 corosync[2271]:   [TOTEM ] ring 0 active with
>no
>> faults
>> Sep 11 17:32:01 node2 corosync[2271]:   [TOTEM ] Incrementing problem
>> counter for seqid 680 iface 10.20.10.2 to [1 of 3]
>> Sep 11 17:32:03 node2 corosync[2271]:   [TOTEM ] ring 0 active with
>no
>> faults
>> Sep 11 17:32:08 node2 corosync[2271]:   [TOTEM ] Incrementing problem
>> counter for seqid 682 iface 10.20.10.2 to [1 of 3]
>> Sep 11 17:32:10 node2 corosync[2271]:   [TOTEM ] ring 0 active with
>no
>> faults
>> Sep 11 17:32:14 node2 corosync[2271]:   [TOTEM ] Incrementing problem
>> counter for seqid 684 iface 10.20.10.2 to [1 of 3]
>> Sep 11 17:32:16 node2 corosync[2271]:   [TOTEM ] ring 0 active with
>no
>> faults
>> Sep 11 17:32:21 node2 corosync[2271]:   [TOTEM ] Incrementing problem
>> counter for seqid 686 iface 10.20.10.2 to [1 of 3]
>> Sep 11 17:32:23 node2 corosync[2271]:   [TOTEM ] ring 0 active with
>no
>> faults
>> Sep 11 17:32:28 node2 corosync[2271]:   [TOTEM ] Incrementing problem
>> counter for seqid 688 iface 10.20.10.2 to [1 of 3]
>> Sep 11 17:32:30 node2 corosync[2271]:   [TOTEM ] ring 0 active with
>no
>> faults
>> Sep 11 17:32:35 node2 corosync[2271]:   [TOTEM ] Incrementing problem
>> counter for seqid 690 iface 10.20.10.2 to [1 of 3]
>> Sep 11 17:32:37 node2 corosync[2271]:   [TOTEM ] ring 0 active with
>no
>> faults
>> Sep 11 17:32:41 node2 corosync[2271]:   [TOTEM ] Incrementing problem
>> counter for seqid 692 iface 10.20.10.2 to [1 of 3]
>> Sep 11 17:32:43 node2 corosync[2271]:   [TOTEM ] ring 0 active with
>no
>> faults
>> Sep 11 17:32:48 node2 corosync[2271]:   [TOTEM ] Incrementing problem
>> counter for seqid 694 iface 10.20.10.2 to [1 of 3]
>> Sep 11 17:32:50 node2 corosync[2271]:   [TOTEM ] ring 0 active with
>no
>> faults
>> ====
>> 
>> When I ifup'ed bcn_bond1 on node1, the messages stopped printing. So
>> before I even start on iptables, I am curious if I am doing something
>> incorrect here.
>> 
>> Advice?
>> 
>> Thanks!
>
>According to this;
>
>https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/s1-config-network-conga-CA.html
>
>Unicast + GFS2 is NOT recommended. So maybe that idea is already out
>the
>window?

That advise is hmmm... weird. Dlm and gfs control daemons use corosync/cman  only for membership and quorum. Everything else is done directly in kernel which is unaware of what is corosync and what transport does it use.




More information about the Users mailing list