[ClusterLabs] Antw: [EXT] Cluster unable to find back together

Thu May 19 05:00:31 EDT 2022

I have no knet experience, but the symptoms really sound odd.

>>> "Leditzky, Fabian via Users" <users at clusterlabs.org> schrieb am 19.05.2022
um
10:16 in Nachricht
<CO1PR08MB6707CD476BBA2D4FA52F891896D09 at CO1PR08MB6707.namprd08.prod.outlook.com>

> Hello
> 
> We have been dealing with our pacemaker/corosync clusters becoming
unstable.
> The OS is Debian 10 and we use Debian packages for pacemaker and corosync,
> version 3.0.1‑5+deb10u1 and 3.0.1‑2+deb10u1 respectively.
> We use knet over UDP transport.
> 
> We run multiple 2‑node and 4‑8 node clusters, primarily managing VIP 
> resources.
> The issue we experience presents itself as a spontaneous disagreement of
> the status of cluster members. In two node clusters, each node
spontaneously
> sees the other node as offline, despite network connectivity being OK.
> In larger clusters, the status can be inconsistent across the nodes.
> E.g.: node1 sees 2,4 as offline, node 2 sees 1,4 as offline while node 3 and

> 4 see every node as online.
> The cluster becomes generally unresponsive to resource actions in this 
> state.
> Thus far we have been unable to restore cluster health without restarting 
> corosync.
> 
> We are running packet captures 24/7 on the clusters and have custom tooling
> to detect lost UDP packets on knet ports. So far we could not see 
> significant
> packet loss trigger an event, at most we have seen a single UDP packet 
> dropped
> some seconds before the cluster fails.
> 
> However, even if the root cause is indeed a flaky network, we do not 
> understand
> why the cluster cannot recover on its own in any way. The issues definitely

> persist
> beyond the presence of any intermittent network problem.
> 
> We were able to artificially break clusters by inducing packet loss with an

> iptables rule.
> Dropping packets on a single node of an 8‑node cluster can cause
malfunctions 
> on
> multiple other cluster nodes. The expected behavior would be detecting that

> the
> artificially broken node failed but keeping the rest of the cluster stable.
> We were able to reproduce this also on Debian 11 with more recent 
> corosync/pacemaker
> versions.
> 
> Our configuration basic, we do not significantly deviate from the defaults.
> 
> We will be very grateful for any insights into this problem.
> 
> Thanks,
> Fabian
> 
> // corosync.conf
> totem {
>     version: 2
>     cluster_name: cluster01
>     crypto_cipher: aes256
>     crypto_hash: sha512
>     transport: knet
> }
> logging {
>     fileline: off
>     to_stderr: no
>     to_logfile: no
>     to_syslog: yes
>     debug: off
>     timestamp: on
>     logger_subsys {
>         subsys: QUORUM
>         debug: off
>     }
> }
> quorum {
>     provider: corosync_votequorum
>     two_node: 1
>     expected_votes: 2
> }
> nodelist {
>     node {
>         name: node01
>         nodeid: 01
>         ring0_addr: 10.0.0.10
>     }
>     node {
>         name: node02
>         nodeid: 02
>         ring0_addr: 10.0.0.11
>     }
> }
> 
> // crm config show
> node 1: node01 \
>     attributes standby=off
> node 2: node02 \
>     attributes standby=off maintenance=off
> primitive IP‑clusterC1 IPaddr2 \
>     params ip=10.0.0.20 nic=eth0 cidr_netmask=24 \
>     meta migration‑threshold=2 target‑role=Started is‑managed=true \
>     op monitor interval=20 timeout=60 on‑fail=restart
> primitive IP‑clusterC2 IPaddr2 \
>     params ip=10.0.0.21 nic=eth0 cidr_netmask=24 \
>     meta migration‑threshold=2 target‑role=Started is‑managed=true \
>     op monitor interval=20 timeout=60 on‑fail=restart
> location STICKY‑IP‑clusterC1 IP‑clusterC1 100: node01
> location STICKY‑IP‑clusterC2 IP‑clusterC2 100: node02
> property cib‑bootstrap‑options: \
>     have‑watchdog=false \
>     dc‑version=2.0.1‑9e909a5bdd \
>     cluster‑infrastructure=corosync \
>     cluster‑name=cluster01 \
>     stonith‑enabled=no \
>     no‑quorum‑policy=ignore \
>     last‑lrm‑refresh=1632230917
> 
> 
> ________________________________
>  [https://go.aciworldwide.com/rs/030‑ROK‑804/images/aci‑footer.jpg] 
> <http://www.aciworldwide.com>
> This email message and any attachments may contain confidential, proprietary

> or non‑public information. The information is intended solely for the 
> designated recipient(s). If an addressing or transmission error has 
> misdirected this email, please notify the sender immediately and destroy
this 
> email. Any review, dissemination, use or reliance upon this information by 
> unintended recipients is prohibited. Any opinions expressed in this email
are 
> those of the author personally.
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/