[ClusterLabs] Cluster unable to find back together

Ken Gaillot kgaillot at redhat.com
Thu May 19 10:33:33 EDT 2022


Also, be sure to configure and test fencing. If corosync is having
trouble, fencing is the only way for Pacemaker to recover.

Check for anything unusual in the system logs around the times of
interest, like processes not being scheduled (possibly indicating CPU
issue), corosync token timeouts, etc.

On Thu, 2022-05-19 at 14:55 +0200, Jan Friesse wrote:
> Hi,
> 
> On 19/05/2022 10:16, Leditzky, Fabian via Users wrote:
> > Hello
> > 
> > We have been dealing with our pacemaker/corosync clusters becoming
> > unstable.
> > The OS is Debian 10 and we use Debian packages for pacemaker and
> > corosync,
> > version 3.0.1-5+deb10u1 and 3.0.1-2+deb10u1 respectively.
> 
> Seems like pcmk version is not so important for behavior you've 
> described. Corosync 3.0.1 is super old, are you able to reproduce
> the 
> behavior with 3.1.6? What is the version of knet? There were quite a
> few 
> fixes so last one (1.23) is really recommended.
> 
> You can try to compile yourself, or use proxmox repo 
> (http://download.proxmox.com/debian/pve/) which contains newer
> version 
> of packages.
> 
> > We use knet over UDP transport.
> > 
> > We run multiple 2-node and 4-8 node clusters, primarily managing
> > VIP resources.
> > The issue we experience presents itself as a spontaneous
> > disagreement of
> > the status of cluster members. In two node clusters, each node
> > spontaneously
> > sees the other node as offline, despite network connectivity being
> > OK.
> > In larger clusters, the status can be inconsistent across the
> > nodes.
> > E.g.: node1 sees 2,4 as offline, node 2 sees 1,4 as offline while
> > node 3 and 4 see every node as online.
> 
> This really shouldn't happen.
> 
> > The cluster becomes generally unresponsive to resource actions in
> > this state.
> 
> Expected
> 
> > Thus far we have been unable to restore cluster health without
> > restarting corosync.
> > 
> > We are running packet captures 24/7 on the clusters and have custom
> > tooling
> > to detect lost UDP packets on knet ports. So far we could not see
> > significant
> > packet loss trigger an event, at most we have seen a single UDP
> > packet dropped
> > some seconds before the cluster fails.
> > 
> > However, even if the root cause is indeed a flaky network, we do
> > not understand
> > why the cluster cannot recover on its own in any way. The issues
> > definitely persist
> > beyond the presence of any intermittent network problem.
> 
> Try newer version. If problem persist, it's good idea to monitor if 
> packets are really passed thru. Corosync always (at least) creates 
> single node membership.
> 
> Regards,
>    Honza
> 
> > We were able to artificially break clusters by inducing packet loss
> > with an iptables rule.
> > Dropping packets on a single node of an 8-node cluster can cause
> > malfunctions on
> > multiple other cluster nodes. The expected behavior would be
> > detecting that the
> > artificially broken node failed but keeping the rest of the cluster
> > stable.
> > We were able to reproduce this also on Debian 11 with more recent
> > corosync/pacemaker
> > versions.
> > 
> > Our configuration basic, we do not significantly deviate from the
> > defaults.
> > 
> > We will be very grateful for any insights into this problem.
> > 
> > Thanks,
> > Fabian
> > 
> > // corosync.conf
> > totem {
> >      version: 2
> >      cluster_name: cluster01
> >      crypto_cipher: aes256
> >      crypto_hash: sha512
> >      transport: knet
> > }
> > logging {
> >      fileline: off
> >      to_stderr: no
> >      to_logfile: no
> >      to_syslog: yes
> >      debug: off
> >      timestamp: on
> >      logger_subsys {
> >          subsys: QUORUM
> >          debug: off
> >      }
> > }
> > quorum {
> >      provider: corosync_votequorum
> >      two_node: 1
> >      expected_votes: 2
> > }
> > nodelist {
> >      node {
> >          name: node01
> >          nodeid: 01
> >          ring0_addr: 10.0.0.10
> >      }
> >      node {
> >          name: node02
> >          nodeid: 02
> >          ring0_addr: 10.0.0.11
> >      }
> > }
> > 
> > // crm config show
> > node 1: node01 \
> >      attributes standby=off
> > node 2: node02 \
> >      attributes standby=off maintenance=off
> > primitive IP-clusterC1 IPaddr2 \
> >      params ip=10.0.0.20 nic=eth0 cidr_netmask=24 \
> >      meta migration-threshold=2 target-role=Started is-managed=true 
> > \
> >      op monitor interval=20 timeout=60 on-fail=restart
> > primitive IP-clusterC2 IPaddr2 \
> >      params ip=10.0.0.21 nic=eth0 cidr_netmask=24 \
> >      meta migration-threshold=2 target-role=Started is-managed=true 
> > \
> >      op monitor interval=20 timeout=60 on-fail=restart
> > location STICKY-IP-clusterC1 IP-clusterC1 100: node01
> > location STICKY-IP-clusterC2 IP-clusterC2 100: node02
> > property cib-bootstrap-options: \
> >      have-watchdog=false \
> >      dc-version=2.0.1-9e909a5bdd \
> >      cluster-infrastructure=corosync \
> >      cluster-name=cluster01 \
> >      stonith-enabled=no \
> >      no-quorum-policy=ignore \
> >      last-lrm-refresh=1632230917
> > 
> > 
> > ________________________________
> >   [https://go.aciworldwide.com/rs/030-ROK-804/images/aci-footer.jpg
> > ] <http://www.aciworldwide.com>
> > This email message and any attachments may contain confidential,
> > proprietary or non-public information. The information is intended
> > solely for the designated recipient(s). If an addressing or
> > transmission error has misdirected this email, please notify the
> > sender immediately and destroy this email. Any review,
> > dissemination, use or reliance upon this information by unintended
> > recipients is prohibited. Any opinions expressed in this email are
> > those of the author personally.
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > ClusterLabs home: https://www.clusterlabs.org/
> > 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list