[ClusterLabs] Merging partitioned two_node cluster?

Tue May 5 00:44:39 EDT 2020

On May 5, 2020 6:39:54 AM GMT+03:00, "Nickle, Richard" <rnickle at holycross.edu> wrote:
>I have a two node cluster managing a VIP.  The service is an SMTP
>service.
>This could be active/active, it doesn't matter which node accepts the
>SMTP
>connection, but I wanted to make sure that a VIP was in place so that
>there
>was a well-known address.
>
>This service has been running for quite awhile with no problems.  All
>of a
>sudden, it partitioned, and now I can't work out a good way to get them
>to
>merge the clusters back again.  Right now one partition takes the
>resource
>and starts the VIP, but doesn't see the other node.  The other node
>doesn't
>create a resource, and can't seem to see the other node.
>
>At this point, I am perfectly willing to create another node and make
>an
>odd-numbered cluster, the arguments for this being fairly persuasive. 
>But
>I'm not sure why they are blocking.
>
>Surely there must be some manual way to get a partitioned cluster to
>merge?  Some trick?  I also had a scenario several weeks ago where an
>odd-numbered cluster configured in a similar way partitioned into a 3
>and 2
>node cluster, and I was unable to work out how to get them to merge,
>until
>all of a sudden they seemed to fix themselves after doing a 'pcs node
>remove/pcs node add' which had failed many times before.  I have tried
>that
>here but with no success so far.
>
>I ruled out some common cases I've seen in discussions and threads,
>such as
>having my host name defined in host as localhost, etc.
>
>Corosync 2.4.3, Pacemaker 0.9.164. (Ubuntu 18.04.).
>
>Output from pcs status for both nodes:
>
>Cluster name: mail
>Stack: corosync
>Current DC: mail2 (version 1.1.18-2b07d5c5a9) - partition with quorum
>Last updated: Mon May  4 23:28:53 2020
>Last change: Mon May  4 21:50:04 2020 by hacluster via crmd on mail2
>
>2 nodes configured
>1 resource configured
>
>Online: [ mail2 ]
>OFFLINE: [ mail3 ]
>
>Full list of resources:
>
> mail_vip (ocf::heartbeat:IPaddr2): Started mail2
>
>Daemon Status:
>  corosync: active/enabled
>  pacemaker: active/enabled
>  pcsd: active/enabled
>
>Cluster name: mail
>Stack: corosync
>Current DC: mail3 (version 1.1.18-2b07d5c5a9) - partition with quorum
>Last updated: Mon May  4 22:13:10 2020
>Last change: Mon May  4 22:10:34 2020 by root via cibadmin on mail3
>
>2 nodes configured
>0 resources configured
>
>Online: [ mail3 ]
>OFFLINE: [ mail2 ]
>
>No resources
>
>Daemon Status:
>  corosync: active/enabled
>  pacemaker: active/enabled
>  pcsd: active/enabled
>
>/etc/corosync/corosync.conf:
>
>totem {
>    version: 2
>    cluster_name: mail
>    clear_node_high_bit: yes
>    crypto_cipher: none
>    crypto_hash: none
>
>    interface {
>        ringnumber: 0
>        bindnetaddr: 192.168.80.128
>        mcastport: 5405
>    }
>}
>
>logging {
>    fileline: off
>    to_stderr: no
>    to_logfile: no
>    to_syslog: yes
>    syslog_facility: daemon
>    debug: off
>    timestamp: on
>}
>
>quorum {
>    provider: corosync_votequorum
>    wait_for_all: 0
>    two_node: 1
>}
>
>nodelist {
>    node {
>        ring0_addr: mail2
>        name: mail2
>        nodeid: 1
>    }
>
>    node {
>        ring0_addr: mail3
>        name: mail3
>        nodeid: 2
>    }
>}
>
>Thanks!
>
>Rick

I had  similar issues  caused by multicast traffic being blocked.
Check with your network team what has been changed recently or switch to 'udpu' transport method in corosync to verify if this is actually the problem.

Best Regards,
Strahil Nikolov