[ClusterLabs] pacemaker won't start because duplicate node but can't remove dupe node because pacemaker won't start

Wed Dec 18 02:44:25 EST 2019

I’ve asked this question on server fault and I’ll re-ask the whole thing here for posterity sake:

https://serverfault.com/questions/995981/pacemaker-wont-start-because-duplicate-node-but-cant-remove-dupe-node-because <https://serverfault.com/questions/995981/pacemaker-wont-start-because-duplicate-node-but-cant-remove-dupe-node-because>

OK! Really new to pacemaker/corosync, like 1 day new.

Software: Ubuntu 18.04 LTS and the versions associated with that distro.

pacemakerd: 1.1.18

corosync: 2.4.3

I accidentally removed the nodes from my entire test cluster (3 nodes)

When I tried to bring everything back up using `pcsd` GUI, that failed because the nodes were "wiped out". Cool.

So. I had a copy of the last `corosync.conf` from my "primary" node. I copied to the other two nodes. I fixed the `bindnetaddr` on the respective confs. I ran `pcs cluster start` on my "primary" node.

One of the nodes failed to come up. I took a look at the status of `pacemaker` on that node and I get the following exception:

    Dec 18 06:33:56 region-ctrl-2 crmd[1049]:     crit: Nodes 1084777441 and 2 share the same name 'region-ctrl-2': shutting down

I tried running `crm_node -R --force 1084777441` on the machine where `pacemaker` won't start, but of course, `pacemaker` isn't running so I get an `crmd: connection refused (111)` error. So, I ran the same command on one of the healthy nodes, which shows no errors, but the node never goes away and `pacemaker` on the affected machine continued to show the same error.

So, I decided to tear down the entire cluster and again. I purge removed all the packages from the machine. I reinstalled everything fresh. I copied and fixed the `corosync.conf` to the machine. I recreated the cluster. I get the exact same bloody error.

So this node named `1084777441` is not a machine I created. This is one the cluster created for me. Earlier in the day I realized that I was using IP addresses in `corosync.conf` instead of names. I fixed the `/etc/hosts` of the machines, removed the IP addresses from the corosync config, and that's why I inadvertently deleted my whole cluster in the first place (I removed the nodes that were IP addresses).

The following is my corosync.conf:

    totem {
	    version: 2
	    cluster_name: maas-cluster
	    token: 3000
	    token_retransmits_before_loss_const: 10
	    clear_node_high_bit: yes
	    crypto_cipher: none
	    crypto_hash: none

	    interface {
	        ringnumber: 0
	        bindnetaddr: 192.168.99.225
	        mcastport: 5405
	        ttl: 1
	    }
	}

	logging {
	    fileline: off
	    to_stderr: no
	    to_logfile: no
	    to_syslog: yes
	    syslog_facility: daemon
	    debug: off
	    timestamp: on

	    logger_subsys {
	        subsys: QUORUM
	        debug: off
	    }
	}

	quorum {
	    provider: corosync_votequorum
	    expected_votes: 3
	    two_node: 1
	}

	nodelist {
	    node {
	        ring0_addr: postgres-sb
	        nodeid: 3
	    }

	    node {
	        ring0_addr: region-ctrl-2
	        nodeid: 2
	    }

	    node {
	        ring0_addr: region-ctrl-1
	        nodeid: 1
	    }
	}

The only thing different about this conf between the nodes is the `bindnetaddr`.

There seems to be a chicken/egg issue here unless there's some way of which I'm not aware to remove a node from a flat-file db or sqlite dbb somewhere or there's some other more authoritative way to remove a node from a cluster.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20191217/b9676dc3/attachment.html>