[ClusterLabs] Corosync/Pacemaker bug methinks! (was: pacemaker won't start because duplicate node but can't remove dupe node because pacemaker won't start)

JC snafuxnj at yahoo.com
Wed Dec 18 15:11:18 EST 2019


Adding More Context

I've made sure that `/etc/hosts` and the hostname of each of the machines match. I forgot to mention that.

	127.0.0.1 localhost
	127.0.1.1 postgres
	192.168.99.224 postgres-sb
	192.168.99.223 region-ctrl-1
	192.168.99.225 region-ctrl-2

	192.168.7.224 postgres-sb
	192.168.7.223 region-ctrl-1
	192.168.7.225 region-ctrl-2


	# The following lines are desirable for IPv6 capable hosts
	::1     ip6-localhost ip6-loopback
	fe00::0 ip6-localnet
	ff00::0 ip6-mcastprefix
	ff02::1 ip6-allnodes
	ff02::2 ip6-allrouters

I decided to try to start from scratch. I `apt removed --purge`ed `corosync*`, `pacemaker*` `crmsh`, and `pcs`. I `rm -rf`ed `/etc/corosync`. I kept a copy of the `corosync.conf` on each machine.

I re-installed all the things on each of the machines. I copied my saved `corosync.conf` to `/etc/corosync/` and restarted `corosync` on all the machines.

I *STILL* get the same exact error. This has to be a bug in one of the components! 

So it seems that `crm_get_peer` is failing to recognize that the host named `region-ctrl-2` is assigned nodeid 2 in `corosync.conf`. Node 2 then gets auto-assigned an ID of 1084777441. This is the part that doesn't make sense to me. The hostname of the machine is `region-ctrl-2` set in `/etc/hostname` and `/etc/hosts` and confirmed using `uname -n`. The `corosync.conf` is explicitly assigning an ID to the machine named `region-ctrl-2` but something is apparently not recognizing the assignment from `corosync` and instead assigned a non-randomized ID with the value 1084777441 to this host. How the freak do I fix this?


More information about the Users mailing list